Search

Results for "Video-LLMs"

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv · Jun 8

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · Jun 8

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

arXiv · May 6

Researchers from MBZUAI have introduced the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) for assessing Video-LLMs. The benchmark evaluates models across 11 real-world video dimensions, revealing challenges in robustness and reasoning, particularly for open-source models. A training-free Dual-Step Contextual Prompting (DSCP) technique is proposed to enhance Video-LMM performance, with the dataset and code made publicly available.