Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv · June 8, 2023 · Significant research

Summary

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

Keywords

video understanding · large language models · multimodal model · video-instruction pairs · dialogue models

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

arXiv · Jun 13

MBZUAI researchers introduce VideoGPT+, a novel video Large Multimodal Model (LMM) that integrates image and video encoders to leverage both spatial and temporal information in videos. They also introduce VCGBench-Diverse, a comprehensive benchmark for evaluating video LMMs across 18 video categories. VideoGPT+ demonstrates improved performance on multiple video benchmarks, including VCGBench and MVBench.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · Jun 8

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.