Middle East AI

This Week arXiv

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

arXiv · · Significant research

Summary

MBZUAI researchers introduce VideoGPT+, a novel video Large Multimodal Model (LMM) that integrates image and video encoders to leverage both spatial and temporal information in videos. They also introduce VCGBench-Diverse, a comprehensive benchmark for evaluating video LMMs across 18 video categories. VideoGPT+ demonstrates improved performance on multiple video benchmarks, including VCGBench and MVBench.

Keywords

VideoGPT+ · LMM · MBZUAI · video understanding · VCGBench-Diverse

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

arXiv ·

MBZUAI researchers introduce PG-Video-LLaVA, a large multimodal model with pixel-level grounding capabilities for videos, integrating audio cues for enhanced understanding. The model uses an off-the-shelf tracker and grounding module to localize objects in videos based on user prompts. PG-Video-LLaVA is evaluated on video question-answering and grounding benchmarks, using Vicuna instead of GPT-3.5 for reproducibility.

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

arXiv ·

Researchers at MBZUAI have introduced Video-R2, a reinforcement learning approach to improve the consistency and visual grounding of reasoning in multimodal language models. Video-R2 combines timestamp-aware supervised fine-tuning with Group Relative Policy Optimization (GRPO) guided by a Temporal Alignment Reward (TAR). The model demonstrates higher Think Answer Consistency (TAC), Video Attention Score (VAS), and accuracy across multiple benchmarks, showing improved temporal alignment and reasoning coherence for video understanding.

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

arXiv ·

FancyVideo, a new video generator, introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video models. CTGM uses a Temporal Information Injector and Temporal Affinity Refiner to achieve frame-specific textual guidance, improving comprehension of temporal logic. Experiments on the EvalCrafter benchmark demonstrate FancyVideo's state-of-the-art performance in generating dynamic and consistent videos, also supporting image-to-video tasks.