Researchers introduce PALO, a polyglot large multimodal model with visual reasoning capabilities in 10 major languages including Arabic. A semi-automated translation approach was used to adapt the multimodal instruction dataset from English to the target languages. The models are trained across three scales (1.7B, 7B and 13B parameters) and a multilingual multimodal benchmark is proposed for evaluation.
Keywords
multilingual · multimodal · vision-language model · Arabic · benchmark
A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.
Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.
MBZUAI researchers introduce PG-Video-LLaVA, a large multimodal model with pixel-level grounding capabilities for videos, integrating audio cues for enhanced understanding. The model uses an off-the-shelf tracker and grounding module to localize objects in videos based on user prompts. PG-Video-LLaVA is evaluated on video question-answering and grounding benchmarks, using Vicuna instead of GPT-3.5 for reproducibility.
Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.