Search

Results for "GLaMM"

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

arXiv · Jun 8

A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.

GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

arXiv · Jul 2

The paper introduces InstAr-500k, a new Arabic instruction dataset of 500,000 examples designed to improve LLM performance in Arabic. Researchers fine-tuned the open-source Gemma-7B model using InstAr-500k and evaluated it on downstream tasks, achieving strong results on Arabic NLP benchmarks. They then released GemmAr-7B-V1, a model specifically tuned for Arabic NLP tasks. Why it matters: This work addresses the lack of high-quality Arabic instruction data, potentially boosting the capabilities of Arabic language models.

PALO: A Polyglot Large Multimodal Model for 5B People

arXiv · Feb 22

Researchers introduce PALO, a polyglot large multimodal model with visual reasoning capabilities in 10 major languages including Arabic. A semi-automated translation approach was used to adapt the multimodal instruction dataset from English to the target languages. The models are trained across three scales (1.7B, 7B and 13B parameters) and a multilingual multimodal benchmark is proposed for evaluation.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv · Nov 20

Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.

MBZUAI launches five new “first-of-its-kind” LLMs to support real-world applications and use cases

MBZUAI · Invalid Date

MBZUAI's Institute of Foundation Models (IFM) has launched five new specialized language and multimodal models, including BiMediX, PALO, GLaMM, GeoChat, and MobiLLaMA. These models address real-world applications in healthcare, visual reasoning, multilingual capabilities, geospatial analysis, and mobile device efficiency. BiMediX is a bilingual medical LLM, while GLaMM generates natural language responses related to objects in an image at the pixel level. Why it matters: This launch demonstrates MBZUAI's commitment to advancing AI research and developing practical AI solutions for various industries, especially with a focus on Arabic language capabilities.

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

arXiv · May 6

Researchers from MBZUAI have introduced the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) for assessing Video-LLMs. The benchmark evaluates models across 11 real-world video dimensions, revealing challenges in robustness and reasoning, particularly for open-source models. A training-free Dual-Step Contextual Prompting (DSCP) technique is proposed to enhance Video-LMM performance, with the dataset and code made publicly available.