GCC AI Research

Search

Results for "BABL AI"

LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs

arXiv ·

MBZUAI researchers introduce LLM-BabyBench, a benchmark suite for evaluating grounded planning and reasoning in LLMs. The suite, built on a textual adaptation of the BabyAI grid world, assesses LLMs on predicting action consequences, generating action sequences, and decomposing instructions. Datasets, evaluation harness, and metrics are publicly available to facilitate reproducible assessment.

PALO: A Polyglot Large Multimodal Model for 5B People

arXiv ·

Researchers introduce PALO, a polyglot large multimodal model with visual reasoning capabilities in 10 major languages including Arabic. A semi-automated translation approach was used to adapt the multimodal instruction dataset from English to the target languages. The models are trained across three scales (1.7B, 7B and 13B parameters) and a multilingual multimodal benchmark is proposed for evaluation.

Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation

arXiv ·

MBZUAI releases Bactrian-X, a multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. They trained low-rank adaptation (LoRA) adapters using this dataset, creating lightweight, replaceable components for large language models. Experiments show the LoRA-based models outperform vanilla and existing instruction-tuned models in multilingual settings.

Why AI can describe an image but struggles to understand the culture inside it

MBZUAI ·

MBZUAI researchers release JEEM, a new benchmark dataset for evaluating vision-language models on Arabic dialects. The dataset covers image captioning and visual question answering tasks using images from Jordan, UAE, Egypt, and Morocco. Results show models struggle with cultural understanding and relevance despite fluent language generation.

RIRAG: Regulatory Information Retrieval and Answer Generation

arXiv ·

Researchers introduce a new task for generating question-passage pairs to aid in developing regulatory question-answering (QA) systems. The ObliQA dataset, comprising 27,869 questions from Abu Dhabi Global Markets (ADGM) financial regulations, is presented. A baseline Regulatory Information Retrieval and Answer Generation (RIRAG) system is designed and evaluated using the RePASs metric.

BiMediX: Bilingual Medical Mixture of Experts LLM

arXiv ·

MBZUAI researchers introduce BiMediX, a bilingual (English and Arabic) mixture of experts LLM for medical applications. The model is trained on BiMed1.3M, a new 1.3 million bilingual instruction dataset and outperforms existing models like Med42 and Jais-30B on medical benchmarks. Code and models are available on Github.