MBZUAI researchers have released ALM Bench, a new benchmark dataset for evaluating the performance of multimodal LLMs on cultural visual question-answer tasks across 100 languages. The dataset includes over 22,000 question-answer pairs across 19 categories, with a focus on low-resource languages and cultural nuances, including three Arabic dialects. They tested 16 open- and closed-source multimodal LLMs on it, revealing a significant need for greater cultural and linguistic inclusivity. Why it matters: The benchmark aims to improve the inclusivity of multimodal AI systems by addressing the underrepresentation of low-resource languages and cultural contexts.
A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.
A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.
MBZUAI researchers have created ArabCulture, a new benchmark dataset to measure cultural commonsense reasoning capabilities in Arabic language models. The dataset was built by native Arabic speakers from 13 countries and is the largest of its kind. Testing 31 language models, the researchers found that many systems struggle with understanding cultural concepts across the Arab world. Why it matters: The new benchmark addresses a gap in AI, enabling development of culturally-aware AI systems tailored to the nuances of the Arabic-speaking world.
A new benchmark, ViMUL-Bench, is introduced to evaluate video LLMs across 14 languages, including Arabic, with a focus on cultural inclusivity. The benchmark includes 8k manually verified samples across 15 categories and varying video durations. A multilingual video LLM, ViMUL, is also presented, along with a training set of 1.2 million samples, with both to be publicly released.