Skip to content
GCC AI Research

Search

Results for "JEEM"

Why AI can describe an image but struggles to understand the culture inside it

MBZUAI ·

MBZUAI researchers release JEEM, a new benchmark dataset for evaluating vision-language models on Arabic dialects. The dataset covers image captioning and visual question answering tasks using images from Jordan, UAE, Egypt, and Morocco. Results show models struggle with cultural understanding and relevance despite fluent language generation.

Why AI can describe an image but struggles to understand the culture inside it

MBZUAI ·

A new paper from MBZUAI introduces JEEM, a benchmark dataset for evaluating vision-language models on their understanding of images grounded in four Arabic-speaking societies (Jordan, UAE, Egypt, and Morocco) and their ability to use local dialects. The dataset comprises 2,178 images and 10,890 question-answer pairs reflecting everyday life and culturally specific scenes. Evaluation of several Arabic-capable models (Maya, PALO, Peacock, AIN, AyaV) and GPT-4o revealed that while models can generate fluent language, they struggle with genuine understanding, consistency, and relevance, especially when cultural context is important. Why it matters: This research highlights the challenges of building AI systems that can truly understand and interact with diverse cultures, emphasizing the need for culturally grounded datasets and evaluation metrics.

Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

arXiv ·

This paper benchmarks the performance of large language models (LLMs) on Arabic medical natural language processing tasks using the AraHealthQA dataset. The study evaluated LLMs in multiple-choice question answering, fill-in-the-blank, and open-ended question answering scenarios. The results showed that a majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved 77% accuracy on MCQs, while other LLMs achieved a BERTScore of 86.44% on open-ended questions. Why it matters: The research highlights both the potential and limitations of current LLMs in Arabic clinical contexts, providing a baseline for future improvements in Arabic medical AI.