Skip to content
GCC AI Research

Search

Results for "validation datasets"

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

arXiv ·

Researchers introduce TimeTravel, a benchmark dataset for evaluating large multimodal models (LMMs) on historical and cultural artifacts. The benchmark comprises 10,250 expert-verified samples across 266 cultures and 10 historical regions, designed to assess AI in tasks like classification and interpretation of manuscripts, artworks, inscriptions, and archaeological discoveries. The goal is to establish AI as a reliable partner in preserving cultural heritage and assisting researchers.

Testing the limits of vision language models: A new benchmark dataset presented at ACL

MBZUAI ·

MBZUAI researchers presented EXAMS-V, a new benchmark dataset for evaluating the reasoning and processing abilities of vision language models (VLMs). EXAMS-V contains over 20,000 multiple-choice questions across 26 subjects and 11 languages, including Arabic. The dataset presents the questions within images, testing the VLM's ability to integrate visual and textual information. Why it matters: This dataset fills a gap in VLM evaluation, providing a valuable resource for assessing and improving the multimodal reasoning capabilities of these models, particularly in diverse languages like Arabic.

Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

arXiv ·

A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.

ArabJobs: A Multinational Corpus of Arabic Job Ads

arXiv ·

The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.

SlimPajama-DC: Understanding Data Combinations for LLM Training

arXiv ·

Researchers at MBZUAI release SlimPajama-DC, an empirical analysis of data combinations for pretraining LLMs using the SlimPajama dataset. The study examines the impact of global vs. local deduplication and the proportions of highly-deduplicated multi-source datasets. Results show that increased data diversity after global deduplication is crucial, with the best configuration outperforming models trained on RedPajama.

TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns for Intrusion Detection

arXiv ·

Researchers introduce TII-SSRC-23, a new network intrusion detection dataset designed to improve the diversity and representation of modern network traffic for machine learning models. The dataset includes a range of traffic types and subtypes to address the limitations of existing datasets. Feature importance analysis and baseline experiments for supervised and unsupervised intrusion detection are also provided.

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

arXiv ·

Researchers introduce AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation, comprising seven synthetic datasets in various dialects and Modern Standard Arabic (MSA). The benchmark includes approximately 45,000 post-edited samples and evaluates LLMs on dialect comprehension, generation, and cultural awareness across the Gulf, Egypt, and Levant. Results show that Arabic-specific models like Jais and AceGPT outperform multilingual models on dialectal tasks, but challenges remain in dialect identification, generation, and translation. Why it matters: This benchmark and associated datasets will help improve LLMs' ability to understand and generate diverse Arabic dialects and cultural contexts, addressing a significant gap in current models.

Commonsense Reasoning in Arab Culture

arXiv ·

A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.