Skip to content
GCC AI Research

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts

arXiv · · Significant research

Summary

Researchers introduce TimeTravel, a benchmark dataset for evaluating large multimodal models (LMMs) on historical and cultural artifacts. The benchmark comprises 10,250 expert-verified samples across 266 cultures and 10 historical regions, designed to assess AI in tasks like classification and interpretation of manuscripts, artworks, inscriptions, and archaeological discoveries. The goal is to establish AI as a reliable partner in preserving cultural heritage and assisting researchers.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

SaudiCulture: A Benchmark for Evaluating Large Language Models Cultural Competence within Saudi Arabia

arXiv ·

The paper introduces SaudiCulture, a new benchmark for evaluating the cultural competence of LLMs within Saudi Arabia, covering five major geographical regions and diverse cultural domains. The benchmark includes questions of varying complexity and distinguishes between common and specialized regional knowledge. Evaluations of five LLMs (GPT-4, Llama 3.3, FANAR, Jais, and AceGPT) revealed performance declines on region-specific questions, highlighting the need for region-specific knowledge in LLM training.