Researchers introduce ArabicaQA, a large-scale dataset for Arabic question answering, comprising 89,095 answerable and 3,701 unanswerable questions. They also present AraDPR, a dense passage retrieval model trained on the Arabic Wikipedia. The paper includes benchmarking of large language models (LLMs) for Arabic question answering. Why it matters: This work addresses a significant gap in Arabic NLP resources and provides valuable tools and benchmarks for advancing research in the field.
This paper introduces an enhanced Dense Passage Retrieval (DPR) framework tailored for Arabic text retrieval. The core innovation is an Attentive Relevance Scoring (ARS) mechanism that improves semantic relevance modeling between questions and passages, replacing standard interaction methods. The method integrates pre-trained Arabic language models and architectural refinements, achieving improved retrieval and ranking accuracy for Arabic question answering. Why it matters: This work addresses the underrepresentation of Arabic in NLP research by providing a novel approach and publicly available code to improve Arabic text retrieval, which can benefit various applications like Arabic search engines and question-answering systems.
This paper presents a comparative study of pre-trained transformer models for Arabic question answering (QA). The study evaluates the performance of AraBERTv2-base, AraBERTv0.2-large, and AraELECTRA models on four reading comprehension datasets: Arabic-SQuAD, ARCD, AQAD, and TyDiQA-GoldP. The researchers fine-tuned these models and analyzed the results to understand the performance disparities. Why it matters: This research contributes to the advancement of Arabic NLP by evaluating and comparing state-of-the-art models on important QA tasks, addressing the scarcity of resources in this domain.
The Inception Team presented a system for Semantic Question Similarity in Arabic as part of the NSURL 2019 Task 8. The system explores different methods for determining question similarity in Arabic. Their best result was an ensemble model using a pre-trained multilingual BERT model, achieving a 95.924% F1-Score and ranking first among nine participating teams. Why it matters: This demonstrates strong performance on a key Arabic NLP task, advancing the state-of-the-art in semantic understanding for the language.
The paper introduces NativQA, a language-independent framework for constructing culturally and regionally aligned QA datasets in native languages. Using the framework, the authors created MultiNativQA, a multilingual natural QA dataset consisting of ~64k manually annotated QA pairs in seven languages. The dataset covers queries from native speakers from 9 regions covering 18 topics, and is designed for evaluating and tuning LLMs. Why it matters: The framework and dataset enable the creation of more culturally relevant and effective LLMs for diverse linguistic communities, including those in the Middle East.
The paper introduces AraTrust, a new benchmark for evaluating the trustworthiness of LLMs when prompted in Arabic. The benchmark contains 522 multiple-choice questions covering dimensions like truthfulness, ethics, safety, and fairness. Experiments using AraTrust showed that GPT-4 performed the best, while open-source models like AceGPT 7B and Jais 13B had lower scores. Why it matters: This benchmark addresses a critical gap in evaluating LLMs for Arabic, which is essential for ensuring the safe and ethical deployment of AI in the Arab world.
A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.