Skip to content
GCC AI Research

Search

Results for "Do-Not-Answer dataset"

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

arXiv ·

Researchers introduce ArabicaQA, a large-scale dataset for Arabic question answering, comprising 89,095 answerable and 3,701 unanswerable questions. They also present AraDPR, a dense passage retrieval model trained on the Arabic Wikipedia. The paper includes benchmarking of large language models (LLMs) for Arabic question answering. Why it matters: This work addresses a significant gap in Arabic NLP resources and provides valuable tools and benchmarks for advancing research in the field.

ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

arXiv ·

The paper introduces ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset for politeness detection collected from online platforms. The dataset covers Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). It contains 10,000 samples across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). Why it matters: This dataset addresses the under-explored area of Arabic-language resources for politeness detection, which is crucial for culturally-aware NLP systems.

Testing LLMs safety in Arabic from two perspectives | NAACL

MBZUAI ·

Researchers at MBZUAI presented a new Arabic dataset at NAACL to measure LLM safety, building on a Chinese dataset called 'Do Not Answer'. The dataset includes nearly 5,800 questions with challenges and harmless requests containing sensitive terms to test for over-sensitivity. The team localized cultural concepts and added 3,000 questions specific to Arabic language and culture. Why it matters: This comprehensive benchmark, accounting for the diversity of Arabic dialects and cultures, advances the development of safer and more culturally aligned LLMs for Arabic speakers.

RIRAG: Regulatory Information Retrieval and Answer Generation

arXiv ·

Researchers introduce a new task for generating question-passage pairs to aid in developing regulatory question-answering (QA) systems. The ObliQA dataset, comprising 27,869 questions from Abu Dhabi Global Markets (ADGM) financial regulations, is presented. A baseline Regulatory Information Retrieval and Answer Generation (RIRAG) system is designed and evaluated using the RePASs metric.

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

arXiv ·

The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

arXiv ·

The paper introduces NativQA, a language-independent framework for constructing culturally and regionally aligned QA datasets in native languages. Using the framework, the authors created MultiNativQA, a multilingual natural QA dataset consisting of ~64k manually annotated QA pairs in seven languages. The dataset covers queries from native speakers from 9 regions covering 18 topics, and is designed for evaluating and tuning LLMs. Why it matters: The framework and dataset enable the creation of more culturally relevant and effective LLMs for diverse linguistic communities, including those in the Middle East.

Commonsense Reasoning in Arab Culture

arXiv ·

A new dataset called ArabCulture is introduced to address the lack of culturally relevant commonsense reasoning resources in Arabic AI. The dataset covers 13 countries across the Gulf, Levant, North Africa, and the Nile Valley, spanning 12 daily life domains with 54 fine-grained subtopics. It was built from scratch by native speakers writing and validating culturally relevant questions. Why it matters: The dataset highlights the need for more culturally aware models and benchmarks tailored to the Arabic-speaking world, moving beyond machine-translated resources.