Skip to content
GCC AI Research

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

arXiv · · Significant research

Summary

The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

NativQA: Multilingual Culturally-Aligned Natural Query for LLMs

arXiv ·

The paper introduces NativQA, a language-independent framework for constructing culturally and regionally aligned QA datasets in native languages. Using the framework, the authors created MultiNativQA, a multilingual natural QA dataset consisting of ~64k manually annotated QA pairs in seven languages. The dataset covers queries from native speakers from 9 regions covering 18 topics, and is designed for evaluating and tuning LLMs. Why it matters: The framework and dataset enable the creation of more culturally relevant and effective LLMs for diverse linguistic communities, including those in the Middle East.

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

arXiv ·

Researchers introduce ArabicaQA, a large-scale dataset for Arabic question answering, comprising 89,095 answerable and 3,701 unanswerable questions. They also present AraDPR, a dense passage retrieval model trained on the Arabic Wikipedia. The paper includes benchmarking of large language models (LLMs) for Arabic question answering. Why it matters: This work addresses a significant gap in Arabic NLP resources and provides valuable tools and benchmarks for advancing research in the field.

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

arXiv ·

This paper benchmarks the performance of OpenAI's Whisper model on diverse Arabic speech recognition tasks, using publicly available data and novel dialect evaluation sets. The study explores zero-shot, few-shot, and full finetuning scenarios. Results indicate that while Whisper outperforms XLS-R models in zero-shot settings on standard datasets, its performance drops significantly when applied to unseen Arabic dialects.

Language Models' Factuality Depends on the Language of Inquiry

arXiv ·

Researchers introduce a benchmark to evaluate the factual recall and knowledge transferability of multilingual language models across 13 languages. The study reveals that language models often fail to transfer knowledge between languages, even when they possess the correct information in one language. The benchmark and evaluation framework are released to drive future research in multilingual knowledge transfer.