The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.
The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.
Researchers created a cross-cultural corpus of annotated verbal and nonverbal behaviors in receptionist interactions. The corpus includes native speakers of American English and Arabic role-playing scenarios at university reception desks in Doha, Qatar, and Pittsburgh, USA. The manually annotated nonverbal behaviors include gaze direction, hand gestures, torso positions, and facial expressions. Why it matters: This resource can be valuable for the human-robot interaction community, especially for building culturally aware AI systems.
A research talk was given on privacy and security issues in speech processing, highlighting the unique privacy challenges due to the biometric information embedded in speech. The talk covered the legal landscape, proposed solutions like cryptographic and hashing-based methods, and adversarial processing techniques. Dr. Bhiksha Raj from Carnegie Mellon University, an expert in speech and audio processing, delivered the talk. Why it matters: As speech-based interfaces become more prevalent in the Middle East, understanding and addressing the associated privacy risks is crucial for ethical AI development and deployment.
The ArabJobs dataset is a new corpus of over 8,500 Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the UAE. The dataset contains over 550,000 words and captures linguistic, regional, and socio-economic variation in the Arab labor market. It is available on GitHub and can be used for fairness-aware Arabic NLP and labor market research.
Researchers compiled a 101 Billion Arabic Words Dataset by mining text from Common Crawl WET files and rigorously cleaning and deduplicating the extracted content. The dataset aims to address the scarcity of original, high-quality Arabic linguistic data, which often leads to bias in Arabic LLMs that rely on translated English data. This is the largest Arabic dataset available to date. Why it matters: The new dataset can significantly contribute to the development of authentic Arabic LLMs that are more linguistically and culturally accurate.
A new culturally inclusive and linguistically diverse dataset called Palm for Arabic LLMs is introduced, covering 22 Arab countries and featuring instructions in both Modern Standard Arabic (MSA) and dialectal Arabic (DA) across 20 topics. The dataset was built through a year-long community-driven project involving 44 researchers from across the Arab world. Evaluation of frontier LLMs using the dataset reveals limitations in cultural and dialectal understanding, with some countries being better represented than others.