Skip to content
GCC AI Research

Search

Results for "QCRI"

QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus

arXiv ·

The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.

Fanar: An Arabic-Centric Multimodal Generative AI Platform

arXiv ·

Hamad Bin Khalifa University's Qatar Computing Research Institute (QCRI) introduced Fanar, an Arabic-centric multimodal generative AI platform featuring the Fanar Star (7B) and Fanar Prime (9B) Arabic LLMs. These models were trained on nearly 1 trillion tokens and are designed to address different prompts through a custom orchestrator. Fanar includes a customized Islamic RAG system, a Recency RAG, bilingual speech recognition, and an attribution service for content verification, sponsored by Qatar's Ministry of Communications and Information Technology. Why it matters: The platform signifies a major step towards sovereign AI development in Qatar, providing advanced Arabic language capabilities and addressing regional needs.

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs

arXiv ·

The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.