MBZUAI's Hanan Al Darmaki is working to improve automated speech recognition (ASR) for low-resource languages, where labeled data is scarce. She notes that Arabic presents unique challenges due to dialectal variations and a lack of written resources corresponding to spoken dialects. Al Darmaki's research focuses on unsupervised speech recognition to address this gap. Why it matters: Overcoming these challenges can improve virtual assistant effectiveness across diverse languages and enable more inclusive AI applications in the Arabic-speaking world.
This paper benchmarks the performance of OpenAI's Whisper model on diverse Arabic speech recognition tasks, using publicly available data and novel dialect evaluation sets. The study explores zero-shot, few-shot, and full finetuning scenarios. Results indicate that while Whisper outperforms XLS-R models in zero-shot settings on standard datasets, its performance drops significantly when applied to unseen Arabic dialects.
Pedro J. Moreno, former head of ASR R&D at Google, presented a talk at MBZUAI on the past, present, and future of speech technologies. The talk covered the evolution of speech tech, his career contributions including work on Google Voice search, and the impact of LLMs on speech science. He also discussed the interplay between foundational and applied research and preparing the next generation of scientists. Why it matters: The talk provides insights into the trajectory of speech technologies from a leading researcher, highlighting future directions and the ethical considerations surrounding AI's impact on society.
MBZUAI student Karima Kadaoui is developing machine learning algorithms to help speech-impaired individuals communicate more easily. Her project aims to create an app that translates speech impediments into understandable language, facilitating communication with others and integration with voice-enabled technologies like Siri and Google Assistant. The AI-powered app could assist individuals with conditions such as strokes and cerebral palsy, who often struggle with muscle control affecting speech clarity. Why it matters: The research addresses a critical need for inclusive AI solutions, potentially improving the quality of life for speech-impaired individuals in the region and beyond.
A research talk was given on privacy and security issues in speech processing, highlighting the unique privacy challenges due to the biometric information embedded in speech. The talk covered the legal landscape, proposed solutions like cryptographic and hashing-based methods, and adversarial processing techniques. Dr. Bhiksha Raj from Carnegie Mellon University, an expert in speech and audio processing, delivered the talk. Why it matters: As speech-based interfaces become more prevalent in the Middle East, understanding and addressing the associated privacy risks is crucial for ethical AI development and deployment.
The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.
MBZUAI researchers presented a study at ACL 2024 on improving Arabic ASR by pre-training on dialectal Arabic. They trained three versions of the ArTST model: one on MSA, one on MSA and dialectal data, and one on MSA, dialectal, and multilingual data. Results showed that pre-training on dialectal Arabic improves ASR performance across MSA and various dialects. Why it matters: This research addresses a key challenge in Arabic NLP, given the diversity and lack of standardization in dialects, which could lead to more accurate speech recognition systems.
The Qatar Computing Research Institute (QCRI) has released SpokenNativQA, a multilingual spoken question-answering dataset for evaluating LLMs in conversational settings. The dataset contains 33,000 naturally spoken questions and answers across multiple languages, including low-resource and dialect-rich languages. It aims to address the limitations of text-based QA datasets by incorporating speech variability, accents, and linguistic diversity. Why it matters: This benchmark enables more robust evaluation of LLMs in speech-based interactions, particularly for Arabic dialects and other low-resource languages.