ElevenLabs, a voice AI research and product company, presented at MBZUAI's Incubation and Entrepreneurship Center (IEC) on the adoption of audio AI in the Middle East. Hussein Makki, general manager for the Middle East at ElevenLabs, highlighted the potential of voice-native AI across sectors like telecommunications, banking, and education. ElevenLabs focuses on making content accessible and engaging across languages and voices through its text-to-speech models. Why it matters: This signals growing interest and investment in voice AI applications within the region, potentially transforming customer service and content accessibility in Arabic.
Egyptian AI startup Intella, specializing in Arabic speech recognition, has raised $12.5 million in funding. The round was led by বিনিয়োগ, with participation from other investors. Intella plans to use the capital to expand its Arabic AI speech models and related services. Why it matters: The funding will help advance Arabic language AI capabilities, which are currently underserved compared to English-centric models.
MBZUAI researchers introduce LLMVoX, a 30M-parameter, LLM-agnostic, autoregressive streaming text-to-speech (TTS) system that generates high-quality speech with low latency. The system preserves the capabilities of the base LLM and achieves a lower Word Error Rate compared to speech-enabled LLMs. LLMVoX supports seamless, infinite-length dialogues and generalizes to new languages with dataset adaptation, including Arabic.
Qatar Computing Research Institute (QCRI) has developed NatiQ, an end-to-end text-to-speech (TTS) system for Arabic utilizing encoder-decoder architectures. The system employs Tacotron-based models and Transformer models to generate mel-spectrograms, which are then synthesized into waveforms using vocoders like WaveRNN, WaveGlow, and Parallel WaveGAN. Trained on in-house speech data featuring a neutral male voice (Hamza) and an expressive female voice (Amina), NatiQ achieves a Mean Opinion Score (MOS) of 4.21 and 4.40, respectively. Why it matters: This research advances Arabic language technology, providing high-quality TTS synthesis that can enhance accessibility and usability of digital content for Arabic speakers.
MBZUAI researchers developed LLMVoX, a system enabling LLMs to produce real-time speech, including Arabic. LLMVoX addresses limitations of existing end-to-end and cascaded pipeline approaches, which suffer from either degraded reasoning or latency. LLMVoX was developed as part of Project OMER, which was recently awarded Regional Research Grant from Meta. Why it matters: This enhances the potential of LLMs to function as more natural, multimodal virtual assistants, especially for Arabic-speaking users in the Middle East.
The authors introduce Nile-Chat, a collection of LLMs (4B, 3x4B-A6B, and 12B) specifically for the Egyptian dialect, capable of understanding and generating text in both Arabic and Latin scripts. A novel language adaptation approach using the Branch-Train-MiX strategy is used to merge script-specialized experts into a single MoE model. Nile-Chat models outperform multilingual and Arabic LLMs like LLaMa, Jais, and ALLaM on newly introduced Egyptian benchmarks, with the 12B model achieving a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks; all resources are publicly available. Why it matters: This work addresses the overlooked aspect of adapting LLMs to dual-script languages, providing a methodology for creating more inclusive and representative language models in the Arabic-speaking world.
The Qatar Computing Research Institute (QCRI) has released QASR, a 2,000-hour transcribed Arabic speech corpus collected from Aljazeera news broadcasts. The dataset features multi-dialect speech sampled at 16kHz, aligned with lightly supervised transcriptions and linguistically motivated segmentation. QCRI also released a 130M word dataset to improve language model training. Why it matters: QASR enables new research in Arabic speech recognition, dialect identification, punctuation restoration, and other NLP tasks for spoken data.