Text-to-speech system brings real-time speech to LLMs

MBZUAI · Significant research

Summary

MBZUAI researchers developed LLMVoX, a system enabling LLMs to produce real-time speech, including Arabic. LLMVoX addresses limitations of existing end-to-end and cascaded pipeline approaches, which suffer from either degraded reasoning or latency. LLMVoX was developed as part of Project OMER, which was recently awarded Regional Research Grant from Meta. Why it matters: This enhances the potential of LLMs to function as more natural, multimodal virtual assistants, especially for Arabic-speaking users in the Middle East.

Keywords

LLM · text-to-speech · MBZUAI · Arabic · multimodal

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

arXiv · Mar 6

MBZUAI researchers introduce LLMVoX, a 30M-parameter, LLM-agnostic, autoregressive streaming text-to-speech (TTS) system that generates high-quality speech with low latency. The system preserves the capabilities of the base LLM and achieves a lower Word Error Rate compared to speech-enabled LLMs. LLMVoX supports seamless, infinite-length dialogues and generalizes to new languages with dataset adaptation, including Arabic.

Making human-machine conversation more lifelike than ever at GITEX

MBZUAI · Invalid Date

MBZUAI researchers demonstrated a low-latency, multilingual multimodal AI system at GITEX that integrates speech, text, and visual capabilities for more lifelike human-machine conversation. The demo, led by Dr. Hisham Cholakkal, includes a mobile app where users can point their camera at an object and ask questions, receiving spoken answers in multiple languages. They are also integrating the model into a robot dog that can respond to voice commands. Why it matters: This work addresses key challenges in deploying LLMs to real-world applications in the Middle East, such as multilingual support and real-time responsiveness.

NatiQ: An End-to-end Text-to-Speech System for Arabic

arXiv · Jun 15

Qatar Computing Research Institute (QCRI) has developed NatiQ, an end-to-end text-to-speech (TTS) system for Arabic utilizing encoder-decoder architectures. The system employs Tacotron-based models and Transformer models to generate mel-spectrograms, which are then synthesized into waveforms using vocoders like WaveRNN, WaveGlow, and Parallel WaveGAN. Trained on in-house speech data featuring a neutral male voice (Hamza) and an expressive female voice (Amina), NatiQ achieves a Mean Opinion Score (MOS) of 4.21 and 4.40, respectively. Why it matters: This research advances Arabic language technology, providing high-quality TTS synthesis that can enhance accessibility and usability of digital content for Arabic speakers.

Your voice can jailbreak a speech model – here’s how to stop it, without retraining

MBZUAI · Invalid Date

A new paper from MBZUAI demonstrates that state-of-the-art speech models can be easily jailbroken using audio perturbations to generate harmful content, achieving success rates of 76-93% on models like Qwen2-Audio and LLaMA-Omni. The researchers adapted projected gradient descent (PGD) to the audio domain to optimize waveforms that push the model towards harmful responses. They propose a defense mechanism based on post-hoc activation patching that hardens models at inference time without retraining. Why it matters: This research highlights a critical vulnerability in speech-based LLMs and offers a practical solution, contributing to the development of more secure and trustworthy AI systems in the region and globally.

Text-to-speech system brings real-time speech to LLMs

Summary

Keywords

Related

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Making human-machine conversation more lifelike than ever at GITEX

NatiQ: An End-to-end Text-to-Speech System for Arabic

Your voice can jailbreak a speech model – here’s how to stop it, without retraining