Search

Results for "low-latency"

Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation

arXiv · May 16

This paper introduces SimulMask, a new paradigm for fine-tuning large language models (LLMs) for simultaneous translation. SimulMask utilizes a novel attention masking approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applied to a Falcon LLM on the IWSLT 2017 dataset, SimulMask achieves improved translation quality compared to state-of-the-art prompting optimization strategies across five language pairs while reducing computational cost. Why it matters: The proposed method offers a more efficient way to adapt LLMs for real-time translation, potentially enhancing multilingual communication tools and services.

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

arXiv · Feb 26

Researchers from MBZUAI have released MobiLlama, a fully transparent open-source 0.5 billion parameter Small Language Model (SLM). MobiLlama is designed for resource-constrained devices, emphasizing enhanced performance with reduced resource demands. The full training data pipeline, code, model weights, and checkpoints are available on Github.

Low-Complexity NN Technology: Model and Precision Search, Acceleration Circuit, and Applications

MBZUAI · Invalid Date

Researchers at National Taiwan University are developing low-complexity neural network technologies using quantization to reduce model size while maintaining accuracy. Their work includes binary-weighted CNNs and transformers, along with a neural architecture search scheme (TPC-NAS) applied to image recognition, object detection, and NLP tasks. They have also built a PE-based CNN/transformer hardware accelerator in Xilinx FPGA SoC with a PyTorch-based software framework. Why it matters: This research provides practical methods for deploying efficient deep learning models on resource-constrained hardware, potentially enabling broader adoption of AI in embedded systems and edge devices.

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

arXiv · Jun 5

The paper introduces Sparse-Quantized Representation (SpQR), a new compression format and quantization technique for large language models (LLMs). SpQR identifies outlier weights and stores them in higher precision while compressing the remaining weights to 3-4 bits. The method achieves less than 1% accuracy loss in perplexity for LLaMA and Falcon LLMs and enables a 33B parameter LLM to run on a single 24GB consumer GPU. Why it matters: This enables near-lossless compression of LLMs, making powerful models accessible on resource-constrained devices and accelerating inference without significant accuracy degradation.

Making human-machine conversation more lifelike than ever at GITEX

MBZUAI · Invalid Date

MBZUAI researchers demonstrated a low-latency, multilingual multimodal AI system at GITEX that integrates speech, text, and visual capabilities for more lifelike human-machine conversation. The demo, led by Dr. Hisham Cholakkal, includes a mobile app where users can point their camera at an object and ask questions, receiving spoken answers in multiple languages. They are also integrating the model into a robot dog that can respond to voice commands. Why it matters: This work addresses key challenges in deploying LLMs to real-world applications in the Middle East, such as multilingual support and real-time responsiveness.

Minimalistic Autonomous Stack for High-Speed Time-Trial Racing

arXiv · Sep 23

This paper introduces a minimalistic autonomous racing stack designed for high-speed time-trial racing, emphasizing rapid deployment and efficient system integration with minimal on-track testing. Validated on real speedways, the stack achieved a top speed of 206 km/h within just 11 hours of practice, covering 325 km. The system performance analysis includes tracking accuracy, vehicle dynamics, and safety considerations. Why it matters: This research offers insights for teams aiming to quickly develop and deploy autonomous racing stacks with limited track access, potentially accelerating innovation in autonomous vehicle technology within the A2RL and similar racing initiatives.

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

arXiv · Mar 6

MBZUAI researchers introduce LLMVoX, a 30M-parameter, LLM-agnostic, autoregressive streaming text-to-speech (TTS) system that generates high-quality speech with low latency. The system preserves the capabilities of the base LLM and achieves a lower Word Error Rate compared to speech-enabled LLMs. LLMVoX supports seamless, infinite-length dialogues and generalizes to new languages with dataset adaptation, including Arabic.