Search

Results for "LipVQ-VAE"

Smoothing the way for in-context robot learning

MBZUAI · Invalid Date

MBZUAI researchers have developed a new action tokenization method called LipVQ-VAE to improve in-context robot learning. LipVQ-VAE combines VQ-VAE with a Lipschitz constraint to generate smoother robotic motions, addressing limitations of traditional methods. The technique was tested on simulated and real robots, showing improved performance in imitation learning. Why it matters: This research advances robot learning by enabling more fluid and successful robot actions through improved action representation, drawing inspiration from NLP techniques.

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

arXiv · Mar 6

MBZUAI researchers introduce LLMVoX, a 30M-parameter, LLM-agnostic, autoregressive streaming text-to-speech (TTS) system that generates high-quality speech with low latency. The system preserves the capabilities of the base LLM and achieves a lower Word Error Rate compared to speech-enabled LLMs. LLMVoX supports seamless, infinite-length dialogues and generalizes to new languages with dataset adaptation, including Arabic.

Towards Unified and Lossless Latent Space for 3D Molecular Latent Diffusion Modeling

arXiv · Mar 19

The paper introduces UAE-3D, a multi-modal VAE for 3D molecule generation that compresses molecules into a unified latent space, maintaining near-zero reconstruction error. This approach simplifies latent diffusion modeling by eliminating the need to handle multi-modality and equivariance separately. Experiments on GEOM-Drugs and QM9 datasets show UAE-3D establishes new benchmarks in de novo and conditional 3D molecule generation, with significant improvements in efficiency and quality.

NatiQ: An End-to-end Text-to-Speech System for Arabic

arXiv · Jun 15

Qatar Computing Research Institute (QCRI) has developed NatiQ, an end-to-end text-to-speech (TTS) system for Arabic utilizing encoder-decoder architectures. The system employs Tacotron-based models and Transformer models to generate mel-spectrograms, which are then synthesized into waveforms using vocoders like WaveRNN, WaveGlow, and Parallel WaveGAN. Trained on in-house speech data featuring a neutral male voice (Hamza) and an expressive female voice (Amina), NatiQ achieves a Mean Opinion Score (MOS) of 4.21 and 4.40, respectively. Why it matters: This research advances Arabic language technology, providing high-quality TTS synthesis that can enhance accessibility and usability of digital content for Arabic speakers.

Mutually-Regularized Dual Collaborative Variational Auto-encoder for Recommendation Systems

arXiv · Nov 21

This paper introduces a mutually-regularized dual collaborative variational auto-encoder (MD-CVAE) for recommendation systems, addressing the limitations of user-oriented auto-encoders (UAEs) in handling sparse ratings and new items. MD-CVAE integrates item content and user ratings within a variational framework, regularizing UAE weights with item content to avoid non-optimal convergence. A symmetric inference strategy eliminates the need for retraining when introducing new items, enhancing efficiency in dynamic recommendation scenarios. Why it matters: The MD-CVAE approach offers a practical solution for improving recommendation accuracy and efficiency, especially in scenarios with data sparsity and frequent item updates, relevant to e-commerce and content platforms in the Middle East.

Unscented Autoencoder

arXiv · Jun 8

The paper introduces the Unscented Autoencoder (UAE), a novel deep generative model based on the Variational Autoencoder (VAE) framework. The UAE uses the Unscented Transform (UT) for a more informative posterior representation compared to the reparameterization trick in VAEs. It replaces Kullback-Leibler (KL) divergence with the Wasserstein distribution metric and demonstrates competitive performance in Fréchet Inception Distance (FID) scores.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv · Jun 8

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

arXiv · Nov 22

MBZUAI researchers introduce PG-Video-LLaVA, a large multimodal model with pixel-level grounding capabilities for videos, integrating audio cues for enhanced understanding. The model uses an off-the-shelf tracker and grounding module to localize objects in videos based on user prompts. PG-Video-LLaVA is evaluated on video question-answering and grounding benchmarks, using Vicuna instead of GPT-3.5 for reproducibility.