Skip to content
GCC AI Research

Search

Results for "PerceptionLM"

From Abu Dhabi to Silicon Valley: MBZUAI students advance computer vision at Meta

MBZUAI ·

MBZUAI Ph.D. students Muhammad Maaz and Hanoona Rasheed interned at Meta, developing a vision encoder for images and videos. They created PerceptionLM, a multimodal language model, to generate synthetic video-caption data to train the Perception Encoder. The team addressed the challenge of limited labeled video data by building a multimodal language model called PerceptionLM to understand video's spatial and temporal aspects. Why it matters: This highlights MBZUAI's strength in computer vision and provides students opportunities to contribute to cutting-edge research at global tech firms.

Tracking Meets Large Multimodal Models for Driving Scenario Understanding

arXiv ·

Researchers at MBZUAI have introduced a novel approach to enhance Large Multimodal Models (LMMs) for autonomous driving by integrating 3D tracking information. This method uses a track encoder to embed spatial and temporal data, enriching visual queries and improving the LMM's understanding of driving scenarios. Experiments on DriveLM-nuScenes and DriveLM-CARLA benchmarks demonstrate significant improvements in perception, planning, and prediction tasks compared to baseline models.

Neural Models with Symbolic Representations for Perceptuo-Reasoning Tasks

MBZUAI ·

Mausam, head of Yardi School of AI at IIT Delhi and affiliate professor at University of Washington, will discuss Neuro-Symbolic AI. The talk will cover recent research threads with applications in NLP, probabilistic decision-making, and constraint satisfaction. Mausam's research explores neuro-symbolic machine learning, computer vision for radiology, NLP for robotics, multilingual NLP, and intelligent information systems. Why it matters: Neuro-Symbolic AI is gaining importance as it combines the strengths of neural and symbolic approaches, potentially leading to more robust and explainable AI systems.

AlcLaM: Arabic Dialectal Language Model

arXiv ·

The paper introduces AlcLaM, an Arabic dialectal language model trained on 3.4M sentences from social media. AlcLaM expands the vocabulary and retrains a BERT-based model, using only 13GB of dialectal text. Despite the smaller training data, AlcLaM outperforms models like CAMeL, MARBERT, and ArBERT on various Arabic NLP tasks. Why it matters: AlcLaM offers a more efficient and accurate approach to Arabic NLP by focusing on dialectal Arabic, which is often underrepresented in existing models.

Human-Computer Conversational Vision-and-Language Navigation

MBZUAI ·

A presentation discusses the evolution of Vision-and-Language Navigation (VLN) from benchmarks like Room-to-Room (R2R). It highlights the role of Large Language Models (LLMs) such as GPT-4 in enabling more natural human-machine interactions. The presentation showcases work using LLMs to decode navigational instructions and improve robotic navigation. Why it matters: This research demonstrates the potential of merging vision, language, and robotics for advanced AI applications in navigation and human-computer interaction.