Multimodal machine intelligence and its human-centered possibilities

MBZUAI · Notable

Ethics Healthcare Research Policy Partnership

Summary

A panel discussion was hosted at MBZUAI in collaboration with the Manara Center for Coexistence and Dialogue. The discussion centered on the potential of multimodal machine intelligence for human-centered applications, particularly in health and wellbeing. USC Professor Shrikanth Narayanan spoke on creating trustworthy and inclusive AI that considers protected variables. Why it matters: This signals MBZUAI's interest in exploring ethical AI development and its applications for societal good, potentially driving research and policy initiatives in the region.

Keywords

MBZUAI · multimodal AI · human-centered AI · ethics · Manara Center

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

Foundations of Multisensory Artificial Intelligence

MBZUAI · Invalid Date

Paul Liang from CMU presented on machine learning foundations for multisensory AI, discussing a theoretical framework for modality interactions. The talk covered cross-modal attention and multimodal transformer architectures, and applications in mental health, pathology, and robotics. Liang's research aims to enable AI systems to integrate and learn from diverse real-world sensory modalities. Why it matters: This highlights the growing importance of multimodal AI research and its potential for advancements across various sectors in the region, including healthcare and robotics.

Making human-machine conversation more lifelike than ever at GITEX

MBZUAI · Invalid Date

MBZUAI researchers demonstrated a low-latency, multilingual multimodal AI system at GITEX that integrates speech, text, and visual capabilities for more lifelike human-machine conversation. The demo, led by Dr. Hisham Cholakkal, includes a mobile app where users can point their camera at an object and ask questions, receiving spoken answers in multiple languages. They are also integrating the model into a robot dog that can respond to voice commands. Why it matters: This work addresses key challenges in deploying LLMs to real-world applications in the Middle East, such as multilingual support and real-time responsiveness.

Multimodality for story-level understanding and generation of visual data

MBZUAI · Invalid Date

Vicky Kalogeiton from École Polytechnique discussed the importance of multimodality for story-level recognition and generation using video, audio, text, masks and clinical data. She presented on multimodal video understanding using FunnyNet-W and Short Film Dataset. She further showed examples of visual generation from text and other modalities (ET, CAD, DynamicGuidance). Why it matters: Multimodal AI research is growing globally, and this talk highlights the potential of combining different data types for enhanced understanding and generation, which could have implications for various applications, including those relevant to the Middle East.

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

arXiv · Dec 18

A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.

Multimodal machine intelligence and its human-centered possibilities

Summary

Keywords

Related

Foundations of Multisensory Artificial Intelligence

Making human-machine conversation more lifelike than ever at GITEX

Multimodality for story-level understanding and generation of visual data

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos