IBM Fellow Dr. Tanveer Syeda-Mahmood gave a talk on the evolution of foundational models, covering multimodal fusion in healthcare and neuro-inspired AI for computer vision. She also discussed image-driven fact-checking of generative AI textual reports for responsible models. Dr. Syeda-Mahmood leads IBM's work in Multimodal Bioinspired AI and WatsonX features, and previously led the Medical Sieve Radiology Grand Challenge. Why it matters: The talk highlights the ongoing development and application of AI foundational models in critical areas like healthcare and responsible AI development, showing IBM's continued investment in these areas.
Paul Liang from CMU presented on machine learning foundations for multisensory AI, discussing a theoretical framework for modality interactions. The talk covered cross-modal attention and multimodal transformer architectures, and applications in mental health, pathology, and robotics. Liang's research aims to enable AI systems to integrate and learn from diverse real-world sensory modalities. Why it matters: This highlights the growing importance of multimodal AI research and its potential for advancements across various sectors in the region, including healthcare and robotics.
MBZUAI President Eric Xing delivered a talk at Carnegie Mellon University on May 13, 2022, titled “From Learning, to Meta-Learning, to Lego-Learning — theory, systems, and engineering.” Xing discussed the development of a standard model for learning, inspired by the standard model in physics, which aims to unify various machine learning paradigms. Before joining MBZUAI, Xing was a professor at CMU and founder of Petuum Inc., an AI development platform company. Why it matters: This talk highlights MBZUAI's leadership in advancing theoretical frameworks for machine learning and its commitment to unifying different AI approaches.
This seminar explores vision systems through self-supervised representation learning, addressing challenges and solutions in mainstream vision self-supervised learning methods. It discusses developing versatile representations across modalities, tasks, and architectures to propel the evolution of the vision foundation model. Tong Zhang from EPFL, with a background from Beihang University, New York University, and Australian National University, will lead the talk. Why it matters: Advancing vision foundation models is crucial for expanding AI applications, especially in the Middle East where computer vision can address challenges in areas like urban planning, agriculture, and environmental monitoring.
Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.
MBZUAI has released Jais and Jais-chat, two new open generative large language models (LLMs) with a focus on Arabic. The 13 billion parameter models are based on the GPT-3 architecture and pretrained on Arabic, English, and code. Evaluation shows state-of-the-art Arabic knowledge and reasoning, with competitive English performance.
A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.