Researchers created a cross-cultural corpus of annotated verbal and nonverbal behaviors in receptionist interactions. The corpus includes native speakers of American English and Arabic role-playing scenarios at university reception desks in Doha, Qatar, and Pittsburgh, USA. The manually annotated nonverbal behaviors include gaze direction, hand gestures, torso positions, and facial expressions. Why it matters: This resource can be valuable for the human-robot interaction community, especially for building culturally aware AI systems.
This article previews a talk by Gül Varol from Ecole des Ponts ParisTech on bridging natural language and 3D human motions. The talk will cover text-to-motion synthesis using generative models and text-to-motion retrieval models based on the ACTOR, TEMOS, TMR, TEACH, and SINC papers. Varol's research interests include video representation learning, human motion synthesis, and sign languages. Why it matters: Research in this area could enable more intuitive human-computer interaction and new applications in areas like virtual reality and robotics.
Christian Montag from Ulm University gave a talk about assessing attitudes towards AI, covering the IMPACT framework (Modality, Person, Area, Country/Culture, and Transparency). He discussed how factors like age, gender, personality, and culture relate to attitudes toward AI, and how those attitudes link to trust in automation and specific AI models like ChatGPT and Ernie Bot. Montag's research explores the intersection of psychology, neuroscience, behavioral economics, and computer science, focusing on the impact of AI on the human mind. Why it matters: Understanding public perception of AI is crucial for responsible development and deployment, especially in the Arab world where cultural and demographic factors can significantly shape attitudes.
A new benchmark, LongShOTBench, is introduced for evaluating multimodal reasoning and tool use in long videos, featuring open-ended questions and diagnostic rubrics. The benchmark addresses the limitations of existing datasets by combining temporal length and multimodal richness, using human-validated samples. LongShOTAgent, an agentic system, is also presented for analyzing long videos, with both the benchmark and agent demonstrating the challenges faced by state-of-the-art MLLMs.
Ivan Laptev from INRIA Paris presented a talk at MBZUAI on embodied multi-modal visual understanding, covering advancements in video understanding tasks like question answering and captioning. The talk highlighted recent work on vision-language navigation and manipulation. He argued that detailed understanding of the physical world through vision is still in early stages, discussing open research directions related to robotics and video generation. Why it matters: The discussion of robotics applications and future research directions in embodied AI could influence the direction of AI research and development in the UAE, particularly at MBZUAI.
Researchers from MBZUAI have introduced VideoMolmo, a large multimodal model for spatio-temporal pointing conditioned on textual descriptions. The model incorporates a temporal module with an attention mechanism and a temporal mask fusion pipeline using SAM2 for improved coherence across video sequences. They also curated a dataset of 72k video-caption pairs and introduced VPoS-Bench, a benchmark for evaluating generalization across real-world scenarios, with code and models publicly available.