Towards embodied multi-modal visual understanding
MBZUAI ·
Ivan Laptev from INRIA Paris presented a talk at MBZUAI on embodied multi-modal visual understanding, covering advancements in video understanding tasks like question answering and captioning. The talk highlighted recent work on vision-language navigation and manipulation. He argued that detailed understanding of the physical world through vision is still in early stages, discussing open research directions related to robotics and video generation. Why it matters: The discussion of robotics applications and future research directions in embodied AI could influence the direction of AI research and development in the UAE, particularly at MBZUAI.