Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.
FancyVideo, a new video generator, introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video models. CTGM uses a Temporal Information Injector and Temporal Affinity Refiner to achieve frame-specific textual guidance, improving comprehension of temporal logic. Experiments on the EvalCrafter benchmark demonstrate FancyVideo's state-of-the-art performance in generating dynamic and consistent videos, also supporting image-to-video tasks.
Dr. Zeke Xie from HKUST(GZ) presented research on noise initialization and sampling strategies for diffusion models. The talk covered golden noise for text-to-image models, zigzag diffusion sampling, smooth initializations for video diffusion, and leveraging image diffusion for video synthesis. Xie leads the xLeaF Lab, focusing on optimization, inference, and generative AI, with previous experience at Baidu Research. Why it matters: The work addresses core challenges in improving the quality and diversity of generated content from diffusion models, a key area of advancement for AI applications in the region.
Vicky Kalogeiton from École Polytechnique discussed the importance of multimodality for story-level recognition and generation using video, audio, text, masks and clinical data. She presented on multimodal video understanding using FunnyNet-W and Short Film Dataset. She further showed examples of visual generation from text and other modalities (ET, CAD, DynamicGuidance). Why it matters: Multimodal AI research is growing globally, and this talk highlights the potential of combining different data types for enhanced understanding and generation, which could have implications for various applications, including those relevant to the Middle East.
Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.
MBZUAI researchers presented a new approach to video question answering at ICCV 2023. The method leverages insights from analyzing still images to understand video content, potentially reducing the computational resources needed for training video question answering models. Guangyi Chen, Kun Zhang, and colleagues aim to apply pre-trained image models to understand video concepts. Why it matters: This research could lead to more efficient and accessible video analysis tools, benefiting fields like healthcare and security where video data is abundant.
VinAI Research presented research projects focused on advancing image generation and manipulation using GANs and Diffusion Models. The research aims to improve GANs regarding utility, coverage, and output consistency. For Diffusion Models, the work focuses on improving the models’ speed to approach real-time performance and prevent negative social impact of diffusion-based personalized text-to-image generation. Why it matters: This talk indicates ongoing research and development in generative AI in Southeast Asia, an area of growing interest globally.