Skip to content
GCC AI Research

Search

Results for "video generation"

Cross-modal understanding and generation of multimodal content

MBZUAI ·

Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

arXiv ·

FancyVideo, a new video generator, introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video models. CTGM uses a Temporal Information Injector and Temporal Affinity Refiner to achieve frame-specific textual guidance, improving comprehension of temporal logic. Experiments on the EvalCrafter benchmark demonstrate FancyVideo's state-of-the-art performance in generating dynamic and consistent videos, also supporting image-to-video tasks.

Golden Noise and Ziazag Sampling of Diffusion Models

MBZUAI ·

Dr. Zeke Xie from HKUST(GZ) presented research on noise initialization and sampling strategies for diffusion models. The talk covered golden noise for text-to-image models, zigzag diffusion sampling, smooth initializations for video diffusion, and leveraging image diffusion for video synthesis. Xie leads the xLeaF Lab, focusing on optimization, inference, and generative AI, with previous experience at Baidu Research. Why it matters: The work addresses core challenges in improving the quality and diversity of generated content from diffusion models, a key area of advancement for AI applications in the region.

Multimodality for story-level understanding and generation of visual data

MBZUAI ·

Vicky Kalogeiton from École Polytechnique discussed the importance of multimodality for story-level recognition and generation using video, audio, text, masks and clinical data. She presented on multimodal video understanding using FunnyNet-W and Short Film Dataset. She further showed examples of visual generation from text and other modalities (ET, CAD, DynamicGuidance). Why it matters: Multimodal AI research is growing globally, and this talk highlights the potential of combining different data types for enhanced understanding and generation, which could have implications for various applications, including those relevant to the Middle East.

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

arXiv ·

Video-ChatGPT is a new multimodal model that combines a video-adapted visual encoder with a large language model (LLM) to enable detailed video understanding and conversation. The authors introduce a new dataset of 100,000 video-instruction pairs for training the model. They also develop a quantitative evaluation framework for video-based dialogue models.

Old images to anticipate the future

MBZUAI ·

MBZUAI researchers presented a new approach to video question answering at ICCV 2023. The method leverages insights from analyzing still images to understand video content, potentially reducing the computational resources needed for training video question answering models. Guangyi Chen, Kun Zhang, and colleagues aim to apply pre-trained image models to understand video concepts. Why it matters: This research could lead to more efficient and accessible video analysis tools, benefiting fields like healthcare and security where video data is abundant.

Image generation and manipulation research at VinAI

MBZUAI ·

VinAI Research presented research projects focused on advancing image generation and manipulation using GANs and Diffusion Models. The research aims to improve GANs regarding utility, coverage, and output consistency. For Diffusion Models, the work focuses on improving the models’ speed to approach real-time performance and prevent negative social impact of diffusion-based personalized text-to-image generation. Why it matters: This talk indicates ongoing research and development in generative AI in Southeast Asia, an area of growing interest globally.