VinAI Research presented research projects focused on advancing image generation and manipulation using GANs and Diffusion Models. The research aims to improve GANs regarding utility, coverage, and output consistency. For Diffusion Models, the work focuses on improving the models’ speed to approach real-time performance and prevent negative social impact of diffusion-based personalized text-to-image generation. Why it matters: This talk indicates ongoing research and development in generative AI in Southeast Asia, an area of growing interest globally.
Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.
The study analyzes over 1,000 images generated by ImageFX, DALL-E V3, and Grok for 56 Saudi professions, finding significant gender imbalances and cultural inaccuracies. DALL-E V3 exhibited the strongest gender stereotyping, with 96% male depictions, particularly in leadership and technical roles. The research underscores the need for diverse training data and culturally sensitive evaluation to ensure equitable AI outputs that accurately reflect Saudi Arabia's labor market and culture.
The paper introduces VENOM, a text-driven framework for generating high-quality unrestricted adversarial examples using diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enhancing both attack success rate and image quality. The framework incorporates an adaptive adversarial guidance strategy with momentum to ensure the generated adversarial examples align with the distribution of natural images.
Axel Sauer from the University of Tübingen presented research on scaling Generative Adversarial Networks (GANs) using pretrained representations. The work explores shaping GANs into causal structures, training them up to 40 times faster, and achieving state-of-the-art image synthesis. The presentation mentions "Counterfactual Generative Networks", "Projected GANs", "StyleGAN-XL”, and “StyleGAN-T". Why it matters: Scaling GANs and improving their training efficiency is crucial for advancing image and video synthesis, with implications for various applications in computer vision, graphics, and robotics.
Dr. Zeke Xie from HKUST(GZ) presented research on noise initialization and sampling strategies for diffusion models. The talk covered golden noise for text-to-image models, zigzag diffusion sampling, smooth initializations for video diffusion, and leveraging image diffusion for video synthesis. Xie leads the xLeaF Lab, focusing on optimization, inference, and generative AI, with previous experience at Baidu Research. Why it matters: The work addresses core challenges in improving the quality and diversity of generated content from diffusion models, a key area of advancement for AI applications in the region.
Researchers from MBZUAI and other institutions have developed a new framework called STEREO to improve the safety of text-to-image diffusion models. STEREO uses a two-stage approach: STE (Search Thoroughly Enough) based on adversarial training and REO (Robustly Erase Once) for batch concept erasure. This framework aims to enhance safety without significantly impacting the model's performance on normal queries. Why it matters: The framework addresses vulnerabilities in AI image generation, reducing the creation of inappropriate images while preserving performance on harmless queries.