Search

Results for "text-to-image"

Fine-tuning Text-to-Image Models: Reinforcement Learning and Reward Over-Optimization

MBZUAI · Invalid Date

The article discusses research on fine-tuning text-to-image diffusion models, including reward function training, online reinforcement learning (RL) fine-tuning, and addressing reward over-optimization. A Text-Image Alignment Assessment (TIA2) benchmark is introduced to study reward over-optimization. TextNorm, a method for confidence calibration in reward models, is presented to reduce over-optimization risks. Why it matters: Improving the alignment and fidelity of text-to-image models is crucial for generating high-quality content, and addressing over-optimization enhances the reliability of these models in creative applications.

From Text to image: M.Sc. graduate develops cutting-edge techniques to transform T2I generation

MBZUAI · Invalid Date

MBZUAI M.Sc. graduate Mohammad Hanan Ghani developed new techniques to improve text-to-image generation from long text prompts, combining large language models and diffusion models. Advised by Dr. Salman Khan, Ghani published three papers at ICLR, BMVC, and NeurIPS, with the ICLR paper focusing on generating images that accurately reflect detailed text descriptions. The new system improves upon existing techniques to generate images that closely follow the details of the input text. Why it matters: This research addresses a key limitation in current T2I models and advances the field of multimodal AI, potentially improving the capabilities of robots and autonomous devices.

Create and edit images like a smart artist

MBZUAI · Invalid Date

Researchers from Carnegie Mellon University and MBZUAI have developed a new method called ConceptAligner for precise image editing using AI. The system decomposes text embeddings into independent building blocks called atomic concepts, allowing users to make targeted tweaks without generating entirely new images. Their approach ensures that each latent factor maps to a specific user-controllable dial, enabling accurate concept-level modifications. Why it matters: This research addresses a major limitation in AI image generation, enhancing its usefulness in industries where precise control is crucial, such as advertising and medicine, and improving the reliability of AI-driven creative tools.

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

arXiv · Aug 15

FancyVideo, a new video generator, introduces a Cross-frame Textual Guidance Module (CTGM) to enhance text-to-video models. CTGM uses a Temporal Information Injector and Temporal Affinity Refiner to achieve frame-specific textual guidance, improving comprehension of temporal logic. Experiments on the EvalCrafter benchmark demonstrate FancyVideo's state-of-the-art performance in generating dynamic and consistent videos, also supporting image-to-video tasks.

VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models

arXiv · Jan 14

The paper introduces VENOM, a text-driven framework for generating high-quality unrestricted adversarial examples using diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enhancing both attack success rate and image quality. The framework incorporates an adaptive adversarial guidance strategy with momentum to ensure the generated adversarial examples align with the distribution of natural images.

Cross-modal understanding and generation of multimodal content

MBZUAI · Invalid Date

Nicu Sebe from the University of Trento presented recent work on video generation, focusing on animating objects in a source image using external information like labels, driving videos, or text. He introduced a Learnable Game Engine (LGE) trained from monocular annotated videos, which maintains states of scenes, objects, and agents to render controllable viewpoints. Why it matters: This talk highlights advancements in cross-modal AI, potentially enabling new applications in gaming, simulation, and content creation within the region.