Fine-tuning Text-to-Image Models: Reinforcement Learning and Reward Over-Optimization

MBZUAI · Notable

Summary

The article discusses research on fine-tuning text-to-image diffusion models, including reward function training, online reinforcement learning (RL) fine-tuning, and addressing reward over-optimization. A Text-Image Alignment Assessment (TIA2) benchmark is introduced to study reward over-optimization. TextNorm, a method for confidence calibration in reward models, is presented to reduce over-optimization risks. Why it matters: Improving the alignment and fidelity of text-to-image models is crucial for generating high-quality content, and addressing over-optimization enhances the reliability of these models in creative applications.

Keywords

text-to-image · fine-tuning · reinforcement learning · reward over-optimization · diffusion models

Read original article →

Get the weekly digest

Top AI stories from the GCC region, every week.

VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models

arXiv · Jan 14

The paper introduces VENOM, a text-driven framework for generating high-quality unrestricted adversarial examples using diffusion models. VENOM unifies image content generation and adversarial synthesis into a single reverse diffusion process, enhancing both attack success rate and image quality. The framework incorporates an adaptive adversarial guidance strategy with momentum to ensure the generated adversarial examples align with the distribution of natural images.

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

arXiv · Nov 20

Researchers at MBZUAI have introduced EvoLMM, a self-evolving framework for large multimodal models that enhances reasoning capabilities without human-annotated data or reward distillation. EvoLMM uses two cooperative agents, a Proposer and a Solver, which generate image-grounded questions and solve them through internal consistency, using a continuous self-rewarding process. Evaluations using Qwen2.5-VL as the base model showed performance gains of up to 3% on multimodal math-reasoning benchmarks like ChartQA, MathVista, and MathVision using only raw training images.

Fine-tuning Text-to-Image Models: Reinforcement Learning and Reward Over-Optimization

Summary

Keywords

Related

VENOM: Text-driven Unrestricted Adversarial Example Generation with Diffusion Models

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards