Skip to content
GCC AI Research

Search

Results for "regularized MDPs"

Fast Rates for Maximum Entropy Exploration

MBZUAI ·

This paper addresses exploration in reinforcement learning (RL) in unknown environments with sparse rewards, focusing on maximum entropy exploration. It introduces a game-theoretic algorithm for visitation entropy maximization with improved sample complexity of O(H^3S^2A/ε^2). For trajectory entropy, the paper presents an algorithm with O(poly(S, A, H)/ε) complexity, showing the statistical advantage of regularized MDPs for exploration. Why it matters: The research offers new techniques to reduce the sample complexity of RL, potentially enhancing the efficiency of AI agents in complex environments.

SGD from the Lens of Markov process: An Algorithmic Stability Perspective

MBZUAI ·

A Marie Curie Fellow from Inria and UIUC presented research on stochastic gradient descent (SGD) through the lens of Markov processes, exploring the relationships between heavy-tailed distributions, generalization error, and algorithmic stability. The research challenges existing theories about the monotonic relationship between heavy tails and generalization error. It introduces a unified approach for proving Wasserstein stability bounds in stochastic optimization, applicable to convex and non-convex losses. Why it matters: The work provides novel insights into the theoretical underpinnings of stochastic optimization, relevant to researchers at MBZUAI and other institutions in the region working on machine learning algorithms.

Distillation Policy Optimization

arXiv ·

The paper introduces a novel actor-critic framework called Distillation Policy Optimization that combines on-policy and off-policy data for reinforcement learning. It incorporates variance reduction mechanisms like a unified advantage estimator (UAE) and a residual baseline. The empirical results demonstrate improved sample efficiency for on-policy algorithms, bridging the gap with off-policy methods.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

arXiv ·

A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.

Fine-tuning Text-to-Image Models: Reinforcement Learning and Reward Over-Optimization

MBZUAI ·

The article discusses research on fine-tuning text-to-image diffusion models, including reward function training, online reinforcement learning (RL) fine-tuning, and addressing reward over-optimization. A Text-Image Alignment Assessment (TIA2) benchmark is introduced to study reward over-optimization. TextNorm, a method for confidence calibration in reward models, is presented to reduce over-optimization risks. Why it matters: Improving the alignment and fidelity of text-to-image models is crucial for generating high-quality content, and addressing over-optimization enhances the reliability of these models in creative applications.

Learning to Identify Critical States for Reinforcement Learning from Videos

arXiv ·

Researchers at KAUST have developed a new method called Deep State Identifier for extracting information from videos for reinforcement learning. The method learns to predict returns from video-encoded episodes and identifies critical states using mask-based sensitivity analysis. Experiments demonstrate the method's potential for understanding and improving agent behavior in DRL.

An Adaptive Stochastic Sequential Quadratic Programming with Differentiable Exact Augmented Lagrangians

MBZUAI ·

Mladen Kolar from the University of Chicago Booth School of Business discussed stochastic optimization with equality constraints at MBZUAI. He presented a stochastic algorithm based on sequential quadratic programming (SQP) using a differentiable exact augmented Lagrangian. The algorithm adapts random stepsizes using a stochastic line search procedure, establishing global "almost sure" convergence. Why it matters: The presentation highlights MBZUAI's role in hosting discussions on advanced optimization techniques, fostering research and knowledge exchange in the field of machine learning.