Skip to content
GCC AI Research

Search

Results for "TextGames"

Can LLMs reason? New benchmark puts models to the test

MBZUAI ·

MBZUAI researchers created a new benchmark dataset called TextGames to evaluate the reasoning abilities of LLMs. The dataset uses simple, text-based games requiring skills like pattern recognition and logical thinking. LLMs struggled with the hardest questions, suggesting limitations in their reasoning capabilities despite advancements in language understanding. Why it matters: This research highlights the need for specialized reasoning models and benchmarks that go beyond memorization to truly test AI's problem-solving abilities.

Modeling Text as a Living Object

MBZUAI ·

The InterText project, funded by the European Research Council, aims to advance NLP by developing a framework for modeling fine-grained relationships between texts. This approach enables tracing the origin and evolution of texts and ideas. Iryna Gurevych from the Technical University of Darmstadt presented the intertextual approach to NLP, covering data modeling, representation learning, and practical applications. Why it matters: This research could enable a new generation of AI applications for text work and critical reading, with potential applications in collaborative knowledge construction and document revision assistance.

The diagnosis game: A simulated hospital environment to measure AI agents’ diagnostic abilities

MBZUAI ·

MBZUAI researchers developed MedAgentSim, a simulated hospital environment to evaluate AI diagnostic abilities. The simulation uses LLM-powered agents to mimic doctor-patient conversations, providing a dynamic assessment of diagnostic skills. The system includes doctor, patient, and evaluator agents that interact within the simulated hospital, making real-time decisions. Why it matters: This research offers a more realistic evaluation of AI in clinical settings, addressing limitations of current benchmarks and potentially improving AI's use in healthcare.

Fine-tuning Text-to-Image Models: Reinforcement Learning and Reward Over-Optimization

MBZUAI ·

The article discusses research on fine-tuning text-to-image diffusion models, including reward function training, online reinforcement learning (RL) fine-tuning, and addressing reward over-optimization. A Text-Image Alignment Assessment (TIA2) benchmark is introduced to study reward over-optimization. TextNorm, a method for confidence calibration in reward models, is presented to reduce over-optimization risks. Why it matters: Improving the alignment and fidelity of text-to-image models is crucial for generating high-quality content, and addressing over-optimization enhances the reliability of these models in creative applications.

GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human

arXiv ·

The GenAI Content Detection Task 1 is a shared task on detecting machine-generated text, featuring monolingual (English) and multilingual subtasks. The task, part of the GenAI workshop at COLING 2025, attracted 36 teams for the English subtask and 26 for the multilingual one. The organizers provide a detailed overview of the data, results, system rankings, and analysis of the submitted systems.

FIRE: Fact-checking with Iterative Retrieval and Verification

arXiv ·

A novel agent-based framework called FIRE is introduced for fact-checking long-form text. FIRE iteratively integrates evidence retrieval and claim verification, deciding whether to provide a final answer or generate a subsequent search query. Experiments show FIRE achieves comparable performance to existing methods while reducing LLM costs by 7.6x and search costs by 16.5x.