A new survey paper provides a deep dive into post-training methodologies for Large Language Models (LLMs), analyzing their role in refining LLMs beyond pretraining. It addresses key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs, and highlights emerging directions in model alignment, scalable adaptation, and inference-time reasoning. The paper also provides a public repository to continually track developments in this fast-evolving field.
The paper introduces MIRAGE, a framework for evaluating LLMs' ability to simulate human behaviors in murder mystery games. MIRAGE uses four methods: TII, CIC, ICI and SCI to assess the LLMs' role-playing proficiency. Experiments show that even GPT-4 struggles with the complexities of the MIRAGE framework.
This paper presents a UI-level evaluation of ALLaM-34B, an Arabic-centric LLM developed by SDAIA and deployed in the HUMAIN Chat service. The evaluation used a prompt pack spanning various Arabic dialects, code-switching, reasoning, and safety, with outputs scored by frontier LLM judges. Results indicate strong performance in generation, code-switching, MSA handling, reasoning, and improved dialect fidelity, positioning ALLaM-34B as a robust Arabic LLM suitable for real-world use.
This paper introduces ProgramFC, a fact-checking model that decomposes complex claims into simpler sub-tasks using a library of functions. The model uses LLMs to generate reasoning programs and executes them by delegating sub-tasks, enhancing explainability and data efficiency. Experiments on fact-checking datasets demonstrate ProgramFC's superior performance compared to baseline methods, with publicly available code and data.
MBZUAI researchers release LLM-DetectAIve, a tool for fine-grained detection of machine-generated text across four categories: human-written, machine-generated, machine-written then humanized, and human-written then machine-polished. The tool aims to address concerns about misuse of LLMs, especially in education and academia, by identifying attempts to obfuscate or polish content. LLM-DetectAIve is publicly accessible with code and a demonstration video provided.
MBZUAI researchers introduce LLM-BabyBench, a benchmark suite for evaluating grounded planning and reasoning in LLMs. The suite, built on a textual adaptation of the BabyAI grid world, assesses LLMs on predicting action consequences, generating action sequences, and decomposing instructions. Datasets, evaluation harness, and metrics are publicly available to facilitate reproducible assessment.
The paper introduces LLMEffiChecker, a tool to test the computational efficiency robustness of LLMs by identifying vulnerabilities that can significantly degrade performance. LLMEffiChecker uses both white-box (gradient-guided perturbation) and black-box (causal inference-based perturbation) methods to delay the generation of the end-of-sequence token. Experiments on nine public LLMs demonstrate that LLMEffiChecker can substantially increase response latency and energy consumption with minimal input perturbations.
Researchers introduce TimeTravel, a benchmark dataset for evaluating large multimodal models (LMMs) on historical and cultural artifacts. The benchmark comprises 10,250 expert-verified samples across 266 cultures and 10 historical regions, designed to assess AI in tasks like classification and interpretation of manuscripts, artworks, inscriptions, and archaeological discoveries. The goal is to establish AI as a reliable partner in preserving cultural heritage and assisting researchers.