Search

Results for "Language Model Steering"

The Cylindrical Representation Hypothesis for Language Model Steering

arXiv · May 3

Researchers from MBZUAI have proposed the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability observed in large language model steering. CRH relaxes the orthogonality assumption of the existing Linear Representation Hypothesis, positing a cylindrical structure where a central axis captures concept differences and a surrounding normal plane controls steering sensitivity. The hypothesis suggests that the intrinsic uncertainty in identifying specific sensitive sectors within this normal plane accounts for why steering outcomes frequently fluctuate even with well-aligned directions. Why it matters: This research offers a more robust theoretical framework for understanding and potentially improving the control and reliability of large language models.

YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation

arXiv · Jan 13

The paper introduces Yet another Policy Optimization (YaPO), a reference-free method for learning sparse steering vectors in the latent space of a Sparse Autoencoder (SAE) to steer LLMs. By optimizing sparse codes, YaPO produces disentangled, interpretable, and efficient steering directions. Experiments show YaPO converges faster, achieves stronger performance, exhibits improved training stability and preserves general knowledge compared to dense steering baselines.

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

arXiv · Feb 19

This paper investigates the intrinsic self-correction capabilities of LLMs, identifying model confidence as a key latent factor. Researchers developed an "If-or-Else" (IoE) prompting framework to guide LLMs in assessing their own confidence and improving self-correction accuracy. Experiments demonstrate that the IoE-based prompt enhances the accuracy of self-corrected responses, with code available on GitHub.

Instruction-Guided Poetry Generation in Arabic and Its Dialects

arXiv · Apr 30

Researchers at MBZUAI have developed a new method for controllable poetry generation in Arabic and its dialects, moving beyond traditional analysis tasks for Arabic poetry within Large Language Models (LLMs). They introduce a large-scale, instruction-based dataset in Modern Standard Arabic (MSA) and various Arabic dialects, enabling LLMs to perform tasks like writing, revising, and continuing poems based on user criteria. Experiments show that fine-tuning LLMs on this dataset results in models capable of generating poetry aligned with user requirements, validated by automated metrics and human evaluation. Why it matters: This work represents a significant advancement in Arabic Natural Language Processing, offering tools for creative expression and cultural preservation while opening new avenues for user-guided content generation in culturally rich text forms.