The Cylindrical Representation Hypothesis for Language Model Steering
arXiv · · Significant research
Summary
Researchers from MBZUAI have proposed the Cylindrical Representation Hypothesis (CRH) to explain the instability and unpredictability observed in large language model steering. CRH relaxes the orthogonality assumption of the existing Linear Representation Hypothesis, positing a cylindrical structure where a central axis captures concept differences and a surrounding normal plane controls steering sensitivity. The hypothesis suggests that the intrinsic uncertainty in identifying specific sensitive sectors within this normal plane accounts for why steering outcomes frequently fluctuate even with well-aligned directions. Why it matters: This research offers a more robust theoretical framework for understanding and potentially improving the control and reliability of large language models.
Keywords
Language Model Steering · Cylindrical Representation Hypothesis · LLMs · MBZUAI · AI Research
Get the weekly digest
Top AI stories from the GCC region, every week.