The Cylindrical Representation Hypothesis for Language Model Steering
arXiv · · Significant research
Summary
Researchers have proposed the Cylindrical Representation Hypothesis (CRH) to address the instability and unpredictability observed in steering large language models, an issue not fully explained by the existing Linear Representation Hypothesis (LRH). CRH suggests that overlapping concept contributions lead to a sample-specific axis-orthogonal structure, comprising a central axis for concept generation and a surrounding normal plane for steering sensitivity. This framework identifies intrinsic uncertainty at the 'sensitive sector' level within the plane, providing a principled explanation for fluctuations in steering outcomes. Experiments verify the existence of this cylindrical structure and demonstrate CRH's practical utility in interpreting real-world model steering behavior, with code available on GitHub from mbzuai-nlp. Why it matters: This research from MBZUAI offers a crucial theoretical advancement in understanding and potentially improving the control and reliability of large language models.
Keywords
Cylindrical Representation Hypothesis · Language Model Steering · Large Language Models · Model Interpretability · MBZUAI
Get the weekly digest
Top AI stories from the GCC region, every week.