arxiv:2606.11599

When is Your LLM Steerable?

Published on Jun 10

· Submitted by

Chenrui Fan on Jun 15

University of Maryland College Park

Upvote

Authors:

Chenrui Fan ,

Tianyi Zhou

Abstract

Activation steering effectiveness can be predicted from early decoding states using a GBDT classifier, enabling efficient steering strength optimization with reduced computational cost.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

View arXiv page View PDF GitHub 7 Add to collection

Community

Fcr09

Paper author Paper submitter 1 day ago

Main contributions:

We curate a dataset of steering that covers steered responses of multiple LLMs under different prompts, concepts, and steering strengths. It enables fine-grained analysis of the latent dynamics of steering in LLMs.
We developed features capturing the effects of steering on the latent dynamics, resulting in interpretable prediction of steering success and two types of failures.
By exploiting the generalization capability of the steerability predictor, we introduce a practical approach that can allocate the optimal steering configurations to improve the performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11599

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11599 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11599 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11599 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.