Papers
arxiv:2606.13657

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Published on Jun 11
· Submitted by
Yu
on Jun 15
Authors:
,
,
,
,

Abstract

On-policy distillation exhibits sparse parameter updates that are distributed across layers and favor FFN components, while maintaining geometric properties distinct from standard dense parameter rewriting.

On-policy distillation (OPD) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, OPD-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full OPD. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn OPD into ordinary dense parameter rewriting; instead, OPD retains important geometric signatures of on-policy post-training.

Community

Paper author Paper submitter

On-policy distillation is attractive because it appears to offer the best of both worlds: the student trains on its own on-policy generations, as in reinforcement learning, while still receiving dense teacher supervision, as in distillation. Yet this hybrid nature raises a basic question: does OPD rewrite model parameters like dense supervised fine-tuning, or does it inherit the sparse, geometry-aware behavior of on-policy post-training? This paper shows that the latter view is closer to reality. Across language and vision-language models, OPD produces small, coordinate-sparse, spectrally concentrated, and off-principal updates; moreover, the discovered sparse support is not merely descriptive, since training only that subnetwork nearly recovers full OPD performance. These results suggest that OPD is not ordinary dense distillation on new data, but a form of sparse on-policy editing shaped strongly by the student's own behavioral distribution.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13657
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13657 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13657 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13657 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.