Title: Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

URL Source: https://arxiv.org/html/2602.18312

Markdown Content:
###### Abstract.

Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.

Linear Control, Physics-based Character Animation, Legged Robots

††submissionid: 385††ccs: Computing methodologies Animation††ccs: Computing methodologies Control methods††ccs: Computing methodologies Reinforcement learning![Image 1: Refer to caption](https://arxiv.org/html/2602.18312v1/x1.png)

Figure 1. Our system is able to learn smooth time-varying linear feedback policies for a simulated character to perform challenging motion skills. 

1. Introduction
---------------

Deep reinforcement learning (DRL) has proven effective in physics-based character animation and robotics. However, policies that naively optimize for task rewards often favor unrealistic, high-frequency control signals. Even when the policies are constrained to imitate high-quality motion data, these unrealistic behaviors still prevail. This problem can be attributed to the learned policies being overly sensitive to slight changes in input signals. Deploying policies that are sensitive in this way on robots is more problematic, because input signals come from noisy sensory measurements and the control bandwidth is limited by what the physical actuators can achieve. Prior work addressed this issue by applying heavy penalties on the rate of change of control signals and randomizing policy inputs to decrease sensitivity to variation. However, these methods are often task-dependent and require significant trial-and-error effort to tune.

More recent work proposes using the Lipschitz-constrained policies to improve the smoothness of a learned control policy. In this method, a gradient penalty is imposed on the likelihood of a control action under the current policy(Chen et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib39 "Learning smooth humanoid locomotion through lipschitz-constrained policies")). While this method has been effective for locomotion tasks, it remains unclear how well it extends to more challenging scenarios. Furthermore, it relies on a large number of action samples to accurately estimate the true sensitivity of the actions with respect to the inputs and imposes significant computational overhead due to the additional backpropagation required for the gradient penalty.

We propose an action Jacobian penalty, which penalizes the norm of the Jacobian of the control action with respect to the state of the character. This penalty significantly reduces high-frequency signals in learned policies across a variety of motion imitation tasks. However, similar to the Lipschitz-constrained policies, computing the Jacobian of a policy parameterized by a fully connected feed forward (FF) neural network incurs significant computational overhead. To address this, we introduce Linear Policy Net (LPN), a new architecture for parameterizing control policies. Instead of directly outputting a control action, an LPN outputs a feedback control matrix using only task information. This feedback matrix then operates on the state of the character to produce the control action. Computing the Jacobian of the action with respect to the state, and computing the gradient of the Jacobian penalty with respect to the network parameters, is reduced to a simple forward and backward pass of the neural network. This approach effectively reduces the computation time required to impose the Jacobian penalty during training.

In DeepMimic style tasks, learning a LPN is equivalent to learning a time-varying linear feedback policy. This parametrization significantly restricts the class of controller the policy can learn, compared to a standard FF neural network. Our surprising finding is that this restriction does not negatively impact learning performance or the quality of the final policies. On the contrary, a LPN is comparable to, and in some cases even surpasses, its FF counterparts. We demonstrate the effectiveness of a LPN with a Jacobian penalty on various motion imitation tasks using a simulated human character, including challenging motions such as a backflip, the table tennis footwork drill and challenging parkour skills that require nontrivial interactions with the environment. Furthermore, we show that the learned time-varying linear feedback control policies extracted from a LPN are capable of controlling a physical quadrupedal robot with an arm to perform a combination of dynamic hopping and arm-swing motions.

2. Related Work
---------------

In this section, we review related work in physics-based character animation with deep reinforcement learning and methods to train smooth control policies. We also review related work in synthesizing linear feedback controllers for simulated characters and robots.

### 2.1. Deep Reinforcement Learning

DRL has become a popular framework for physics-based character animation. Realistic character animations are often synthesized through imitation of kinematic references from motion capture data, e.g.,(Peng et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib9 "Deepmimic: example-guided deep reinforcement learning of physics-based character skills"); Zhang et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib18 "ADD: physics-based motion imitation with adversarial differential discriminators")). An advantage of DRL compared to other approaches is its scalability and versatility. Over the past decade, a wide range of specialized tasks have been tackled with DRL, including scene interaction(Yu et al., [2021](https://arxiv.org/html/2602.18312v1#bib.bib29 "Human dynamics from monocular video with dynamic camera movements")), object manipulations(Yu et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib26 "Skillmimic-v2: learning robust and generalizable interaction skills from sparse and noisy demonstrations"); Xu et al., [2025b](https://arxiv.org/html/2602.18312v1#bib.bib30 "Intermimic: towards universal whole-body control for physics-based human-object interactions")), multi-character interaction(Zhang et al., [2023](https://arxiv.org/html/2602.18312v1#bib.bib27 "Simulation and retargeting of complex multi-character interactions")), music instrument play(Wang et al., [2024b](https://arxiv.org/html/2602.18312v1#bib.bib28 "Fürelise: capturing and physically synthesizing hand motion of piano performance")) and sports play(Wang et al., [2024a](https://arxiv.org/html/2602.18312v1#bib.bib31 "Strategy and skill learning for physics-based table tennis animation"); Kim et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib32 "PhysicsFC: learning user-controlled skills for a physics-based football player controller")). By scaling up the system, hours of diverse human motion capture data can be reproduced in simulation, e.g.,(Won et al., [2020](https://arxiv.org/html/2602.18312v1#bib.bib20 "A scalable approach to control diverse behaviors for physically simulated characters"); Luo et al., [2023](https://arxiv.org/html/2602.18312v1#bib.bib19 "Perpetual humanoid control for real-time simulated avatars"); Tessler et al., [2024](https://arxiv.org/html/2602.18312v1#bib.bib21 "Maskedmimic: unified physics-based character control through masked motion inpainting")). The versatility of this framework also opens up applications such as human motion tracking in virtual reality(Lee et al., [2023](https://arxiv.org/html/2602.18312v1#bib.bib22 "Questenvsim: environment-aware simulated motion tracking from sparse sensors")) and sim-to-real humanoid control(Liao et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib23 "Beyondmimic: from motion tracking to versatile humanoid control via guided diffusion"); Ruben et al., [2024](https://arxiv.org/html/2602.18312v1#bib.bib24 "Design and control of a bipedal robotic character"); Luo et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib25 "Sonic: supersizing motion tracking for natural humanoid whole-body control")).

### 2.2. Learning Smooth Policies

Reinforcement learning tends to exploit high frequency signals to achieve high task rewards, which can produce jittery motions that degrade motion quality(Xie et al., [2023](https://arxiv.org/html/2602.18312v1#bib.bib43 "Too stiff, too strong, too smart: evaluating fundamental problems with motion control policies")) and results in sim-to-real failure for robotics applications. High frequency signals can be reduced by adding a filter to the action(Bergamin et al., [2019](https://arxiv.org/html/2602.18312v1#bib.bib41 "DReCon: data-driven responsive control of physics-based characters")), however, this significantly reduces response to perturbation and can impact performance when performing dynamic motions. Directly penalizing the rate of change of actions in the form of a reward signal has been the prevailing approach in sim2real RL for quadrupedal and humanoid robot control(Hwangbo et al., [2019](https://arxiv.org/html/2602.18312v1#bib.bib42 "Learning agile and dynamic motor skills for legged robots"); Miller et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib33 "High-performance reinforcement learning on spot: optimizing simulation parameters with distributional measures"); Liao et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib23 "Beyondmimic: from motion tracking to versatile humanoid control via guided diffusion")). This approach relies on random exploration of the policy during training to discover smooth behaviors, requires manual tuning to achieve a balance between task completion and behavior regularization. Chen et al. ([2025](https://arxiv.org/html/2602.18312v1#bib.bib39 "Learning smooth humanoid locomotion through lipschitz-constrained policies")); Mysore et al. ([2021](https://arxiv.org/html/2602.18312v1#bib.bib40 "Regularizing action policies for smooth control with reinforcement learning")) directly penalized the change of action from the policy, via approximation to the Jacobian of the action with respect to the policy input. While this solution requires less per-task tuning, it imposes significant computational costs resulting in long training times and limiting it to simple control tasks like locomotion.

### 2.3. Linear Feedback Control

Before DRL became popular, control policies were often formulated as linear feedback controllers. Motion can be segmented into stages and a linear feedback controller can be designed for each stage to generate locomotion motion controllers(Yin et al., [2007](https://arxiv.org/html/2602.18312v1#bib.bib4 "Simbicon: simple biped locomotion control")) and various athletic behaviors(Hodgins et al., [1995](https://arxiv.org/html/2602.18312v1#bib.bib3 "Animating human athletics")). These segmented controllers require substantial human effort as each motion needs custom designed segmentations and input features .

Sampling based strategies can be used to search for linear feedback controllers. Ding et al. ([2015](https://arxiv.org/html/2602.18312v1#bib.bib1 "Learning reduced-order feedback policies for motion skills")) learns reduce-order linear feedback controllers via an evolutionary algorithm by directly sampling the feedback matrix. Liu et al. ([2016](https://arxiv.org/html/2602.18312v1#bib.bib2 "Guided learning of control graphs for physics-based characters")) learns time varying linear feedback controller via iterative sampling and linear regression. These systems are able to learn robust linear feedback policies for dynamic motions such as a backflip and a cartwheel. However, these systems are usually complex, require manual definition of a state representation specific to a given motion, and have not been shown to generalize to motions that require more complex environment interactions such as vaulting and wall climbing.

Through linear matrix parameterization, reinforcement learning or evolutionary algorithms can also be used to train a linear feedback policy. This approach has been shown on various locomotion tasks. These policies either exploit unrealistic behaviors such as high frequency leg motions, e.g.,(Mania et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib7 "Simple random search of static linear policies is competitive for reinforcement learning")), or require carefully designed features, e.g.,(Krishna et al., [2021](https://arxiv.org/html/2602.18312v1#bib.bib8 "Learning linear policies for robust bipedal locomotion on terrains with varying slopes")).

Time varying linear feedback control policies can also be synthesized using model-based control approaches such as differential dynamic programming and its variants, e.g.,(Muico et al., [2009](https://arxiv.org/html/2602.18312v1#bib.bib5 "Contact-aware nonlinear control of dynamic characters"); Li and Todorov, [2004](https://arxiv.org/html/2602.18312v1#bib.bib6 "Iterative linear quadratic regulator design for nonlinear biological movement systems")). However, linear feedback controllers obtained via model-based approaches are often brittle, and often require online replanning using computationally expensive model predictive control approaches, e.g., (Eom et al., [2019](https://arxiv.org/html/2602.18312v1#bib.bib12 "Model predictive control with a visuomotor system for physics-based character animation")).

There is also work that applies linearization to the equation of motion. This method allows for an efficient MPC formulation because it only requires solving a small scale Quadratic Programming (QP) problem online. This approach has proven to be effective for quadrupedal and bipedal locomotion, e.g.,(Di Carlo et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib13 "Dynamic locomotion in the mit cheetah 3 through convex model-predictive control"); Bishop et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib14 "The surprising effectiveness of linear models for whole-body model-predictive control")). However, the resulting controllers are still nonlinear due to the presence of inequality constraints such as the friction cone constraints.

3. Problem Setup and System Overview
------------------------------------

Our problem formulation is similar to DeepMimic(Peng et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib9 "Deepmimic: example-guided deep reinforcement learning of physics-based character skills")), with a reference motion ℳ={𝒔^1,𝒔 2^,…,𝒔^T}\mathcal{M}=\{\hat{\bm{s}}_{1},\hat{\bm{s}_{2}},\dots,\hat{\bm{s}}_{T}\} provided as input, where 𝒔^t∈ℝ n\hat{\bm{s}}_{t}\in\mathbb{R}^{n} specifies the full pose of the character state at timestep t t. A control policy is a map π:ℝ n×ℝ n→ℝ m\pi:\mathbb{R}^{n}\times\mathbb{R}^{n}\to\mathbb{R}^{m} where the character state 𝒔 t\bm{s}_{t} and the reference state 𝒔^t\hat{\bm{s}}_{t} are used to generate a control action via 𝒂 t=π​(𝒔 t,𝒔^t)\bm{a}_{t}=\pi(\bm{s}_{t},\hat{\bm{s}}_{t}). The control action 𝒂 t\bm{a}_{t} is a target angle for each actuated joint of the character. A joint level Proportional-Derivative (PD) controller is then used to actuate the character in a physics simulation. The control policy is trained with DRL to drive the character to imitate the reference. In our experiment, we use the same humanoid character as DeepMimic and a simulated quadrapedal robot, modeling the Boston Dynamics Spot quadraped, with arm attached.

Our reward set up is a simplified version of DeepMimic, where

r=0.3​r pos+0.3​r orientation+0.4​r joint r=0.3r_{\text{pos}}+0.3r_{\text{orientation}}+0.4r_{\text{joint}}

with

r pos=exp⁡(−50∗∥𝒙^−𝒙∥2),r_{\text{pos}}=\exp(-50*\lVert\hat{\bm{x}}-\bm{x}\rVert^{2}),

r orientation=exp⁡(−10∗∥𝒐​𝒓​𝒊^⊖𝒐​𝒓​𝒊∥2),r_{\text{orientation}}=\exp(-10*\lVert\hat{\bm{ori}}\ominus\bm{ori}\rVert^{2}),

r joint=exp⁡(−2∗∥𝒋^−𝒋∥2),r_{\text{joint}}=\exp(-2*\lVert\hat{\bm{j}}-\bm{j}\rVert^{2}),

where 𝒙,𝒐​𝒓​𝒊\bm{x},\bm{ori} and 𝒋\bm{j} are the root position, root orientation and the joint angles of the character respectively. Similar to DeepMimic, we also apply reference state initialization and early termination to improve the learning efficiency.

We use Mujoco(Todorov et al., [2012](https://arxiv.org/html/2602.18312v1#bib.bib10 "Mujoco: a physics engine for model-based control")) to simlulate the characters. The simulation runs at 120 Hz 120\text{\,}\mathrm{Hz} while the policy is updating the joint target at 30 Hz 30\text{\,}\mathrm{Hz}. A deep neural network is used to parameterize the policy. In this paper, we experiment with both the standard fully connected feedforward (FF) neural net and the Linear Policy Net (LPN) (described in Section[5](https://arxiv.org/html/2602.18312v1#S5 "5. Linear Policy Net ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty")). During training, the actions are sampled from a fixed Gaussian distribution where the mean is the output of the policy and the covariance matrix is a diagonal matrix with diagonal element δ 2=0.01\delta^{2}=0.01.

We use Proximal Policy Optimization (PPO)(Schulman et al., [2017](https://arxiv.org/html/2602.18312v1#bib.bib11 "Proximal policy optimization algorithms")) to optimize the policies. At each iteration of the PPO algorithm, 50 50 parallel simulation environments are used to collect 2500 2500 samples by using the control policy to interact with the simulations. PPO uses these samples to optimize for a loss function ℒ PPO\mathcal{L}_{\text{PPO}}, via gradient descent on the policy parameters. For all the experiments in this paper, we run PPO for a maximum of 5000 5000 iterations, which takes around 2.5 2.5 hours on a workstation with 12 12 CPU cores for running the simulations and a NVIDIA RTX A6000 for neural network inference and optimization. Most of the time the training already converges after 2000 2000 iterations, which takes about 1 1 hour with 5 5 million samples collected. This performance is comparable to a GPU-based simulation framework in terms of both training time and number of simulation samples required, e.g.,(Zhang et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib18 "ADD: physics-based motion imitation with adversarial differential discriminators")).

Additional reward terms are often added to the original DeepMimic loss to reduce motion artifacts such as motion jitteriness or high impact, especially for robotics applications(Liao et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib23 "Beyondmimic: from motion tracking to versatile humanoid control via guided diffusion"); Chen et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib39 "Learning smooth humanoid locomotion through lipschitz-constrained policies")). These additional terms cause additional tuning efforts and often require task specific tuning. In this work, we propose to use a Linear Policy Net with an action Jacobian penalty as a regularization loss. This approach introduces smooth behaviors to the policies with minimal tuning and compute overhead while also maintaining learning efficiency. We now describe each component in the more detail.

4. Action Jacobian Penalty
--------------------------

Reinforcement learning relies on injecting Guassian noise to the policy output for exploration and data collection. This process creates unrealistic high frequency signals that can achieve high rewards. Reinforcement learning then tends to fit the policies to these high frequency signals, resulting in policies that can produce undesired and unnatural jittery control signals. This problem is more pronounced in robotics applications, where real world noise from sensor measurements and imperfect motor commands amplify the high frequency control signals until they are visible. Existing work relies on adding a term in the reward function to penalize the difference between the actions over consecutive time steps to regularize action change over time. But the regularization effect tends to be small compared to the other rewards to avoid reducing the effectiveness of the task reward. The size of this effect sometimes causes it to become ineffective in the face of challenging tasks since various reward terms are summed over and the effect of the regularization need to be discovered by the inherent noisy nature of the exploration procedure.

We propose directly regularizing the policy with an action Jacobian penalty. Instead of using it as a reward signal, we directly add a loss term to the PPO optimization:

ℒ total=ℒ PPO+w Jac​ℒ Jac,\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{PPO}}+w_{\text{Jac}}\mathcal{L}_{\text{Jac}},

where w Jac w_{\text{Jac}} is a tunable weighting factor, and ℒ Jac=∥𝐉∥2\mathcal{L}_{\text{Jac}}=\lVert\mathbf{J}\rVert^{2} is the square of the Frobenius norm of 𝐉\mathbf{J}, with 𝐉\mathbf{J} the Jacobian of the policy:

𝐉=[∂a 1∂s 1⋯∂a 1∂s n⋮⋱⋮∂a m∂s 1⋯∂a m∂s n].\mathbf{J}=\begin{bmatrix}\frac{\partial a_{1}}{\partial s_{1}}&\cdots&\frac{\partial a_{1}}{\partial s_{n}}\\ \vdots&\ddots&\vdots\\ \frac{\partial a_{m}}{\partial s_{1}}&\cdots&\frac{\partial a_{m}}{\partial s_{n}}\end{bmatrix}.

We use w Jac=10 w_{\text{Jac}}=10 across all our experiments.

The action Jacobian captures the sensitivity of the action generated by the policy with respect to the changes in character state. For 𝑱\bm{J} with a larger norm, a small variation in the character state will cause large changes in action. The resulting motion will exhibit a high frequency joint oscillation. By penalizing the magnitude of the action Jacobian, a policy will generate smoother control signals.

There are prior works that try to approximate 𝑱\bm{J} to encourage a smooth policy. Mysore et al. ([2021](https://arxiv.org/html/2602.18312v1#bib.bib40 "Regularizing action policies for smooth control with reinforcement learning")) use sampling strategies around the collected data point to approximate the Jacobian, which incurs significant computation costs to get an accurate approximation. Chen et al. ([2025](https://arxiv.org/html/2602.18312v1#bib.bib39 "Learning smooth humanoid locomotion through lipschitz-constrained policies")) penalizes the gradient of likelihood of a sampled action. This approach is equivalent to penalizing ∥𝐉 T​(𝒂−𝒂 mean)∥2\lVert\mathbf{J}^{T}(\bm{a}-\bm{a}_{\text{mean}})\rVert^{2}, where 𝒂\bm{a} is the sampled action during exploration and 𝒂 mean\bm{a}_{\text{mean}} is the mean of the Gaussian distribution the policy will sampled from. This penalty only optimizes for a specific direction of the Jacobian and requires more samples to optimize the full Jacobian. We propose directly optimizing for the norm of the full Jacobian, which can be achieved via auto-differentiation and backpropagation.

While penalizing the norm of the action Jacobian is straightforward thanks to the autograd features in a modern deep learning framework such as PyTorch(Paszke et al., [2019](https://arxiv.org/html/2602.18312v1#bib.bib45 "Pytorch: an imperative style, high-performance deep learning library")), it imposes significant computation overhead. Optimizing for the action Jacobian penalty with a fully connected neural network, each iteration of PPO is about 1.5 1.5 times slower in our implementation than using only the PPO loss.

5. Linear Policy Net
--------------------

We introduce a Linear Policy Net (LPN) to parameterize our policies. See Figure[2](https://arxiv.org/html/2602.18312v1#S5.F2 "Figure 2 ‣ 5. Linear Policy Net ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). The input to the policy is the tuple {𝒔 t,𝒔^t}\{\bm{s}_{t},\hat{\bm{s}}_{t}\} that represents the state of the simulated character and the desired state from the reference motion at time step t t. The reference motion 𝒔^t\hat{\bm{s}}_{t} is fed into a two layer MLP, the output of the MLP is a feedback matrix K t∈ℝ m×n K_{t}\in\mathbb{R}^{m\times n} and feedforward action 𝒌 t∈ℝ m\bm{k}_{t}\in\mathbb{R}^{m}, where n n is the dimension of the state of the character and m m is the dimension of the control action. Control action 𝒂 t=𝑲 t​𝒔 t+𝒌 t+𝒂^t\bm{a}_{t}=\bm{K}_{t}\bm{s}_{t}+\bm{k}_{t}+\hat{\bm{a}}_{t} is then applied to advance the simulation, where 𝒂^t\hat{\bm{a}}_{t} is the reference joint angle corresponding to the actuated joints, extracted from the reference state. Note that 𝑲 t\bm{K}_{t} and 𝒌 t\bm{k}_{t} do not depend on the state of the character, in the case of a fixed sequence of reference motion, this approach is equivalent to learning a time-varying linear feedback control policy, where the feedback matrix and feedforward action only depend on time.

There are several design choices to be made and we will discuss them in this section.

![Image 2: Refer to caption](https://arxiv.org/html/2602.18312v1/x2.png)

Figure 2. We introduce Linear Policy Net. A fully connected multilayer perceptrons (MLP) is used to generate a feedback matrix K t K_{t} and a feedforward action 𝒌 t\bm{k}_{t} from the reference state 𝒔^t\hat{\bm{s}}_{t}. The final control action is generated by applying the linear feedback matrix on the simulation state plus the feedforward terms.

### 5.1. Input Features

Prior work on learning linear feedback controllers often includes manual design features such as the end effector positions in the input(Ding et al., [2015](https://arxiv.org/html/2602.18312v1#bib.bib1 "Learning reduced-order feedback policies for motion skills"); Liu et al., [2016](https://arxiv.org/html/2602.18312v1#bib.bib2 "Guided learning of control graphs for physics-based characters")). DeepMimic similarly includes maximal coordinate of the character that includes positions and orientations of the body parts(Peng et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib9 "Deepmimic: example-guided deep reinforcement learning of physics-based character skills")). We choose to use the minimal coordinate system of the character as our policy input, which includes the displacement in the position and orientation of the root to the desired configuration specified in the reference motion, the root linear and angular velocity, and the joint angle and joint velocity of the actuated joints of the character. These features can also be easily obtained from a state estimator on a physical quadrupedal robot, making sim-to-real straightforward.

Another option is to use both the reference state and character state as input to the MLP. While this makes the policy more expressive, the linear feedback matrices would then depend on the character state. We found this dependency unnecessary as the LPN can successfully learn time-varying linear feedback policies for dynamic motion in both simulation and on a physical robot without it.

### 5.2. Action Space

An added benefit of learning a linear feedback matrix is that we can update the feedback matrix and feedforward term 𝑲 t,𝒌 t\bm{K}_{t},\bm{k}_{t} at a slower rate than 𝒂 t\bm{a}_{t}. Such a hierarchy is already present in current DRL frameworks for legged robot control, where the policy is updating the joint target angle at a slow rate of around 50 Hz 50\text{\,}\mathrm{Hz}, while there is a joint level PD controller running at a much higher frequency of 500 Hz 500\text{\,}\mathrm{Hz} or more. Within our framework, we can inference the MLP at even lower rate, down to 15 Hz 15\text{\,}\mathrm{Hz} for some tasks, while maintaining the update frequency of the joint PD target computation and joint level PD control.

Note that because the learnable parameters in the system are in the MLP layer, we can potentially use the feedback matrix 𝑲 t,𝒌 t\bm{K}_{t},\bm{k}_{t} as our action space, and then do the linear feedback within the simulation loop, hidden from the learning loop. However, this approach increases the action space dimension from m m to (n+1)×m(n+1)\times m, making learning much harder.

### 5.3. Action Jaocbian Penalty with Linear Policy Net

Because the control policy learned by a LPN is in the form of 𝒂 t=𝑲 t​𝒔 t+𝒌 t+𝒂^t\bm{a}_{t}=\bm{K}_{t}\bm{s}_{t}+\bm{k}_{t}+\hat{\bm{a}}_{t}, with 𝑲 t\bm{K}_{t} being independent of 𝒔 t\bm{s}_{t}, the action Jacobian at time step t t is equal to 𝑲 t\bm{K}_{t}. Computing the action Jacobian penalty and computing its gradient with respect to the learnable parameters then becomes a simple forward pass and backward pass of the MLP layer in the LPN. This process incurs minimal additional computation cost because the forward and backward pass are already required for evaluating the PPO loss term.

6. Evaluations
--------------

Table 1. Comparison of smoothness of policies trained with different methods, average over three runs with different random seeds. We highlight the best performance. The LPN has top three performance in all the measures except for the motion jerk metric for the backflip task.

We evaluate our systems on a wide range of physics-based character animation tasks and on a physical quadrupedal robot. In all the examples, the MLP layer in the LPN is a 2-layer fully connected neural network with hidden layer size 256 256. For baseline methods, we use the 2-layer fully connected network (FF) that is typically used in these tasks, with hidden layer size 256.

### 6.1. DeepMimic Task with Simulated Humanoid

We apply our method to a number of motion imitation tasks for a simulated humanoid. These tasks can be classified into four categories.

#### Locomotion Tasks

We apply our framework to train the humanoid to imitate reference motions for walking and running. We use the reference data from DeepmMimic(Peng et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib9 "Deepmimic: example-guided deep reinforcement learning of physics-based character skills")).

#### Gymnastic Motion

To demonstrate that our system works on dynamic motions, we train the LPN with the action Jacobian penalty to imitate a range of dynamic gymnastic motions, including a backlip, a sideflip and a cartwheel. While a time-varying linear feedback policy has been shown to work with these motions previously(Liu et al., [2016](https://arxiv.org/html/2602.18312v1#bib.bib2 "Guided learning of control graphs for physics-based characters")), it requires human designed input features that may not generalize to other motions. We demonstrate that simple input features work across all the motions we considered.

#### Imitating Single Sequence

We apply our framework to train the humanoid to imitate a 15 second clip of a motion capture of a table tennis footwork drill. The data is obtained from (Wang et al., [2024a](https://arxiv.org/html/2602.18312v1#bib.bib31 "Strategy and skill learning for physics-based table tennis animation")), and involves a dynamic whole body motion where the subject rapidly hops sideways while performing a fast forehand drive. We also demonstrate tracking of a break dance motion, with motion retargetted to the simulated character using motion capture data from(CMU Graphics Lab, [1999](https://arxiv.org/html/2602.18312v1#bib.bib44 "CMU graphics lab motion capture database")). This result demonstrates the generalization of our approach beyond cyclic motion.

#### Environment Interaction

We demonstrate that our system is also able to learn to interact with the environment with non-trivial contact, such as a parkour motion, with motion capture dataset from(Zhang et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib18 "ADD: physics-based motion imitation with adversarial differential discriminators"); Xu et al., [2025a](https://arxiv.org/html/2602.18312v1#bib.bib49 "Parc: physics-based augmentation with reinforcement learning for character controllers")). In particular, we train policies to execute a reverse vault sequence, a wall climbing sequence and a double kong sequence, which has been shown to be challenging to learn using DeepMimic(Zhang et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib18 "ADD: physics-based motion imitation with adversarial differential discriminators")). We also learn to imitate a soccer juggling sequence, following the set up from(Xie et al., [2022](https://arxiv.org/html/2602.18312v1#bib.bib48 "Learning soccer juggling skills with layer-wise mixture-of-experts")), to demonstrate the interactions with a dynamic environment.

### 6.2. Comparison

We compare our system with four alternatives, all implemented with FF neural network:

*   •Feedforward Neural Net with an action Jacobian Penalty, where we apply the action Jacobian penalty during training. This is labeled as FF + Jac Pen. 
*   •No regularization, where we directly optimize for ℒ PPO\mathcal{L}_{\text{PPO}} using the imitation reward. This is labeled as No Reg. 
*   •Action Change Penalty, where a reward r action=−w action​∥𝒂 t−𝒂 t−1∥2 r_{\text{action}}=-w_{\text{action}}\lVert\bm{a}_{t}-\bm{a}_{t-1}\rVert^{2} is used to penalize the changes in action between two timesteps. The weight w action w_{\text{action}} is a tunable parameter. We experiment with three sets of weights: w action=0.01,0.1,1 w_{\text{action}}=0.01,0.1,1. We label them as reward 0.01, reward 0.1 and reward 1. 
*   •Lipschitz Constraint Policy(Chen et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib39 "Learning smooth humanoid locomotion through lipschitz-constrained policies")), where instead of minimizing the norm of the action Jacobian, a penalty is applied to ℒ Lipschtiz=∥𝒂−𝒂 mean∥2 d​s\mathcal{L}_{\text{Lipschtiz}}=\frac{\lVert\bm{a}-\bm{a}_{\text{mean}}\rVert^{2}}{ds}, where 𝒂 mean\bm{a}_{\text{mean}} is the mean of the Gaussian distribution the policy will sample from while 𝒂\bm{a} is the sampled action. This method is equivalent to minimizing ∥𝑱 T​(𝒂−𝒂 mean)∥2\lVert\bm{J}^{T}(\bm{a}-\bm{a}_{\text{mean}})\rVert^{2}. It does not require computation of the whole Jacobian matrix but will only optimize in the direction of the sampled 𝒂\bm{a}. We use the same weighting factor as the action Jacobian penalty and optimize for ℒ total=ℒ PPO+10​ℒ Liptschitz\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{PPO}}+10\mathcal{L}_{\text{Liptschitz}}. This is labeled as Lipschitz. 

We compare the learning performance on walking, a backflip and the table tennis footwork drill, taking the average over three training runs with different random seeds. The learning curve is shown in Fig.[3](https://arxiv.org/html/2602.18312v1#S7.F3 "Figure 3 ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). The reward is computed only with the imitation reward. Applying a large action penalty (Reward 1) slows down task learning but results in better motion imitation performance, except for the backflip, where a large action change penalty causes learning to fail. FF with an action Jacobian Penalty takes almost twice as many learning iterations to converge compared to other methods, and each iteration is about 1.5 1.5 time slower due to the Jacobian penalty computation. The Lipschitz constraint policy shows fast convergence in walking and backflip tasks, but converges slowly for the footwork task. Furthermore, it fails to produce a smooth policy compared to the alternatives, as we will show later. The LPN with the Jacobian penalty has the fastest learning convergence across all tasks with minimal computation overhead per iteration.

To quantify the smoothness of a policy, We collect data by rolling out the policy in simulation. For walking and the backflip, we run the policy for five motion cycles and for the table tennis footwork drill, we run the policy over the complete sequence. We also quantify the smoothness of each policy using the following metrics:

*   •Action Smoothness: This metric evaluates the average action change of a policy: ∑t=1 T+1∥𝒂 t−𝒂 t−1∥2 T\frac{\sum^{T+1}_{t=1}\lVert\bm{a}_{t}-\bm{a}_{t-1}\rVert^{2}}{T}. 
*   •High Frequency Ratio: We compute the Fourier transform of the action output over time. Humans typically have a control bandwidth of about 10 Hz 10\text{\,}\mathrm{H}\mathrm{z}(Hogan, [2022](https://arxiv.org/html/2602.18312v1#bib.bib46 "Contact and physical interaction")), we consider signal content higher than 10 Hz 10\text{\,}\mathrm{H}\mathrm{z} to be unnatural. We use the proportion of the energy that is higher than 10 Hz 10\text{\,}\mathrm{H}\mathrm{z} with respect to the total energy to characterize the smoothness of the signal. 
*   •Motion Jerk We compute the jerk metric of the motion, following (Rohrer et al., [2002](https://arxiv.org/html/2602.18312v1#bib.bib47 "Movement smoothness changes during stroke recovery")). From the joint acceleration signal sampled at 120 Hz 120\text{\,}\mathrm{H}\mathrm{z}, we use finite differencing to compute the joint jerk. We then evaluate the jerk metric by dividing the mean jerk magnitude by the peak speed, average over the 28 28 joints on the character. A lower value in the jerk metric corresponds to a smoother motion. 

The result is recorded in Table[1](https://arxiv.org/html/2602.18312v1#S6.T1 "Table 1 ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). Heavily penalizing the action change in the reward (reward 1) can produce smooth policies. However, it fails to learn dynamic tasks like a backflip. reward 0.1 is also able learn smooth behaviors, with a reduced leaning convergence rate, especially for the backflip task. Task specific reward tuning is necessary to obtain the best performance with the action change penalty. The Lipshcitz constraint policies consistently fail to achieve smooth policies compared to the other methods. FF policies with an action Jacobian penalty are able to learn smooth policies that are competitive with the methods that use a reward that heavily penalizes action change, but they are slow to train. A LPN is able to learn smooth policies while maintaining fast learning convergence rate. Note that in the backflip task, the smoothness metric for a LPN is worse than the feedforward neural network policies with a Jacobian penalty or appropriate reward set up. We conjecture that the backflip is a challenging motion for a time-varying linear feedback control policies, requiring the LPN to produce higher frequency action.

We also plot the action of the pitch joint on the left ankle for the backflip and table tennis footwork in Fig.[4](https://arxiv.org/html/2602.18312v1#S7.F4 "Figure 4 ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty") to provide a visual demonstration of the importance of regularizing the control action, either via an action change penalty as a reward signal or using our action Jacobian penalty. Without action regularization, policies tend to generate actions that change rapidly, resulting in jittery motions, potentially causing the feet to chatter on the ground.

### 6.3. Linear Policy Net Evaluations

#### Reduced-Order Linear Feedback Policy

Inspired by(Ding et al., [2015](https://arxiv.org/html/2602.18312v1#bib.bib1 "Learning reduced-order feedback policies for motion skills")), we experiment with how to obtain a reduced-order linear policy by using low rank linear feedback matrices. We perform singular value decomposition on the learned feedback matrices to compute their low rank approximation. Specifically, for a feedback matrix 𝑲 t\bm{K}_{t}, its singular value decomposition is in the form of 𝑲 t=𝑼​𝚺​𝑽 T\bm{K}_{t}=\bm{U}\bm{\Sigma}\bm{V}^{T}, with 𝑼\bm{U} and 𝑽\bm{V} being orthogonal matrices that form bases for the action space and state space of the character respectively, and 𝚺\bm{\Sigma} is a diagonal matrix, whose diagonal elements quantify the importance of the corresponding dimension in those bases. A best rank k k approximation of the feedback matrix can be obtained via 𝑲 k,t=𝑼 k​𝚺 k​𝑽 k T=∑i=1 k σ i​𝒖 i​𝒗 i T\bm{K}_{k,t}=\bm{U}_{k}\bm{\Sigma}_{k}\bm{V}_{k}^{T}=\sum_{i=1}^{k}\sigma_{i}\bm{u}_{i}\bm{v}_{i}^{T}, where 𝒖 i\bm{u}_{i} and 𝒗 i\bm{v}_{i} are the i i th column of 𝑼\bm{U} and 𝑽\bm{V} respectively.

Our action space is 28 28 dimensional corresponding to the 28 28 joints on the character. We find that we can lower the rank of the feedback matrices learned by a LPN while still maintaining performance. For example, a sequence of rank 14 14 feedback matrices are able to retain the performance of a walking policy. We can further reduce the rank of these matrices down to two. While they can still command the character to walk, the motion quality degrades.

We can also find low rank approximation of the policies for other motions. For example, the rank of a backflip policy can be reduced to 20 20, the rank of a cartwheel policy can be reduced to 22 22, and the rank of a table tennis footwork drill policy can be reduced to 18 18.

#### Terrain Adaptation

We adapt our backflip policy and cartwheel policy to handle uneven terrain by finetuning them on a sinusoidal terrain. Even though these policies do not perceive the terrain, the underlying linear feedback structure is able to handle the perturbation. This experiment demonstrates the robustness of the time varying linear feedback polices learned by a LPN.

#### Linear Feedback Update Rate

LPN is trained to update the linear feedback matrix and the feedforward action at a rate of 30 Hz 30\text{\,}\mathrm{H}\mathrm{z}. We experiment with how much we can slow down the update rate while still maintaining good motion tracking performance. We find that we can update the feedback matrix at 10 Hz 10\text{\,}\mathrm{H}\mathrm{z} with a walking policy, while policies for other motions fail when we try to lower the update rate below 30 Hz 30\text{\,}\mathrm{H}\mathrm{z}. In (Liu et al., [2016](https://arxiv.org/html/2602.18312v1#bib.bib2 "Guided learning of control graphs for physics-based characters")), the linear feedback control policies can be queried at a slow rate of 10 Hz 10\text{\,}\mathrm{H}\mathrm{z}, even for highly dynamic motions such as a backflip. It will be interesting to figure out how to train a LPN to operate at this slower rate.

#### Policy Distillation and Transitions between Skills

We can train policies for different motions and distill them into a single policy, following the procedure in(Truong et al., [2024](https://arxiv.org/html/2602.18312v1#bib.bib50 "Pdp: physics-based character animation via diffusion policy")). The resulting policies can track different reference motions in sequence. In particular, we train a LPN policy to track jumping, a sideflip and a backflip via distillation. The resulting policies can then execute jumping, a sideflip and a backflip in sequence. We also train a FF policy via the same distillation procedure, it fails to perform the agile transition between a sideflip and a backflip.

### 6.4. Sim-to-real on a Quadrupedal Robot

We demonstrate that the time-varying linear feedback policies trained with our framework can be readily applied to a physical robot. We use a quadrupedal robot Spot with an arm attached to the body as our platform. To reduce the sim-to-real gap, we implement the actuator model of the leg motors that correspond to the physical actuator model specified by the manufacturer, following(Miller et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib33 "High-performance reinforcement learning on spot: optimizing simulation parameters with distributional measures")). The results are shown in the supplementary video.

To train a locomotion policy, a reference motion is generated for a pacing gait on Spot using a handcrafted sinusoid for the joints on the legs, with a gait cycle of 0.6 second, similar to(Xie et al., [2021](https://arxiv.org/html/2602.18312v1#bib.bib34 "Dynamics randomization revisited: a case study for quadrupedal locomotion")). During training, the joints on the arm are set to a random target sampled around the current configuration every 0.5 0.5 second. The result is a time-varying linear feedback policy that can maintain stable pacing motion while executing fast arm movement. Instead of querying the LPN during run time, we precompute a sequence of linear matrices offline and apply them in sequence. We update the linear feedback matrices at 15 Hz 15\text{\,}\mathrm{H}\mathrm{z}, and compute the joint target angles for the PD controller using linear feedback at 30 Hz 30\text{\,}\mathrm{H}\mathrm{z}. This significantly reduces the computation load required to maintain stable motions with a FF policy, which requires inference of a neural network at 50 Hz 50\text{\,}\mathrm{H}\mathrm{z}(Miller et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib33 "High-performance reinforcement learning on spot: optimizing simulation parameters with distributional measures")).

To demonstrate a combination of an agile arm movement and an agile lower body movement, we generate a hopping motion via trajectory optimization of a single rigid body model(Di Carlo et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib13 "Dynamic locomotion in the mit cheetah 3 through convex model-predictive control")), and an agile table tennis stroke motion for the arm using a kinematic MPC planner(Nguyen et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib35 "Whole body model predictive control for spin-aware quadrupedal table tennis")). After synchronizing these motions, we create a reference motion for Spot that is similar to the table tennis footwork practice drill of a human player. Again, an LPN is able to learn a time-varying linear feedback policy to execute the motion.

7. Conclusions and Discussion
-----------------------------

In this paper, we present a framework to efficiently train smooth time-varying linear control policies for motion imitation tasks for a simulated character and a physical quadrupedal robot.

A surprising finding is that this time-varying linear policy exists for a wide range of motions without needing special feature engineering. Another way to synthesize a time-varying linear feedback policy is via model-based control such as differential dynamic programming (DDP), although matrices obtained via DDP are often less robust. It would be interesting to explore how to combine both approaches, so that we can gain the benefit of sample efficiency from model-based approaches while retaining the robustness afforded by the DRL framework. For example, we could warm start the feedback matrices with solutions from the DDP method, e.g.,(Levine and Koltun, [2013](https://arxiv.org/html/2602.18312v1#bib.bib37 "Guided policy search")). Given that our our learned policies are in the simple form of linear matrices, the policies can potentially be more easily explained than a typical black box feedforward neural network policy. For example, it is possible that there exists a DDP formulation that with the appropriate costs and transition dynamics can reproduce the same feedback matrices learned by our system. Inverse optimal control techniques can potentially be used to search for this formulation, e.g., using(Amos et al., [2018](https://arxiv.org/html/2602.18312v1#bib.bib36 "Differentiable mpc for end-to-end planning and control")).

While the action Jacobian penalty produces smooth policies for many motion imitation tasks, it only considers the derivatives with respect to state but not time. For dynamic motions such as a backflip, where the state of the character has to change rapidly, penalizing the action Jacobian alone is not guaranteed to reduce the changes of action in time. This observation may also explain why using the action Jacobian penalty is less effective for the backflip in our experiment.

As a future work, it will be interesting to explore how we can learn piece-wise linear policies in the state space. For example, we can segment the state space into regions where there exists a linear control policy for each region. Neural network policies that use ReLU as an activation function can already be used to generate such regions based on the activation patterns(Tjeng et al., [2017](https://arxiv.org/html/2602.18312v1#bib.bib51 "Evaluating robustness of neural networks with mixed integer programming"); Cohan et al., [2022](https://arxiv.org/html/2602.18312v1#bib.bib52 "Understanding the evolution of linear regions in deep reinforcement learning")). However, without regularization, such regions are usually too fine-grained. Learning linear feedback control policies that have a large region of attractions allowing us to divide the state space into manageable pieces might further improve policy robustness and explainability.

We focus on the imitation of relatively short motion segments compared to current DRL systems that can scale to imitating long motion sequences. It is conceivable that by scaling the system to imitate more and longer motions, one may find a one-to-one correspondence between feasible motion capture data and the set of feedback matrices. Such correspondence can enable us to learn a policy generator, where a generative model such as a diffusion model(Rocca et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib38 "Policy-space diffusion for physics-based character animation")) can be used to generate the linear feedback policies.

While we demonstrate skill composition via policy distillation, we are not yet able to transition between arbitrary skills. Scaling the system to a large motion dataset and building a control graph(Liu et al., [2016](https://arxiv.org/html/2602.18312v1#bib.bib2 "Guided learning of control graphs for physics-based characters")) can potentially automate more diverse transitions.

Our formulation limits our use case to DeepMimic style motion imitation tasks. Expanding the formulation to other tasks, e.g., using adversarial motion imitation(Zhang et al., [2025](https://arxiv.org/html/2602.18312v1#bib.bib18 "ADD: physics-based motion imitation with adversarial differential discriminators")), or tasks where motion capture data is not available will be an interesting direction.

References
----------

*   B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter (2018)Differentiable mpc for end-to-end planning and control. Advances in Neural Information Processing Systems 31. Cited by: [§7](https://arxiv.org/html/2602.18312v1#S7.p2.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   K. Bergamin, S. Clavet, D. Holden, and J. R. Forbes (2019)DReCon: data-driven responsive control of physics-based characters. ACM Transactions On Graphics (TOG)38 (6),  pp.1–11. Cited by: [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   A. L. Bishop, J. Alvarez-Padilla, S. Schoedel, I. S. Sow, J. Chandrachud, S. Sharma, W. Kraus, B. Park, R. J. Griffin, J. M. Dolan, et al. (2025)The surprising effectiveness of linear models for whole-body model-predictive control. In 2025 IEEE-RAS 24th International Conference on Humanoid Robots (Humanoids),  pp.1–7. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p5.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Z. Chen, X. He, Y. Wang, Q. Liao, Y. Ze, Z. Li, S. S. Sastry, J. Wu, K. Sreenath, S. Gupta, et al. (2025)Learning smooth humanoid locomotion through lipschitz-constrained policies. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),  pp.4743–4750. Cited by: [§1](https://arxiv.org/html/2602.18312v1#S1.p2.1 "1. Introduction ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§3](https://arxiv.org/html/2602.18312v1#S3.p7.1 "3. Problem Setup and System Overview ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§4](https://arxiv.org/html/2602.18312v1#S4.p5.4 "4. Action Jacobian Penalty ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [4th item](https://arxiv.org/html/2602.18312v1#S6.I1.i4.p1.6.1 "In 6.2. Comparison ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   CMU Graphics Lab (1999)CMU graphics lab motion capture database. Cited by: [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px3.p1.1 "Imitating Single Sequence ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   S. Cohan, N. H. Kim, D. Rolnick, and M. van de Panne (2022)Understanding the evolution of linear regions in deep reinforcement learning. Advances in Neural Information Processing Systems 35,  pp.10891–10903. Cited by: [§7](https://arxiv.org/html/2602.18312v1#S7.p4.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   J. Di Carlo, P. M. Wensing, B. Katz, G. Bledt, and S. Kim (2018)Dynamic locomotion in the mit cheetah 3 through convex model-predictive control. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS),  pp.1–9. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p5.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.4](https://arxiv.org/html/2602.18312v1#S6.SS4.p3.1 "6.4. Sim-to-real on a Quadrupedal Robot ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   K. Ding, L. Liu, M. Van de Panne, and K. Yin (2015)Learning reduced-order feedback policies for motion skills. In Proceedings of the 14th ACM SIGGRAPH/Eurographics Symposium on Computer Animation,  pp.83–92. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p2.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§5.1](https://arxiv.org/html/2602.18312v1#S5.SS1.p1.1 "5.1. Input Features ‣ 5. Linear Policy Net ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.3](https://arxiv.org/html/2602.18312v1#S6.SS3.SSS0.Px1.p1.12 "Reduced-Order Linear Feedback Policy ‣ 6.3. Linear Policy Net Evaluations ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   H. Eom, D. Han, J. S. Shin, and J. Noh (2019)Model predictive control with a visuomotor system for physics-based character animation. ACM Transactions on Graphics (TOG)39 (1),  pp.1–11. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p4.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   J. K. Hodgins, W. L. Wooten, D. C. Brogan, and J. F. O’Brien (1995)Animating human athletics. In Proceedings of the 22nd Annual Conference on Computer graphics and Interactive Techniques,  pp.71–78. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p1.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   N. Hogan (2022)Contact and physical interaction. Annual Review of Control, Robotics, and Autonomous Systems 5 (1),  pp.179–203. Cited by: [2nd item](https://arxiv.org/html/2602.18312v1#S6.I2.i2.p1.3 "In 6.2. Comparison ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter (2019)Learning agile and dynamic motor skills for legged robots. Science Robotics 4 (26),  pp.eaau5872. Cited by: [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   M. Kim, E. Jung, and Y. Lee (2025)PhysicsFC: learning user-controlled skills for a physics-based football player controller. ACM Transactions on Graphics (TOG)44 (4),  pp.1–21. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   L. Krishna, U. A. Mishra, G. A. Castillo, A. Hereid, and S. Kolathaya (2021)Learning linear policies for robust bipedal locomotion on terrains with varying slopes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),  pp.5159–5164. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p3.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   S. Lee, S. Starke, Y. Ye, J. Won, and A. Winkler (2023)Questenvsim: environment-aware simulated motion tracking from sparse sensors. In ACM SIGGRAPH 2023 Conference Proceedings,  pp.1–9. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   S. Levine and V. Koltun (2013)Guided policy search. In International Conference on Machine Learning,  pp.1–9. Cited by: [§7](https://arxiv.org/html/2602.18312v1#S7.p2.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   W. Li and E. Todorov (2004)Iterative linear quadratic regulator design for nonlinear biological movement systems. In First International Conference on Informatics in Control, Automation and Robotics, Vol. 2,  pp.222–229. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p4.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Q. Liao, T. E. Truong, X. Huang, G. Tevet, K. Sreenath, and C. K. Liu (2025)Beyondmimic: from motion tracking to versatile humanoid control via guided diffusion. arXiv preprint arXiv:2508.08241. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§3](https://arxiv.org/html/2602.18312v1#S3.p7.1 "3. Problem Setup and System Overview ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   L. Liu, M. V. D. Panne, and K. Yin (2016)Guided learning of control graphs for physics-based characters. ACM Transactions on Graphics (TOG)35 (3),  pp.1–14. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p2.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§5.1](https://arxiv.org/html/2602.18312v1#S5.SS1.p1.1 "5.1. Input Features ‣ 5. Linear Policy Net ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px2.p1.1 "Gymnastic Motion ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.3](https://arxiv.org/html/2602.18312v1#S6.SS3.SSS0.Px3.p1.4 "Linear Feedback Update Rate ‣ 6.3. Linear Policy Net Evaluations ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§7](https://arxiv.org/html/2602.18312v1#S7.p6.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Z. Luo, J. Cao, K. Kitani, W. Xu, et al. (2023)Perpetual humanoid control for real-time simulated avatars. In Proceedings of the IEEE/CVF International Conference on Computer Vision,  pp.10895–10904. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Z. Luo, Y. Yuan, T. Wang, C. Li, S. Chen, F. Castañeda, Z. Cao, J. Li, D. Minor, Q. Ben, et al. (2025)Sonic: supersizing motion tracking for natural humanoid whole-body control. arXiv preprint arXiv:2511.07820. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   H. Mania, A. Guy, and B. Recht (2018)Simple random search of static linear policies is competitive for reinforcement learning. Advances in neural information processing systems 31. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p3.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   A.J. Miller, F. Yu, M. Brauckmann, and F. Farshidian (2025)High-performance reinforcement learning on spot: optimizing simulation parameters with distributional measures. 2025 IEEE International Conference on Robotics and Automation (ICRA) (),  pp.9981–9988. External Links: ISSN Cited by: [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.4](https://arxiv.org/html/2602.18312v1#S6.SS4.p1.1 "6.4. Sim-to-real on a Quadrupedal Robot ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.4](https://arxiv.org/html/2602.18312v1#S6.SS4.p2.4 "6.4. Sim-to-real on a Quadrupedal Robot ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   U. Muico, Y. Lee, J. Popović, and Z. Popović (2009)Contact-aware nonlinear control of dynamic characters. In ACM SIGGRAPH 2009 papers,  pp.1–9. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p4.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   S. Mysore, B. Mabsout, R. Mancuso, and K. Saenko (2021)Regularizing action policies for smooth control with reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA),  pp.1810–1816. Cited by: [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§4](https://arxiv.org/html/2602.18312v1#S4.p5.4 "4. Action Jacobian Penalty ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   D. Nguyen, Z. Zaidi, K. Karol, J. Hodgins, and Z. Xie (2025)Whole body model predictive control for spin-aware quadrupedal table tennis. arXiv preprint arXiv:2510.08754. Cited by: [§6.4](https://arxiv.org/html/2602.18312v1#S6.SS4.p3.1 "6.4. Sim-to-real on a Quadrupedal Robot ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32. Cited by: [§4](https://arxiv.org/html/2602.18312v1#S4.p6.1 "4. Action Jacobian Penalty ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne (2018)Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Transactions On Graphics (TOG)37 (4),  pp.1–14. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§3](https://arxiv.org/html/2602.18312v1#S3.p1.8 "3. Problem Setup and System Overview ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§5.1](https://arxiv.org/html/2602.18312v1#S5.SS1.p1.1 "5.1. Input Features ‣ 5. Linear Policy Net ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px1.p1.1 "Locomotion Tasks ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   M. Rocca, S. Darkner, K. Erleben, and S. Andrews (2025)Policy-space diffusion for physics-based character animation. ACM Transactions on Graphics 44 (3),  pp.1–18. Cited by: [§7](https://arxiv.org/html/2602.18312v1#S7.p5.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   B. Rohrer, S. Fasoli, H. I. Krebs, R. Hughes, B. Volpe, W. R. Frontera, J. Stein, and N. Hogan (2002)Movement smoothness changes during stroke recovery. Journal of Neuroscience 22 (18),  pp.8297–8304. Cited by: [3rd item](https://arxiv.org/html/2602.18312v1#S6.I2.i3.p1.2 "In 6.2. Comparison ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   G. Ruben, K. Espen, A. Michael, W. Georg, B. Jared, P. Steven, M. David, and B. Moritz (2024)Design and control of a bipedal robotic character. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017)Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: [§3](https://arxiv.org/html/2602.18312v1#S3.p6.9 "3. Problem Setup and System Overview ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   C. Tessler, Y. Guo, O. Nabati, G. Chechik, and X. B. Peng (2024)Maskedmimic: unified physics-based character control through masked motion inpainting. ACM Transactions on Graphics (TOG)43 (6),  pp.1–21. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   V. Tjeng, K. Xiao, and R. Tedrake (2017)Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356. Cited by: [§7](https://arxiv.org/html/2602.18312v1#S7.p4.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   E. Todorov, T. Erez, and Y. Tassa (2012)Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems,  pp.5026–5033. Cited by: [§3](https://arxiv.org/html/2602.18312v1#S3.p5.3 "3. Problem Setup and System Overview ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   T. E. Truong, M. Piseno, Z. Xie, and K. Liu (2024)Pdp: physics-based character animation via diffusion policy. In SIGGRAPH Asia 2024 Conference Papers,  pp.1–10. Cited by: [§6.3](https://arxiv.org/html/2602.18312v1#S6.SS3.SSS0.Px4.p1.1 "Policy Distillation and Transitions between Skills ‣ 6.3. Linear Policy Net Evaluations ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   J. Wang, J. Hodgins, and J. Won (2024a)Strategy and skill learning for physics-based table tennis animation. In ACM SIGGRAPH 2024 Conference Papers,  pp.1–11. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px3.p1.1 "Imitating Single Sequence ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   R. Wang, P. Xu, H. Shi, E. Schumann, and C. K. Liu (2024b)Fürelise: capturing and physically synthesizing hand motion of piano performance. In SIGGRAPH Asia 2024 Conference Papers,  pp.1–11. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   J. Won, D. Gopinath, and J. Hodgins (2020)A scalable approach to control diverse behaviors for physically simulated characters. ACM Transactions on Graphics (TOG)39 (4),  pp.33–1. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   K. Xie, P. Xu, S. Andrews, V. B. Zordan, and P. G. Kry (2023)Too stiff, too strong, too smart: evaluating fundamental problems with motion control policies. Proceedings of the ACM on Computer Graphics and Interactive Techniques 6 (3),  pp.1–17. Cited by: [§2.2](https://arxiv.org/html/2602.18312v1#S2.SS2.p1.1 "2.2. Learning Smooth Policies ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Z. Xie, X. Da, M. Van de Panne, B. Babich, and A. Garg (2021)Dynamics randomization revisited: a case study for quadrupedal locomotion. In 2021 IEEE International Conference on Robotics and Automation (ICRA),  pp.4955–4961. Cited by: [§6.4](https://arxiv.org/html/2602.18312v1#S6.SS4.p2.4 "6.4. Sim-to-real on a Quadrupedal Robot ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Z. Xie, S. Starke, H. Y. Ling, and M. van de Panne (2022)Learning soccer juggling skills with layer-wise mixture-of-experts. In ACM SIGGRAPH 2022 Conference Papers,  pp.1–9. Cited by: [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px4.p1.1 "Environment Interaction ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   M. Xu, Y. Shi, K. Yin, and X. B. Peng (2025a)Parc: physics-based augmentation with reinforcement learning for character controllers. In ACM SIGGRAPH 2025 Conference Papers,  pp.1–11. Cited by: [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px4.p1.1 "Environment Interaction ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   S. Xu, H. Y. Ling, Y. Wang, and L. Gui (2025b)Intermimic: towards universal whole-body control for physics-based human-object interactions. In Proceedings of the Computer Vision and Pattern Recognition Conference,  pp.12266–12277. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   K. Yin, K. Loken, and M. Van de Panne (2007)Simbicon: simple biped locomotion control. ACM Transactions on Graphics (TOG)26 (3),  pp.105–es. Cited by: [§2.3](https://arxiv.org/html/2602.18312v1#S2.SS3.p1.1 "2.3. Linear Feedback Control ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   R. Yu, H. Park, and J. Lee (2021)Human dynamics from monocular video with dynamic camera movements. ACM Transactions on Graphics (TOG)40 (6),  pp.1–14. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   R. Yu, Y. Wang, Q. Zhao, H. W. Tsui, J. Wang, P. Tan, and Q. Chen (2025)Skillmimic-v2: learning robust and generalizable interaction skills from sparse and noisy demonstrations. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers,  pp.1–11. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Y. Zhang, D. Gopinath, Y. Ye, J. Hodgins, G. Turk, and J. Won (2023)Simulation and retargeting of complex multi-character interactions. In ACM SIGGRAPH 2023 Conference Proceedings,  pp.1–11. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 
*   Z. Zhang, S. Bashkirov, D. Yang, M. Taylor, and X. B. Peng (2025)ADD: physics-based motion imitation with adversarial differential discriminators. arXiv preprint arXiv:2505.04961. Cited by: [§2.1](https://arxiv.org/html/2602.18312v1#S2.SS1.p1.1 "2.1. Deep Reinforcement Learning ‣ 2. Related Work ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§3](https://arxiv.org/html/2602.18312v1#S3.p6.9 "3. Problem Setup and System Overview ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§6.1](https://arxiv.org/html/2602.18312v1#S6.SS1.SSS0.Px4.p1.1 "Environment Interaction ‣ 6.1. DeepMimic Task with Simulated Humanoid ‣ 6. Evaluations ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"), [§7](https://arxiv.org/html/2602.18312v1#S7.p7.1 "7. Conclusions and Discussion ‣ Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty"). 

![Image 3: Refer to caption](https://arxiv.org/html/2602.18312v1/x3.png)

Figure 3. Learning curves of various methods, average over 3 3 runs with different random seeds. The total reward is evaluated only on the motion imitation reward.

![Image 4: Refer to caption](https://arxiv.org/html/2602.18312v1/x4.png)

Figure 4. Control action of the pitch joint on the left ankle for the backflip and table tennis footwork drill.
