Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion
Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode is actuator inversion, where identical actions produce opposite physical effects under a latent binary context. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via an expressivity separation result for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under actuator inversion. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions. On AIB’s held-out actuator-inversion tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 111.8% and surpassing a standard context-aware baseline by 16.1%.
💡 Research Summary
The paper tackles a fundamental failure mode in contextual reinforcement learning (RL) called “actuator inversion”: under a latent binary context, the same action produces opposite physical effects, creating a discontinuous mapping from context to dynamics. Traditional approaches that concatenate context to inputs or assume smooth latent distributions cannot represent such sign‑flipping behavior and thus fail to generalize zero‑shot. To address this, the authors introduce DMA*‑SH (Dynamics‑Model‑Aligned Shared Hypernetwork). The method consists of three tightly coupled components.
First, a dynamics‑aligned context encoder (DMA*) processes a sliding window of K recent transitions (state, action, state‑difference) through random input masking, per‑sample AvgL1Norm, an LSTM, and SimNorm output projection. Masking breaks brittle co‑adaptations, AvgL1Norm provides statistic‑free scaling suitable for online RL, and SimNorm forces the embedding into a product of low‑dimensional simplices, yielding stable, directionally concentrated representations that are robust across training and out‑of‑distribution contexts.
Second, a single hypernetwork conditioned on the inferred context vector zₜ generates a small set of adapter weights ω. These adapters are inserted into bottleneck modules of the forward dynamics model, the policy network, and the Q‑function. By modulating internal features multiplicatively, the adapters can directly implement sign reversals or other discontinuous transformations that simple concatenation cannot express. The adapters are shared across all three networks, enforcing a consistent context‑dependent modulation.
Third, training is performed solely with a forward‑dynamics reconstruction loss L = ‖Δŝₜ₊₁ – Δsₜ₊₁‖², jointly updating the encoder parameters ϕ, the base dynamics parameters θ, and the hypernetwork parameters η. During policy and value updates, ω is detached so that reward gradients do not alter the hypernetwork or the context encoder; thus the context representation is shaped only by dynamics prediction, providing a strong structural prior that aligns control with the true physics of each context.
Theoretical contributions include: (i) an expressivity separation theorem showing that hypernetwork‑generated multiplicative adapters can represent functions of the form f(s,a,c)=c·g(s,a) which are impossible for additive context concatenation; (ii) a variance decomposition that isolates within‑mode (intra‑context) variance from between‑mode variance, proving that DMA* dramatically compresses the former; and (iii) a policy‑gradient variance bound linking this compression to reduced gradient noise, thereby improving sample efficiency and learning stability.
To evaluate the approach, the authors construct the Actuator Inversion Benchmark (AIB), a suite of environments (2‑D robotic arm, mobile robot, continuous control tasks) where a hidden binary variable flips the sign of actuator outputs. They define three context sets: C_train for training, C_eval_in for interpolation, and C_eval_out for extrapolation. Zero‑shot performance is measured on C_eval_out without any gradient updates. DMA*‑SH achieves a 111.8 % relative improvement over domain randomization and outperforms a strong concatenation‑based baseline by 16.1 %. It also exceeds a “separate hypernetwork” baseline (DA) by 7.5 %. Ablation studies confirm that each design element—random masking, AvgL1Norm, SimNorm, and sharing adapters across modules—contributes significantly; removing any of them degrades performance by 10‑20 %.
In summary, DMA*‑SH demonstrates that (a) aligning context inference with dynamics prediction yields reliable latent representations, (b) a shared hypernetwork providing multiplicative adapters supplies the exact inductive bias needed for discontinuous context shifts, and (c) careful normalization and masking stabilize online learning. The framework not only solves the actuator inversion problem but also suggests a general recipe for zero‑shot RL in settings with abrupt mode changes, such as tool swaps, sudden parameter shifts, or safety‑critical system reconfigurations.
Comments & Academic Discussion
Loading comments...
Leave a Comment