Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers
Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push–pull interaction induces radial oscillations, injecting noise into Adam’s second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam’s adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.
💡 Research Summary
The paper revisits the widely adopted AdamW optimizer and identifies a fundamental geometric flaw: AdamW treats the magnitude (norm) and direction of each parameter vector as a single entity, applying the same adaptive update while simultaneously pulling the norm down with weight decay. In deep networks, stochastic gradients tend to increase parameter norms to expand model capacity, whereas weight decay indiscriminately shrinks those norms. This creates a “Radial Tug‑of‑War” where the norm oscillates back and forth, injecting noise into Adam’s second‑moment accumulator (the v‑state). The contaminated variance estimate then harms the delicate, direction‑focused updates that are essential for learning useful features.
To resolve this, the authors propose Orthogonal Dynamics Decoupling and instantiate it as a new optimizer called AdamO. The core idea is to decompose each parameter vector w into a radial component (aligned with w) and a tangential component (orthogonal to w) at every step, and to handle these two subspaces with distinct update rules.
-
Radial‑Tangential Projections
- Define φρ₍w₎(z) = ⟨z,w⟩/⟨w,w⟩ w (the projection of any vector z onto the radial direction) and φθ₍w₎(z) = z – φρ₍w₎(z) (the orthogonal, tangential part).
- For each stochastic gradient gₜ, compute gρₜ = φρ(gₜ) and gθₜ = φθ(gₜ).
-
State Re‑projection
- Adam’s first‑moment (m) and second‑moment (v) states are also projected each iteration because the subspaces rotate as w changes. This prevents cross‑talk between radial and tangential dynamics.
-
Radial Update (SGD‑style)
- The radial direction is updated with a simple SGD‑like step: Δwρ = ηρ · φρ(ĥmρ), where ĥmρ is the bias‑corrected radial momentum.
- Crucially, ηρ is adaptive: a curvature proxy κₜ = ‖gₜ – gₜ₋₁‖² is exponentially smoothed (τₜ = βτ τₜ₋₁ + (1‑βτ) κₜ). The step size is then ηρ,t = ηρ · (τₜ / τ_target) + ε. In high‑curvature regions the radial step shrinks, suppressing oscillations; in flat regions it grows, allowing faster norm adjustment.
-
Tangential Update (Adam‑style)
- The tangential subspace retains Adam’s adaptive preconditioning: mθ and vθ are updated with the projected tangential gradient gθ, bias‑corrected, and the update is Δwθ = ηθ · φθ(ĥmθ) / (√ĥvθ + ε). This ensures that direction‑only learning benefits from Adam’s variance scaling while remaining orthogonal to the radial direction.
-
Pure Radial Weight Decay
- Weight decay is applied only along the radial axis: w ← (1 – ηρ,t λ) w. This shrinks the norm without affecting the direction, eliminating the “indiscriminate” decay of AdamW.
-
Architecture‑Aware Special Cases
- Low‑dimensional parameters (biases, scale‑affine terms) are identified via a threshold d_th; for them AdamO falls back to a standard Adam update, avoiding unnecessary projection overhead.
- Scale‑invariant layers (BatchNorm, LayerNorm) receive a “tangential‑only” update (Δw ← Δwθ) because radial changes have no functional effect. This mirrors the projection heuristic of AdamP but is naturally embedded in the decoupled framework.
-
Empirical Evaluation
- Experiments on CIFAR‑100 with a ResNet‑18 backbone (300 epochs, standard data augmentation) show AdamO achieving 79.74 % ± 0.09 test accuracy, a +5 %p gain over AdamW’s 74.75 %. AdamP, a prior projection‑based method, improves only marginally to 75.07 %.
- Ablation studies reveal each component’s contribution: removing curvature‑adaptive radial steps drops accuracy to 75.21 %; disabling low‑dimensional handling reduces it to 75.99 %; omitting the projection for scale‑invariant layers yields 76.17 %. An isotropic‑decay variant (AdamO‑Isotropic) performs on par with AdamW, confirming that pure radial regularization is the key driver.
- Visualization of optimization trajectories in a 2‑D loss subspace shows AdamW’s path wandering radially, whereas AdamO follows a smoother, more directed curve. Gradient statistics (norm variance, direction‑change rate) indicate an 11 % reduction in gradient‑norm fluctuation and a noticeable drop in variance, evidencing more stable training dynamics.
-
Significance and Future Directions
- By explicitly separating norm control from feature learning, AdamO respects the underlying geometry of the parameter space, addressing a limitation that has persisted since the introduction of AdamW.
- The method is lightweight: it introduces only a few extra EMA states and projection operations, without any Lagrange multipliers or complex constraints.
- The authors suggest that the decoupling principle could be applied to other adaptive optimizers (e.g., RMSProp, AdaFactor) and that meta‑learning could be used to automatically tune τ_target or ηρ.
- For large‑scale language models and multimodal systems—where scale‑invariant components are abundant and parameter norms can become extremely large—the radial‑tangential split promises improved stability and potentially better generalization.
In summary, AdamO offers a principled, geometry‑aware redesign of weight‑decay regularization: it isolates norm shrinkage to a curvature‑aware SGD‑style radial step, confines Adam’s adaptive preconditioning to the orthogonal subspace, and augments the scheme with architecture‑specific handling. Empirical results across vision and language tasks demonstrate consistent gains in accuracy, smoother optimization trajectories, and reduced gradient noise, positioning AdamO as a compelling next‑generation alternative to AdamW for deep network training.
Comments & Academic Discussion
Loading comments...
Leave a Comment