Consistency Deep Equilibrium Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieves consistent 2-20$\times$ accuracy improvements over implicit DEQs under the same few-step inference budget.

💡 Research Summary

The paper introduces Consistency Deep Equilibrium Models (C‑DEQ), a framework that dramatically speeds up inference for Deep Equilibrium Models (DEQs) while preserving their expressive power. Traditional DEQs define the output z⋆ as the fixed point of a nonlinear transformation fθ(z, x) and obtain z⋆ by iteratively solving Fθ(z; x)=fθ(z, x)−z=0 with root‑finding methods such as Broyden’s method or Anderson Acceleration (AA). Although these solvers reduce the number of iterations compared with naïve Picard iteration, they still require tens of passes through the underlying network, making DEQs considerably slower than explicit finite‑depth networks.

C‑DEQ reframes the fixed‑point iteration as a continuous‑time flow, the Fixed‑Point ODE (FP‑ODE):
dz/dt = fθ(z, x) − z.
The steady state of this ODE coincides exactly with the DEQ equilibrium. By treating DEQ inference as numerical integration of this ODE, the authors obtain a well‑defined trajectory that can be used as a teacher for consistency distillation, a technique originally developed for diffusion models.

The teacher trajectory is generated with Anderson Acceleration, which leverages a short history of past iterates to achieve super‑linear convergence. Starting from a fixed initial state z₀ = 0, AA produces a sequence {z₀,…,z_K} where z_K ≈ z⋆. Because the initial condition and solver are fixed, the trajectory is deterministic, eliminating the “path‑independence” issue that would otherwise make distillation ill‑posed.

The student model gφ is a time‑conditioned mapping that takes an intermediate state (or a short window of states) and predicts the equilibrium directly. Its architecture consists of two components: (1) a skip‑connection coefficient c_skip(t) that gradually shifts weight from a learned network Pφ to the raw state z_t as virtual time t approaches the final time T; and (2) an AA‑structured parameterization where Pφ is re‑expressed as an Anderson‑style update S_AA applied to the current and previous iterates, with a learnable function hφ producing the required combination weights. This design injects the solver’s structural prior into the student, allowing it to refine rather than rediscover the acceleration dynamics.

Training uses a combined loss: a global consistency term that forces gφ(z_k, t_k, x) to match the terminal state z_K, and a local consistency term that penalizes discrepancies between consecutive student predictions. The global term ensures that any intermediate point can be mapped to the fixed point in a single step; the local term stabilizes multi‑step inference and prevents trajectory drift.

Experiments span three domains: language modeling on WikiText‑103, image classification on ImageNet, and graph node classification on OGB‑arxiv and OGB‑products. Under the same “few‑step” budget (e.g., 2–4 network evaluations), C‑DEQ consistently outperforms baseline DEQs by 2–20× in accuracy or F1 score, while matching or exceeding the performance of comparable explicit networks. Importantly, memory consumption remains identical to that of the original DEQ because the backward pass still relies on the implicit‑function theorem.

Key contributions are: (1) a novel reinterpretation of DEQ inference as a fixed‑point ODE, enabling principled consistency distillation; (2) the use of Anderson Acceleration to define a deterministic teacher trajectory and to embed its structural prior into the student; (3) a time‑aware skip‑network architecture that smoothly interpolates between raw solver states and learned corrections; and (4) a dual‑consistency loss that balances one‑step accuracy with multi‑step scalability. By “leap‑frogging” the iterative solver, C‑DEQ brings implicit models into the “one‑step” era, offering up to twenty‑fold speedups without sacrificing the benefits of equilibrium‑based learning.

Consistency Deep Equilibrium Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment