Invertible Memory Flow Networks

Invertible Memory Flow Networks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Long sequence neural memory remains a challenging problem. RNNs and their variants suffer from vanishing gradients, and Transformers suffer from quadratic scaling. Furthermore, compressing long sequences into a finite fixed representation remains an intractable problem due to the difficult optimization landscape. Invertible Memory Flow Networks (IMFN) make long sequence compression tractable through factorization: instead of learning end-to-end compression, we decompose the problem into pairwise merges using a binary tree of “sweeper” modules. Rather than learning to compress long sequences, each sweeper learns a much simpler 2-to-1 compression task, achieving O(log N) depth with sublinear error accumulation in sequence length. For online inference, we distilled into a constant-cost recurrent student achieving O(1) sequential steps. Empirical results validate IMFN on long MNIST sequences and UCF-101 videos, demonstrating compression of high-dimensional data over long sequences.


💡 Research Summary

The paper tackles the long‑standing challenge of compressing and remembering very long sequences in neural networks. Traditional recurrent architectures (RNNs, LSTMs) suffer from vanishing gradients, while Transformers incur quadratic cost with respect to sequence length, making them impractical for very long contexts. Moreover, learning a single end‑to‑end compression function that maps an entire sequence into a fixed‑size vector is notoriously difficult because the optimization landscape becomes highly non‑convex and gradients vanish over long horizons.

Invertible Memory Flow Networks (IMFN) propose a fundamentally different approach: instead of learning a monolithic compressor, they factorize the task into many tiny, locally invertible 2‑to‑1 “sweeper” modules arranged in a binary tree. Each sweeper consists of a lightweight encoder that merges two memory vectors into one and a decoder that reconstructs the original pair from the merged code. The encoder/decoder are simple MLPs (or lightweight attention blocks) trained with a reconstruction loss plus a small L2 regularizer on the merged code; Gaussian noise is added for robustness. Because each merge operates on only two vectors, the learning problem is dramatically simpler, and the local reconstruction loss forces the merge to be approximately invertible.

By stacking sweeper modules level‑wise, a full sequence of length T is compressed in log₂ T steps: adjacent pairs are merged at level 0, the resulting latent vectors are merged again at level 1, and so on until a single root memory vector remains. This yields a depth‑O(log T) compression pipeline, and empirical results show that the total reconstruction error grows sub‑linearly (approximately logarithmically) with sequence length, far slower than the linear error growth typical of naïve hierarchical compression.

The authors evaluate the teacher (the binary‑tree compressor) on two domains:

  1. MNIST image sequences – each image is flattened to a 784‑dim vector. Sweeper levels map raw pixels to a latent space of dimension d (128–2048). The tree can compress up to 256 frames (8 merges). Round‑trip reconstruction MSE is measured for sequence lengths 16, 32, 64, 128, 256. IMFN consistently outperforms two strong baselines: a Transformer encoder‑decoder that compresses the whole sequence into a CLS token, and a Mamba state‑space model with the same decoder. The baselines exhibit much higher MSE, confirming that the factorized, locally invertible design is more learnable.

  2. UCF‑101 video clips – 128‑frame clips are down‑sampled to 64×64 RGB. Level 0 sweeps use a Perceiver‑style cross‑attention stack to merge two frames into 96 memory tokens (d = 256). Higher levels merge token‑latents directly, achieving cumulative temporal compression ratios of 2:1, 4:1, 8:1, 16:1. Again, local token‑space reconstruction losses keep the inverse path accurate, and the full round‑trip reconstruction shows low pixel‑wise error.

While the teacher provides an excellent compression mechanism, its O(log N) update cost per new token is still too high for real‑time streaming. To obtain constant‑time online inference, the authors distill the tree into a student model. They define a “zero‑padding trajectory”: start with all leaves zero, then sequentially replace zeros with actual inputs, generating a series of root states y₀, y₁,…, y_N that the student must predict. Using a Merkle‑style optimization, only the nodes on the path from the newly added leaf to the root need to be recomputed, so the full trajectory can be generated in O(N log N) time.

The student is a 4‑layer MLP (hidden size 2d) that takes the current memory mₜ, the new latent xₜ, and a one‑hot positional encoding, and predicts a delta Δₜ. The update rule is mₜ₊₁ = mₜ + Δₜ, mirroring the additive dynamics of residual networks. Training proceeds on the student’s own rollouts: after generating teacher targets, the student runs forward, its intermediate states are detached, and a subset of timesteps (25 %) is sampled to compute an L2 loss between the student’s predicted delta (plus the previous memory) and the teacher’s target root state. This “self‑rollout” training prevents distribution shift that would otherwise occur if the student were always trained on perfect teacher states.

Experiments show that the distilled student matches the teacher’s reconstruction quality across all tested sequence lengths while requiring only O(1) computation per step and O(d) memory, independent of the total sequence length. This makes IMFN suitable for streaming applications where latency and memory constraints are critical.

Key contributions:

  • Introduces a novel factorization of long‑sequence compression into locally invertible 2‑to‑1 merges, enabling logarithmic depth and sub‑linear error growth.
  • Provides a concrete binary‑tree architecture (the “teacher”) with explicit inverse pathways, trained via simple reconstruction losses.
  • Proposes an efficient Merkle‑style trajectory generation and a lightweight recurrent student that distills the teacher’s behavior into constant‑time updates.
  • Demonstrates the approach on high‑dimensional visual data (MNIST sequences and UCF‑101 videos), achieving superior compression‑reconstruction performance compared to strong Transformer and state‑space baselines.

Overall, IMFN offers a fresh perspective on the memory problem: by treating compression as a reversible flow through a vector space and by decomposing it into many tiny, learnable primitives, the method sidesteps the optimization difficulties of end‑to‑end compression and delivers both theoretical scalability (O(log N) depth) and practical efficiency (O(1) online inference). This work opens avenues for applying invertible hierarchical compression to other modalities such as language, audio, or multimodal streams, where long‑range dependencies and memory constraints are equally challenging.


Comments & Academic Discussion

Loading comments...

Leave a Comment