Minibatch Optimal Transport and Perplexity Bound Estimation in Discrete Flow Matching
Discrete flow matching, a recent framework for modeling categorical data, has shown competitive performance with autoregressive models. However, unlike continuous flow matching, the rectification strategy cannot be applied due to the stochasticity of discrete paths, necessitating alternative methods to minimize state transitions. We propose a dynamic-optimal-transport-like minimization objective and derive its Kantorovich formulation for discrete flows with convex interpolants, where transport cost depends solely on inter-state similarity and can be optimized via minibatch strategies. We show that such methods can reduce the number of transitions up to 32 times (1024 to 32) to reach the same generative perplexity without compromising diversity. Additionally, path nondeterminism in discrete flows precludes an instantaneous change-of-variables analogue, preventing precise probability estimation available to continuous flows. We therefore propose two upper bounds on perplexity, enabling principled training, evaluation and model comparison. Finally, we introduce Multimask Flows which outperform masked flows in generative perplexity without compromising diversity, particularly when utilizing minibatch Optimal Transport.
💡 Research Summary
This paper addresses a fundamental limitation of discrete flow matching (DFM) for categorical data: unlike continuous flow matching, the rectification strategy cannot be applied because discrete sample paths are stochastic. Consequently, the authors seek to reduce the number of state transitions, which they interpret as a discrete analogue of path‑length minimization.
The first major contribution is the formulation of a dynamic optimal‑transport (OT) objective for DFM. For each time step t and each token position i, the transition probability u_{i t}(x_i, x_t) is weighted by a symmetric similarity measure s(x_i, x_i^t). The functional (Equation 11) penalizes transitions to dissimilar tokens, thereby encouraging trajectories that stay in place whenever possible. By proving a categorical Benamou‑Brenier‑type theorem (Theorem 3.1), the authors show that this dynamic formulation is exactly equivalent to a static Kantorovich problem whose cost is c(x_0, x_1)=∑{i=1}^L s(x{i0}, x_{i1}). When s is the discrete metric (1‑δ), the cost reduces to the Hamming distance, directly minimizing the number of jumps; when s is the squared L2 distance between token embeddings, the cost mirrors the quadratic cost used in continuous flows.
The second contribution is the practical optimization of this OT objective using minibatch strategies. Rather than estimating the coupling π over the entire dataset, the authors sample small minibatches and solve a regularized OT problem within each batch, following recent minibatch OT techniques. To make minibatch OT applicable to DFM, they introduce Multimask Flows (DFM‑MMF). Instead of a single mask token, they define V_s distinct mask tokens and initialize sequences at t = 0 with only masks, assigning a tiny mass ε = 1/(V_s·L) to each position. This creates a “fictitious grid” of mass that can be transported to the data grid, allowing non‑trivial couplings and thus enabling minibatch OT.
A third contribution tackles the difficulty of evaluating perplexity for DFM, since the instantaneous change‑of‑variables formula is unavailable. The authors derive two upper bounds on perplexity: (1) a bound based on the expected transition cost from the dynamic OT formulation, and (2) a bound that combines the entropy of the coupling π with the transport cost. Both bounds are provably tight in the sense that any model’s true perplexity cannot be lower, and they can be used directly as training objectives.
Empirical results on a GPT‑2‑scale model trained on OpenWebText demonstrate that minibatch OT reduces the required inference steps from 1024 to 32—a 32× speed‑up—while preserving the same generative perplexity. Multimask Flows outperform the original masked DFM in perplexity (improvements of 0.2–0.4) without sacrificing sample diversity. Moreover, training with the proposed perplexity bounds yields modest but consistent gains over standard cross‑entropy training.
In summary, the paper makes four key advances: (1) a dynamic‑OT formulation for discrete flows that directly minimizes weighted transition counts, (2) a minibatch OT algorithm that scales to large vocabularies and long sequences, (3) two theoretically grounded perplexity upper bounds that enable principled training and evaluation, and (4) the Multimask Flow architecture that unlocks the benefits of OT while retaining the time‑independent denoising probabilities of masked flows. These contributions collectively push discrete generative modeling closer to the efficiency and theoretical rigor of continuous flow‑based methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment