Title: Exploring the Design Space of Transition Matching
ArXiv ID: 2512.12465
Date: 2025-12-13
Authors: Uriel Singer, Yaron Lipman
📝 Abstract
Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.
💡 Deep Analysis
📄 Full Content
Exploring the Design Space of Transition Matching
Uriel Singer1, Yaron Lipman1
1FAIR at Meta
Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion
and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous
paradigms, gradually transforms noise samples to data samples, however it uses a second “internal”
generative model to implement the transition steps, making the transitions more expressive compared
to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network
and a smaller "head" module to efficiently execute the generative transition step. In this work, we
present a large-scale, systematic investigation into the design, training and sampling of the head in TM
frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations
and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique
evaluations) we evaluate the affect of the head module architecture and modeling during training
as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality,
training, and inference efficiency. We find that TM with an MLP head, trained with a particular time
weighting and sampled with high frequency sampler provides best ranking across all metrics reaching
state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low
frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments
presented highlight the design aspects that are likely to provide most quality and efficiency gains,
while at the same time indicate what design choices are not likely to provide further gains.
Correspondence: First Author at urielsinger@meta.com
1
Introduction
Transition Matching (TM) Shaul et al. (2025) is a recent generalization of several media generative paradigms
including diffusion models Sohl-Dickstein et al. (2015); Ho et al. (2020); Song et al. (2020), flow matching
models Lipman et al. (2022); Liu et al. (2022); Albergo and Vanden-Eijnden (2022), and continuous-state
autoregressive image generation Li et al. (2024); Team et al. (2025) that offers new design choices that go
beyond the scope of these former paradigms and already shown to yield improved image quality and/or more
efficient sampling at inference time.
In this work we focus on TM’s continuous time bidirectional variant, which, similarly to previous paradigms,
learns a transition function (kernel) that gradually transfers a source (noise) sample X0 to a target (data)
sample X1 by iteratively producing future samples Xt′ from previous samples Xt, 0 ≤t < t′ ≤1. Differently
from previous work, TM models the transition kernel with a second “internal” generative model, offering a
more expressive transition kernels than, e.g., diffusion models that utilize a factorized (i.e., independent in each
coordinate) multivariate Gaussian as kernels. To keep things tractable, TM adopts a backbone–head paradigm,
in which: The backbone (typically a large transformer) encodes current state Xt as well as conditioning
information, producing a rich latent representation per input token. The head (typically much smaller than the
backbone) serves as a learnable module tasked with translating backbone latent representations into concrete
transition outputs, producing the next state Xt′ with t′ > t. While backbone architecture for diffusion models
have been, and still are, thoroughly investigated (e.g., Peebles and Xie (2022)), systematic exploration of
head architecture and hyperparameters is lacking in the current literature. Most existing works treat the
head as a fixed, minimal component—often a single MLP or a lightweight mapping—without investigating
how variations in design might impact model behavior and efficiency (Li et al., 2024; Fan et al.; Team et al.,
2025; Shaul et al., 2025). In fact, due to its particular role in the generative process and its small relative
size, the head design holds much potential for improving the model performance by exploring head-specific
architectures and different scaling laws. In this paper we take on this opportunity and explore different design
1
arXiv:2512.12465v1 [cs.LG] 13 Dec 2025
DTM++ (ours)
DTM+ (ours)
DTM
FM-lognorm
AR-dis
MAR-dis
AR
MAR
a fox
a stone bust next to an egg and an eggplant
a tiger wearing a tuxedo
Greek statue of a man comforting a cat. The cat has a big head. The man looks angry.
A high resolution photo of a large bowl of ramen. There are several origami boats in the ramen of different colors.
Figure 1 Samples comparing of our Best D-TM MLP (DTM++) and Transformer (DTM+) Models with DTM baseline,
FM-lognormal, AR, MAR, AR-discrete, MAR-discrete as baselines. All models share similar architecture and training
recipe.
choices for the head both in training and inference stages, with the goal of improving one or more of the three
main