Exploring the Design Space of Transition Matching

December 13, 2025

Reading time: 5 minute

...

📝 Original Info

Title: Exploring the Design Space of Transition Matching
ArXiv ID: 2512.12465
Date: 2025-12-13
Authors: Uriel Singer, Yaron Lipman

📝 Abstract

Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.

💡 Deep Analysis

📄 Full Content

Exploring the Design Space of Transition Matching Uriel Singer1, Yaron Lipman1 1FAIR at Meta Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second “internal” generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains. Correspondence: First Author at urielsinger@meta.com 1 Introduction Transition Matching (TM) Shaul et al. (2025) is a recent generalization of several media generative paradigms including diffusion models Sohl-Dickstein et al. (2015); Ho et al. (2020); Song et al. (2020), flow matching models Lipman et al. (2022); Liu et al. (2022); Albergo and Vanden-Eijnden (2022), and continuous-state autoregressive image generation Li et al. (2024); Team et al. (2025) that offers new design choices that go beyond the scope of these former paradigms and already shown to yield improved image quality and/or more efficient sampling at inference time. In this work we focus on TM’s continuous time bidirectional variant, which, similarly to previous paradigms, learns a transition function (kernel) that gradually transfers a source (noise) sample X0 to a target (data) sample X1 by iteratively producing future samples Xt′ from previous samples Xt, 0 ≤t < t′ ≤1. Differently from previous work, TM models the transition kernel with a second “internal” generative model, offering a more expressive transition kernels than, e.g., diffusion models that utilize a factorized (i.e., independent in each coordinate) multivariate Gaussian as kernels. To keep things tractable, TM adopts a backbone–head paradigm, in which: The backbone (typically a large transformer) encodes current state Xt as well as conditioning information, producing a rich latent representation per input token. The head (typically much smaller than the backbone) serves as a learnable module tasked with translating backbone latent representations into concrete transition outputs, producing the next state Xt′ with t′ > t. While backbone architecture for diffusion models have been, and still are, thoroughly investigated (e.g., Peebles and Xie (2022)), systematic exploration of head architecture and hyperparameters is lacking in the current literature. Most existing works treat the head as a fixed, minimal component—often a single MLP or a lightweight mapping—without investigating how variations in design might impact model behavior and efficiency (Li et al., 2024; Fan et al.; Team et al., 2025; Shaul et al., 2025). In fact, due to its particular role in the generative process and its small relative size, the head design holds much potential for improving the model performance by exploring head-specific architectures and different scaling laws. In this paper we take on this opportunity and explore different design 1 arXiv:2512.12465v1 [cs.LG] 13 Dec 2025 DTM++ (ours) DTM+ (ours) DTM FM-lognorm AR-dis MAR-dis AR MAR a fox a stone bust next to an egg and an eggplant a tiger wearing a tuxedo Greek statue of a man comforting a cat. The cat has a big head. The man looks angry. A high resolution photo of a large bowl of ramen. There are several origami boats in the ramen of different colors. Figure 1 Samples comparing of our Best D-TM MLP (DTM++) and Transformer (DTM+) Models with DTM baseline, FM-lognormal, AR, MAR, AR-discrete, MAR-discrete as baselines. All models share similar architecture and training recipe. choices for the head both in training and inference stages, with the goal of improving one or more of the three main

📄 Read Full PDF on ArXiv