Exploring the Design Space of Transition Matching

Reading time: 5 minute
...

📝 Original Info

  • Title: Exploring the Design Space of Transition Matching
  • ArXiv ID: 2512.12465
  • Date: 2025-12-13
  • Authors: Uriel Singer, Yaron Lipman

📝 Abstract

Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second ``internal'' generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains.

💡 Deep Analysis

Figure 1

📄 Full Content

Exploring the Design Space of Transition Matching Uriel Singer1, Yaron Lipman1 1FAIR at Meta Transition Matching (TM) is an emerging paradigm for generative modeling that generalizes diffusion and flow-matching models as well as continuous-state autoregressive models. TM, similar to previous paradigms, gradually transforms noise samples to data samples, however it uses a second “internal” generative model to implement the transition steps, making the transitions more expressive compared to diffusion and flow models. To make this paradigm tractable, TM employs a large backbone network and a smaller "head" module to efficiently execute the generative transition step. In this work, we present a large-scale, systematic investigation into the design, training and sampling of the head in TM frameworks, focusing on its time-continuous bidirectional variant. Through comprehensive ablations and experimentation involving training 56 different 1.7B text-to-image models (resulting in 549 unique evaluations) we evaluate the affect of the head module architecture and modeling during training as-well as a useful family of stochastic TM samplers. We analyze the impact on generation quality, training, and inference efficiency. We find that TM with an MLP head, trained with a particular time weighting and sampled with high frequency sampler provides best ranking across all metrics reaching state-of-the-art among all tested baselines, while Transformer head with sequence scaling and low frequency sampling is a runner up excelling at image aesthetics. Lastly, we believe the experiments presented highlight the design aspects that are likely to provide most quality and efficiency gains, while at the same time indicate what design choices are not likely to provide further gains. Correspondence: First Author at urielsinger@meta.com 1 Introduction Transition Matching (TM) Shaul et al. (2025) is a recent generalization of several media generative paradigms including diffusion models Sohl-Dickstein et al. (2015); Ho et al. (2020); Song et al. (2020), flow matching models Lipman et al. (2022); Liu et al. (2022); Albergo and Vanden-Eijnden (2022), and continuous-state autoregressive image generation Li et al. (2024); Team et al. (2025) that offers new design choices that go beyond the scope of these former paradigms and already shown to yield improved image quality and/or more efficient sampling at inference time. In this work we focus on TM’s continuous time bidirectional variant, which, similarly to previous paradigms, learns a transition function (kernel) that gradually transfers a source (noise) sample X0 to a target (data) sample X1 by iteratively producing future samples Xt′ from previous samples Xt, 0 ≤t < t′ ≤1. Differently from previous work, TM models the transition kernel with a second “internal” generative model, offering a more expressive transition kernels than, e.g., diffusion models that utilize a factorized (i.e., independent in each coordinate) multivariate Gaussian as kernels. To keep things tractable, TM adopts a backbone–head paradigm, in which: The backbone (typically a large transformer) encodes current state Xt as well as conditioning information, producing a rich latent representation per input token. The head (typically much smaller than the backbone) serves as a learnable module tasked with translating backbone latent representations into concrete transition outputs, producing the next state Xt′ with t′ > t. While backbone architecture for diffusion models have been, and still are, thoroughly investigated (e.g., Peebles and Xie (2022)), systematic exploration of head architecture and hyperparameters is lacking in the current literature. Most existing works treat the head as a fixed, minimal component—often a single MLP or a lightweight mapping—without investigating how variations in design might impact model behavior and efficiency (Li et al., 2024; Fan et al.; Team et al., 2025; Shaul et al., 2025). In fact, due to its particular role in the generative process and its small relative size, the head design holds much potential for improving the model performance by exploring head-specific architectures and different scaling laws. In this paper we take on this opportunity and explore different design 1 arXiv:2512.12465v1 [cs.LG] 13 Dec 2025 DTM++ (ours) DTM+ (ours) DTM FM-lognorm AR-dis MAR-dis AR MAR a fox a stone bust next to an egg and an eggplant a tiger wearing a tuxedo Greek statue of a man comforting a cat. The cat has a big head. The man looks angry. A high resolution photo of a large bowl of ramen. There are several origami boats in the ramen of different colors. Figure 1 Samples comparing of our Best D-TM MLP (DTM++) and Transformer (DTM+) Models with DTM baseline, FM-lognormal, AR, MAR, AR-discrete, MAR-discrete as baselines. All models share similar architecture and training recipe. choices for the head both in training and inference stages, with the goal of improving one or more of the three main

📸 Image Gallery

000080_0000_FM_lognormal.png 000080_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000080_0001_DTM_mlp0d_mid_time_per_patch.png 000080_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000080_AR_1step.png 000080_MAR_1step.png 000080_eval_discreteAR_argmax.png 000080_eval_discreteMAR_argmax.png 000101_0000_FM_lognormal.png 000101_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000101_0001_DTM_mlp0d_mid_time_per_patch.png 000101_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000101_AR_1step.png 000101_MAR_1step.png 000101_eval_discreteAR_argmax.png 000101_eval_discreteMAR_argmax.png 000214_0000_FM_lognormal.png 000214_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000214_0001_DTM_mlp0d_mid_time_per_patch.png 000214_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000214_AR_1step.png 000214_MAR_1step.png 000214_eval_discreteAR_argmax.png 000214_eval_discreteMAR_argmax.png 000223_0000_FM_lognormal.png 000223_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000223_0001_DTM_mlp0d_mid_time_per_patch.png 000223_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000223_AR_1step.png 000223_MAR_1step.png 000223_eval_discreteAR_argmax.png 000223_eval_discreteMAR_argmax.png 000285_0000_FM_lognormal.png 000285_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000285_0001_DTM_mlp0d_mid_time_per_patch.png 000285_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000285_AR_1step.png 000285_MAR_1step.png 000285_eval_discreteAR_argmax.png 000285_eval_discreteMAR_argmax.png 000312_0000_FM_lognormal.png 000312_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000312_0001_DTM_mlp0d_mid_time_per_patch.png 000312_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000312_AR_1step.png 000312_MAR_1step.png 000312_eval_discreteAR_argmax.png 000312_eval_discreteMAR_argmax.png 000328_0000_FM_lognormal.png 000328_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000328_0001_DTM_mlp0d_mid_time_per_patch.png 000328_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000328_AR_1step.png 000328_MAR_1step.png 000328_eval_discreteAR_argmax.png 000328_eval_discreteMAR_argmax.png 000353_0000_FM_lognormal.png 000353_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000353_0001_DTM_mlp0d_mid_time_per_patch.png 000353_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000353_AR_1step.png 000353_MAR_1step.png 000353_eval_discreteAR_argmax.png 000353_eval_discreteMAR_argmax.png 000400_0000_FM_lognormal.png 000400_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000400_0001_DTM_mlp0d_mid_time_per_patch.png 000400_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000400_AR_1step.png 000400_MAR_1step.png 000400_eval_discreteAR_argmax.png 000400_eval_discreteMAR_argmax.png 000406_0000_FM_lognormal.png 000406_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000406_0001_DTM_mlp0d_mid_time_per_patch.png 000406_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000406_AR_1step.png 000406_MAR_1step.png 000406_eval_discreteAR_argmax.png 000406_eval_discreteMAR_argmax.png 000474_0000_FM_lognormal.png 000474_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000474_0001_DTM_mlp0d_mid_time_per_patch.png 000474_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000474_AR_1step.png 000474_MAR_1step.png 000474_eval_discreteAR_argmax.png 000474_eval_discreteMAR_argmax.png 000495_0000_FM_lognormal.png 000495_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000495_0001_DTM_mlp0d_mid_time_per_patch.png 000495_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000495_AR_1step.png 000495_MAR_1step.png 000495_eval_discreteAR_argmax.png 000495_eval_discreteMAR_argmax.png 000547_0000_FM_lognormal.png 000547_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000547_0001_DTM_mlp0d_mid_time_per_patch.png 000547_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000547_AR_1step.png 000547_MAR_1step.png 000547_eval_discreteAR_argmax.png 000547_eval_discreteMAR_argmax.png 000715_0000_FM_lognormal.png 000715_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000715_0001_DTM_mlp0d_mid_time_per_patch.png 000715_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000715_AR_1step.png 000715_MAR_1step.png 000715_eval_discreteAR_argmax.png 000715_eval_discreteMAR_argmax.png 000716_0000_FM_lognormal.png 000716_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000716_0001_DTM_mlp0d_mid_time_per_patch.png 000716_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000716_AR_1step.png 000716_MAR_1step.png 000716_eval_discreteAR_argmax.png 000716_eval_discreteMAR_argmax.png 000734_0000_FM_lognormal.png 000734_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000734_0001_DTM_mlp0d_mid_time_per_patch.png 000734_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000734_AR_1step.png 000734_MAR_1step.png 000734_eval_discreteAR_argmax.png 000734_eval_discreteMAR_argmax.png 000768_0000_FM_lognormal.png 000768_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000768_0001_DTM_mlp0d_mid_time_per_patch.png 000768_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000768_AR_1step.png 000768_MAR_1step.png 000768_eval_discreteAR_argmax.png 000768_eval_discreteMAR_argmax.png 000799_0000_FM_lognormal.png 000799_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000799_0001_DTM_mlp0d_mid_time_per_patch.png 000799_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000799_AR_1step.png 000799_MAR_1step.png 000799_eval_discreteAR_argmax.png 000799_eval_discreteMAR_argmax.png 000918_0000_FM_lognormal.png 000918_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000918_0001_DTM_mlp0d_mid_time_per_patch.png 000918_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000918_AR_1step.png 000918_MAR_1step.png 000918_eval_discreteAR_argmax.png 000918_eval_discreteMAR_argmax.png 000980_0000_FM_lognormal.png 000980_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 000980_0001_DTM_mlp0d_mid_time_per_patch.png 000980_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 000980_AR_1step.png 000980_MAR_1step.png 000980_eval_discreteAR_argmax.png 000980_eval_discreteMAR_argmax.png 001017_0000_FM_lognormal.png 001017_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001017_0001_DTM_mlp0d_mid_time_per_patch.png 001017_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001017_AR_1step.png 001017_MAR_1step.png 001017_eval_discreteAR_argmax.png 001017_eval_discreteMAR_argmax.png 001136_0000_FM_lognormal.png 001136_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001136_0001_DTM_mlp0d_mid_time_per_patch.png 001136_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001136_AR_1step.png 001136_MAR_1step.png 001136_eval_discreteAR_argmax.png 001136_eval_discreteMAR_argmax.png 001151_0000_FM_lognormal.png 001151_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001151_0001_DTM_mlp0d_mid_time_per_patch.png 001151_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001151_AR_1step.png 001151_MAR_1step.png 001151_eval_discreteAR_argmax.png 001151_eval_discreteMAR_argmax.png 001185_0000_FM_lognormal.png 001185_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001185_0001_DTM_mlp0d_mid_time_per_patch.png 001185_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001185_AR_1step.png 001185_MAR_1step.png 001185_eval_discreteAR_argmax.png 001185_eval_discreteMAR_argmax.png 001206_0000_FM_lognormal.png 001206_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001206_0001_DTM_mlp0d_mid_time_per_patch.png 001206_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001206_AR_1step.png 001206_MAR_1step.png 001206_eval_discreteAR_argmax.png 001206_eval_discreteMAR_argmax.png 001230_0000_FM_lognormal.png 001230_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001230_0001_DTM_mlp0d_mid_time_per_patch.png 001230_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001230_AR_1step.png 001230_MAR_1step.png 001230_eval_discreteAR_argmax.png 001230_eval_discreteMAR_argmax.png 001232_0000_FM_lognormal.png 001232_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001232_0001_DTM_mlp0d_mid_time_per_patch.png 001232_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001232_AR_1step.png 001232_MAR_1step.png 001232_eval_discreteAR_argmax.png 001232_eval_discreteMAR_argmax.png 001372_0000_FM_lognormal.png 001372_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001372_0001_DTM_mlp0d_mid_time_per_patch.png 001372_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001372_AR_1step.png 001372_MAR_1step.png 001372_eval_discreteAR_argmax.png 001372_eval_discreteMAR_argmax.png 001374_0000_FM_lognormal.png 001374_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001374_0001_DTM_mlp0d_mid_time_per_patch.png 001374_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001374_AR_1step.png 001374_MAR_1step.png 001374_eval_discreteAR_argmax.png 001374_eval_discreteMAR_argmax.png 001483_0000_FM_lognormal.png 001483_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001483_0001_DTM_mlp0d_mid_time_per_patch.png 001483_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001483_AR_1step.png 001483_MAR_1step.png 001483_eval_discreteAR_argmax.png 001483_eval_discreteMAR_argmax.png 001495_0000_FM_lognormal.png 001495_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001495_0001_DTM_mlp0d_mid_time_per_patch.png 001495_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001495_AR_1step.png 001495_MAR_1step.png 001495_eval_discreteAR_argmax.png 001495_eval_discreteMAR_argmax.png 001507_0000_FM_lognormal.png 001507_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001507_0001_DTM_mlp0d_mid_time_per_patch.png 001507_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001507_AR_1step.png 001507_MAR_1step.png 001507_eval_discreteAR_argmax.png 001507_eval_discreteMAR_argmax.png 001604_0000_FM_lognormal.png 001604_0001_DTM_mlp0d_mid_best_original_cfg_stochastic_sampling_k32_a0p2.png 001604_0001_DTM_mlp0d_mid_time_per_patch.png 001604_0003_DTM_transformer_mid_best_original_cfg_stochastic_sampling_k1_a0p8.png 001604_AR_1step.png 001604_MAR_1step.png 001604_eval_discreteAR_argmax.png 001604_eval_discreteMAR_argmax.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut