FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows
Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6x fewer parameters and training 10x faster than prior methods. The project page with code is available at https://yeonwoo378.github.io/official_flowbind.
💡 Research Summary
FlowBind tackles three major bottlenecks of existing flow‑based multimodal generative models: the need for large fully‑paired datasets, high computational cost from modeling a joint distribution over many modalities, and complex multi‑stage training pipelines. The core idea is to introduce a learnable shared latent variable z* that captures common information across all modalities, and to connect each modality z_i to this latent via its own invertible flow. For each modality a linear interpolation path z_i(t)=t z_i+(1‑t) z* is defined, and a time‑dependent velocity field v_i(z_i(t),t) is learned using the flow‑matching loss
L(θ,ϕ)=E_{t, z_S, z*} ∑{i∈S}‖v{θ_i}(z_i(t),t)−(z_i−z*)‖².
Here θ_i are the parameters of the modality‑specific drift networks and ϕ are the parameters of an auxiliary encoder H_ϕ that produces z* from a partially paired sample set z_S. A key technical contribution is a simple training schedule that stops gradients through H_ϕ for all t∈(0,1] and updates the encoder only at t=0. This prevents the “collapse” where the encoder outputs a constant vector, because at t=0 the loss reduces to minimizing the conditional variance Var(z_i|z*) —forcing z* to retain predictive information about every modality.
Training can therefore use arbitrarily paired subsets of modalities; each sample only needs the modalities present in S. The shared latent is instantiated as z* = H_ϕ(z_S) and the per‑modality flows are optimized jointly in a single stage, eliminating the need for separate alignment and generation phases used by prior works such as CoDi (text‑anchor) or OmniFlow (joint velocity field).
During inference the auxiliary encoder is not used. Given a source modality i, the model integrates the backward flow v_i from t=1 to t=0 to obtain an estimate of z*. Then it integrates the forward flow of the target modality j from t=0 to t=1 to generate ẑ_j. For multiple source modalities the backward flows are run independently, their latent estimates are averaged to form a single ẑ*, and the forward flow of the desired target modality is applied. Thus any‑to‑any translation reduces to two ODE solves per modality pair, regardless of how many modalities are involved.
Experiments cover three modality pairs: text‑image (MS‑COCO), text‑audio (AudioSet), and image‑audio (VGGSound). FlowBind is compared against state‑of‑the‑art flow models (CoDi, OmniFlow) and recent discrete‑diffusion baselines. Quantitative metrics (FID, IS, MOS) and human evaluations show that FlowBind achieves comparable or slightly better quality while using up to six times fewer parameters and training ten times faster. Notably, when only 30 % of the data are fully paired, performance degrades minimally, confirming the effectiveness of the shared latent in leveraging partially paired data.
Limitations include the reliance on pre‑trained modality‑specific autoencoders to obtain low‑dimensional embeddings, which may be insufficient for very high‑dimensional data such as video or 3D point clouds. Moreover, the flow‑matching loss can be sensitive to ODE integration error, so scaling to high‑resolution images may require more stable solvers or adaptive step sizes. Finally, because each modality’s flow is learned independently, modeling intricate cross‑modal interactions that depend on simultaneous conditioning (e.g., text and image jointly influencing audio) may be less expressive than a fully joint velocity field.
In summary, FlowBind presents a conceptually simple yet powerful framework: a shared latent anchor plus per‑modality invertible flows trained with a single flow‑matching objective. This design dramatically reduces data and compute requirements while preserving the flexibility of any‑to‑any multimodal generation. Future work could explore richer latent regularization (contrastive or adversarial), extensions to higher‑dimensional modalities, and learned fusion mechanisms for multiple source latents, further broadening the applicability of flow‑based generative models.
Comments & Academic Discussion
Loading comments...
Leave a Comment