Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Accurate prediction of protein-ligand binding poses is crucial for structure-based drug design, yet existing methods struggle to balance speed, accuracy, and physical plausibility. We introduce Matcha, a novel molecular docking pipeline that combines multi-stage flow matching with physically-aware post-processing. Our approach consists of three sequential stages applied consecutively to progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces ($\mathbb{R}^3$, $\mathrm{SO}(3)$, and $\mathrm{SO}(2)$). We enhance the prediction quality through GNINA energy minimization and apply unsupervised physical validity filters to eliminate unrealistic poses. Compared to various approaches, Matcha demonstrates superior physical plausibility across all considered benchmarks. Moreover, our method works approximately 31 times faster than modern large-scale co-folding models. The model weights and inference code to reproduce our results are available at https://github.com/LigandPro/Matcha.


💡 Research Summary

Matcha introduces a novel molecular docking pipeline that leverages multi‑stage Riemannian flow matching to generate accurate and physically plausible protein‑ligand binding poses while maintaining high computational efficiency. The core idea is to treat docking as a conditional generative problem on three manifolds: translations in Euclidean space ℝ³, global rotations on the rotation group SO(3), and internal torsional angles on a product of circles SO(2). For each manifold a separate flow‑matching model predicts a velocity field in the tangent space; integrating this field yields a deterministic transport from a noisy initial configuration to the target pose.

The pipeline consists of three sequential models, each trained independently with different noise scales to implement a coarse‑to‑fine refinement strategy. Stage 1 uses a large‑variance Gaussian for translations while sampling rotations and torsions uniformly, learning only the translation velocity. Stage 2 narrows the translation variance and still samples angular components uniformly, learning full‑dimensional velocities. Stage 3 employs small variances for all three components, producing a fine‑grained refinement of the pose. During inference, a fixed‑step Riemannian Euler solver (10 steps) integrates the predicted velocities for each stage, discarding angular samples after stage 1 and re‑initializing them uniformly before stage 2.

Input representation combines ligand atom tokens, protein residue tokens, and two CLS‑like tokens that aggregate translation and rotation information. Each token carries a 3‑D coordinate (atom positions, Cα positions, or ligand centroid) and is embedded with atom‑type features (RDKit) or protein sequence embeddings from ESM‑2 (35 M parameters). Positional information is injected via a simple MLP, while spatial attention biases are constructed from distance‑based radial kernels (Gaussian RBF) and directional vector projections, following the approach of UNIMOL and AlphaFold‑Multimer. The transformer backbone follows the Diffusion Transformer (DIT) paradigm, with time embeddings conditioned on all layers. After the backbone, lightweight heads output a 3‑vector for translation velocity, a 3‑vector representing an element of so(3) for rotation, and per‑bond scalars for torsional velocities.

Training follows the conditional flow‑matching loss L_CFM = E_{x0∼p0, x1∼pdata, t∼U


Comments & Academic Discussion

Loading comments...

Leave a Comment