Multi-view biomedical foundation models for molecule-target and property prediction
Quality molecular representations are key to foundation model development in bio-medical research. Previous efforts have typically focused on a single representation or molecular view, which may have strengths or weaknesses on a given task. We develop Multi-view Molecular Embedding with Late Fusion (MMELON), an approach that integrates graph, image and text views in a foundation model setting and may be readily extended to additional representations. Single-view foundation models are each pre-trained on a dataset of up to 200M molecules. The multi-view model performs robustly, matching the performance of the highest-ranked single-view. It is validated on over 120 tasks, including molecular solubility, ADME properties, and activity against G Protein-Coupled receptors (GPCRs). We identify 33 GPCRs that are related to Alzheimer’s disease and employ the multi-view model to select strong binders from a compound screen. Predictions are validated through structure-based modeling and identification of key binding motifs.
💡 Research Summary
The paper introduces MMELON (Multi‑view Molecular Embedding with Late Fusion), a biomedical foundation model that jointly leverages three complementary molecular representations: a 2‑D image view, a graph view, and a text (SMILES) view. Each view is pre‑trained on a massive corpus of up to 200 million molecules curated from PubChem and ZINC22. The image encoder adopts the ImageMol CNN architecture trained on 10 M PubChem images; the text encoder follows MolFormer, a transformer that processes SMILES sequences; the graph encoder is based on TokenGT, a graph‑transformer that tokenizes atoms and bonds. For graph and text pre‑training, three self‑supervised tasks are used: node feature masking, edge prediction, and a novel Betti‑number (topological) prediction, encouraging the models to capture both local chemistry and global topology.
After independent pre‑training, the three encoders feed into a late‑fusion aggregator. The aggregator computes a weighted sum of the three view embeddings, where the weights (αₘ) are learnable parameters that are fine‑tuned together with downstream heads. This design makes the contribution of each view transparent and adaptable to each specific task.
The authors first validate the quality of the embeddings by sampling 100 k molecules and measuring correlations between Euclidean distances in the learned spaces and Tanimoto distances of four classic fingerprints (Morgan, Atom‑Pair, MACCS, torsion). Text and graph embeddings are highly correlated (≈ 0.7), while the image embedding is more orthogonal, especially aligning with MACCS, confirming that each view captures distinct chemical information.
Performance is evaluated on more than 120 downstream tasks, including the MoleculeNet benchmark suite (classification and regression), CYP450 inhibition assays, and the ComputationalADME dataset. In single‑view experiments, the graph model generally outperforms image and text models, but MMELON matches or exceeds the best single‑view result on every task, never producing a “poor” outlier. Heat‑maps of the learned α weights reveal that the graph view dominates most tasks, yet the image view receives higher weight on solubility‑related tasks (ESOL, Lipophilicity) and the text view on tasks heavily dependent on SMILES‑encoded substructures. This dynamic weighting demonstrates the practical benefit of multi‑view fusion.
To showcase real‑world utility, the authors identify 33 GPCRs implicated in Alzheimer’s disease (AD) and fine‑tune MMELON on binding assay data for over 100 GPCRs. They then screen a library of gut microbiome metabolites and FDA‑approved drugs, selecting top candidates predicted to bind the AD‑related GPCRs. Structural docking and pharmacophore analysis confirm that the predicted binders occupy key interaction motifs, outperforming predictions from any single‑view model.
Key contributions of the work are: (1) massive multi‑view pre‑training on 200 M molecules, (2) a simple yet interpretable late‑fusion architecture that yields view‑specific importance scores, (3) extensive benchmarking that demonstrates robustness across diverse chemical and biological tasks, and (4) a successful application to AD‑related GPCR target discovery. Limitations include reliance on 2‑D representations (no explicit 3‑D conformational data), a linear fusion that may miss higher‑order interactions, and potential bias inherited from the pre‑training corpus. Future directions suggested are incorporation of 3‑D voxel or graph‑conformer views, and replacement of the linear aggregator with a cross‑attention transformer to capture richer multimodal interactions.
In summary, MMELON establishes that integrating graph, image, and text molecular views within a foundation‑model framework yields a versatile, high‑performing, and interpretable tool for molecular property prediction and drug‑target discovery, advancing the state of the art beyond single‑view approaches.
Comments & Academic Discussion
Loading comments...
Leave a Comment