DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling

DeepMoLM: Leveraging Visual and Geometric Structural Information for Molecule-Text Modeling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

AI models for drug discovery and chemical literature mining must interpret molecular images and generate outputs consistent with 3D geometry and stereochemistry. Most molecular language models rely on strings or graphs, while vision-language models often miss stereochemical details and struggle to map continuous 3D structures into discrete tokens. We propose DeepMoLM: Deep Molecular Language M odeling, a dual-view framework that grounds high-resolution molecular images in geometric invariants derived from molecular conformations. DeepMoLM preserves high-frequency evidence from 1024 $\times$ 1024 inputs, encodes conformer neighborhoods as discrete Extended 3-Dimensional Fingerprints, and fuses visual and geometric streams with cross-attention, enabling physically grounded generation without atom coordinates. DeepMoLM improves PubChem captioning with a 12.3% relative METEOR gain over the strongest generalist baseline while staying competitive with specialist methods. It produces valid numeric outputs for all property queries and attains MAE 13.64 g/mol on Molecular Weight and 37.89 on Complexity in the specialist setting. On ChEBI-20 description generation from images, it exceeds generalist baselines and matches state-of-the-art vision-language models. Code is available at https://github.com/1anj/DeepMoLM.


💡 Research Summary

DeepMoLM (Deep Molecular Language Modeling) introduces a dual‑view multimodal framework that tightly couples high‑resolution molecular images with explicit three‑dimensional (3D) geometric information derived from conformer structures. The authors identify three critical gaps in existing chemistry‑focused multimodal models: (1) weak coupling between visual and structural encoders, leading to loss of fine‑grained stereochemical cues; (2) collapse of continuous 3D geometry when projected into discrete token spaces, which destroys rotation/translation invariance and makes enantiomers indistinguishable; and (3) the prohibitive computational cost of processing high‑resolution images with standard Vision Transformers (VTs) due to quadratic self‑attention scaling.

To address these issues, DeepMoLM comprises three main components. First, the Molecular DeepEncoder processes 1024 × 1024 pixel molecular drawings using a dual‑pathway architecture: a SAM‑Base local Vision Transformer with windowed attention captures fine‑grained bond and stereobond details, while a CLIP‑Large global transformer extracts long‑range structural context. A convolutional token compressor reduces the 4096 local tokens to 256 tokens (16 × 16 grid) without discarding high‑frequency information, preserving the token budget for downstream fusion.

Second, the 3D geometric stream is built from Extended 3‑Dimensional Fingerprints (E3FP). For each heavy atom, E3FP iteratively aggregates neighbor information over K + 1 radii, hashes the aggregated descriptors, and maps them into a fixed vocabulary of size |F|. These discrete 3D tokens are aligned with SELFIES tokens via an atom‑position bijection ϕ, producing a joint structural sequence that simultaneously encodes topological (1D) and conformational (3D) information.

Third, the Multimodal Fusion Projector employs cross‑attention where visual tokens act as queries and the 3D token sequence provides keys and values. After linear projection into a shared hidden dimension (d_h = 4096), the attention mechanism allows each visual token to directly “look up” the corresponding atom‑level geometric descriptor, ensuring that stereochemical invariants are respected. Residual connections, layer normalization, and a feed‑forward network refine the fused representation, yielding H_fused (256 × 4096).

The fused multimodal embedding is fed into a Qwen2‑VL vision‑language decoder. The decoder receives both the fused visual tokens and a textual prompt, and generates captions or property descriptions autoregressively. Training is end‑to‑end, requiring only the image, the SELFIES string, and the conformer coordinates (used to compute E3FP) as inputs; no explicit atom coordinates are needed at inference time.

Experimental evaluation covers three tasks. On PubChem captioning, DeepMoLM achieves a 12.3 % relative improvement in METEOR over the strongest generalist vision‑language baseline and matches specialist methods. For property prediction, the model produces valid numeric answers for all queries and records mean absolute errors of 13.64 g/mol for molecular weight and 37.89 for molecular complexity—substantially better than 2D‑only baselines. In the ChEBI‑20 description generation benchmark, DeepMoLM surpasses generalist baselines and reaches performance comparable to state‑of‑the‑art vision‑language models. Ablation studies confirm that (a) the dual‑pathway encoder preserves stereochemical details, (b) E3FP tokens provide robust geometric grounding, and (c) cross‑attention is essential for effective multimodal fusion.

In summary, DeepMoLM demonstrates that high‑resolution visual perception combined with discrete 3D geometric fingerprints can produce physically grounded, chemically accurate text outputs without requiring explicit coordinate decoding. This work opens the door to more integrated drug‑discovery pipelines, literature mining, and automated chemical reporting where image, structure, and language are jointly understood. Future directions include scaling to larger conformer datasets, incorporating self‑supervised multimodal pretraining, and extending the framework to generative chemistry tasks such as de‑novo molecule design conditioned on textual specifications.


Comments & Academic Discussion

Loading comments...

Leave a Comment