Knowledge-enhanced Pretraining for Vision-language Pathology Foundation Model on Cancer Diagnosis

Knowledge-enhanced Pretraining for Vision-language Pathology Foundation Model on Cancer Diagnosis
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision-language foundation models have shown great promise in computational pathology but remain primarily data-driven, lacking explicit integration of medical knowledge. We introduce KEEP (KnowledgE-Enhanced Pathology), a foundation model that systematically incorporates disease knowledge into pretraining for cancer diagnosis. KEEP leverages a comprehensive disease knowledge graph encompassing 11,454 diseases and 139,143 attributes to reorganize millions of pathology image-text pairs into 143,000 semantically structured groups aligned with disease ontology hierarchies. This knowledge-enhanced pretraining aligns visual and textual representations within hierarchical semantic spaces, enabling deeper understanding of disease relationships and morphological patterns. Across 18 public benchmarks (over 14,000 whole-slide images) and 4 institutional rare cancer datasets (926 cases), KEEP consistently outperformed existing foundation models, showing substantial gains for rare subtypes. These results establish knowledge-enhanced vision-language modeling as a powerful paradigm for advancing computational pathology.


💡 Research Summary

The paper introduces KEEP (Knowledge‑Enhanced Pathology), a vision‑language foundation model designed specifically for cancer diagnosis in computational pathology. Existing vision‑language models in pathology have achieved impressive results but are fundamentally data‑driven; they rely on large collections of image‑text pairs that are often noisy, sparsely annotated, and lack explicit medical semantics. Consequently, these models struggle with rare tumor subtypes and provide limited interpretability.

To address these shortcomings, the authors construct a comprehensive disease knowledge graph (KG) that integrates the Disease Ontology and UMLS. The KG contains 11,454 disease entities and 139,143 attributes, including synonyms, definitions, and hierarchical (hypernym) relations. This structured medical knowledge serves as a scaffold for both data cleaning and representation learning.

The raw pathology image‑text datasets (OpenPath, Quilt1M, etc.) are first denoised using a YOLOv8 detector that removes irrelevant background and a text‑entity extractor that aligns caption entities with KG concepts. The cleaned pairs are then clustered into 143,000 “semantic groups” based on cosine similarity of disease embeddings derived from the KG; each group reflects a specific node or sub‑node in the ontology hierarchy. This re‑organization transforms a chaotic collection of noisy pairs into a hierarchy‑aware corpus that can guide pre‑training.

A BERT‑based language encoder is pre‑trained on the KG using metric learning to embed diseases such that hierarchical distances are preserved. For vision‑language pre‑training, a ViT visual encoder processes image tiles while the text encoder processes augmented captions (random cropping, dropout, template‑based paraphrasing). The training objective aligns visual and textual embeddings at multiple semantic levels: (1) node‑level alignment (individual disease concepts) and (2) group‑level alignment (semantic clusters). To mitigate false negatives, the authors implement positive mining, hardest‑negative selection, and explicit false‑negative elimination strategies.

During inference, whole‑slide images (WSIs) are tiled (e.g., 256 × 256 px at 20×). Each tile is embedded by the vision encoder and compared against text prompts of the form “


Comments & Academic Discussion

Loading comments...

Leave a Comment