Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model
Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combin…
Authors: Ahmet Halici, Ece Tugba Cebeci, Musa Balci
A utomated Histopathology Report Generation via Pyramidal F eatur e Extraction and the UNI F oundation Model Ahmet Halici ∗ 1 , Ece T u ˘ gba Cebeci ∗ 1 , Musa Balci ∗ 1 , Mustafa Çini 1 , and Serkan Sökmen 1 1 V iseurAI ∗ These authors contrib uted equally to this work Abstract Generating diagnostic text from histopathology whole-slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain-specific language. W e propose a hierarchical vision–language frame- work that combines a frozen pathology foundation model with a T ransformer decoder for report generation. T o make WSI processing tractable, we perform multi-resolution pyramidal patch selection (downsampling factors 2 3 to 2 6 ) and remove background and artifacts using Laplacian-v ariance and HSV - based criteria. Patch features are e xtracted with the UNI V ision T ransformer and projected to a 6-layer T ransformer decoder that generates diagnostic text via cross-attention. T o better represent biomedical terminology , we tokenize the output us- ing BioGPT . Finally , we add a retriev al-based verification step that compares generated reports with a reference corpus us- ing Sentence-BER T embeddings; if a high-similarity match is found, the generated report is replaced with the retrie ved ground-truth reference to improv e reliability . 1 Intr oduction Histopathological examination remains the clinical reference standard for cancer diagnosis, requiring expert pathologists to interpret complex morphological patterns across cellular, tissue, and architectural lev els [ 1 ]. While the digitization of pathology has enabled discriminati ve deep learning for tasks such as tumor classification and segmentation [ 2 , 3 ], recent work has increasingly explored generati ve models for produc- ing textual outputs from images. Automated Histopathology Report Generation (AHRG) extends slide-lev el prediction by synthesizing coherent, clinically appropriate natural-language descriptions directly from whole-slide images (WSIs). A central dif ficulty in AHRG is the disparity between the scale of the visual input and the semantic density of the textual output. A single WSI often exceeds 10 10 pixels, rendering standard vision–language architectures—typically designed for natural images at 224 × 224 resolution—computationally intractable. T raditional Multiple Instance Learning (MIL) methods [ 13 , 14 ] effecti vely aggre gate features for slide-lev el prediction b ut often lack the fine-grained spatial grounding required for descriptiv e text generation. Recent advancements have attempted to bridge this gap through two primary avenues: domain-specific foundation models and Multimodal Large Language Models (MLLMs). Foundation models lik e UNI [ 4 ] and H-optimus-1 [ 23 ] pro- vide robust, self-supervised feature representations, yet the y lack inherent text generation capabilities. Con versely , MLLMs such as WSI-LLaV A [ 25 ] and ChatEXA ONEPath [ 27 ] adapt general-purpose LLMs to pathology through instruction tuning. Howe ver , these end-to-end systems face significant hurdles: they are computationally expensi ve to train, often require mas- siv e token pruning that risks discarding rare diagnostic features, and are prone to hallucinations—plausible but factually incor - rect statements [8, 28]. In this work, we present a modular, hierarchical vision– language framew ork that emphasizes computational efficiency and diagnostic reliability . Rather than training an end-to-end MLLM, we pair a frozen pathology encoder with a lightweight, domain-adapted decoder . Our main contributions are: 1. W e propose a hierarchical p yramidal scanning strate gy (downsampling 2 3 to 2 6 ) that follows a coarse-to-fine workflo w and uses simple, interpretable filters to prior- itize tissue regions while suppressing background and common artifacts. 2. W e integrate the UNI encoder [ 4 ] as a frozen feature extractor and train a lightweight T ransformer decoder on top of its 1024-dimensional visual tokens, avoiding end-to-end retraining of the visual backbone. 3. W e use the BioGPT tokenizer [ 22 ] to better represent biomedical terminology and reduce vocab ulary mismatch during decoding. 4. W e add a retrie val-based verification step that compares generated reports with a reference corpus using Sentence- BER T embeddings, replacing high-similarity matches with retriev ed ground-truth references to improve output reliability . 1 2 Related W ork Computational pathology has expanded from primarily dis- criminativ e tasks tow ard generative settings that require trans- lating visual evidence into structured text. This section re- views pathology foundation models, histopathology-specific MLLMs, and verification strate gies for reducing unsupported generation. 2.1 Pathology F oundation Models T ransfer learning from natural images (e.g., ImageNet- pretrained ResNet) has increasingly been complemented by domain-specific self-supervised learning (SSL). Pathol- ogy foundation models are trained on large collections of histopathology patches to learn transferable tissue representa- tions. Chen et al. [ 4 ] introduced UNI, a V iT -Large model dis- tilled via DINOv2 from o ver 100 million tissue patches, which improv es performance across multiple downstream tasks rela- tiv e to supervised baselines. H-optimus-1 [ 23 ] further scales SSL to slide-level corpora to capture broad morphological variability . While these models provide strong visual repre- sentations, they are feature extractors and therefore require a separate decoding component to produce diagnostic text. 2.2 Generative V ision-Language Architectures Generating te xt from WSIs requires solving the “semantic gap” between pixel-le vel features and high-le vel diagnostic concepts. Early approaches relied on captioning models adapted from natural image domains, often yielding generic descriptions [17]. The current state of the art le verages Multimodal Large Lan- guage Models (MLLMs). Quilt-1M [ 24 ] facilitated this shift by curating a dataset of o ver 1 million image–text pairs from educational videos and social media, enabling the training of models such as Quilt-LLaV A. WSI-LLaV A [ 25 ] addresses the computational bottleneck of processing gigapix el images through dynamic token pruning, retaining only diagnostically relev ant patches to fit within the context window of a LLaMA- based decoder . HistGen [ 26 ] proposes a dual-stream architec- ture that separately aggregates local-region details and global WSI context, allo wing the decoder to attend to both cellular atypia and tissue architecture. In contrast to these end-to-end models, our work adopts a modular design and focuses on efficient patch selection and reliable decoding. 2.3 Hallucination and V erification Mechanisms A critical barrier to clinical adoption is hallucination, where generativ e models inv ent features not present in the image. This risk is exacerbated in pathology , where a single incorrect word (e.g., “malignant” vs. “benign”) has se vere consequences. Recent works hav e begun to address this through architectural constraints and verification loops. ReinPath [28] employs Re- inforcement Learning from Human Feedback (RLHF) with a semantic re ward system to penalize non-f actual generation. ChatEXA ONEP ath [ 27 ] utilizes Retriev al-Augmented Gen- eration (RA G) to ground model responses in retriev ed text- book kno wledge or similar historical cases. Similarly , the A QuA frame work [ 29 ] demonstrates that statistical anomaly detection can flag realistic-looking hallucinations in generative tasks. Our approach aligns with this trend by incorporating a retriev al-based verification step using Sentence-BER T , pro vid- ing a scalable method to estimate report confidence without the complexity of RLHF training. 3 Methodology W e propose a comprehensiv e pipeline for automated histopathology report generation that bridges the gap between gigapixel visual data and coherent text generation. The system consists of three sequential modules: (1) a hierarchical pyra- midal patch selection and feature extraction mechanism using the UNI foundation model; (2) a custom Transformer decoder trained to translate visual features into diagnostic text; and (3) a post-processing verification stage. 3.1 Pyramidal Patch Selection Processing entire gigapixel WSIs at full resolution is computa- tionally prohibitiv e. W e implement a coarse-to-fine pyramidal scanning strategy that processes WSI pyramid lev els in de- scending order ( ℓ ∈ { 6 , 5 , 4 , 3 } ), where le vel 0 represents the base resolution at 40 × magnification [3, 15]. Each WSI W comprises a hierarchy of le vels with progres- siv ely downsampled resolutions: H ℓ = H 0 2 ℓ , W ℓ = W 0 2 ℓ (1) where H 0 and W 0 are the base resolution dimensions. This hierarchical approach allows the system to capture both broad architectural patterns at low magnification and fine cellular details at higher resolution. 3.1.1 Tissue Segmentation For each pyramid lev el ℓ , we generate a thumbnail image and compute a binary tissue mask using HSV color space thresholding. W e classify a pixel at position ( x, y ) as tissue if: M ℓ ( x, y ) = ( 1 if S ( x, y ) > τ S ∧ V ( x, y ) > τ V 0 otherwise (2) where τ S = 20 and τ V = 30 are empirically determined thresholds that ef fectiv ely separate H&E-stained tissue from background glass [36]. 2 The raw binary mask is refined using morphological oper- ations (opening follo wed by closing with a 5 × 5 structuring element) to remov e noise and consolidate tissue regions: M refined ℓ = (( M ℓ ⊖ K ) ⊕ K ⊕ K ) ⊖ K (3) where ⊖ and ⊕ denote morphological erosion and dilation, respectiv ely . 3.1.2 Patch Candidate Generation Candidate patches of size 256 × 256 pixels are identified on a regular grid with stride s = 256 pixels (non-o verlapping tiling). A candidate patch is retained only if its tissue cov erage exceeds a minimum threshold: P i,j M x c ,y c ( i, j ) 256 2 > 0 . 1 (4) where M x c ,y c is the mask re gion corresponding to the patch at coordinates ( x c , y c ) . 3.2 Quality-A ware P atch Filtering T o ensure only diagnostically informativ e patches proceed to feature extraction, we implement multi-criteria quality filter - ing. 3.2.1 Focus Quality via Laplacian V ariance W e ev aluate focus quality using the v ariance of the Laplacian operator [ 37 ], which measures edge sharpness. F or a grayscale patch G derived from RGB patch P , the Laplacian is approxi- mated using con volution with k ernel: K Lap = 0 1 0 1 − 4 1 0 1 0 (5) The focus measure is the variance of the Laplacian response: f ( P ) = V ar ( ∇ 2 G ) = 1 | Ω | X ( x,y ) ∈ Ω ∇ 2 G ( x, y ) − ∇ 2 G 2 (6) where Ω denotes all pixel coordinates in the patch. Patches with f ( P ) < 40 are rejected as out-of-focus. 3.2.2 Exposure and Artifact Filtering W e analyze the HSV V alue and Saturation channels to detect improper exposure. Computing mean V alue µ V and mean Saturation µ S , we reject patches if: µ V / ∈ [40 , 245] or µ S < 12 (7) Additionally , we detect artifacts by computing the fraction of dark pixels (grayscale intensity < 30 ): ρ dark = 1 | Ω | X ( x,y ) ∈ Ω ⊮ [ P gray ( x, y ) < 30] (8) Patches with ρ dark > 0 . 2 are rejected as likely containing dust, pen marks, or other contamination [38]. 3.2.3 Multi-Level Sampling After quality filtering across all pyramid lev els, we impose a maximum b udget of N max = 2500 patches per WSI. If the total number of valid patches e xceeds this budget, we perform stratified random sampling proportional to the number of v alid patches at each lev el: N sample ℓ = min N valid ℓ , N max · N valid ℓ P k N valid k (9) This ensures representation across multiple magnifications. 3.3 F oundation Model Featur e Extraction 3.3.1 UNI V ision T ransformer Architectur e W e utilize the UNI (Univ ersal Pathology) foundation model [ 4 ] as our frozen visual encoder . UNI is a V ision Transformer (V iT -Large/16) pre-trained on ov er 100 million histopathology patches using DINOv2 self-supervised learning [39]. The V iT -Large architecture decomposes each 256 × 256 RGB patch into 16 × 16 = 256 non-ov erlapping tokens of size 16 × 16 pixels. Each token is flattened and linearly projected to model dimension d model = 1024 . A special classification tok en [ CLS ] is prepended, and learnable positional embeddings are added to preserve spatial relationships. 3.3.2 Self-Attention and Encoding The encoder consists of 24 transformer layers, each applying multi-head self-attention followed by a feed-forward netw ork. For head h , the attention mechanism computes: Attention h ( Z ) = softmax Q h K ⊤ h √ d k V h (10) where Q h = ZW ( h ) Q , K h = ZW ( h ) K , V h = ZW ( h ) V are query , ke y , and value projections with head dimension d k = 64 [30]. The outputs of all 16 attention heads are concatenated and projected. After 24 layers with residual connections and layer normalization, the final [ CLS ] token embedding serv es as the patch-lev el feature vector f ∈ R 1024 . 3.3.3 Frozen Encoder Strategy A critical design decision is k eeping the UNI encoder frozen (all 307M parameters fixed) during decoder training. This eliminates gradient computation through the encoder , reducing GPU memory requirements from approximately 16 GB to 4 GB while retaining the rob ust morphological representations learned from over 100 million patches. The frozen encoder 3 also enables a modular workflo w in which patch-lev el features are pre-computed once and cached as HDF5 files, decoupling feature extraction from decoder training. 3.3.4 Featur e Storage For each WSI with N selected patches, we extract a feature matrix: F = [ f 1 , f 2 , . . . , f N ] ⊤ ∈ R N × 1024 (11) where f i = UNI frozen ( P i ) . T able 1: Pipeline configuration and hyperparameters for patch selection and feature extraction. Stage Parameter V alue Pyramid Scanning Lev els processed { 6 , 5 , 4 , 3 } Patch size 256 × 256 px Stride 256 px T issue threshold > 10% Quality Filtering Laplacian var . > 40 V alue range [40 , 245] Saturation thr . > 12 Dark pixel frac. < 20% Max patches/WSI 2500 Feature Extraction Model UNI (V iT -L/16) Pre-training data 100M+ patches Feature dim. 1024 Encoder params. 307M (frozen) Batch size 128 3.4 T ransformer Decoder Ar chitecture The core generation module is a custom 6-layer T ransformer decoder [ 30 ] that conditions text generation on the extracted visual features. Unlike standard encoder -decoder architectures where both components are jointly trained, our design k eeps the UNI encoder frozen and trains only the decoder along with a lightweight projection layer . This modular approach signifi- cantly reduces computational requirements while lev eraging the robust representations learned by the foundation model. 3.4.1 Input Representation and F eature Pr ojection The visual features e xtracted from WSI patches form the mem- ory input for the decoder . Giv en a feature matrix F ∈ R N × 1024 from N selected patches, we first apply a linear projection layer to map these features into the decoder’ s embedding space: M = FW p + b p (12) where W p ∈ R 1024 × 1024 and b p ∈ R 1024 are learnable param- eters. The projected features M ∈ R N × 1024 serve as the visual memory that the decoder attends to during text generation. Since WSIs yield variable numbers of valid patches, we employ a memory k ey padding mask to handle sequences of different lengths within a batch. This mask ensures that the cross-attention mechanism ignores padded positions, allowing the model to process WSIs with v arying tissue content without introducing noise from padding tokens. 3.4.2 T oken Embedding and Positional Encoding For te xt representation, we employ the BioGPT tokenizer [ 22 ], which is specifically optimized for biomedical vocabulary . This choice reduces token fragmentation commonly observed with generic tokenizers when processing domain-specific ter - minology such as histological grades, cellular descriptions, and diagnostic phrases. Input tokens are mapped to dense vectors through a learn- able embedding layer E ∈ R V × 1024 , where V denotes the vocab ulary size. T o inject sequential information, we add sinusoidal positional encodings [ 30 ] to the token embeddings: P E ( pos, 2 i ) = sin pos 10000 2 i/d (13) P E ( pos, 2 i +1) = cos pos 10000 2 i/d (14) where pos represents the position index and i denotes the dimension index. This encoding scheme enables the model to capture the sequential nature of diagnostic text without introducing additional learnable parameters. 3.4.3 Decoder Layer Configuration Each decoder layer comprises three sub-modules: mask ed multi-head self-attention, multi-head cross-attention, and a position-wise feed-forw ard network [ 30 ]. The architec- ture employs 8 attention heads with a model dimension of d model = 1024 and a feed-forward dimension of d f f = 2048 . Dropout [ 34 ] with probability 0.1 is applied after each sub- layer for regularization. The masked self-attention mechanism enforces autore gres- siv e generation by applying a causal mask that pre vents each position from attending to subsequent positions: Mask ij = ( 0 if i ≥ j −∞ if i < j (15) This ensures that the prediction for position t depends only on the known outputs at positions less than t . The cross-attention module enables each generating token to attend to the entire set of visual patch features, follo w- ing the encoder -decoder attention mechanism introduced for sequence-to-sequence tasks [ 30 ] and later adapted for image captioning [ 35 ]. For a query token at position t , the attention weights ov er visual memory are computed as: α t = softmax q t K ⊤ √ d k (16) 4 where q t is the query vector deri ved from the text representa- tion, and K contains the key v ectors projected from the visual memory M . This mechanism allows the decoder to dynami- cally focus on relev ant image regions when generating each diagnostic term. 3.4.4 T raining Objective and Optimization The decoder is trained using teacher forcing [ 32 ], where the ground-truth tokens are provided as input during training. The model learns to predict the next token t i +1 giv en the previous tokens t 1 , . . . , t i and the visual features F . W e minimize the cross-entropy loss ov er the vocab ulary: L = − L X i =1 log P ( t i | t 1 , . . . , t i − 1 , F ) (17) where L denotes the sequence length. Padding tokens are excluded from the loss computation to prevent the model from learning trivial predictions. All decoder parameters are initialized using Xa vier uniform initialization [ 31 ] to maintain stable gradient flo w during early training. W e employ the AdamW optimizer [ 33 ] with a two- phase learning rate schedule: a warmup phase of 10 epochs at 5 × 10 − 5 , follo wed by decay to a base rate of 5 × 10 − 6 . The output layer projects the final hidden states to vocab ulary logits, from which the next token probability distribution is obtained via softmax. T able 2: Decoder hyperparameters. Parameter V alue Number of layers 6 Attention heads 8 Model dimension ( d model ) 1024 Feed-forward dimension ( d f f ) 2048 Dropout rate 0.1 Maximum sequence length 64 V ocabulary size 42,384 (BioGPT) Optimizer AdamW W armup epochs 10 W armup learning rate 5 × 10 − 5 Base learning rate 5 × 10 − 6 Batch size 64 T otal epochs 350 3.5 Retriev al-Based Post-Pr ocessing T o mitigate the risk of hallucination, we incorporate a similarity-based correction module. After generating a report, the system encodes the te xt using a Sentence-BER T model (all-MiniLM-L6-v2) to produce a 384-dimensional semantic embedding. This embedding is compared against a database of ground-truth reports from the training set via cosine similarity . If the cosine similarity between the generated report and the nearest reference e xceeds a confidence threshold ( τ = 0 . 85 ), the system replaces the generated report with the matched ground-truth report, le veraging the assumption that a high- similarity match indicates a reliable reference exists in the training corpus. Reports falling below this threshold are re- tained as original generations, as they may represent v alid but less common diagnostic patterns not well-represented in the reference database. 4 Experimental Setup 4.1 Implementation Details The pipeline is implemented in PyT orch 2.0. Feature extraction and training were conducted on Azure ML infrastructure with GPU acceleration. The decoder was trained for 350 epochs with a batch size of 64. W e utilized the AdamW optimizer with a learning rate schedule comprising a 10-epoch w armup (peaking at 5 × 10 − 5 ) followed by a decay to a base rate of 5 × 10 − 6 . The sequence length was capped at 64 tokens to align with the typical length of diagnostic summaries. 4.2 Dataset and Evaluation Experiments were conducted on the REG 2025 Grand Chal- lenge dataset, which comprises 10,494 WSI-report pairs col- lected from fi ve institutions across K orea, T urkey , India, Japan, and Germany . The dataset spans se ven or gan systems (breast, bladder , cervix, colon, lung, prostate, and stomach) with re- ports standardized according to Colle ge of American P athol- ogists (CAP) guidelines. The data was divided into 8,494 training samples and tw o test sets of 1,000 samples each. Eval- uation was performed using the composite scoring function described in Section 5, with strict patient-lev el separation to prev ent data leakage. 5 Evaluation Metrics The REG 2025 Grand Challenge employs a composite scoring function specifically designed for pathology report generation, dev eloped in consultation with clinical experts to balance tex- tual fidelity with diagnostic relev ance. The ranking score S rank integrates four complementary metrics: S rank = 0 . 15 × ( S R OUGE + S BLEU ) + 0 . 4 × S KEY + 0 . 3 × S EMB (18) The individual components capture distinct aspects of re- port quality . ROUGE (Recall-Oriented Understudy for Gist- ing Evaluation) and BLEU (Bilingual Ev aluation Understudy) 5 quantify n-gram ov erlap between generated and reference re- ports, providing a measure of lexical precision. Howe ver , in medical contexts, these surface-lev el metrics alone prov e insufficient, as clinically equivalent statements may employ substantially different v ocabulary . The ke yword score S KEY addresses this limitation by com- puting Jaccard similarity between e xtracted clinical keyw ord sets: S KEY = | K gen ∩ K ref | | K gen ∪ K ref | (19) where K gen and K ref denote keyw ord sets extracted from gen- erated and reference reports, respecti vely . This metric specifi- cally targets the preserv ation of diagnostically significant ter - minology such as disease names, grading descriptors, and anatomical locations. The embedding score S EMB captures semantic equi valence through cosine similarity of sentence embeddings produced by a pre-trained language model: S EMB = e gen · e ref ∥ e gen ∥∥ e ref ∥ (20) The weight distribution reflects clinical priorities: keyw ord matching receiv es the highest coeffi cient (0.4), emphasizing accurate capture of diagnostic terminology over stylistic simi- larity . Semantic embedding similarity (0.3) ackno wledges that pathologically equiv alent descriptions may vary in phrasing. T raditional NLG metrics contrib ute a combined weight of 0.15, serving primarily to penalize gross departures from reference language patterns. 6 Results Our frame work was e valuated on the REG 2025 Grand Chal- lenge. T able 3 presents the T est Phase 2 performance compari- son among participating teams. Our submission (MedInsight- V iseurAI) achiev ed a ranking score of 0.8093, placing 8th among 24 teams and within 4.7% of the top-performing method. 6.1 Qualitative Analysis T o pro vide insight into the model’ s beha vior across div erse organ systems and diagnostic cate gories, T able 4 presents rep- resentativ e examples of generated reports alongside their cor- responding ground-truth references. Each ro w shows the WSI identifier , the model’ s output ( Generated ), and the clinician- authored reference ( Gr ound T ruth ). Sev eral patterns emerge from the qualitati ve examples. Cases such as PIT_01_00004_01 (in vasi ve breast car- cinoma), PIT_01_04573_01 (colonic inflammation), and PIT_01_05088_01 (lung squamous cell carcinoma) demon- strate exact or near-e xact match with ground-truth reports, confirming the model’ s ability to correctly identify or gan sites, T able 3: REG 2025 Challenge T est Phase 2 Leaderboard (T op 10). T eams are ranked by the composite score defined in Equation 18. Rank T eam Score 1 IMA GINE Lab 0.8494 2 ICGI 0.8472 3 ICL_PathReport 0.8415 4 nw 0.8282 5 PathX 0.8237 6 T rustPath 0.8127 7 FPathX 0.8115 8 MedInsight-V iseurAI (Ours) 0.8093 9 ADCT 0.8040 10 katherlab 0.7960 biopsy procedures, and primary diagnoses for common patho- logical entities. Howe ver , more complex cases reveal characteristic fail- ure modes. For PIT_01_00005_01, the model generates a diagnosis of in vasi ve ductal carcinoma with grading details, whereas the ground truth specifies ductal carcinoma in situ with architectural subtype and necrosis descriptors—illustrating the challenge of distinguishing in vasi ve from in-situ lesions and generating multi-attribute descriptions. Similarly , for PIT_01_06231_02, the model correctly identifies acinar ade- nocarcinoma but assigns a Gleason score of 6 (3+3) instead of the correct 7 (3+4), demonstrating that fine-grained grading distinctions remain challenging. Minor discrepancies in biopsy procedure terminology (e.g., “colposcopic” vs. “punch” for PIT_01_03513_01; “endoscopic motome” vs. “sono-guided core” for PIT_01_00575_01) are also observed, though these typically do not affect the diagnostic conclusion. Since ground-truth labels were not provided for the full test set, detailed per-cate gory performance analysis was not possible. Howe ver , the ov erall qualitative examination con- firmed that the model consistently produces outputs in the e x- pected canonical format of [Organ], [biopsy type]; [diagnosis] , representing a significant advantage of the encoder-decoder architecture over large language model ap- proaches that may occasionally generate e xtraneous text or deviate from standardized reporting templates. The retriev al-based post-processing module provided an additional verification mechanism by comparing generated re- ports against the training corpus using Sentence-BER T embed- dings. Reports exceeding the similarity threshold ( τ = 0 . 85 ) were replaced with the matched ground-truth references, le ver - aging the assumption that a high-similarity match indicates a reliable reference exists; those below the threshold were re- tained as original generations, potentially representing valid but less common diagnostic patterns not well-represented in the reference database. 6 T able 4: Qualitati ve comparison of generated reports versus ground-truth references across multiple organ systems. Generated reports are produced by our model; ground truth reports are clinician-authored references. Slide ID Generated Report Ground T ruth PIT_01_00005_01 Breast, sono-guided core biopsy; 1. Ductal carci- noma of no special type, grade II (T ubule formation: 3, Nuclear grade: 2, Mitoses: 1) 2. Ductal carcinoma in situ Breast, core-needle biopsy; 1. Ductal carcinoma in situ – T ype: Solid – Nuclear grade: High – Necrosis: Present (Comedo-type) 2. Microcalcification PIT_01_00004_01 Breast, core-needle biopsy; In vasi ve carcinoma of no special type, grade II (T ubule formation: 3, Nuclear grade: 2, Mitoses: 1) Breast, core-needle biopsy; In vasi ve carcinoma of no special type, grade II (T ubule formation: 3, Nuclear grade: 2, Mitoses: 1) PIT_01_00575_01 Breast, endoscopic motome biopsy; Fibroc ystic tu- mor fa vor change Breast, sono-guided core biopsy; Fibrocystic change PIT_01_06231_02 Prostate, biopsy; Acinar adenocarcinoma, Gleason’ s score 6 (3+3), grade group 1, tumor volume: 90% Prostate, biopsy; Acinar adenocarcinoma, Gleason’ s score 7 (3+4), grade group 2 (Gleason pattern 4: 10%), tumor volume: 90% PIT_01_03513_01 Uterine cervix, colposcopic biopsy; Lo w-grade squa- mous intraepithelial lesion (LSIL; CIN 1) Uterine cervix, punch biopsy; Low-grade squamous intraepithelial lesion (LSIL; CIN 1) PIT_01_04573_01 Colon, colonoscopic biopsy; Chronic nonspecific inflammation Colon, colonoscopic biopsy; Chronic nonspecific inflammation PIT_01_05088_01 Lung, biopsy; Squamous cell carcinoma Lung, biopsy; Squamous cell carcinoma 7 Discussion The modular architecture combining a frozen foundation model encoder with a lightweight domain-adapted decoder offers sev eral practical adv antages. By keeping the UNI en- coder frozen, we preserve robust visual representations learned through large-scale self-supervised pre-training while substan- tially reducing computational requirements. T raining only the 6-layer decoder required significantly fewer GPU-hours compared to end-to-end fine-tuning of vision-language models that typically exceed billions of parameters, enabling iterati ve experimentation within resource-constrained research en viron- ments. A notable adv antage of the encoder-decoder design o ver autoregressi ve lar ge language models is the structural consis- tency of generated outputs. Unlike LLMs that sample from probability distributions at each generation step and may pro- duce hallucinated content or formatting artifacts, our decoder employs deterministic generation that consistently adheres to learned report templates. Throughout our e xperiments, we observed virtually no instances of format violations or out- of-domain text generation, which would be problematic for clinical deployment where reports must conform to standard- ized structures. The BioGPT tok enizer’ s domain-specific v ocabulary prov ed advantageous for pathology terminology . Generic tok enizers frequently fragment medical terms into subword units that disrupt semantic coherence, whereas a tokenizer pre-trained on biomedical corpora reduces ef fective sequence length for pathological terms, facilitating more robust associations be- tween visual features and complete diagnostic phrases. The systematic errors observ ed in comple x grading schemas rev eal opportunities for architectural improv ements. Diag- noses requiring simultaneous specification of multiple semi- independent attributes create a combinatorial space that may be sparsely sampled in training data. Structured prediction heads that e xplicitly model attrib ute dependencies, or auxiliary classification objecti ves for individual grading components, could provide stronger supervisory signals for rare combina- tions. The strong performance on re gularly structured grading systems such as Gleason scoring, despite their apparent com- plexity , likely reflects their higher prev alence in the training distribution and more consistent linguistic templates. Our competitiv e ranking despite architectural simplicity sug- gests that careful attention to training procedures—including warmup scheduling and domain-specific tokenization— combined with post-processing verification can partially com- pensate for reduced model capacity compared to larger multi- modal language models. Sev eral limitations warrant ackno wledgment. The absence of ground-truth labels for the test set precludes quantitativ e per-cate gory error analysis. The retriev al-based correction mechanism, while providing a safety net against obvious er - rors, may inadvertently suppress v alid rare diagnoses not well- represented in the training corpus. Additionally , ev aluation on a single challenge dataset limits generalizability assessment, as the REG 2025 data deriv es from specific institutional contexts that may not transfer to other settings. Finally , the current framew ork generates only diagnostic summary components; complete clinical reports include additional elements such as 7 gross descriptions and ancillary test recommendations that require separate modeling approaches. 8 Conclusion W e presented a modular vision-language frame work for auto- mated histopathology report generation that addresses com- putational efficienc y without sacrificing diagnostic accuracy . The hierarchical pyramidal scanning strategy enables tractable processing of gigapixel WSIs while preserving diagnostically relev ant tissue regions. Integration of the frozen UNI founda- tion model with a lightweight T ransformer decoder , trained using BioGPT tokenization, produces structurally consistent reports that adhere to clinical formatting con ventions. Evaluation on the REG 2025 Grand Challenge yielded a ranking score of 0.8093 in T est Phase 2, placing our method 8th among 24 international teams. Qualitati ve analysis re- vealed rob ust performance in org an identification, procedure classification, and primary disease recognition, with degra- dation observed primarily in comple x multi-attribute grading schemas characteristic of underrepresented diagnostic cate- gories. The architectural choices prioritizing ef ficiency and con- sistency demonstrate that competiti ve automated report gen- eration is achie vable without the substantial computational in vestment required for end-to-end multimodal large language model training. Future work will e xplore structured prediction approaches for complex grading schemas and v alidation across div erse institutional datasets. Acknowledgments W e thank the or ganizers of the REG 2025 Grand Challenge for providing the e valuation infrastructure and benchmark dataset. Computational resources were pro vided by Microsoft Azure Machine Learning. This work was conducted at V iseurAI. Refer ences [1] Niazi, M.K.K., Parwani, A.V ., and Gurcan, M.N. (2019). Digital pathology and artificial intelligence. The Lancet Oncology , 20(5), e253–e261. [2] Bera, K., Schalper , K.A., Rimm, D.L., V elcheti, V ., and Madabhushi, A. (2019). Artificial intelligence in digi- tal pathology—new tools for diagnosis and precision oncology . Nature Reviews Clinical Oncology , 16(11), 703–715. [3] Campanella, G., et al. (2019). Clinical-grade computa- tional pathology using weakly supervised deep learning on whole slide images. Nature Medicine , 25(8), 1301– 1309. [4] Chen, R.J., et al. (2024). T o wards a general-purpose foundation model for computational pathology . Natur e Medicine , 30, 850–862. [5] Lu, M.Y ., et al. (2024). A visual-language foundation model for pathology image analysis using medical T wit- ter . Natur e Medicine , 30, 863–874. [6] V in yals, O., T oshev , A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption generator . CVPR , 3156–3164. [7] Anderson, P ., et al. (2018). Bottom-up and top-down attention for image captioning and visual question an- swering. CVPR , 6077–6086. [8] Liu, G., et al. (2019). Clinically accurate chest X-ray report generation. Machine Learning for Healthcar e , 249– 269. [9] Liu, H., et al. (2024). V isual instruction tuning. NeurIPS , 36. [10] Alayrac, J.B., et al. (2022). Flamingo: A visual language model for few-shot learning. NeurIPS , 35, 23716–23736. [11] Li, C., et al. (2024). LLaV A-Med: Training a large language-and-vision assistant for biomedicine in one day . NeurIPS , 36. [12] Litjens, G., et al. (2017). A surve y on deep learning in medical image analysis. Medical Imag e Analysis , 42, 60– 88. [13] Ilse, M., T omczak, J., and W elling, M. (2018). Attention- based deep multiple instance learning. ICML , 2127– 2136. [14] Lu, M.Y ., et al. (2021). Data-efficient and weakly super - vised computational pathology on whole-slide images. Natur e Biomedical Engineering , 5(6), 555–570. [15] Chen, R.J., et al. (2022). Scaling vision transformers to gigapixel images via hierarchical self-supervised learn- ing. CVPR , 16144–16155. [16] V orontsov , E., et al. (2024). V irchow: A million- slide digital pathology foundation model. arXiv pr eprint arXiv:2309.07778 . [17] Jing, B., Xie, P ., and Xing, E. (2018). On the automatic generation of medical imaging reports. ACL , 2577–2586. [18] Chen, Z., et al. (2020). Generating radiology reports via memory-driv en transformer . EMNLP , 1439–1449. [19] Li, Y ., et al. (2018). Hybrid retrie val-generation re- inforced agent for medical image report generation. NeurIPS , 31. 8 [20] Huang, Z., et al. (2023). A visual-language foundation model for pathology image analysis using medical T wit- ter . Natur e Medicine , 29(9), 2307–2316. [21] Jaegle, A., et al. (2021). Percei ver: General perception with iterativ e attention. ICML , 4651–4664. [22] Luo, R., et al. (2022). BioGPT : Generati ve pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics , 23(6), bbac409. [23] Bioptimus (2025). H-optimus-1. A v ailable at: https://huggingface.co/bioptimus/ H- optimus- 1 . [24] Ikezogwo, W ., et al. (2024). Quilt-1M: One Million Image-T ext Pairs for Histopathology . NeurIPS . [25] Liang, Y ., L yu, X., Ding, M., Chen, W ., Zhang, J., Ren, Y ., He, X., W u, S., Y ang, S., W ang, X., Xing, X., and Shen, L. (2024). WSI-LLaV A: A Multimodal Lar ge Lan- guage Model for Whole Slide Image. arXiv preprint arXiv:2412.02141 . [26] Zuo, Y ., et al. (2024). HistGen: Histopathology Report Generation via Local-Global Feature Encoding. MICCAI . [27] Baek, S., et al. (2025). ChatEXA ONEPath: An Expert- le vel Multimodal Large Language Model for Histopathol- ogy . arXiv preprint . [28] Zhang, L., et al. (2026). ReinPath: A Multimodal Re- inforcement Learning Approach for Pathology . arXiv pr eprint arXiv:2601.14757 . [29] Shao, Z., et al. (2025). A QuA: A rob ust and scalable framew ork for hallucination detection in virtual tissue staining. Natur e Communications . [30] V aswani, A., et al. (2017). Attention is all you need. NeurIPS , 30. [31] Glorot, X. and Bengio, Y . (2010). Understanding the difficulty of training deep feedforward neural netw orks. AIST A TS , 249–256. [32] W illiams, R.J. and Zipser , D. (1989). A learning algo- rithm for continually running fully recurrent neural net- works. Neural Computation , 1(2), 270–280. [33] Loshchilov , I. and Hutter , F . (2019). Decoupled weight decay regularization. ICLR . [34] Sriv asta va, N., et al. (2014). Dropout: A simple way to prev ent neural networks from o verfitting. JMLR , 15(1), 1929–1958. [35] Xu, K., et al. (2015). Sho w , attend and tell: Neural image caption generation with visual attention. ICML , 2048– 2057. [36] Macenko, M., et al. (2009). A method for normalizing his- tology slides for quantitativ e analysis. ISBI , 1107–1110. [37] Pertuz, S., Puig, D., and Garcia, M.A. (2013). Analysis of focus measure operators for shape-from-focus. P attern Recognition , 46(5), 1415–1432. [38] K othari, S., et al. (2013). Removing out-of-focus blur in whole-slide digital pathology images. J ournal of P athol- ogy Informatics , 4, 22. [39] Oquab, M., et al. (2023). DINOv2: Learning ro- bust visual features without supervision. arXiv pr eprint arXiv:2304.07193 . 9
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment