NucFuseRank: Dataset Fusion and Performance Ranking for Nuclei Instance Segmentation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Nuclei instance segmentation in hematoxylin and eosin (H&E)-stained images plays an important role in automated histological image analysis, with various applications in downstream tasks. While several machine learning and deep learning approaches have been proposed for nuclei instance segmentation, most research in this field focuses on developing new segmentation algorithms and benchmarking them on a limited number of arbitrarily selected public datasets. In this work, rather than focusing on model development, we focused on the datasets used for this task. Based on an extensive literature review, we identified manually annotated, publicly available datasets of H&E-stained images for nuclei instance segmentation and standardized them into a unified input and annotation format. Using two state-of-the-art segmentation models, one based on convolutional neural networks (CNNs) and one based on a hybrid CNN and vision transformer architecture, we systematically evaluated and ranked these datasets based on their nuclei instance segmentation performance. Furthermore, we proposed a unified test set (NucFuse-test) for fair cross-dataset evaluation and a unified training set (NucFuse-train) for improved segmentation performance by merging images from multiple datasets. By evaluating and ranking the datasets, performing comprehensive analyses, generating fused datasets, conducting external validation, and making our implementation publicly available, we provided a new benchmark for training, testing, and evaluating nuclei instance segmentation models on H&E-stained histological images.

💡 Research Summary

This paper addresses a critical but often overlooked aspect of nuclei instance segmentation in hematoxylin‑and‑eosin (H&E) stained histopathology images: the quality, format, and combinability of publicly available training and testing datasets. While many recent studies focus on developing ever more sophisticated deep‑learning architectures, they typically benchmark those models on a small, arbitrarily chosen subset of public datasets, making it difficult to assess how well a model will generalize to new data. The authors therefore set out to (1) compile a comprehensive list of manually annotated H&E nuclei datasets, (2) standardize their image and mask formats, (3) create a unified test set (NucFuse‑test) that draws equally from each dataset, (4) construct a fused training set (NucFuse‑train) by merging the remaining tiles, and (5) evaluate two state‑of‑the‑art segmentation models—HoVerNeXt (CNN‑based) and CellViT (CNN‑ViT hybrid)—under both single‑dataset and fused‑dataset training regimes.

Dataset selection began with a systematic literature search that identified 24 candidate datasets. The authors applied strict inclusion criteria: fully manual annotations, availability of both images and pixel‑wise masks, and no reliance on semi‑automatic annotation pipelines that could introduce bias from pre‑trained models. Ten datasets satisfied these criteria: PCNS, PUMA, MoNuSeg, CPM17, TNBC, NuInsSeg, CoNSeP, MoNuSAC, DSB, and CryoNuSeg. For each, they recorded the number of image tiles, total annotated nuclei, organ diversity, tile size, and average nuclei density (nuclei per 256 × 256 patch). CryoNuSeg exhibited the highest density (≈63 nuclei/patch), whereas TNBC was the sparsest (≈0.28 nuclei/patch).

Standardization involved converting all images to .tif and all masks to .npy, then extracting non‑overlapping 256 × 256 patches. Images smaller than this size received white padding. This uniform preprocessing eliminated format and resolution disparities that previously hampered cross‑dataset comparisons.

The unified test set, NucFuse‑test, comprises exactly 14 tiles from each of the ten datasets (total 140 tiles). The number 14 matches the smallest official test split (MoNuSeg) and ensures that no single dataset dominates the evaluation. Tiles were drawn preferentially from existing test or validation splits; when none existed, random sampling was used. A minimum tile size of 256 × 256 pixels was enforced, and for MoNuSAC images containing macrophages, those tiles were excluded to avoid bias. The remaining tiles from each dataset were pooled to form the training pool, which was then progressively merged into NucFuse‑train (2,739 tiles, ≈248 k nuclei). All generated resources are publicly released on FigShare.

Two segmentation models were selected for benchmarking. HoVerNeXt builds on the ConvNeXt‑V2 encoder and features parallel class and instance decoders; only the instance decoder was used. CellViT employs a Vision Transformer encoder pretrained on histology data and three decoder branches (binary mask, class, distance map); the class and distance branches were disabled. Both models were trained with identical hyper‑parameters on each single dataset and on the fused training set.

Experiment 1 (single‑dataset training) revealed substantial performance variability across datasets. High‑density, multi‑organ datasets such as CryoNuSeg and MoNuSeg achieved the highest average precision (AP ≈ 0.75–0.80) on NucFuse‑test, while low‑density or single‑organ datasets (TNBC, DSB) lagged (AP ≈ 0.55). This demonstrates that nuclei density, organ heterogeneity, and image resolution are strong predictors of a dataset’s generalization power. The authors ranked the ten datasets accordingly, providing a practical guide for researchers who need to select training data for a given application.

Experiment 2 (fused‑dataset training) showed that merging datasets consistently improves performance, especially for models originally trained on small or homogeneous datasets. Training on NucFuse‑train raised the mean AP by 3–5 % for both HoVerNeXt and CellViT, reduced over‑fitting, and produced smoother loss curves. The benefit was most pronounced for datasets that previously performed poorly in isolation, confirming that data diversity acts as a regularizer and enhances the model’s ability to capture varied nuclear morphologies.

A supplementary analysis incorporated the semi‑automatically annotated PanNuke dataset. While inclusion of PanNuke slightly lowered overall AP (by ≈1 %), the massive increase in training samples (≈8 k additional tiles) demonstrated that large, albeit noisier, datasets can still be valuable when combined with high‑quality manual annotations.

The paper’s contributions are threefold: (1) a publicly available, standardized collection of ten H&E nuclei segmentation datasets plus two unified splits (NucFuse‑test and NucFuse‑train); (2) a systematic benchmark and ranking of these datasets using two state‑of‑the‑art models; (3) empirical evidence that dataset fusion improves segmentation accuracy and robustness. By releasing code, data, and detailed evaluation pipelines, the authors promote reproducibility and provide a solid baseline for future work in computational pathology.

In summary, the study shifts the focus from algorithmic novelty to data-centric evaluation, highlighting that careful selection, standardization, and combination of public datasets are essential for building reliable nuclei segmentation models that generalize across diverse histopathology domains.

NucFuseRank: Dataset Fusion and Performance Ranking for Nuclei Instance Segmentation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment