Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification

Prototype-Enhanced Multi-V ie w Learning for Thyroid Nodule Ultrasound Classiﬁcation Y angmei Chen, Zhongyuan Zhang, Xikun Zhang, Xinyu Hao, Mingliang Hou * , Renqiang Luo * , Ziqi Xu Abstract —Thyr oid nodule classiﬁcation using ultrasound imaging is essential for early diagnosis and clinical decision- making; howev er , despite promising performance on in- distribution data, existing deep learning methods often exhibit limited rob ustness and generalisation when deployed acr oss different ultrasound devices or clinical envir onments. This limitation is mainly attributed to the pronounced heterogeneity of thyroid ultrasound images, which can lead models to capture spurious correlations rather than reliable diagnostic cues. T o address this challenge, we propose PEMV -thyroid, a Prototype-Enhanced Multi-View lear ning framework that accounts for data heterogeneity by learning complementary repr esentations from multiple feature perspectives and reﬁn- ing decision boundaries through a prototype-based correction mechanism with mixed prototype inf ormation. By integrating multi-view representations with prototype-level guidance, the proposed appr oach enables more stable r epresentation lear ning under heterogeneous imaging conditions. Extensiv e experi- ments on multiple thyroid ultrasound datasets demonstrate that PEMV -th yroid consistently outperforms state-of-the-art methods, particularly in cross-de vice and cross-domain evalu- ation scenarios, leading to improved diagnostic accuracy and generalisation perf ormance in real-world clinical settings. The source code is available at https://github .com/chenyangmeii/ Prototype- Enhanced- Multi- V iew- Learning. Index T erms —Thyroid nodule classiﬁcation, Ultrasound imaging, Multi-view learning I . I N T RO D U C T I O N Thyroid nodules are among the most common diseases of the endocrine system and exhibit a high prev alence in the general population [1]. Accurate differentiation between benign and malignant nodules is therefore critical for guid- ing clinical decision-making, reducing unnecessary biopsies, and av oiding excessi ve in vasi ve treatments [2]. Ultrasound imaging is widely adopted as the primary screening modality due to its non-inv asive nature, lo w cost, and real-time capability [3]. Howe ver , the visual assessment of thyroid ultrasound images remains highly dependent on clinicians’ Y angmei Chen is with the College of Software, Jilin University , Changchun 130012, China (chenym5523@mails.jlu.edu.cn). Zhongyuan Zhang and Renqiang Luo are with the College of Com- puter Science and T echnology , Jilin Uni versity , Changchun 130012, China (zhongyuanz25@mails.jlu.edu.cn, lrenqiang@jlu.edu.cn). Xikun Zhang and Ziqi Xu are with the School of Computing T echnolo- gies, RMIT Univ ersity , Melbourne, VIC 3000, Australia ( { xikun.zhang, ziqi.xu } @rmit.edu.au). Xinyu Hao is with the School of Software T echnology , Dalian University of T echnology , Dalian 116024, China (xihao@dlut.edu.cn). Mingliang Hou is with the Guangdong Institute of Smart Education, Jinan Univ ersity , Guangzhou 510632, China (teemohold@outlook.com) Corresponding author: Mingliang Hou, Renqiang Luo. subjectiv e interpretation, which can vary across experience lev els and clinical settings, leading to inconsistent diagnoses and suboptimal decision-making. In recent years, deep learning techniques have been e xten- siv ely applied to thyroid nodule diagnosis using ultrasound imaging [4], [5]. A wide range of approaches is explored, including dynamic ultrasound video analysis [6], multi- modal deep learning frame works [7], and hybrid models that inte grate traditional machine learning with deep neural networks [8]. These methods demonstrate promising perfor- mance in improving classiﬁcation accuracy and diagnostic efﬁcienc y [9]. Moreov er , they provide effecti ve technical support for alleviating clinicians’ workload, reducing un- necessary in v asi ve procedures, and enhancing diagnostic consistency in clinical practice [10]. Despite recent adv ances, se veral critical challenges re- main unresolved. When trained models are deployed on datasets collected from different ultrasound devices or clin- ical en vironments, their performance often degrades signif- icantly [11], indicating limited robustness and poor gen- eralisation. Although variance pooling strategies and data augmentation techniques are introduced to mitigate this issue, these approaches remain sensitive to variations in imaging conditions and nodule characteristics, resulting in only marginal performance improvements. This limitation is largely attributed to the pronounced heterogeneity of thyroid ultrasound images [12], which arises from v aria- tions in imaging equipment, acquisition protocols, operator expertise, and intrinsic differences in nodule appearance. As illustrated in Figure 1, thyroid nodules sharing the same pathological type can exhibit markedly dif ferent visual manifestations in ultrasound images, including variations in echogenicity , margin deﬁnition, shape, and internal tex- ture. Such pronounced intra-class heterogeneity may induce spurious correlations during model training, causing deep learning models to rely on non-causal visual cues and con- sequently undermining their robustness and generalisation across di verse clinical settings. T o address these challenges, we propose PEMV -th yroid, a Prototype-Enhanced Multi-V iew Learning framework for thyroid nodule ultrasound classiﬁcation. The proposed ap- proach aims to impro ve robustness by e xplicitly accounting for data heterogeneity in the relationship between image representations and diagnostic outcomes. It comprises two key components: a Multi-V ie w Feature Extraction (MVFE) 1 2 3 Image Shape Direction Margin Edge Internal Echo Back Echo 1 Round Horizontal Clear Complete Hypoechoic Mixed 2 Irregular Horizontal Unclear Incomplete Hyperechoic Mixed 3 Oval V ertical Unclear Incomplete Hypoechoic Boosted Fig. 1. Examples illustrating pronounced intra-class heterogeneity in thyroid ultrasound images, where nodules of the same pathological type exhibit diverse visual manifestations across multiple lesion attributes. module and a Prototype-Based Correction (PBC) module. The MVFE module constructs complementary representa- tions from multiple feature perspectiv es, while the PBC module reﬁnes decision boundaries by incorporating mixed prototype information to reduce the inﬂuence of spuri- ous correlations. Extensiv e experiments demonstrate that PEMV -thyroid consistently improves diagnostic accuracy and generalisation performance, underscoring its practical effecti veness for thyroid nodule ultrasound classiﬁcation. In summary , our main contributions are as follows: • W e propose PEMV -thyroid, a Prototype-Enhanced Multi-V iew learning framework for thyroid nodule ultra- sound classiﬁcation that accounts for data heterogeneity between image representations and diagnostic outcomes, reducing spurious correlations across di verse clinical settings. • W e design a prototype-based correction mechanism that integrates multi-view representations with mixed pro- totype information to enable more stable and reliable learning under heterogeneous imaging conditions. • W e conduct extensi ve experiments on thyroid ultrasound datasets, sho wing that PEMV -th yroid consistently out- performs state-of-the-art methods, particularly in cross- device and cross-domain scenarios, leading to impro ved diagnostic accuracy and generalisation. I I . R E L A T E D W O R K Medical image classiﬁcation aims to automatically predict clinically relev ant labels from medical images, thereby pro- viding decision support for disease screening and diagnosis. In this work, we focus on benign–malignant classiﬁcation of thyroid nodules in ultrasound images. Howe ver , thyroid ultrasound images often exhibit speckle noise, low contrast, and substantial appearance v ariations across imaging de vices and operators, which can hinder model generalisation. Medical image classiﬁcation has ev olved from hand- crafted feature-based methods to deep CNN-based end- to-end learning, and more recently to transformer-based architectures and large-scale pretraining or self-supervised learning paradigms. T o address common challenges such as domain shift and imaging style variations, existing studies hav e sought to improve robustness from both data- and representation-lev el perspectives. For example, Mixup [13] mitigates o verﬁtting by interpolating and mixing training samples, MixStyle [14] enhances cross-domain generalisa- tion by perturbing feature statistics, and Fishr [15] promotes in variant learning through gradient regularisation. Nev er - theless, these methods may be insufﬁcient for addressing disease heterogeneity and its associated confounding f actors, often resulting in suboptimal performance in real-world clinical settings. I I I . M E T H O D O L O G Y W e address thyroid nodule classiﬁcation in ultrasound, formulated as a binary prediction problem. Let D = { ( x i , y i ) } N i =1 , where x i denotes an ultrasound image and y i ∈ { 0 , 1 } is its label ( 0 : benign, 1 : malignant). While con ventional classiﬁers optimise the observ ational objectiv e associated with p θ ( y | x ) , the proposed PEMV -thyroid framew ork is motiv ated by the presence of unmeasured confounding and aims to learn more stable predictive rela- tionships guided by causal principles. Speciﬁcally , PEMV - thyroid constructs multi-view feature representations as an intermediate mediator A through a Multi-V ie w Feature Extraction (MVFE) module, and subsequently reﬁnes this mediator via a prototype-based correction mechanism to ob- tain ˆ A , which is inspired by the front-door adjustment con- cept [16], [17] to attenuate confounder-induced variations without explicitly modelling unobserved confounders. The ﬁnal prediction is produced by feeding the concatenation of a global feature g and the reﬁned mediator ˆ A into a classiﬁer head: p θ ( y | g , ˆ A ) = softmax  f c ([ g ; ˆ A ])  , (1) ˆ y = arg max c ∈{ 0 , 1 } p θ ( y = c | g , ˆ A ) . (2) During training, we optimise a joint objectiv e that com- bines the standard classiﬁcation loss with an additional fusion loss to jointly supervise representation learning and prototype-based correction. The overall architecture of the proposed method is shown in Figure 2. A. F r ont-door Adjustment A major difﬁculty in thyroid ultrasound classiﬁcation is that acquisition-related factors (e.g., device settings and operator-dependent scanning) may introduce latent con- founding that affects both the observed image appearance x and the diagnostic label y . As a result, directly ﬁtting the observational conditional p ( y | x ) can be unstable across domains. From a causal perspective, introducing an intermediate representation that captures disease-rele v ant evidence transmitted from the image to the label can help attenuate confounder-induced spurious correlations. In this work, PEMV -thyroid adopts such an intermediate represen- tation A as a mediator, inspired by the front-door adjustment principle. Under the front-door assumptions, the interventional ef- fect can be expressed using only observational quantities as: p ( y | do ( x )) = X a p ( a | x ) X x ′ p ( y | a, x ′ ) p ( x ′ ) . (3) Eq. (3) suggests a decomposition into two stages: (i) learning how the image giv es rise to an intermediate rep- resentation, i.e., p ( a | x ) , and (ii) estimating the label dis- tribution conditioned on this representation while marginal- ising over the image distribution. In the following, we de- scribe ho w PEMV -thyroid instantiates the mediator A using multi-view feature representations and how a prototype- based correction mechanism is employed to approximate the intervention-inspired effect implied by Eq. (3). B. Instantiating the mediator via multi-view repr esentations W e implement the mediator-generation term p ( a | x ) by extracting disease-related representations from the input ul- trasound image. Speciﬁcally , gi ven an image x , a backbone network produces a shared feature map, from which we deriv e (i) a global representation g that summarises holistic semantics, and (ii) a set of K view-speciﬁc representations { a k } K k =1 that capture complementary evidence. These view- speciﬁc features are treated as the mediator , and the aggre- gated mediator is deﬁned as A =  a 1 ; a 2 ; . . . ; a K  , (4) where [ · ; · ] denotes concatenation. The multi-view design is particularly well suited to thy- roid ultrasound imaging, where speckle noise, low contrast, and device- or operator-dependent appearance variations can induce spurious shortcuts when relying solely on global features. By decomposing disease evidence into multiple complementary views, the mediator A encourages the model to encode more structured and reusable representations, which subsequently facilitates robustness-oriented correc- tion under heterogeneous imaging conditions. C. Pr ototype-based corr ection of the mediator T o mitigate the inﬂuence of unmeasured confounding on the learned mediator , PEMV -thyroid incorporates a prototype-based correction mechanism that reﬁnes mediator representations using class-conditional reference patterns. For each class c ∈ { 0 , 1 } , we maintain a mediator prototype P c , which serves as a class-speciﬁc reference representation. In practice, each prototype is updated during training by aggregating mediator features from samples belonging to class c , yielding a stable estimate of typical disease-related patterns for that class. Giv en a training sample ( x, y ) , we retriev e the corre- sponding same-class prototype P y and additionally sample a different-class prototype P ¯ y . These prototypes are jointly Ultrasound image x Backbone CNN Global g Multi-view mediator A A = [ a 1 ; . . . ; a K ] MVFE Prototype bank { P c } c ∈{ 0 , 1 } retrieve ( P y , P ¯ y ) PBC: ( A, P y , P ¯ y ) → ˆ A PBC Fusion [ g ; ˆ A ] → Classiﬁer ˆ y Objective: L = L o + λ L f Fig. 2. Overvie w of the proposed PEMV -thyroid framework for thyroid ultrasound classiﬁcation. The MVFE module extracts multi-view mediator representations, while the PBC module reﬁnes these representations using class-conditional prototypes to mitigate spurious correlations under hetero- geneous imaging conditions. lev eraged to reﬁne the mediator extracted from x , producing a corrected mediator ˆ A . Intuitively , the same-class prototype encourages alignment with class-rele vant e vidence, while the different-class prototype provides complementary contrast that discourages reliance on confounder-dri ven shortcuts. Through this reﬁnement process, the corrected mediator becomes more in v ariant to acquisition-related variations, thereby improving robustness across de vices and clinical en vironments. After obtaining the corrected mediator ˆ A , it is fused with the global representation g for ﬁnal classiﬁcation. The model is trained using a joint objectiv e that combines the standard classiﬁcation loss with an additional fusion loss, which jointly supervises representation learning and prototype- based correction under heterogeneous imaging conditions. D. Fusion and learning objective W ith the corrected mediator ˆ A , PEMV -thyroid performs prediction by jointly le veraging global and mediator-le vel evidence. Speciﬁcally , we concatenate the global repre- sentation g with the corrected mediator to form a fused feature z = [ g ; ˆ A ] , which is fed into a classiﬁer head f c to produce logits and the predictiv e distribution p θ ( y | z ) = softmax( f c ( z )) . The model is trained using a joint learning objecti ve. The ﬁrst term, L o , is the standard cross-entropy loss that en- forces discriminati ve learning on the training set. Howe ver , optimising L o alone may encourage the model to exploit spurious correlations that are predictive only under spe- ciﬁc acquisition conditions. T o further promote robustness under heterogeneous imaging en vironments, PEMV -thyroid introduces an additional fusion loss L f , which provides complementary supervision for the corrected mediator and its fusion with the global representation, encouraging more stable and inv ariant decision cues. The ov erall optimisation objectiv e is giv en by L = L o + λ L f , (5) L o = − 1 N N X i =1 C X c =1 y ic log exp( ˆ y ic ) P C j =1 exp( ˆ y ij ) ! , (6) L f = − X x ′ P ( x ′ ) " P ( ˆ y c ) l c log exp( ˆ y c ) P C j =1 exp( ˆ y j ) + P ( ˆ y c ′ ) l c ′ log exp( ˆ y c ′ ) P C j =1 exp( ˆ y j ) # , (7) where λ controls the relati ve contribution of the fusion loss. I V . E X P E R I M E N T S A. Datasets In this study , we ev aluate the proposed method on two publicly av ailable thyroid ultrasound image datasets, namely TN 5000 and TN 3 K. Both datasets are designed for thyroid nodule analysis and support a binary classiﬁcation task of distinguishing benign and malignant nodules. They are selected for their clinical relev ance, annotated diagnostic labels, and diversity of imaging conditions, which together enable a comprehensive ev aluation of model robustness and generalisation. • TN 5000 : A thyroid ultrasound image dataset in which each image is annotated with a benign or malignant diagnostic label. The dataset contains images acquired under div erse clinical conditions, including v ariations in ultrasound de vices, imaging parameters, and nodule ap- pearances, providing a realistic benchmark for e valuating robustness and generalisation performance. • TN 3 K: A publicly av ailable thyroid ultrasound dataset annotated with benign and malignant labels. As ultra- sound is a primary non-in vasi ve modality for thyroid nodule assessment, TN 3 K has strong clinical relev ance for computer-aided diagnosis research and poses addi- tional challenges due to variations in acquisition settings and de vice conﬁgurations. Follo wing standard practices in medical image classiﬁca- tion, all ultrasound images are resized to a ﬁxed resolution and normalised before being fed into the network. Images in both datasets are divided into disjoint training, v alida- tion, and test sets, which are used for model optimisation, hyperparameter selection, and ﬁnal performance e valuation, respectiv ely . TN 5000 consists of 5 , 000 images with predeﬁned splits following the P ASCAL V OC protocol, including 3 , 500 train- ing, 500 validation, and 1 , 000 test images (approximately 70 %/ 10 %/ 20 %). W e strictly follow these ofﬁcial splits and con vert the original detection annotations into image-lev el binary labels without altering the data partitioning. TN 3 K contains 3 , 493 images with an of ﬁcial test set of 614 images, while the remaining 2 , 879 images are split into training and validation sets using an 8:2 ratio, resulting in 2 , 303 training and 576 validation images (approximately 66 %/ 16 %/ 18 %). During training, data augmentation is applied only to the training images to improve model generalisation, while no augmentation is used for validation or test samples. All data splits are ﬁxed and speciﬁed via predeﬁned text ﬁles to ensure reproducibility across experiments. B. Baselines T o validate the ef fecti veness of PEMV -thyroid for thyroid ultrasound image classiﬁcation, we compare it with se veral representativ e and reproducible baseline methods that are widely adopted in medical image analysis. All methods are trained and ev aluated under the same data splits, input preprocessing procedures, and ev aluation metrics to ensure a fair comparison. W e consider the follo wing baseline methods: • ResNet 18 (ERM) [18]: A standard con volutional neu- ral network trained with empirical risk minimisation is adopted as the primary backbone baseline. This setting serves as a strong and widely used reference for binary thyroid nodule classiﬁcation. • Fishr [15]: Fishr is an in v ariant feature learning method that re gularises the variance of gradients across envi- ronments to reduce reliance on spurious correlations. In our implementation, Fishr is applied as an additional regularisation term on top of the backbone training objectiv e to enhance robustness under heterogeneous imaging conditions. • MixStyleNet [14]: MixStyleNet performs feature-le vel style perturbation by mixing channel-wise statistics, such as mean and variance, during training. This strategy simulates domain and style shifts caused by dif ferent ultrasound de vices and acquisition settings, making it particularly relev ant for ultrasound images with substan- tial appearance variability . • MixupNet [13]: MixupNet applies the Mixup strategy to construct virtual training samples by linearly interpolat- ing pairs of input images and their corresponding labels. This regularisation encourages smoother decision bound- aries and is commonly used to improv e generalisation in medical image classiﬁcation. These baselines represent commonly adopted strategies for improving robustness and generalisation in medical image classiﬁcation, including empirical risk minimisation, data augmentation, and inv ariant representation learning. By ev aluating PEMV -thyroid against Fishr, MixStyleNet, and MixupNet under a uniﬁed experimental protocol, we provide a systematic comparison with methods that address domain variability and spurious correlations from different perspectiv es. C. Experimental Setup All experiments are conducted on a workstation equipped with an NVIDIA L 40 GPU. The software en vironment includes Python 3 . 8 . 20 , PyT orch 1 . 10 . 1 , and CUD A 11 . 3 . ResNet 18 is adopted as the backbone network for all meth- ods. All thyroid ultrasound images are resized to 128 × 128 pixels. For both TN 5000 and TN 3 K datasets, models are trained using the AdamW optimizer with an initial learning rate of 1 × 10 − 4 and a batch size of 16 . All reported results are obtained by averaging over ﬁve runs with dif ferent random seeds. D. Main Results In this section, we present a comprehensi ve ev aluation of PEMV -thyroid against state-of-the-art baselines across two real-world thyroid nodule ultrasound datasets, TN 3 K and TN 5000 . The comparison focuses on four commonly used metrics, including accurac y (A CC), precision (P), recall (R), and F 1 -score (F 1 ). Overall, the quantitati ve results reported in T able I and T able II show that PEMV -thyroid consistently outperforms e xisting methods, demonstrati ng its effecti veness in learning robust representations for thyroid nodule classiﬁcation. On the TN 3 K dataset, PEMV -th yroid achie ves clear improv ements over all baseline methods, as summarised in T able I. Speciﬁcally , compared with MixupNet, which constructs virtual training samples via linear interpola- tion, PEMV -th yroid yields impro vements of 3 . 97 %, 2 . 64 %, 10 . 51 %, and 7 . 38 % in accuracy , precision, recall, and F 1 -score, respectively . Notably , PEMV -th yroid achieves a substantial gain in recall ( 60 . 76 % → 71 . 27 %), which is particularly important in clinical diagnosis where missing malignant cases should be minimised. Moreover , PEMV - thyroid attains an A CC of 82 . 08 % and an F 1 -score of 75 . 32 %, outperforming the strongest baseline Fishr (A CC 79 . 74 %, F 1 71 . 71 %). These results indicate that PEMV - thyroid better mitigates the impact of data heterogeneity and reduces reliance on spurious correlations, leading to im- prov ed generalisation under challenging imaging conditions. On the TN 5000 dataset, all methods achie ve relatively high performance, suggesting a more stable training dis- tribution. As shown in T able II, PEMV -thyroid delivers the best overall performance, achieving 86 . 50 % ACC and 90 . 99 % F 1 -score, compared with the strongest baseline Fishr ( 85 . 82 % ACC, 90 . 55 % F 1 ). These results demonstrate that PEMV -thyroid not only excels on more heterogeneous data such as TN 3 K, but also deliv ers consistent performance gains on TN 5000 , highlighting its rob ustness across dif ferent thyroid ultrasound datasets. E. Sensitivity Analysis W e analyse the effect of the number of expert networks in the MVFE module by varying num att from 1 to 9 on the TN 3 K dataset (Fig. 3). Overall, the performance is sensiti ve to the choice of num att but remains relativ ely stable within a reasonable range. Among all conﬁgurations, num att = 3 achiev es the best overall performance, with 82 . 08 % A CC, 79 . 95 % precision, 71 . 27 % recall, and 75 . 32 % F 1 -score. Increasing the number of experts beyond this setting does T ABLE I C O M PA R I S O N O F D I FF E R E N T M E T H O D S O N T H E T N 3 K D A T A S E T . A L L R E S U LT S A R E R E P O RT E D I N P E R C E N TAG E ( % ) , A N D T H E B E S T P E R F O R M A N C E I S H I G H L I G H T E D I N B O L D . Method A CC(%) P(%) R(%) F1(%) ResNet 18 79 . 67 ± 1 . 96 80 . 88 ± 5 . 00 62 . 29 ± 4 . 22 70 . 17 ± 2 . 76 Fishr 79 . 74 ± 3 . 02 77 . 61 ± 5 . 87 67 . 88 ± 9 . 78 71 . 71 ± 5 . 45 MixupNet 78 . 11 ± 2 . 86 77 . 31 ± 3 . 42 60 . 76 ± 6 . 41 67 . 94 ± 5 . 16 MixStyleNet 78 . 96 ± 1 . 54 76 . 12 ± 2 . 13 66 . 02 ± 4 . 21 70 . 63 ± 2 . 71 PEMV -thyroid 82.08 ± 1 . 14 79.95 ± 1 . 11 71.27 ± 3 . 23 75.32 ± 2 . 04 T ABLE II C O M PA R I S O N O F D I FF E R E N T M E T H O D S O N T H E T N 5000 DAT A S E T . A L L R E S U LT S A R E R E P O RT E D I N P E R C E N TAG E ( % ) , A N D T H E B E S T P E R F O R M A N C E I S H I G H L I G H T E D I N B O L D . Method A CC(%) P(%) R(%) F1(%) ResNet18 85 . 68 ± 0 . 66 88 . 78 ± 1 . 42 92 . 09 ± 1 . 37 90 . 39 ± 0 . 39 Fishr 85 . 82 ± 0 . 46 88 . 28 ± 1 . 04 92 . 97 ± 1 . 01 90 . 55 ± 2 . 06 MixupNet 85 . 66 ± 0 . 60 88 . 75 ± 0 . 48 92 . 07 ± 1 . 52 90 . 37 ± 0 . 50 MixStyleNet 84 . 68 ± 0 . 79 87 . 52 ± 1 . 19 92 . 23 ± 1 . 41 89 . 80 ± 0 . 52 PEMV -thyroid 86.50 ± 0 . 55 88.88 ± 0 . 87 93.21 ± 0 . 97 90.99 ± 0 . 36 not lead to consistent improvements; for e xample, num att = 5 results in a noticeable performance drop ( 78 . 9 % A CC and 64 . 6 % recall), suggesting that an excessi ve number of experts may introduce optimisation difﬁculty or overﬁtting under limited training data. Based on these observ ations, we adopt num att = 3 as the default conﬁguration in all experiments. F . Ablation Study W e conduct a step-wise ablation study from AB1 to AB5 on the TN 3 K and TN 5000 datasets, with mean ± std results reported in T able III. Introducing the multi-view feature extractor (AB 2 ) consistently improv es the ERM baseline (AB 1 ) on both datasets, yielding gains on TN 3 K in A CC ( 79 . 67 % → 79 . 87 %) and F 1 ( 70 . 17 % → 71 . 42 %), and similar improvements on TN 5000 . Adding the prototype- based correction module (AB 3 ) further boosts recall on TN 3 K from 65 . 51 % to 70 . 25 %, leading to a higher F 1 - score ( 71 . 69 %), while the improvement on TN 5000 remains marginal, reﬂecting its more stable data distribution. Incor - porating the information-purity factor alone (AB 4 ) causes noticeable performance ﬂuctuations on TN 3 K, particularly a drop in recall to 62 . 37 %, indicating that a single constraint is insufﬁcient for stable optimisation. By jointly integrating all components, the full model (AB 5 ) achie ves the best overall performance on both datasets, with TN 3 K reaching 82 . 08 % A CC and 75 . 32 % F 1 , and TN 5000 achieving 86 . 50 % ACC and 90 . 99 % F 1 , demonstrating the complementarity and effecti veness of the proposed framew ork. Fig. 3. Sensiti vity analysis of the number of expert networks ( num att ) in the MVFE module on the TN 3 K dataset, evaluated using ACC, P , R, and F 1 (%). T ABLE III A B L AT I O N R E S U LT S O N T H E T N 3 K A N D T N 5000 DAT A S E T S ( M E A N ± S T D , % ) . A B 1 : R E S N E T 18 ( E R M BA S E L I N E ) ; A B 2 : A B 1 + M V F E ; A B 3 : A B 2 + P B C ; A B 4 : A B 3 + I P ; A B 5 : F U L L M O D E L . T H E B E S T A N D S E C O N D - B E S T R E S U LT S A R E H I G H L I G H T E D I N B O L D A N D U N D E R L I N E D , R E S P E C T I V E LY . TN 3 K Method A CC(%) P(%) R(%) F1(%) AB 1 79 . 67 ± 1 . 96 80.88 ± 5 . 00 62 . 29 ± 4 . 22 70 . 17 ± 2 . 76 AB 2 79 . 87 ± 1 . 81 79 . 06 ± 4 . 75 65 . 51 ± 4 . 43 71 . 42 ± 2 . 36 AB 3 78 . 73 ± 1 . 57 73 . 80 ± 4 . 41 70 . 25 ± 5 . 44 71 . 69 ± 2 . 09 AB 4 78 . 73 ± 1 . 44 78 . 85 ± 5 . 54 62 . 37 ± 8 . 88 68 . 99 ± 3 . 71 AB 5 82.08 ± 1 . 14 79 . 95 ± 1 . 11 71.27 ± 3 . 23 75.32 ± 2 . 04 TN5000 Method A CC(%) P(%) R(%) F1(%) AB 1 85 . 68 ± 0 . 66 88 . 78 ± 1 . 42 92 . 09 ± 1 . 37 90 . 39 ± 0 . 39 AB 2 86 . 10 ± 0 . 77 88 . 66 ± 0 . 83 92 . 89 ± 1 . 60 90 . 71 ± 0 . 58 AB 3 85 . 06 ± 0 . 66 87 . 87 ± 0 . 82 92 . 34 ± 1 . 80 90 . 03 ± 0 . 54 AB 4 86 . 12 ± 0 . 73 89.14 ± 0 . 48 92 . 26 ± 1 . 10 90 . 67 ± 0 . 54 AB 5 86.50 ± 0 . 55 88 . 88 ± 0 . 87 93.21 ± 0 . 97 90.99 ± 0 . 36 V . C O N C L U S I O N This work presents PEMV -th yroid, a prototype-enhanced multi-view learning framework for rob ust thyroid nodule ultrasound classiﬁcation. By explicitly accounting for data heterogeneity through complementary multi-view represen- tations and a prototype-based correction mechanism, the proposed approach mitigates the inﬂuence of spurious cor- relations arising from variations in imaging devices, acqui- sition protocols, and nodule appearances. Extensiv e e xperi- ments on two publicly av ailable thyroid ultrasound datasets demonstrate that PEMV -thyroid consistently outperforms state-of-the-art baselines, with particularly notable improv e- ments under cross-de vice and heterogeneous settings. These results highlight the effecti veness of integrating multi-view representation learning with prototype-guided reﬁnement for improving robustness and generalisation in medical image classiﬁcation. Future work explores extending the proposed framew ork to other ultrasound-based diagnostic tasks and in vestigating its applicability to additional medical imaging modalities with pronounced domain variability . R E F E R E N C E S [1] D. S. Dean and H. Gharib, “Epidemiology of thyroid nodules, ” Best practice & resear ch Clinical endocrinolo gy & metabolism , vol. 22, no. 6, pp. 901–911, 2008. [2] B. R. Haugen, E. K. Alexander , K. C. Bible, G. M. Doherty , S. J. Mandel, Y . E. Nikiforov , F . Pacini, G. W . Randolph, A. M. Sawka, M. Schlumberger et al. , “2015 american thyroid association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the american thyroid association guide- lines task force on thyroid nodules and dif ferentiated thyroid cancer , ” Thyr oid , vol. 26, no. 1, pp. 1–133, 2016. [3] Z. Zhou, Y . Lu, J. Bai, V . M. Campello, F . Feng, and K. Lekadir, “Segment anything model for fetal head-pubic symphysis segmenta- tion in intrapartum ultrasound image analysis, ” Expert Systems with Applications , vol. 263, p. 125699, 2025. [4] B. Wildman-T obriner , M. Buda, J. K. Hoang, W . D. Middleton, D. Thayer, R. G. Short, F . N. T essler, and M. A. Mazurowski, “Using artiﬁcial intelligence to revise acr ti-rads risk stratiﬁcation of thyroid nodules: diagnostic accuracy and utility , ” Radiology , vol. 292, no. 1, pp. 112–119, 2019. [5] M. Buda, B. W ildman-T obriner , J. K. Hoang, D. Thayer, F . N. T essler , W . D. Middleton, and M. A. Mazurowski, “Management of thyroid nodules seen on us images: deep learning may match performance of radiologists, ” Radiology , vol. 292, no. 3, pp. 695–701, 2019. [6] T . Qian, Y . Zhou, J. Y ao, C. Ni, S. Asif, C. Chen, L. Lv , D. Ou, and D. Xu, “Deep learning based analysis of dynamic video ultrasonogra- phy for predicting cervical lymph node metastasis in papillary thyroid carcinoma, ” Endocrine , vol. 87, no. 3, pp. 1060–1069, 2025. [7] W . W en, T . Zhang, H. Zhao, J. Liu, H. Jiang, Y . He, and Z. Jiang, “Multimodal model enhances qualitative diagnosis of hypervascular thyroid nodules: Integrating radiomics and deep learning features based on b-mode and pdi images, ” Gland Sur gery , vol. 14, no. 8, pp. 1558–1571, 2025. [8] S. Fan, R. Xu, Q. Dong, Y . He, C. Chang, and P . Cui, “Stable cox regression for survival analysis under distribution shifts, ” Nature Machine Intelligence , vol. 6, no. 12, pp. 1525–1541, 2024. [9] F . Ziadi, H. Fourati, and L. A. Saidane, “AI and IoT users, challenges and opportunities for e-health: A review , ” in Pr oceedings of the 2024 International Wir eless Communications and Mobile Computing , 2024. [10] G. Grani, M. Sponziello, S. Filetti, and C. Durante, “Thyroid nodules: diagnosis and management, ” Nature Reviews Endocrinology , vol. 20, no. 12, pp. 715–728, 2024. [11] H. Guan and M. Liu, “Domain adaptation for medical image analysis: a survey , ” IEEE T ransactions on Biomedical Engineering , vol. 69, no. 3, pp. 1173–1185, 2021. [12] L. F aes, S. K. W agner , D. J. Fu, X. Liu, E. Korot, J. R. Ledsam, T . Back, R. Chopra, N. Pontikos, C. Kern et al. , “ Automated deep learning design for medical image classiﬁcation by health-care pro- fessionals with no coding experience: A feasibility study , ” The Lancet Digital Health , vol. 1, no. 5, pp. e232–e242, 2019. [13] K. Zhou, Y . Y ang, Y . Qiao, and T . Xiang, “Domain generalization with mixstyle, ” in 9th International Conference on Learning Repre- sentations, ICLR , 2021. [14] H. Zhang, M. Ciss ´ e, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization, ” in 6th International Conference on Learning Repr esentations, ICLR , 2018. [15] A. Ram ´ e, C. Dancette, and M. Cord, “Fishr: Inv ariant gradient variances for out-of-distrib ution generalization, ” in International Con- fer ence on Machine Learning, ICML , 2022, pp. 18 347–18 377. [16] J. Pearl, Causality: Models, Reasoning, and Inference . Cambridge, UK: Cambridge Univ ersity Press, 2000. [17] Z. Xu, D. Cheng, J. Li, J. Liu, L. Liu, and K. Y u, “Causal inference with conditional front-door adjustment and identiﬁable variational autoencoder , ” in The T welfth International Conference on Learning Repr esentations, ICLR , 2024. [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” in IEEE Conference on Computer V ision and P attern Recognition, CVPR , 2016, pp. 770–778.

Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment