Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning

Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The diagnosis of thyroid nodule cancers commonly utilizes ultrasound images. Several studies showed that deep learning algorithms designed to classify benign and malignant thyroid nodules could match radiologists’ performance. However, data availability for training deep learning models is often limited due to the significant effort required to curate such datasets. The previous study proposed a method to curate thyroid nodule datasets automatically. It was tested to have a 63% yield rate and 83% accuracy. However, the usefulness of the generated data for training deep learning models remains unknown. In this study, we conducted experiments to determine whether using a automatically-curated dataset improves deep learning algorithms’ performance. We trained deep learning models on the manually annotated and automatically-curated datasets. We also trained with a smaller subset of the automatically-curated dataset that has higher accuracy to explore the optimum usage of such dataset. As a result, the deep learning model trained on the manually selected dataset has an AUC of 0.643 (95% confidence interval [CI]: 0.62, 0.66). It is significantly lower than the AUC of the 6automatically-curated dataset trained deep learning model, 0.694 (95% confidence interval [CI]: 0.67, 0.73, P < .001). The AUC of the accurate subset trained deep learning model is 0.689 (95% confidence interval [CI]: 0.66, 0.72, P > .43), which is insignificantly worse than the AUC of the full automatically-curated dataset. In conclusion, we showed that using a automatically-curated dataset can substantially increase the performance of deep learning algorithms, and it is suggested to use all the data rather than only using the accurate subset.


💡 Research Summary

The paper investigates whether an automatically curated ultrasound image dataset can improve deep learning (DL) models for classifying thyroid nodules as benign or malignant, compared with a manually annotated dataset. Using a previously developed pipeline called MADLaP (Multi‑step Automated Data Labeling Procedure), the authors automatically extracted and labeled images from 3,981 patients’ thyroid fine‑needle aspiration (FNA) records. MADLaP combines rule‑based natural language processing, optical character recognition, and a DL segmentation model to locate two orthogonal (transverse and longitudinal) images per nodule and assign pathology‑based labels. In validation, MADLaP achieved a 63 % yield and 83 % labeling accuracy.

Three training sets were created: (1) Manual Set – 378 patients manually selected and labeled by an experienced radiologist (802 images, 752 benign, 50 malignant); (2) MADLaP Set – the full automatically curated output (5,228 images, 4,970 benign, 258 malignant); (3) S1 Set – a subset derived only from Stage 1 of MADLaP, which has higher label precision but lower yield (4,150 images). All images were pre‑processed with a Faster R‑CNN (ResNet‑101 backbone) to detect measurement calipers, generate bounding boxes around nodules, and crop them to 160 × 160 px.

The classification network was a relatively shallow CNN: six 3 × 3 convolutional layers, five 2 × 2 max‑pooling layers, a 50 % dropout, and a final sigmoid output. Focal loss addressed the strong class imbalance (malignant cases ≈5 % of images). Training used a base learning rate of 0.001, RMSProp optimizer, and batch sizes ranging from 32 to 1,024; the optimal epoch count was chosen via 10‑fold cross‑validation.

On an independent test set (320 patients, 378 nodules), the Manual Set model achieved an AUC of 0.643 (95 % CI 0.62‑0.66). The MADLaP Set model reached an AUC of 0.694 (95 % CI 0.67‑0.73), a statistically significant improvement (P < 0.001). The S1 Set model yielded an AUC of 0.689 (95 % CI 0.66‑0.72), not significantly different from the full MADLaP Set (P = 0.43). Increasing batch size consistently improved performance for the automatically curated datasets, supporting the notion that larger batches mitigate the impact of noisy labels.

The authors conclude that automatically curated data, despite imperfect labeling, can substantially boost DL performance for thyroid nodule classification, and that using the entire automatically generated dataset is preferable to restricting training to a higher‑precision subset. They compare their results to a prior benchmark (AUC ≈ 0.70 using 2,556 manually annotated images from 1,139 patients) and note comparable performance while requiring far fewer manually labeled cases.

Limitations include the modest labeling accuracy of MADLaP (83 %), the single‑institution data source, and the heavy class imbalance. Future work should test the pipeline across multiple centers, incorporate more diverse imaging equipment, and refine the automated labeling stages to further reduce noise. Nonetheless, the study demonstrates the practical value of automated dataset construction in medical AI, offering a scalable solution when expert annotation resources are limited.


Comments & Academic Discussion

Loading comments...

Leave a Comment