A two-stream network with global-local feature fusion for bone age assessment

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual’s growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.

💡 Research Summary

The paper introduces BoNet+, a novel two‑stream deep learning architecture for automated bone age assessment (BAA) that explicitly balances global skeletal context with fine‑grained local bone details. The authors identify a persistent gap in existing approaches: pure CNN‑based “black‑box” models ingest the whole hand radiograph but lack interpretability and often under‑utilize global structural cues, while ROI‑based methods depend on labor‑intensive annotations and can lose holistic information. To address these issues, BoNet+ comprises a global feature extraction branch and a local feature extraction branch that run in parallel and are later fused.

In the global branch, a Vision‑Transformer‑style module is inserted. The hand image is split into fixed‑size patches, linearly embedded, and enriched with positional encodings. Stacked multi‑head self‑attention (MHSA) layers capture long‑range spatial dependencies across the entire hand, effectively modeling the overall skeletal maturity pattern akin to the Greulich‑Pyle (GP) method, which relies on whole‑hand comparison.

The local branch incorporates a Receptive‑Field Attention Convolution (RF‑AConv) module. RF‑AConv processes the feature map with multiple convolution kernels of varying receptive fields (e.g., 3×3, 5×5, 7×7) in parallel, generating scale‑specific attention maps that are multiplied back onto the features. This mechanism adaptively emphasizes salient regions such as distal, middle, and proximal phalanges as well as the carpal bones, compensating for the limited coverage of the key‑point‑based Gaussian attention maps used in the original BoNet. The design mirrors the Tanner‑Whitehouse (TW) approach, which scores individual bone regions.

Both streams output feature tensors that are concatenated along the channel dimension. The concatenated representation is then passed through an Inception‑V3 backbone, which refines multi‑scale information and reduces dimensionality before a final fully‑connected regression head predicts bone age in months. Gender information is embedded separately and merged with the fused features to account for sex‑specific growth patterns. The loss function is a simple L1 (mean absolute error) loss, directly optimizing the clinical metric of interest.

The model is evaluated on two publicly available datasets: the Radiological Society of North America (RSNA) hand X‑ray set and the Radiological Hand Pose Estimation (RHPE) set. BoNet+ achieves mean absolute errors (MAE) of 3.81 months on RSNA and 5.65 months on RHPE, matching or surpassing recent state‑of‑the‑art methods such as BoGFF‑Net (3.91 MAE), RA‑Net (4.10 MAE), and Swin‑Transformer‑based pipelines (≈4.6 MAE). Notably, the RF‑AConv module yields a pronounced improvement on RHPE, where carpal and wrist regions are more complex, demonstrating the benefit of enhanced local attention.

From a computational standpoint, the two‑stream design adds modest overhead because the Transformer is confined to the global branch and the RF‑AConv operates with lightweight convolutions. Using Inception‑V3 as the fusion backbone keeps the total parameter count and inference latency within clinically acceptable limits, enabling near‑real‑time deployment.

The authors acknowledge limitations: the system still relies on auxiliary key‑point annotations to generate initial attention maps, and cross‑domain generalization (e.g., varying imaging protocols) requires further study. Future work may explore fully self‑supervised training, integration of additional modalities (chronological age, hormonal markers), and more sophisticated multi‑task learning to further reduce annotation burden and improve robustness.

In summary, BoNet+ demonstrates that a carefully engineered combination of Transformer‑based global context modeling and receptive‑field‑aware local attention can substantially improve bone age prediction accuracy while preserving interpretability and efficiency, offering a promising tool for clinical workflow automation.

A two-stream network with global-local feature fusion for bone age assessment

💡 Research Summary

Comments & Academic Discussion

Leave a Comment