HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography

HAR U-Net: Hybrid Attention Residual U-Net f or Edge-Preser ving Denoising in Cone-Beam Computed T omography Khuram Na veed 1 , ∗ , and Ruben Pauwels 1 1 Department of Dentistry and Oral Health, Aarhus Univ ersity , Aarhus, 8000, Denmark Abstract Cone-beam computed tomography (CBCT) is widely used in dental and maxillofacial imaging, b ut low- dose acquisition introduces strong, spatially varying noise that degrades soft-tissue visibility and obscures ﬁne anatomical structures. Classical denoising methods struggle to suppress noise in CBCT while preserving edges. Although deep learning–based approaches of fer high-ﬁdelity restoration, their use in CBCT denoising is limited by the scarcity of high-resolution CBCT data for supervised training. T o address this research gap, we propose a novel Hybrid Attention Residual U-Net (HAR U-Net) for high-quality denoising of CBCT data, trained on a cadav er dataset of human hemimandibles acquired using a high-resolution protocol of the 3D Ac- cuitomo 170 (J. Morita, K yoto, Japan) CBCT system. The nov el contribution of this approach is the integra- tion of three complementary architectural components: (i) a hybrid attention transformer block (HAB) embed- ded within each skip connection to selectiv ely empha- size salient anatomical features, (ii) a residual hybrid at- tention transformer group (RHAG) at the bottleneck to strengthen global contextual modeling and long-range feature interactions, and (iii) residual learning conv o- lutional blocks to facilitate deeper , more stable fea- ture extraction throughout the network. HAR U-Net con- sistently outperforms state-of-the-art (SOT A) methods including SwinIR and Uformer , achieving the highest PSNR (37.52 dB), highest SSIM (0.9557), and lo west GMSD (0.1084). This ef fective and clinically reliable CBCT denoising is achie ved at a computational cost signiﬁcantly lower than that of the SO T A methods, of- fering a practical adv ancement to ward improving diag- nostic quality in low-dose CBCT imaging. Introduction Cone-beam computed tomography (CBCT) is among the most widely used imaging modalities in dentistry and otorhi- nolaryngology . It can provide three-dimensional visualiza- tion of the anatomical structures of the teeth and jaws, and ear-nose-throat (ENT) regions. CBCT enables v olumet- ric assessment of anatomical structures, facilitating diag- nosis and treatment planning across a wide range of ap- Copyright © 2026 by the authors. Open access article published under the Creativ e Commons Attribution 4.0 International License (CC BY 4.0). plications, including endodontics, implantology , orthodon- tics, temporomandibular joint (TMJ) e valuation, maxillofa- cial sur gery , and ENT imaging. These applications are made possible by relativ ely low radiation doses in CBCT imaging and high spatial resolution, which make it particularly suit- able for imaging the head and neck regions. Despite its broad clinical utility , the requirement to main- tain low radiation doses to minimize patient harm results in a relativ ely high degree of noise in CBCT scans (Scarfe and Farman 2008; P auwels et al. 2015a; Pauwels et al. 2012; Schulze et al. 2011). These noise patterns are composed of quantum noise due to low exposure (Bryce-Atkinson et al. 2021; Goldman 2007; Pauwels et al. 2016), and addi- tiv e noise due to sensor electronics and post-processing er- rors (e.g., quantization errors) (Nuyts et al. 2013; Kalender 2011). Such granular noise often compromises the visual clarity of anatomical structures, causing soft-tissue regions to become largely indistinguishable and impairing the visu- alization of hard-tissue boundaries and small lesions (Ma- hesh, Deshpande, and V iv eka 2022; Olszewski 2020). Such limitations in CBCT imaging reduce diagnostic conﬁdence and often necessitate repeated scans or supplementary imag- ing with CT or MRI. These challenges underscore the need for efﬁcient post-processing noise reduction techniques that can enhance image quality without increasing the radiation dose. As the clinical use of CBCT increases, the demand for reliable, high-quality CBCT denoising solutions has gro wn substantially . Since acquisition-related noise is not unique to CBCT imaging, denoising has long been employed to address the low signal-to-noise ratio (SNR) problem (e.g., (Nav eed, de Freitas, and Pauwels 2025; Khawaja et al. 2019; Nav eed et al. 2021; Na veed et al. 2019)). In this context, deep learning (DL) approaches for CBCT image enhancement have been explored in some recent studies, e.g., (Y unker , K ettimuthu, and Roeske 2024; Zhao et al. 2025; Naveed and Pauwels 2025). Ho wever , the literature remains limited when con- sidering CBCT used in dentistry and ENT imaging, with the majority of studies focused on CBCT used in radio- therapy . This scarcity stems from the dependence of super- vised learning on paired high-quality reference data, which requires acquiring high-dose scans. Such low- and high- radiation image pairs are ethically and clinically imprac- tical. Notably , e ven the supervised approaches reported in (Y unker , K ettimuthu, and Roeske 2024; Zhao et al. 2025) were only made possible through a dataset temporarily released as part of the Grand Challenge in (Biguri and Mukherjee 2023). Although phantom or cada ver datasets with controlled exposure le vels can, in principle, support supervised training, they fail to fully capture the anatom- ical variations and tissue heterogeneity present in real pa- tients. A way around this is to use self-supervised learning framew orks, as demonstrated in (Choi 2021; Y un et al. 2023; Zanini, Rubira-Bullen, and dos Santos Nunes 2024). These framew orks train neural networks to map noisy inputs to independently corrupted versions of the same data, which facilitates the noise being a veraged out without the need for clean supervision. Ho wev er , the performance of self- supervised learning methods tends to be suboptimal. This work seeks to address this challenge through a cadav er-based CBCT dataset acquired under high-dose set- tings, with simulated noise added to generate paired noisy and clean CBCT slices. W e introduce a pre-processing pipeline based on morphological operations for detection and segmentation of tissue from background (air), to en- sure that the latter is ignored for do wnstream tasks. For edge-preserving denoising, we propose a h ybrid attention residual U-Net (HAR U-Net), which is capable of captur- ing anatomical structures and subtle tissue variations across CBCT slices, thereby enabling robust recovery of structural details within complex noise patterns. For v alidation and efﬁcienc y , we compare our method against state-of-the-art (SO T A) transformer-based architectures for denoising, i.e., SwinIR (Liang et al. 2021) and Uformer (W ang et al. 2022), and a residual U-Net as a standard CNN benchmark method. Methods This section describes: (i) the dataset and pre-processing steps for segmentation of the foreground tissue, (ii) prepara- tion of training data through dynamic patching and the addi- tion of noise, and (iii) a detailed e xplanation of the proposed HAR U-Net architecture. Pre-pr ocessing and Data Preparation Here, we describe the preparation of noisy and clean image pairs and dynamic patching developed using only the seg- mented anatomical regions. Dataset Description The dataset employed in this study consists of CBCT scans of human hemimandibles, which were acquired at Chulalongkorn Uni versity using the 3D Accuitomo 170 CBCT (J. Morita, Kyoto, Japan) system. A total of 21 hemimandibular specimens were obtained from the Department of Anatomy , Chulalongkorn University . The study protocol recei ved a certiﬁcate of exemption from the Faculty of Dentistry , Chulalongkorn University Human Re- search Ethics Committee (approval code HREC-DCU 2015- 032). The imaging protocol followed standard adult high- resolution exposure settings: 90 kV , 5 mA, and an exposure time of 30 . 8 seconds per scan. The scans were reconstructed using a ﬁeld of vie w (FO V) of 5 × 5 cm at an isotropic v oxel resolution of 0 . 08 mm, allo wing for high-detail visualization of osseous structures. Each of the 21 CBCT volumes was sliced along the frontal, axial, and sagittal planes using ImageJ (National In- stitutes of Health, Bethesda, MD, USA) to obtain 2D views from three anatomical directions. This resulted in a total of 26 , 317 2D slices used for training and testing. Subsequently , each of these slices was pre-processed to extract the relev ant anatomical regions for model input preparation, as detailed in the next section. Preparation of Noisy Samples Noise in reconstructed CBCT images is primarily composed of two sources: (i) the stochastic nature of discrete photon detection (quantum noise) and (ii) electronic circuitry used for photon detection (electronic noise). Quantum noise depends on the radiation dose; that is, a higher radiation dose ensures a higher SNR due to a larger number of photon detections, with SNR being proportional to the square root of the number of photons (and thus, the dose). While the number of detected photons N P follows a Poisson distribution, N P ∼ P( λ ) , where the mean pho- ton count λ is proportional to the local X-ray intensity inci- dent on the detector , for large photon ﬂux, the Poisson pro- cess can be approximated by a Gaussian distribution with variance equal to its mean, V ar( N ) = λ . Howe ver , after logarithmic transformation and ﬁltered back-projection, this variance is propagated and reﬂected in the reconstructed vox el intensities. T o approximate this beha vior at the im- age le vel, the quantum noise can be e xpressed as an additiv e Gaussian term with uniform variance N (0 , σ 2 q ) , where σ q controls the strength of the quantum noise contrib ution in the reconstructed domain, which depends on the radiation dose (Pauwels et al. 2015a; Kak and Slane y 2001). Fluctuations in the detection circuitry , thermal ﬂuctua- tions, and analog-to-digital con version contrib ute to elec- tronic noise, which is independent of the signal and is commonly approximated as a zero-mean Gaussian process N (0 , σ 2 e ) , with σ e denoting the standard de viation corre- sponding to the lev el of electronic noise. The recorded vox el values in the reconstructed CBCT vol- ume I ′ can be modeled as the sum of the true values I and these two additi ve noise terms, gi ven as follo ws: I ′ = I + ψ q + ψ e , (1) where ψ q ∼ N (0 , σ 2 q ) denotes the quantum noise compo- nent and ψ e ∼ N (0 , σ 2 e ) denotes the electronic noise com- ponent in a vox el. Segmentation of Anatomical Regions and Dynamic Patching This section describes an unsupervised cluster- ing and morphological operations pipeline designed specif- ically to isolate the foreground anatomical re gions from background air in CBCT slices. This approach facilitates dy- namic placement of patches exclusi vely within anatomically meaningful areas, thereby effecti vely excluding empty , irrel- ev ant air regions. 1. Manual Cropping for Excluding Empty Re gions The ﬁrst step of this pipeline was performed manually , where all slices were cropped using ImageJ software. The cropping area was determined by visually inspecting all slices to Manual cropping of the region containing the tissues K-means clustering based estimation of ROI segmentation mask Filling holes inside the ROI mask using morphological contour detection Masking the ROI and drawing a bounding box it Select overlapping patches from the masked and bounded ROI CBCT slice Cropped slice Initial ROI mask Final ROI mask Bounded and masked ROI Selected Patches Figure 1: An illustration of the proposed pre-processing pipeline for detection and segmentation of the foreground-tissue from the background air and noise. identify the region encompassing the highest anatomical interest. This ensured that each cropped slice contained the relev ant anatomical structures while maintaining uni- form dimensions across the dataset. 2. K-Means Clustering for F ore gr ound Se gmentation W e employed an unsupervised K-means clustering algorithm (Ajala Funmilola et al. 2012) to isolate foreground tissue pixels, leveraging their signiﬁcantly higher intensity v al- ues compared to the relati vely low pixel v alues in air- and noise-dominated re gions. The algorithm was applied with k = 2 and a con vergence criterion combining a maximum of 100 iterations and a minimum cluster center shift of 0 . 2 (i.e., ϵ = 0 . 2 ). Given that I k ∈ I denotes the k th slice in the 3D CBCT volume I , the algorithm extracts non-zero pixel intensities and reshapes them into a one-dimensional vector ˜ I k . The ﬁnal output is an initial binary mask M 0 ∈ { 0 , 1 } H × W , where the fore ground pixels represent the anatomical regions of interest. Howe ver , the resulting mask may contain distorted boundaries between air and tissue due to partial volume averaging, as well as voids within the anatomical structures due to pockets ﬁlled with air , lo w-density tissues, or noise. The follo wing two steps address this issue. 3. Morphological Dilation W e employ a morphological di- lation operation to correct the distorted boundaries as well as internal noise pockets of the binary mask M 0 . The di- lation operation is applied using a square structuring ele- ment (of size 5 × 5 ) to smooth boundary irregularities by expanding the segmented regions. This operation is also helpful in ﬁlling narrow gaps, resulting in a more con- tiguous representation of the anatomical structures in the improv ed mask, denoted by M 1 , along with regularized tissue boundaries. Howe ver , this step does not fully ad- dress the discontinuities caused by larger empty pockets within the anatomical boundaries. 4. Hole Detection and F illing via Contour Hierar chy This step addresses voids or holes too large to ﬁll using dila- tion. W e detect these empty regions using a hierarchical contour retriev al method (( cv2.RETR_CCOMP )), which captures both outer and inner contours (Suzuki and others 1985). Inner contours denote the hollow or empty regions, while their surrounding regions are termed the parent con- tours. Once all inner contours are identiﬁed, we ﬁll them using a region-growing method, namely the ﬂood-ﬁll ap- proach (Haralick and Shapiro 1985). This is further sup- plemented by morphological smoothing operations to ob- tain a smooth mask and close residual gaps. The process iterativ ely increases the kernel size, starting from 15 × 15 with a step size of 5, until either no new contours are de- tected or a maximum iteration count is reached. This iter- ativ e reﬁnement produces a ﬁnal binary mask M f , where anatomical regions are fully enclosed and holes are elim- inated. 5. Bounding Box Extraction and Dynamic P atching T o en- sure that patches used for training the DL models are selected exclusi vely from the foreground tissue regions, the contours of these regions are extracted to deﬁne axis-aligned bounding boxes around each fully connected component. These bounding boxes localize the segmented anatomical areas, thereby enabling dynamic patch place- ment within the tissue regions and preventing the inclu- sion of predominantly empty areas. Howe ver , this approach may introduce redundancy when smaller anatomical regions are completely enclosed within the bounding box of a larger re gion. T o mitigate this, nested bounding boxes (i.e., boxes entirely contained within an- other) are identiﬁed and removed by ev aluating the spatial containment relationships among all boxes. Consequently , only the largest non-overlapping bounding boxes are re- tained for patch extraction. Subsequently , non-ov erlapping patches of size 256 × 256 pixels are selected starting from the top-left corner of each slice. In cases where a fe w rows or columns remain uncov- ered after patch selection, additional overlapping patches are generated to ensure complete coverage of the anatomical area. In cases where a bounding box has dimensions smaller than the desired patch size, it is symmetrically extended to meet the minimum size requirement while ensuring it re- mains within the image boundaries. A total of 50 , 026 pairs of noisy and clean patches were generated following the process detailed above on 14 ran- domly selected CBCT volumes (i.e., ≈ 70% of the total data). For validation and testing, a total of 8 , 971 and 10 , 462 patches were obtained from 3 randomly selected CBCT v ol- umes each ( ≈ 15% of the total data). Hybrid Attention Residual U-Net This section introduces the proposed hybrid attention resid- ual U-Net (HAR U-Net), illustrated in Fig. 2, which inte- grates transformer-based attention blocks within the classi- cal encoder–decoder frame work of U-Nets to enhance rep- resentational capacity for the challenging task of image de- noising. Conv entional con volutional blocks primarily rely on localized ﬁltering operations that act as lo w-frequency estimators and, by design, tend to suppress high-frequency details such as edges and corners. Howe ver , preservation of these high-frequency components is critical for ef fective denoising, as they encode essential structural and anatomi- cal information. T o address this limitation, attention mech- anisms and transformer architectures hav e been increas- ingly adopted to model global dependencies and retain high- frequency details. In this context, HARU-Net incorporates a hybrid attention block (HAB) and a residual hybrid at- tention group (RHA G), originally introduced in a recent transformer-based architecture for image super -resolution and denoising (Chen et al. 2023), to activ ate a larger num- ber of pixels and enable effecti ve modeling of global de- pendencies. Such collaborati ve use of con volutional and transformer blocks helps achie ve ef fectiv e denoising while maintaining computational efﬁcienc y compared with fully transformer-based solutions. The proposed HARU-Net architecture is composed of four main types of con volutional and transformer-based blocks: (i) residual con volutional encoding blocks for robust feature extraction, (ii) hybrid attention transformer blocks (HABs) integrated into the skip connections to emphasize relev ant features at each resolution level, (iii) a residual hy- brid attention group (RHAG) at the bottleneck to enhance the representational capacity of the deepest feature maps, and (iv) residual conv olutional decoding blocks that progres- siv ely rev erse the encoding process and reconstruct high- resolution feature maps with the assistance of the HABs and RHA G. This architecture effecti vely combines the strengths of con volutional layers for local feature estimation with the enhanced representational capacity of transformer-based attention modules for capturing contextual dependencies across scales. As a result, HAR U-Net is able to suppress background noise while preserving ﬁne anatomical details critical for CBCT interpretation. The following subsections describe each architectural component in detail. Encoder The encoder consists of four stages, each con- taining a residual con volutional encoding block composed of two con volutional layers in series, each formed by two 3 × 3 con volutions follo wed by a LeakyReLU activ ation, and a 1 × 1 conv olutional projection of the input to the output serv- ing as a skip connection to feed forward dimension-aligned features for residual learning. This residual design ensures stable gradient ﬂow and effecti ve feature propagation, e ven when the number of channels changes. At each stage, the ﬁrst con volution operation doubles the number of channels (from c to 2 c , 4 c , and 8 c ), where c denotes the number of channels after the ﬁrst conv olution operation in the initial block. After each residual block, the spatial resolution of the output tensor is reduced by a factor of 2 using learnable 4 × 4 con volutions with a stride of 2 . W e used LeakyReLU activ a- tion due to its sensitivity to ne gativ e feature values, enabling enhanced feature ﬂow into the deeper layers. Furthermore, the use of conv olutional operations for down-sampling, in- stead of pooling operations, preserves more signal informa- tion and enables ﬂe xible feature compression by learning data-driv en down-sampling ﬁlters. Hybrid Attention T ransformer Block The Hybrid At- tention Block (HAB), introduced in (Chen et al. 2023), combines window-based self-attention with channel atten- tion to jointly capture both local and global contextual in- formation. The windowed self-attention mechanism, origi- nally introduced in the Swin T ransformer architecture (Liu et al. 2021), adapts standard self-attention to a local scale through window partitioning while retaining longer -range contextual dependencies via overlapping and shifted win- dows across layers. The hybrid attention module reinforces the global feature representation by incorporating a chan- nel attention component that re-weights feature channels ac- cording to their global relev ance, thereby strengthening the model’ s ability to emphasize contextually important infor- mation and activ ate a lar ger proportion of input pixels. W e present a brief description of windowed and channel atten- tion for necessary background as follows: 1. W indowed self-attention (Liu et al. 2021) is a localized variant of the standard self-attention mechanism (V aswani et al. 2017), designed to capture ﬁne-grained spatial pat- terns in visual data. Serving as a core building block of the Swin T ransformer architecture (Liu et al. 2021), this approach partitions the input image into ﬁxed-size win- dows and computes self-attention independently within each window , enabling efﬁcient modeling of local con- text while progressiv ely aggregating global information across layers. W ithin each window , feature embeddings are projected into query , key , and value representations, and attention weights are computed based on pairwise query–key similarity to generate context-aw are features. By operating on localized neighborhoods, windo wed self- attention preserv es ﬁne structural and textural details use- ful for image restoration. Moreov er , the use of shifted windows across consecuti ve layers facilitates information exchange between neighboring windo ws, ef fectiv ely inte- grating local details with global anatomical context. 2. Squeeze-and-excitation channel attention enhances fea- ture representation by adaptiv ely re-weighting the impor- tance of each feature channel. It models inter-channel dependencies by aggregating global spatial information through operations such as global av erage pooling, fol- lowed by a small multi-layer perceptron (MLP) that learns attention weights. These learned weights are then used to selectiv ely emphasize informative channels and suppress less relev ant ones, thereby improving the network’ s abil- ity to focus on diagnostically signiﬁcant features while reducing redundancy . HAB HAB HAB RHAG HAB HAB HAB HAB 64 128 256 512 512 1 64 128 128 256 256 128 64 64 512 256 512 512 512 256 512 1024 1024 512 256 256 256 256 128 128 128 128 64 64 64 1 Residual Hybrid Attention Group (RHAG) 256x256 128x128 64x64 32x32 16x16 16x16 32x32 64x64 128x128 256x256 Channel Concatenate Conv2d(n,n,4,2) (down-sampling) ConvT rans2d(n,m,2,2) (up-sampling) Conv2d(n,m) + LeakyReLU Conv2d(n,n) + LeakyReLU 1*1 convolution T ensor T ensor addition Layer norm Channel Atention Window Attention Layer norm MLP Hybrid Attention T ransformer Block (HAB) Figure 2: Architecture of the proposed HARU-Net, incorporating hybrid attention transformer modules (HABs and RHA Gs) within the skip connections and bottleneck of a residual U-Net to improv e feature representation. A sketch of the internal structure of the HAB is illustrated in Fig. 2 (left), where the hybrid-attention mechanism is po- sitioned between two consecutiv e layer normalization (LN) operations and followed by a multilayer perceptron (MLP) with nonlinear acti vation. This design ensures stable feature transformations and enhanced learning of pixel-wise acti va- tions. In this way , the ef ﬁcacy of a recent high-ﬁdelity trans- former block is embedded within a robust residual conv olu- tional architecture, enabling enhanced activ ation of features at both local and global scales and effecti vely addressing the limitations imposed by localized receptiv e ﬁelds. W e employ the HABs directly at the skip connections to reﬁne and enhance the core feature representations transmit- ted between the encoder and decoder stages. Readers inter- ested in learning more about HAB and its ability to enhance receptiv e ﬁelds can refer to (Chen et al. 2023). RHA G and Bottleneck W e employ a more extensiv e and robust attention-based feature modeling frame work, namely the residual hybrid attention transformer group (RHAG), at the bottleneck, where the feature representations are at their deepest le vel. This design enables the modeling of long- range contextual dependencies at a relatively lo w additional computational cost. It is important to note that the proposed RHA G is a simpliﬁed variant of the original RHAG module introduced in (Chen et al. 2023), where only the HAB block is repeated six times in series within a residual conﬁguration, allowing the netw ork to capture higher -order global patterns while preserving data ﬁdelity through residual learning. Before RHA G operation, the bottleneck employs a con vo- lutional operation to increase the number of channels from 512 to 1024 for the resulting feature map. Similar to the en- coding blocks, the bottleneck also incorporates a residual connection to facilitate effecti ve gradient propagation and stable training. Decoder with Skip Connections The decoder mirrors the encoder , progressi vely restoring spatial resolution through transposed con volutions followed by conv olutional reﬁne- ment. T o enhance information ﬂo w through the skip con- nections, an HAB is integrated at each skip pathway to reﬁne the transmitted features. In doing so, the encoder features are selectiv ely enhanced, emphasizing salient anatomical struc- tures over noise and thereby impro ving the reconstruction of ﬁne details. After concatenation, the fused feature maps are further reﬁned using a residual conv olutional block before being forwarded to the next decoding stage. T raining and hyper -parameters Mean Squared Error (MSE) was used as the loss function to train the proposed HAR U-Net. The network was trained using the Adam optimizer with an initial learning rate of 1 × 10 − 4 . A learning rate scheduler reduced the learning rate when the validation loss failed to improve for ﬁ ve con- secutiv e epochs, helping the model escape shallow minima and stabilize con vergence. T raining was terminated early if no improv ement in validation loss was observed for 20 epochs, prev enting ov erﬁtting and unnecessary computation. The same optimizer , learning-rate scheduling, and early- stopping strategy were consistently applied across all com- T able 1: Evaluation of denoising performance on the test dataset. Metric ResUNet Uformer SwinIR HA T HARU-Net PSNR 35.03 36.25 36.12 36.70 37.52 SSIM 0.9542 0.9447 0.9551 0.9569 0.9557 GMSD 0.1240 0.1147 0.1151 0.1119 0.1084 T able 2: Computational costs during inference Measure ResUnet Uformer SwinIR HA T HAR U-Net Flops per patch (GMA Cs) 6.898 78.027 111.069 349.358 40.760 Time per scan (minute) 0.205 4.298 8.852 13.095 1.985 parativ e methods (described in the next section) to ensure a fair e valuation. Experimental setup This section presents the experimental setup and materials used to ev aluate the performance of the proposed HAR U- Net against several state-of-the-art (SO T A) denoising meth- ods. T o ensure a comprehensiv e comparison, we selected algorithms that are widely recognized for their denoising effecti veness across medical, dental, and natural image do- mains. The comparison pool includes SwinIR (Liang et al. 2021), a Swin Transformer–based image restoration model known for its strong performance in di verse imaging ap- plications. W e additionally employ Uformer (W ang et al. 2022), a U-shaped transformer architecture inspired by the U-Net design b ut enhanced with transformer-based feature modeling. T o provide an architectural baseline, we also in- clude a Residual U-Net (ResU-Net), which forms the foun- dational backbone of HAR U-Net shown in Fig. 2 with the HAB and RHA G blocks remov ed. For quantitativ e ev aluation of the denoising performance of the comparative methods, we employ three widely used image quality metrics and a fourth metric to compare com- putational cost: 1. P eak Signal-to-Noise Ratio (PSNR) quantiﬁes the pixel- wise agreement between the noise-corrected image and the reference image (i.e., the original CBCT data before noise addition). It is deriv ed from the MSE and is ex- pressed in decibels (dB). Higher PSNR v alues indicate closer correspondence to the reference image, reﬂecting effecti ve noise suppression. PSNR is deﬁned as PSNR = 10 log 10  L 2 MSE  , (2) where L denotes the maximum possible pixel inten- sity value of the image (e.g., L = 255 for an im- age represented in 8 -bit unsigned integer format, while L = 1 for a normalized image), and MSE = 1 N P N i =1  I ( i ) − ˆ I ( i )  2 , with I and ˆ I respectively denot- ing the reference and denoised images, each having N pixels. 2. Structural Similarity Index Measure (SSIM) ev aluates perceptual similarity by jointly assessing luminance, con- trast, and structural consistency between two images. SSIM is more closely aligned with human perception and understanding than PSNR. Since anatomical structures and contrast preservation are critical in CBCT , SSIM is a vital indicator of preserved structural information. Its values range between 0 and 1 , with v alues close to 1 indi- cating better structural preservation. SSIM can be deﬁned as SSIM( I , ˆ I ) = (2 µ I µ ˆ I + C 1 )(2 σ I , ˆ I + C 2 ) ( µ 2 I + µ 2 ˆ I + C 1 )( σ 2 I + σ 2 ˆ I + C 2 ) , (3) where µ I and µ ˆ I denote the mean intensities, σ 2 I and σ 2 ˆ I denote the v ariances, and σ I , ˆ I denotes the cov ariance of the reference and denoised images I and ˆ I . The constants C 1 and C 2 are included to stabilize the division. 3. Gradient Magnitude Similarity Deviation (GMSD) mea- sures the deviation between the gradient magnitudes of the reference and the noise-corrected image. GMSD is particularly useful for assessing edge preservation and re- tention of ﬁne structural details, which are essential for accurate interpretation of CBCT scans. Lower GMSD val- ues indicate better preservation of anatomical boundaries and textures. Mathematically , GMSD of the reference and denoised im- ages I and ˆ I is gi ven as follo ws: GM S D = v u u t 1 N N X i =1  ∇ I ∼ ˆ I ( i ) − ∇ I ∼ ˆ I  2 , (4) where ∇ I ∼ ˆ I ( i ) denotes the gradient magnitude similarity at the i th pixel in images I and ˆ I , deﬁned as ∇ I ∼ ˆ I ( i ) = 2 ∇ I ( i ) ∇ ˆ I ( i ) + C ∇ I ( i ) 2 + ∇ ˆ I ( i ) 2 + C , (5) where ∇ I and ∇ ˆ I denote the gradients of the reference and denoised images I and ˆ I , respecti vely , and C is a small positiv e constant. 4. Giga-Multiply-Accumulate operations (GMACs) quan- tify the computational cost (i.e., one GMA C is equal to two giga ﬂoating point operations per second (FLOPS)) required for a single forward pass through the model. This metric reﬂects the inference efﬁciency of the model and 1 0 1 1 0 2 Computational cost in log scale (GMA Cs) 35.0 35.5 36.0 36.5 37.0 37.5 PSNR (dB) ResU-Net Uformer SwinIR HA T HARU-Net 1 0 1 1 0 2 Computational cost in log scale (GMA Cs) 0.9425 0.9450 0.9475 0.9500 0.9525 0.9550 SSIM ResU-Net Uformer SwinIR HA T HARU-Net Figure 3: Comparison of computational cost with the performance in terms of PSNRs and SSIMs. provides insight into its suitability for ef ﬁcient clinical de- ployment. Lo wer GMAC values correspond to faster ex- ecution and reduced hardware demands, allowing for di- rect comparison of computational ef ﬁciency across mod- els while accounting for the fundamental arithmetic oper - ations in volv ed in neural network inference. Results This section describes the experimental results comparing the proposed HAR U-Net against SO T A denoising methods. W e ﬁrst present quantitative performance metrics and then provide a visual analysis. Quantitative Assessment T able 1 lists the PSNR, SSIM, and GMSD values for each comparativ e method when applied to the testing data. T able 2 presents the computational cost metrics incurred during inference, expressed as the number of GMA Cs required to process one patch and the time (in seconds) required to pro- cess a complete 3D CBCT scan of size 512 × 512 × 512 . Bold v alues in each table indicate the best results. A com- bined view of the results in T ables 1 and 2 is presented in Fig. 3 to demonstrate the balance between image ﬁdelity and computational efﬁcienc y for each method. The results demonstrate that the proposed HAR U-Net de- liv ers the strongest overall performance, achieving the high- est PSNR of 37.52 dB, the second-highest SSIM of 0.9557 (i.e., slightly below HA T , which achieves 0.9569), and the lowest GMSD (0.1084) among all comparative methods. These results indicate that HAR U-Net provides the best bal- ance between noise suppression and preservation of anatom- ical detail. Despite its superior performance, HAR U-Net maintains a computational cost substantially lower than both Uformer and SwinIR. This trend is clearly illustrated in Fig. 3, which visualizes ho w PSNR and SSIM performance scale with respect to computational cost. ResU-Net achiev es the lowest PSNR and highest GMSD among the compared mod- els, although it demonstrates higher structural similarity to the ground truth than Uformer . Notably , ResU-Net is signif- icantly faster than the transformer -based architectures. V isual Assessment W e next ev aluate the qualitative performance across three representativ e CBCT slices from two different test samples, extracted from the sagittal, frontal, and axial anatomical views, as sho wn in Figures 4–6. Figures 4 and 5 present the axial and sagittal vie ws from the same scan, while Fig. 6 shows the frontal view from another scan. W e addition- ally provide zoomed-in views for each ﬁgure, focusing on relev ant anatomical structures. Overall, ResU-Net, Uformer, SwinIR, and the proposed HAR U-Net produce visually co- herent and clinically interpretable depictions, effecti vely re- ducing noise while maintaining the visibility of k ey anatom- ical structures. Ho wever , HAR U-Net produces the most vi- sually consistent impro vements, with clear enhancements in the sharpness of bone boundaries, cortical outlines, and in- ternal trabecular patterns. In contrast, sev eral of the competing approaches exhibit minor artifacts or structural inconsistencies despite their generally good performance. Although the visual improve- ments introduced by HARU-Net may appear modest at ﬁrst glance, they are clearly observable in the zoomed regions and align well with the quantitativ e gains reported earlier . Overall, while all deep learning models pro vide clini- cally applicable reconstructions, HARU-Net demonstrates the most balanced and visually superior denoising perfor- mance, consistently deliv ering sharper and more anatomi- cally faithful results across dif ferent slice orientations. Discussion Noise remains one of the major limiting factors in lo w- dose CBCT imaging, which serves as a primary modality used for dental, maxillofacial, and ENT diagnosis. Excessiv e noise often obscures ﬁne anatomical structures such as root canal morphology and periapical lesions, potentially lead- ing to diagnostic uncertainty or ev en misdiagnosis (Schulze et al. 2011; Patel et al. 2019; R ´ ıos-Osorio et al. 2024). More- ov er , the clarity of bone structure may be reduced by noise, which adversely impacts the planning and placement of im- plants (Patel et al. 2019; Pauwels et al. 2015b; Camilo et al. 2013). In addition, cephalometric analysis and airway ev al- (a) Noisy (b) ResU-Net (c) Uformer (d) SwinIR (e) HA T (f) HAR U-Net Figure 4: Comparison of denoising performance on CBCT slice from sagital view from a 3D CBCT scan. (a) Noisy (b) ResU-Net (c) Uformer (d) SwinIR (e) HA T (f) HAR U-Net Figure 5: Comparison of denoising performance on CBCT slice from axial view from a 3D CBCT scan. (a) Noisy (b) ResU-Net (c) Uformer (d) SwinIR (e) HA T (f) HARU-Net Figure 6: Comparison of denoising performance on CBCT slice from frontal view from a 3D CBCT scan. uation are susceptible to noise, making it difﬁcult to mea- sure anatomical landmarks precisely (Chung et al. 2024; Lee et al. 2019). The situation is further aggrav ated in dy- namic CBCT applications, such as TMJ ev aluations, where repeated scans are needed to capture joint motion (Bag et al. 2014). Additionally , tissue characterization through dual- energy CBCT suf fers from metal artif acts and noise because of the complex decomposition process (Zhu et al. 2019). While DL has been widely applied for image enhance- ment and reconstruction in CT and MRI, its use for CBCT image enhancement remains limited, primarily due to the lack of paired low- and high-quality training data. This article addresses this gap by introducing a Hybrid Atten- tion Residual U-Net (HAR U-Net), trained using a cada ver- deriv ed CBCT dataset acquired at higher radiation doses. Noisy and clean pairs were generated through a controlled noising process. The experimental results demonstrate that the proposed HAR U-Net substantially improves CBCT denoising per- formance compared with both the baseline U-Net ar- chitecture and state-of-the-art transformer -based models. This enhanced performance stems from the integration of transformer-style HABs at each skip connection and an RHA G at the bottleneck. These components combine the representational strengths of transformer blocks with the sta- bility and computational efﬁciency of con volutional feature extraction within a uniﬁed architecture. This hybrid design prov es particularly effecti ve for CBCT denoising, where noise is spatially v arying, signal-dependent, and structurally complex. While transformer-based models such as Uformer and SwinIR rely heavily on computationally intensive at- tention mechanisms to capture long-range dependencies, HAR U-Net uses the strengths of a con volutional backbone to ef ﬁciently extract local features, while transformer-style attention blocks provide global context to augment the local representations. This results in superior restoration quality at substantially lower computational cost, highlighting the advantage of integrating selectiv e transformer components within a CNN framew ork. The comparison with ResU-Net is particularly interest- ing because, while it exhibits the lowest computational cost due to its purely conv olutional architecture, it also shows the worst denoising performance in terms of PSNR and GMSD. What makes this ﬁnding notable is that the incorporation of HABs into the skip connections and an RHA G at the bottle- neck effecti vely elev ates the capability of the baseline ResU- Net. These additions introduce only a modest increase in computational complexity yet yield a substantial improv e- ment in denoising performance, demonstrating the effecti ve- ness and efﬁcienc y of the proposed architectural enhance- ments. Regardless, our ﬁndings suggest that ResU-Net can serve as a baseline for developing future fast and effecti ve real-time CBCT denoisers. W e further demonstrate the clinical rele vance of the pro- posed method through visual e valuation. Compared with transformer-dominant architectures, HAR U-Net produces reconstructions that more faithfully preserve ﬁne anatomi- cal details without introducing over -smoothing or attention- induced artifacts. It must be stated that the denoised images from Uformer, SwinIR, and ResU-Net yield clinically rel- ev ant results, although HAR U-Net appears visually supe- rior . These improvements emphasize the value of integrat- ing CNN-based inductiv e biases with transformer-style at- tention in a balanced and computationally efﬁcient manner . These results suggest that such hybrid architectures of fer a promising direction for dev eloping high-performing yet lighter-weight denoising models for dentistry and broader medical imaging applications. Although computationally lighter than both Uformer and SwinIR, full-v olume inference for a 512 × 512 × 512 CBCT scan requires approximately 2 minutes on a consumer-grade GPU (NVIDIA R TX 2080 Ti). This is substantially faster than Uformer (approximately 4.30 minutes) and SwinIR (approximately 8.85 minutes), yet still f alls short of real- time processing, which would facilitate clinical deplo yment. This limitation highlights the need for further exploration of model compression, architectural pruning, and efﬁcient attention mechanisms to produce an e ven lighter variant of HAR U-Net. Nev ertheless, the present ﬁndings establish HAR U-Net as an ef fectiv e, computationally ef ﬁcient, and clinically meaningful advancement toward high-ﬁdelity de- noising in low-dose CBCT imaging. A key limitation of the proposed approach is that the train- ing data originate from a limited sample scanned on a de- vice from a single CBCT v endor , which restricts the model’ s generalizability across different scanner types. Future work should therefore assess cross-vendor performance and ex- plore strategies such as vendor-speciﬁc ﬁne-tuning or do- main adaptation to strengthen the model’ s robustness in clin- ical settings. Conclusion In this work, we have introduced a Hybrid Attention Resid- ual U-Net tailored for denoising CBCT data. By integrating residual con volutional encoding with hybrid attention mech- anisms both within the bottleneck and along skip connec- tions, HAR U-Net effecti vely captures both local anatomical detail and global contextual structure. Comprehensiv e ex- periments using real CBCT scans demonstrate that HAR U- Net consistently outperforms se veral state-of-the-art denois- ing models, including SwinIR, Uformer , and residual U- Net baselines, achieving higher PSNR, SSIM, and GMSD scores, as well as superior perceptual quality across multiple anatomical views. Overall, HARU-Net offers a fast, accu- rate, and clinically viable solution for enhancing CBCT im- age quality . Its ability to preserve ﬁne anatomical structures while effecti vely suppressing noise positions it as a promis- ing candidate for real-time CBCT enhancement workﬂo ws in dental, maxillofacial, and ENT imaging. Future work will explore adapting the model for v olumetric (3D) de- noising, multi-site generalization, and integration with self- supervised or physics-informed learning frame works. References Ajala Funmilola, A.; Ok e, O.; Adedeji, T .; Alade, O.; and Ade wusi, E. 2012. Fuzzy kc-means clustering algorithm for medical image segmentation. Journal of information Engineering and Applica- tions, ISSN 22245782:2225–0506. Bag, A. K.; Gaddikeri, S.; Singhal, A.; Hardin, S.; Tran, B. D.; Medina, J. A.; and Cur ´ e, J. K. 2014. Imaging of the temporo- mandibular joint: An update. W orld journal of radiology 6(8):567. Biguri, A., and Mukherjee, S. 2023. Advancing the fron- tiers of deep learning for low-dose 3d cone-beam ct re- construction. https://sites.google.com/view/ icassp2024- spgc- 3dcbct/contacts [Accessed: 2025- 11-25]. Bryce-Atkinson, A.; De Jong, R.; Marchant, T .; Whitﬁeld, G.; Az- nar , M. C.; Bel, A.; and van Herk, M. 2021. Low dose cone beam ct for paediatric image-guided radiotherapy: Image quality and prac- tical recommendations. Radiotherapy and Oncology 163:68–75. Camilo, C. C.; Brito-J ´ unior , M.; Faria-e Silva, A. L.; Quintino, A. C.; Paula, A. F . d.; Cruz-Filho, A. M.; and Sousa-Neto, M. D. 2013. Artefacts in cone beam ct mimicking an extrapalatal canal of root-ﬁlled maxillary molar . Case r eports in dentistry 2013(1):797286. Chen, X.; W ang, X.; Zhang, W .; K ong, X.; Qiao, Y .; Zhou, J.; and Dong, C. 2023. Hat: Hybrid attention transformer for image restoration. arXiv preprint . Choi, K. 2021. Self-supervised projection denoising for low-dose cone-beam ct. In 2021 43rd Annual International Confer ence of the IEEE Engineering in Medicine & Biology Society (EMBC) , 3459– 3462. IEEE. Chung, E.-J.; Y ang, B.-E.; Kang, S.-H.; Kim, Y .-H.; Na, J.-Y .; P ark, S.-Y .; On, S.-W .; and Byun, S.-H. 2024. V alidation of 2d lateral cephalometric analysis using artiﬁcial intelligence-processed low- dose cone beam computed tomography . Heliyon 10(21). Goldman, L. W . 2007. Principles of ct: radiation dose and image quality . Journal of nuclear medicine technology 35(4):213–225. Haralick, R. M., and Shapiro, L. G. 1985. Image segmenta- tion techniques. Computer vision, graphics, and image processing 29(1):100–132. Kak, A. C., and Slaney , M. 2001. Principles of computerized to- mographic imaging . SIAM. Kalender , W . A. 2011. Computed tomography: fundamentals, sys- tem technology , image quality , applications . John W iley & Sons. Khawaja, A.; Khan, T . M.; Na veed, K.; Naqvi, S. S.; Rehman, N. U.; and Na waz, S. J. 2019. An impro ved retinal v essel segmen- tation frame work using frangi ﬁlter coupled with the probabilistic patch based denoiser . IEEE Access 7:164344–164361. Lee, K.-M.; Dav ami, K.; Hwang, H.-S.; and Kang, B.-C. 2019. Effect of voxel size on the accuracy of landmark identiﬁcation in cone-beam computed tomography images. J. K orean Dental Sci 12(1):20–28. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; V an Gool, L.; and Timo- fte, R. 2021. Swinir: Image restoration using swin transformer . In Pr oceedings of the IEEE/CVF international confer ence on com- puter vision , 1833–1844. Liu, Z.; Lin, Y .; Cao, Y .; Hu, H.; W ei, Y .; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF interna- tional conference on computer vision , 10012–10022. Mahesh, K.; Deshpande, P .; and V i veka, S. 2022. Pre v alence of artifacts in cone-beam computed tomography: A retrospecti ve study . Journal of Indian Academy of Oral Medicine and Radiology 34(4):428–431. Nav eed, K., and Pauwels, R. 2025. U-net emulates bm3d for fast, effecti ve denoising of dental cbcts. In Dental Digitalization Soci- ety’ s Global Congr ess, V enice Italy . Dental Digitalization Society (DDS). Nav eed, K.; Ehsan, S.; McDonald-Maier, K. D.; and Ur Rehman, N. 2019. A multiscale denoising framework using detection the- ory with application to images from cmos/ccd sensors. Sensors 19(1):206. Nav eed, K.; Daud, F .; Madni, H. A.; Khan, M. A.; Khan, T . M.; and Naqvi, S. S. 2021. T owards automated eye diagnosis: An improv ed retinal vessel segmentation framew ork using ensemble block matching 3d ﬁlter . Diagnostics 11(1):114. Nav eed, K.; de Freitas, B. N.; and Pauwels, R. 2025. Naada: A noise-aware attention denoising autoencoder for dental panoramic radiographs. arXiv preprint . Nuyts, J.; De Man, B.; Fessler , J. A.; Zbije wski, W .; and Beekman, F . J. 2013. Modelling the physics in the iterati ve reconstruction for transmission computed tomography . Physics in Medicine & Biology 58(12):R63. Olszewski, R. 2020. Artifacts related to cone beam computed tomography technology (cbct) and their signiﬁcance for clinicians: illustrated re view of medical literature. Nemesis. Negative Effects in Medical Sciences Oral and Maxillofacial Sur gery 11(1):1–29. Patel, S.; Brown, J.; Pimentel, T .; Kelly , R.; Abella, F .; and Du- rack, C. 2019. Cone beam computed tomography in endodontics– a revie w of the literature. International endodontic journal 52(8):1138–1152. Pauwels, R.; Beinsberger , J.; Collaert, B.; Theodorakou, C.; Rogers, J.; W alker , A.; Cockmartin, L.; Bosmans, H.; Jacobs, R.; Bogaerts, R.; et al. 2012. Effecti ve dose range for dental cone beam computed tomography scanners. Eur opean journal of radi- ology 81(2):267–271. Pauwels, R.; Araki, K.; Siewerdsen, J.; and Thongvigitmanee, S. S. 2015a. T echnical aspects of dental cbct: state of the art. Dentomax- illofacial Radiology 44(1):20140224. Pauwels, R.; Faruangsaeng, T .; Charoenkarn, T .; Ngonphloy , N.; and Panmekiate, S. 2015b. Ef fect of exposure parameters and vox el size on bone structure analysis in cbct. Dentomaxillofacial Radiology 44(8):20150078. Pauwels, R.; Jacobs, R.; Bogaerts, R.; Bosmans, H.; and Panmeki- ate, S. 2016. Reduction of scatter-induced image noise in cone beam computed tomography: effect of ﬁeld of view size and posi- tion. Oral sur gery , oral medicine, oral pathology and oral radiol- ogy 121(2):188–195. R ´ ıos-Osorio, N.; Quijano-Guauque, S.; Bri ˜ nez-Rodr ´ ıguez, S.; V elasco-Flechas, G.; Mu ˜ noz-Sol ´ ıs, A.; Ch ´ avez, C.; and Fernandez- Grisales, R. 2024. Cone-beam computed tomography in en- dodontics: from the speciﬁc technical considerations of acquisi- tion parameters and interpretation to advanced clinical applica- tions. Restorative Dentistry & Endodontics 49(1). Scarfe, W . C., and Farman, A. G. 2008. What is cone-beam ct and how does it work? Dental Clinics of North America 52(4):707– 730. Schulze, R.; Heil, U.; Gro β , D.; Bruellmann, D. D.; Dranis- chniko w , E.; Schwanecke, U.; and Schoemer , E. 2011. Artefacts in cbct: a revie w . Dentomaxillofacial Radiology 40(5):265–273. Suzuki, S., et al. 1985. T opological structural analysis of digitized binary images by border following. Computer vision, graphics, and image pr ocessing 30(1):32–46. V aswani, A.; Shazeer, N.; Parmar , N.; Uszk oreit, J.; Jones, L.; Gomez, A. N.; Kaiser , Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information pr ocessing systems 30. W ang, Z.; Cun, X.; Bao, J.; Zhou, W .; Liu, J.; and Li, H. 2022. Uformer: A general u-shaped transformer for image restoration. In Pr oceedings of the IEEE/CVF confer ence on computer vision and pattern reco gnition , 17683–17693. Y un, S.; Jeong, U.; Kwon, T .; Choi, D.; Lee, T .; Y e, S.-J.; Cho, G.; and Cho, S. 2023. Penalty-driven enhanced self-supervised learn- ing (noise2void) for cbct denoising. In Medical Imaging 2023: Physics of Medical Imaging , volume 12463, 464–469. SPIE. Y unker , A. A.; Kettimuthu, B. R.; and Roeske, C. J. C. 2024. Low dose cbct denoising using a 3d u-net. In 2024 IEEE Inter- national Confer ence on Acoustics, Speech, and Signal Pr ocessing W orkshops (ICASSPW) , 85–86. IEEE. Zanini, L. G. K.; Rubira-Bullen, I. R. F .; and dos Santos Nunes, F . d. L. 2024. Enhancing dental caries classiﬁcation in cbct images by using image processing and self-supervised learning. Computers in Biology and Medicine 183:109221. Zhao, X.; W ang, X.; Du, Y .; and Peng, Y . 2025. Cbct-iddnet: a three-dimensional res-unet based image domain denoising net- work for clinical dose cone-beam computed tomography—winner of the international conference on acoustics, speech, and signal processing-2024 challenge. Quantitative Imaging in Medicine and Sur gery 15(10):9844. Zhu, L.; Chen, Y .; Y ang, J.; T ao, X.; and Xi, Y . 2019. Evaluation of the dental spectral cone beam ct for metal artefact reduction. Dentomaxillofacial Radiology 48(2):20180044.

HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment