The Sim-to-Real Gap in MRS Quantification: A Systematic Deep Learning Validation for GABA
Magnetic resonance spectroscopy (MRS) is used to quantify metabolites in vivo and estimate biomarkers for conditions ranging from neurological disorders to cancers. Quantifying low-concentration metabolites such as GABA ($γ$-aminobutyric acid) is cha…
Authors: Zien Ma, S. M. Shermer, Oktay Karakuş
The Sim-to-Real Gap in MRS Quantification: A Systematic Deep Learning V alidation for GAB A Zien Ma a , S. M. Shermer b , Oktay Karaku ¸ s a , Frank C. Langbein a, ∗ a School of Computer Science and Informatics, Car di ff University , Car di ff , CF24 4A G, United Kingdom b F aculty of Science and Engineering, Swansea Univer sity, Swansea, SA2 8PP , United Kingdom Abstract Magnetic resonance spectroscopy (MRS) is used to quantify metabolites in vivo and estimate biomarkers for con- ditions ranging from neurological disorders to cancers. Quantifying low-concentration metabolites such as GABA ( γ -aminobutyric acid) is challenging due to low signal-to-noise ratio (SNR) and spectral ov erlap. W e inv estigate and validate deep learning for quantifying complex, low-SNR, overlapping signals from MEGA-PRESS spectra, devise a con volutional neural network (CNN) and a Y -shaped autoencoder (Y AE), and select the best models via Bayesian optimisation on 10 , 000 simulated spectra from slice-profile-aware MEGA-PRESS simulations. The selected models are trained on 100 , 000 simulated spectra. W e v alidate their performance on 144 spectra from 112 e xperimental phan- toms containing fi ve metabolites of interest (GABA, Glu, Gln, NAA, Cr) with known ground truth concentrations across solution and gel series acquired at 3 T under varied bandwidths and implementations. These models are fur- ther assessed against the widely used LCModel quantification tool. On simulations, both models achie ve near -perfect agreement (small MAEs; regression slopes ≈ 1 . 00, R 2 ≈ 1 . 00). On experimental phantom data, errors initially in- creased substantially . Howe ver , modelling variable line widths in the training data significantly reduced this gap. The best augmented deep learning models achie ved a mean MAE for GAB A over all phantom spectra of 0 . 151 (Y AE) and 0 . 160 (FCNN) in max-normalised relati ve concentrations, outperforming the con ventional baseline LCModel (0 . 220). A sim-to-real gap remains, b ut physics-informed data augmentation substantially reduced it. Phantom ground truth is needed to judge whether a method will perform reliably on real data. K eywor ds: Magnetic Resonance Spectroscopy, GAB A, MEGA-PRESS, Deep Learning, Bayesian Model Selection, Domain Shift, Phantom V alidation 1. Introduction Magnetic resonance spectroscopy (MRS) is a non-in vasi ve technique for quantifying metabolite concentrations in vivo , providing insights into cellular metabolism that can support diagnosis and monitoring across neurological and oncological diseases [1, 2, 3, 4, 5]. A biomarker of particular clinical interest is γ -aminobutyric acid (GAB A), the primary inhibitory neurotransmitter in the brain, whose dysregulation is implicated in numerous psychiatric and neurological disorders [1, 6, 7, 8, 9, 10, 11]. Accurate quantification of lo w-concentration metabolites such as GAB A is, howe ver , challenging: the weak GABA signal is partially obscured by stronger resonances from more ab undant compounds such as N-acetylaspartate (NAA) and creatine (Cr), leading to substantial spectral overlap. Edited MRS techniques such as MEGA-PRESS (Mescher-Garw ood point-resolved spectroscopy editing) are therefore commonly employed to isolate the GAB A signal [12, 13, 14]. While spectral editing improves specificity , the required subtrac- tion of edit-OFF and edit-ON (hereafter OFF and ON) acquisitions reduces the signal-to-noise ratio (SNR) and may introduce artefacts, creating additional challenges for reliable, unbiased quantification. Deep learning (DL) has been applied to address these problems, potentially improving accuracy and reducing expert-dri ven parameter tuning [15]. In practice, howe ver , DL methods face a major obstacle: the “sim-to-real” ∗ Corresponding author . Email addr ess: LangbeinFC@cardiff.ac.uk (Frank C. Langbein) gap between simulated training data and experimental measurements. Training robust models requires large, la- belled datasets. In vivo data are costly to acquire at scale, subject to ethical and logistical constraints, and crucially lack ground-truth metabolite concentrations, precluding fully supervised training and rigorous ev aluation [3, 16, 17]. Phantom datasets pro vide known concentrations b ut are expensi ve and time-consuming to prepare, calibrate (e.g. pH, temperature, relaxation), and scan under multiple conditions; covering all concentration combinations, line widths, and sequence variants would be impractical [18, 19]. As a result, most DL models are trained and selected on large collections of simulated spectra [20, 21, 22]. Simulations typically assume idealised hardw are, controlled acquisition parameters, and simplified baselines. Ex- perimental spectra, instead, show variations and artefacts from scanner-specific implementations, B0 / B1 inhomo- geneities, imperfect water suppression, and subtraction-related baseline distortions [19, 23, 24]. Models optimised exclusi vely on simulated data can therefore overfit to unrealistic training distributions and exhibit substantial esti- mation bias when applied to experimental spectra. Recent studies have highlighted this concern, reporting strong performance on simulations but de graded accuracy and calibration under domain shift [22, 25]. In this work, we directly address this validation challenge through a systematic inv estigation of DL-based quan- tification of GAB A and related metabolites from MEGA-PRESS spectra. While GAB A is our primary target, we simultaneously quantify N AA, Cr, glutamate (Glu), and glutamine (Gln), which are integral to brain metabolism and function [26]. Building on MRSNet [21], we pursue three main objectiv es. Firstly , we dev elop two complemen- tary architectures for multi-metabolite regression: a con volutional neural network (CNN) that captures local spectral features and a Y -shaped autoencoder (Y AE) that learns a denoised latent representation. Secondly , we perform system- atic model selection via Bayesian optimisation on a slice-profile-aware simulated dataset (five-fold cross-validation on 10 , 000 spectra) to identify the best performing configurations of these models, and include several established ar- chitectures from the literature as comparati ve baselines (using their published configurations with minimal adaptation to our MEGA-PRESS pipeline). Thirdly , we assess the performance of the best DL models by v alidating them on 144 spectra from 112 experimental phantoms with known ground-truth concentrations (solutions and tissue-mimicking gels acquired at 3 T containing GAB A, Glu, Gln, Cr, NAA and no macromolecule background or lipid signal), and comparing their performance to the widely used LCModel tool (applied to OFF spectra as it provided the better quan- tification results on the phantom data). For the baseline (fixed-line width, non-augmented) models, across all 144 phantom spectra, mean MAE ov er all spectra for GAB A was 0 . 161 (MRSNet-Y AE), 0 . 203 (MRSNet-CNN), 0 . 167 (FCNN), and 0 . 220 (LCModel-OFF). Our results show that both architectures achiev e near-perfect agreement with ground truth on simulated data. On experimental phantoms, initial models showed a substantial sim-to-real gap. Howe ver , by incorporating realistic variability in spectral line widths into the training data, we significantly improved rob ustness. The augmented models outperformed the con ventional LCModel baseline for GAB A and Glu; with linewidth augmentation, the best DL mod- els achiev ed GAB A MAE over all phantom spectra 0 . 151 (Y AE) and 0 . 160 (FCNN), outperforming LCModel-OFF (0 . 220), even if the sim-to-real gap could not be closed. Physics-informed simulation such as linewidth augmenta- tion is, hence, important for DL methods to generalise from simulation to experiment. W e argue that v alidation on phantoms with known concentrations should precede in vivo use of an y new quantification method. Section 2 revie ws con ventional and machine learning-based quantification methods in MRS. Section 3 details the simulated and experimental datasets and preprocessing. Section 4 describes our DL architectures, and Section 5 outlines the Bayesian optimisation-based model selection. Section 6 presents the experimental results; Section 7 concludes. 2. Related W ork The quantification of metabolites from MRS spectra, particularly lo w-concentration compounds like GAB A, has motiv ated a wide range of analytical methods [16, 27, 28]. These fall into two main paradigms: (i) con ventional model- based fitting with explicit basis functions and (ii) data-driv en machine learning, including deep learning (DL). W e focus on methods for edited spectra such as MEGA-PRESS and summarise their main de velopments and limitations, in particular interpretability , uncertainty quantification, and behaviour under domain shift. 2 2.1. Conventional Quantification Methods Con ventional quantification in MRS is dominated by peak fitting and spectral basis set methods. Peak fitting mod- els spectral peaks using analytical lineshapes (e.g., Gaussian, Lorentzian, V oigt), adjusting parameters like amplitude and phase to match the data and estimating metabolite concentrations from peak areas. While some tools, such as L WFIT [18], use model-free integration over fix ed frequency intervals for robustness, others, such as GANNET [29] and its successor Osprey [30], specialise in edited spectra like MEGA-PRESS by fitting Gaussian models to target peaks. Although e ff ectiv e for well-separated signals, peak fitting accuracy is often compromised by spectral overlap and baseline distortions. Basis set fitting methods model the acquired spectrum as a linear combination of basis spectra, which are pre- acquired from phantom experiments or generated from quantum-mechanical simulations. Using fitting algorithms such as constrained nonlinear least-squares, these methods determine the relative contrib utions of each metabolite. A variety of such tools exist, operating either in the frequency (e.g., LCModel [31, 32], INSPECT OR [33], VESP A [34], JMR UI [35], A QSES [36], AMARES [37]) or time domain (e.g., QUEST [38], T ARQUIN [39, 40]), as revie wed in [41]. Recent toolboxes also o ff er Bayesian posterior estimates over concentrations (e.g., FSL-MRS [42]), improving uncertainty characterisation relativ e to point estimates. Despite their widespread use, these methods face sev eral challenges. Simulated basis sets may not fully capture the nuances of experimental spectra [43], and performance can degrade with lo w signal-to-noise ratio or increased spectral line width [44, 45, 46]. Experimentally acquired basis sets can improv e accuracy , b ut their generation is laborious and requires significant human e xpertise for preprocessing and parameter tuning, introducing operator-dependent variability [17]. Furthermore, challenges such as macromolecule contamination and ensuring consistency across di ff erent modelling choices persist [47, 48]. 2.2. Machine Learning Methods Machine learning and particularly Deep learning (DL) hav e been proposed as an alternative that can automate analysis and reduce dependence on manual tuning and operator-dependent variability . Models trained on large-scale simulated datasets hav e been shown to learn rele vant features and predict metabolite concentrations directly from spectral data [21, 22, 25, 49, 50, 51, 52, 53, 54, 55]; other work uses DL for denoising before conv entional linear least-squares (LLS) or linear combination model (LCM) quantification [49, 50, 54, 55, 56]. W e group existing DL- based quantification strategies into three categories. A revie w of machine learning applications in MRS can be found in [57]. Direct Regression Appr oaches: The most direct application of DL frames quantification as an end-to-end re- gression task, where a network, typically a CNN, learns to map raw spectral data directly to metabolite concentra- tions [22, 25, 53]. V ariations on this theme include using handcrafted wavelet scattering features to provide more robust inputs to a shallow regression network [58]. These models are computationally e ffi cient and can be designed to provide uncertainty estimates. Howe ver , because they do not explicitly model the physics of spectral formation, they can lack interpretability and are prone to ov erfitting, especially in low-SNR regimes. Their estimation bias is often unknown, as most ha ve not been v alidated against experimental data with ground-truth concentrations. Physics-Inf ormed and Hybrid Models: T o improve interpretability and robustness, other approaches integrate physical knowledge of spectral composition into the DL frame work. One strategy inv olves using a CNN to predict a clean, metabolite-only spectrum, which is then quantified using a fixed, non-trainable solver such as multiv ariable linear regression against a predefined basis set [54, 55, 59, 60]. A key limitation here is that errors from the quan- tification step cannot be backpropagated to train the feature extractor . More advanced methods overcome this by incorporating a fully di ff erentiable solver into the network architecture. For example, Q-Net [49] embeds a di ff eren- tiable least-squares solv er , while PHIVE [61] inte grates a di ff erentiable LCM into a v ariational autoencoder , enabling end-to-end training and robust uncertainty estimation. These models more tightly link representation learning with the quantification task, but are more comple x to train and require access to accurate basis sets. Heuristic and Partially Di ff erentiable Models: A third category of methods uses DL for feature extraction or parameter estimation but relies on fixed or heuristic components for the final quantification. For instance, some models use a denoising autoencoder to learn a robust spectral representation but decode it using a fixed LCM with a standard basis set [56]. Other encoder-decoder designs extend this approach. For example, Zhang et al. proposed a model using W av eNet blocks and an attention-based GR U to learn a robust latent representation from multi-echo JPRESS data, from which it simultaneously predicts concentrations, reconstructed FIDs, and phase parameters in an 3 end-to-end manner [62]. Others use a CNN to estimate spectral parameters (e.g., peak locations and widths) and then reconstruct the spectrum using a fixed but di ff erentiable physical model, such as a sum of Lorentzian functions [63]. A di ff erent approach is taken by NMRQNet [64], which uses a recurrent network to predict spectral parameters and then refines them with a non-di ff erentiable stochastic optimisation algorithm. While these methods can enforce physical plausibility , the separation between the learned and fixed components prev ents global error minimization and limits end-to-end optimisation. Con ventional LCM methods remain interpretable and can provide uncertainty estimates but depend on accurate basis sets and can be sensitive to linewidth and baseline [45, 46]. Direct-regression DL is fast and flexible but often lacks calibrated uncertainty and has rarely been validated on experimental ground truth [22, 52]. Physics-informed and di ff erentiable hybrids improv e interpretability and gradient flo w [49, 61] but still rely on basis accuracy and remain under-v alidated on phantoms. Recent revie ws of ML for proton MRS (e.g. V andesande et al. [57]) describe progress in denoising, quantification, and uncertainty , and note remaining concerns about generalisation and bias under domain shift [52]. Many DL studies report strong performance on simulations or on in vivo data without ground truth; few provide v alidation on experimental phantoms with kno wn concentrations. 2.3. Limitations of Existing W ork and Our Contribution Con ventional and machine learning methods su ff er from a lack of validation against known ground truth. Many DL-based studies rely exclusi vely on simulated data for ev aluation, using pre-set concentrations as ground truth [22, 25, 53, 58]. Others use in vivo data, which lacks ground truth, and instead compare results to established methods such as LCModel [49, 51, 54, 55]. As recent work has highlighted, models validated solely on synthetic data often ov erfit and exhibit significant estimation bias when applied to real-world spectra [22, 25]. Thus, neither approach provides a true measure of a model’ s practical utility . Our w ork addresses this validation g ap. W e e valuate DL models for MEGA-PRESS quantification systematically: Bayesian model selection on simulations, then validation on 144 phantom spectra with known concentrations. W e dev elop two distinct architectures, a CNN and a novel Y -shaped autoencoder (Y AE), and optimise them through Bayesian hyperparameter search on a large simulated dataset. The main focus is validation of the selected models on experimental phantom spectra and comparison with LCModel (Section 3). W e compare our models to ground truth and to the widely used LCModel to assess the capabilities and limitations of deep learning for MEGA-PRESS quantification. W ithout validation on phantoms with known concentrations, one cannot judge whether a new method will perform reliably on real data. 3. Simulation, Phantom Experiments, and Evaluation Metrics Our study focuses on quantifying fi ve metabolites (GAB A, Glu, Gln, N AA, and Cr) from edited spectra acquired with the MEGA-PRESS pulse sequence, which yields OFF and ON acquisitions and their di ff erence (DIFF; the ON minus OFF di ff erence spectrum). GAB A is the primary target; Glu and Gln are ke y e xcitatory neurotransmitters; N AA and Cr serve as prominent reference metabolites whose distinct peaks support calibration and relati ve quantification. Concentrations are often reported relative to N AA or Cr . Below , we describe the simulated and e xperimental datasets, preprocessing, and ev aluation metrics. 3.1. Simulated Datasets T o train and validate the models, we generated simulated spectra by taking a weighted sum of simulated basis spectra for each metabolite. OFF and ON spectra for each metabolite were generated using custom MA TLAB code and the FID-A toolbox [20]. The simulations are based on Hamiltonian models for the molecules of interest and in volve solving the time-dependent Schrödinger equation for the MEGA-PRESS pulse sequence, calculating the pre- dicted time-domain echo signals, and adding line broadening to enhance realism. For some molecules, competing Hamiltonian models exist, including the Govindaraju, Kaiser , and Near models [65, 66, 67]. In this work, we used the Near model for GABA because the spectra it produces most closely match our experimental data, and there is a consensus that it is the preferred model [17]. Spatial localisation for MEGA-PRESS spectra is commonly achieved using a technique similar to PRESS (Point RESolved Spectroscopy). This process begins with an initial radiofrequency pulse that excites a slab-shaped region, 4 providing localisation in one dimension. This is followed by two subsequent refocusing pulses, each designed to se- lectiv ely refocus spins within a slab perpendicular to the pre viously e xcited slab and to each other . Consequently , only spins in the intersection of these three orthogonal slabs experience the complete sequence and contribute to the final signal. Crusher gradients further enhance voxel localisation. Slice-selectiv e excitation is achie ved by applying finite- bandwidth modified sinc pulses concurrently with gradients. Many simulations neglect this, implementing excitation and refocusing pulses as ideal 90 ◦ and 180 ◦ rotations, respecti vely . Howe ver , this ignores the non-uniformity of exci- tation and refocusing pulse profiles, especially for large vox el sizes. T o obtain more realistic spectra, we performed simulations on a 2D spatial grid to more accurately model the excitation profiles of the refocusing pulses, thereby improving the realism of the simulated spectra. Finally , phase cycling w as implemented, cycling ov er two phases (0 ◦ , 90 ◦ ) each for the first editing and the refocusing pulses, and four phases (0 , 90 , 180 , 270) for the second editing pulse, resulting in 2 × 2 × 2 × 4 = 32 simulation runs for each point on the spatial grid, and a total of 32 × 8 × 8 = 512 simulations for each metabolite spectrum. T o simulate both the ON and OFF spectra for each metabolite therefore requires 1 , 024 simulation runs. See Figure 1 and mrsnet/simulators/fida/run_custom_simMegaPRESS_2D.m in [68] for details. The simulation produces a complex time-domain signal corresponding to the time-e volution of the x and y compo- nents of the transv erse magnetisation, which is Fourier-transformed to obtain a spectrum. In practice, the time-domain signal is multiplied by an exponential en velope to simulate signal attenuation due to T 1 and especially T 2 relaxation e ff ects. The decay rate of the exponential function controls the linewidth of the simulated spectra. Lorentzian fits of the N AA peak in 144 experimental MEGA-PRESS OFF spectra yielded a median linewidth of just under 2 Hz and a mean just over 2 Hz (full width at half maximum, FWHM). Based on this, we chose a line width of 2 Hz FWHM for the simulated basis spectra as a realistic but slightly conservati ve value. Estimated linewidths across the 144 phantom spectra span approximately 1 Hz to 10 Hz FWHM (see the sim-to-real repository [69]); we use this range when assessing variable line width augmentation in Section 6.2. Consistently , our sim-to-real analysis on the phantom series [69] indicates that simulations with a fixed 2 Hz FWHM basis pro vide the closest o verall match in spectral mag- nitude and linewidth to the phantom data, and allowing the simulated line width to vary did not materially reduce the sim-to-real gap in these spectral similarity metrics. In Section 6.2, we show that linewidth augmentation can improv e quantification performance. 3.2. Concentration Sampling and Noise Injection W e adopt the same strategy for generating synthetic spectra as in our previous work [21], where training and validation samples are constructed by taking linear combinations of individual metabolite signals from a giv en basis set. Each metabolite is assigned a scaling factor within the range [0 , 1], representing its relativ e concentration. These concentration v alues are sampled using a Sobol sequence: a lo w-discrepancy , quasi-Monte Carlo method that ensures uniform co verage of the high-dimensional concentration space e ven with a relati vely small number of samples. Sobol sampling gi ves more uniform cov erage of the concentration space than random sampling, which helps the models cope with div erse spectral mixtures. T ime-domain Gaussian noise with zero mean and standard de viation σ sampled uniformly in [0 , 0 . 03] is added to all simulated spectra, whose signal amplitudes are normalised so the maximum spectral peak equals 1 . 0. This noise lev el reflects realistic scanner-induced variability and was pre viously sho wn to closely approximate noise observed in experimental phantom spectra [18, 21]. The range was guided by expert analysis of spectral regions that contain no identifiable metabolic signal, capturing acquisition-related baseline fluctuations. Beyond mimicking experimental conditions, randomised noise both improv es realism and reduces ov erfitting to noise-free inputs. W e verified the interval on our 144 phantom spectra by estimating σ from signal-free regions only in the phantom–simulation residual: median σ ≈ 0 . 014, ≈ 80% within [0 , 0 . 03]. The 30 spectra with σ > 0 . 03 come from the gel series E4, E9, and E11 (di ff erent bandwidths or reacquired later) and the solution series E6; E6 and E11 are the noisiest. These series were included to test performance across SNR. For the main training runs reported in this paper , the simulated data use the above setup: slice-profile-aware basis spectra with fixed 2 Hz linewidth, Sobol-sampled concentrations, time-domain Gaussian noise in [0 , 0 . 03], and the same B0 alignment, ppm range, and amplitude normalisation as in the common export pipeline (Section 3.4). W e did not apply explicit augmentation for frequency drift, higher -order phase errors, v ariable line width beyond the fix ed simulation v alue, synthetic rolling baselines, or non-Gaussian noise. The impact of more aggressive augmentation, specifically using varied line widths, is explored in Section 6.2 and further discussed in Section 7. 5 (a) Sequence Diagram for MEGA-PRESS (b) Simulation Grid (2D) for cubic voxel with a = 3 cm Figure 1: The sequence diagram (a) shows the RF and gradient pulses with actual pulse shapes and timings used in the simulations. The initial excitation pulse is modelled as an ideal (instantaneous) slice-selective 90 ◦ pulse and the corresponding slice selection gradient G z is therefore omitted. The excitation pulse excites a slice of thickness 3 cm perpendicular to the z -axis. The refocusing pulses refocus the magnetisation of 3 cm thick slabs perpendicular to the x and y axis, respectively , to define the localised vox el. What di ff erentiates the MEGA-PRESS sequence from the standard PRESS sequence is the presence of two 20 ms frequency-selectiv e Gaussian editing pulses (yellow and green) at 1 . 9 ppm for the ON acquisition. For the OFF spectra the editing pulses could in principle be omitted, but the simulation follows the experimental implementation where editing pulses at 7 . 5 ppm, which have no e ff ect on the metabolites of interest, are applied instead. The readout of the signal starts at 68 ms as indicated. Experimentally , it can last over 1 s, depending on the dwell time and number of samples acquired. For a bandwidth of 2000 Hz, the dwell time is 0 . 5 ms and acquiring N = 2048 samples, a typical signal length, would therefore require 2048 × 0 . 5 ms = 1 . 024 s. The readout block in the diagram is truncated at 80 ms for clarity , to show the RF pulses and timings. T o account for imperfect slice profiles of the refocusing pulses, the spectra are simulated on a spatial grid (b) and the av erage over all positions is calculated. 3.3. Experimental Dataset In addition to v alidation on simulated spectra, we e valuate our models’ performance on experimental spectra from test objects with known metabolite concentrations, ranging from bu ff ered metabolite solutions to tissue-mimicking gel phantoms. These phantoms contain only the specified metabolites, and no macromolecule (MM) or lipid signal, unlike in vivo spectra. Sev eral sets of experiments were conducted, ranging from bu ff er solutions with a few metabolites to tissue- mimicking gel phantoms with sev eral relev ant metabolites. All phantoms were prepared in-house at the Institute for Life Science at Swansea University . For the solution datasets, the general procedure was to prepare a pH-neutral phosphate bu ff er solution and add fixed amounts of various metabolites to obtain a base solution. Some of the base solution was then used to prepare “spiked” solutions with high concentrations of a single metabolite, typically GAB A, our primary interest. The solution series were then generated by incrementally removing a small amount of solution from the phantom and replacing it with the same volume of a spiked solution. This procedure allows for increasing the concentration of a single metabolite in small increments without changing the baseline concentration of others, providing more precise control ov er concentrations than if the solutions were prepared independently . After each addition of spiked solution, a ne w OFF and ON spectrum was acquired. The gel phantoms were obtained similarly by preparing base solutions and adding small amounts of spiked metabolite solutions to v ary concentrations. Howe ver , separate gel phantoms were prepared for each combination of concentrations, as incremental addition is not feasible for gels. T o create gels, a gelling agent (Agar Agar, 1 g per 100 ml) was added to each solution, and the mixture was heated to approximately 95 ◦ C. The hot solutions were then transferred to suitable molds and allowed to cool to room temperature and solidify before scanning. The composition of the phantoms was intentionally varied in complexity . In total, 144 spectra were obtained from 112 phantoms (some phantoms were scanned multiple times—e.g. E4 and E9—yielding more spectra than physical objects). Solution series E1 had the simplest composition, with only fixed concentrations of N AA and Cr and 6 increasing amounts of GABA, and no Glu or Gln. Solution series E2 was similar but deliberately miscalibrated with a lo w pH to test the models’ ability to cope with miscalibrated data. Solution series E3 w as similar to E1 b ut included fixed amounts of Glu and Gln. Series E4 consisted of gel phantoms with fixed amounts of N AA, Cr , Glu, and Gln, and varying amounts of GABA, with concentrations similar to E3. The gel phantoms were scanned four times. The first two rounds (E4a and E4b) were acquired on the same day with the same sequence but two di ff erent acquisition bandwidths. The phantoms were scanned again with the same sequence and bandwidths a week later to acquire more data and assess any deterioration over time (E4c, E4d). E6 and E7 were solution series similar in composition to E3, while E8, E9, E11, and E14 were gel phantoms similar to E4. Not all av ailable experimental series were included, as some in volv ed additional metabolites for other projects. A summary of the metabolite concentrations for the experimental series is in T able 1, and further details on phantom design can be found in [18]. All spectra were acquired on a Siemens MA GNETOM Skyra 3T MRI system at Swansea Univ ersity’ s Clinical Imaging Unit. Datasets E1 to E4 used the Siemens WIP MEGA-PRESS implementation for GAB A editing, while the remaining datasets (E6, E7, E8, E9, E11, and E14) were acquired with a widely used implementation from the University of Minnesota [70]. As GAB A is our main metabolite of interest, most experimental series kept the concentrations of N AA, Cr , Gln, and Glu constant while gradually increasing GABA concentration, except for E9, where all metabolite concentrations were varied. All spectra were acquired with T E = 68 ms and T R = 2000 ms with 160 av erages and N = 2048 time samples per spectrum. The sampling frequency w as 1250 Hz for experiments E1, E2, E3, E4a, and E4c, and 2000 Hz for E4b, E4d, E6, E7, E8, E9a, E9b, E11, and E14. These parameters were chosen as they are generally optimal for GAB A editing. The sampling frequencies are determined by the dwell time of each measurement. Shorter dwell times allow for faster data acquisition and a broader spectral frequency range, but also reduce the signal-to-noise ratio (SNR). 2000 Hz (dwell time 0 . 5 ms) is the most common sampling frequency at 3 T, but 1250 Hz (dwell time 0 . 8 ms) still cov ers the frequency range of interest with slightly better SNR, which is why both values were used. For ev ery MEGA-PRESS run, the acquired time series data was Fourier -transformed, frequency and phase- corrected, and av eraged to create three spectra: OFF , ON, and DIFF. This was done using a combination of vendor- supplied scanner software and in-house MA TLAB code. Acquisition parameters and reconstruction details are de- scribed in [18]. 3.4. Prepr ocessing the Datasets All experimental (phantom) and simulated spectra are processed with a harmonised pipeline so that inputs pre- sented to the models share an identical ppm axis, spectral resolution and scaling. Only the data-origin steps di ff er; the subsequent export to model inputs is identical across datasets. For experimental spectra only , the complex free-induction decays (FIDs) for the OFF and ON acquisitions are read. Processing proceeds in the time and then frequency domain: 1. Apodization: no windowing is applied, as it did not improv e downstream performance in our data (Ham- ming / Hanning windowing w as tried). 2. Phase handling: when real / imaginary channels are used, we explored applying a linear phase correction (con- stant and linear terms in frequency) estimated by minimising the imaginary component or spectral entropy of the real part. As no significant improv ement was observed, we did not use this for training the models. 3. W ater-peak attenuation: around the water resonance (4 . 75 ppm ± 0 . 75 ppm window), lar ge outliers are clamped tow ards the local median in the magnitude spectrum to reduce residual water without altering phase. 4. B0 alignment across acquisitions: a single frequency shift (ppm) is estimated per spectrum from a reference metabolite. W e examine NAA (2 . 01 ppm) and Cr (3 . 015 ppm) within narrow windows ( ± 0 . 25 ppm) and use Jain’ s method [71] to estimate the exact peak location in the Fourier spectrum. The reference peak with higher prominence is used, and the resulting shift is applied uniformly to OFF , ON and DIFF. Residual misalignment after resampling is negligible at the chosen resolution. 5. Di ff erence spectrum: DIFF is computed as ON − OFF after the abov e steps to ensure consistency with aligned acquisitions. Simulated spectra required less pre-processing. For consistency with the experimental spectra, we apply B0 alignment in the same way as for the experimental spectra after noise has been added and DIFF is recomputed. No water suppression or additional phase correction is applied. 7 T able 1: Composition of the experimental data: 144 spectra from 112 phantoms (phantom count is the sum of the first number in each row , ignoring the repetition factor when given as e.g. 1 × 2 or 8 × 4). For E4, 8 × 4 denotes 8 gel phantoms each scanned in four acquisition rounds (E4a–E4d). Concentrations are in mM = mmol / L. E1 contains a phantom with NAA only and a phantom with NAA and Cr only , followed by phantoms with increasing amounts of GABA (Cr indicated by 0 / 8 . 0 where applicable). E4 phantoms were acquired four times with di ff erent bandwidths (E4a / c: 1250 Hz, E4b / d: 2000 Hz) and reacquired one week later (E4c / d). E9 consists of 7 phantoms with varying concentrations of all metabolites; one phantom was scanned three times (1 × 3) and the rest twice (1 × 2), so that E9a contains one more spectrum than E9b. Series Medium # Phantoms N AA Cr Glu Gln GAB A E1 Solution 13 15 . 0 0 / 8 . 0 0 . 0 0 . 0 0 . 00, 0 . 52, 1 . 04, 1 . 56, 2 . 07, 2 . 59, 3 . 10, 4 . 12, 6 . 15, 8 . 15, 10 . 12, 11 . 68 E2 Solution 1 15 . 0 0 . 0 0 . 0 0 . 0 0 . 0 15 15 . 0 8 . 0 0 . 0 0 . 0 0 . 00, 0 . 50, 1 . 00, 1 . 50, 2 . 00, 2 . 50, 3 . 00, 3 . 99, 4 . 98, 5 . 96, 6 . 95, 7 . 93, 8 . 90, 9 . 88, 11 . 81 E3 Solution 15 15 . 0 8 . 0 12 . 0 3 . 0 0 . 00, 1 . 00, 2 . 00, 3 . 00, 3 . 99, 4 . 98, 5 . 97, 6 . 95, 7 . 93, 8 . 91, 9 . 88, 10 . 85, 11 . 81, 12 . 77, 13 . 73 E4 Gel 8 × 4 15 . 0 8 . 0 12 . 0 3 . 0 0 . 00, 1 . 00, 2 . 00, 3 . 00, 4 . 00, 6 . 00, 8 . 00, 10 . 00 E6 Solution 10 15 . 0 8 . 0 12 . 0 3 . 0 0 . 00, 1 . 03, 2 . 05, 3 . 06, 4 . 07, 6 . 09, 8 . 09, 10 . 08, 12 . 05, 14 . 98 E7 Solution 16 12 . 0 7 . 0 12 . 0 3 . 0 0 . 00, 1, 2, 2 . 99, 3 . 98, 4 . 97, 5 . 95, 6 . 93, 7 . 9, 8 . 87, 9 . 84, 10 . 80, 11 . 76, 12 . 71, 13 . 66, 14 . 61 E8 Gel 8 15 . 0 8 . 0 12 . 0 3 . 0 0 . 00, 1 . 00, 2 . 00, 3 . 00, 4 . 00, 6 . 00, 8 . 00, 10 . 00 E9 Gel 1 × 2 14 . 0 8 . 0 2 . 0 11 . 0 6 . 0 1 × 2 8 . 0 7 . 0 7 . 0 13 . 0 4 . 0 1 × 2 10 . 0 6 . 0 6 . 0 8 . 0 6 . 0 1 × 2 12 . 0 9 . 0 2 . 0 14 . 0 4 . 0 1 × 3 11 . 0 8 . 0 3 . 0 10 . 0 3 . 0 1 × 2 15 . 0 10 . 0 4 . 0 9 . 0 5 . 0 1 × 2 13 . 0 7 . 0 5 . 0 12 . 0 2 . 0 E11 Gel 8 12 . 0 7 . 0 12 . 0 3 . 0 0 . 00, 1 . 00, 2 . 00, 3 . 00, 4 . 00, 6 . 00, 8 . 00, 10 . 00 E14 Gel 11 11 . 8 6 . 3 8 . 55 2 . 25 0 . 00, 1 . 02, 2 . 05, 3 . 06, 4 . 08, 5 . 09, 6 . 1, 7 . 1, 8 . 1, 9 . 1, 10 . 09 Independent of their origin, spectra are prepared for training and inference using the same steps: 1. Fixed ppm band and resolution: signals are zero-filled / truncated in time and FFT -transformed; the frequency- domain spectra are then oriented to a common descending ppm axis and cropped to [4 . 5 ppm , 1 . 0 ppm] with exactly 2048 points. This harmonises datasets acquired at di ff erent bandwidths (e.g., 1250 Hz vs 2000 Hz) without altering linewidths; zero-filling does not change SNR but provides a common sampling grid. The ppm axis direction is consistent across all inputs. 2. Amplitude normalisation: each spectrum is scaled by the maximum magnitude of its OFF acquisition when av ailable; otherwise by the global maximum across provided acquisitions. This yields a comparable dynamic range across samples and acquisitions. 3. Inputs and channels: for each acquisition (OFF , ON, DIFF) we can export real, imaginary and / or magnitude channels, providing up to nine channels; the channel combination and acquisition selection used by the best CNN and Y AE configurations are reported in Section 5. 4. Baseline handling: beyond the water-windo w attenuation, no explicit baseline modelling or subtraction is ap- plied. In particular , the DIFF baseline is retained so that the models learn to accommodate realistic baselines present in both simulated and experimental data. 5. T argets (for supervised training): ground-truth values are exported as an ordered vector ov er the metabolites. Concentrations are expressed relati ve to N AA as reference, so the N AA entry equals 1 by construction. During model selection, we considered sum-normalisation (v ector divided by its sum) and max-normalisation (di vided by its maximum), a common normalisation step to account for v ariations in total signal or reference metabolite concentrations [72]. The normalisation and channel choices for the selected best models are giv en in Section 5; 8 ev aluation metrics in Section 3.5 use max-normalised concentrations. These steps ensure that spectra presented to the models are aligned in ppm, uniformly sampled, consistently scaled and represented in a common acquisition-channel format across experimental and simulated datasets. 3.5. P erformance Evaluation Performance is ev aluated in terms of spectral reconstruction (denoising) and metabolite quantification accuracy . For architectures that output a reconstructed spectrum (our Y AE), denoising is assessed by the mean absolute error between the noise-fr ee (clean) spectrum and the model’ s reconstruction from the noisy input: ϵ spec = 1 F F X f = 1 | x f − ˆ x f | (1) where x f is the clean reference value at frequency bin f , ˆ x f is the reconstructed value, and F is the number of frequency points. This quantifies how well the model suppresses noise while preserving spectral structure. Quantification performance is ev aluated using the mean absolute error (MAE) o ver maximum-normalised concen- trations, ϵ = 1 N N X l = 1 | g l − p l | (2) where g l is the ground-truth and p l the predicted concentration of metabolite l , and N is the number of metabolites. W e use maximum-normalised concentrations for e valuation regardless of the normalisation used during training, as MAE in this space is rob ust to outliers and comparable across datasets. Where appropriate, we also report error distributions, mean and standard de viation. T o assess systematic bias and scale, we fit a linear regression of predicted vs. ground-truth concentrations, y = a x + q . Ideal performance corresponds to slope a = 1 and intercept q = 0. W e report the slope, intercept, coe ffi cient of determination R 2 , standard error of the fit, and p -v alue for the regression. T ogether , MAE and regression metrics characterise both the magnitude of errors and any consistent o ver- or under -estimation. 3.6. Experimental Gr ound-T ruth V alidation The ground truth for the experimental spectra is the millimolar concentrations of the metabolites. Because the algorithms output only normalised concentrations and conv ersion to millimolar would require additional calibration, we evaluate relativ e concentrations (ratios eliminate arbitrary scaling). NAA is chosen as a reference for the relati ve concentrations as it has a prominent peak in both the OFF and DIFF spectra and relati vely stable peak characteristics. Alternativ ely , Cr could be used as a reference, b ut N AA is preferable here because the Cr signal is absent in the DIFF spectra. Once the relati ve concentrations hav e been calculated, the mean absolute error (MAE) is used to measure the prediction error for each metabolite; when reported ov er the phantom set, we use the mean absolute error ov er all spectra (ov erall MAE) or, for experiment-lev el comparison, the mean over spectra within each series (Section 3.7). The standard de viation (SD) is used to reflect the variability of the prediction error . Linear regression fitting and statistical analysis are performed for the relativ e concentrations in the same manner as detailed above. Finally , we compare our quantification results with LCModel, one of the most widely used tools for in vivo MRS quantification. LCModel was run separately on the OFF and DIFF spectra of each phantom. F or a fair comparison with the DL models, LCModel was configured with a basis set derived from the same FID-A simulations and Hamil- tonian choices used to generate the DL training data (Section 3.1), including the Near model for GAB A. Fitting range, baseline model (spline), and other settings were chosen to match typical MEGA-PRESS practice; further details (v er- sion, fitting range, baseline parameters) are given in the benchmark study [18]. T able 10 compares quantification accuracy of LCModel-OFF and LCModel-DIFF across e xperiments; LCModel-OFF achie ved lo wer experiment-le vel errors for GAB A and Glu, and is therefore used as the main con ventional baseline in the following comparisons. W e use LCModel-OFF as the conv entional baseline because it gav e lower errors than LCModel-DIFF on our phantom data (T able 10), so the DL models are compared against a strong rather than a weak baseline; this aligns with lit- erature suggesting o ff -resonance spectra can yield more reliable quantification than di ff erence spectra under certain conditions [73]. 9 3.7. Statistical Comparison of Quantification Err ors T o compare the quantification accuracy between models on phantom data in a statistically principled and model- agnostic manner, we further analyse the distributions of absolute quantification errors across experiments, where a single experiment here refers to one series in the phantom data. Since absolute errors are non-negati ve and typically non-Gaussian, all statistical analyses are based on non-parametric methods. T o avoid pseudo-replication arising from treating spectra acquired under identical experimental conditions as independent samples, each experiment is treated as the fundamental statistical unit. Thus, experiment-lev el MAE denotes the mean error over spectra within each series (one v alue per series per metabolite), whereas ov erall MAE denotes the mean error ov er all 144 spectra as used later in summary T ables 12 and 13. For a giv en experiment e and metabolite m , the mean absolute error across spectra is computed as ˜ ϵ e , m = 1 K e K e X k = 1 ϵ e , k , m , (3) where K e denotes the number of spectra in experiment e . The mean provides a standard summary of typical error magnitude per experiment. Paired one-tailed W ilcoxon signed-rank tests are then applied to the experiment-le vel aggregated errors to assess whether one model consistently yields lower quantification errors than another across experiments. This paired for- mulation exploits the fact that both models are e valuated under identical experimental conditions, thereby isolating di ff erences attributable to the quantification method. In addition to hypothesis testing, e ff ect sizes are reported to facilitate direct interpretation of performance di ff er- ences. Specifically , the proportion of experiments with lo wer error, P e = 1 E E X e = 1 I ˜ ϵ (1) e , m < ˜ ϵ (2) e , m , (4) quantifies the fraction of e xperiments in which one model outperforms the other, while the mean di ff erence in absolute error , ∆ m = 1 E E X e = 1 ˜ ϵ (1) e , m − ˜ ϵ (2) e , m , (5) indicates the direction and magnitude of the performance di ff erence. W e use these metrics to compare models con- sistently: MAE and regression slope / intercept for magnitude and bias, and e xperiment-level W ilcoxon tests and e ff ect sizes for statistical comparison. 4. Deep Learning Architectur es This section describes the deep learning architectures ev aluated here. W e first present our two proposed architectures— a con volutional multi-class regressor (CNN) and a Y -shaped autoencoder (Y AE)—which are extensiv ely parame- terised and optimised via Bayesian model selection (Section 5). W e then describe sev eral comparative baselines from the literature, implemented with their published configurations and minimal adaptations to our MEGA-PRESS pipeline. The CNN builds on the architecture of [21] and consists of 1D and 2D con volutions that extract features along the frequency axis and across acquisition channels, followed by fully connected layers that regress to metabolite concentrations. W e explore a broader architectural space (kernel sizes, down-sampling, regularisation, acti vations) to identify an optimal configuration. The Y AE is a two-branch network: an encoder maps the input to a latent representation; one branch (decoder) reconstructs a denoised spectrum, and the other (quantifier) re gresses the latent representation to concentrations. This design is motiv ated by the physical principle that an MRS spectrum is a linear combination of basis spectra: the autoencoder encourages a latent space that captures the weights of these basis components, which the quantifier then maps to concentrations. The decoder acts as a regulariser , ensuring the latent representation retains su ffi cient spectral information for reconstruction. 10 T able 2: Parameterised con volutional multi-class regressor (CNN) architecture with parameters. N f is the number of frequencies in the inputs, and channels refers to the number of input spectra. The parameters f 1 and f 2 determine the number of con volutional filters, k 1 , k 2 , k 3 and k 4 determine the kernel size of the conv olutions, s 1 and s 2 determine the use of strides or max-pooling, d 1 determines the use of dropout or batch normalisation, d 2 the dropout rate in the dense layer, and e the number of neurons in the dense layer . The activ ation function at the output is either sigmoid or softmax, and the output consists of N m metabolite concentrations (fixed to 5). Layer Description input (channels, N f ) con v1 FCONV ( f 1 , (1 , k 1 ), s 1 , d 1 ) con v2 FCONV ( f 1 , (1 , k 2 ), s 1 , d 1 ) reduce Repeat until one output channel: FCONV( f 1 , (min(channels , 3), k 3 ), 1, d 1 ) con v3 FCONV ( f 1 , (1 , k 4 ), 1, d 1 ) con v4 FCONV ( f 1 , (1 , k 4 ), s 2 , d 1 ) con v5 FCONV ( f 2 , (1 , k 4 ), 1, d 1 ) con v6 FCONV ( f 2 , (1 , k 4 ), s 2 , d 1 ) dense1 sigmoid (Dense ( e )) dropout if d 2 > 0 . 0 then Dropout ( d 2 ) dense2 Dense (metabolites) activ ation sigmoid () or softmax () output N m FCONV ( f , k , s , d ) is a sequence of layers forming mainly a con volution, determined by its parameters as follo ws: if s ≤ 0 : x = Conv2D (filter= f , kernel_shape= k )( x ) else: x = Conv2D (filter= f , kernel_shape= k , strides = (1, s )) ( x ) if d == 0 . 0 : x = BatchNormalisation () ( x ) x = ReLU () ( x ) if d > 0 . 0 : x = Dropout ( d ) ( x ) if s < 0 : x = MaxPool2D ( (1 , | s | ) ) ( x ) For these architectures, the input can be chosen from ON, OFF , and DIFF acquisitions and from real, imaginary , or magnitude representations; see Section 3.4. The following subsections detail the CNN (Section 4.1), the Y AE (Section 4.2), and the additional baselines (Section 4.3). 4.1. CNN: Convolutional Multi-Class Re gressor Our parameterised CNN architecture is shown in T able 2. It consists of a sequence of conv olutional layers follo wed by two dense layers. Its input has a channel axis, for the di ff erent acquisitions and datatypes, follo wed by a frequency axis for the bins from the Fourier transform; see Section 3.4. The con volutional blocks are represented by FCONV ( f , k , s , d ), a sequence of layers detailed in T able 2. This block primarily consists of a con volution with f filters, kernel size k , and a ReLU acti vation function. The parameter s controls dimensionality reduction along the frequency axis, using either strides (if s > 0) or max-pooling (if s < 0). The parameter d controls regularisation: dropout is used if d > 0, and batch normalisation is used if d = 0. This parameterisation enables us to explore di ff erent strategies for down-sampling and regularisation during model selection. While batch normalisation addresses internal covariate shift to improve training, dropout regularisation addresses ov erfitting by learning more robust features. In practice, using both simultaneously is often redundant. The first two layers are 1D conv olutions along the frequency axis (kernel sizes k 1 , k 2 ), applied per channel to detect local spectral features. A reduction module then collapses the channel dimension to one via sequential 2D con volutions with kernel size (min(channels , 3) , k 3 ), combining information across acquisitions; this channel-then- reduce design follows [21]. A further sequence of 1D conv olutions along the frequency axis on this single combined channel extracts higher -level spectral features. T wo dense layers (with sigmoid acti vation between them and optional dropout) map these features to concentrations via a final softmax or sigmoid output. While softmax may match the relativ e concentrations better , sigmoid may be more suitable as the relativ e concentrations do not represent probabili- ties. The network is trained by minimising the mean squared error (MSE). T raining and model selection are detailed in Section 5. 4.2. Fully Connected Y -Shaped Autoencoder (Y AE) The Y AE is a deterministic, fully connected network with three modules: an encoder, a decoder, and a quantifier (Figure 2, T able 3). W e use fully connected (rather than con volutional) layers to capture global spectral structure: in the frequency domain, information lies in ov erall lineshapes rather than local patterns, and CNNs’ locality bias can be misaligned with this. W e do not use a variational autoencoder (V AE) design because stochastic latents and V AE 11 Figure 2: Illustration of the Y -shaped autoencoder (Y AE) architecture. The input consists of one or more noisy MEGA-PRESS spectra (OFF , ON, DIFF using real, imaginary or magnitude representations). The encoder maps the input to a compressed latent representation. The decoder branch reconstructs denoised versions of the input spectra from this latent space. The quantifier branch predicts the metabolite concentrations from the same latent representation. K ey components are highlighted: ① hidden layer activation function, ② dropout layer, ③ decoder output acti vation function, and ④ quantifier output activ ation function. T able 3: Parameterised architecture for the Y AE’ s encoder, decoder , and quantifier modules. The specific values explored for the parameters are giv en in Section 5. Channels refer to the number of channels used as input spectra. N f : number of frequency samples in the input spectrum. N q : number of neurons in the first dense layer of the quantifier . N m : number of output metabolite concentrations (fixed to 5). L e , L d , L q : number of dense layers in the encoder, decoder, and quantifier , respecti vely . a e , a d : activation function for the encoder and decoder . a q , a m : activation function for the hidden layers and output layer of the quantifier . d e : dropout rate for the encoder . Encoder Layer Description input (channels , N f ) dense_e for each channel c: for l = 1 , . . . , L e − 1: Dense N f 2 l − 1 , a e Dropout ( d e ) latent_c Dense N f 2 L e − 1 , a e output N f 2 L e − 1 per channel Decoder Layer Description input N f 2 L e − 1 per channel dense_d for each channel c: for l = L d , . . . , 2: Dense N f 2 l − 1 , a d decode_c Dense N f , a d output ( N f ) per channel Quantifier Layer Description input N f 2 L e − 1 per channel flatten channels × frequencies dense_q for l = 1 , 2 , . . . , L q − 1: Dense N q 2 l − 1 , a q quant Dense ( N m , a m ) output N m regularisation can reduce reconstruction fidelity , which we need for precise quantification. Input and output options match the CNN (Section 3.4): the decoder outputs a denoised spectrum; the quantifier outputs normalised relativ e concentrations. For each channel of the input, the encoder consists of a sequence of fully connected layers, reducing the input frequency dimensions while enriching the feature representation in the latent space. As detailed in T able 3, each layer comprises N f 2 l − 1 neurons at the l -th layer , where N f denotes the initial frequency dimension. Dropout layers ( d e ) are applied after each dense layer for regularisation, so the model can be adapted to di ff erent complexities and dataset sizes. The final encoder layer compresses the input into a lo w-dimensional latent representation of size N f 2 L e − 1 per channel. The latent representation feeds into two separate branches, forming the “Y” shape. The first branch is the decoder , which aims to reconstruct a denoised spectrum from the latent representation, restoring the original input shape of (channels , N f ). Its architecture consists of a sequence of fully connected layers that progressiv ely expand the data, structured to reverse the encoder’ s compression (T able 3). The decoder is used to ensure the latent representation contains su ffi cient features to reconstruct a clean spectrum, rather than for denois- ing. This is moti vated by the physical principle that an MRS spectrum is a linear combination of basis spectra (see Section 3.1) and the demonstrated e ffi cacy of denoising autoencoders in MRS [74, 75]. Dropout was not applied in the decoder module, as its objectiv e is to reconstruct the spectrum in a stable and deterministic manner; stochastic 12 regularisation here could disrupt reconstruction fidelity . The second branch is the quantifier , which regresses the latent representation to metabolite concentrations. The latent representation is flattened across channels and frequency dimensions, then passed through a sequence of fully connected layers ( N q 2 l − 1 neurons at layer l ). W e do not use dropout in the quantifier: the encoder already regularises the latent space, and dropout at the final regression stage would add unnecessary stochasticity where a stable, deterministic mapping to concentrations is required. The decoder acts as a regulariser , as the encoder must retain enough spectral information such that the decoder can reconstruct the full spectrum. So the latent representation cannot collapse to quantification-only features. As an MRS spectrum is a linear combination of basis spectra (Section 3.1), this aims to ensure the latent space reflects those weights. The final output layer of the quantifier maps the transformed features to N m target dimensions: the number of quantified metabolites. The Y -shape ties the latent representation to both reconstruction and quantification, so the encoder is encouraged to learn features that support both tasks rather than ov erfitting to concentration targets alone. W e use the Huber loss for the decoder and quantifier branches: S δ ( x , x ′ ) = 1 2 ( x − x ′ ) 2 , if | x − x ′ | ≤ δ , δ ( | x − x ′ | − 1 2 δ ) , otherwise. (6) where x is the ground truth, x ′ is the prediction, and δ is a hyperparameter . W e use δ = 1 . 0, the default value in T ensorFlow . When the absolute error is less than or equal to δ , the Huber loss is quadratic like MSE; for lar ger errors, it becomes linear like MAE. This makes the loss less sensitiv e to large errors and outliers compared to MSE. This choice is motiv ated by the nature of MRS data. For the decoder branch, the loss S δ ( y , y ′ ) is calculated between the noise-free ground truth spectrum y and the reconstructed spectrum y ′ . In MEGA-PRESS spectra, some metabolite signals are very prominent and could dominate an MSE loss. Huber loss mitigates the influence of these large signals and helps the model reconstruct the finer details of less prominent spectral features, such as those from GAB A. For similar reasons, we also use the Huber loss S δ ( g , p ) for the quantifier branch, where g is the ground truth and p the predicted vector of concentrations. The specific methodology used to select the optimal parameters for this architecture, along with the training strategy , is detailed in Section 5. 4.3. Additional Deep Learning Baselines W e implement four further DL models from the literature as comparative baselines. These are used with their published default configurations with minimal adaptations needed to handle the MEGA-PRESS input and train on the quantification objectiv e. The y are not subject to the Bayesian optimisation applied to the CNN and Y AE, due to computational constraints; our results therefore reflect their out-of-the-box performance in this setting rather than a fully tuned comparison. Pipeline adaptations may also a ff ect comparability with the originally reported results. FCNN [53] is a sev en-layer 1D-CNN using fiv e 1D con volutional layers with concatenated ReLU (CReLU) activ a- tion, follo wed by a tw o-layer dense head for regression. W e adopted the published architectural parameters, including the con volutional filter progression ([32 , 64 , 128 , 256 , 512]) and 1024 units in the first dense layer . The original model was designed for two-channel (real, imaginary) input; our implementation reshapes the arbitrary input acquisitions and datatypes (e.g., OFF , ON, DIFF) to a two-channel or four-channel format to match the model’ s expected input shape. QNet [49] is a physics-informed model with an IF (Imperfection Factor) Extraction Module (three-block CNN) and an MM (Macromolecule) Signal Prediction Module; concentrations are obtained via a Linear Least Squares (LLS) solver . W e adapt it for multi-acquisition data by running separate IF modules per acquisition (e.g., OFF , DIFF) and concatenating their outputs. T o focus solely on quantification and due to the absence of macromolecule signal, we disable the MM module and train only the quantification path. T wo LLS variants are ev aluated: QNet (simplified, learnable BasicLLSModule) and QNetBasis (physics-grounded BasisLLSModule that applies the predicted IFs to a known metabolite basis set). Published defaults are used for the IF module (e.g., [16 , 32 , 64] filters). QMRS [59] combines a backbone of CNNs, Inception modules, and a Bidirectional LSTM (BiLSTM) with a Multi-Head MLP designed to predict various spectral parameters. W e feed a two-channel input (adapted from our pipeline) and train only the metabolite amplitude (concentration) head; other heads (phase, line width, baseline) are 13 disabled to focus on quantification and a void gradient issues. Default parameters are used (32 initial filters, 128 LSTM units, [512 , 256] MLP units). EncDec [62], an encoder-decoder network, uses a sequence of W av eNet blocks for feature extraction, which are then integrated across acquisitions using an attention-based GRU (AttentionGR U). It was designed for JPRESS data, which consists of man y echoes (e.g., 32). W e added a custom Conte xtConv erter layer that reshapes the pipeline output from (batch, acquisitions × datatypes, freqs) to (batch, acquisitions, freqs, datatypes) for the EncDec backbone. Similar to QMRS, the auxiliary heads for predicting FIDs and phase parameters were disabled during training to focus solely on quantification. Published parameters are used (e.g., 128 filters for W av eNet blocks). 5. Model Selection and Perf ormance on Simulated Data W e optimise hyperparameters for the CNN and the Y AE using similar protocols. W e first describe the optimisation strategy , then present model selection for the CNN (including validation that Bayesian optimisation is likely to find the grid-search optimum), then the three-stage Y AE selection, and finally the performance of the selected models on simulated data. The process explores various combinations of model hyperparameters, input data representations (real, imaginary , magnitude of ON, OFF , DIFF spectra) and concentration normalisation strategies (sum vs. max normalisation; see Section 3.4). Both models are trained with the AD AM optimiser ( β 1 = 0 . 9, β 2 = 0 . 999, ϵ = 10 − 8 ) with random batch shu ffl ing. A fixed learning rate of 10 − 4 × (batch_size / 16) is used, which was determined to giv e good con vergence in preliminary e xperiments and reflects a linear scaling rule with batch size [76]. 5.1. Hyperparameter Optimisation Strate gy For this multi-dimensional optimisation problem, we employ Bayesian optimisation with a Gaussian process, a sample-e ffi cient method well-suited for tuning expensi ve black-box functions like deep learning model training [77]. As each configuration is tested, the Gaussian process is updated to better represent the performance landscape. W e alternate between choosing a configuration based on the Expected Improvement (EI) and selecting configurations by Thompson sampling. EI employs an L-BFGS optimiser to find the maximum of the expected improv ement. Thompson sampling takes a possible version of the estimated av erage MAE function, sampled from the Gaussian process, and acts as if that version were the truth, choosing the best point based on that sample. Interleaving the Thompson with the EI strategy explored the configuration space in a more balanced manner . The iterativ e process continues until a predetermined number of iterations is reached, and then selects the best model based on the estimated average MAE across the fi ve-fold cross-validation. For each configuration proposed by the optimiser , the corresponding model is trained and ev aluated using fiv e-fold cross-v alidation on a dataset of 10 , 000 simulated spectra. The mean absolute error (MAE) on the validation folds serves as the objective function to be minimised. Our implementation uses GPyOpt [78]. 5.2. CNN Model Selection W e run a grid search over the CNN parameter space, then apply Bayesian optimisation on the same space to con- firm it reco vers the same optimum and to quantify its e ffi cienc y . The parameters and their options are simplified, based on the original values explored in [21]. Broader parameter exploration did not yield significantly improved models, and is not further explored here for simplicity . The full results are av ailable in the CNN models data repository [79]. The model parameters for the grid search are listed in T able 4. The options explore the sum and maximum normalisation for the relative concentration output, and di ff erent combinations of ON, OFF , DIFF acquisitions, and datatype. For the model parameters, di ff erent kernel sizes (small, medium and large) and two options are explored for the final activ ation functions: softmax with strides or sigmoid with max-pooling. The other model parameters are fixed. W e also explore di ff erent batch sizes. See T able 2 for details of ho w the parameters are used by the CNN model architecture. The results for the top 25 performing models from the grid search are shown in Figure 3. Analysis of the full grid search results reveals several clear trends. Sum normalisation of target concentrations consistently outperforms maximum normalisation. Models using the real part of the spectra as input, either alone or with the imaginary part, tend to perform better than those using magnitude data. The choice of acquisitions is less clear, though combinations 14 T able 4: Grid search model selection parameters for CNN model, v arying the dataset options, exploring di ff erent conv olutional kernel sizes (small, medium, large), acti vation functions and batch sizes, leaving the remaining model parameters fix ed. Group Parameter V alues Best Model Dataset Normalisation sum, max sum Acquisitions (DIFF , OFF), (DIFF , ON), (OFF , ON), (DIFF ,OFF ,ON) (OFF ,ON) Datatypes (magnitude), (real), (imaginary , real) (real) Model K ernel sizes small: k 1 = 7 , k 2 = 5 , k 3 = 3 , k 4 = 3 medium: k 1 = 9 , k 2 = 7 , k 3 = 5 , k 4 = 3 medium large: k 1 = 11 , k 2 = 9 , k 3 = 7 , k 4 = 3 Activ ations softmax: softmax with s 1 = 2 , s 2 = 3 sigmoid-pool: sigmoid with s 1 = − 2 , s 2 = − 3 sigmoid-pool Fixed parameters f 1 = 256 , f 2 = 512 , d 1 = 0 , d 2 = 0 . 3 , e = 1024, N f = 2048, N m = 5 T raining batch size 16, 32, 64 16 including OFF and ON spectra are consistently among the top models. For the architecture itself, using max-pooling with a sigmoid output activ ation (sigmoid-pool) is superior to using strides with a softmax output. Overall, the best configuration identified by the grid search is a model with the medium kernel sizes and sigmoid- pool activ ation, trained on the real part of the OFF and ON spectra with sum normalisation and a batch size of 16. While there is a clear distinction between this model and lower-performing ones, some groups exhibit similar per- formance but still fall short of this optimal model. Large kernel sizes may be worth considering for larger training datasets with more variation, but their higher computational cost is not justified in this context. This optimal config- uration di ff ers from that in preliminary work [21]; we attribute the di ff erence to the broader search used here, which explored more of the configuration space. T o verify the e ffi ciency of the Bayesian optimisation model selection compared to the grid search, we ran it ov er the same model parameters listed in T able 4. Repeated runs of the Bayesian optimisation over the CNN model configurations, trained ov er 100 epochs, consistently gave the same best model, also found by the grid search over 432 models. Figure 4 shows the performance of repeated Bayesian optimisation over the iterations. The best model was found before 150 models hav e been explored, about 35% of the model configuration space. This giv es us some confidence that the process works and can also be applied to the Y AE architecture. 5.3. Y AE Model Selection As the parameter space for the Y AE architecture is substantially larger than that of the CNN, an exhaustiv e grid search is computationally infeasible. W e therefore rely on the Bayesian optimisation strategy validated in Section 5.2. Howe ver , the combined parameter space of the full, jointly-trained Y AE architecture (described in Section 4.2) is still too large for a direct optimisation. T o keep the search computationally feasible we used an incremental strategy in three stages: 1. Stage 1: Denoising A utoencoder Optimisation. T o find an optimal architecture for feature extraction, we focused solely on the autoencoder’ s denoising capability . W e performed a parameter search to identify the most e ff ectiv e encoder ( L e ) and decoder ( L d ) architectures, along with optimal activ ation functions and regularisation strategies, for the reconstruction task. 2. Stage 2: Exploratory Quantifier Search (Frozen Encoder). W e conducted a preliminary Bayesian optimisa- tion on the quantifier module. For this exploratory stage, we attached a quantifier to the fix ed and frozen optimal encoder from Stage 1. The goal of this stage was not to find the final model, but rather to e ffi ciently identify the optimal architecture (e.g., N q , L q ) and parameter ranges for the quantifier . The results of this intermediate step were used to prune the search space for the final optimisation stage. 3. Stage 3: Final Joint-T raining Optimisation. Finally , using the optimal architectures and refined parameter ranges from Stages 1 and 2 as a highly-informed starting point, we conducted the definitive model selection on the full, unfrozen, end-to-end joint-training model. This step fine-tuned all parameters simultaneously , lev eraging the loss-weighting and ramp-up strategy . The results of this final stage are presented below . 15 Acquisitions Batchsize Datatype Model Norm e v e t Plot 1.0 1.2 1.4 1.6 1.8 1e 2 OFF- ON 16 r eal medium_sigmoid_pool sum 0.01263 ± 0.00054 0.01093 ± 0.00069 OFF- ON 16 r eal lar ge_sigmoid_pool sum 0.01356 ± 0.00184 0.01201 ± 0.00212 OFF- ON 16 imaginary -r eal lar ge_sigmoid_pool sum 0.01398 ± 0.00164 0.01269 ± 0.00187 DIFF- OFF- ON 16 magnitude small_sigmoid_pool sum 0.01399 ± 0.00119 0.01167 ± 0.00152 DIFF- OFF- ON 16 imaginary -r eal lar ge_sigmoid_pool max 0.01402 ± 0.00032 0.01191 ± 0.00036 DIFF- OFF- ON 16 r eal medium_sigmoid_pool sum 0.01403 ± 0.00144 0.01242 ± 0.00161 DIFF- OFF- ON 16 imaginary -r eal small_sigmoid_pool sum 0.01408 ± 0.00113 0.01266 ± 0.00128 OFF- ON 16 imaginary -r eal medium_sigmoid_pool sum 0.01416 ± 0.00155 0.01275 ± 0.00175 DIFF- OFF- ON 16 imaginary -r eal lar ge_sof tmax sum 0.01418 ± 0.00154 0.01288 ± 0.00178 OFF- ON 16 imaginary -r eal lar ge_sigmoid_pool max 0.01432 ± 0.00130 0.01214 ± 0.00144 DIFF- OFF- ON 16 r eal lar ge_sigmoid_pool sum 0.01441 ± 0.00068 0.01280 ± 0.00080 DIFF- OFF- ON 16 imaginary -r eal lar ge_sigmoid_pool sum 0.01444 ± 0.00168 0.01311 ± 0.00184 OFF- ON 16 imaginary -r eal medium_sigmoid_pool max 0.01448 ± 0.00177 0.01233 ± 0.00176 DIFF- OFF- ON 16 imaginary -r eal medium_sigmoid_pool sum 0.01456 ± 0.00244 0.01312 ± 0.00264 DIFF- OFF 16 r eal medium_sigmoid_pool sum 0.01457 ± 0.00142 0.01299 ± 0.00169 DIFF- OFF 16 imaginary -r eal lar ge_sigmoid_pool sum 0.01458 ± 0.00145 0.01290 ± 0.00168 DIFF- OFF- ON 16 imaginary -r eal medium_sof tmax sum 0.01476 ± 0.00137 0.01309 ± 0.00141 OFF- ON 16 imaginary -r eal lar ge_sof tmax sum 0.01476 ± 0.00081 0.01305 ± 0.00090 DIFF- OFF- ON 16 r eal lar ge_sof tmax sum 0.01481 ± 0.00207 0.01290 ± 0.00215 DIFF- OFF- ON 16 r eal lar ge_sigmoid_pool max 0.01498 ± 0.00103 0.01280 ± 0.00116 OFF- ON 32 imaginary -r eal small_sof tmax sum 0.01503 ± 0.00196 0.01398 ± 0.00180 OFF- ON 16 r eal small_sigmoid_pool max 0.01523 ± 0.00135 0.01291 ± 0.00144 OFF- ON 16 imaginary -r eal small_sigmoid_pool max 0.01523 ± 0.00102 0.01321 ± 0.00132 DIFF- OFF- ON 16 imaginary -r eal medium_sigmoid_pool max 0.01525 ± 0.00125 0.01321 ± 0.00153 DIFF- OFF 16 imaginary -r eal small_sigmoid_pool sum 0.01527 ± 0.00125 0.01369 ± 0.00141 V alidation T raining Figure 3: V alidation and training concentration MAE for the top 25 of 432 CNN configurations from the grid search model selection (T able 4). Each pair of horizontal bars corresponds to one configuration: blue, v alidation MAE; red, training MAE. Error bars sho w standard deviation across fiv e-fold cross-validation. Configurations are ordered by validation MAE (best at top). Dataset: 10 , 000 simulated spectra, sum normalisation, basis set linewidth 2 Hz, 100 epochs. The options for data types, acquisition combinations, and dataset used in the model selection process are kept consistent with those for the CNN model selection (except for using 200 training epochs, versus 100 for the less complex CNN), ensuring comparability . 5.3.1. Structural Exploration and Sear ch Space Pruning (Stag e 1 & 2) T o e ff ectively navig ate the v ast hyperparameter space of the Y AE architecture, we conducted extensiv e exploratory experiments in the first two stages. The goal was not to identify a single “final” model, but to empirically determine the impact of key architectural components and prune the search space for the final joint optimisation. Stage 1: Robustness Analysis of the Encoder-Decoder Backbone In the first stage, approximately 900 inde- pendent autoencoder models were trained to optimise the self-supervised reconstruction task for the search space of T able 5. The full selection results can be found in the data repositories [80]. Analysis of the correlation between various h yperparameters and the validation MAE yields the follo wing architectural insights: • Depth Saturation: Increasing network depth beyond eight layers yielded diminishing returns and occasional instability , with no statistically significant reduction in reconstruction error ( r ≈ 0 . 06). Consequently , we fixed the encoder depth to the most representativ e configuration ( L e = 5) for the exploratory Stage 2 to keep the search space tractable. • Activation Robustness: The tanh activ ation function with a significantly lo wer mean v alidation MAE (0 . 0043) consistently outperformed unbounded alternati ves like ReLU (0 . 0122) for the spectral reconstruction task. Its bounded range appears well-suited for modelling the normalised spectral intensities, making it a primary can- didate. 16 0 25 50 75 100 125 150 175 Iteration 1 0 1 Mean validation MAE of best model 0 20 40 60 80 100 P er centage conver ged Figure 4: Performance of 50 repetitions of the Bayesian optimisation for the CNN model selection over the Bayesian optimisation iterations. The grey shaded area (right axis) sho ws the percentage of runs that selected the same best configuration as the full grid search (“conv erged”). T able 5: Model selection parameters for the Y AE encoder-decoder path. Group Parameter V alues Dataset Normalisation sum Acquisitions (DIFF , OFF , ON), (DIFF , ON), (DIFF , OFF), (OFF , ON) Datatypes (real), (imaginary), (magnitude) Model L e 5 , 6 , 7 , 8 , 9 L d 5 , 6 , 7 , 8 , 9 a e tanh, ReLU (magnitude) a d tanh, sigmoid (magnitude) Fixed P arameters d e = 0 . 3, N f = 2048, N m = 5 T raining batch size 16 , 32 , 64 • Input Representation: The real datatype (mean validation MAE of 0 . 0033) significantly outperformed imag- inary (0 . 0042) and magnitude (0 . 0085) inputs across all configurations, justifying its exclusi ve use in the final model. Stage 2: Quantifier Capacity Estimation (Frozen Encoder) W ith the encoder architecture fixed to a high- performing configuration from Stage 1, we conducted a targeted search to determine the necessary capacity for the quantifier module. Sensitivity analysis of the quantifier branch shows the risks of ov er-parameterisation of the search space, detailed in T able 6. Full selection results can be found in the supplementary materials [80]. • Width Constraint (Smaller is Better): Analysis of the median v alidation MAE reveals a clear positiv e trend between layer width and error . Models with compact hidden layers (128–256 units) consistently achie ved lo wer median errors (MAE ≈ 0 . 024) compared to wider networks (e.g., 2048 units, MAE ≈ 0 . 030). • Activation Function Dominance: W e see a strong preference for the sigmoid activ ation function in the quan- tifier module. For the hidden layers, models utilising sigmoid achieved the lo west minimum validation MAE (0 . 0171) compared to ReLU (0 . 0200) and tanh (0 . 0207), indicating a higher performance ceiling for this spe- cific regression task. For the output layer, the advantage of bounded activ ations was even more pronounced. The sigmoid output acti vation achiev ed a median MAE of 0 . 0246, outperforming unbounded linear (0 . 0280) and ReLU (0 . 2615) activ ation. Overall, this supports the use of bounded acti vation functions to match the normalised range of metabolite concentrations. 17 T able 6: Model selection parameters for quantifier path Group Parameter V alues Dataset Normalisation sum Acquisitions (DIFF , OFF , ON), (DIFF , ON), (DIFF , OFF), (OFF , ON) Datatypes (real) Model N q 2048 , 1024 , 512 , 384 , 256 , 192 , 128 L q 2 , 3 , 4 , 5 a q linear , ReLU, sigmoid, tanh a m linear , ReLU, sigmoid, softmax, tanh Fixed P arameters N f = 2048, N m = 5 T raining batch size 16 , 32 T able 7: Refined Search Space for the final joint optimisation (Stage 3), deriv ed from the insights of Stages 1 & 2. Group P arameter Refined V alues (Stage 3) Dataset Normalisation sum Acquisitions (DIFF , OFF , ON), (DIFF , ON), (OFF , ON) Datatypes (real), (imaginary , real) Model Encoder Depth ( L e ) 5 , 6 Decoder Depth ( L d ) 6 , 7 , 8 Encoder Activ ation ( a e ) tanh, ReLU Encoder Dropout Rate ( d e ) 0 . 2 , 0 . 3 Decoder Activ ation ( a d ) tanh, linear Quantifier W idth ( N q ) 128 , 192 , 256 , 384 , 448 , 512 Quantifier Activ ation ( a q ) ReLU, sigmoid Output Activ ation ( a m ) sigmoid, softmax Conclusion of Exploration: Stages 1 and 2 narrowed the search space (encoder depth, acti vation choices, quantifier width, etc.). The final joint optimisation (Stage 3, Section 5.3.2) was then run ov er this reduced set of options (T able 7). 5.3.2. F inal Optimisation In the final stage, we performed a Bayesian optimisation on the fully , jointly-trained Y AE architecture, confined to the refined search space in T able 7 as identified by Stages 1 and 2 (Section 4.2). T o ensure stable conv ergence during this end-to-end training, we implemented a dynamic loss-weighting strategy . The total loss L tot al is defined as a weighted sum of the reconstruction loss ( L ae ) and the quantification loss ( L q ): L tot al = w ae · L ae + w q ( t ) · L q (7) where w ae is fixed at 1 . 0, b ut the quantifier weight w q ( t ) is dynamic. W e employed a “warm-up” ramp strate gy where w q starts at a lo w value (0 . 1) and linearly increases to its tar get value (1 . 0) o ver the initial epochs. Starting with a lo w w q allows the encoder to learn spectral features from the denoising task first, so that noisy gradients from the untrained quantifier do not disrupt early feature learning. All results of this joint optimisation are provided in [80]. 1. Architectural Con vergence and Consistency: The joint optimisation results strongly support the architectural insights from the exploratory stages. As summarised in the performance distribution analysis in Figure 5: • Encoder: The optimal encoder configuration con verged to a depth of L e = 5 and L d = 6, using tanh acti vations and a dropout rate of 0 . 2, confirming the “middle-ground” depth hypothesis from Stage 1. • Quantifier: Consistent with Stage 2 findings, the quantifier preferred a compact structure. The top-performing models consistently utilised a width of 384 units and a depth of two layers. 18 Figure 5: V alidation and training concentration MAE for Y AE configurations from the final joint optimisation (Stage 3). Each pair of horizontal bars corresponds to one configuration: blue, validation MAE; red, training MAE. Error bars show standard deviation across five-fold cross-validation. Configurations are ordered by validation MAE (best at top). Dataset: 10 , 000 simulated spectra, sum normalisation, 200 epochs (T able 7). • Activation function: The distinct preference for di ff erent activ ation functions was maintained in the joint setting: tanh for the spectral reconstruction path (encoder / decoder) and sigmoid for the quantification path (hidden and output layers). The results support decoupling the architectural choices for the two paths. 2. T ask-Specific Data Prefer ence: A notable finding in this final stage was a shift in the optimal input represen- tation. While Stage 1 (pure denoising) fav oured the real datatype, the joint optimisation for quantification accuracy 19 T able 8: The final, optimal Y AE configuration selected from the joint optimisation Stage 3. This configuration achie ved the lowest validation concentration error and demonstrated high stability across fiv e-fold cross-validation. Group Parameter Selected V alue Dataset Acquisitions edit_off , edit_on Datatypes imaginary , real Normalisation sum Encoder Depth ( L e ) 5 Activ ation ( a e ) tanh Dropout Rate ( d e ) 0.2 Decoder Depth ( L d ) 6 Hidden Activ ation tanh Output Activ ation ( a d ) tanh Quantifier Depth ( L q ) 2 W idth ( N q ) 384 Hidden Activ ation ( a q ) sigmoid Output Activ ation ( a m ) sigmoid Dropout Rate ( d q ) None ( − 1 . 0) T raining Batch Size 16 fa voured imaginary–real input, particularly with the OFF–ON acquisition pair . This suggests that the real datatype produces better reconstruction results, but the imaginary component carries phase information that helps quantifica- tion. 3. The Final Model: Based on the minimum v alidation MAE and cross-validation stability across five folds (Figure 5), the final selected Y AE configuration is in T able 8. This specific configuration achiev ed the lowest concentration MAE with low variance sho wn in Figure 5, giving the best trade-o ff between model complexity and generalisation. This is the “Optimised Y AE” used for the comparative ev aluation in Section 5.4 and the experimental v alidation in Section 6. 5.4. P erformance of Optimal Configurations on Simulated Data Follo wing the optimisation process (Sections 5.2 and 5.3), we validated our final, optimised CNN and Y AE mod- els against the adapted baseline architectures (FCNN, QNet, QNetBasis, QMRS, and EncDec) using the simulated dataset. The baseline models in Section 4.3 were implemented using their published parameters rather than perform- ing an equiv alent Bayesian optimisation. This methodological choice was twofold: (1) to assess the “out-of-the-box” generalisation of these established architectures, and (2) the computational infeasibility of e xhaustively optimising all additional models (see also Section 6.3.6). Figure 6 and T able 9 summarise the quantification performance of the ev aluated models (Training / V alidation Split: 80% / 20%) for “idealised” simulated data. Both report the four non-reference metabolites (Cr , GAB A, Gln, Glu); NAA is omitted as it is fixed to 1 by the choice of reference (Section 3.4). The scatter plots of the estimated vs actual concentrations of metabolites quantified (Figure 6) show that for MRSNet-CNN, MRSNet-Y AE, FCNN and QNet-default, the points lie close to a line through the origin with slope 1 . 00, in most cases to three significant digits. For QMRS, EncDec and QNetBasis, there is significantly more scatter , and the slopes of the linear regression fits deviate more noticeably from 1 . 00. T able 9 further shows that for both our systematically optimised models and the simple direct-regression models (FCNN, QNet-default), the R 2 values are ≥ 0 . 99 and MAEs are small (e.g., 0 . 024– 0 . 026 for GABA across the four best-performing models; see T able 9). The more complex baseline models, on the other hand, struggled significantly with the data. QMRS and EncDec exhibited much wider prediction scatter, lower R 2 values (e.g., 0 . 91 for Gln), and substantially higher MAEs (e.g., Glu MAE of 0 . 068 and 0 . 067, respectiv ely). The QNetBasis model performed the worst, with the largest errors across metabolites (e.g., Cr R 2 = 0 . 76, MAE = 0 . 11; Gln R 2 = 0 . 82, MAE = 0 . 10). The results demonstrate that for simple direct-regression models (FCNN, QNet-default), no further optimisation was necessary to achie ve near-perfect simulated performance. Con versely , the poor performance of the complex 20 Figure 6: Performance of optimal-configuration models in estimating metabolite concentrations for 2 , 000 simulated v alidation spectra. Scatter plots of predicted vs actual (ground-truth) concentrations for each metabolite, with identity line; each point is one spectrum; each panel corresponds to one model and one metabolite (Cr , GABA, Gln, Glu). hybrid models (QMRS, EncDec, QNetBasis) suggests they are either less suited to this task or require significant, dataset-specific tuning, indicating their limited “out-of-the-box” utility in our specific MEGA-PRESS setup. 21 T able 9: Performance indices for optimal-configuration models in estimating metabolite concentrations for 2 , 000 simulated validation spectra (a) Cr Model R 2 SE(slope) MAE ± Std EncDec 0 . 977 0 . 0034 0 . 033 ± 0 . 037 FCNN 0 . 999 0 . 0008 0 . 008 ± 0 . 009 MRSNet-CNN 0 . 999 0 . 0008 0 . 009 ± 0 . 009 MRSNet-Y AE 0 . 998 0 . 0009 0 . 008 ± 0 . 010 QMRS 0 . 980 0 . 0029 0 . 039 ± 0 . 036 QNetBasis 0 . 761 0 . 0101 0 . 111 ± 0 . 116 QNet-default 0 . 999 0 . 0009 0 . 009 ± 0 . 011 (b) GABA Model R 2 SE(slope) MAE ± Std EncDec 0 . 947 0 . 0052 0 . 050 ± 0 . 056 FCNN 0 . 989 0 . 0023 0 . 024 ± 0 . 025 MRSNet-CNN 0 . 990 0 . 0022 0 . 024 ± 0 . 025 MRSNet-Y AE 0 . 985 0 . 0027 0 . 026 ± 0 . 029 QMRS 0 . 960 0 . 0041 0 . 040 ± 0 . 045 QNetBasis 0 . 874 0 . 0071 0 . 070 ± 0 . 073 QNet-default 0 . 990 0 . 0023 0 . 024 ± 0 . 026 (c) Gln Model R 2 SE(slope) MAE ± Std EncDec 0 . 909 0 . 0065 0 . 075 ± 0 . 077 FCNN 0 . 976 0 . 0034 0 . 040 ± 0 . 039 MRSNet-CNN 0 . 978 0 . 0032 0 . 039 ± 0 . 038 MRSNet-Y AE 0 . 970 0 . 0040 0 . 043 ± 0 . 043 QMRS 0 . 925 0 . 0058 0 . 069 ± 0 . 070 QNetBasis 0 . 821 0 . 0088 0 . 100 ± 0 . 101 QNet-default 0 . 977 0 . 0032 0 . 040 ± 0 . 040 (d) Glu Model R 2 SE(slope) MAE ± Std EncDec 0 . 907 0 . 0066 0 . 068 ± 0 . 074 FCNN 0 . 976 0 . 0034 0 . 038 ± 0 . 039 MRSNet-CNN 0 . 978 0 . 0033 0 . 038 ± 0 . 039 MRSNet-Y AE 0 . 971 0 . 0039 0 . 040 ± 0 . 043 QMRS 0 . 912 0 . 0063 0 . 067 ± 0 . 071 QNetBasis 0 . 789 0 . 0096 0 . 093 ± 0 . 101 QNet-default 0 . 979 0 . 0032 0 . 037 ± 0 . 039 On simulated data, the best models (MRSNet-Y AE, MRSNet-CNN, FCNN, QNet-default) are nearly indistin- guishable; discrimination requires e valuation on experimental phantoms (Section 6). The near-perfect performance of the top-four regression models (MRSNet-Y AE, MRSNet-CNN, FCNN, QNet-default) establishes a clear perfor- mance baseline, but this simulated benchmark lacks the discriminativ e power to di ff erentiate their performance. The critical test is their performance on experimental phantom data and the sim-to-real challenge (Section 6). Note that QMRS, EncDec and QNetBasis were originally proposed for di ff erent acquisition protocols and more complex in vivo settings, and were adapted to our MEGA-PRESS pipeline without extensi ve re-tuning, so our ev aluation reflects their out-of-the-box behaviour in this specific conte xt rather than a reproduction of their originally reported results. 6. Results and Discussion of Best-Perf orming Models W e e valuate the best-performing DL models (selected on simulated data in Section 5) on the 144 spectra from 112 experimental phantoms with kno wn ground-truth concentrations, and compare them to LCModel. LCModel was run on OFF and DIFF spectra separately; at the experiment lev el, LCModel-OFF achieved significantly lo wer errors than LCModel-DIFF for GABA and Glu (T able 10), and is therefore used as the con ventional baseline. W e first present the ov erall phantom results and statistical comparisons, then discuss the sim-to-real gap and its implications. 6.1. Experimental Gr ound-T ruth V alidation Figure 7 summarises the performance of the three models that performed best on simulated data (MRSNet-Y AE, MRSNet-CNN, FCNN) and LCModel on the 144 spectra (112 phantoms). LCModel can only quantify a single spectrum. W e therefore quantified OFF and DIFF spectra separately as described in Section 3 and compared the quantification accuracy . The results of the statistical tests (described in Section 3.7) are summarised in T able 10. The comparison indicates that the LCModel quantification results based on the OFF spectra were generally more accurate than those based on di ff erence spectra, including for GABA. This is surprising at first, as quantifying GABA from unedited OFF spectra at 3 T is unreliable due to ov erlap with Cr and macromolecules, and MEGA-PRESS was specif- ically designed to produce di ff erence spectra to address these issues. Howe ver , LCModel was originally designed for normal PRESS or STEAM spectra [31, 32]; the user manual documents MEGA-PRESS and o ff -resonance spectra 22 Figure 7: Distribution of absolute quantification errors (max-normalised concentrations) for MRSNet-Y AE, MRSNet-CNN, FCNN, and LCModel- OFF across 144 spectra from 112 phantoms, grouped by experimental series (E1–E14) and metabolite (Cr , GAB A, Gln, Glu). Each box summarises the errors for one series–metabolite–model combination; outliers (points beyond 1 . 5 IQR) are shown. NAA is the reference metabolite and is not quantified separately . as special cases [73, Sections 9.4 and 11.3.1], and there may be issues with the adaptation to the di ff erence spectra. The absence of a macromolecule signal in the phantom data may also play a role here. In any case, as our statistical analysis suggests that LCModel quantification using the OFF spectra is more accurate, we use LCModel-OFF as the con ventional baseline in the follo wing comparisons. Relativ e to their near-perfect performance on simulated v alidation data, both the best Y AE and best CNN sho wed substantially higher errors on phantoms. Their accurac y was comparable to, and in some cases worse than, LCModel- OFF , and there is no clear winner . T able 11 reports experiment-le vel mean MAE for each of the 14 series. From the raw per-spectrum absolute errors, the overall MAE (averaged ov er all 144 spectra), with standard error of the mean (SEM) and 95% CI, is summarised in T able 12: for GABA (primary target), MAE was 0 . 161 (MRSNet-Y AE), 0 . 203 (MRSNet-CNN), 0 . 167 (FCNN), and 0 . 220 (LCModel-OFF). By metabolite, LCModel-OFF had the lowest MAE for Cr in 9 of 14 series, FCNN for Gln in 9 series, MRSNet-Y AE for GABA in 10 series and for Glu in 7 series. Ranking of the three DL methods was unchanged when summarising per-series performance by the median of the 14 experiment-le vel MAEs instead of the MAE pooled ov er all 144 spectra. Although FCNN achie ved the lo west ov erall MAE on the phantom spectra, cross-validation on simulated data showed consistently higher errors on validation than on training sets (validation MAE ≈ 1 . 5 × training MAE), indicating mild overfitting; the apparently fa vourable phantom performance should therefore be interpreted with caution rather than as evidence of superior generalisation. Nev ertheless, the Y AE and CNN models remain competitiv e on phantom data: MRSNet-Y AE achiev ed the lowest mean MAE for GAB A (the primary target) and for Glu, and FCNN for Gln (T able 12). Cr and NAA have well- separated resonances in OFF spectra and do not require edited spectroscopy for quantification. LCModel, designed for standard PRESS or STEAM spectra, can therefore quantify them more simply and accurately from OFF , which is consistent with LCModel-OFF having the lo west MAE for Cr . Pairwise one-tailed W ilcoxon signed-rank tests (Section 3.7) were used to test whether one method had systemat- 23 T able 10: Experiment-level comparison between LCModel-OFF and LCModel-DIFF using absolute errors on phantom data. Cr is omitted as it can be quantified from OFF; the comparison focuses on GABA, Gln and Glu, for which OFF vs DIFF quantification di ff ers. For each experiment and metabolite, absolute errors were first computed for indi vidual spectra, and the mean error across spectra was used as a single experiment- lev el summary statistic to avoid pseudo-replication (i.e., all values in this table are based on mean MAE). Paired one-tailed Wilcoxon signed- rank tests were applied across experiments to assess whether LCModel-OFF yields systematically lower errors (Section 3.7). LCModel-OFF achiev ed significantly lower errors for GABA ( p = 0 . 045, lower error in 11 out of 14 experiments) and Glu ( p = 0 . 0019, lower error in 12 out of 14 experiments), while no significant di ff erence was observed for Gln. These results indicate that LCModel-OFF provides equal or superior quantification performance compared to LCModel-DIFF at the experiment level, supporting its use as a conservativ e and stable baseline for subsequent model comparisons. Metabolite p -value (OFF < DIFF) Lower error in e xperiments Mean MAE di ff erence N exp GAB A 0 . 045 11 / 14 − 1 . 35 × 10 − 2 14 Gln 0 . 809 4 / 14 + 3 . 62 × 10 − 3 14 Glu 0 . 0019 12 / 14 − 7 . 49 × 10 − 2 14 T able 11: Phantom performance for optimal-configuration models compared with the LCModel baseline. Experimental series E1–E14 and phan- tom compositions are summarised in T able 1. V alues are experiment-lev el mean MAE ± SEM (standard error of the mean; max-normalised concentrations). Series Cr GAB A MRSNet-Y AE MRSNet-CNN LCModel-OFF FCNN MRSNet-Y AE MRSNet-CNN LCModel-OFF FCNN E1 0 . 069 ± . 009 0 . 067 ± . 009 0 . 009 ± . 004 0 . 015 ± . 004 0 . 079 ± . 017 0 . 086 ± . 019 0 . 096 ± . 029 0 . 081 ± . 020 E2 0 . 108 ± . 015 0 . 229 ± . 023 0 . 043 ± . 007 0 . 092 ± . 014 0 . 171 ± . 029 0 . 188 ± . 026 0 . 261 ± . 056 0 . 185 ± . 036 E3 0 . 291 ± . 015 0 . 188 ± . 009 0 . 133 ± . 022 0 . 137 ± . 011 0 . 167 ± . 020 0 . 070 ± . 016 0 . 247 ± . 031 0 . 132 ± . 019 E4a 0 . 305 ± . 027 0 . 081 ± . 009 0 . 038 ± . 012 0 . 026 ± . 009 0 . 059 ± . 015 0 . 236 ± . 065 0 . 133 ± . 046 0 . 074 ± . 015 E4b 0 . 317 ± . 039 0 . 115 ± . 016 0 . 045 ± . 009 0 . 032 ± . 011 0 . 076 ± . 019 0 . 141 ± . 049 0 . 145 ± . 045 0 . 115 ± . 022 E4c 0 . 185 ± . 025 0 . 052 ± . 026 0 . 053 ± . 009 0 . 063 ± . 014 0 . 075 ± . 039 0 . 199 ± . 031 0 . 139 ± . 050 0 . 052 ± . 018 E4d 0 . 194 ± . 028 0 . 071 ± . 019 0 . 050 ± . 009 0 . 060 ± . 018 0 . 113 ± . 024 0 . 179 ± . 029 0 . 138 ± . 051 0 . 064 ± . 018 E6 0 . 276 ± . 009 0 . 452 ± . 010 0 . 213 ± . 009 0 . 398 ± . 022 0 . 188 ± . 052 0 . 128 ± . 026 0 . 326 ± . 076 0 . 256 ± . 062 E7 0 . 100 ± . 015 0 . 103 ± . 012 0 . 040 ± . 008 0 . 047 ± . 010 0 . 352 ± . 049 0 . 424 ± . 050 0 . 380 ± . 051 0 . 357 ± . 044 E8 0 . 059 ± . 014 0 . 052 ± . 011 0 . 056 ± . 019 0 . 017 ± . 004 0 . 135 ± . 029 0 . 183 ± . 039 0 . 179 ± . 044 0 . 157 ± . 032 E9a 0 . 175 ± . 039 0 . 102 ± . 042 0 . 083 ± . 029 0 . 091 ± . 024 0 . 166 ± . 025 0 . 262 ± . 029 0 . 208 ± . 024 0 . 166 ± . 036 E9b 0 . 179 ± . 027 0 . 109 ± . 039 0 . 094 ± . 020 0 . 155 ± . 023 0 . 095 ± . 027 0 . 196 ± . 039 0 . 202 ± . 029 0 . 135 ± . 032 E11 0 . 111 ± . 032 0 . 083 ± . 042 0 . 034 ± . 011 0 . 111 ± . 045 0 . 167 ± . 050 0 . 206 ± . 049 0 . 208 ± . 050 0 . 174 ± . 041 E14 0 . 032 ± . 007 0 . 072 ± . 005 0 . 057 ± . 009 0 . 035 ± . 005 0 . 217 ± . 035 0 . 312 ± . 054 0 . 239 ± . 039 0 . 225 ± . 038 Series Gln Glu MRSNet-Y AE MRSNet-CNN LCModel-OFF FCNN MRSNet-Y AE MRSNet-CNN LCModel-OFF FCNN E1 0 . 026 ± . 002 0 . 032 ± . 005 0 . 000 ± . 000 0 . 032 ± . 008 0 . 074 ± . 011 0 . 015 ± . 002 0 . 006 ± . 003 0 . 036 ± . 006 E2 0 . 372 ± . 076 0 . 452 ± . 034 0 . 166 ± . 048 0 . 224 ± . 053 0 . 152 ± . 015 0 . 166 ± . 014 0 . 048 ± . 010 0 . 034 ± . 007 E3 0 . 073 ± . 014 0 . 057 ± . 009 0 . 116 ± . 005 0 . 063 ± . 013 0 . 065 ± . 015 0 . 095 ± . 015 0 . 263 ± . 015 0 . 095 ± . 014 E4a 0 . 250 ± . 072 0 . 090 ± . 054 0 . 113 ± . 011 0 . 042 ± . 014 0 . 146 ± . 035 0 . 128 ± . 028 0 . 279 ± . 037 0 . 193 ± . 030 E4b 0 . 130 ± . 037 0 . 053 ± . 012 0 . 128 ± . 010 0 . 053 ± . 009 0 . 163 ± . 051 0 . 229 ± . 039 0 . 273 ± . 035 0 . 359 ± . 049 E4c 0 . 267 ± . 089 0 . 135 ± . 044 0 . 123 ± . 013 0 . 072 ± . 038 0 . 165 ± . 041 0 . 157 ± . 032 0 . 314 ± . 047 0 . 200 ± . 020 E4d 0 . 294 ± . 093 0 . 117 ± . 054 0 . 112 ± . 016 0 . 054 ± . 026 0 . 150 ± . 022 0 . 140 ± . 041 0 . 334 ± . 049 0 . 208 ± . 033 E6 0 . 751 ± . 013 0 . 269 ± . 019 0 . 046 ± . 005 0 . 092 ± . 026 0 . 247 ± . 036 0 . 250 ± . 044 0 . 382 ± . 040 0 . 384 ± . 038 E7 0 . 216 ± . 004 0 . 146 ± . 011 0 . 143 ± . 006 0 . 029 ± . 005 0 . 510 ± . 021 0 . 554 ± . 024 0 . 632 ± . 015 0 . 520 ± . 024 E8 0 . 150 ± . 008 0 . 085 ± . 018 0 . 110 ± . 006 0 . 042 ± . 012 0 . 227 ± . 033 0 . 252 ± . 029 0 . 339 ± . 044 0 . 260 ± . 027 E9a 0 . 237 ± . 062 0 . 138 ± . 038 0 . 169 ± . 043 0 . 154 ± . 058 0 . 381 ± . 056 0 . 288 ± . 067 0 . 381 ± . 041 0 . 277 ± . 053 E9b 0 . 243 ± . 036 0 . 206 ± . 032 0 . 244 ± . 055 0 . 147 ± . 038 0 . 314 ± . 061 0 . 323 ± . 033 0 . 399 ± . 042 0 . 338 ± . 058 E11 0 . 227 ± . 077 0 . 097 ± . 021 0 . 097 ± . 007 0 . 050 ± . 013 0 . 380 ± . 063 0 . 378 ± . 040 0 . 468 ± . 030 0 . 376 ± . 039 E14 0 . 167 ± . 001 0 . 116 ± . 010 0 . 097 ± . 006 0 . 039 ± . 009 0 . 129 ± . 018 0 . 239 ± . 030 0 . 290 ± . 021 0 . 194 ± . 026 ically lower experiment-lev el MAE than another . F or Cr , the null hypothesis that the other methods were not worse than LCModel-OFF w as rejected for MRSNet-Y AE and MRSNet-CNN ( α = 0 . 01) but not for FCNN. For GAB A and Glu, the null hypothesis that the other methods were not worse than MRSNet-Y AE was rejected for LCModel-OFF but not for MRSNet-CNN or FCNN. For Gln, the null hypothesis that the other methods were not worse than FCNN was rejected for all other methods. Thus, each of the four methods was statistically superior to at least one other on at 24 T able 12: Overall MAE o ver all 144 phantom spectra (max-normalised concentrations). Primary target: GABA. Computed from raw per -spectrum absolute errors (DL models from model-dist; LCModel-OFF from LCM-results analysis on the same 144 spectra). V alues are mean ± SEM with 95% CI (normal approximation) in brackets. The “Overall” row is the mean ov er all 144 spectra of the per-spectrum MAE (mean over the five metabolites Cr , GABA, Gln, Glu, N AA). Metabolite MRSNet-Y AE MRSNet-CNN FCNN LCModel-OFF Cr 0 . 165 ± 0 . 009 [ . 146, . 183] 0 . 136 ± 0 . 010 [ . 116, . 155] 0 . 091 ± 0 . 009 [ . 074, . 109] 0 . 067 ± 0 . 006 [ . 056, . 078] GAB A 0 . 161 ± 0 . 011 [ . 139, . 183] 0 . 203 ± 0 . 013 [ . 178, . 229] 0 . 167 ± 0 . 012 [ . 144, . 190] 0 . 220 ± 0 . 015 [ . 191, . 249] Gln 0 . 238 ± 0 . 019 [ . 201, . 276] 0 . 153 ± 0 . 012 [ . 129, . 177] 0 . 080 ± 0 . 009 [ . 062, . 098] 0 . 116 ± 0 . 008 [ . 101, . 132] Glu 0 . 219 ± 0 . 014 [ . 191, . 246] 0 . 230 ± 0 . 014 [ . 201, . 258] 0 . 237 ± 0 . 015 [ . 208, . 266] 0 . 304 ± 0 . 016 [ . 272, . 336] Overall 0 . 160 ± 0 . 007 [ . 147, . 173] 0 . 147 ± 0 . 006 [ . 134, . 159] 0 . 117 ± 0 . 006 [ . 105, . 129] 0 . 144 ± 0 . 006 [ . 131, . 156] T able 13: MAE ov er all 144 phantom spectra for models trained with variable linewidth augmentation (1–10 Hz). V alues are max-normalised relativ e concentrations (mean ± SEM with 95% CI in brackets). Bold indicates the lowest error per metabolite. The “Overall” row is the mean o ver all 144 spectra of the per-spectrum MAE (mean o ver the five metabolites Cr , GAB A, Gln, Glu, NAA). Metabolite MRSNet-Y AE (Aug) MRSNet-CNN (Aug) FCNN (Aug) LCModel-OFF Cr 0 . 103 ± 0 . 008 [ . 088, . 118] 0 . 129 ± 0 . 009 [ . 111, . 147] 0 . 102 ± 0 . 009 [ . 084, . 120] 0 . 067 ± 0 . 006 [ . 056, . 078] GAB A 0 . 151 ± 0 . 010 [ . 132, . 170] 0 . 161 ± 0 . 011 [ . 140, . 182] 0 . 160 ± 0 . 011 [ . 139, . 181] 0 . 220 ± 0 . 015 [ . 191, . 249] Gln 0 . 194 ± 0 . 020 [ . 155, . 234] 0 . 151 ± 0 . 013 [ . 126, . 176] 0 . 099 ± 0 . 011 [ . 079, . 120] 0 . 116 ± 0 . 008 [ . 101, . 132] Glu 0 . 184 ± 0 . 015 [ . 156, . 213] 0 . 212 ± 0 . 014 [ . 185, . 239] 0 . 209 ± 0 . 014 [ . 183, . 236] 0 . 304 ± 0 . 016 [ . 272, . 336] Overall 0 . 131 ± 0 . 008 [ . 115, . 146] 0 . 133 ± 0 . 005 [ . 122, . 143] 0 . 116 ± 0 . 005 [ . 106, . 127] 0 . 144 ± 0 . 006 [ . 131, . 156] least one metabolite, but no method w as superior to all others on all metabolites. Figure 7 sho ws that error distributions varied by series, metabolite, and model. Absolute errors for GABA spanned a wider range than for Cr , consistent with Cr’ s strong, well-separated signal and GABA ’ s overlap with other metabo- lites. Performance did not follow a simple ordering by phantom type: the deliberately miscalibrated solution series E2 did not consistently show higher errors than the well-calibrated series E3 (e.g., for Cr and GAB A, errors in E2 were similar to or lower than in E3 except for MRSNet-CNN); only for Gln did E2 show noticeably larger MAE and spread. Similarly , tissue-mimicking gel phantoms did not consistently yield worse quantification than solution phan- toms. Domain shift therefore appears to depend on factors beyond calibration or physical state alone; characterising them is left for future work. Overall, each method had strengths on specific metabolites or conditions, but no single method emerged as best across all scenarios. 6.2. Impact of Linewidth Augmentation T o test if the sim-to-real gap is primarily dri ven by the fixed linewidth in our training data, we trained additional versions of the CNN, FCNN and Y AE models on a dataset augmented with variable line widths sampled from a uniform distribution ranging from 1 to 10 Hz in 0 . 2 Hz steps. This range spans all estimated linewidths in our phantom data (Section 3.1) and was chosen to cover the full range of experimental conditions, from high-quality phantom scans (typically 2–4 Hz) to in vivo scenarios where poor shimming can lead to significantly broader lines. T able 13 summarises the line width-augmented results with the non-augmented results in T able 12. All augmented models maintained near-perfect performance on the simulated test set (per-metabolite MAEs on the order of 10 − 2 ). On the experimental phantoms, augmentation led to performance improv ements across architectures. For GAB A, the augmented Y AE achiev ed the lowest error (0 . 151), followed by FCNN (0 . 160) and CNN (0 . 161), all outperforming LCModel-OFF (0 . 220). For Glu, the augmented Y AE also achiev ed the lowest error (0 . 184), followed by FCNN (0 . 209) and CNN (0 . 212), all improving upon their non-augmented baselines and LCModel (0 . 304). For Gln, the FCNN remained the best performing model (0 . 099), although this was worse than its non-augmented performance (0 . 080), while the augmented Y AE improved (0 . 194) relativ e to its non-augmented baseline (0 . 238). Note that the best model for each metabolite and ov erall remain the same in the augmented vs. the non-augmented versions. Despite the gains, the error on phantom data remains an order of magnitude higher than on simulated data ( ≈ 0 . 15 vs values on the order of 10 − 2 on simulations). So while linewidth mismatch is an important factor , it is not the only 25 one. As the gap persists across the three architectures, other unmodelled e ff ects (e.g. lineshape asymmetries, baseline instabilities, phase errors) must be addressed in future simulation pipelines, which is well beyond the scope of this study . 6.3. Discussion W e performed systematic model selection and ground-truth ev aluation of deep learning architectures for MEGA- PRESS quantification. Our results show that DL can match or exceed LCModel on phantoms for GABA and Glu when linewidth augmentation is used, howe ver , a sim-to-real gap remains and performance on simulations alone is a poor predictor of phantom performance. 6.3.1. Principal F indings The principal finding of this work is the stark contrast between model performance on simulated versus exper- imental data. Following systematic Bayesian optimisation, the final Y AE and CNN models achie ved near-perfect quantification on a large, independent simulated validation dataset (per-metabolite MAEs on the order of 0 . 008 for Cr and 0 . 024–0 . 026 for GAB A; regression slopes and R 2 ≈ 1 . 00; see T able 9), indicating that they had learned the mapping from idealised spectra to metabolite concentrations. On the 144 spectra from 112 experimental phantoms with known ground truth, this high performance was not maintained. Errors increased substantially for both DL models, and neither sho wed a consistent, statistically signifi- cant advantage over LCModel-OFF (Section 6). Performance was comparable across methods (T able 11). No single method was statistically best across all four metabolites. Perhaps most notably , the degradation was evident even on solution-based phantoms, which are closest to the simulation ideal, and not only on tissue-mimicking gels. Cross- validation on simulated data rev ealed mild but systematic o verfitting in the three best-performing deep models (Y AE, CNN, FCNN), with validation MAE typically 20–70% higher than training MAE across folds, and some what stronger ov erfitting in QNet and QNetBasis, whereas the EncDec model showed little train–validation discrepancy . Howe ver , these e ff ects were small compared to the much larger increase in error when moving from simulated data to exper - imental phantoms. This shows that conv entional ov erfitting on simulations alone cannot account for the sim-to-real degradation. V ariable linewidth augmentation (Section 6.2) reduced the sim-to-real gap: nearly all DL-model performances were improv ed per metabolite and overall, ev en if their relativ e performance remains quite consistent. Ho wever , a significant sim-to-real gap remains, and we cannot distinguish ho w much of this gap is due to other unmodelled factors versus fundamental limitations. Further experiments with more aggressive augmentation and substantially improv ed simulation pipelines are needed. The relative strength of DL for GAB A and Glu / Gln (versus LCModel for Cr) aligns with the fact that edited MEGA-PRESS is most beneficial for ov erlapping, low-SNR metabolites such as GAB A and Glu / Gln, whereas reference metabolites such as Cr are more easily quantified from OFF spectra alone. Furthermore, our results show that all methods, including LCModel, struggled with out-of-distribution data. The highest quantification errors were observed in experiments E7, E9, and E11, which featured extreme GAB A / NAA or Glu / NAA concentration ratios. These conditions represent cases that were likely underrepresented in, or outside the bounds of, the models’ training data, emphasising the vulnerability of data-driv en models when faced with nov el biochemical conditions. While Sobol sampling ensures uniform co verage of the hypercube of relati ve concentrations, specific physiological or experimental combinations (e.g., extremely high GABA with low NAA) may still fall into sparse regions of the training distrib ution, contributing to the performance gap. 6.3.2. Comparison with Literatur e Baselines The comparativ e deep learning baselines (FCNN, QNet, QMRS, EncDec) were not originally designed for MEGA- PRESS and w ould require sequence-specific adaptation and optimisation to reach their full potential. They were used in their published configurations with minimal adaptation to our MEGA-PRESS pipeline (Section 4.3) and were not subjected to the same Bayesian optimisation as the CNN and Y AE. Our comparison should therefore be interpreted as assessing the transferability of these architectures to MEGA-PRESS quantification, i.e., how well they perform out- of-the-box, rather than their absolute performance limits or optimised potential. They ne vertheless serve as a useful reference in particular giv en the strong performance of the FCNN architecture. 26 (a) NAA OFF (b) Cr OFF (c) GABA OFF (d) Glu OFF (e) Gln OFF (f) NAA ON (g) Cr ON (h) GABA ON (i) Glu ON (j) Gln ON Figure 8: Comparison of simulated (FID-A) vs experimental (Skyra) basis spectra. Experimental basis spectra derived from 100 mM metabolite solutions acquired on the same scanner . 6.3.3. The Sim-to-Real Gap: P otential Causes The performance drop demonstrates a sim-to-real gap that also a ff ects other medical deep learning applications. Although our simulations were rather sophisticated—solving the time-dependent Schrödinger equation for accepted Hamiltonian models of the metabolites, simulating actual pulse sequences with accurate pulse timings and pulse profiles, incorporating phase cycling and realistic slice profiles to account for spatial localisation e ff ects—they failed to capture the full range of subtle variations present in experimentally acquired signals. Discrepancies can arise from sev eral physical and chemical factors not fully modelled in the simulation: • Lineshape and Phase V ariations: Minor , unmodelled variations in peak lineshapes (deviating from pure Lorentzian / Gaussian forms) and phase inconsistencies across the spectrum can distort the features DL mod- els were trained to recognise. • Baseline Instability: The idealised baselines in simulations do not fully account for real-world instabilities arising from factors like imperfect water suppression, eddy currents, or subtle hardware artefacts, which can obscure low-amplitude peaks. • Chemical Envir onment E ff ects: The precise chemical environment in the phantoms (e.g., pH, temperature, ionic concentration) can cause small shifts in peak frequencies that may not be perfectly aligned with the simulation’ s basis set. T o further in vestigate the issue, we compared our simulated basis spectra with experimental spectra obtained for single metabolite solutions (approx. 100 mM), acquired using the same pulse sequence on the same scanner . Figure 8 shows that while the spectra generally match quite well, there are some systematic di ff erences. Most notably , the simulated ON spectrum for N AA has an additional small feature between 4 ppm and 4 . 5 ppm that is completely absent in the experimental spectra; none of the other metabolites exhibits a comparable feature in this region. In the simulated data, this peak is reproducible and robust to added noise, e ff ectiv ely providing a unique fingerprint for N AA, whereas in the e xperimental spectra, the corresponding re gion is dominated by baseline and noise. The ratio of the Cr peaks at 3 . 9 ppm and 3 . 0 ppm is also slightly higher in the simulations than in the experimental spectra, and for GAB A (and to a lesser extent Glu) the triplet around 2 . 3 ppm appears sharper and less noise-a ff ected in the simulated basis than in the experimental data. T ogether with our sim-to-real analysis on the phantom series, this suggests that the models may have learned to rely on subtle, simulation-specific features and peak patterns robust to noise that are stable in simulated data but absent, attenuated or distorted in e xperimental spectra. A more detailed view is provided by the per-series sim-to-real metrics in the accompanying repository [69]. The series with the highest quantification errors, E7, E9, and E11, also sho w the largest residual magnitude and phase dis- crepancy between simulated and experimental spectra in that analysis, consistent with the concentration-ratio and bio- chemical conditions (e.g., e xtreme GAB A / NAA or Glu / NAA) being underrepresented in training and harder to match by our fixed-linewidth simulation. Decomposing the residual into distinct contributions from lineshape asymmetries, 27 (a) Solution series (E1) (b) Gel phantom series (E4a) Figure 9: Representativ e comparison of simulated vs experimental ON spectra for a solution series (E1) and a gel phantom series (E4a), together with their residuals. Overall spectral magnitudes and peak positions agree well, particularly for the solution series, while the gel phantoms exhibit broader lineshapes and more pronounced baseline and phase di ff erences that are closer to in vivo conditions. phase errors, and baseline drift would require targeted simulation experiments (e.g., fitting asymmetric lineshapes or perturbed phase to phantom spectra). 6.3.4. Interpreting Model Behaviour and Methodolo gical Implications Architectural di ff erences between our models help explain their di ff erent failure modes. The Y AE, designed to learn a compressed, global representation via its autoencoder, is e ff ective on structurally consistent simulated data. Howe ver , this global focus becomes a weakness when presented with experimental data. Baseline distortions and lineshape v ariations can obscure the subtle features of low-concentration metabolites, prev enting them from being well-encoded in the latent space and subsequently reconstructed or quantified. This may explain its di ffi culty in resolving the fine details of metabolites like Gln and Cr . The CNN architecture is biased towards extracting local features (i.e., spectral peaks) through its con volutional filters. This makes it theoretically more robust to baseline issues and more adept at identifying weak signals lik e Gln. Howe ver , its failure suggests that the filters were trained to recognise the o verly specific, idealised peak shapes present in the simulations. They did not generalise well to the broadened or otherwise distorted peaks found in the phantom data. The FCNN, particularly with its concatenated ReLU activ ation functions, performed comparably to the Y AE on sev eral metrics (e.g., best overall MAE). Its strong phantom performance should be interpreted with caution given ov erfitting on simulated data (Section 6.3.6). That said, a direct regressor can match the Y AE here because the task may not need the full capacity of an autoencoder, and the FCNN av oids Y AE’ s latent bottleneck and decoder . So under domain shift, it may rely on features that transfer better . 6.3.5. Implications for Clinical T ranslation Acceptance of any new MRS quantification method in clinical or experimental settings, traditional or DL-based, requires phantom validation with kno wn concentrations. The next step, where feasible, is validation on in vivo data (where macromolecule modelling is critical) and across multiple sites and vendors to establish generalisability . 6.3.6. Limitations This study has sev eral limitations. Firstly , although we identified a sim-to-real gap and discussed plausible causes, we did not perform an exhausti ve quantitativ e analysis of spectral di ff erences between simulated and phantom data. Summary statistics from our sim-to-real comparison nevertheless indicate good magnitude agreement (correlation 0 . 90, cosine similarity 0 . 92), lo wer phase agreement (correlation 0 . 46), and mean N AA and Cr line widths (FWHM in ppm) of 0 . 041 and 0 . 038 and SNRs of 26 . 3 and 14 . 3 respectiv ely; a more exhausti ve analysis could guide improved simulations. Secondly , v alidation was confined to phantoms. While phantoms are essential for ground-truth ev aluation, the y do not capture the full comple xity of in vivo data (e.g., macromolecules, lipids, physiological noise). Phantom v alidation 28 is therefore a necessary but not su ffi cient step tow ard establishing clinical utility . In particular , our phantoms do not include a macromolecule (MM) background. In vivo , the ov erlap of GAB A with MM at ∼ 3 ppm is a major challenge that edited MEGA-PRESS and di ff erence-spectrum quantification (e.g., LCModel-DIFF) are specifically designed to address. The absence of MM in our phantom data means that the relativ e merits of OFF-based versus DIFF-based quantification observed may not directly transfer to human data, where MM modelling is critical; clinical readers should interpret our findings in that light. Furthermore, the e ff ectiveness of our line width-augmentation strategy in in vivo data, where MM contributes a strong ov erlapping signal at ∼ 3 ppm, may di ff er from that observed in phantoms; LCModel-DIFF and edited MEGA-PRESS are specifically designed to separate GABA from MM via di ff erence spectra. This phantom validation should therefore be seen as a necessary step toward clinical deployment rather than a complete in vivo solution. Thirdly , our findings are specific to MEGA-PRESS and the phantom conditions and scanners used; validation was performed at a single site on a Siemens 3 T system, and generalisation to other sequences, sites, or vendors would require further v alidation. Furthermore, FCNN’ s phantom performance may be inflated by mild overfitting on simulated data. 6.3.7. Future W ork Reducing the sim-to-real gap requires more realistic simulations and augmentation, and possibly domain adap- tation or mixed simulation–phantom training. The sim-to-real comparison (Figure 9) and the structure of residuals between simulated and experimental spectra indicate that simulations must capture realistic peak shapes as well as distortions and various noise sources and types. Improving model architecture alone may have limited impact if the training data do not better match experimental v ariability . In parallel, more sophisticated training strategies are needed. T echniques from domain adaptation and transfer learning could help models generalise from simulation to experimental domains. Training on mixed datasets of simulated and experimental spectra is another possibility , though care must be taken to address the inherent data imbalance where a few real spectra may be ignored by the model. Semi-supervised or self-supervised learning on large-scale unlabelled in vivo data could also impro ve robustness, even if the lack of ground truth remains a challenge. W e provide a benchmark and argue that reliability of any ne w MRS quantification method should be judged against experimental phantom data with kno wn concentrations. 7. Conclusion W e presented a systematic ev aluation of di ff erent deep learning architectures for GABA quantification from MEGA-PRESS spectra, focusing on a Y -shaped autoencoder (Y AE) and a conv olutional neural network (CNN), and including se veral adapted baseline architectures (FCNN, QNet-def ault, QNetBasis, QMRS and EncDec). W e selected CNN and Y AE configurations via Bayesian optimisation on 10 , 000 simulated spectra (five-fold cross-validation), trained the chosen models on 100 , 000 spectra, and ev aluated them on 144 phantom spectra with known concentra- tions, alongside LCModel-OFF. Our principal finding is that while non-augmented deep learning models sho wed performance degradation on experimental phantoms, models trained with variable linewidth augmentation sho wed lower errors. The augmented Y AE and FCNN models achiev ed an MAE for GAB A over all phantom spectra of 0 . 151 and 0 . 160, respectively , surpassing the conv entional LCModel-OFF baseline (0 . 220). Similar improv ements were observed for Glutamate, and partial improv ements for Glutamine, depending on the architecture. T able 13 reports the augmented results; T able 12 gives the non-augmented comparison. The augmented DL models perform better for GAB A and Glu (and FCNN for Gln), while Cr and N AA are more simply and accurately quantified by LCModel from OFF spectra. T ranslating DL-based MRS quantification from simulation to clinical use remains di ffi cult; phantom validation is needed to quantify the gap. The performance gap likely stems from subtle variations in experimental spectra (e.g., lineshape distortions, baseline instabilities, unmodelled chemical shifts) not fully captured by high-fidelity simula- tions. T raining initially used fixed linewidth, and variable linewidth augmentation improv ed the accuracy on the phantom spectra. Howe ver , a notable sim-to-real g ap remained, and it is unclear whether it stems from a fundamental limitation or training data fidelity . Performance gains may be achiev ed by improving simulation and training strate- gies rather than by changing model architectures. This must be combined with validation on experimental data with known ground truth to de vise trusted tools in clinical and biomedical research. 29 CRediT A uthorship Contribution Statement Zien Ma: Conceptualization; Methodology; Software; V alidation; Formal analysis; In vestigation; Data curation; V isualization; Writing — original draft. Oktay Karaku ¸ s: Writing — re view & editing. Sophie M. Shermer: Methodology; Resources; Data curation; Writing — re view & editing. Frank C. Langbein: Methodology; Software; F ormal analysis; Data curation; Writing — revie w & editing. Declaration of Competing Interest The authors declare no competing interests. Data A v ailability Experimental phantom compositions and processed spectra (Swansea 3 T MEGA-PRESS phantom series E1– E14), along with summary e valuation tables and figures, are available at h t t p s : / / q y b e r . b l a c k / m r s / d a t a - m e g a p r e s s - s p e c t ra . Basis spectra for simulation are at h t t p s : / / q y b e r . b l a c k / m r s / d a t a - m r s n e t- b a s is . Simulated datasets can be generated using scripts and configuration in the MRSNet code repository; pre-generated simulated MEGA-PRESS spectra are also av ailable at h t t p s : / / q y b e r . b l a c k / m r s / d a t a - m r s n e t- s i m u l a t e d - s p e c t r a - m e g a p r e ss . The sim-to-real analysis (per-series metrics and comparison reports) is at h t t p s : / / q y b e r . b l a c k / m r s / r e s u l t s- m r s n e t- s i m 2 r e a l [69]. CNN, Y AE, and extra model selection results are at htt ps :// qy be r.b la ck/ mr s/ res ul ts- m rs net- mod el s- cnn , htt ps :// qy ber .b la ck/ mr s/r es ul ts- m rsn e t- mo d e ls- yae , and h tt ps :/ /q yb er .b la ck /m rs /r es ul ts- mr sn et- mo de ls- ext ra [79, 80]. Trained models are at h ttps:/ /qyber .black /mrs/d ata- mrs net- mode ls . All listed repositories are tagged v2.1 for consistency and released under A GPL-3.0-or-later . Code A v ailability T raining, inference, analysis and simulation code, including model configurations, are released at h tt p s :/ / q y be r. bl ac k/ mr s/ co de - m rs ne t [68] (tag v2.1), with a versioned DOI at Zenodo, h tt ps :/ /d oi .o r g /1 0 . 52 8 1 / zeno do.1 8520 504 . DICOM reading for Siemens IMA spectroscopy files is provided by the QDicom Utilities [81]. Basis spectra simulation uses FID-A [20] (V1.2). The repository README and requirements.txt list further dependencies (e.g. Python, T ensorFlow). All code is released under A GPL-3.0-or-later . Funding and Acknowledgements The authors ackno wledge the use of the MRI scanner facilities at Swansea Uni versity . The chemicals for the phantom studies were sponsored by Cardi ff Univ ersity School of Computer Science and Informatics funds. Ethical Appr oval This study used only physical phantoms; no human participants or animals were in volved. Therefore, ethical approv al was not required. 30 References [1] J. W . Błaszczyk, Parkinson’ s disease and neurodegeneration: GABA-collapse hypothesis, Frontiers in Neuro- science 10 (2016) 269. doi:10.3389 / fnins.2016.00269. [2] C. Madeira, F . V . Alheira, M. A. Calcia, T . C. S. Silv a, F . M. T annos, C. V argas-Lopes, M. Fisher , N. Goldenstein, M. A. Brasil, S. V inogradov , S. T . Ferreira, R. Panizzutti, Blood levels of glutamate and glutamine in recent onset and chronic schizophrenia, Frontiers in Psychiatry 9 (2018) 713. doi:10.3389 / fpsyt.2018.00713. [3] R. Rideaux, M. Mikkelsen, R. A. E. Edden, Comparison of methods for spe ctral alignment and signal modelling of GABA-edited MR spectroscopy data, Neuroimage 232 (2021) 117900. doi:10.1016 / j.neuroimage.2021.117900. [4] J.-L. Y uan, S.-K. W ang, X.-J. Guo, L.-L. Ding, H. Gu, W .-L. Hu, Reduction of NAA / Cr ratio in a patient with re versible posterior leukoencephalopathy syndrome using MR spectroscopy , Archi ves of Medical Sciences. Atherosclerotic Diseases 1 (1) (2016) e98–e100. doi:10.5114 / amsad.2016.62376. [5] L. Zhang, P . Bu, The two sides of creatine in cancer , Trends in Cell Biology 32 (5) (2022) 380–390. doi:10.1016 / j.tcb .2021.11.004. [6] P . Nuss, Anxiety disorders and GAB A neurotransmission: a disturbance of modulation, Neuropsychiatric Dis- ease and T reatment 11 (2015) 165–175. doi:10.2147 / NDT .S58841. [7] B. Luscher , Q. Shen, N. Sahir , The GAB Aergic deficit hypothesis of major depressiv e disorder , Molecular Psychiatry 16 (4) (2011) 383–406. doi:10.1038 / mp.2010.120. [8] N. A. J. Puts, R. A. E. Edden, In vi vo magnetic resonance spectroscopy of GABA: a methodological revie w , Progress in Nuclear Magnetic Resonance Spectroscopy 60 (2012) 29–41. doi:10.1016 / j.pnmrs.2011.06.001. [9] S. Bollmann, C. Ghisleni, S.-S. Poil, E. Martin, J. Ball, D. Eich-Höchli, R. A. E. Edden, P . Klav er, L. Michels, D. Brandeis, R. L. O’Gorman, Dev elopmental changes in gamma-aminobutyric acid lev els in attention- deficit / hyperacti vity disorder, T ranslational Psychiatry 5 (6) (2015) e589–e589. doi:10.1038 / tp.2015.79. [10] R. A. E. Edden, D. Crocetti, H. Zhu, D. L. Gilbert, S. H. Mostofsky , Reduced GAB A concentra- tion in attention-deficit / hyperacti vity disorder, Archives of General Psychiatry 69 (7) (2012) 750–753. doi:10.1001 / archgenpsychiatry .2011.2280. [11] B. E. Jewett, S. Sharma, Physiology , GAB A, StatPearls Publishing, Treasure Island (FL), 2023. [12] M. Mescher, H. Merkle, J. Kirsch, M. Garwood, R. Gruetter , Simultaneous in vi vo spectral editing and water suppression, NMR in Biomedicine 11 (6) (1998) 266–272. doi:10.1002 / (SICI)1099-1492(199810)11. [13] M. Mescher , A. T annus, M. O’Neil Johnson, M. Garwood, Solvent suppression using selective echo dephasing, Journal of Magnetic Resonance 123 (2) (1996) 226–229. doi:10.1006 / jmra.1996.0242. [14] P . G. Mullins, D. J. McGonigle, R. L. O’Gorman, N. A. Puts, R. V idyasagar , C. J. Evans, R. A. Edden, Current practice in the use of MEGA-PRESS spectroscopy for the detection of GAB A, Neuroimage 86 (2014) 43–52. doi:10.1016 / j.neuroimage.2012.12.004. [15] G. Dias, R. Berto, M. Oliveira, L. Ueda, S. Dertkigil, P . D. P . Costa, A. Shamaei, et al., Spectro-V iT : a vision transformer model for GAB A-edited MEGA-PRESS reconstruction using spectrograms, Magnetic Resonance Imaging 113 (2024) 110219. doi:10.1016 / j.mri.2024.110219. [16] A. Peek, T . Rebbeck, A. Leav er , S. F oster , K. Refshauge, N. Puts, G. Oeltzschner, O. C. Andronesi, P . B. Bark er, W . Bogner, K. M. Cecil, I.-Y . Choi, D. K. Deelchand, R. A. de Graaf, U. Dydak, R. A. Edden, U. E. Emir, A. D. Harris, A. P . Lin, D. J. L ythgoe, M. Mikkelsen, P . G. Mullins, J. Near , G. Öz, C. D. Rae, M. T erpstra, S. R. Williams, M. W ilson, A comprehensi ve guide to MEGA-PRESS for GABA measurement, Analytical Biochemistry 669 (2023) 115113. doi:10.1016 / j.ab.2023.115113. 31 [17] M. W ilson, O. Andronesi, P . B. Barker , R. Bartha, A. Bizzi, P . J. Bolan, K. M. Brindle, I.-Y . Choi, C. Cudalbu, U. Dydak, U. E. Emir , R. G. Gonzalez, S. Gruber , R. Gruetter , R. K. Gupta, A. Heerschap, A. Henning, H. P . Hetherington, P . S. Huppi, R. E. Hurd, K. Kantarci, R. A. Kauppinen, D. W . J. Klomp, R. Kreis, M. J. Kruiskamp, M. O. Leach, A. P . Lin, P . R. Luijten, M. Marja ´ nska, A. A. Maudsley , D. J. Meyerho ff , C. E. Mountford, P . G. Mullins, J. B. Murdoch, S. J. Nelson, R. Noeske, G. Öz, J. W . Pan, A. C. Peet, H. Poptani, S. Posse, E.-M. Ratai, N. Salibi, T . W . J. Scheenen, I. C. P . Smith, B. J. Soher , I. Tká ˇ c, D. B. V igneron, F . A. Howe, Methodological consensus on clinical proton MRS of the brain: Revie w and recommendations, Magnetic Resonance in Medicine 82 (2) (2019) 527–550. doi:10.1002 / mrm.27742. [18] C. Jenkins, M. Chandler, F . C. Langbein, S. M. Shermer , Benchmarking GABA quantification: A ground truth data set and comparativ e analysis of T ARQUIN, LCModel, jMR UI and Gannet (2021). URL [19] M. G. Saleh, D. Rimbault, M. Mikkelsen, G. Oeltzschner, A. M. W ang, D. Jiang, A. Alhamud, et al., Multi- vendor standardized sequence for edited magnetic resonance spectroscopy , Neuroimage 189 (2019) 425–431. doi:10.1016 / j.neuroimage.2019.01.056. [20] R. Simpson, G. A. Dev enyi, P . Jezzard, T . J. Hennessy , J. Near , Adv anced processing and simulation of MRS data using the FID appliance (FID-A)—an open source, MA TLAB-based toolkit, Magnetic Resonance in Medicine 77 (1) (2017) 23–33. doi:10.1002 / mrm.26091. [21] M. Chandler , C. Jenkins, S. M. Shermer , F . C. Langbein, MRSNet: Metabolite quantification from edited mag- netic resonance spectra with con volutional neural networks (2019). arXi v:1909.03836. URL [22] R. Rizzo, M. Dziadosz, S. P . Kyathanahally , A. Shamaei, R. Kreis, Quantification of MR spectra by deep learning in an idealized setting: Inv estigation of forms of input, network architectures, optimization by ensembles of net- works, and training bias, Magnetic Resonance in Medicine 89 (5) (2022) 1707–1727. doi:10.1002 / mrm.29561. [23] M. Mikkelsen, P . Barker , P . Bhattacharyya, M. Brix, P . Buur, K. Cecil, K. L. Chan, et al., Big GAB A: Edited MR spectroscopy at 24 research sites, Neuroimage 159 (2017) 32–45. doi:10.1016 / j.neuroimage.2017.07.021. [24] M. Mikkelsen, D. Rimbault, P . Barker , P . Bhattacharyya, M. Brix, P . Buur , K. Cecil, et al., Big GAB A II: W ater-referenced edited MR spectroscopy at 25 research sites, Neuroimage 191 (2019) 537–548. doi:10.1016 / j.neuroimage.2019.02.059. [25] R. Rizzo, M. Dziadosz, S. P . Kyathanahally , M. Reyes, R. Kreis, Reliability of quantification estimates in mr spectroscopy: Cnns vs. traditional model fitting, in: Medical Image Computing and Computer Assisted Inter- vention (MICCAI), 2022, pp. 715–724. doi:10.1007 / 978-3-031-16452-1_68. [26] S. Ramadan, A. Lin, P . Stanwell, Glutamate and glutamine: a revie w of in viv o MRS in the human brain, NMR in Biomedicine 26 (12) (2013) 1630–1646. doi:10.1002 / nbm.3045. [27] J. M. Duda, A. D. Moser , C. S. Zuo, F . Du, X. Chen, S. Perlo, C. E. Richards, N. Nascimento, M. Ironside, D. J. Crowle y , L. M. Holsen, M. Misra, J. I. Hudson, J. M. Goldstein, D. A. Pizzagalli, Repeatability and reliability of GABA measurements with magnetic resonance spectroscopy in healthy young adults, Magnetic Resonance in Medicine 85 (5) (2021) 2359–2369. doi:10.1002 / mrm.28587. [28] F . Sanaei Nezhad, A. Anton, E. Michou, J. Jung, L. M. Parkes, S. R. W illiams, Quantification of GAB A, gluta- mate and glutamine in a single measurement at 3T using GABA-edited MEGA-PRESS, NMR in Biomedicine 31 (1) (2018) e3847. doi:10.1002 / nbm.3847. [29] R. A. E. Edden, N. A. J. Puts, A. D. Harris, P . B. Barker , C. J. Evans, Gannet: A batch-processing tool for the quantitative analysis of gamma-aminobutyric acid–edited MR spectroscopy spectra, Journal of Magnetic Resonance Imaging 40 (6) (2014) 1445–1452. doi:10.1002 / jmri.24478. 32 [30] G. Oeltzschner , H. J. Zöllner , S. C. N. Hui, M. Mikkelsen, M. G. Saleh, S. T apper , R. A. E. Edden, Osprey: Open-source processing, reconstruction & estimation of magnetic resonance spectroscop y data, Journal of Neu- roscience Methods 343 (2020) 108827. doi:10.1016 / j.jneumeth.2020.108827. [31] S. W . Prov encher, Estimation of metabolite concentrations from localized in vi vo proton NMR spectra, Magnetic Resonance in Medicine 30 (6) (1993) 672–679. doi:10.1002 / mrm.1910300604. [32] S. W . Prov encher , Automatic quantitation of localized in vivo 1H spectra with LCModel, NMR in Biomedicine 14 (4) (2001) 260–264. doi:10.1002 / nbm.698. [33] M. Gajdošík, K. Landheer , K. M. Swanberg, C. Juchem, INSPECTOR: free software for magnetic reso- nance spectroscopy data inspection, processing, simulation and analysis, Scientific Reports 11 (1) (2021) 2094. doi:10.1038 / s41598-021-81193-9. [34] B. Soher , P . Semanchuk, D. T odd, X. Ji, D. Deelchand, J. Joers, G. Oz, K. Y oung, V espa: Integrated applications for RF pulse design, spectral simulation and MRS data analysis, Magnetic Resonance in Medicine 90 (2023) 823–838. doi:10.1002 / mrm.29686. [35] A. Naressi, C. Couturier , I. Castang, R. de Beer, D. Graveron-Demilly , Java-based graphical user interface for mrui, a software package for quantitation of in vi vo / medical magnetic resonance spectroscopy signals, Comput- ers in Biology and Medicine 31 (4) (2001) 269–286. doi:10.1016 / S0010-4825(01)00006-3. [36] J.-B. Poullet, D. M. Sima, A. W . Simonetti, B. De Neuter, L. V anhamme, P . Lemmerling, S. V an Hu ff el, An automated quantitation of short echo time MRS spectra in an open source software en vironment: A QSES, NMR in Biomedicine 20 (5) (2007) 493–504. doi:10.1002 / nbm.1112. [37] L. V anhamme, A. v an den Boogaart, S. V an Hu ff el, Impro ved method for accurate and e ffi cient quantifi- cation of MRS data with use of prior knowledge, Journal of Magnetic Resonance 129 (1) (1997) 35–43. doi:10.1006 / jmre.1997.1244. [38] H. Ratiney , M. Sdika, Y . Coenradie, S. Cav assila, D. van Ormondt, D. Graveron-Demilly , T ime-domain semi-parametric estimation based on a metabolite basis set, NMR in Biomedicine 18 (1) (2005) 1–13. doi:10.1002 / nbm.895. [39] G. Reynolds, M. W ilson, A. Peet, T . N. Arv anitis, An algorithm for the automated quantitation of metabolites in in vitro nmr signals, Magnetic Resonance in Medicine 56 (6) (2006) 1211–1219. doi:10.1002 / mrm.21081. [40] M. Wilson, G. Reynolds, R. A. Kauppinen, T . N. Arv anitis, A. C. Peet, A constrained least-squares approach to the automated quantitation of in vi vo 1H magnetic resonance spectroscopy data, Magnetic Resonance in Medicine 65 (1) (2011) 1–12. doi:10.1002 / mrm.22579. [41] J.-B. Poullet, D. M. Sima, S. V an Hu ff el, MRS signal quantitation: A revie w of time- and frequency-domain methods, Journal of Magnetic Resonance 195 (2) (2008) 134–144. doi:10.1016 / j.jmr .2008.09.005. [42] W . T . Clarke, C. J. Stagg, S. Jbabdi, Fsl-mrs: An end-to-end spectroscopy analysis package, Magnetic Resonance in Medicine 85 (6) (2021-06) 2950–2964. doi:10.1002 / mrm.28630. [43] R. Kreis, C. S. Bolliger , The need for updates of spin system parameters, illustrated for the case of γ - aminobutyric acid, NMR in Biomedicine 25 (12) (2012) 1401–1403. doi:10.1002 / nbm.2810. [44] H. Zöllner , G. Oeltzschner , A. Schnitzler, H. Wittsack, In silico GABA + MEGA-PRESS: E ff ects of signal-to- noise ratio and line width on modeling the 3 ppm GABA + resonance, NMR in Biomedicine 34 (1) (2020) e4410. doi:10.1002 / nbm.4410. [45] H. Zöllner , S. T apper, S. C. N. Hui, P . Barker , R. Edden, G. Oeltzschner, Comparison of linear combi- nation modeling strategies for edited magnetic resonance spectroscopy at 3T, NMR in Biomedicine (2021). doi:10.1002 / nbm.4618. 33 [46] A. R. Craven, T . K. Bell, L. Ersland, A. D. Harris, K. Hugdahl, G. Oeltzschner , Linewidth-related bias in mod- elled concentration estimates from GAB A-edited 1H-MRS, bioRxiv (2024). doi:10.1101 / 2024.02.27.582249. [47] A. Craven, P . Bhattacharyya, W . T . Clarke, U. Dydak, R. Edden, L. Ersland, P . Mandal, et al., Comparison of sev en modelling algorithms for γ -aminobutyric acid-edited proton magnetic resonance spectroscopy , NMR in Biomedicine 35 (7) (2022) e4702. doi:10.1002 / nbm.4702. [48] C. Davies-Jenkins, H. Zöllner, D. Simi ˇ ci ´ c, S. C. N. Hui, Y . Song, K. Hupfeld, J. Prisciandaro, R. Ed- den, G. Oeltzschner , GAB A-edited MEGA-PRESS at 3T: Does a measured macromolecule background improv e linear combination modeling?, Magnetic Resonance in Medicine 92 (4) (2024) 1348–1362. doi:10.1002 / mrm.30158. [49] D. Chen, M. Lin, H. Liu, J. Li, Y . Zhou, T . Kang, L. Lin, Z. W u, J. W ang, J. Li, J. Lin, X. Chen, D. Guo, X. Qu, Magnetic resonance spectroscopy quantification aided by deep estimations of imperfection factors and macromolecular signal, IEEE Transactions on Biomedical Engineering 71 (6) (2024) 1841–1852. doi:10.1109 / TBME.2024.3354123. [50] X. Chen, J. Li, D. Chen, Y . Zhou, Z. T u, M. Lin, T . Kang, J. Lin, T . Gong, L. Zhu, J. Zhou, O. yang Lin, J. Guo, J. Dong, D. Guo, X. Qu, CloudBrain-MRS: An intelligent cloud computing platform for in vi vo magnetic resonance spectroscopy preprocessing, quantification, and analysis, Journal of Magnetic Resonance 358 (2024) 107601. doi:10.1016 / j.jmr .2023.107601. [51] D. Das, E. Coello, R. F . Schulte, B. H. Menze, Quantification of metabolites in magnetic resonance spectroscopic imaging using machine learning, in: Medical Image Computing and Computer Assisted Intervention (MICCAI), 2017, pp. 462–470. doi:10.1007 / 978-3-319-66179-7_53. [52] M. Dziadosz, R. Rizzo, S. P . Kyathanahally , R. Kreis, Denoising single MR spectra by deep learning: Miracle or mirage?, Magnetic Resonance in Medicine 90 (5) (2023) 1749–1761. doi:10.1002 / mrm.29762. [53] N. Hatami, M. Sdika, H. Ratiney , Magnetic resonance spectroscopy quantification using deep learning (2018). URL [54] H. H. Lee, H. Kim, Intact metabolite spectrum mining by deep learning in proton magnetic resonance spec- troscopy of the brain, Magnetic Resonance in Medicine 82 (1) (2019) 33–48. doi:10.1002 / mrm.27727. [55] H. H. Lee, H. Kim, Bayesian deep learning–based 1H-MRS of the brain: Metabolite quantification with uncertainty estimation using Monte Carlo dropout, Magnetic Resonance in Medicine 88 (1) (2022) 38–52. doi:10.1002 / mrm.29214. [56] A. Shamaei, J. Starcukov a, Z. Starcuk, Physics-informed deep learning approach to quantification of human brain metabolites from magnetic resonance spectroscopy data, Computers in Biology and Medicine 158 (2023) 106837. doi:10.1016 / j.compbiomed.2023.106837. [57] D. M. J. van de Sande, J. P . Merkofer , S. Amirrajab, M. V eta, R. J. G. van Sloun, M. J. V ersluis, J. F . A. Jansen, J. S. van den Brink, M. Breeuwer, A revie w of machine learning applications for the proton MR spectroscopy workflo w , Magnetic Resonance in Medicine 90 (4) (2023) 1253–1270. doi:10.1002 / mrm.29793. [58] A. Shamaei, J. Starcuko vá, Z. S. Jr ., A wavelet scattering con volutional network for magnetic resonance spectroscopy signal quantitation, in: Proceedings of the 14th International Joint Conference on Biomedi- cal Engineering Systems and T echnologies, BIOSTEC 2021, V olume 4: Biosignals, 2021, pp. 268–275. doi:10.5220 / 0010318502680275. [59] C. J. W u, L. S. K egeles, J. Guo, Q-MRS: a deep learning frame work for quantitati ve magnetic resonance spectra analysis (2024). URL 34 [60] Y .-L. Huang, Y .-R. Lin, S.-Y . Tsai, Comparison of con volutional-neural-networks-based method and LCModel on the quantification of in viv o magnetic resonance spectroscopy , Magnetic Resonance Materials in Physics, Biology and Medicine 37 (3) (2024) 477–489. doi:10.1007 / s10334-023-01120-z. [61] A. Shamaei, E. Niess, L. Hingerl, B. Strasser, A. Osb urg, K. Eckstein, W . Bogner, S. Motyka, PHIVE: a physics- informed variational encoder enables rapid spectral fitting of brain metabolite mapping at 7T, medRxiv (2025). doi:10.1101 / 2025.01.02.25319930. [62] Y . Zhang, J. Shen, Quantification of spatially localized MRS by a novel deep learning ap- proach without spectral fitting, Magnetic Resonance in Medicine 90 (4) (2023) 1282–1296. arXiv:https: // onlinelibrary .wiley .com / doi / pdf / 10.1002 / mrm.29711, doi:10.1002 / mrm.29711. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/mrm.29711 [63] S. S. Gurbani, S. Sheri ff , A. A. Maudsley , H. Shim, L. A. Cooper , Incorporation of a spectral model in a con volutional neural network for accelerated spectral fitting, Magnetic Resonance in Medicine 81 (5) (2019) 3346–3357. doi:10.1002 / mrm.27641. [64] W . W ang, L.-H. Ma, M. Maletic-Sav atic, Z. Liu, NMRQNet: a deep learning approach for automatic identifi- cation and quantification of metabolites using nuclear magnetic resonance (NMR) in human plasma samples, bioRxiv (2023). doi:10.1101 / 2023.03.01.530642. [65] V . Govindaraju, K. Y oung, A. A. Maudsley , Proton NMR chemical shifts and coupling constants for brain metabolites, NMR in Biomedicine 13 (3) (2000) 129–153. doi:10.1002 / 1099-1492(200005)13:3 < 129::AID- NBM619 > 3.0.CO;2-V . [66] L. G. Kaiser , K. Y oung, D. J. Meyerho ff , S. G. Mueller , G. B. Matson, A detailed analysis of localized J- di ff erence GAB A editing: theoretical and experimental study at 4T, NMR in Biomedicine 21 (1) (2008) 22–32. doi:10.1002 / nbm.1150. [67] J. Near, C. J. Evans, N. A. J. Puts, P . B. Barker , R. A. E. Edden, J-di ff erence editing of gamma-aminobutyric acid (GABA): Simulated and experimental multiplet patterns, Magnetic Resonance in Medicine 70 (5) (2013) 1183–1191. doi:10.1002 / mrm.24572. [68] Z. Ma, M. Chandler , S. M. Shermer , F . C. Langbein, MRSNet, software, V ersion 2.1 (2026). doi:10.5281 / zenodo.18520504. URL https://qyber.black/mrs/code- mrsnet [69] Z. Ma, S. M. Shermer , F . C. Langbein, MRSNet Sim2Real phantom–simulation comparison, data, V ersion 2.1 (2026). URL https://qyber.black/mrs/results- mrsnet- sim2real [70] E. J. Auerbach, M. Marja ´ nska, CMRR spectroscopy package (2017). URL https://www.cmrr.umn.edu/spectro/ [71] V . Jain, W . Collins, D. Davis, High-accuracy analog measurements via interpolated FFT, IEEE Transactions on Instrumentation and Measurement 28 (1) (1979) 113–122. doi:10.1109 / TIM.1979.4314779. [72] A. Harris, N. Puts, S. A. Wijtenb urg, L. M. Rowland, M. Mikkelsen, P . B. Barker , C. J. Evans, R. A. E. Edden, Normalizing data from GAB A-edited MEGA-PRESS implementations at 3 T esla, Magnetic Resonance Imaging 4 (42) (2017) 8–15. doi:10.1016 / j.mri.2017.04.013. [73] S. W . Pro vencher , LCModel & LCMgui User’ s Manual, LCModel, version 6.3-1R; Section 9.4 (MEGA-PRESS for GAB A), Section 11.3.1 (O ff -Resonance Spectra). A vailable at ht tp: / /s- p ro v en ch er .c om /l cm- ma nu a l.shtml (2 2021). 35 [74] F . Lam, Y . Li, X. Peng, Constrained magnetic resonance spectroscopic imaging by learning non- linear lo w-dimensional models, IEEE T ransactions on Medical Imaging 39 (3) (2020) 545–555. doi:10.1109 / TMI.2019.2930586. [75] Y . Lei, B. Ji, T . Liu, W . J. Curran, H. Mao, X. Y ang, Deep learning-based denoising for magnetic resonance spec- troscopy signals, in: Medical Imaging 2021: Biomedical Applications in Molecular, Structural, and Functional Imaging, V ol. 11600, SPIE, 2021, pp. 16–21. doi:10.1117 / 12.2580988. [76] P . Goyal, P . Dollár, R. Girshick, P . Noordhuis, L. W esolowski, A. Kyrola, A. T ulloch, Y . Jia, K. He, Accurate, large minibatch SGD: T raining ImageNet in 1 hour (2018). URL [77] J. Snoek, H. Larochelle, R. P . Adams, Practical Bayesian optimization of machine learning algorithms, in: Advances in Neural Information Processing Systems, V ol. 25, Curran Associates, Inc., 2012, pp. 2951–2959. [78] The GPyOpt authors, GPyOpt: A Bayesian optimization framework in python, ht tp: // gi t h u b. co m/ S he ff ieldML/GPyOpt (2016). [79] Z. Ma, M. Chandler, F . C. Langbein, MRSNet-CNN model selection data, data, V ersion 2.1 (2026). URL https://qyber.black/mrs/results- mrsnet- models- cnn [80] Z. Ma, F . C. Langbein, MRSNet Y AE model selection data, data, V ersion 2.1 (2026). URL https://qyber.black/mrs/results- mrsnet- models- yae [81] F . C. Langbein, QDicom utilities, h t t p s : / / q y b e r . b l a c k / c a / c o d e- q d i c o m- u t i l i t i es , siemens DI- COM / IMA reading; included in MRSNet at tag v2.1 (2026). URL https://qyber.black/ca/code- qdicom- utilities 36 Graphical Abstract 2.2 2.4 2.6 2.8 3.0 3.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 R eal part (nor malised r eal part) Glu Gln GABA Cr (a) Simulated MEGA-PRES S Spectra OFF ON DIFF 2.2 2.4 2.6 2.8 3.0 3.2 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.125 R eal part (nor malised r eal part) Glu Gln GABA Cr (b) Solution Phantom (E3) DIFF Spectrum Measur ed Clean Simulation 2.2 2.4 2.6 2.8 3.0 3.2 Chemical shif t (ppm) 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.125 R eal part (nor malised r eal part) Glu Gln GABA Cr (c) Gel Phantom (E4a) DIFF Spectrum Measur ed Clean simulation Cr GABA Gln Glu Metabolite 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Mean absolute er r or (d) Phantom MAE: non-augmented vs linewidth-augmented MRSNet- Y AE MRSNet-CNN FCNN LCModel- OFF Non-augmented Linewidth-augmented Simulated and experimental phantom MEGA-PRES S DIFF spectra (2.2 3.2 ppm, real part) and phantom MAE for four methods. Non-augmented models show a sim-to-real gap and are comparable to LCModel; linewidth-augmented deep learning models outperform LCModel for GABA and Glu, though a sim-to-real gap persists. Highlights • Systematic Bayesian model selection of deep learning models on realistic simulations • Phantom ground-truth validation across solution and gel series at 3 T • Non-augmented models show a sim-to-real gap and are comparable to LCModel • Linewidth-augmented DL outperforms LCModel for GAB A and Glu on phantoms; gap remains 37
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment