Body Fat, Skin Tone, and the Accuracy of Smartwatch Caloric Expenditure Estimates

Body Fat, Skin Tone, and the Accuracy of Smartwatch Caloric Expenditure Estimates
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Smartwatches are widely used to estimate caloric expenditure for weight management, clinical decision making, and public health monitoring. These devices combine photoplethysmography, accelerometry, and proprietary algorithms. However, prior studies report substantial error, and the influence of moderators such as skin tone and body fat percentage (BF) remains underexamined. This study tested whether smartwatch brand, BF, and Fitzpatrick skin type (III to V) predict caloric expenditure error relative to indirect calorimetry. Fifty eight Hispanic adults completed a single laboratory visit including a ten minute recumbent cycling protocol with alternating two minute moderate and vigorous intensity intervals, bracketed by rest and recovery. Participants wore four consumer devices: Apple Watch Series 8, Fitbit Sense 2, Samsung Galaxy Watch 5, and Garmin Forerunner 955. Energy expenditure was measured using a COSMED K5 metabolic system. After device specific data quality filtering, valid participant device pairings ranged from 44 to 52 per brand. One sample tests showed significant mean bias for three devices: Apple, Garmin, and Samsung. Fitbit showed no significant overall bias, although this depended on device specific outlier removal. Mean bias varied by brand, with Garmin and Samsung showing the largest overestimations. Mixed effects models revealed significant effects of device and BF, as well as a device by BF interaction, with physical activity energy expenditure error increasing as adiposity increased. Overall, common smartwatches substantially misestimate caloric expenditure compared with indirect calorimetry. Error varies by brand and worsens with higher body fat, highlighting limitations of current consumer wearables and the need for improved accuracy across diverse body types.


💡 Research Summary

This study evaluated the accuracy of physical activity energy expenditure (PAEE) estimates from four commercially available smartwatches—Apple Watch Series 8, Fitbit Sense 2, Samsung Galaxy Watch 5, and Garmin Forerunner 955—against the gold‑standard indirect calorimetry provided by a COSMED K5 portable metabolic system. Fifty‑eight Hispanic adults (31 females, 27 males, mean age 23 ± 5.9 years) participated in a single laboratory session that included a 10‑minute recumbent cycling protocol with alternating two‑minute moderate (64‑76 % HRmax) and vigorous (77‑95 % HRmax) intensity intervals, bracketed by 5‑minute rest periods. Participants were stratified by body mass index (BMI) and Fitzpatrick skin type (III‑V) to ensure representation across adiposity and pigmentation levels. Body fat percentage (BF %) was measured using skinfold calipers and bioelectrical impedance; skin tone was recorded via self‑report Fitzpatrick classification, a colorimeter, and an experimental spatial frequency domain spectroscopy system (the latter two were excluded from analysis due to reliability issues).

Data cleaning removed six participants because of K5 equipment failures and excluded device‑specific implausible values (e.g., zero calories or estimates > 450 % of the K5 value). The final analytic sample comprised 52 Apple, 51 Garmin, 50 Samsung, and 44 Fitbit valid participant‑device pairings. Three error metrics were computed: bias (device − K5, signed kcal), absolute error (AE, |bias|), and absolute percentage error (APE = 100·AE/K5).

Descriptive results showed systematic over‑estimation for three devices. Mean bias (± SD) was 21.6 ± 36.8 kcal for Apple, 68.6 ± 55.9 kcal for Garmin, 56.8 ± 42.0 kcal for Samsung, and 3.1 ± 41.0 kcal for Fitbit (the latter after removal of seven extreme outliers). One‑sample t‑tests indicated that Apple, Garmin, and Samsung biases differed significantly from zero (p < .05), whereas Fitbit’s bias was not significant only when the outliers were excluded; inclusion of those outliers would raise Fitbit’s mean bias to > 90 kcal. AE averaged roughly 80–110 kcal across devices, and median APE ranged from 15 % to 25 %, underscoring substantial relative error even in a controlled laboratory setting.

To examine predictors of error, linear mixed‑effects models were fitted separately for bias, log‑transformed AE, and log‑transformed APE (log(APE + 1)). Fixed effects included device (four levels), standardized BF % (z‑scored), Fitzpatrick skin type (categorical), and all pairwise interactions; participant was entered as a random intercept. Type III Satterthwaite tests revealed a highly significant device main effect (p < .001), a significant BF % main effect (p < .01), and a device × BF % interaction (p = .02). The interaction indicated that error increased with higher body fat for all devices, but the slope was steepest for Garmin and Samsung, which already showed the largest absolute biases. Fitzpatrick skin type did not reach significance either as a main effect or in interaction terms, suggesting that within the III‑V range examined, pigmentation had limited impact on PPG‑based PAEE estimation. Model diagnostics showed homoscedastic residuals but non‑normality; robust clustered standard errors and influence diagnostics (Cook’s distance, leave‑one‑out refits) confirmed that the significance pattern was stable.

The findings have several practical implications. First, consumer‑grade smartwatches systematically misestimate PAEE, with average over‑estimates of 20–70 kcal in a 10‑minute bout—errors that could accumulate to several hundred kilocalories over a day, potentially misleading weight‑management efforts or clinical energy‑balance calculations. Second, the magnitude of error is not uniform across brands; Garmin and Samsung exhibited the greatest over‑estimation, while Apple performed comparatively better, and Fitbit’s performance was highly sensitive to data‑cleaning decisions. Third, adiposity emerged as a robust moderator: higher BF % consistently amplified error across all devices, highlighting a bias against individuals with higher body fat—a demographic that already bears disproportionate obesity‑related health risks. Fourth, skin tone (within Fitzpatrick III‑V) did not significantly affect error, though the limited range and sample size may have constrained detection of subtle effects.

The study’s strengths include a well‑controlled exercise protocol, use of a gold‑standard metabolic reference, and rigorous statistical handling of non‑normal residuals and influential observations. Limitations involve the homogeneous ethnic sample (all Hispanic), a restricted skin‑tone spectrum, and a single exercise modality (recumbent cycling) that may not generalize to free‑living or high‑impact activities. Moreover, proprietary algorithms of the devices were not disclosed, preventing direct attribution of error sources to sensor hardware versus software processing.

In conclusion, current commercial smartwatches provide imprecise PAEE estimates, with error magnitude increasing as body fat rises and varying markedly across brands. These inaccuracies limit the utility of smartwatch‑derived caloric data for personal health monitoring, clinical decision‑making, and large‑scale epidemiological research. Future work should expand demographic diversity, test a broader array of activities, and collaborate with manufacturers to develop and validate adiposity‑adjusted algorithms that deliver equitable and reliable energy‑expenditure monitoring for all users.


Comments & Academic Discussion

Loading comments...

Leave a Comment