BoTTA: Benchmarking on-device Test Time Adaptation
The performance of deep learning models depends heavily on test samples at runtime, and shifts from the training data distribution can significantly reduce accuracy. Test-time adaptation (TTA) addresses this by adapting models during inference without requiring labeled test data or access to the original training set. While research has explored TTA from various perspectives like algorithmic complexity, data and class distribution shifts, model architectures, and offline versus continuous learning, constraints specific to mobile and edge devices remain underexplored. We propose BoTTA, a benchmark designed to evaluate TTA methods under practical constraints on mobile and edge devices. Our evaluation targets four key challenges caused by limited resources and usage conditions: (i) limited test samples, (ii) limited exposure to categories, (iii) diverse distribution shifts, and (iv) overlapping shifts within a sample. We assess state-of-the-art TTA methods under these scenarios using benchmark datasets and report system-level metrics on a real testbed. Furthermore, unlike prior work, we align with on-device requirements by advocating periodic adaptation instead of continuous inference-time adaptation. Experiments reveal key insights: many recent TTA algorithms struggle with small datasets, fail to generalize to unseen categories, and depend on the diversity and complexity of distribution shifts. BoTTA also reports device-specific resource use. For example, while SHOT improves accuracy by $2.25\times$ with $512$ adaptation samples, it uses $1.08\times$ peak memory on Raspberry Pi versus the base model. BoTTA offers actionable guidance for TTA in real-world, resource-constrained deployments.
💡 Research Summary
The paper introduces BoTTA, a benchmark specifically designed to evaluate test‑time adaptation (TTA) methods under the practical constraints of mobile and edge devices. While prior TTA research has explored algorithmic complexity, domain shifts, and model architectures, it has largely ignored the resource‑limited realities of on‑device deployment. BoTTA addresses this gap by defining four realistic challenge scenarios: (i) a limited number of adaptation samples, (ii) incomplete class coverage during adaptation, (iii) a wide variety of distribution shifts (different corruptions and severities), and (iv) overlapping shifts within a single sample.
To assess state‑of‑the‑art TTA techniques, the authors conduct extensive experiments on two widely used benchmark datasets, CIFAR‑10C (with 19 corruption types) and PACS (with 7 domain styles). They evaluate three representative model families—ResNet‑26, ResNet‑50, and Vision Transformer (ViT)—and apply a suite of recent TTA algorithms, including entropy‑minimization, SHOT (source‑hypothesis‑transfer), SAR, SoTTA, T3A, and OFTTA. Importantly, BoTTA departs from the conventional continuous‑adaptation protocol (where the model updates after every inference) and instead adopts a periodic‑adaptation regime that better reflects the sporadic adaptation opportunities on real devices.
Key findings emerge from both algorithmic and system‑level analyses. First, when the adaptation set contains fewer than 100 samples, most TTA methods fail to deliver measurable accuracy gains; only SHOT shows modest improvement with as few as 200 samples, but its performance degrades sharply below that threshold. Second, limited class exposure (e.g., only 50 % of target classes present) severely harms pseudo‑label‑based approaches such as SHOT and SoTTA, because label noise propagates and the model drifts toward incorrect decision boundaries. Third, methods that rely heavily on class prototypes (T3A, OFTTA) are highly sensitive to the type and severity of corruption; they excel on mild shifts but collapse under strong Gaussian noise or severe blur, whereas entropy‑based and SHOT methods remain comparatively robust across a broader range of shifts. Fourth, overlapping corruptions (e.g., blur + noise in the same image) expose a weakness of continuous adaptation: memory consumption and latency increase dramatically. Periodic adaptation mitigates these costs while preserving most of the accuracy benefit.
System‑level measurements on two real edge platforms—Raspberry Pi 4B and NVIDIA Jetson Orin Nano—provide concrete resource footprints. For example, SHOT with 512 adaptation samples improves top‑1 accuracy by 2.25× on CIFAR‑10C but incurs a 1.08× increase in peak memory on the Raspberry Pi and a 23 % rise in CPU utilization. On the Jetson Nano, the same configuration leads to a 1.12× memory increase and an 18 % CPU load rise. Simpler entropy‑minimization methods consume far less memory and compute but deliver negligible gains when data are scarce.
From these observations, the authors derive several design recommendations for on‑device TTA. (1) Algorithms should incorporate mechanisms that remain stable with very few samples, such as meta‑learning or Bayesian updates that can exploit prior uncertainty. (2) Robust pseudo‑label filtering (e.g., confidence‑thresholding or ensemble voting) is essential to handle incomplete class coverage. (3) Data‑augmentation strategies that pre‑expose the model to a diverse set of corruptions can improve resilience to unseen shifts. (4) Periodic adaptation schedules combined with lightweight parameter updates (e.g., adapting only batch‑norm statistics) strike a favorable balance between performance and resource consumption.
In summary, BoTTA offers the first comprehensive benchmark that aligns TTA evaluation with the constraints of real‑world mobile and edge deployments. By quantifying both accuracy and system‑level metrics across realistic scenarios, it provides actionable guidance for researchers and engineers aiming to bring adaptive deep learning models to resource‑constrained devices.
Comments & Academic Discussion
Loading comments...
Leave a Comment