Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of “well-behavedness” of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird’s-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
💡 Research Summary
The paper tackles a fundamental gap in computer‑vision research: while deep neural networks (DNNs) have achieved remarkable accuracy on image classification benchmarks, they often underperform on other crucial quality dimensions such as robustness, calibration, fairness, and computational efficiency. Existing work typically studies a subset of these aspects in isolation, leaving it unclear how improvements in one dimension affect the others. To address this, the authors define a comprehensive “well‑behavedness” framework consisting of nine quality metrics: (1) top‑1 accuracy, (2) adversarial robustness (FGSM + PGD, normalized by clean accuracy), (3) corruption robustness (ImageNet‑C, normalized), (4) out‑of‑domain (OOD) robustness (geometric mean over five OOD datasets), (5) calibration error (geometric mean of Expected Calibration Error and Adaptive Calibration Error), (6) class balance (inverse of the standard deviation of per‑class accuracies and confidences), (7) object focus (accuracy drop when backgrounds are swapped), (8) shape bias (preference for shape over texture on conflict images), and (9) parameter count as a hardware‑agnostic proxy for computational cost. Each metric is carefully operationalized with standardized protocols that require only the pretrained backbone’s ImageNet‑1k predictions, avoiding costly fine‑tuning.
The authors assemble a zoo of 326 backbone models spanning four architecture families—CNNs, Vision Transformers, B‑cos models, and Vision‑Language (ViL) models—and five training paradigms: standard supervised learning, self‑supervised pre‑training followed by fine‑tuning, semi‑supervised learning, adversarial training, and the A
Comments & Academic Discussion
Loading comments...
Leave a Comment