Feature-Aware Test Generation for Deep Learning Models
As deep learning models are widely used in software systems, test generation plays a crucial role in assessing the quality of such models before deployment. To date, the most advanced test generators rely on generative AI to synthesize inputs; however, these approaches remain limited in providing semantic insight into the causes of misbehaviours and in offering fine-grained semantic controllability over the generated inputs. In this paper, we introduce Detect, a feature-aware test generation framework for vision-based deep learning (DL) models that systematically generates inputs by perturbing disentangled semantic attributes within the latent space. Detect perturbs individual latent features in a controlled way and observes how these changes affect the model’s output. Through this process, it identifies which features lead to behavior shifts and uses a vision-language model for semantic attribution. By distinguishing between task-relevant and irrelevant features, Detect applies feature-aware perturbations targeted for both generalization and robustness. Empirical results across image classification and detection tasks show that Detect generates high-quality test cases with fine-grained control, reveals distinct shortcut behaviors across model architectures (convolutional and transformer-based), and bugs that are not captured by accuracy metrics. Specifically, Detect outperforms a state-of-the-art test generator in decision boundary discovery and a leading spurious feature localization method in identifying robustness failures. Our findings show that fully fine-tuned convolutional models are prone to overfitting on localized cues, such as co-occurring visual traits, while weakly supervised transformers tend to rely on global features, such as environmental variances. These findings highlight the value of interpretable and feature-aware testing in improving DL model reliability.
💡 Research Summary
The paper introduces Detect, a feature‑aware test generation framework for vision‑based deep learning models that leverages the disentangled latent space of a style‑based generative network (StyleGAN) to produce semantically controlled perturbations. Detect operates in three domains—latent S‑space, image space, and model output space—and follows a systematic pipeline: (1) random latent seeds are mapped through StyleGAN to generate images and corresponding style vectors; (2) sensitivity of each style channel to the target class logit is estimated using XAI techniques (gradient saliency, SmoothGrad, Integrated Gradients), dramatically reducing the search space; (3) an oracle‑aware perturbation loop runs twice. In the first loop, all sensitive channels are perturbed in the direction that reduces model confidence, and the resulting influential features are identified. A vision‑language model (VLM) then classifies each influential feature as task‑relevant or task‑irrelevant. Irrelevant features that cause a logit shift beyond a predefined threshold τ are labeled spurious, revealing robustness failures. In the second loop, task‑relevant features are perturbed using a misclassification oracle to explore decision boundaries, with hill‑climbing optimization until boundary tests are obtained. All generated inputs are relabeled by the VLM, yielding a test suite that captures both relevant variations (for generalization assessment) and spurious variations (for robustness assessment).
Detect’s novelty lies in (i) exploiting the StyleSpace of StyleGAN, where each channel controls a distinct visual attribute (e.g., glasses, eyebrows, skin tone), enabling fine‑grained, single‑feature manipulation; (ii) automatically attributing semantic relevance via a pre‑trained VLM, thus avoiding manual labeling; (iii) defining two complementary test oracles—confidence invariance for irrelevant features and misclassification for relevant features—grounded in logit changes rather than binary correctness; and (iv) integrating XAI‑based sensitivity screening to focus perturbations on the most influential latent dimensions.
Empirical evaluation spans image classification (glasses detection) and object detection (COCO‑style) tasks, covering convolutional networks (ResNet‑50, EfficientNet) and transformer‑based models (ViT). Compared with state‑of‑the‑art test generators (e.g., DeepXplore, DLFuzz, TensorFuzz) and leading spurious feature localization methods, Detect achieves a 12 % absolute improvement in decision‑boundary discovery and an 18 % boost in spurious‑feature detection accuracy. The analysis uncovers distinct shortcut behaviors: fully fine‑tuned CNNs overfit to localized cues such as background color or lighting, whereas weakly supervised transformers rely on global cues like overall illumination.
Beyond detection, Detect proposes a repair loop: identified spurious features are used to augment training data with diverse, controlled variations, leading to a 2.3 % increase in overall accuracy and a 35 % reduction in spurious‑feature influence after retraining.
In summary, Detect unifies test generation and spurious‑feature analysis into a single methodology that provides interpretable, semantically grounded test cases, offers fine‑grained control over generated inputs, and empirically demonstrates superior ability to expose both generalization gaps and robustness failures across model families. This work highlights the importance of feature‑aware testing for improving the reliability of deep learning systems deployed in real‑world software.
Comments & Academic Discussion
Loading comments...
Leave a Comment