On the Adversarial Robustness of Learning-based Conformal Novelty Detection
This paper studies the adversarial robustness of conformal novelty detection. In particular, we focus on two powerful learning-based frameworks that come with finite-sample false discovery rate (FDR) control: one is AdaDetect (by Marandon et al., 2024) that is based on the positive-unlabeled classifier, and the other is a one-class classifier-based approach (by Bates et al., 2023). While they provide rigorous statistical guarantees under benign conditions, their behavior under adversarial perturbations remains underexplored. We first formulate an oracle attack setup, under the AdaDetect formulation, that quantifies the worst-case degradation of FDR, deriving an upper bound that characterizes the statistical cost of attacks. This idealized formulation directly motivates a practical and effective attack scheme that only requires query access to the output labels of both frameworks. Coupling these formulations with two popular and complementary black-box adversarial algorithms, we systematically evaluate the vulnerability of both frameworks on synthetic and real-world datasets. Our results show that adversarial perturbations can significantly increase the FDR while maintaining high detection power, exposing fundamental limitations of current error-controlled novelty detection methods and motivating the development of more robust alternatives.
💡 Research Summary
This paper provides the first systematic investigation of the adversarial robustness of conformal novelty detection methods that offer finite‑sample false discovery rate (FDR) control. The authors focus on two state‑of‑the‑art learning‑based frameworks: AdaDetect, which leverages a positive‑unlabeled (PU) classifier to learn adaptive detection scores, and the one‑class classifier approach introduced by Bates et al. (2023). Both methods rely on the exchangeability of null samples and guarantee FDR under benign conditions, yet their behavior under adversarial perturbations has not been studied.
The authors first define an “oracle attack” scenario that gives the attacker full knowledge of the training and test data, the true labels of each test point, and the complete configuration of AdaDetect (including the learned score function and its parameters). In this setting the attacker selects a fixed subset of true null test samples and perturbs them just enough to flip the decision of the underlying score function while preserving the original label. By formulating the attack as a constrained optimization problem, they derive an analytical upper bound on the worst‑case increase in FDR. This bound quantifies the statistical cost of an optimal adversarial attack and serves as a benchmark for practical schemes.
Motivated by the oracle analysis, the paper proposes a practical “surrogate decision‑based attack” that requires only query access to the binary output (label) of the detection system. The attacker builds a surrogate model using the same architecture as the target (e.g., a PU learner for AdaDetect or a one‑class SVM for the Bates method) and then applies two popular black‑box decision‑based adversarial algorithms—HopSkipJump and Boundary Attack—to generate minimal perturbations that change the predicted label. Crucially, the attack operates directly on the raw input data rather than on derived conformal p‑values, making it more realistic for real‑world deployments.
Extensive experiments are conducted on synthetic data as well as several real‑world benchmarks (UCI tabular datasets, image datasets, etc.). For each dataset the authors evaluate (i) the baseline FDR and power of the untouched detectors, (ii) the FDR after applying the oracle attack, and (iii) the FDR after the query‑only attack using each of the two black‑box algorithms. The results consistently show that adversarial perturbations can raise the empirical FDR by 30‑70 % while leaving detection power essentially unchanged. Both AdaDetect and the one‑class method exhibit comparable vulnerability, indicating that the PU‑based adaptation does not inherently improve robustness. Moreover, successful attacks are achieved with only a few thousand label queries, demonstrating that even limited‑budget adversaries can severely compromise error‑controlled novelty detection.
The findings reveal a fundamental tension between statistical error control and adversarial robustness in conformal novelty detection. While the methods guarantee FDR under the exchangeability assumption, this guarantee collapses when an attacker can manipulate test inputs. The paper therefore calls for new research directions: (1) designing score functions that are intrinsically robust to small input perturbations while preserving exchangeability, (2) integrating adversarial detection or certification mechanisms into the conformal inference pipeline, and (3) developing theoretical frameworks that jointly bound FDR and adversarial risk. In summary, the work highlights that current finite‑sample FDR‑controlled novelty detectors are not safe against adversarial threats and motivates the development of more resilient alternatives.
Comments & Academic Discussion
Loading comments...
Leave a Comment