Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy
The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.
💡 Research Summary
The paper addresses the growing need to distinguish human‑written from AI‑generated text, a problem that has become critical in education, publishing, and online communication. While many recent detectors achieve near‑perfect scores on shared‑task benchmarks such as PAN‑CLEF 2025 and COLING 2025, the authors question whether these systems truly capture the latent construct of “machine authorship” or merely exploit idiosyncratic cues present in the training corpora.
To investigate this, they construct an interpretable detection pipeline based on 38 handcrafted linguistic features spanning surface statistics (character, word, sentence counts), lexical diversity (type‑token ratio, hapax ratio, entropy), syntactic structure (POS distributions, dependency depth), readability and predictability (Flesch‑Reading‑Ease, gzip compression ratio, mean token surprisal), and discourse/style markers (repetition rates, punctuation usage). These features are computed at the document level, allowing direct inspection of the evidence used by the classifier.
Using classical machine‑learning models (primarily XGBoost and Random Forest), the system attains an F1 of 0.9734 on both benchmark datasets, matching or surpassing state‑of‑the‑art transformer‑based detectors despite not relying on large pre‑trained language models. However, systematic cross‑domain and cross‑generator evaluations reveal a dramatic drop in performance when a model trained on one benchmark is applied to the other, with F1 scores often falling below 0.70.
To diagnose the cause, the authors apply SHAP (Shapley Additive Explanations) to obtain instance‑level feature contributions. The analysis shows that the most influential features differ markedly between datasets: for PAN‑CLEF, average sentence length and gzip compression ratio dominate, whereas for COLING, lexical entropy and surprisal scores are paramount. Error‑focused SHAP visualisations of false positives and false negatives further demonstrate that features sensitive to text length, formatting, and genre are responsible for many misclassifications.
These findings support the central claim that high benchmark accuracy does not guarantee valid authorship detection; instead, many detectors rely on dataset‑specific stylistic artifacts that do not generalize. The paper therefore advocates for a validity‑oriented evaluation framework that includes cross‑domain robustness and explainability as core metrics. As a practical contribution, the authors release an open‑source Python package that outputs both binary predictions and SHAP‑based explanations for individual texts, facilitating reproducible research and responsible deployment in high‑stakes settings such as academic integrity workflows.
Comments & Academic Discussion
Loading comments...
Leave a Comment