Statistical Testing Framework for Clustering Pipelines by Selective Inference
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass through several data-dependent procedures within such pipelines. In this study, we address the problem of quantifying the statistical reliability of results produced by data analysis pipelines. As a proof of concept, we focus on clustering pipelines that identify cluster structures from complex and heterogeneous data through procedures such as outlier detection, feature selection, and clustering. We propose a novel statistical testing framework to assess the significance of clustering results obtained through these pipelines. Our framework, based on selective inference, enables the systematic construction of valid statistical tests for clustering pipelines composed of predefined components. We prove that the proposed test controls the type I error rate at any nominal level and demonstrate its validity and effectiveness through experiments on synthetic and real datasets.
💡 Research Summary
The paper tackles a fundamental problem in modern data analysis: how to assess the statistical reliability of results that emerge from multi‑step pipelines, especially when the final step is an unsupervised clustering algorithm preceded by outlier detection and feature selection. While pipelines are widely used in high‑stakes domains such as genomics or medical diagnostics, the literature has largely focused on engineering aspects (scalability, reproducibility) or on post‑hoc validation methods that ignore the data‑driven nature of the pipeline itself. Consequently, p‑values obtained after clustering are often invalid because the hypotheses being tested (e.g., “cluster A has a higher mean than cluster B on feature j”) are themselves selected by the same data that generated the clusters.
To solve this, the authors adopt the selective inference (SI) framework, which conditions on the entire selection process and computes valid conditional p‑values. They model a clustering pipeline as a directed acyclic graph (DAG) whose nodes are predefined algorithmic components: two outlier‑detection methods (k‑NN removal, k‑NN‑mean removal), two feature‑selection methods (variance‑based, correlation‑based), two clustering algorithms (DBSCAN, k‑means), and set‑operations (union, intersection) for combining multiple OD/FS results. The pipeline takes a raw data matrix X∈ℝ^{n×d} and outputs three objects: O (indices of detected outliers), M (selected feature indices), and C (cluster labels). The statistical model assumes X = μ + ε with ε∼N(0, Σ), where Σ is known (or estimated as discussed in the appendix).
The hypothesis of interest is a difference in true means between two clusters a and b for a selected feature j∈M. The test statistic is the difference of sample means, which can be written as a linear functional T(X)=ηᵀX, where η encodes the cluster memberships and the selected feature set. The key technical contribution is the derivation of the conditional distribution of ηᵀX given the entire pipeline’s selection events. Each component of the pipeline imposes linear constraints on X (e.g., “sample i is flagged as an outlier because its distance exceeds a threshold”). By aggregating all constraints, the authors obtain a polyhedral region A·X ≤ b that characterizes the selection event. Under the Gaussian model, ηᵀX conditional on A·X ≤ b follows a truncated normal distribution, enabling exact computation of p‑values via standard algorithms for truncated normals.
The authors prove that the resulting test controls the type I error at any nominal level α, regardless of the specific pipeline configuration. This universality is a major advance: a single implementation can handle any combination of the supported OD, FS, and clustering components without additional coding. The software framework parses a JSON description of the pipeline, automatically builds the constraint system, and returns valid p‑values for any user‑specified cluster pair and feature.
Empirical evaluation includes synthetic experiments where the true mean difference between clusters is varied, and the authors demonstrate that the test maintains the prescribed false‑positive rate while achieving higher power than naïve post‑hoc t‑tests. Real‑world experiments on genomic data (identifying disease subtypes) and image data (grouping fashion items) illustrate that the method can detect meaningful differences that would be missed or overstated by conventional approaches.
The paper’s contributions are threefold: (1) a statistically rigorous test for a broad class of clustering pipelines, (2) the first application of selective inference to a pipeline comprising multiple heterogeneous components, and (3) an open‑source implementation that abstracts away the mathematical complexity from end users. Limitations include the assumption of known covariance Σ (though the authors discuss estimation) and the focus on linear mean‑difference tests; extending the framework to non‑linear cluster characteristics or deep‑learning based pipelines remains future work.
In summary, this work provides a principled, generalizable solution for quantifying the significance of cluster‑based findings when they arise from complex, data‑dependent preprocessing pipelines, thereby enhancing the reliability and reproducibility of high‑impact data‑driven decisions.
Comments & Academic Discussion
Loading comments...
Leave a Comment