Sparse outlier-robust PCA for multi-source data
Sparse and outlier-robust Principal Component Analysis (PCA) has been a very active field of research recently. Yet, most existing methods apply PCA to a single dataset whereas multi-source data-i.e. multiple related datasets requiring joint analysis-arise across many scientific areas. We introduce a novel PCA methodology that simultaneously (i) selects important features, (ii) allows for the detection of global sparse patterns across multiple data sources as well as local source-specific patterns, and (iii) is resistant to outliers. To this end, we develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns, and where the ssMRCD estimator is used as plug-in to permit joint outlier-robust analysis across multiple data sources. We provide an efficient implementation of our proposal via the Alternating Direction Method of Multiplier and illustrate its practical advantages in simulation and in applications.
💡 Research Summary
This paper introduces a novel principal component analysis (PCA) framework designed for multi‑source data, i.e., several related data sets that should be analyzed jointly. Classical PCA and most sparse PCA methods operate on a single data matrix, which limits their ability to distinguish variables that drive variation globally across all sources from those that are important only locally within a particular source. The authors address this gap by (1) formulating a regularized optimization problem that simultaneously encourages global sparsity (variables selected in all sources) and local sparsity (variables selected in individual sources), (2) employing the spatially smoothed Minimum Regularized Covariance Determinant (ssMRCD) estimator as a plug‑in to obtain robust, jointly estimated covariance matrices and means for each source, and (3) solving the resulting non‑convex problem efficiently with an Alternating Direction Method of Multipliers (ADMM) algorithm.
The loading matrix V∈ℝ^{p×N} (p variables, N sources) is the decision variable. For each source i, the quadratic term v_{·i}ᵀ Σ̂_i v_{·i} measures explained variance using the robust covariance Σ̂_i supplied by ssMRCD. Sparsity is imposed through a composite penalty: γ∈
Comments & Academic Discussion
Loading comments...
Leave a Comment