Greedy Sparsity-Constrained Optimization
Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and compressive Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsity-constrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressive Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic data, where the algorithm is employed for sparse logistic regression with and without $\ell_2$-regularization.
💡 Research Summary
This paper addresses the problem of sparsity‑constrained optimization when the objective function is non‑quadratic or non‑smooth, a setting that is common in modern machine learning but has received comparatively little algorithmic attention. The authors propose a new greedy method called Gradient Support Pursuit (GraSP), which extends the ideas behind CoSaMP, IHT, and Subspace Pursuit to arbitrary cost functions. At each iteration GraSP computes the full gradient of the current estimate, selects the 2s largest‑magnitude components to form a candidate support set, solves a restricted 2s‑dimensional subproblem (e.g., a Newton or line‑search step) on that support, and then truncates the solution back to s non‑zero entries. This loop repeats until the gradient norm is sufficiently small.
The theoretical contribution rests on two novel regularity conditions. The first, Stable Restricted Hessian (SRH), requires that for any s‑sparse direction Δ the Hessian of the objective satisfies α‖Δ‖₂² ≤ Δᵀ∇²f(x)Δ ≤ β‖Δ‖₂² for all x, i.e., the Hessian is well‑conditioned on every s‑sparse subspace. This generalizes the Restricted Isometry Property (RIP) from linear measurements to smooth, possibly non‑linear models. The second condition, Stable Restricted Linearization (SRL), is a non‑smooth analogue: the gradient must be approximated linearly with bounded error on s‑sparse pairs, i.e., ‖∇f(x)−∇f(y)−∇²f(y)(x−y)‖₂ ≤ γ‖x−y‖₂. Under SRH (or SRL) the authors prove that GraSP converges geometrically: after t iterations the estimate (\hat{x}^{(t)}) satisfies
‖(\hat{x}^{(t)})−x*‖₂ ≤ C·(β/α)^{t}·‖x*‖₂ + ε,
where x* is the true s‑sparse minimizer, C is a constant, and ε captures measurement noise or model misspecification. When β/α < 1 the error shrinks exponentially.
To demonstrate applicability, the paper shows that the ℓ₂‑regularized logistic loss f(w)=∑ₙ log(1+e^{−yₙ aₙᵀw})+λ‖w‖₂² satisfies SRH provided the data matrix A obeys a suitable RIP‑like condition. The Hessian of the logistic loss is Σₙ σₙ(1−σₙ) aₙ aₙᵀ, whose eigenvalues are bounded away from zero and infinity on any s‑sparse subspace, yielding explicit α and β. Consequently GraSP can recover a sparse classifier without any ℓ₁ penalty.
Experimental evaluation uses synthetic data (p=2000, sparsity s=20, varying sample sizes) and a real‑world spam classification task. GraSP is compared against ℓ₁‑regularized logistic regression, standard IHT, CoSaMP variants, and a recent forward‑backward algorithm. Results show that GraSP attains lower test error and converges in fewer iterations, especially when the ℓ₂ regularization parameter λ is large. Its per‑iteration cost is O(p·s), making it scalable to high‑dimensional problems.
The authors acknowledge limitations: verifying SRH/SRL for a given dataset may be non‑trivial, and when the design matrix is highly correlated the condition number β/α can approach one, slowing convergence. For non‑smooth objectives, SRL guarantees are weaker, suggesting the need for sub‑gradient adaptations. Future work is proposed on relaxing the regularity conditions, incorporating adaptive step sizes, and extending GraSP to deep learning architectures.
In summary, the paper delivers a theoretically grounded, computationally efficient greedy algorithm that broadens sparsity‑constrained optimization beyond quadratic losses. By introducing SRH and SRL, it unifies the analysis of smooth and non‑smooth cases and provides concrete guarantees for important models such as regularized logistic regression, positioning GraSP as a strong alternative to traditional ℓ₁‑based methods.
Comments & Academic Discussion
Loading comments...
Leave a Comment