Structured sparsity through convex optimization

Submitted to the Statistical Science Structured spa rsit y through convex optimization F rancis Bach , Ro dolphe Jenatton , Juli en Mairal and Guilla ume Ob ozinski INRIA and University of California, Ber keley Abstr act. Sparse estimatio n metho ds are aimed at using or obtaining parsimonious r ep resen tations of d a ta or mo dels. While naturally cast as a com binatorial optimization problem, v ariable or feature selection admits a con v ex relaxation through the r eg ularization by the ℓ 1 -norm. In this pap er, w e consider situations wh ere w e are not only in terested in sparsit y , but w h ere some structural p rior kn o wledge is a v ailable as well. W e s ho w that th e ℓ 1 -norm can then b e extended to structur ed norms built on either d i sjoint or ov erlapp ing group s of v ariables, leading to a ﬂexible framework that can deal w ith v arious structures. W e present applications to u nsup ervised learning, for stru ct ured s p arse prin c ipal comp onen t analysis an d hierarc hical dictionary learning, and to su per- vised learning in the conte xt of non-linear v ariable selectio n. Key wor ds and phr ases: Sp arsit y, Con v ex optimization. 1. INTRODUCTION The concept of p a rsimony is cen tral in man y scien tiﬁc domains. In the con text of statistics, signal pr ocessing or mac hine learning, it tak es the form of v ariable or feature selection problems, and is commonly used in t wo s i tuations: First, to mak e the mo del or the p redictio n more inte rpretable or c heap er to use, i.e., ev en if the underlying pr o blem do es not admit sparse solutions, one lo o ks for the b est sparse appro ximation. Second, sparsit y can also b e used giv en prior kno wledge that the mo del should b e sp a rse. Sparse linear mo dels seek to pr e dict an output by linearly com bining a small subset of the features describing the data. T o simultaneously addr ess v ariable selection and mo del estimation, ℓ 1 -norm regularization has b ecome a p opular to o l, w hic h b eneﬁts b oth from eﬃcient algorithms (see, e.g., Efron et al. , 2004 ; Bec k and T eb oulle , 2009 ; Y uan , 2010 ; Bac h et al. , 2012 , and m ultiple references therein) and a well-dev elop ed theory for generalizatio n prop erties and v ariable selection consistency ( Zhao and Y u , 2006 ; W ain wrigh t , 2009 ; Bic k el, Rito v and Tsybak o v , 2009 ; Z h ang , 2009 ). Sierr a Pr oje ct-T e am, L ab or atoir e d’Informatique de l’Ec ole Normale Sup ´ erieur e, Paris, F r anc e (e- ma il: fr ancis.b ach@inria.fr ; r o dolp he.jenatton@inria.fr ; guil laume.ob ozinski@inria.fr ). Dep artment of Statistics, University of California, Berke ley, USA (e-mail: julien@stat.b erkeley.e du ). 1 imsart-sts ver. 2011/05/20 file: stat_science _structured_ sparsity_final_version.tex date: November 27, 2024 2 BAC H ET AL. When regularizing with the ℓ 1 -norm, eac h v ariable is selected individually , re- gardless of its p osition in the inp ut feature v ector, so that existing relationships and s t ructures b et w een th e v ariables (e.g., sp a tial, hierarc hical or related to the physic s of the problem at hand ) are merely d isr eg arded. How ever, in many prac- tical situations the estimation can b eneﬁt fr o m some t yp e of prior kno wledge, p oten tia lly b oth for in terpretabilit y and to imp ro ve predictiv e p erformance. This a priori can tak e v arious forms: in neu roi maging based on f unctio nal magnetic r e sonance (fMRI) or m a gneto ence phalography (MEG), sets of v o xels allo w ing to discriminate b et w een d iﬀerent brain states are exp ected to form small lo c alized and connected areas ( Gramfort and Ko w alski , 2009 ; Xiang et al. , 2009 , and references therein). Similarly , in face recognition, as sh own in Section 4.4 , robustness to o cclusions can b e increased by considerin g as features, sets of p ixel s that form small conv ex regions of th e faces. Again, a p la in ℓ 1 -norm regularization fails to enco de su c h sp eciﬁc spatial constraints ( Jenatton, Ob ozinski and Bac h , 2010 ) . The same rationale supp orts the u se of structur e d sp arsity f o r bac kground subtraction ( Cevher et al. , 2008 ; Huang, Zhang and Metaxas , 2011 ; Mairal et al. , 2011 ) . Another example of the need for higher-order prior kno wledge comes f r o m bioinformatics. Indeed, for the diagnosis of tum o rs, the p roﬁles of arra y-based comparativ e genomic h ybridization (arrayCGH) can b e used as inpu ts to feed a classiﬁer ( Rapap ort, Barillot and V e rt , 2008 ). Th ese proﬁles are characte rized b y man y v ariables, bu t only a f e w observ ations of suc h proﬁ les are a v ailable, promp t - ing the need for v ariable selection. Because of the sp eciﬁc spatial organizatio n of b ac terial artiﬁcial c hromosomes along the genome, the s et of discriminative features is exp ecte d to consist of sp eciﬁc con tiguous p atterns. Using this prior kno wledge in add i tion to standard sparsit y leads to imp ro vemen t in classiﬁcation accuracy ( Rapap ort, Barillot and V ert , 2008 ). In the con text of m ulti-task re- gression, a problem of in terest in genetics is to ﬁn d a m a pping b et w een a small subset of lo ci presenti ng sin gle nucleo tide p ol ymorph i sms (SNP’s) that h a ve a phenot ypic impact on a give n family of genes ( Kim and Xing , 2010 ). Th is target family of genes has its o wn structure, where some genes share common genetic c haracteristics, so that these genes can b e emb edded int o some und erlying h ie r- arc h y . Exploiting directly this h ie rarc hical inf o rmation in the regularization term outp erforms the unstructur ed approac h with a standard ℓ 1 -norm ( Kim and Xing , 2010 ) . These real world examples motiv ate the need f o r th e design of sparsity-inducing regularization schemes, capable of enco ding more sophisticated p r io r k n o wledge ab out the exp ected sp a rsit y patterns. As m entioned ab o ve, the ℓ 1 -norm corre- sp onds only to a constraint on c ar dinality and is oblivious of any other informa- tion a v ailable ab out th e patterns of nonzero co eﬃcien ts (“nonzero p a tterns” or “supp orts”) induced in the solution, since they are all theoretical ly p ossible. In this p a p er, we consider a family of sparsit y-inducing norms that can address a large v ariety of str u ct ured sparse problems: a simple change of norm w ill induce new wa ys of selecting v ariables; moreo v er, as shown in Section 3.5 and S e ction 3.6 , algorithms to obtain estimators (e.g., conv ex optimization metho ds) and theo- retical analyses are easily extended in man y s itu ations. As shown in Section 3 , the norm s w e introd uce generalize traditional “group ℓ 1 -norms”, that ha v e b een p opular for selecting v ariables organized in non-o v erlapping groups ( T urlac h , V e n- imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 3 ables and W righ t , 2005 ; Y uan and Lin , 2006 ; Roth and Fisc her , 2008 ; Huang and Zhang , 2010 ). Other f amilies for diﬀeren t t yp es of str u ct ures are presented in Section 3.4 . The pap er is organized as follo w s : we ﬁ r st review in Section 2 classica l ℓ 1 - norm r e gularization in sup ervised cont exts. W e then in tro duce seve ral f a milies of norms in Section 3 , and present applicati ons to un sup ervised learning in Section 4 , namely for sparse pr incipal comp onen t an alysis in Section 4.4 and hierarchica l dictionary learning in Section 4.5 . W e br ie ﬂy show in Section 5 how these norms can also b e used for high-dimensional n o n-linear v ariable s election. Notations. T hroughout the pap er, we shall denote v ectors w i th b old lo w er case letters, and matrices w it h b old u pp e r case ones. F or any in teger j in the set J 1; p K , { 1 , . . . , p } , we denote the j -th coeﬃcient of a p -dimen s i onal vec tor w ∈ R p b y w j . S imila rly , f or any matrix W ∈ R n × p , we r e fer to th e en try on the i -th ro w and j -th column as W ij , for an y ( i, j ) ∈ J 1; n K × J 1; p K . W e will need to refer to sub-ve ctors of w ∈ R p , and so, for any J ⊆ J 1; p K , w e denote by w J ∈ R | J | the v ector consisting of th e en tries of w indexed b y J . Likewise, for any I ⊆ J 1 ; n K , J ⊆ J 1; p K , w e denote by W I J ∈ R | I | × | J | the sub-matrix of W formed by the ro ws (resp ectiv ely the columns) ind exed by I (resp ectiv ely by J ). W e extensiv ely manipulate norms in this p a p er. W e th us deﬁne the ℓ q -norm for any ve ctor w ∈ R p b y k w k q q , P p j =1 | w j | q for q ∈ [1 , ∞ ) , and k w k ∞ , max j ∈ J 1; p K | w j | . F or q ∈ (0 , 1), w e extend the deﬁnition ab o v e to ℓ q pseudo-norms. Finally , for any matrix W ∈ R n × p , we d e ﬁne the F rob enius norm of W b y k W k 2 F , P n i =1 P p j =1 W 2 ij . 2. UNSTRUCTURED S P ARSITY V IA T HE ℓ 1 -NORM Regularizing by the ℓ 1 -norm h a s b een a topic of inte nsive researc h ov er the last decade. This line of w ork has witnessed the dev elopmen t of nice theoretical frame- w orks ( Tibshirani , 1996 ; Chen, Donoho and Saund e rs , 1998 ; Mallat , 1999 ; T r o pp , 2004 , 2006 ; Zhao and Y u , 2006 ; Zou , 2006 ; W ain wrigh t , 2009 ; Bick el, Rito v and Tsybak o v , 2009 ; Zhang , 2009 ; Negah ban et al. , 2009 ) and the emergence of man y eﬃcien t algorithms ( Efron et al. , 2004 ; Nestero v , 2007 ; F riedman et al. , 2007 ; W u and Lange , 2008 ; Beck and T eb oulle , 2009 ; W right, No w ak and Figueiredo , 2009 ; Needell and T ropp , 2009 ; Y uan et al. , 2010 ). Moreo ver, this metho dology has foun d q u ite a few applications, notably in compressed sensing ( Cand` es and T ao , 2005 ), for the estimation of th e structure of graphical mo dels ( Meinshausen and B ¨ uhlmann , 2006 ) or f o r sev eral reconstruction tasks inv olving n a tural im- ages (e.g., see Mairal , 2010 , f o r a review). In th is section, we fo cus on sup ervised learning and p resen t the traditional estimation p roblems asso ciat ed with sparsity- inducing norms su c h as the ℓ 1 -norm (see Section 4 f o r un sup ervised learning). In sup ervised learning, w e pred ict (t ypically one-dimen sio nal) outputs y in Y from observ ations x in X ; these observ atio ns are u sually repr e sent ed by p -dimen - sional vec tors with X = R p . M-estimation and in particular regularized empirical risk minimization are well suited to this setting. In deed, giv en n pairs of data p oin ts { ( x ( i ) , y ( i ) ) ∈ R p × Y ; i = 1 , . . . , n } , w e consider the estimators s o lving the follo wing form of con v ex optimizatio n pr ob lem (2.1) min w ∈ R p 1 n n X i =1 ℓ ( y ( i ) , w ⊤ x ( i ) ) + λ Ω( w ) , where ℓ is a loss fun c tion and Ω : R p → R is a sparsit y-inducing—typica lly non- imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 4 BAC H ET AL. smo oth and non-Euclidean—norm. T ypical examples of d iﬀe rent iable loss func- tions are the square loss for least squares regression, i.e., ℓ ( y , ˆ y ) = 1 2 ( y − ˆ y ) 2 with y in R , and the logistic loss ℓ ( y , ˆ y ) = log ( 1 + e − y ˆ y ) for logistic r e gression, w it h y in {− 1 , 1 } . W e refer the readers to Sha w e-T a ylor and C ristia nini ( 2004 ) and to Hastie, T ib shirani and F riedman ( 2001 ) for more complete descriptions of loss functions. Within the cont ext of least-squares regression, ℓ 1 -norm r e gularization is known as th e Lasso ( Tibshirani , 1996 ) in statistics and as basis pu r suit in signal p rocess- ing ( Chen, Donoho and Saunder s , 1998 ). F or th e Lasso, form ulation ( 2.1 ) tak es the form (2.2) min w ∈ R p 1 2 n k y − Xw k 2 2 + λ k w k 1 , and, equiv alen tly , b a sis pursuit can b e written 1 (2.3) min α ∈ R p 1 2 k x − D α k 2 2 + λ k α k 1 . These tw o equations are ob viously identi cal bu t we write th e m b oth to sho w th e corresp ondence b et w een notations u se d in statistics and in s i gnal pr ocessing. In statistica l notations, we will use X ∈ R n × p to d e note a set of n observ ations de- scrib ed by p v ariables (co v ariates), w h ile y ∈ R n represent s th e corresp onding set of n targets (resp onses) that we try to pr e dict. F or instance, y ma y ha v e d i screte en tries in the con text of classiﬁcation. With n o tations of signal pro cessing, w e will consider an m -dimensional signal x ∈ R m that w e exp r ess as a linear combina- tion of p dictionary elemen ts comp osing the dictionary D , [ d 1 , . . . , d p ] ∈ R m × p . While the design matrix X is us uall y assumed ﬁxed and giv en b eforehand, w e shall see in Section 4 th a t the dictionary D ma y corresp ond either to some pre- deﬁned b a sis (e.g., see Mallat , 1999 , for w a v elet b a ses) or to a representa tion that is actually le arne d as well ( Olshausen and Field , 1996 ). Ge ometric intuitions for the ℓ 1 -norm b al l. While we consider in ( 2.1 ) a regu- larized formulatio n, w e could hav e considered an equiv alen t c onstr aine d problem of the form (2.4) min w ∈ R p 1 n n X i =1 ℓ ( y ( i ) , w ⊤ x ( i ) ) suc h that Ω( w ) ≤ µ, for some µ ∈ R + : It is indeed th e case that the solutions to problem ( 2.4 ) obtained when v arying µ is the s ame as the solutions to p roblem ( 2.1 ), for some of λ µ dep ending on µ (e.g., see Section 3.2 in Borw ein and Lewis , 2006 ). A t optimalit y , th e opp osite of the gradient of f : w 7→ 1 n P n i =1 ℓ ( y ( i ) , w ⊤ x ( i ) ) ev aluated at any solution ˆ w of ( 2.4 ) must b elong to the normal cone to B = { w ∈ R p ; Ω( w ) ≤ µ } at ˆ w ( Borw ein and Lewis , 2006 ) . In other w ords, for suﬃciently small v alues of µ (i.e., ensur ing that the constraint is activ e) the lev el set of f for the v alue f ( ˆ w ) is tangen t to B . As a consequen ce, imp ortan t prop erties of the 1 Note th a t th e formula tions which are typically encountered in signal pro ce ssing are either min α k α k 1 s.t. x = D α , which correspond s to the limiting case of Eq. ( 2.3 ) where λ → 0 and x is in the span of the dictionary D , or min α k α k 1 s.t. k x − D α k 2 ≤ η which is a constrained counterpart of Eq. ( 2.3 ) leading to th e same set of solutions (see the exp lanation follow ing Eq. ( 2.4 )). imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 5 solutions ˆ w follo w from the geometry of the ball B . If Ω is take n to b e the ℓ 2 - norm, then the r e sulting ball B is the standard , isotropic, “roun d” ball that d oes not f av or an y sp eciﬁc direction of the space. On the other hand, when Ω is the ℓ 1 -norm, B corresp onds to a diamond-shap ed pattern in t w o dimensions, and to a d o uble pyramid in three dimensions . In particular, B is anisotropic and exhibits some sin g ular p oin ts du e to the n o n-smo othness of Ω. Since these singular p oin ts are lo cat ed along axis-aligned linear su bspaces in R p , if th e lev el set of f with the smallest feasible v alue is tangent to B at one of those p oin ts, sparse solutions are obtained. W e d ispla y on Figure 1 th e balls B for b oth the ℓ 1 - and ℓ 2 -norms. See Section 3 and Figure 2 for extensions to structur e d n o rms. (a) ℓ 2 -norm ball (b) ℓ 1 -norm ball Fig 1: Comparison b et we en the ℓ 2 -norm and ℓ 1 -norm b alls in thr e e dimensions, resp ectiv ely on the left and right ﬁgu r es. T he ℓ 1 -norm ball presents some singular p oin ts lo cate d along the axes of R 3 and along the thr ee axis-aligned planes going through the origin. 3. STRUCTURED SP ARSITY - INDUCING NORMS In this section, w e consider structured sparsity-inducing n orms that induce estimated v ectors that are not only s p arse, as for the ℓ 1 -norm, but wh ose su p port also displa ys some stru c ture known a priori that r e ﬂects p oten tial relationships b et ween the v ariables. 3.1 Spa rsit y-Inducing Norms with Disjoint Groups of Va riables The most natur a l form of structured sp a rsity is arguab ly gr oup sp arsity , match- ing the a p r io ri knowle dge that pre-sp eciﬁed d isjoin t b l o c k s of v ariables sh o uld b e selected or ignored simulta neously . In that case, if G is a collection of groups of v ariables, formin g a partition of J 1; p K , and d g is a p ositiv e scalar wei gh t indexed b y group g , we deﬁne Ω as (3.1) Ω( w ) = X g ∈G d g k w g k q for any q ∈ (1 , ∞ ] . This norm is usu a lly referred to as a mixed ℓ 1 /ℓ q -norm, and in p rac tice, p opular c hoices for q are { 2 , ∞} . As d esir ed , regularizing with Ω leads v ariables in the same group to b e selected or set to zero s imultaneously (see Figure 2 for a geomet- ric in terpretation). In th e conte xt of least-squares regression, this regularization imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 6 BAC H ET AL. (a) ℓ 1 /ℓ 2 -norm ball without ov er- laps: Ω( w ) = k w { 1 , 2 } k 2 + | w 3 | (b) ℓ 1 /ℓ 2 -norm ball with o verlaps: Ω( w ) = k w { 1 , 2 , 3 } k 2 + | w 1 | + | w 2 | Fig 2: Comparison b et we en t w o m ixe d ℓ 1 /ℓ 2 -norm balls in thr e e d imensions (the ﬁrst t wo directions are horizon tal, the third one is ve rtical), without and with o v erlapping groups of v ariables, resp ectiv ely on the left and right ﬁgur e s. T he singular p oin ts app earing on these balls describ e the sparsity-i ndu c ing b eha vior of the u n derlying norms Ω . is known as th e group Lasso ( T urlac h, V enables and W right , 2005 ; Y uan and Lin , 2006 ) . It has b een sho wn to improv e the prediction p erformance and/or in ter- pretabilit y of the learned mo dels wh en the b loc k structure is relev ant ( Roth and Fisc her , 2008 ; Sto jn ic , P arv aresh and Hassibi , 2009 ; Lounici et al. , 2009 ; Huang and Zhang , 2010 ). Moreo v er, applications of th is regularization sc heme arise also in the con text of m ulti-task learning ( Ob ozinski, T ask ar and Jordan , 2010 ; Qu a t- toni et al. , 2009 ; Liu, Palat ucci and Zh a ng , 2009 ) to accoun t for features shared across tasks, and m ultiple k ernel learning ( Bac h , 2008 ) for the selection of d iﬀe r- en t kernels (see also Section 5 ). Choic e of the weights. When the groups v ary s igniﬁ c an tly in size, resu l ts can b e impro v ed, in particular u nder h igh-d i mensional scaling, by an appr o priate c hoice of th e wei gh ts d g whic h comp ensate for the discrepancies of s i zes b et w een groups. It is d iﬃc ult to provide a u nique choic e for the w eigh ts. In general, they dep end on q and on th e t yp e of consistency desired. W e r efer the reader to Y uan and Lin ( 2006 ); Bac h ( 2008 ); Ob ozinski, Jacob and V ert ( 2011 ); Lounici et al. ( 2011 ) for general discussions. It migh t seem that the case of groups that ov erlap would b e unnecessarily complex. It turn s out, in realit y , th a t appr o priate collections of o v erlapping grou p s allo w to enco de qu it e int eresting forms of stru c tured sparsity . In fact, the idea of constructing sparsity-i ndu c ing norms from ov erlapp ing groups will b e k ey . W e present t w o d iﬀe rent constru ct ions based on o v erlapping groups of v ariables that are essen tially complemen tary of eac h other in Sections 3.2 and 3.3 . 3.2 Spa rsit y-Inducing Norms with Overlapping Groups of Va riables In this section, w e consider a direct extension of the n orm intro d uced in th e previous section to the case of o v erlapping groups; we giv e an in f o rmal o v erview of the structur e s that it can enco de and examples of r e lev ant ap p lie d settings. F or more d et ails see Jenatton, Audib ert and Bac h ( 2011 ). imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 7 Starting fr o m the deﬁnition of Ω in Eq. ( 3.1 ), it is natur a l to stud y w hat happ ens wh e n the set of group s G is allo w ed to con tain elements that o v erlap. In fact, and as sho wn by J e natton, Audib ert and Bac h ( 2011 ), the sp ars i t y-inducing b eha vior of Ω remains the same: when regularizing by Ω, some en tire groups of v ariables g in G are set to zero. This is reﬂected in the set of non-smo oth extreme p oin ts of the unit b all of the norm repr e sent ed on Figure 2 - (b) . Wh i le the resu lting patterns of nonzero v ariables—also referred to as supp orts , or nonzer o p atterns — w ere ob vious in the non-o v erlapping case, it is interesting to u n derstand here the relationship that ties together the set of groups G and its asso ciated s et of p ossible n o nzero p a tterns. Let us denote by P the latter s et. F or an y norm of the form ( 3.1 ), it is still the case that v ariables b elonging to a giv en group are encouraged to b e set sim ultaneously to zero; as a result, the p ossible zero patterns for solutions of ( 2.1 ) are obtained b y form in g un io ns of th e basic groups, whic h means that the p ossible su pp o rts are obtained by taking the intersectio n of a certain num b er of complements of the basic groups. Moreo v er, under mild conditions ( Jenatton, Audib ert and Bac h , 2011 ), giv en an y interse ction-close d 2 family of patterns P of v ariables (see examples b elo w), it is p ossible to bu ild an ad-ho c set of groups G —and hence, a regularization norm Ω—that enf orces the sup p ort of the solutions of ( 2.1 ) to b elong to P . These prop erties m ake it p ossible to design norms that are adapted to the structure of the problem at hand, whic h we no w illustrate with a few examples. One-dimensional interval p attern. Giv en p v ariables organized in a sequence, using the set of groups G of Figure 3 , it is only p ossible to select c ontiguous nonzer o p atterns . In this case, w e hav e |G | = O ( p ). Imp osing the con tiguit y of the nonzero p a tterns can b e relev an t in the context of v ariable form ing time series, or f o r the diagnosis of tumors, based on the proﬁles of CGH arrays ( Rapap ort, Barillot and V ert , 2008 ), sin ce a b ac terial artiﬁcial c hromosome will b e in se rted as a single con tin uous blo c k into the genome. Fig 3: (Left) Th e s et of blue groups to p enalize in order to select con tiguous patterns in a sequence. (Right) In red, an example of suc h a nonzero pattern with its corr esp ondin g zero pattern (hatc hed area). Two-dimensional c onvex supp ort. S i milarly , assu me now that the p v ariables are organized on a tw o-dimensional grid . T o constrain the allo w ed sup ports P to b e the set of all rectangles on this grid, a p ossible set of groups G to consider is represent ed in the top of Figure 4 . This set is relativ ely small since |G | = O ( √ p ). Groups corresp onding to h a lf-planes w i th additional orien tations (see Figure 4 b ottom) ma y b e added to “carv e out” more general con v ex patterns. See an illustration in S e ction 4.4 . 2 A set A is said to b e intersection-clos ed, if for any k ∈ N , and for any ( a 1 , . . . , a k ) ∈ A k , we hav e T k i =1 a i ∈ A . imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 8 BAC H ET AL. Fig 4: T op: V ertical and horizon tal groups: (Left) the set of blue and green groups to p enalize in order to s elect rectangles. (Right) In red, an examp le of nonzero pattern reco v ered in this setting, with its corresp onding zero p at tern (hatc hed area). Bottom: Groups w it h ± π / 4 orien tations: (Left) the set of blue and green groups with th eir (not d isp la ye d) complemen ts to p enalize in order to select diamond-shap ed patterns. Two-dimensional blo ck structur es on a grid. Using sparsit y-inducing regular- izations built u pon groups w hic h are comp osed of v ariables toget her with their spatial neighbors leads to go od p erformances for bac kground subtr action ( Cevher et al. , 2008 ; Baraniuk et al. , 2010 ; Hu a ng, Zhang and Metaxa s , 2011 ; Mairal et al. , 2011 ) , top og raphic d i ctionary learning ( Ka vuk cuoglu et al. , 2009 ; Mairal et al. , 2011 ) , and wa velet -based d e noising ( Rao et al. , 2011 ) . Hier ar chic al structur e. A four t h in teresting example assumes that the v ariables are organized in a h ie rarc hy . Pr ec isely , w e assume that the p v ariables can b e assigned to the no des of a tree T (or a forest of trees), and that a giv en v ariable ma y b e selected only if all its ancestors in T ha v e already b een selected. This hierarc hical rule is exactly resp ected when usin g th e family of groups display ed on Figure 5 . The corresp onding p enalt y w as ﬁ r st used by Zh ao , Ro c ha and Y u ( 2009 ); one of it simplest instance in the context of regression is the sparse group Lasso ( Sprec hmann et al. , 2010 ; F riedman, Hastie and Tibshirani , 2010 ); it has found numerous applications, for in sta nce, wa vel et-based denoising ( Zhao, Ro c ha and Y u , 2009 ; Baraniuk et al. , 2010 ; Huang, Zhang and Metaxas , 2011 ; Jenatton et al. , 2011b ), hierarchical dictionary learnin g for b oth topic mo delling and image restoration ( Jenatton et al. , 2011 b ), log-linear mo dels for the selection of p ote nt ial orders ( Sc hmidt and Murp hy , 2010 ), b io informatics, to exploit the tree structur e of gene n e t w orks for multi-ta sk regression ( Kim and Xing , 2010 ), and multi-scal e mining of fMRI data for the pred iction of simple cognitiv e tasks ( Jenatton et al. , 2011a ). Extensions. Possible choic es for the sets of groups G are not limited to the aforemen tioned examples: more complicated top olo gies can b e considered, f o r example thr e e-dimensional spaces discretized in cub es or spherical v olumes dis- cretized in slices (see an application to neur o imaging by V aro quaux et al. ( 2010 )) , and more complicated hierarchical str uctures based on d i rected acyclic graph s can imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 9 Fig 5: Left: set of groups G (dashed con tours in red) corresp onding to the tree T with p = 6 no des represented by blac k circles. Right: example of a sparsit y pattern induced by the tree-structured norm corresp onding to G : the groups { 2 , 4 } , { 4 } and { 6 } are set to zero, so that the corresp onding no des (in gra y) that form subtrees of T are remov ed. The remaining nonzero v ariables { 1 , 3 , 5 } f o rm a ro ote d and connected s u btree of T . This sp a rsity pattern ob eys the tw o follo wing equiv alent rules: (i) if a no de is selected, the same go es for all its ancestors. (ii) if a no de is not selected, then its descendan t are not selected. b e enco ded as fur ther dev elop ed in Section 5 . Choic e of the weights. The c hoice of the w eigh ts d g is signiﬁ cantly more im- p ortan t in the o v erlapping case b oth th eoretically and in practice. In ad d iti on to comp ensating for the discrepancy in group sizes, the w eigh ts additionally hav e to mak e up for the p oten tial ov er-p enalization of parameters contai ned in a larger n umber of groups. F or the case of one-dimensional interv al patterns, Jenatton, Audib ert and Bac h ( 2011 ) sho w ed that it w as more eﬃcien t in practice to actu- ally weigh t eac h in dividual co eﬃci en t inside of a group as opp osed to weigh ting the group globally . 3.3 No rms for Overlapping Gr oups : a Latent V a riable Fo r mulation The family of norms deﬁned in Eq. ( 3.1 ) is adapted to interse ction-close d sets of nonzero patterns. Ho w ev er, some applications exh ibit structures that can b e more naturally mo delled by union-close d families of supp orts. This idea w as in tro duced b y Jacob, Ob ozinski and V er t ( 2009 ) and Ob ozinski, Jacob and V ert ( 2011 ) wh o , giv en a set of group s G , prop osed the follo wing norm (3.2) Ω union ( w ) , min v ∈ R p ×|G | X g ∈G d g k v g k q suc h that ( P g ∈G v g = w , ∀ g ∈ G , v g j = 0 if j / ∈ g , where again d g is a p ositiv e scalar w eigh t asso ciate d with group g . The norm we j u st deﬁned provi des a diﬀerent generalization of the ℓ 1 /ℓ q -norm to the case of o v erlapping group s than the norm presented in Section 3.2 . In fact, it is easy to see that solving Eq. ( 2.1 ) with the norm Ω union is equ i v ale nt to solving (3.3) min ( v g ∈ R | g | ) g ∈G n X i =1 ℓ  y ( i ) , X g ∈G v g g ⊤ x ( i ) g  + λ X g ∈G d g k v g k q , and setting w = P g ∈G v g . This last equ a tion sho ws that using the norm Ω union can b e in terpreted as implicitly duplicating the v ariables b elonging to sev eral groups and regularizing with a weig ht ed ℓ 1 /ℓ q -norm for d isj o in t groups in the imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 10 BAC H ET AL. expanded space. Again in th i s case a careful c hoice of the w eigh ts is imp ortan t ( Ob ozinski, J a cob and V ert , 2011 ). This laten t v ariable formulati on pushes some of the v ectors v g to zero while k eeping others with no zero comp onen ts, h en c e leading to a vecto r w with a supp ort whic h is in general the u nion of the selected group s. Interesti ngly , it can b e seen as a conv ex relaxation of a n o n-con v ex p enalt y encouraging similar sparsit y patterns whic h was in tro duced by Huang, Zhang and Metaxas ( 201 1 ) and whic h w e present in Section 3.4 . Gr aph L asso. On e typ e of a priori kno wledge commonly encoun tered tak es the form of a graph deﬁned on th e set of input v ariables, whic h is such th a t connected v ariables are more lik ely to b e sim ultaneously relev an t or irrelev ant ; this typ e of prior is common in genomics where regulation, co-expression or in teraction net w orks b et w een genes (or their expression lev el) used as predictors are often a v ailable. T o fa v or the selection of neighbors of a selected v ariable, it is p ossible to consider the edges of the graph as groups in the p revio us form ulation (see Jacob, Ob ozinski and V ert , 2009 ; Rao et al. , 2011 ). Patterns c onsisting of a smal l numb e r of intervals. A quite s im ilar situ ation o cc urs, when one kno ws a priori—typical ly for v ariables forming sequences (times series, strings, p olymers)—that the supp ort sh ou ld consist of a small n umb er of connected su bsequences. In that case, one can consider the s e ts of v ariables forming connected s u bsequences (or connected subsequences of length at most k ) as the ov erlappin g groups ( Ob ozinski, Jacob and V ert , 2011 ). (a) Unit ball for G =  { 1 , 3 } , { 2 , 3 }  (b) G =  { 1 , 3 } , { 2 , 3 } , { 1 , 2 }  Fig 6: Two instances of un it b al ls of the latent group Lasso r e gularization Ω union for tw o or three groups of tw o v ariables. T heir singular p oin ts lie on axis aligned circles, corresp onding to eac h group, and whose conv ex hull is exactly the u nit ball. It sh ould b e noted that th e ball on the left is quite similar to the one of Fig. 2b except that its “p oles” are ﬂatter, hen c e discouraging the selection of x 3 without either x 1 or x 2 . 3.4 Related Approache s to Str uc tured Sparsit y Norm design thr ough submo dular functions. Another approac h to stru ct ured sparsit y relies on submo dular analysis ( Bac h , 2010 ). Starting from a non-de- imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 11 creasing, s ubmod ular 3 set-function F of the supp orts of th e parameter v ector w —i.e., w 7→ F ( { j ∈ J 1; p K ; w j 6 = 0 } )—a structured sparsity-inducing norm can b e built by considering its conv ex en v elop e (tight est con v ex lo w er b ound ) on the un it ℓ ∞ -norm ball. By selecting the ap p ropriate set-function F , similar structures to those describ ed ab o v e can b e ob tained. Th is id e a can b e further extended to symmetric, submo dular set-functions of the level sets of w , that is, w 7→ m ax ν ∈ R F ( { j ∈ J 1; p K ; w j ≥ ν } ), th us leading to diﬀeren t t yp es of structures ( Bac h , 2011 ) , allo wing to shap e the leve l sets of w rather than its supp ort. This approac h can also b e generalized to an y set-function and other priors on the the non-zero v ariables than the ℓ ∞ -norm ( Ob ozinski and Bac h , 2012 ) . Non-c onvex appr o aches. W e mainly fo cus in this review on con v ex p enalties but in fact many n o n-con v ex approac hes ha v e b een prop osed as w ell. In the same spirit as the norm ( 3.2 ), Huang, Zhang and Metaxas ( 2011 ) considered the p enalt y ψ ( w ) , min H⊆G X g ∈H ω g , su c h th a t { j ∈ J 1; p K ; w j 6 = 0 } ⊆ [ g ∈H g , where G is a giv en set of groups, and { ω g } g ∈G is a s et of p ositiv e we igh ts whic h de- ﬁnes a c o ding length . In other w ords, the p enalt y ψ measures from an information- theoretic viewp oin t, “ho w m uc h it costs” to represent w . Finally , in the conte xt of compressed sensin g , the w ork of Baraniuk et al. ( 2010 ) also fo cuses on u nion- closed families of sup ports, although without information-theoretic considera- tions. All of th e se non-conv ex app roa c hes can in fact also b e relaxed to conv ex optimization problems ( Ob ozinski an d Bac h , 2012 ). Other forms of sp ars ity. W e end this review by discussing sparse regulariza- tion fu ncti ons enco ding other t yp es of structures than the stru ct ured sp a rsit y p enalties we hav e presented. W e start with the total-v ariation p enalt y originally in tro duced in the im age pro cessing communit y ( Rudin , Osher and F atemi , 1992 ), whic h encourages piecewise constan t signals. It can b e foun d in the statistics literature un der the name of “fu sed lasso” ( Tibshirani et al. , 2005 ). F or one- dimensional signals, it can b e seen as the ℓ 1 -norm of ﬁnite diﬀerences f o r a v ec- tor w in R p : Ω TV-1D ( w ) , P p − 1 i =1 | w i +1 − w i | . Extensions ha v e b ee n pr o p osed for m ulti-dimensional signals an d for reco v ering p i ecewise constan t f unctions on graphs ( Kim, S o hn and Xing , 2009 ). W e remark th at w e h a ve presen ted group-sparsity p enalties in Section 3.1 , where the goal was to select a few groups of v ariables. A d iﬀe rent approac h called “exclusive Lasso” consists instead of s e lecting a few v ariables inside eac h group, with some applications in multitask learning ( Zhou, Jin and Hoi , 2010 ). Finally , w e would like to mention a few w orks on automatic feature group- ing ( Bondell and Reic h , 2008 ; Shen and Huang , 2010 ; Zhong and Kwok , 2011 ), whic h could b e used when no a-priori grou p stru ct ure G is av ailable. These p enal- ties are typically made of pairwise terms b et w een all v ariables, and en cour a ge some co e ﬃcien ts to b e similar, thereby forming “groups”. 3 Let S b e a ﬁnite set. A fun cti on F : 2 S → R is said to b e submo dular if for any subset A, B ⊆ S , we have the inequ a lity F ( A ∩ B ) + F ( A ∪ B ) ≤ F ( A ) + F ( B ); see Bac h ( 2011 ) and references therein. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 12 BAC H ET AL. 3.5 Convex Optimization with P ro ximal Metho ds In this section, w e b rieﬂy review pr oximal metho ds whic h are conv ex optimiza- tion m e tho ds particularly suited to the n o rms we hav e deﬁned. T h ey essential ly allo w to solv e the problem r e gularized with a new norm at lo w imp le ment ation and compu t ational costs. F or a more complete presen tation of optimization tec h- niques adapted to sparsit y-inducing n o rms, see Bac h et al. ( 2012 ). Pro ximal metho ds constitute a class of ﬁrs t -order tec hniques typical ly designed to solv e p roblem ( 2.1 ) ( Nestero v , 2007 ; Bec k and T eb oulle , 2009 ; Com b ettes and P esquet , 2010 ) . They tak e adv an tage of the structure of ( 2.1 ) as the sum of t w o con v ex terms. F or simp l icit y , we will pr esent here the pro ximal metho d kn o wn as forwar d-b ackwar d splitting w h ic h assumes that at least on e of these t w o terms, is smo oth. Thus, we will t ypically assume that the loss fun c tion ℓ is con v ex diﬀeren tiable, with Lipschitz -con tin uous gradien ts (suc h as the logistic or squ a re loss), while Ω will only b e assumed conv ex. Pro ximal metho ds hav e b ecome increasingly p opular ov er the past few y ears, b oth in the signal pro cessing (e.g., Bec k er, Bobin and Candes , 2009 ; W right, No w ak and Figueiredo , 2009 ; Com b ette s and P esquet , 2010 , and numerous ref- erences therein) and in the mac hine learning comm unities (e.g., Jenatton et al. , 2011b ; Chen et al. , 2011 ; Bac h et al. , 2012 , an d references therein). In a broad sense, these metho ds can b e describ ed as pr o viding a natur a l extension of gradient- based tec hniques when the ob jectiv e f unction to minimize h a s a n o n-smo oth part. Pro ximal metho ds are iterativ e pr ocedures. Their basic pr inciple is to linearize, at eac h iteration, the fu ncti on f around the current estimate ˆ w , and to u pd a te this estimate as th e (unique, b y strong con v exit y) solution of the so-called pr oxima l pr oblem . Under the assum ptio n that f is a smo oth f unctio n, it tak es the form: (3.4) min w ∈ R p  f ( ˆ w ) + ( w − ˆ w ) ⊤ ∇ f ( ˆ w ) + λ Ω( w ) + L 2 k w − ˆ w k 2 2  . The role of the add e d qu a dratic term is to k eep the u pd a te in a neighborh oo d of ˆ w wher e f stays close to its current linear appro ximation; L > 0 is a parameter whic h is an upp er b ound on the Lipsc hitz constant of ∇ f . Pro vided that we can solve eﬃciently the pro ximal pr oblem ( 3.4 ), this ﬁr s t iter- ativ e scheme constitutes a simp le wa y of solving p r o blem ( 2.1 ). I t app ears u nder v arious names in the literature: pr o ximal-gradien t techniques ( Nestero v , 2007 ), forw ard-bac kw ard splitting metho ds ( Com b ettes and P esquet , 2010 ), and itera- tiv e sh r ink age-thresholding algorithm ( Bec k and T eb oulle , 2009 ). F urtherm o re, it is p ossible to guarantee con v ergence r at es for the f u nctio n v alues ( Nestero v , 2007 ; Bec k and T eb oulle , 2009 ) , and after k iterations, th e precision b e sho wn to b e of order O (1 /k ), whic h should con trasted with rates for the su bgradien t case, that are rather O (1 / √ k ). This ﬁrst iterativ e scheme can actual ly b e extended to “accelerated” v er- sions ( Nestero v , 2007 ; Bec k and T eb oulle , 2009 ). In that case, the up date is not tak en to b e exactly the result from ( 3.4 ); instead, it is obtained as the solution of the proximal problem app lie d to a w ell-c hosen linear combinatio n of the previous estimates. In that case, the fun c tion v alues con v erge to the optim um with a rate of O (1 /k 2 ), where k is the iteration n umber. F rom Nestero v ( 2004 ), we kn o w that this rate is optimal within the class of ﬁr s t -order tec hniques; in other w ords, imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 13 accele rated proximal- gradien t metho ds can b e as fast as withou t a non-smo oth comp onen t. W e h av e so far giv en an o v erview of pr o ximal metho ds, without sp ecifying ho w w e precisely hand le its core part, namely the computation of the pro ximal problem, as deﬁn ed in ( 3.4 ). Pr oximal Pr oblem. W e ﬁ r st rewrite problem ( 3.4 ) as min w ∈ R p 1 2    w −  ˆ w − 1 L ∇ f ( ˆ w )     2 2 + λ L Ω( w ) . Under this form, w e can readily observ e that when λ = 0, the solution of the prox- imal pr o blem is id e nt ical to the standard gradient up date rule. Th e pr ob lem ab o v e can b e more generally viewed as an instance of the pr oximal op er ator ( Moreau , 1962 ) asso cia ted with λ Ω: Pro x λ Ω : u ∈ R p 7→ arg m i n v ∈ R p 1 2 k u − v k 2 2 + λ Ω( v ) . F or man y choic es of regularizers Ω, the pro ximal pr o blem has a closed-form solution, which mak es pr o ximal metho ds particularly eﬃcien t. It tur ns out that for the norms deﬁned in this pap er, we can compu te in a large num b er of cases the pro ximal op erator exactly and eﬃcien tly (see Bac h et al. , 2012 ). If Ω is c hosen to b e the ℓ 1 -norm, the pr o ximal op erator is simply the soft-thresholding op erator ap - plied element wise ( Donoho and J o hnstone , 1995 ). More formally , we ha v e for all j in J 1; p K , Pr o x λ k . k 1 [ u ] j = sign( u j ) max( | u j | − λ, 0). F or th e group Lasso p enalt y of Eq. ( 3.1 ) with q = 2, the pro ximal op erator is a group-thresholdin g op erator and can b e also computed in closed form : Pro x λ Ω [ u ] g = ( u g / k u g k 2 ) max( k u g k 2 − λ, 0) for all g in G . F or norms w i th hierarc hical group s of v ariables (in the sens e de- ﬁned in Section 3.2 ), th e computation of the pr o ximal op erator can b e obtained b y a comp osition of group-thresh o lding op erators in a time linear in th e num- b er p of v ariables ( Jenatton et al. , 2011 b ). In other settings, e.g., general o v er- lapping group s of ℓ ∞ -norms, th e exact pr o ximal op erator implies a more exp en- siv e p olynomial dep endency on p u sing net wo rk-ﬂo w tec hniques ( Mairal et al. , 2011 ) , but approximat e computation is p ossible without harming the con v ergence sp eed ( Sc hmidt, Le Roux and Bac h , 2011 ). Most of these norms an d the asso ci- ated proximal p r o blems are imp le ment ed in the op en-source softw are SP AMS 4 . In s u mmary , with pro ximal metho ds, generalizing algorithms fr o m the ℓ 1 -norm to a stru c tured norm requires only to b e able to compute the corresp onding pro ximal op erator, which can b e done eﬃcientl y in many cases. 3.6 Theo retical Analysis Sparse m e tho ds are traditionally analyzed according to thr ee d iﬀe rent criteria; it is often assu med that the d ata we re generated by a sparse loading v ector w ∗ . Denoting ˆ w a solution of the M -estimation problem in Eq. ( 2.1 ), traditional statistica l consistency resu lt s aim at sh owing that k w ∗ − ˆ w k is small for a certain norm k · k ; mo del consistency considers the estimation of the sup port of w ∗ as a criterion, w h ile , prediction eﬃciency only cares ab out the p redictio n of the mo del, i.e., with the square loss, the q u an tity k Xw ∗ − X ˆ w k 2 2 has to b e as small as p ossible. 4 http://www .di.ens.fr/w illow/SPAMS/ imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 14 BAC H ET AL. A striking consequence of assuming that w ∗ has man y zero comp onen ts is that for the thr e e criteria, consistency is achiev able even when p is muc h larger than n ( Zhao and Y u , 2006 ; W ainwrigh t , 2009 ; Bic k el, Rito v and Tsyb ako v , 2009 ; Zhang , 2009 ). Ho w ev er, to r e lax the often un real istic assumption that th e data are gener- ated by a sparse loading vec tor, and also b ecause a go o d predictor, esp ecially in the high-dimen s i onal setting, can p ossibly b e muc h sparser than any p oten tial true m odel generating the data, prediction eﬃciency is often formulate d un der the form of or acle ine qualities , where the p erformance of the estimator is upp er b ounded by the p erformance of any f unction in a ﬁxed complexit y class, reﬂecting appro ximation error, plu s a complexit y term c haracterizing the class and reﬂect- ing the hardness of estimation in that class. W e r efe r the reader to v an de Geer ( 2010 ) for a review and r e ferences on oracle resu lts for th e Lasso and the group Lasso. It should b e noted that mo del selection consistency and prediction eﬃciency are obtained in quite diﬀerent regimes of regularization, so that it is not p ossible to obtain b oth t yp es of consistency with the same Lasso estimator ( Shalev-Sh w artz, Srebro and Z h ang , 2010 ). F or p redict ion consistency , th e regularization p a rame- ter is easily c hosen by cross-v alidati on on the prediction error. F or mo del selection consistency , th e regularization co eﬃci en t should t ypically b e muc h larger than for prediction consistency; but rather than trying to select an optimal regularization parameter in that case, it is more n a tural to consider the collection of mo dels ob- tained along the regularization p a th and to apply usual mo del selection metho ds to choose the b est mo del in the collection. One metho d that works reasonably w ell in practice, sometimes called “OLS hybrid” f o r the least squ ares loss ( Efron et al. , 2004 ), consists in r eﬁ t ting the diﬀeren t mo dels without regularization and to c ho ose the mo del with the b est ﬁ t by cross-v alidation. In structured sparse situations, suc h high-dimensional ph en o mena can also b e c haracterized. Essen tially , if one can mak e the assum ptio n th a t w ∗ is compati- ble with th e additional prior kno wledge on the sparsit y pattern enco ded in the norm Ω , then, some of th e assumptions required for consistency can sometimes b e relaxed (see Huang and Zhang , 2010 ; Jenatton, Audib ert and Bac h , 2011 ; Huang, Z hang and Metaxas , 2011 ; Bac h , 2010 ), and faster rates can sometimes b e obtained ( Huang and Z hang , 2010 ; Huang, Zhang and Metaxas , 2011 ; O b ozin- ski, W ain wright and Jordan , 2011 ; Negah ban and W ainwrigh t , 2011 ; Bac h , 2009 ; P erciv al , 2012 ). Ho we v er, one m a jor diﬃcult y that arises is that some of the conditions for reco v ery or to obtain fast rates of con v ergence dep end on an in - tricate inte raction b et w een the sparsit y pattern, the d e sign matrix and the noise co v ariance, whic h leads in eac h case to suﬃ c ien t cond it ions that are t ypically not directly comparable b et w een diﬀeren t structured or unstru c tured cases ( Je- natton, Audib ert and Bac h , 2011 ). Moreo v er, ev en if the suﬃ cient conditions are satisﬁed simulta neously for the norms to b e compared , sh a rp er b ounds on rates and sample complexities wo uld still often b e n e eded to c haracterize more accurately the impr o ve ment resu lt ing from h a ving a stronger structural a priori. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 15 4. SP ARS E PRINCIP AL COMPONENT ANAL YSIS AND DICTIONARY LEARNING Unsup ervised learning aims at extracting laten t representat ions of the data that are usefu l for analysis, visualization, denoising or to extract relev ant infor- mation to solv e subsequ e nt ly a sup ervised learnin g problem. Sparsity or struc- tured sparsity are essential to sp ecify , on the r epresen tat ions, constrain ts that impro v e their iden tiﬁabilit y and int erpretabilit y . 4.1 Analysis and Synthesis Views of PCA Dep ending on how the laten t r ep resen tation is extracted or constructed from the data, it is useful to distinguish tw o p oints of v iew. Th is is illustrated well in the case of P CA. In the analysis view, PCA aims at ﬁnding se quential ly a set of directions in space that explain the largest fraction of the v ariance of the data. T h is can b e form ulated as an iterativ e pr ocedure in wh ich a one-dimensional p ro jectio n of the data with maximal v ariance is found ﬁr st, then th e data are pro jected on the orthogonal subs pac e (corresp onding to a deﬂation of th e cov ariance matrix), and the pro cess is iterated. In the synthesis view, PCA aims at ﬁn ding a set of vecto rs, or dictionary elements (in a terminology closer to signal pr ocessing) suc h that all obs er ved signals admit a linear decomp ositi on on that set with lo w reconstruction error. In the case of PC A, these t wo formulations lead to the same solution (an eigen v alue problem). Ho we v er, in extensions of PC A, in wh ic h either the dictionary elemen ts or the decomp ositions of signals are constrained to b e sparse or stru c tured, th ey lead to diﬀerent algorithms with d iﬀe rent solutions. The analysis in terpretation leads to sequential f ormulations ( d’Aspremont, Bac h and El Ghaoui , 2008 ; Moghaddam, W eiss and Avidan , 2006 ; Jolliﬀe, T rendaﬁlov and Udd in , 2003 ) that consider comp on ents one at a time and p erform a de- ﬂation of th e co v ariance matrix at eac h step (see Mac k ey , 2009 ). T h e synthe- sis interpretati on leads to non-conv ex global formulations (see, e.g., Zou, Hastie and Tib shirani , 2006 ; Moghaddam, W eiss and Avidan , 2006 ; Aharon, E la d and Bruc kstein , 2006 ; Mairal et al. , 2010 ) w hic h estimate sim ultaneously all principal comp onen ts, typical ly do n o t require the orthogonalit y of the comp onen ts, and are referr ed to as matrix factorization pr o blems ( Sin g h and Gordon , 2008 ; Bac h, Mairal and Ponce , 2008 ) in machine learning, and dictionary learning in signal pro cessing ( Olshausen and Field , 1996 ). While w e could also imp ose structured sparse p riors in the analysis view, we will consid e r from now on the synthesis view, that we will introdu c e with the terminology of d i ctionary learning. 4.2 Dictiona r y Lea rning Giv en a matrix X ∈ R m × n of n columns corresp onding to n obs e rv atio ns in R m , the dictionary learning p roblem is to ﬁ nd a matrix D ∈ R m × p , called dictionary , suc h that eac h observ ation can b e we ll app ro ximated b y a linear combinatio n of the p columns ( d k ) k ∈ J 1; p K of D called the dictionary elements . If A ∈ R p × n is the m a trix of th e lin ear com bination co e ﬃcien ts or de c omp osition c o eﬃcients (or c o des ), with α k the k -th column of A b eing the co eﬃcien ts for the k -t h signal x k , the matrix p rodu c t D A is called a d ecomp osition of X . Learning simultaneously th e dictionary D and the coeﬃcients A corresp onds imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 16 BAC H ET AL. to a matrix f a ctorization p roblem (see Witten, Tibshiran i and Hastie , 2009 , and reference therein). As formulated by Bac h, Mairal and P once ( 2008 ) or Witten, Tibsh ir a ni and Hastie ( 2009 ), it is natur a l, when learnin g a decomp osition, to p enalize or con- strain some norms or ps eu do- norms of A and D , s a y Ω A and Ω D resp ectiv ely , to enco de prior inform ation — t ypically spars ity — ab out the decomp osition of X . While in general the p enalties could b e deﬁ ned globally on the matrices A and D , we assume that eac h column of D and A is p enalized separately . This can b e written as (4.1) min A ∈ R p × n , D ∈ R m × p 1 2 nm   X − DA   2 F + λ p X k =1 Ω D ( d k ) , s.t. Ω A ( α i ) ≤ 1 , ∀ i ∈ J 1; n K , where the regularization p a rameter λ ≥ 0 con trols to whic h extent the d ic tionary is regularized. If we assume that b oth regularizations Ω A and Ω D are conv ex, problem ( 4.1 ) is con v ex with resp ect to A for ﬁxed D and v ice ve rsa. It is ho w- ev er not jointly conv ex in the pair ( A , D ), but alternating optimization schemes generally lead to goo d p erformance in p rac tice. 4.3 Imp os ing Spars it y The choice of the t w o norms Ω A and Ω D is crucial and hea vily inﬂu ence s the b eha vior of d i ctionary learning. Without regularization, an y solution ( D , A ) is suc h that DA is the b est ﬁx ed - rank app ro ximation of X , and the problem can b e solved exactly with a classical PCA. When Ω A is the ℓ 1 -norm and Ω D the ℓ 2 -norm, we aim at ﬁnding a dictionary suc h that eac h signal x i admits a sparse decomp ositio n on the dictionary . I n th is con text, we are essen tially lo oking for a basis wh er e the d at a h a ve sp a rse d e comp ositions, a framewo rk we r efer to as sp arse dictionary le arning . On the con trary , when Ω A is th e ℓ 2 -norm and Ω D the ℓ 1 -norm, the f o rmulat ion ind uces sparse principal comp onen ts, i.e., atoms with many zeros, a fr a mew ork w e refer to as sparse PCA. I n Section 4.4 and Section 4.5 , w e replace the ℓ 1 -norm b y structured norms introdu c ed in Section 3 , leading to stru ct ured versions of the ab o v e estimation problems. 4.4 Adding Str uc tures to Principal Com ponents One of PC A’ s main sh o rtcomings is that, ev en if it ﬁ nds a small num b er of imp ortan t factors, th e factor th ems elves t ypically inv olve all original v ariables. In the last decade, several alternativ es to PCA whic h ﬁ nd spars e and p oten tially in terpretable factors ha v e b een p r o p osed, notably non-negativ e matrix factoriza- tion (NMF) ( Lee and Seung , 1999 ) and sparse PC A (S P CA) ( Jolliﬀe, T rendaﬁlov and Uddin , 2003 ; Zou, Hastie and Tib shirani , 2006 ; Zass and Sh a shua , 200 7 ; Witten, Tibs hirani and Hastie , 2009 ). Ho w ev er, in many applications, only constrainin g the size of the supp orts of the factors do es not seem app ropriate b ecause the considered factors are not only ex- p ected to b e s p arse bu t also to ha v e a certain stru c ture. In fact, the p opularit y of NMF for f a ce image an alysis ow es essen tially to the fact that the metho d happ ens to retriev e sets of v ariables that are p a rtly lo calized on the face and capture some features or p a rts of the f a ce whic h seem in tuitiv ely meaningful giv en our a priori. W e might therefore gain in the quality of the f a ctors ind u ce d by en f o rcing d irec tly imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 17 this a priori in the m atrix factorizatio n constraints. More generally , it would b e desirable to encod e higher-order information ab out the supp orts that reﬂects the structur e of the data. F or example, in computer vision, features asso ciated to the pixels of an image are naturally organized on a grid and the su pp o rts of factors explaining the v ariabilit y of images could b e exp ected to b e lo ca lized, connected or ha v e s ome other regularit y with r e sp ect to that grid. S imila rly , in genomics, factors explaining the gene expression patterns observ ed on a microarra y could b e exp ected to inv olve groups of genes corresp ondin g to biological path w a ys or set of genes that are n e igh b ors in a pr o tein-protein interact ion n etw ork. Based on these remarks and with the norms pr e sent ed earlier, sparse PCA is readily extended to structur e d sp arse PCA (SSP CA), whic h exp la ins the v ariance of the d a ta b y factors that are not only sparse but also resp ect some a priori struc- tural constrain ts deemed r e lev ant to mo del the d a ta at hand: sligh t v arian ts of the regularization term deﬁn ed in Section 3 (with the groups deﬁned in Figure 4 ) can b e u sed successfully for Ω D . Exp eriments on f ac e r e c o g ni tion. By deﬁn it ion, d ic tionary learning b elongs to unsup ervised learning; in that sense, our metho d ma y app ear ﬁrs t as a to ol for exploratory data analysis, whic h leads us n a turally to qu a litatively analyze the results of our decomp ositi ons (e.g., by visualizing the learned dictionaries). This is ob viously a d iﬃcult and su b jectiv e exercise, b ey ond the assessment of the consistency of the metho d in artiﬁcial examples w here the “true” dictionary is kno wn. F or quant itativ e results, see Jenatton, O bozinski and Bac h ( 2010 ). 5 W e app ly SS PCA on th e cropp ed AR F ace Database ( Martinez and Kak , 2001 ) that consists of 2600 face images, corresp onding to 100 individu a ls (50 wo men and 50 men). F or eac h su b ject, there are 14 non-o ccluded p oses and 12 o ccluded ones (the o cclusions are due to s unglasses and scarfs). W e r e duce the resolution of the images f rom 165 × 120 pixels to 38 × 27 pixels f or computational reasons. Figure 7 sho ws examples of learned dictionaries (for p = 36 elements), for NMF, un structured sparse PCA (SP C A) , and S S PCA. While NMF ﬁn ds sp a rse but spatially u nco nstrained patterns, SSPC A selects sparse conv ex areas that corresp ond to a more n a tural segment of faces. F or instance, meaningful parts suc h as the mouth and the ey es are r ec o v ered b y the dictionary . 4.5 Hiera r c hical Dictiona ry Lea rning In this section, w e consider sparse dictionary learning, where the structured sparse prior kno wledge is pu t on the decomp osition co eﬃcien ts, i.e., the matrix A in Eq. ( 4.1 ), and present an application to text do cumen ts. T ext do cuments. Th e goal of p r o babilistic topic mo dels is to ﬁn d a lo w-dimen- sional representa tion of a collection of do cument s, wh ere the represent ation should pro vide a semant ic description of the collection. Approac hing the p roblem in a parametric Ba y esian framework, latent Diric hlet allocation (LDA), Blei, Ng and Jordan ( 2003 ) mo dels do cumen ts, represen ted as v ectors of word coun ts, as a mixture of a p r edeﬁned num b er of latent topics , deﬁ n ed as multinomial distrib u- tions o v er a ﬁx ed vocabulary . The num b er of topics is usually small compared to the size of the vocabulary (e.g., 100 against 10 000), so that the topic prop ortions of eac h d ocument provide a compact repr esentati on of the corpu s . 5 A Matlab toolbox implementi ng our metho d can b e downloa ded from http:/ /www.di.ens.fr/ ~ jenatt on/ . imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 18 BAC H ET AL. Fig 7: T op left, examples of faces in the datasets. The three remaining images rep- resen t learned dictionaries of faces with p = 36: NMF (top right), SPCA (b otto m left) and SS PCA (b ottom right ). The dictionary elemen ts are sorted in decreasing order of explained v ariance. While NMF gives sp a rse s pat ially u nconstrained pat- terns, SSP CA ﬁn ds con v ex areas that corr esp ond to more natural face segmen ts. SSPCA captures the left/righ t illuminations and retrieve s pairs of sym met ric pat- terns. S o me d ispla yed patterns d o n ot seem to b e conv ex, e.g., nonzero patterns lo c ated at t w o opp osite corners of the grid . Ho w ev er, a closer lo ok at th e se dic- tionary elemen ts sho ws that con v ex shap es are ind ee d selected, and that small n umerical v alues (just as regularizing by ℓ 2 -norm may lead to) giv e the visual impression of h aving zero es in con v ex n on zero patterns. This also sho ws th a t if a noncon v ex pattern has to b e selected, it will b e, by considerin g its con v ex hull. In fact the pr ob lem addr e ssed b y LD A is fun damen tall y a matrix factorizati on problem. F or instance, Bun tine ( 2002 ) argued that LD A can b e interpreted as a Diric hlet-m ultinomial coun terpart of factor analysis. W e can actually cast the problem in the dictionary learning formulation that we presen ted b efore 6 . Ind e ed, 6 Doing so we simply trade th e multinomia l lik elihoo d with a least-square formulation. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 19 supp ose that the signals X = [ x 1 , . . . , x n ] in R m × n are eac h the so-called b ag-of- wor d rep resen tation of eac h of n do cumen ts o v er a v o cabulary of m wo rds, i.e., x i is a ve ctor whose k -th comp onen t is the empirical fr equency in d ocument i of the k -t h word of a ﬁxed lexicon. If w e constrain the entries of D and A to b e nonnegativ e, and the dictionary elemen ts d j to hav e un it ℓ 1 -norm, the decomp ositio n ( D , A ) can b e interpreted as the parameters of a topic-mixture mo del. Sp a rsit y here ensur es that a do cumen t is describ ed by a s m a ll num b e r of topics. Switc hing to str u ct ured sp arsit y allo ws in this case to organize automatically the dictionary of topics in th e pro cess of learning it. Assume that Ω A in Eq. ( 4.1 ) is a tr ee-structur e d r e gularization, suc h as illustrated on Figure 5 ; in this case, in th e light of S e ction 3.2 , if the decomp osition of a do cumen t in v olv es a certain topic, then all ancestral topics in th e tree are also present in the topic decomp o- sition. Sin c e the hierarc h y is shared by all do cumen ts, the topics close to the r oot participate in every decomp ositio n, and give n that th e dictionary is learned , th i s mec hanism f o rces those topics to b e quite generic—essent ially gathering th e lex- icon whic h is common to all do cument s. Con v ersely , th e deep er th e topics in the tree, the more sp eciﬁc th ey sh o uld b e. It should b e noted that suc h hierarc hical dictionaries can also b e obtained with generativ e probabilistic mo dels, typica lly based on non-p a rametric Bay esian prior o v er trees or p aths in trees, and whic h extend the LD A mo del to topic hierarchies ( Blei, Griﬃths and J ord an , 2010 ; Adams, Ghahramani and Jordan , 2010 ). Fig 8: Example of a topic h ie rarc hy estimated from 1714 NIPS pro ceedings pa- p ers (from 1988 through 1999). Eac h no de corresp onds to a topic wh o se 5 most imp ortan t words are display ed. Sin g le c haracters su c h as n, t, r are part of the v o ca bulary and often app ear in NIPS pap ers, and their place in the hierarch y is seman tically relev ant to children topics. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 20 BAC H ET AL. Visualization of NIPS pr o c e e dings. W e qualitativ ely illustrate our appr o ac h on the NIPS pr oceedings from 1988 through 1999 ( Griﬃths and Steyv ers , 2004 ). After remo ving wo rds app earing few er than 10 times, the d ataset is comp osed of 1714 articles, with a v o cabulary of 8274 w ords. As explained ab o v e, we enforce b oth the dictionary and the sparse co e ﬃcien ts to b e non-n e gativ e, and constrain the dictionary elemen ts to hav e a unit ℓ 1 -norm. Figure 8 disp la ys an example of a learned dictionary w it h 13 topics, obtained by using a tree-structured p enalt y (see S ec tion 3.2 ) on th e co eﬃcie nt s A and b y selecting manually 7 λ = 2 − 15 . As exp ected and similarly to Blei, Griﬃths and Jordan ( 2010 ), we capture the stop w ords at the ro ot of th e tree, an d topics reﬂecting the diﬀerent sub domains of the conference suc h as neurosciences, optimization or learning theory . 5. HIGH-DIMENSIONAL NON-LINEAR V ARIABLE SELECTION In this section, we sh ow h ow structured spars i t y-inducing n orm s ma y b e used to pro vide an eﬃcien t solution to the p roblem of high-dimensional non-line ar v ariable selection. Namely , giv en p v ariables x = ( x 1 , . . . , x p ), our aim is to ﬁnd a n o n-linear f u nctio n f ( x 1 , . . . , x p ) whic h dep ends only on a few v ariables. First approac hes to the p r o blem hav e considered restricted fu nctio nal forms such as f ( x 1 , . . . , x p ) = f 1 ( x 1 ) + · · · + f p ( x p ), where eac h f 1 , . . . , f p are univ ariate non- linear functions ( Ravikumar et al. , 2009 ; Bac h , 2008 ). How ever, m any non-linear functions cann o t b e expressed as sums of functions of these forms. Add it ional in teractions hav e b een added leading to functions of the form f ( x 1 , . . . , x p ) = P J ⊂{ 1 ,...,p } , | J | 6 2 f J ( x J ) ( Lin and Zhang , 2006 ). While second-order interact ions mak e the class of functions larger, our aim in this section is to consid er fu nctio ns whic h can b e expressed as a sp a rse linear com bination of the form f ( x 1 , . . . , x p ) = P J ⊂{ 1 ,...,p } f J ( x J ), i.e., a com bination of functions deﬁn ed on p oten tially larger subsets of v ariables. The main d iﬃcultie s asso ciat ed w ith this pr o blem are that (1) eac h fu n ct ion f J has to b e estimated, leading to a non-parametric problem, and (2) there are exp o- nen tially m a ny suc h functions. W e pr o p ose how ever an approac h that ov ercomes b oth diﬃculties in the next section, based on the ideas that estimating functions rather than v ectors can b e tac kled with estimators in r eprod u ci ng k ernel Hilb ert spaces (see S e ction 5.1 ), and that the complexit y issues can b e addressed by imp osing some str u ct ure among all the subsets J ⊂ { 1 , . . . , p } (see Section 5 ). 5.1 Multiple Kernel Lea rning: F rom Linear to Non-Linea r Predictions Repro ducing kernel Hilb ert spaces are arguably the simplest spaces for the non-parametric estimation of non-linear fun ct ions sin c e most learning algorithms for linear mo dels are d i rectly p orted to an y RK H S via simple k ernelization. W e therefore start by reviewing learning from a single and later m ultiple repr oducing k ernels, s i nce our approac h will b e based on com bining functions from multiple (actually a h ie rarc hy) of RKHSes. F or more details, see Bac h ( 2008 ). Single kernel le arning. Let us assume that the n inpu t data-p oi nts x (1) , . . . , x ( n ) b elong to a set X (not necessarily R p ), and consider predictors of the form 7 The regularization parameter striking a goo d compromise b et w een sparsity and reconstruc- tion of th e data is chosen here by hand b eca use (a) cross-v alidation would y i eld a signiﬁcantly less sp a rse dictionary and ( b) mo del selection criteria would not apply without serious ca veats here since the dictionary is learned at the same time. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 21 h f , Φ( x ) i where Φ : X → F is a map from the input space to a repro ducing k ernel Hilb ert space F (asso ciated to the ke rnel function k ), wh ic h w e refer to as the feature sp a ce. T hese pr edict ors are linearly parameterized, but may d epend non-linearly on x . W e consid e r the follo wing estimation problem: min f ∈F 1 n n X i =1 ℓ ( y ( i ) , h f , Φ ( x ( i ) ) i ) + λ 2 k f k 2 F , where k . k F is the Hilb ertian n orm associated to F . The repr e sent er theorem ( Kimel- dorf and W ah ba , 1971 ) states that, for all loss functions (p oten tially noncon v ex), the solution f admits the expans i on f = P n i =1 α i Φ( x ( i ) ), so that, r ep l acing f by its new expression, we can now minimize min α ∈ R n 1 n n X i =1 ℓ ( y ( i ) , ( K α ) i ) + λ 2 α ⊤ K α , where K is the k e r nel matrix , an n × n matrix whose elemen t ( i, j ) is equal to h Φ( x ( i ) ) , Φ ( x ( j ) ) i = k ( x ( i ) , x ( j ) ). This optimizatio n problem inv olve s the observ a- tions x (1) , . . . , x ( n ) only thr o ugh th e ke rnel matrix K , and can thus b e solv ed, as long as K can b e ev aluate d eﬃcientl y . See Sha w e-T a ylor and Cr istianini ( 2004 ) for more details. Multiple kernel le arning (MKL). W e can now assume th a t w e are giv en m Hilb ert sp a ces F j , j = 1 , . . . , m , and lo ok for predictors of the form f ( x ) = g 1 ( x )+ · · · + g m ( x ), wh er e 8 eac h g j ∈ F j . In order to h a ve m any g j equal to zero, we can p enalize f using a su m of norm s similar to the group Lasso p enalties in tro duced earlier, namely P m j =1 k g j k F j . T his leads to selection of fun c tions. Moreo v er, it turns out that the optimization problems may b e expressed also in terms of the m k ernel matrices, and it is equ iv alent to learn a s parse linear com bination ˆ K = P m j =1 η j K j (with man y η j ’s equ a l to zero) of k ernel matrices with then α solution of the single kernel learning p roblem f o r ˆ K . F or more d e tails, see Bac h ( 2008 ). F r om M KL to sp arse gener alize d additive mo dels. As shown ab o v e, the MKL framew ork is d e ﬁned with an y set of m RKHSes deﬁned on the same b a se s e t X . When the base set is itself deﬁn e d as a cartesian pro duct of p base sets, i.e., X = X 1 × · · · × X p , then it is common to consider m = p RKHSes w h ic h are eac h of them deﬁned on a single X i , leading to the desired functional form f 1 ( x 1 ) + · · · + f p ( x p ). T o o v ercome the limitation of this functional form w e n ee d to consider a more complex expansion. 5.2 Hiera r c hical Kernel Lea rning In this section, we consider fun c tional exp a nsions with up to m = 2 p terms corresp onding to diﬀerent RKHSes, eac h d e ﬁned on a cartesian pro duct of a sub - set of the p separate inp u t spaces. S peciﬁcally , we consider functions of the form f ( x 1 , . . . , x p ) = P J ⊂{ 1 ,...,p } f J ( x J ) with f J c hosen to liv e in a RKHS F J deﬁned on v ariables x J . P enalizing by the norm P J ⊂{ 1 ,...,p } k f J k F J w ould in theory lead to an appr o priate selection of functions f r o m the v arious RKHSes (and to learn- ing a s p arse linear combinatio n of the corresp onding kernel matrices). Ho w ev er, in practice, th e re are 2 p suc h predictors, which is not algorithmically f easible. 8 Notice that the function g j is not restricted to dep end only on a subp art of x as b ef ore. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 22 BAC H ET AL. 23 34 14 13 24 123 234 124 134 1234 12 1 2 3 4 2 3 4 12 23 34 14 24 234 124 13 134 1234 123 1 Fig 9: P o w er set of the set { 1 , . . . , 4 } : in blue, an auth o rized set of selected subsets. In red, an example of a group used with in the norm (a su bset and all of its descendan ts in th e D A G). This is where stru c tured sparsity comes into play . In order to obtain p olynomial- time algorithms and theoretically con trolled predictiv e p erformance, we m a y add an extra constrain t to the prob lem. Namely , we endow the p o w er set of { 1 , . . . , p } with th e p a rtial order of the inclusion of sets, an d in this dir ected acyclic graph (D A G), w e requir e that pr e dictors f select a subset only after all of its ances- tors hav e b een selected. This can b e ac hiev ed in a con v ex form ulation using a structured-spars ity indu cing norm of the typ e presented in Section 3.2 , but de- ﬁned by a h ie rarc hy of groups as follo w s Ω  ( f H  H ⊂{ 1 ,...,p } ] = X J ⊂{ 1 ,...,p }  X H ⊃ J k f H k 2 2  1 / 2 . As illustrated in Figure 9 , this norm corresp onds to ov erlappin g grou p s of v ari- ables d e ﬁned on the directed acyclic graphs of all su bsets of { 1 , . . . , p } . W e will explain brieﬂy how in tro d ucing this n orm ma y lead to p olynomial time algo- rithms and what theoretical guaran tees are asso ciate d w it h it. Illustrations of the application of hierarc hical kernel learning to real d a ta can b e found in Bac h ( 2009 ). Polynomial-time estimation algorithm. While w e are, a priori, still facing an estimation problem with 2 p functions, it can b e solve d using an activ e set metho d, whic h considers ad d ing a comp onen t f J ∈ F J (resp. K J ) to the activ e set of predictors (resp . kernels). The tw o cru cial asp ects are (1) to add the right kernel, i.e., c ho ose among the 2 p whic h one to add , and (2) when to stop. As sho wn in Bac h ( 2009 ), th e se steps ma y b e carried out eﬃcien tly for certain collect ions of RKHSes F J , in particular those for which w e are able to compu t e eﬃcien tly (i.e., in p olynomial time in p ) the sum P J ⊂{ 1 ,...,p } K J . This is the case, for example, for Gaussian kernels k J ( x J , x ′ J ) = exp( − γ k x J − x ′ J k 2 2 ‘). The or etic al analysis. Bac h ( 200 9 ) show ed that under appr o priate assumptions, estimation under high-dimensional scaling, i.e., for p ≫ n b ut log p = O ( n ), is p ossible in th i s situation, in sp i te of the fact that the num b er of terms in the expansion is now p oten tial ly doubly exp onen tial in n . 6. CONCLUSION In th i s pap er, we review ed sev eral approac hes for structured sparsit y , b ased on conv ex optimization and the design of app ropriate sp arsit y-ind u ci ng norms. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 23 Analyses and algorithms for the traditional ℓ 1 -norm can readily b e extend ed to these n ew norms, making them an eﬃcien t and ﬂexible to ols for intro d ucing prior kno wledge in high-dimensional statistical problems. W e also p resen ted sev eral ap- plications to sup ervised and unsup ervised learning pr o blems, where the p roper use of additional kno wledge leads to imp ro ved interpretabilit y of the sparse esti- mates and/or increased predictiv e p erformance. A CKNO WLEDGEMENTS F rancis Bac h, Ro dolphe Jenatton and Guillaume Ob ozinski are supp orted in part b y ANR under grant MGA ANR-07-BLAN-031 1 and the Europ ean Researc h Council (SIERRA Pro ject). Julien Mairal is sup ported by the NSF gran t SES- 08355 31 and NSF a w ard CC F-0939370. The authors would like to than k the anon ymous review ers, whose commen ts ha v e greatly con tributed to impro v e the qualit y of this pap er. REFERENCES Adams, R. , Gh ahramani, Z. and Jord an, M. (2010). T ree-Structu red St ick Breaking for Hierarc hical D a ta. In A dvanc es in Neur al Information Pr o c essing Systems 23 (J. L aﬀerty, C. K. I. Williams, J. Sh a we-T aylor, R. S. Zemel and A. Culotta, eds.) 19–27. Aharon, M. , Elad, M. and Bruck stein, A. (2006). K-SV D: A n Algorithm for D es igning Overcomplete Dictionaries for Sparse Representation. IEEE T r ans. Signal Pr o c essing 54 4311–43 22. Bach, F. (2008). Consistency of the group Lasso and multiple kernel learning. Journal of M a- chine L e arning R ese ar ch 9 1179–122 5. Bach, F. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. I n Neur al Information Pr o c essing Systems 21 . Bach, F. (2010). Stru ctured Sparsity-inducing Norms Through Submo dular F unctions. In A d- vanc es in Neur al I nformation Pr o c essing Systems 23 . Bach, F. (2011). Learning with Su bmodular F unctions: A Con vex Optimization Perspective T echnical Rep ort No. 006452 71, HA L. Bach, F. (2011). S haping level sets with submod ular fun ctio ns. In A dvanc es in Neur al Infor- mation Pr o c essing Systems 24 . Bach, F. , Mairal, J. and Ponce, J. (2008). Conv ex Sparse Matrix F actorizations T echnical Rep ort, Preprint arXiv :0812.1869. Bach, F. , Jena tton, R. , Maira l, J. and Obozinski, G. ( 2 012). Optimization with sparsity- inducing p enalties. F oundat ions and T r ends in Machine L e arning 4 1–106. Baraniuk, R. G . , Cevher, V . , D u ar te, M. F. and Hegde, C. (2010). Mo del-based compres- sive sensing. IEEE T r ansactions on I nformat ion The ory 56 1982–2001. Beck, A. and Teboulle, M. (2009). A fast iterativ e shrink age-thresholding algorithm for linear inv erse problems. SIAM Journal on Imaging Scienc es 2 183–202. Becker, S. , Bobin, J. and Cand es, E. (2009). NEST A: A F ast and Accurate First-order Method for Sparse Reco very. SIAM Journal on I maging Scienc es 4 1–39. Bickel, P. , Ritov, Y. and T syb ako v, A. ( 2009). Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics 37 1705–173 2. Blei, D. , G riffiths, T. L. and Jordan, M. I. (2010). The nested Chin ese restaurant pro ces s and Ba yesian n o nparametric inference of topic hierarc hies. Journal of the ACM 57 1–30. Blei, D. , N g, A. and Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine L e arning R ese ar ch 3 993–1022. Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrink age, v ariable selection, and sup ervised clustering of predictors with O SCAR. Biometrics 64 115–123. Bor wein , J. M. and Lewis, A. S. (2006). Convex A nalysis and Nonline ar Optimization: The ory and Examples . Springer. Buntine, W. L. (2002). V ariational Exten s ions t o EM and Multinomial PCA. In Pr o c e e di ng s of the Eur op e an Confer enc e on Machine L e arning (ECML) . Cand ` es, E. J. and T ao, T. (2005). Deco ding by linear p ro gramming. IEEE T r ansactions on Information The ory 51 4203–4215. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 24 BAC H ET AL. Cevher, V. , Duar te, M . F. , Hegde, C. and B araniuk, R. G. (2008). Sp a rse signal recov ery using Marko v rand om ﬁelds. In A dvanc es i n Neur al Inf or mation Pr o c essing Systems 20 . Chen, S. S. , Donoho, D. L. and S a unders, M. A. (1998). Atomic Decomp ositi on by Basis Pursuit. SIAM Journal on Scientiﬁc Computing 20 33–61. Chen, X. , Lin, Q. , Kim, S. , Carbonell, J. G . and Xing, E. P. (2011). S m o othing Proximal Gradien t Metho d for General Stru ctured Sparse Learning. I n Pr o c e e dings of the Twenty-Fifth Confer enc e on Unc ertainty in Artiﬁcial Intel ligenc e (UAI) . Combettes, P. L. and Pesquet, J. C. (2010). Proximal splitting method s in signal p rocessing. In Fixe d-Point Algorithms for Inverse Pr oblems in Scienc e and Engine ering S pringer. d’Aspremont, A. , Bach, F. and El Ghaoui, L. (2008). Op tima l Solutions for Sparse Principal Component Analysis. Journal of M a chine L e arning R ese ar ch 9 1269–12 94. Donoho, D. L. and Johnstone, I. M. (1995). Adapting to Unknown Smo othness Via W av elet Shrink age. Journal of the Amer ic an Statistic al Asso ciation 90 1200–1224. Efr on, B . , Hastie , T. , Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann als of Statistics 32 407–451 . Friedman, J. , Hastie, T. and Tibshirani, R. (2010). A note on the group Lasso and a sparse group Lasso. pr eprint . Friedman, J. , Hastie, T. , H ¨ ofling, H. and Tibshiran i, R. (2007). Path wise co ordinate optimization. Annals of Appli e d Statistics 1 302–33 2. Gramfor t, A. and Ko w alski, M. (2009). Improving M/EEG source localization with an inter- condition sparse prior. I n IEEE I nt ernational Symp osium on Biome dic al I maging . Griffiths, T . L. and Steyvers, M. (2004). Finding scientiﬁc topics. Pr o c e e dings of the Na- tional Ac ademy of Scienc es 101 5228–5235. Hastie, T. , Tibshirani , R. and Frie dman, J. (2001). The El emen ts of Statistic al L e arning . Springer-V erlag. Huang, J. and Zhang, T. (2010). The b eneﬁt of group sparsity. Annals of Statistics 38 1978– 2004. Huang, J. , Zhang, T. and Me t axa s, D. (2011). Learning with structured sparsity. Journal of Machine L e arning R ese ar ch 12 3371–3412. Ja cob, L. , Obozinski, G. an d Ver t, J. P. (2009). Group Lasso with o verlaps and graph Lasso. In Pr o c e e dings of the International Confer enc e on Machine L e arning (ICML) . Jena tton, R. , Audiber t, J. Y. and Bach, F. (2011). S tructured V ariable Selection with Sparsity-Inducing Norms. Journal of Machine L e arning R ese ar ch 12 2777–28 24. Jena tton, R. , Obozinski, G. and Bach, F. (2010). Structured sparse principal component analysis. In I nternational Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AI ST A TS) . Jena tton, R. , G ramf or t, A. , Michel, V. , Obozinski, G. , Eger, E. , Bach, F. and Thirion, B. (2011a). Multi-scale mining of fMRI data with hierarc hical structured spar- sit y T echnical Rep ort, Preprint arXiv:1105.03 63. T o ap p ear in SI A M Journal on I ma ging Sciences. Jena tton, R. , Mai ral, J. , Obozinski, G. and Bach, F. (2011b). Proximal Metho ds for Hierarc hical S pars e Cod i ng. Journal of Machine L e arning R ese ar ch 12 2297-2334. Jolliffe, I. T. , Trendafilo v, N . T. and Uddi n, M. (2003). A mo diﬁed principal comp onen t technique based on the Lasso. Journal of Computational and Gr aphic al Statistics 12 531–547. Ka vukcuoglu, K. , R a nza to, M. A. , Fergus, R. an d LeCun , Y. (2009). Learning inv ariant features through top ographic ﬁlter maps. In Pr o c e e dings of the I EEE Confer enc e on Computer Vision and Pattern R e c o gni ti on (CVPR) . Kim, S . , Sohn, K. A. and Xing, E. P. (2009). A multiv ariate regression approach to asso ci ation analysis of a q uan titative trait netw ork. Bioinformatics 25 204–212. Kim, S. and Xi ng, E. P. (2010). T ree-Guided Group Lasso for Multi-T ask R e gression with Structured Sp a rsity. In Pr o c e e dings of the International Conf er enc e on Machine L e arning (ICML) . Kimeldorf, G . S. and W ahba, G . (1971). Some results on Tc hebyc heﬃan spline functions. J. Math. Anal. Applic at. 33 82–95. Lee, D. D. and Seu ng, H . S. (1999). Learning the p arts of ob jects by non-negative matrix factorizatio n. Natur e 401 788–791. Lin, Y. and Zhang, H. H. (2006). Comp on ent selection and smooth i ng in multiv aria te n on- parametric regression. Annals of Statistics 34 2272–2297. Liu, H. , P ala tucci, M. and Zh ang, J. (2009). Blockwise co ordinate descent pro cedures for the multi-task lasso, with app l ications to neural seman tic basis discov ery. In Pr o c e e dings of imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 25 the International Confer enc e on Machine L e arning (ICML) . Lounici, K. , Pontil, M. , Tsybako v, A. B. and v an de Ge er, S. (2009). T aking A dv antage of Sparsity in Multi-T ask Learning. I n Pr o c e e dings of the Confer enc e on L e arning The ory . Lounici, K. , Pontil, M. , v an de G eer, S. and Tsybako v, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Annals of Statistics 39 2164–220 4. Mack ey, L. (2009). Deﬂ ation Metho ds for S p ar se PCA. I n A dvanc es in Neur al Information Pr o c essing Systems 21 . Mairal, J. (2010). Sparse cod i ng for machine learning, image p rocessing and computer vision PhD th es is, ´ Ecole normale sup ´ erieure de Cachan - ENS Ca chan Av ailable at http://tel .archives-ou vertes.fr/tel-00595312/fr/ . Mairal, J. , Bach, F. , Ponce, J. and Sapiro, G. (2010). O nli ne learning for matrix factoriza- tion and sparse co ding. Journal of Machine L e arning R ese ar ch 11 19–60. Mairal, J. , Jena tton, R. , Obozinski, G . and Bach, F. (2011). Conv ex and Netw ork Flow Optimization for St ru ctured Sparsity. Journal of Machine L e arning R ese ar ch 12 2681–2720. Malla t, S . G. (1999). A wavelet tour of signal pr o c essing . Academic Press. Mar tin e z, A. M. and Ka k, A. C. (2001). PCA versus LDA. I EEE T r ansactions on Pattern Ana lysis and Machine Intel ligenc e 23 228–233. Meinshausen, N . and B ¨ uhlmann, P. (2006). High-dimensional graphs and v ariable selection with the Lasso. Annals of Statistics 34 1436–14 62. Moghaddam, B. , Wei ss , Y. and A vidan, S . (2006). Sp ectral b ounds for sparse PCA: Exact and greedy algorithms. In A dvanc es i n Neur al Information Pr o c essing Systems 18 . Moreau, J. J. (1962). F onctions conve xes d uale s et p oin ts proximaux dans un espace hilb er tien. C. R. Ac ad. Sci. Paris S´ er . A Math. 255 2897–2899 . Needell, D. and Tr opp, J. A. (2009). CoSaMP: Iterative signal reco very from incomplete and inaccurate samples. Appli e d and Computational Harmonic Analysis 26 301–321. Negahban, S. N. and W ainwright, M. J. (2011). Simultaneous Su pport Recov ery in High Di- mensions: Beneﬁts and Perils of Blo c k e l l { 1 } /ell { inf ty } -Regularization. Information The- ory, I EEE T r ansact ions on 57 3841–38 63. Negahban, S. , Ra vikumar, P. , W ainwright, M. J. and Yu, B . (2009). A u niﬁed framew ork for high-dimensional analysis of M-estimators with decomp osable regularizers. In A dvanc es in Neur al Information Pr o c essing Systems 22 . Nestero v, Y. (2004). Intr o ductory l e ctur es on c onvex optimization: a b asic c ourse . Kluw er Academic Publishers. Nestero v, Y. (2007). Gradient metho ds for minimizing comp osi te ob jectiv e function T echnical Rep ort, Cen ter for Op erations R ese arch and Econometrics (COR E), Catholic Un i versit y of Louv ain. Obozinski, G . and Bach, F. (2012). Conv ex relaxation for combinatori al p enalties T echnical Rep ort, HAL. Obozinski, G. , Jacob, L. and Ver t, J. P. (2011). Group Lasso with ov erlaps: th e Latent group Lasso approach T ec hnical Rep ort No. inria-00628498, HAL. Obozinski, G. , T askar, B. and Jordan, M. I. (2010). Join t co v ariate selection and join t subspace selection for multiple classiﬁcation problems. Statistics and Computing 20 231–25 2. Obozinski, G. , W ainwrigh t, M. J. and Jordan, M. I. (2011). Supp ort Union Recov ery in High-dimensional Multiv ariate regression. Annals of statistics 39 1–47. Olsha usen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeld prop erties by learning a sparse cod e for natural images. Natur e 381 607–609 . Perc iv al, D. ( 2 012). Theoretical Prop erties of the Overlapping Group Lasso. Ele ctr on. J. Statist. 6 269–288. Qua ttoni, A. , Carreras, X. , Collins, M. and D arrell, T. (2009). An eﬃcient pro jection for ℓ 1 /ℓ ∞ regularization. In Pr o c e e dings of the International Confer enc e on Machine L e arning (ICML) . Rao, N. S. , Now ak, R. D. , Wri ght, S. J. and King sbur y, N. G. (2011). Con vex approaches to mo del wa vel et sparsity patterns. In International Confer enc e on Image Pr o c essing (ICIP) . Rap apor t, F. , B arillo t, E. and Ve r t, J. P. (2008). Classiﬁcation of arra yCGH data using fused SVM. Bioi nformat ics 24 i375–i3 82. Ra v i k umar, P. , Laffer ty, J. , Liu, H. and W asserman, L. (2009). Sparse ad d iti ve mo dels. Journal of the R oyal Statistic al So ciety. Series B, Statistic al metho dolo gy 71 1009–1030. Ro th, V. and Fi sche r, B. (2008). The group-Lasso for generalized linear mo dels: uniquen ess of solutions and eﬃcient algorithms. In Pr o c e e dings of the I nt ernational C onf e r enc e on Machine imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 26 BAC H ET AL. L e arning (ICML) . Rudin, L. I. , Oshe r, S. and F a temi, E. (1992). Nonlinear t otal v ariatio n based n oi se remov al algorithms. Physic a D: N onl ine ar Phenomena 60 259–268 . Schmidt, M. , Le Roux, N . and Bach, F. (2011). Conv ergence Rates of In ex ac t Proximal- Gradien t Metho ds for Conv ex Optimization. In A dvanc es in Neur al Inf or mation Pr o c essing Systems 24 . Schmidt, M . and Murphy, K. (2010). Conv ex S t ructure Learning in Log-Linear Mo dels : Be- yond P airwise Poten tials. In Pr o c e e dings of the International Conf er enc e on Artiﬁcial Intel- ligenc e and Statistics (AI ST A TS) . Shalev-Shw ar tz , S. , Srebro, N. and Zhang, T. (2010). T rading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization 20 . Sha we-T a ylor, J. and Cristiani ni, N. (2004). Kernel Metho ds f or Pattern Analysis . Cam- bridge Universit y Press. Shen, X. and Hua n g, H. C. (2010). Grouping pursu i t through a regularization solution surface. Journal of the Americ an Statistic al Asso ciation 105 727–739. Singh, A. P. and Gordon, G . J. (2008). A Un iﬁed View of Matrix F actoriza tion Mo dels . I n Pr o c e e dings of the Eur op e an c onfer enc e on Machine L e arning and Know le dge Disc overy i n Datab ases . Sprechmann, P. , R a mirez, I. , Sapiro, G. and Eldar, Y. (2010). Collaborative hierarchica l sparse mo deling. In 44th Annual Confer enc e on Information Scienc es and Systems (CISS) 1–6. IEEE. Stojnic, M. , P ar v aresh, F. and Hassibi, B . ( 2009). On the reconstruction of blo c k-sparse signals with an op t i mal number of measurements. I EEE T r ansactions on Signal Pr o c essing 57 3075–3085 . Tibshirani, R. (1996). Regression shrink age and selection via th e Lasso. Journal of the R oyal Statistic al So ciety. Series B 267–288. Tibshirani, R. , Saunders, M. , Ro sset, S. , Zhu, J. and Knight, K. (2005). Sparsity and smoothn ess via the fused Lasso. J. R oy. Stat. So c. B 67 91–108. Tro pp, J. A. (2004). Greed is goo d: Algorithmic results for sparse approximation. IEEE T r ans- actions on Information The ory 50 2231–2242 . Tro pp, J. A. (2006). Just relax: Conv ex programming meth ods for identifying sparse signals in noise. IEEE T r ansactions on I nformat ion The ory 52 . Turlach, B . A. , Venables, W. N. and Wright, S. J. (2005). Simultaneous v ariable selection. T e chnometr ics 47 349–363. v an d e Geer , S. (2010). ℓ 1 -Regularization in H ig h-Dimensional Statistical Models. In Pr o c e e d- ings of the International Congr ess of Mathematicians 4 2351–2 369. V aro quaux, G. , Jena tton, R . , Gramfor t, A. , Obozinski, G . , Thirion, B . and B a ch, F. (2010). Sparse Structured Dictionary Learning for Brain Resting-State Activity Mo deling. In NIPS W or kshop on Pr actic al Applic ations of Sp arse Mo deling: Op en Issues and New Dir e c- tions . W ainwright, M. J. (2009). Sh a rp thresholds for n o isy and high-dimensional reco very of sparsity using ℓ 1 - constrained quadratic programming. IEEE T r ansact ions on Information The ory 55 2183–22 02. Witten, D. M. , T ibshirani, R. and Hastie, T. (2009). A p e nalized matrix decomp o sition, with applications to sparse principal comp onen ts and canonical correlation analysis. Bio- statistics 10 515. Wright, S. J. , N o w ak, R. D. and Figueire d o , M. A. T . (2009). Sparse reconstruction by separable approximatio n. IEEE T r ansactions on Signal Pr o c essing 57 2479–24 93. Wu, T. T. and Lange , K. (2008). Co ordinate descent algorithms for Lasso p enal ized regression. Ann als of Applie d Statistics 2 224–244. Xiang, Z. J. , Xi, Y. T. , Hasson, U. and Ra m adge, P. J. (2009). Boosting with spatial regularization. In Adv anc es in Neur al Information Pr o c essing Systems 22 . Yuan, M. (2010). High Dimensional Inv erse Cov ariance Matrix Estimation via Linear Program- ming. Journal of Machine L e arning R ese ar ch 11 2261–2286 . Yuan, M. and Li n, Y. (2006). Mo del selection and estimation in regression with group ed v ariables. Journal of the R oyal Statistic al So ciety. Series B 68 49–67. Yuan, G. X. , Chang , K. W. , Hsie h, C. J. and Lin, C. J. (2010). Comparison of Op timi za- tion Method s and Softw are for L arge-scale L1-regularized Linear Classiﬁcatio n. Journal of Machine L e arning R ese ar ch 11 3183–323 4. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4 STRUCTURED SP A RSITY THROUGH CONV E X OPTIMIZA TION 27 Zass, R . and Shashua , A. (2007). Nonnegative sparse PCA. I n A dvanc es in Neur al Information Pr o c essing Systems 19 . Zhang, T. (2009). Some sharp p erf ormance b ounds for least squares regression with l1 regular- ization. Annals of Statistics 37 2109–2144 . Zhao, P. , Rocha, G. and Yu, B. (2009). The comp osite absolute p enalties family for group e d and hierarchica l v ariable selection. A nnals of Statistics 37 3468–3497. Zhao, P. and Yu, B. (2006). On mo del selection consistency of Lasso. Journal of Machine L e arning R ese ar ch 7 2541–2563. Zhong, L. W . and Kwok , J. T. (2011). Eﬃcient Sparse Mo deling with Automatic F eature Grouping. In Pr o c e e dings of the International C on fer enc e on Machine L e arning (ICML) . Zhou, Y. , Jin , R. and H o i, S. C. H. (2010). Exclusive Lasso for multi-task feature selec- tion. In Pr o c e e dings of the International Confer enc e on Artiﬁcial Intel ligenc e and Statistics (AIST A TS) . Zou, H. (2006). The adaptive Lasso and its oracle p ro p erties. Journal of the Americ an Statistic al Asso ciation 101 1418–1429. Zou, H. , Hastie , T. and T i bs hirani , R. (2006). Sparse p ri ncipal comp onent analysis. Journal of Computational and Gr aphic al Statistics 15 265–286. imsart-sts ver. 2011/05 /20 file: stat_sc ience_structu red_sparsity_final_version.tex date: November 27, 202 4

Structured sparsity through convex optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment