Gradient Sliding for Composite Optimization

Noname manuscr ipt No. (will be inserted by the editor) Gradien t Sliding for Comp osite Optimization Guangh ui Lan the date of receipt a nd accept ance s ho uld be inser te d later Abstract W e consider i n t his pap er a cl ass of co mposi te opti mization pr oblems whose ob jective function is gi v en b y the summa tion of a genera l smo oth and nons- mo oth comp onent, t ogether with a relatively simple nonsmo oth term. W e present a new cl ass of ﬁrst-order metho ds, na mely the gradi en t sli ding a lgorithms, which can skip the comp ut ation of t he gradi en t for the smo oth com ponent from t ime to time. As a consequence, these a lgorithm s require only O (1 / √ ǫ ) gradi en t ev alua- tions for the smo oth co mponent i n order t o ﬁnd a n ǫ -soluti on for the comp osite problem, whi le still maintaining the opti mal O ( 1 /ǫ 2 ) b ound on the total num b er of subgradient ev a luations for the no nsmoo th com p onent. W e then present a stochas- tic co unterpart fo r t hese algo rithms and esta bl ish sim ilar com plexity b ounds for solving an imp ortant class o f sto chastic comp osite o ptimizati on problems. More- ov er, i f the sm oot h com ponent i n t he com posit e functio n is strong ly co nvex, the developed gradient slidi n g algori thms can signiﬁca n t ly reduce the num b er o f grad- uate and subgrad i en t ev aluat ions for the sm o oth and nonsmo oth com ponent to O ( log(1 /ǫ )) a nd O (1 /ǫ ), resp ectively . Fina l ly , we g eneralize these algor ithms to the ca se w hen the smo oth comp onen t is repla ced by a nonsm oot h one p ossessing a certai n bi-linear saddle p oin t structure. Keywords: conv ex progr amming, complexity , gradi en t sl iding, Nesterov’s metho d, data anal ysis AMS 2000 sub ject classi ﬁcation: 9 0 C25, 90C 06, 90C2 2, 49M37 The author of this paper was partiall y supported by NSF grant CMMI-1000347, DMS-1319050, ONR gran t N00014-13 -1-0036 and NSF CAREER Award CM MI-1254 446. Departmen t of Industrial and Systems Engineering, Universit y of Florida, Gainesvill e, FL, 32611. (email: glan@i se.ufl.ed u ). Address(es) of author(s) should be give n 2 Guangh ui Lan 1 Intr oduction In t his pap er, we consi d er a class of co mposi te convex pro gramming (CP) pro blems given in the fo r m of Ψ ∗ ≡ min x ∈ X { Ψ ( x ) := f ( x ) + h ( x ) + X ( x ) } . (1.1 ) Here, X ⊆ R n is a closed conv ex set , X is rel atively simpl e conv ex functi on, and f : X → R and h : X → R , respecti v ely , ar e general smoo th and nonsmo oth convex functions satisfy ing f ( x ) ≤ f ( y ) + h∇ f ( y ) , x − y i + L 2 k x − y k 2 , ∀ x , y ∈ X , (1.2 ) h ( x ) ≤ h ( y ) + h h ′ ( y ) , x − y i + M k x − y k , ∀ x, y ∈ X, (1.3 ) for som e L > 0 and M > 0, w here h ′ ( x ) ∈ ∂ h ( x ). Comp osite problem of this type app ears in many data analy si s appli cations, wher e ei ther f or h corr esponds to a certain data ﬁdel i t y t erm, whil e the o ther com ponents i n Ψ denote regula rization terms used to enforce certai n struct ur al prop erties f or the obta ined sol u t ions. Througho ut this pap er, we assume that one can access the ﬁrst-o r der info r- matio n of f a nd h separa tely . More sp eciﬁcally , in the det erministic setting, we can co mpute the exact g radien t ∇ f ( x ) a nd a subgradi en t h ′ ( x ) ∈ ∂ h ( x ) fo r any x ∈ X . W e al so consider the sto chastic situat ion where only a sto c hastic subg ra- dient of t he non sm oot h comp onent h i s av ai lable. The main g oal of t his pap er to provide a b etter theo r etical u nd er standing on how ma n y number of gradi en t ev aluat ions of ∇ f and sub g radient ev alua tions of h ′ are needed i n order to ﬁnd a certain approximate so lution o f (1 .1 ). Most existing ﬁrst -order metho ds for solvi ng (1.1) requir e the com putation of bo th ∇ f and h ′ in each it eration. In particul a r, since the ob jecti v e functi on Ψ i n (1.1) i s nonsmo oth, t hese algori thms would require O (1 /ǫ 2 ) ﬁrst -order i terations, and hence O ( 1 /ǫ 2 ) ev aluat ions for b oth ∇ f a nd h ′ to ﬁnd an ǫ -solut ion of (1 .1 ), i.e. , a po i n t ¯ x ∈ X s.t . Ψ ( ¯ x ) − Ψ ∗ ≤ ǫ . Muc h recent research eﬀort ha s b een di r ected to reducing the i mpact of the L i psc hi tz co nstant L on the afo remen tioned compl exity bo unds fo r comp osite optimizat ion. F or example, Juditsky , Nemirovski and T rav el show ed in [ 8] tha t by usi ng a v aria n t of the mirr or-prox metho d, the num ber of ev aluat ions for ∇ f a nd h ′ required to ﬁnd a n ǫ -solut ion of (1. 1 ) can b e b ounded by O  L f ǫ + M 2 ǫ 2  . By developing a n enhan ced version of Nesterov’s a cc el erated grad i en t metho d [15, 16], Lan [11] further show ed t hat t he ab o ve b ound can be i mpro ved to O r L f ǫ + M 2 ǫ 2 ! . (1.4 ) It is a lso shown i n [11] that simi l ar b ounds hold for the sto c h a stic ca se where only unbiased esti m ators for ∇ f and h ′ are av ail able. It is o bserv ed in [ 11] that such a compl exit y b ound is no t i mprov able if one can o nly access the ﬁrst-order inform a tion for the summat i on of f and h al l together . Gradien t Sliding for Composite Optimization 3 Note, howev er, that it is unclear w hether the complexi t y b ound i n (1 .4 ) is optim a l if o ne do es have access to the ﬁrst-order inform ation of f and h separat ely . In particul ar, one would exp ect that the n umber of ev aluat ions fo r ∇ f can b e bo unded by O (1 / √ ǫ ), i f t he nonsmo oth term h in ( 1 .1) does not app ear (see [ 18 , 21, 3]). H o wever, it i s uncl ear whether su ch a bound still ho lds for t he more genera l comp osite pr oblem i n (1. 1 ) without signi ﬁc a n t ly increasing the b ound i n ( 1.4) on the n umber o f subg radient ev alua tions for h ′ . It shoul d b e p oin ted out that in many appl ications the b ottleneck of ﬁrst-o r der metho ds exist in the co mputation of ∇ f rather than that o f h ′ . T o m otiv ate o ur study , let us mention a few such examples. a) In many inv erse problems, w e need t o enforce certai n bl ock sparsi t y (e.g . , tota l v ariat ion a nd overlapped group La sso) by so lving the probl em of mi n x ∈ R n k Ax − b k 2 2 + r ( B x ) . Here A : R n → R m is a g iv en l inear op erator, b ∈ R m denotes the collect ed o bserv a t ions, r : R p → R is a relat iv ely sim ple nonsm oot h conv ex function (e.g., r = k · k 1 ), a nd B : R n → R p is a very spar se matri x. In this c a se, ev aluat ing t h e gradien t o f k Ax − b k 2 requires O ( mn ) arithm etic o perati ons, while the computa tion of r ′ ( B x ) only needs O ( n + p ) arithm etic oper ations. b) In ma n y machine learni ng problem s, we need t o minim ize a regul arized lo ss function given by min x ∈ R n E ξ [ l ( x, ξ ) ] + q ( B x ) . Here l : R n × R d → R deno tes a certain simple loss funct ion, ξ i s a random v ariabl e w i th unknown dist ribution, q is a certain smo oth conv ex f u n ct ion, and B : R n → R p is a gi v en l inear op erator. In t his case, t he computat ion of the sto chastic subgradi en t for the loss funct ion E ξ [ l ( x, ξ ) ] requires o nly O ( n + d ) ari thmetic op erations, whil e ev aluat ing the gradient of q ( B x ) needs O ( np ) ar ithmetic op eratio n s. c) In some cases, t he comput ation of ∇ f in volves a black-box si m ul ation pro ce- dure, t he solut i on o f an opti mization problem, or a part i al diﬀerential equa tion, while the computa tion of h ′ is given expli c i tly . In a ll t hes e cases m en t i oned above, it is desir a ble to reduce the num ber of g radi- ent ev a luations of ∇ f to imp r o ve the ov erall eﬃciency for solv i ng the comp osite problem (1.1). Our contribution can be b r ieﬂy sum marized as fo llows. First ly , we present a new class of ﬁrst-order metho ds, namely the gra dien t slidi ng al gorithms, and show that the num b er of g radient ev a luations fo r ∇ f requi r ed by t h ese a lgorithm s to ﬁnd an ǫ -solut ion of (1.1) can b e si gniﬁcantly reduced from (1. 4) to O r L ǫ ! , (1.5 ) while the tota l n umb er of subgrad i en t ev a luations for h ′ is still bo unded by ( 1.4). The basic sc heme of these alg orithms is to skip th e compu t ation of ∇ f from time to t ime so that o nly O (1 / √ ǫ ) gr adien t ev alua t ions are needed i n the O ( 1 /ǫ 2 ) iter- atio n s r equired to solve (1.1). Such an algor ithmic framework originated from the simple idea of inco rpora ting an iter ative pro cedure to sol v e the subprobl ems i n the aforementioned accelerat ed pro ximal gra dien t metho ds, al though the analysi s of these gradi en t sliding algo rithms appea rs to b e more technical a nd inv olved. Secondly , we consi der the st ochastic ca se where the nonsmo oth ter m h is repre- sented by a sto chastic o racle ( S O), which, fo r a given search p oin t u t ∈ X , outputs 4 Guangh ui Lan a vector H ( u t , ξ t ) such t hat (s.t . ) E [ H ( u t , ξ t )] = h ′ ( u t ) ∈ ∂ h ( u t ) , (1.6 ) E [ k H ( u t , ξ t ) − h ′ ( u t ) k 2 ∗ ] ≤ σ 2 , (1.7) where ξ t is a r andom vector indep enden t o f the sear c h p oints u t . Note t hat H ( u t , ξ t ) is referred to as a sto c hastic subgra dien t of h at u t and it s computa tion is oft en muc h cheaper than the exact subg radien t h ′ . Based on the gradi en t sliding t ec h- niques, we develop a new cla ss o f sto c hastic approxima tion type algo r ithms and show that the to tal num ber gra d i en t ev aluat ions of ∇ f required by th ese alg orithms to ﬁnd a sto chastic ǫ -soluti on of ( 1.1), i.e. , a p oin t ¯ x ∈ X s. t. E [ Ψ ( ¯ x ) − Ψ ∗ ] ≤ ǫ , can stil l b e b ounded by (1. 5 ) , while the tota l num b er of st o chastic subgra di en t ev aluat ions can b e bound ed by O r L ǫ + M 2 + σ 2 ǫ 2 ! . W e a lso establ ish large-devia tion r esu l ts asso ciated with these com plexity b ounds under certain “light-tail ” assumpt ions on the st ochastic subgra dien t s returned by the SO. Thirdly , we generali ze the gra dien t sli ding algor ithms fo r sol ving two i mport an t classes of com p osite pr o blems given in the for m o f ( 1.1), but with f satisfyi n g additi onal o r al terative assumpti ons. W e ﬁrst assume that f is no t only smo oth, but also stron g ly conv ex, and show t hat the num ber of ev al uations fo r ∇ f a nd h ′ can be si gniﬁcantly reduced fr o m O (1 / √ ǫ ) and O ( 1 /ǫ 2 ), resp ectively , to O (log(1 /ǫ ) ) and O (1 /ǫ ). W e then consi d er t he case when f is no nsmoo th, but can b e cl osely approximated b y a class of smo oth functions. By inco rpora t ing a nove l smo othing scheme due to Nest ero v [17] i n t o the gradient slidi n g al gorithms, we show that the num b er of gradi en t ev aluat i ons can b e b o unded by O (1 /ǫ ), while the opt imal O ( 1 /ǫ 2 ) bo un d on the n umber of subgra di en t ev aluat ions of h ′ is still retai ned. This pa per i s orga nized as follows. In Section 2.1, we provide some prelim- inaries o n the pro x-functions and a brief review on existing proximal gradi en t metho ds for solv ing (1.1). In Section 3, we presen t the gra dien t sliding al gorithms and establi sh their conv er gence prop erties for so lving pro blem (1. 1 ). Section 4 is devoted to st ochastic g radient sliding algo rithms for solv ing a cl a ss of sto chastic comp osite probl ems. In Section 5, we genera lize t he gra d i en t sliding algo rithms for the s i tuation where f is smo oth and st r ongly conv ex, and for the case when f is nonsm o oth but can b e closely approximated by a class of smo oth f unctions. Finall y , some concludi ng remarks are made in Section 6. Notation and terminology . W e use k · k to denote an arbi trary norm in R n , whi c h is not necessaril y a ssocia ted the inner pr o duct h· , ·i . W e also use k · k ∗ to denot e the conjuga te o f k · k . F or any p ≥ 1, k · k p denotes the standard p -nor m in R n , i. e., k x k p p = n X i =1 | x i | p , for any x ∈ R n . F or an y con vex function h , ∂ h ( x ) is the set of subdiﬀerential at x . Given any X ⊆ R n , we say a conv ex functi on h : X → R is no nsmoo th if | h ( x ) − h ( y ) | ≤ M h k x − y k for any x, y ∈ X . In thi s case, it can b e shown that (1 .3 ) holds wit h Gradien t Sliding for Composite Optimization 5 M = 2 M h (see Lem ma 2 of [11]). W e say that a convex functi o n f : X → R is smo oth if it is Lipschitz con tinuously diﬀerentiable wit h Lips chitz constant L > 0 , i.e., k∇ f ( y ) − ∇ f ( x ) k ∗ ≤ L k y − x k for any x, y ∈ X , which clear ly impli es (1.2). F or an y real number r , ⌈ r ⌉ a nd ⌊ r ⌋ denot e the nearest in teger to r from ab o ve and b elo w, respecti v el y . R + and R ++ , respect iv ely , denote the set of nonnegat iv e and p ositive real n umbers. N denot es the set o f natur al num b ers { 1 , 2 , . . . } . 2 Revie w of the pro x imal gradient me thods In this sect ion, w e provide a brief revi ew o n t he proximal g r adien t met hods fro m which t he prop osed grad i en t sliding algori thms orig inate, a nd p oint ou t a few problems asso ciated with these exist ing al gorithms when a pplied to solve problem (1.1). 2.1 Prelim inary: dista nce generat ing function and prox-function In this subsecti on, we review the co ncept of pr o x -function (i.e. , proximity con t rol function) , which plays an imp ortant ro le in the recent development o f ﬁrst-o rder metho ds for conv ex progra m ming. The go al of using the prox-function in place of the usua l Eucl idean distance is t o allow t he dev elop ed algori thms to get adapted to the geom et ry of the feasible sets. W e say that a functio n ω : X → R is a distanc e gener ating function with m odulus ν > 0 with r espect t o k · k , if ω is continuously di ﬀerentiabl e and st rongly conv ex with param eter ν w ith resp ect to k · k , i.e. , h x − z , ∇ ω ( x ) − ∇ ω ( z ) i ≥ ν k x − z k 2 , ∀ x, z ∈ X . (2.1 ) The pr ox-function asso ciated with ω is given by V ( x , z ) ≡ V ω ( x, z ) = ω ( z ) − [ ω ( x ) + h∇ ω ( x ) , z − x i ] . (2.2 ) The prox-functi on V ( · , · ) is also call ed the Bregman’s dist a nce, which was i ni- tial l y studied by Bregma n [4] and lat er by many others (see [1, 2, 9] and r eferences therein). In this pap er, we assume that the prox-function V ( x, z ) is chosen such that the soluti on of arg m in u ∈ X {h g , u i + V ( x, u ) + X ( u ) } (2.3 ) is easi ly co m putable fo r any g ∈ E ∗ and x ∈ X . Some examples of t hese pr ox- functions are given in [ 5]. If there exi sts a constant Q such that V ( x, z ) ≤ Qk x − z k 2 / 2 for any x, z ∈ X , then we say that the pr o x -function V ( · , · ) i s growing qua dratically . Moreov er , the small est consta n t Q sat isfying the previ ous rela tion is ca lled th e q uadratic gro wth constant of V ( · , · ). Withou t lo ss of g enerality , we assum e that Q = 1 for the prox- function V ( x , z ) if it grows quadrat ically , i.e. , V ( x , z ) ≤ 1 2 k x − z k 2 , ∀ x, z ∈ X . (2.4 ) Indeed, if Q 6 = 1, we can multipl y the corresp onding distance generati ng functio n ω b y 1 / Q and the result ing prox-function w ill sati sfy (2.4). 6 Guangh ui Lan 2.2 Proximal g r adien t met hods In t his subsection, we brieﬂy r eview a f ew po ssible ﬁrst-o rder metho ds for so lving problem (1.1). W e start with the simplest proximal gra dien t metho d w hic h w o rks for the case when the nonsmo oth co mponent h do es not app ear o r is rela tiv ely simpl e (e.g., h is aﬃne) . F or a gi v en x ∈ X , let m Ψ ( x, u ) := l f ( x, u ) + h ( u ) + X ( u ) , ∀ u ∈ X, (2.5 ) where l f ( x ; y ) := f ( x ) + h∇ f ( x ) , y − x i . (2.6 ) Clearl y , b y the conv exity of f and (1. 2 ) , we hav e m Ψ ( x, u ) ≤ Ψ ( u ) ≤ m Ψ ( x, u ) + L 2 k u − x k 2 ≤ m Ψ ( x, u ) + L 2 ν V ( x, u ) for any u ∈ X , where the last inequali t y fo llows from the strong convexit y of ω . Hence, m Ψ ( x, u ) is a go o d a ppro ximat ion o f Ψ ( u ) when u is “ close” enough to x . In view of t his observ atio n, we up date the search p oint x k ∈ X at th e k -th i teration of the proximal gr adien t met hod by x k = argm in u ∈ X  l f ( x k − 1 , u ) + h ( u ) + X ( u ) + β k V ( x k − 1 , u )  , (2.7 ) Here, β k > 0 is a param eter whi c h deter m ines ho w well we “t rust” the proximity between m Ψ ( x k − 1 , u ) and Ψ ( u ). In particular, a larger v a lue of β k impli es less conﬁdence o n m Ψ ( x k − 1 , u ) and resul t s in a small er step moving from x k − 1 to x k . It is well-known that the num ber of it erations required by the proximal gradi en t metho d fo r ﬁnding an ǫ -so lution o f (1. 1) can b e bo unded by O (1 /ǫ ). The eﬃciency of the ab ov e proximal g radien t metho d can be signiﬁ ca n t ly im- prov ed by incorp orati ng a multi-step accelerat i on scheme. The basic idea of t h i s scheme is to introd u ce three closely relat ed search sequences, namely , { x k } , { x k } , and { ¯ x k } , which will b e used t o build the mo del m Ψ , control the proximity b etw een m Ψ and Ψ , and compute the output solutio n, resp ectively . More speci ﬁcally , these three sequences are up dated according to x k = (1 − γ k ) ¯ x k − 1 + γ k x k − 1 , (2.8 ) x k = argm in u ∈ X  Φ k ( u ) := l f ( x k , u ) + h ( u ) + X ( u ) + β k V ( x k − 1 , u )  , (2.9 ) ¯ x k = (1 − γ k ) ¯ x k − 1 + γ k x k , (2.1 0 ) where β k ≥ 0 and γ k ∈ [0 , 1] are given para meters for the algori t hm. Clea rly , (2. 8 )- (2.1 0) reduces to (2 .7 ), if ¯ x 0 = x 0 and γ k is set t o a co nstan t, i.e., γ k = γ for some γ ∈ [0 , 1] for all k ≥ 1 . How ev er, by prop erly sp ecifying β k and γ k , e.g ., β k = 2 L/ k and γ k = 2 / ( k + 2) , one can show tha t the ab ov e accelera ted proximal gradi ent metho d ca n ﬁnd an ǫ -solut ion of ( 1.1) i n at m ost O (1 / √ ǫ ) iter ations. Since each iterat ion of thi s algorithm requires only one ev alua tion o f ∇ f , the t o tal n umber of gradient ev a l uations of ∇ f can also be b ounded b y O (1 / √ ǫ ). One crucial p r oblem asso ciat ed wi th the aforementioned proxima l g radien t type met hods is tha t the subprobl ems (2. 7 ) and ( 2 .9 ) are diﬃcult to s o lv e w hen h is a g eneral nonsmo oth co n vex functi on. T o address this issue, one ca n p ossibly Gradien t Sliding for Composite Optimization 7 apply an enha nced a ccelerated gradient metho d by L an [ 11] (see also [5, 6]). Thi s algo r ithm is obtained by repla cing h ( u ) in (2. 9 ) with l h ( x k ; u ) := h ( x k ) + h h ′ ( x k ) , u − x k i (2.1 1 ) for s o me h ′ ( x k ) ∈ ∂ h ( x k ). A s a result, the subproblems i n this algo rithm b ecome easier to sol v e. Moreov er, wit h a pr oper selection of { β k } a nd { γ k } , this approach can ﬁnd an ǫ -soluti on of (1.1) in at mo st O ( r LV ( x 0 , x ∗ ) ǫ + M 2 V ( x 0 , x ∗ ) ǫ 2 ) (2.1 2 ) iterat ions. Since each i teration requi res one co mputation o f ∇ f and h ′ , the tot al num b er of ev a luations for f and h ′ is b ounded by O (1 /ǫ 2 ). A s poi n t ed out in [11] , this bound in (2.12) is not im pro v able if o ne can only compu t e the subgr a dien t of the comp osite functio n f ( x )+ h ( x ) as a who le. Howev er, as noted in Section 1, we do hav e access to sepa rate ﬁrst-o r der informa tion ab out f a nd h in many a pplications. One interesting pr oblem is wh et her we can f urther improve the per f ormance of proximal gra dien t type metho ds in the latter case. 3 Determi ni stic gradient slidi ng Througho ut this sect ion, we consider t he det erministic case where exact subgra- dients of h a r e av ail a ble. By presen ting a new class of proximal gradient metho ds, namely the gradi ent slidi ng (GS) metho d, we show t hat o n e can sig niﬁcantly re- duce the number o f gra d i en t ev alua tions for ∇ f required t o sol v e (1.1), whil e maintaining the optima l b ound on t he tota l num b er o f subgradi en t ev a luations f or h ′ . The ba sic idea of t he GS metho d i s to incorp orate an itera tive pro cedure to approximatel y solve t he subpr o blem ( 2.9) in the accelerated proximal gradient metho ds. A critical observ a tion in our dev el opmen t of the GS m ethod is that one needs to comput e a pair of closely rela ted approximat e so lutions o f probl em (2.9). One of t hem will b e used in place of x k in (2 .8) to const ruct the mo del m Ψ , while the other one wi l l be used i n pla c e of x k in (2. 10 ) t o comput e the o utput soluti on ¯ x k . Mo reo v er, we show that such a pa ir of approximat ion soluti o ns can be o btained by applying a s i mple subg r adien t pro jecti on type subroutine. W e no w forma l ly describ e thi s alg orithm as foll o w s. 8 Guangh ui Lan Algorithm 1 The gradi ent sl i ding (GS) al gorithm Input: Initial point x 0 ∈ X and iteration limit N . Let β k ∈ R ++ , γ k ∈ R + , and T k ∈ N , k = 1 , 2 , . . . , b e giv en and set ¯ x 0 = x 0 . for k = 1 , 2 , . . . , N do 1. Set x k = (1 − γ k ) ¯ x k − 1 + γ k x k − 1 , and let g k ( · ) ≡ l f ( x k − 1 , · ) b e deﬁned in (2 .6). 2. Set ( x k , ˜ x k ) = PS( g k , x k − 1 , β k , T k ); (3.1) 3. Set ¯ x k = (1 − γ k ) ¯ x k − 1 + γ k ˜ x k . end for Output: ¯ x N . The PS (prox-sliding) pro ce dure called at step 2 is s tated as follows. pro cedure ( x + , ˜ x + ) = PS( g , x , β , T ) Let the parameters p t ∈ R ++ and θ t ∈ [0 , 1], t = 1 , . . . , be given. Set u 0 = ˜ u 0 = x . for t = 1 , 2 , . . . , T do u t = ar gmin u ∈ X { g ( u ) + l h ( u t − 1 , u ) + β V ( x, u ) + β p t V ( u t − 1 , u ) + X ( u ) } , (3.2) ˜ u t = (1 − θ t ) ˜ u t − 1 + θ t u t . (3.3) end for Set x + = u T and ˜ x + = ˜ u T . end procedure Observe that w hen suppl ied w ith an aﬃne functi on g ( · ), prox-cen ter x ∈ X , pa - rameter β , and sliding p erio d T , t he PS pr ocedure computes a pair of approximate soluti ons ( x + , ˜ x + ) ∈ X × X fo r the problem of: argmi n u ∈ X { Φ ( u ) := g ( u ) + h ( u ) + β V ( x, u ) + X ( u ) } . (3.4 ) Clearl y , problem (3.4) i s equiv a len t to (2.9) when the input parameters a re set to (3.1). Si nce the same aﬃ ne funct ion g ( · ) = l f ( x k − 1 , · ) has been used thr oughout the T it erations of the PS pro cedure, we ski p t he computa tion of the gradients of f when p erforming the T pro jectio n steps in (3 .2 ). T his diﬀers from the a ccelerated gradient metho d in [1 1], where o ne needs to comput e ∇ f + h ′ in each pro jectio n step. A few more remark s ab out the a bov e GS al gorithm are in or der. First ly , we say that an o uter iterati on of t he GS al gorithm o ccurs whenev er k in Al gorithm 1 increments by 1 . Each outer it eration of t he GS al gorithm inv olves the comp ut ation of t he g radient ∇ f ( x k − 1 ) a nd a ca l l t o the PS pro cedure to up date x k and ˜ x k . Secondly , the PS pro cedure sol ves pro b l em (3.4) i teratively . Eac h iterat ion of this pro cedure consist s o f the com putation of subgr adien t h ′ ( u t − 1 ) and the so lution of the pr o jection subproblem (3. 2), which is assumed to b e rela tively easy t o solve (see Sectio n 2.1). F o r notat ional conv enience, we refer to an i teration o f the PS pro cedure a s an inner it eration of the G S a lgorithm . Thir dly , t he G S al gorithm describ ed ab o ve is c o nceptual o nly si nce we have not yet sp eciﬁed th e selecti on of { β k } , { γ k } , { T k } , { p t } and { θ t } . W e wil l return to thi s issue after est ablishing some conv erg ence prop erties of t he g ener ic GS algo rithm describ ed ab o ve. W e ﬁrst present a result which summ arizes some imp ortant co n vergence pr o per- ties o f the PS pro cedure. The following two technical results are needed to establ ish the conv ergence of thi s pro cedure. Gradien t Sliding for Composite Optimization 9 The ﬁrst t ec hnica l resul t below character izes the sol ution of the pro jection st ep (3.1). The pro of of this result can be found in Lemma 2 of [5]. Lemma 1 L et the c onvex function q : X → R , the p oints ˜ x, ˜ y ∈ X and the sc alars µ 1 , µ 2 ∈ R + b e given. L et ω : X → R b e a diﬀer entiable c onvex function and V ( x, z ) b e deﬁne d in (2.2). If u ∗ ∈ Argm i n { q ( u ) + µ 1 V ( ˜ x, u ) + µ 2 V ( ˜ y , u ) : u ∈ X } , then for any u ∈ X , we have q ( u ∗ ) + µ 1 V ( ˜ x, u ∗ ) + µ 2 V ( ˜ y, u ∗ ) ≤ q ( u ) + µ 1 V ( ˜ x, u ) + µ 2 V ( ˜ y, u ) − ( µ 1 + µ 2 ) V ( u ∗ , u ) . The second t ec hni cal result slightly genera lizes Lem ma 3 of [12] t o provide a conv eni en t wa y to analyze sequences wi th sublinear rat e of conv ergence. Lemma 2 L et w k ∈ (0 , 1] , k = 1 , 2 , . . . , and W 1 > 0 b e given and deﬁne W k := (1 − ω k ) W k − 1 , k ≥ 2 . (3.5) Supp ose that W k > 0 f or al l k ≥ 2 and that the se quenc e { δ k } k ≥ 0 satisﬁes δ k ≤ (1 − w k ) δ k − 1 + B k , k = 1 , 2 , . . . . (3.6 ) Then for any k ≥ 1 , we have δ k ≤ W k " 1 − w 1 W 1 δ 0 + k X i =1 B i W i # . (3.7 ) Pr o of . The result fo llows from dividi ng b oth sides of (3.6) by W k and then summing up the result ing inequal ities. W e are now ready to est ablish the conv ergence of the PS pro cedure. Proposition 1 If { p t } and { θ t } in the PS pr o c e dur e satisfy θ t = P t − 1 − P t (1 − P t ) P t − 1 with P t = ( 1 , t = 0 , p t (1 + p t ) − 1 P t − 1 , t ≥ 1 , (3.8 ) then, for any t ≥ 1 and u ∈ X , β (1 − P t ) − 1 V ( u t , u ) + [ Φ ( ˜ u t ) − Φ ( u )] ≤ P t (1 − P t ) − 1 " β V ( u 0 , u ) + M 2 2 ν β t X i =1 ( p 2 i P i − 1 ) − 1 # , (3.9 ) wher e Φ i s deﬁne d in (3.4). 10 Guangh ui Lan Pr o of . By (1.3) and the deﬁnition of l h in (2.11), w e ha ve h ( u t ) ≤ l h ( u t − 1 , u t ) + M k u t − u t − 1 k . Addi ng g ( u t ) + β V ( x, u t ) + X ( u t ) t o bo th sides of this inequali t y and using the deﬁnitio n of Φ in (3. 4 ), we obtai n Φ ( u t ) ≤ g ( u t ) + l h ( u t − 1 , u t ) + β V ( x, u t ) + X ( u t ) + M k u t − u t − 1 k . (3.1 0 ) Now appl ying Lemm a 1 to (3.2), we obt ain g ( u t ) + l h ( u t − 1 , u t ) + β V ( x, u t ) + X ( u t ) + β p t V ( u t − 1 , u t ) ≤ g ( u ) + l h ( u t − 1 , u ) + β V ( x , u ) + X ( u ) + β p t V ( u t − 1 , u ) − β ( 1 + p t ) V ( u t , u ) ≤ g ( u ) + h ( u ) + β V ( x, u ) + X ( u ) + β p t V ( u t − 1 , u ) − β ( 1 + p t ) V ( u t , u ) = Φ ( u ) + β p t V ( u t − 1 , u ) − β (1 + p t ) V ( u t , u ) , where the second ineq ua lity foll o w s fr om the convexit y o f h . M oreov er, by t he strong conv exi t y of ω , − β p t V ( u t − 1 , u t ) + M k u t − u t − 1 k ≤ − ν β p t 2 k u t − u t − 1 k 2 + M k u t − u t − 1 k ≤ M 2 2 ν β p t , where the last i nequality f ollows fro m the simple fact that − at 2 / 2 + bt ≤ b 2 / (2 a ) for any a > 0. Combining the previo us three inequal ities, we concl ude that Φ ( u t ) − Φ ( u ) ≤ β p t V ( u t − 1 , u ) − β (1 + p t ) V ( u t , u ) + M 2 2 ν β p t . Dividing b oth si des by 1 + p t and rearra nging the terms, we obt ain β V ( u t , u ) + Φ ( u t ) − Φ ( u ) 1 + p t ≤ β p t 1 + p t V ( u t − 1 , u ) + M 2 2 ν β ( 1 + p t ) p t , which, i n view of t he deﬁni tion of P t in ( 3 .8 ) a nd L emma 2 (wit h k = t , w k = 1 / (1 + p t ) and W k = P t ), then impl ies that β P t V ( u t , u ) + t X i =1 Φ ( u i ) − Φ ( u ) P i (1 + p i ) ≤ β V ( u 0 , u ) + M 2 2 ν β t X i =1 1 P i (1 + p i ) p i = β V ( u 0 , u ) + M 2 2 ν β t X i =1 ( p 2 i P i − 1 ) − 1 , (3.11) where t he l ast identit y a lso follows fro m the deﬁniti on o f P t in ( 3 .8 ). Also no t e that by the deﬁnit ion of ˜ u t in the PS pro cedure and ( 3.8), w e h ave ˜ u t = P t 1 − P t  1 − P t − 1 P t − 1 ˜ u t − 1 + 1 P t (1 + p t ) u t  . Applying this r el ation inductively and using t he fact t hat P 0 = 1, we can ea sily see that ˜ u t = P t 1 − P t  1 − P t − 2 P t − 2 ˜ u t − 2 + 1 P t − 1 (1 + p t − 1 ) u t − 1 + 1 P t (1 + p t ) u t  = . . . = P t 1 − P t t X i =1 1 P i (1 + p i ) u i , Gradien t Sliding for Composite Optimization 11 which, in view of the co n vexity of Φ , then implies that Φ ( ˜ u t ) − Φ ( u ) ≤ P t 1 − P t t X i =1 Φ ( u i ) − Φ ( u ) P i (1 + p i ) . (3.1 2 ) Combining the ab ov e inequa lity wi t h (3.11) and rea rranging the terms , we obtain (3.9). Setting u to be the optim al so lution of (3 .4 ), we can see th a t bot h x k and ˜ x k are approximate sol u t ions o f (3 .4 ) i f the rig ht hand si de (RHS) of (3. 9) is smal l enough. With the help of this r esu l t, we can establi sh a n imp ortant recur sion from which the co n vergence of the GS algori thm easily fo llows. Proposition 2 Supp ose that { p t } and { θ t } in the PS pr o c e dur e satisfy (3.8). A lso assume that { β k } and { γ k } in the GS algorithm satisfy γ 1 = 1 and ν β k − Lγ k ≥ 0 , k ≥ 1 . (3.1 3 ) Then for any u ∈ X and k ≥ 1 , Ψ ( ¯ x k ) − Ψ ( u ) ≤ (1 − γ k )[ Ψ ( ¯ x k − 1 ) − Ψ ( u )] + γ k (1 − P T k ) − 1 " β k V ( x k − 1 , u ) − β k V ( x k , u ) + M 2 P T k 2 ν β k T k X i =1 ( p 2 i P i − 1 ) − 1 # . (3.1 4 ) Pr o of . First, not ice that by the deﬁnition of ¯ x k and x k , we hav e ¯ x k − x k = γ k ( ˜ x k − x k − 1 ). Using this o bserv a tion, (1.2), the deﬁnition o f l f in ( 2 .6 ), and the conv exi t y of f , we obta in f ( ¯ x k ) ≤ l f ( x k , ¯ x k ) + L 2 k ¯ x k − x k k 2 = (1 − γ k ) l f ( x k , ¯ x k − 1 ) + γ k l f ( x k , ˜ x k ) + Lγ 2 k 2 k ˜ x k − x k − 1 k 2 ≤ (1 − γ k ) f ( ¯ x k − 1 ) + γ k  l f ( x k , ˜ x k ) + β k V ( x k − 1 , ˜ x k )  − γ k β k V ( x k − 1 , ˜ x k ) + Lγ 2 k 2 k ˜ x k − x k − 1 k 2 ≤ (1 − γ k ) f ( ¯ x k − 1 ) + γ k  l f ( x k , ˜ x k ) + β k V ( x k − 1 , ˜ x k )  −  γ k β k − Lγ 2 k ν  V ( x k − 1 , ˜ x k ) ≤ (1 − γ k ) f ( ¯ x k − 1 ) + γ k  l f ( x k , ˜ x k ) + β k V ( x k − 1 , ˜ x k )  , (3.1 5 ) where the third inequa lity f o llows from t he stro ng convexit y o f ω and the last inequali t y follows from (3. 13 ). By the conv exity o f h and X , w e ha ve h ( ¯ x k ) + X ( ¯ x k ) ≤ (1 − γ k )[ h ( ¯ x k − 1 ) + X ( ¯ x k − 1 )] + γ k [ h ( ˜ x k ) + X ( ˜ x k )] . (3.1 6 ) Adding up the previ ous two inequal ities, and usi ng the deﬁn i tions o f Ψ i n (1 .1 ) and Φ k in (2.9), we have Ψ ( ¯ x k ) ≤ (1 − γ k ) Ψ ( ¯ x k − 1 ) + γ k Φ k ( ˜ x k ) . 12 Guangh ui Lan Subtracti ng Ψ ( u ) from b o th sides o f the ab ov e inequali t y , we o btain Ψ ( ¯ x k ) − Ψ ( u ) ≤ (1 − γ k )[ Ψ ( ¯ x k − 1 ) − Ψ ( u )] + γ k [ Φ k ( ˜ x k ) − Ψ ( u )] . (3.1 7 ) Also note that by the deﬁni tion of Φ k in (2.9) and the conv exity of f , Φ k ( u ) ≤ f ( u ) + h ( u ) + X ( u ) + β k V ( x k − 1 , u ) = Ψ ( u ) + β k V ( x k − 1 , u ) , ∀ u ∈ X. (3.18) Combining these two inequal i ties, w e obtai n Ψ ( ¯ x k ) − Ψ ( u ) ≤ (1 − γ k )[ Ψ ( ¯ x k − 1 ) − Ψ ( u )] + γ k [ Φ k ( ˜ x k ) − Φ k ( u ) + β k V ( x k − 1 , u ) ] . (3.1 9 ) Now, in view of (3. 9 ), the deﬁnitio n of Φ k in (2. 9 ) , and the o rigin of ( x k , ˜ x k ) i n (3.1), we ca n easi ly see tha t, for any u ∈ X and k ≥ 1, β k 1 − P T k V ( x k , u ) +[ Φ k ( ˜ x k ) − Φ k ( u )] ≤ P T k 1 − P T k " β k V ( x k − 1 , u ) M 2 2 ν β k t X i =1 ( p 2 i P i − 1 ) − 1 # . Plugging the ab ov e bo u nd on Φ k ( ˜ x k ) − Φ k ( u ) into (3 .19 ), we obt ain (3.1 4 ). W e are now ready to establi sh t he main convergence pro perti es of the GS algo r ithm. Not e that the fo llowing quantity wi ll b e used in o ur analysis o f thi s algo r ithm. Γ k = ( 1 , k = 1 , (1 − γ k ) Γ k − 1 , k ≥ 2 . (3.2 0 ) Theorem 1 Assume that { p t } and { θ t } i n the PS pr o c e dur e satisfy (3.8), and also that { β k } and { γ k } in the GS algorithm satisfy (3.13). a) If for any k ≥ 2 , γ k β k Γ k (1 − P T k ) ≤ γ k − 1 β k − 1 Γ k − 1 (1 − P T k − 1 ) , (3.2 1 ) then we have, for any N ≥ 1 , Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≤ B d ( N ) := Γ N β 1 1 − P T 1 V ( x 0 , x ∗ ) + M 2 Γ N 2 ν N X k =1 T k X i =1 γ k P T k Γ k β k (1 − P T k ) p 2 i P i − 1 , (3.2 2 ) wher e x ∗ ∈ X is an arbitr ary optimal solution of pr oblem (1.1), and P t and Γ k ar e deﬁne d in (3.3 ) and (3.20 ), r esp e ctively. b) If X is c omp act, and for any k ≥ 2 , γ k β k Γ k (1 − P T k ) ≥ γ k − 1 β k − 1 Γ k − 1 (1 − P T k − 1 ) , (3.2 3 ) then (3.22) stil l holds by simply r eplacing the ﬁrst term i n the deﬁnition of B d ( N ) with γ N β N ¯ V ( x ∗ ) / (1 − P T N ) , wher e ¯ V ( u ) = ma x x ∈ X V ( x, u ) . Gradien t Sliding for Composite Optimization 13 Pr o of . W e conclude from (3.1 4 ) and Lemma 2 that Ψ ( ¯ x N ) − Ψ ( u ) ≤ Γ N 1 − γ 1 Γ 1 [ Ψ ( ¯ x 0 ) − Ψ ( u )] + Γ N N X k =1 β k γ k Γ k (1 − P T k ) [ V ( x k − 1 , u ) − V ( x k , u ) ] + M 2 Γ N 2 ν N X k =1 T k X i =1 γ k P T k Γ k β k (1 − P T k ) p 2 i P i − 1 = Γ N N X k =1 β k γ k Γ k (1 − P T k ) [ V ( x k − 1 , u ) − V ( x k , u ) ] + M 2 Γ N 2 ν N X k =1 T k X i =1 γ k P T k Γ k β k (1 − P T k ) p 2 i P i − 1 , (3.24 ) where the l ast identity follows fr o m the fa ct that γ 1 = 1 . Now it foll o ws from (3 .21 ) that N X k =1 β k γ k Γ k (1 − P T k ) [ V ( x k − 1 , u ) − V ( x k , u ) ] ≤ β 1 γ 1 Γ 1 (1 − P T 1 ) V ( x 0 , u ) − β N γ N Γ N (1 − P T N ) V ( x N , u ) ≤ β 1 1 − P T 1 V ( x 0 , u ) , (3.2 5 ) where the last i nequality follows f rom the fact s t hat γ 1 = Γ 1 = 1, P T N ≤ 1, a nd V ( x N , u ) ≥ 0. Th e result in part a) then clearly foll o w s f rom t he previ ous two inequali ties wi th u = x ∗ . Mor eover, using (3.23) and the fact V ( x k , u ) ≤ ¯ V ( u ) , we conclude that N X k =2 β k γ k Γ k (1 − P T k ) [ V ( x k − 1 , u ) − V ( x k , u ) ] ≤ β 1 1 − P T 1 ¯ V ( u ) − N X k =2  β k − 1 γ k − 1 Γ k − 1 (1 − P T k − 1 ) − β k γ k Γ k (1 − P T k )  ¯ V ( u ) = γ N β N Γ N (1 − P T N ) ¯ V ( u ) . (3.2 6 ) P art b) then foll o ws from the ab o ve observ ati on and (3.24) wi t h u = x ∗ . Clearl y , there ar e v a rious opti ons for sp ecifying the param eters { p t } , { θ t } , { β k } , { γ k } , and { T k } to g u a ran tee the convergence o f the GS a lgorithm . Below we provide a few suc h selection s w hic h l ead t o t he b est p ossible r a te of con vergence for solving problem ( 1.1). In parti cular, Coroll ary 1.a ) pr o v ides a set of such para meters for the cas e when the feasi ble regi on X i s unbo unded and the iter ation l i mit N is given a pr iori, whil e the one in Cor ollary 1 .b) works only for the case when X i s compact , but does not require N to b e given in adv ance. Corollary 1 Assume that { p t } and { θ t } in the PS pr o c e dur e ar e set to p t = t 2 and θ t = 2( t + 1) t ( t + 3) , ∀ t ≥ 1 . (3.2 7 ) 14 Guangh ui Lan a) If N is ﬁxe d a priori, and { β k } , { γ k } , and { T k } ar e set to β k = 2 L v k , γ k = 2 k + 1 , and T k =  M 2 N k 2 ˜ D L 2  (3.2 8 ) for some ˜ D > 0 , then Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≤ 2 L N ( N + 1)  3 V ( x 0 , x ∗ ) ν + 2 ˜ D  , ∀ N ≥ 1 . (3 .29) b) If X is c omp act, and { β k } , { γ k } , and { T k } ar e set to β k = 9 L (1 − P T k ) 2 ν ( k + 1) , γ k = 3 k + 2 , and T k =  M 2 ( k + 1 ) 3 ˜ D L 2  , (3.3 0 ) for some ˜ D > 0 , then Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≤ L ( N + 1)( N + 2)  27 ¯ V ( x ∗ ) 2 ν + 8 ˜ D 3  , ∀ N ≥ 1 . (3.3 1 ) Pr o of . W e ﬁrst show part a). By the deﬁni tions of P t and p t in (3.8) a nd (3 .27), we hav e P t = tP t − 1 t + 2 = . . . = 2 ( t + 1)( t + 2) . (3.3 2 ) Using the ab o ve identity and (3.27), we can easi ly see that the conditi on in ( 3.8) holds. It also foll o ws from (3. 32 ) a nd t he deﬁnitio n of T k in (3. 2 8 ) that P T k ≤ P T k − 1 ≤ . . . ≤ P T 1 ≤ 1 3 . (3.3 3 ) Now, i t ca n b e easily seen from the deﬁniti on of β k and γ k in (3. 28 ) tha t ( 3.13 ) holds. It also foll o ws from (3. 20 ) a nd ( 3.28 ) that Γ k = 2 k ( k + 1) . (3.3 4 ) By (3.28), (3. 33 ), and (3.34), we hav e γ k β k Γ k (1 − P T k ) = 2 L ν (1 − P T k ) ≤ 2 L ν (1 − P T k − 1 ) = γ k − 1 β k − 1 Γ k − 1 (1 − P T k − 1 ) , from which ( 3.21 ) foll o w s. Now, b y (3.3 2 ) and t he fa ct that p t = t/ 2 , we have T k X i =1 1 p 2 i P i − 1 = 2 T k X i =1 i + 1 i ≤ 4 T k , (3.3 5 ) which, tog ether with (3. 28 ) and (3.3 4 ), t hen impl y tha t T k X i =1 γ k P T k Γ k β k (1 − P T k ) p 2 i P i − 1 ≤ 4 γ k P T k T k Γ k β k (1 − P T k ) = 4 ν k 2 L ( T k + 3) . (3.3 6 ) Gradien t Sliding for Composite Optimization 15 Using this observ a tion, (3.2 2 ), (3.33), and (3.3 4 ) , we have B d ( N ) ≤ 4 LV ( x 0 , x ∗ ) ν N ( N + 1)( 1 − P T 1 ) + 4 M 2 LN ( N + 1 ) N X k =1 k 2 T k + 3 ≤ 6 LV ( x 0 , x ∗ ) ν N ( N + 1) + 4 M 2 LN ( N + 1 ) N X k =1 k 2 T k + 3 , which, i n vi ew of Theorem 1.a) and the deﬁnitio n o f T k in (3 .28 ), t hen clea rly impli es (3.2 9). Now let us show that part b) holds. It fo l lo ws from (3.33), and the deﬁnit ion of β k and γ k in (3.30) that β k ≥ 3 L ν ( k + 1 ) ≥ Lγ k ν (3.3 7 ) and hence that (3.1 3 ) holds. It also foll o ws from (3.20) and (3.3 0 ) that Γ k = 6 k ( k + 1)( k + 2) , k ≥ 1 , (3.38) and hence that γ k β k Γ k (1 − P T k ) = k ( k + 1) 2 9 L 2 ν ( k + 1) = 9 Lk 4 ν , (3.3 9 ) which im pl ies that (3. 23 ) holds. Using (3.30), (3.3 3 ), (3.35), and (3.3 7 ) , we have T k X i =1 γ k P T k Γ k β k (1 − P T k ) p 2 i P i − 1 ≤ 4 γ k P T k T k Γ k β k (1 − P T k ) = 4 ν k ( k + 1) 2 P T k T k 9 L (1 − P T k ) 2 = 8 ν k ( k + 1 ) 2 ( T k + 1)( T k + 2) 9 LT k ( T k + 3) 2 ≤ 8 ν k ( k + 1) 2 9 LT k . ( 3 .40) Using this observ a tion, (3.3 0 ), (3.38), and Theorem 1.b), we concl ude that Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≤ γ N β N ¯ V ( x ∗ ) (1 − P T N ) + M 2 Γ N 2 ν N X k =1 8 ν k ( k + 1 ) 2 9 LT k ≤ γ N β N ¯ V ( x ∗ ) (1 − P T N ) + 8 L ˜ D 3( N + 1 )( N + 2) ≤ L ( N + 1)( N + 2)  27 ¯ V ( x ∗ ) 2 ν + 8 ˜ D 3  . Observe that b y (3.3) and (3 .32), when the select ion of p t = t/ 2, t he deﬁnition of ˜ u t in the PS pro cedure can b e sim pliﬁed as ˜ u t = ( t + 2) ( t − 1) t ( t + 3) ˜ u t − 1 + 2( t + 1) t ( t + 3) u t . In view of Corol lary 1, we can est ablish the complexi t y o f the GS algo rithm for ﬁnding an ǫ -solut ion of problem (1.1) . 16 Guangh ui Lan Corollary 2 Supp ose that { p t } and { θ t } ar e set to (3.27). Also assume that ther e exists an estimate D X > 0 s.t. V ( x , y ) ≤ D X , ∀ x, y ∈ X . (3.4 1 ) If { β k } , { γ k } , and { T k } ar e set to (3.28) with ˜ D = 3 D X / (2 ν ) for some N > 0 , then the total numb er of evaluations for ∇ f and h ′ c an b e b ounde d by O r L D X ν ǫ ! (3.4 2 ) and O ( M 2 D X ν ǫ 2 + r L D X ν ǫ ) , (3.4 3 ) r esp e ctively. Mor e over, the ab ove two c omplexity b ounds also hold if X is b ounde d, and { β k } , { γ k } , and { T k } ar e set to (3.28) with ˜ D = 81 D X / (16 ν ) . Pr o of . In v iew of Coro llary 1 .a), if { β k } , { γ k } , and { T k } are set to (3.28), the tota l num b er of out er iterat i ons (or gra dien t ev a luations) p erformed by the GS algo r ithm to ﬁnd an ǫ -solutio n of (1.1) can b e b ounded by N ≤ s L ǫ  3 V ( x 0 , x ∗ ) ν + 2 ˜ D  ≤ r 6 L D X ν ǫ . (3.4 4 ) Moreov er, usin g t he deﬁn i tion of T k in (3. 2 8 ), we conclude tha t the t otal number of inner itera tions (or subgradient ev al uations) can b e b ounded by N X k =1 T k ≤ N X k =1  M 2 N k 2 ˜ D L 2 + 1  ≤ M 2 N ( N + 1) 3 3 ˜ DL 2 + N = 2 ν M 2 N ( N + 1) 3 9 D X L 2 + N , which, i n view of ( 3.44), then clearl y impli es the b ound in (3. 43 ). Using Co rol- lary 1 . b) and simil ar ar gumen ts, we can s how t hat the complexi t y b ounds ( 3.42 ) and (3 .43) a lso hol d w hen X i s b ounded, a nd { β k } , { γ k } , a nd { T k } ar e set to (3. 28 ) . In v i ew of Co rollary 2, the GS alg orithm can achiev e the opti mal complexi t y bo und for sol v ing problem (1. 1) in t erms o f the number of ev a l uations for b oth ∇ f and h ′ . T o the b est o f o ur knowledge, this is t he ﬁrst time tha t this type of algo r ithm has been develop ed in t he li terature. It is also worth no ting tha t we can rel ax the requi remen t on D X in (3. 41 ) to V ( x 0 , x ∗ ) ≤ D X or max x ∈ X V ( x , x ∗ ) ≤ D X , respectively , when t he stepsi ze p olicies in ( 3 .28) o r in (3 .30 ) is used. Acco rdingly , we can tig h t en t he complexi ty b ounds in (3. 4 2 ) a nd (3. 4 3 ) by a const an t facto r. Gradien t Sliding for Composite Optimization 17 4 Sto c hastic gradient sliding In this sect ion, w e consi der the situat i on when the computa tion o f stochastic sub- gradients of h is muc h easi er than that of exact subgradi en t s. This si tuation hap- pens , fo r exa mple, w hen h is given in the form of an exp ectation o r a s the summa - tion of many nonsm ooth com ponents. By pr esen ti ng a sto c hastic gradient sl iding (SGS) metho d, we show that sim ilar complexi t y bounds as in Section 3 for solving problem ( 1.1) can stil l b e obta ined in exp ectation or w ith high proba bility , but the iterati on cost of the SG S met hod can be subst an t ially s m aller than tha t of the GS metho d. More sp eciﬁcally , we assume that the nonsm oot h comp onent h i s represented by a sto c hastic o racle (SO) sati sfying (1 .6 ) and (1. 7 ) . Som et imes, we a ugmen t (1. 7) by a “li gh t -tail” assumptio n: E [exp( k H ( u, ξ ) − h ′ ( u ) k 2 ∗ /σ 2 )] ≤ exp(1) . (4.1 ) It can be easily seen that (4. 1 ) impli es (1.7) b y Jensen’s inequal it y . The sto c ha stic gradi en t sliding (SGS) alg orithm is obta ined by sim ply repl a c- ing th e exact subgra dien t s in t he PS pro cedure with the sto c ha stic subgra dien t s returned by the SO. Thi s alg orithm is forma lly describ ed as fo llows. Algorithm 2 The sto c hastic gradient sli di ng (SGS) alg orithm The algorithm is the same as GS except that the iden tit y (3.2 ) in the PS pro cedure is replaced b y u t = argmin u ∈ X { g ( u ) + h H ( u t − 1 , ξ t − 1 ) , u i + β V ( x, u ) + β p t V ( u t − 1 , u ) + X ( u ) } . (4.2) The abov e modiﬁed PS procedure is called the SPS (stoch astic PS) pro cedure. W e add a few rema rks ab out the ab ov e SGS al g orithm. Firstl y , in thi s algo- rithm, we assume that t he exa ct gra dien t of f wi ll b e used throughout the T k inner iterat ions. This is diﬀerent f rom t he accelera ted sto chastic approxima t ion i n [ 11], where one needs to com pute ∇ f a t eac h subgradient pro jecti on step. Secondly , let us denote ˜ l h ( u t − 1 , u ) := h ( u t − 1 ) + h H ( u t − 1 , ξ t − 1 ) , u − u t − 1 i . (4.3 ) It can be easily seen that (4. 2 ) is equiv alent to u t = argm in u ∈ X  g ( u ) + ˜ l h ( u t − 1 , u ) + β V ( x , u ) + β p t V ( u t − 1 , u ) + X ( u )  . (4.4) This problem reduces to (3. 2 ) if there is no sto chastic noise asso ciated wi th the SO, i.e., σ = 0 in (1 .7). Thirdly , note t ha t w e hav e no t pr o v ided the speci ﬁcation of { β k } , { γ k } , { T k } , { p t } and { θ t } in t he SGS a lgorithm. Simi larly t o Sectio n 3, we wi ll return to this issue a fter est ablishing some conv er gence prop erties ab out the generic SPS pro cedure and SGS algo rithm. The foll o w ing result descri bes some imp ortant conv ergence prop erties of the SPS pro cedure. 18 Guangh ui Lan Proposition 3 Assume that { p t } and { θ t } in the SPS pr o c e dur e satisfy (3.8). Then for any t ≥ 1 and u ∈ X , β (1 − P t ) − 1 V ( u t , u ) + [ Φ ( ˜ u t ) − Φ ( u )] ≤ β P t (1 − P t ) − 1 V ( u t − 1 , u ) + P t (1 − P t ) − 1 t X i =1 ( p i P i − 1 ) − 1  ( M + k δ i k ∗ ) 2 2 ν β p i + h δ i , u − u i − 1 i  , (4.5 ) wher e Φ i s deﬁne d in (3.4), δ t := H ( u t − 1 , ξ t − 1 ) − h ′ ( u t − 1 ) , and h ′ ( u t − 1 ) = E [ H ( u t − 1 , ξ t − 1 )] . (4.6 ) Pr o of . Let ˜ l h ( u t − 1 , u ) be deﬁned in (4.3). Clear l y , we hav e ˜ l h ( u t − 1 , u ) − l h ( u t − 1 , u ) = h δ t , u − u t − 1 i . Using thi s observ at ion and (3.1 0 ), we obtai n Φ ( u t ) ≤ g ( u ) + l h ( u t − 1 , u t ) + β V ( x, u t ) + X ( u t ) + M k u t − u t − 1 k = g ( u ) + ˜ l h ( u t − 1 , u t ) − h δ t , u t − u t − 1 i + β V ( x, u t ) + X ( u t ) + M k u t − u t − 1 k ≤ g ( u ) + ˜ l h ( u t − 1 , u t ) + β V ( x, u t ) + X ( u t ) + ( M + k δ t k ∗ ) k u t − u t − 1 k , where the la st inequality fol lo ws f rom the Cauch y-Sch warz inequa lity . Now apply- ing Lemma 1 to (4. 2), we o btain g ( u t ) + ˜ l h ( u t − 1 , u t ) + β V ( x, u t ) + β p t V ( u t − 1 , u t ) + X ( u t ) ≤ g ( u ) + ˜ l h ( u t − 1 , u ) + β V ( x , u ) + β p t V ( u t − 1 , u ) + X ( u ) − β (1 + p t ) V ( u t , u ) = g ( u ) + l h ( u t − 1 , u ) + h δ t , u − u t − 1 i + β V ( x , u ) + β p t V ( u t − 1 , u ) + X ( u ) − β (1 + p t ) V ( u t , u ) ≤ Φ ( u ) + β p t V ( u t − 1 , u ) − β ( 1 + p t ) V ( u t , u ) + h δ t , u − u t − 1 i , where t he last inequa lity follows from the conv exi t y of h and (3.4). Moreover, by the strong conv exity of ω , − β p t V ( u t − 1 , u t ) + ( M + k δ t k ∗ ) k u t − u t − 1 k ≤ − ν β p t 2 k u t − u t − 1 k 2 + ( M + k δ t k ∗ ) k u t − u t − 1 k ≤ ( M + k δ t k ∗ ) 2 2 ν β p t , where the last i nequality f ollows fro m the simple fact that − at 2 / 2 + bt ≤ b 2 / (2 a ) for any a > 0. Combining the previo us three inequal ities, we concl ude that Φ ( u t ) − Φ ( u ) ≤ β p t V ( u t − 1 , u ) − β (1 + p t ) V ( u t , u ) + ( M + k δ t k ∗ ) 2 2 ν β p t + h δ t , u − u t − 1 i . Now di viding b oth sides o f the ab ov e i nequality b y 1 + p t and re-a rranging t he terms, we obt ain β V ( u t , u ) + Φ ( u t ) − Φ ( u ) 1 + p t ≤ β p t 1 + p t V ( u t − 1 , u ) + ( M + k δ t k ∗ ) 2 2 ν β ( 1 + p t ) p t + h δ t , u − u t − 1 i 1 + p t , Gradien t Sliding for Composite Optimization 19 which, in view of Lemm a 2, then impl ies tha t β P t V ( u t , u ) + t X i =1 Φ ( u i ) − Φ ( u ) P i (1 + p i ) ≤ β V ( u 0 , u ) + t X i =1  ( M + k δ i k ∗ ) 2 2 ν β P i (1 + p i ) p i + h δ i , u − u i − 1 i P i (1 + p i )  . (4.7 ) The result then immedi ately foll o w s from t he ab ov e inequali t y and (3.1 2 ). It should b e noted that t he search p oints { u t } generated by di ﬀeren t cal ls to the SPS pro cedure in diﬀerent out er iterat ions of the SGS al gorithm are distinct from eac h other. T o av oi d ambiguity , we use u k,t , k ≥ 1, t ≥ 0 , t o denote t he search po in ts g enerated by the SPS pro cedure in t he k -th o uter it er ation. Accordi ngly , we use δ k,t − 1 := H ( u k,t − 1 , ξ t − 1 ) − h ′ ( u k,t − 1 ) , k ≥ 1 , t ≥ 1 , (4.8 ) to deno te the sto chastic noises asso ciated wit h the SO. Then, by (4.5), t he deﬁni- tion of Φ k in (2. 9), and the ori gin of ( x k , ˜ x k ) in the SGS algo rithm, we hav e β k (1 − P T k ) − 1 V ( x k , u ) + [ Φ k ( ˜ x k ) − Φ k ( u )] ≤ β k P T k (1 − P T k ) − 1 V ( x k − 1 , u ) + P T k (1 − P T k ) − 1 T k X i =1 1 p i P i − 1 "  M + k δ k,i − 1 k ∗  2 2 ν β k p i + h δ k,i − 1 , u − u k,i − 1 i # (4.9 ) for any u ∈ X and k ≥ 1 . With the help o f (4 .9 ), we are now ready t o esta blish the mai n conv ergence prop erties of the S G S a lgorithm. Theorem 2 Supp ose that { p t } , { θ t } , { β k } , and { γ k } in the SGS algorithm satisfy (3.8) and (3.13). a) If r elation (3.21) holds, then under Assumptions (1.6) and (1.7), we have, for any N ≥ 1 , E  Ψ ( ¯ x N ) − Ψ ( x ∗ )  ≤ ˜ B d ( N ) := Γ N β 1 1 − P T 1 V ( x 0 , u ) + Γ N ν N X k =1 T k X i =1 ( M 2 + σ 2 ) γ k P T k β k Γ k (1 − P T k ) p 2 i P i − 1 , (4.1 0 ) wher e x ∗ is an arbitr ary optimal solution of (1.1), and P t and Γ k ar e deﬁne d in (3.3) and (3.20), r esp e ctively. b) If in addition, X is c omp act and Assumption (4.1) holds, then Prob  Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≥ ˜ B d ( N ) + λ B p ( N )  ≤ ex p n − 2 λ 2 / 3 o + exp {− λ } , (4.1 1 ) for any λ > 0 and N ≥ 1 , wher e ˜ B p ( N ) := σ Γ N ( 2 ¯ V ( x ∗ ) ν N X k =1 T k X i =1  γ k P T k Γ k (1 − P T k ) p i P i − 1  2 ) 1 2 + Γ N ν N X k =1 T k X i =1 σ 2 γ k P T k β k Γ k (1 − P T k ) p 2 i P i − 1 . (4.1 2 ) 20 Guangh ui Lan c) If X is c om p act and r elation (3.23) (i nste ad of (3.21)) holds, then b oth p art a) and p art b) stil l hold by r eplacing the ﬁrst term i n the deﬁnition of ˜ B d ( N ) with γ N β N ¯ V ( x ∗ ) / (1 − P T N ) . Pr o of . Using (3.19) and (4.9), we have Ψ ( ¯ x k ) − Ψ ( u ) ≤ (1 − γ k )[ Ψ ( ¯ x k − 1 ) − Ψ ( u )] + γ k  β k 1 − P T k [ V ( x k − 1 , u ) − V ( x k , u ) ]+ P T k 1 − P T k T k X i =1 1 p i P i − 1 "  M + k δ k,i − 1 k ∗  2 2 ν β k p i + h δ k,i − 1 , u − u k,i − 1 i #) . Using the ab o v e inequality and Lemm a 2, we concl ude that Ψ ( ¯ x N ) − Ψ ( u ) ≤ Γ N (1 − γ 1 )[ Ψ ( ¯ x 0 ) − Ψ ( u )] + Γ N N X k =1 β k γ k Γ k (1 − P T k ) [ V ( x k − 1 , u ) − V ( x k , u ) ] + Γ N N X k =1 γ k P T k Γ k (1 − P T k ) T k X i =1 1 p i P i − 1 "  M + k δ k,i − 1 k ∗  2 2 ν β k p i + h δ k,i − 1 , u − u k,i − 1 i # . The ab o ve relatio n, in view of (3.25) and the fact that γ 1 = 1, then im plies that Ψ ( ¯ x N ) − Ψ ( u ) ≤ β k 1 − P T 1 V ( x 0 , u ) + Γ N N X k =1 γ k P T k Γ k (1 − P T k ) T k X i =1 1 p i P i − 1  M 2 + k δ k,i − 1 k 2 ∗ ν β k p i + h δ k,i − 1 , u − u k,i − 1 i  . (4.1 3 ) W e now pr o v ide b ounds on the RHS o f (4.13) in exp ectatio n or with high pro ba- bility . W e ﬁrst show pa rt a) . Not e that by o ur assum ptions on the SO, the rando m v ariab l e δ k,i − 1 is indep end ent of the search p oint u k,i − 1 and hence E [ h ∆ k,i − 1 , x ∗ − u k,i i ] = 0. In a ddition, Assum ption 1.7 impli es that E [ k δ k,i − 1 k 2 ∗ ] ≤ σ 2 . Using the previous tw o o bs er v ati ons a nd t aking exp ectation on b oth sides of (4 .13) (wi th u = x ∗ ), we obta i n (4.10). W e now show that part b) h o lds. No te t hat by our assumptio ns on the SO and the deﬁnit ion of u k,i , the sequence {h δ k,i − 1 , x ∗ − u k,i − 1 i} k ≥ 1 , 1 ≤ i ≤ T k is a mar t ingale- diﬀerence sequence. Denotin g α k,i := γ k P T k Γ k (1 − P T k ) p i P i − 1 , and using t he lar g e-deviation theorem for m artingale-di ﬀerence sequence (e.g ., Lemma 2 of [13]) and the fact tha t E h exp n ν α 2 k,i h δ k,i − 1 , x ∗ − u k,i i 2 /  2 α 2 k,i ¯ V ( x ∗ ) σ 2 oi ≤ E h exp n ν α 2 k,i k δ k,i − 1 k 2 ∗ k x ∗ − u k,i k 2 /  2 ¯ V ( x ∗ ) σ 2 oi ≤ E h exp n k δ k,i − 1 k 2 ∗ V ( u k,i , x ∗ ) /  ¯ V ( x ∗ ) σ 2 oi ≤ E h exp n k δ k,i − 1 k 2 ∗ /σ 2 oi ≤ exp { 1 } , Gradien t Sliding for Composite Optimization 21 we conclude t h a t Prob  P N k =1 P T k i =1 α k,i h δ k,i − 1 , x ∗ − u k,i − 1 i > λ σ q 2 ¯ V ( x ∗ ) ν P N k =1 P T k i =1 α 2 k,i  ≤ ex p {− λ 2 / 3 } , ∀ λ > 0 . (4.1 4 ) Now let S k,i := γ k P T k β k Γ k (1 − P T k ) p 2 i P i − 1 and S : = P N k =1 P T k i =1 S k,i . By the conv exity of expo nen ti al functio n, we hav e E h exp n 1 S P N k =1 P T k i =1 S k,i k δ k,i k 2 ∗ /σ 2 oi ≤ E h 1 S P N k =1 P T k i =1 S i exp  k δ k,i k 2 ∗ /σ 2  i ≤ exp { 1 } . where the la st inequali t y fo llows from Assumptio n 4 .1 . Therefore, by Marko v’ s inequali t y , for all λ > 0 , Prob n P N k =1 P T k i =1 S k,i k δ k,i − 1 k 2 ∗ > (1 + λ ) σ 2 P N k =1 P T k i =1 S k,i o = Prob n exp n 1 S P N k =1 P T k i =1 S k,i k δ k,i − 1 k 2 ∗ /σ 2 o ≥ exp { 1 + λ } o ≤ exp {− λ } . (4.1 5 ) Our result now directly fo llows fr om (4 .13 ), (4 . 14 ) and (4 .15 ). The pr oof of par t c) is v ery simi lar to pa rt a) and b) i n view of t he b ound in ( 3.26), and hence the detail s are skipp ed. W e now p r o v ide some sp eciﬁc choices for the para meters { β k } , { γ k } , { T k } , { p t } , and { θ t } used i n the SG S algor i thm. In partic ul ar, whi l e the stepsi ze p olicy in C orollary 3.a) requires th e num ber of it erations N given a pr iori, such an as- sumptio n is not needed in Co rollary 3.b) gi v en t hat X is b ounded. Howev er, in order to provide some large-devi ation result s ass o ciat ed with th e rat e of conv er- gence fo r the SGS a lgorithm (see (4.1 8 ) and (4 .21 ) b elo w), we need to a ssume the bo undness o f X in bo th Coroll ary 3.a) a nd Coroll ary 3.b). Corollary 3 Assume that { p t } and { θ t } in the SPS pr o c e dur e ar e set to (3.27). a) If N is given a pri ori , { β k } and { γ k } ar e set to (3.28), and { T k } is given by T k =  N ( M 2 + σ 2 ) k 2 ˜ DL 2  (4.1 6 ) for some ˜ D > 0 . Then under Assumptions (1.6) and (1.7), we have E  Ψ ( ¯ x N ) − Ψ ( x ∗ )  ≤ 2 L N ( N + 1)  3 V ( x 0 , x ∗ ) ν + 4 ˜ D  , ∀ N ≥ 1 . (4.1 7 ) If in addition, X is c omp act and Assumption (4.1) holds, then Prob    Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≥ 2 L N ( N + 1)   3 V ( x 0 , x ∗ ) ν + 4( 1 + λ ) ˜ D + 4 λ q ˜ D ¯ V ( x ∗ ) √ 3 ν      ≤ exp n − 2 λ 2 / 3 o + exp {− λ } , ∀ λ > 0 , ∀ N ≥ 1 . ( 4.18) 22 Guangh ui Lan b) If X is c omp act, { β k } and { γ k } ar e set to (3.30), and { T k } is given by T k =  ( M 2 + σ 2 )( k + 1) 3 ˜ DL 2  (4.1 9 ) for some ˜ D > 0 . Then under Assumptions (1.6) and (1.7), we have E  Ψ ( ¯ x N ) − Ψ ( x ∗ )  ≤ L ( N + 1)( N + 2)  27 ¯ V ( x ∗ ) 2 ν + 16 ˜ D 3  , ∀ N ≥ 1 . (4.2 0 ) If in addition, Assumption (4.1) holds, then Prob    Ψ ( ¯ x N ) − Ψ ( x ∗ ) ≥ L N ( N + 2)   27 ¯ V ( x ∗ ) 2 ν + 8 3 (2 + λ ) ˜ D + 12 λ q 2 ˜ D ¯ V ( x ∗ ) √ 3 ν      ≤ exp n − 2 λ 2 / 3 o + exp {− λ } , ∀ λ > 0 , ∀ N ≥ 1 . ( 4.21) Pr o of . W e ﬁrst show part a) . It can b e ea sily seen from ( 3.34 ) that ( 3.13 ) holds. Moreov er, Using (3. 28 ), ( 3.33 ), and (3. 34 ) , we can eas i ly see that ( 3.21) holds. By (3.3 3), (3. 34 ), (3 .36), (4 .10), and (4. 16 ), we hav e ˜ B d ( N ) ≤ 4 LV ( x 0 , x ∗ ) ν N ( N + 1)( 1 − P T 1 ) + 8  M 2 + σ 2  LN ( N + 1 ) N X k =1 k 2 T k + 3 ≤ 6 L ν N ( N + 1) + 8  M 2 + σ 2  LN ( N + 1 ) N X k =1 k 2 T k + 3 ≤ 2 L N ( N + 1)  3 V ( x 0 , x ∗ ) ν + 4 ˜ D  , (4.2 2 ) which, in view of Theorem 2.a), then cl early implies (4 .17 ). Now observe that by the deﬁniti on of γ k in (3.28) and relat ion (3.3 4 ), T k X i =1  γ k P T k Γ k (1 − P T k ) p i P i − 1  2 =  2 k T k ( T k + 3)  2 T k X i =1 ( i + 1) 2 =  2 k T k ( T k + 3)  2 ( T k + 1) ( T k + 2)(2 T k + 3) 6 ≤ 8 k 2 3 T k , which to g ether with (3 .34 ) , (3 . 36 ), and (4. 12 ) then i mply t hat ˜ B p ( N ) ≤ 2 σ N ( N + 1) " 2 ¯ V ( x ∗ ) ν N X k =1 8 k 2 3 T k # 1 2 + 8 σ 2 LN ( N + 1 ) N X k =1 k 2 T k + 3 ≤ 2 σ N ( N + 1)  16 ˜ DL 2 ¯ V ( x ∗ ) 3 ν ( M 2 + σ 2 )  1 2 + 8 ˜ DLσ 2 N ( N + 1) ( M 2 + σ 2 ) ≤ 8 L N ( N + 1)   q ˜ D ¯ V ( x ∗ ) √ 3 ν + ˜ D   . Gradien t Sliding for Composite Optimization 23 Using the ab o v e inequality , (4 . 22 ), Theorem 2.b), we ob t ain (4. 18 ). W e now show that part b) ho lds. No te that P t and Γ k are given by (3.3 2 ) and ( 3.38), r espectively . It then follows from ( 3.37) and ( 3.39) that b oth (3.1 3 ) and ( 3.23) hold. Using (3.4 0 ) , t he deﬁnit ions o f γ k and β k in ( 3.30), (4.1 9 ) , and Theorem 2.c), we concl ude that E  Ψ ( ¯ x N ) − Ψ ( x ∗ )  ≤ γ N β N ¯ V ( x ∗ ) (1 − P T N ) + Γ N ( M 2 + σ 2 ) ν N X k =1 T k X i =1 γ k P T k β k Γ k (1 − P T k ) p 2 i P i − 1 ≤ γ N β N ¯ V ( x ∗ ) (1 − P T N ) + 16 L ˜ D 3 ν ( N + 1 )( N + 2) ≤ L ( N + 1)( N + 2)  27 ¯ V ( x ∗ ) 2 ν + 16 ˜ D 3  . (4.23) Now observe that by the deﬁnit ion of γ k in (3.3 0), t he fact that p t = t/ 2, ( 3.32), and (3. 38 ), we hav e T k X i =1  γ k P T k Γ k (1 − P T k ) p i P i − 1  2 =  k ( k + 1) T k ( T k + 3)  2 T k X i =1 ( i + 1) 2 =  k ( k + 1) T k ( T k + 3)  2 ( T k + 1) ( T k + 2)(2 T k + 3) 6 ≤ 8 k 4 3 T k , which to g ether with (3 .38 ) , (3 . 40 ), and (4. 12 ) then i mply t hat ˜ B p ( N ) ≤ 6 N ( N + 1) ( N + 2)   σ 2 ¯ V ( x ∗ ) ν N X k =1 8 k 4 3 T k ! 1 2 + 4 σ 2 9 L N X k =1 k ( k + 1) 2 T k   = 6 N ( N + 1) ( N + 2) " σ  8 ¯ V ( x ∗ ) ˜ DL 2 N ( N + 1) 3 ν ( M 2 + σ 2 )  1 2 + 4 σ 2 L ˜ D N 9( M 2 + σ 2 ) # ≤ 6 L N ( N + 2)   2 q 2 ¯ V ( x ∗ ) ˜ D √ 3 ν + 4 ˜ D 9   . The relatio n in (4 .21 ) then imm ediately follows from the ab o ve inequa lity , (4.23), and Theorem 2.c). Corol lary 4 b elo w sta tes the complexi ty of the SGS algor ithm for ﬁnding a sto c hastic ǫ -soluti on o f (1.1), i .e., a poi n t ¯ x ∈ X s.t. E [ Ψ ( ¯ x ) − Ψ ∗ ] ≤ ǫ f o r some ǫ > 0, as well as a sto c hastic ( ǫ, Λ )-solut ion of (1 .1 ), i. e., a p o in t ¯ x ∈ X s.t. Prob { Ψ ( ¯ x ) − Ψ ∗ ≤ ǫ } > 1 − Λ for some ǫ > 0 and Λ ∈ (0 , 1). Since t his result follows a s an im mediate conseq uence of Co rollary 3, we skipp ed the detai ls of its pro of. Corollary 4 Supp ose that { p t } and { θ t } ar e set to (3.27). Also assume that ther e exists an estimate D X > 0 s.t. (3.41) holds. 24 Guangh ui Lan a) If { β k } and { γ k } ar e set to (3.28), and { T k } is given by (4.16) with ˜ D = 3 D X / (4 ν ) for some N > 0 , then the numb er of evaluations for ∇ f and h ′ , r esp e ctively, r e quir e d by the SGS algorithm to ﬁnd a sto chastic ǫ -solution of (1.1) c an b e b ounde d by O r L D X ν ǫ ! (4.2 4 ) and O ( ( M 2 + σ 2 ) D X ν ǫ 2 + r L D X ν ǫ ) . (4. 2 5) b) If in addition, Assumption (4.1) holds, then the numb er of evaluations f or ∇ f and h ′ , r esp e ctively, r e quir e d by the SGS algorithm to ﬁnd a sto chastic ( ǫ, Λ ) -solution of (1.1) c an b e b ounde d by O ( s L D X ν ǫ max  1 , log 1 Λ  ) (4.2 6 ) and O ( M 2 D X ν ǫ 2 max  1 , lo g 2 1 Λ  + s L D X ν ǫ max  1 , log 1 Λ  ) . (4.2 7 ) c) The ab ove b ounds in p art a) and b) stil l hold i f X i s b ounde d, { β k } and { γ k } ar e set to (3.30), and { T k } is given by (4. 19 ) with ˜ D = 81 D X / (32 ν ) . Observe that b oth b ounds i n (4 .24 ) and ( 4.25) o n t he number of ev al uations for ∇ f and h ′ are essen t ially no t improv able. In fa ct, to the best of our knowledge, this is the ﬁrst t ime that the O (1 / √ ǫ ) com plexity b ound o n gradi en t ev al uations has b een esta blished in t he l i terature fo r sto chastic approxima tion typ e algo rithms applied to solve the co m posit e problem in (1.1). 5 Generalization to strongly conv e x and structured nonsmo ot h optimization Our goal in thi s secti on is to show that the gra dien t sliding techniques d evelop ed in Sectio ns 3 and 4 can b e further general ized to som e other i mport an t classes of CP problems. More speciﬁca lly , w e ﬁrst study i n Subsect ion 5. 1 the com posit e CP problems i n ( 1.1) wit h f b eing strong ly conv ex, and then consider i n Subsectio n 5.2 the case where f is a speci al nonsmo oth function gi v en in a bi -linear saddle po in t form. T hroughout this secti on, we assum e that t he no nsmoo th comp onent h is represented by a SO ( see Sectio n 1). It is clear that our disc us s i on cov er s also the determini stic comp osite problems as certain sp ecial cases by sett i ng σ = 0 in (1.7) and (4. 1 ) . Gradien t Sliding for Composite Optimization 25 5.1 Strong ly conv ex o ptimizati on In t his secti on, we as s um e t hat the smo oth co m ponent f in ( 1.1) is str ongly conv ex , i.e., ∃ µ > 0 such tha t f ( x ) ≥ f ( y ) + h∇ f ( y ) , x − y i + µ 2 k x − y k 2 , ∀ x, y ∈ X . (5.1 ) In addit ion, thr oughout this sect ion, we assume t hat the prox-functio n grows quadrat ically so that (2.4) is satisﬁed. One way to sol v e these strongly conv ex comp osite probl ems is to appl y the aforementioned accelerat ed sto c hastic approximat ion al gorithm which would r e- quire O (1 /ǫ ) ev a luations for ∇ f and h ′ to ﬁnd an ǫ -sol ution of (1.1) [5, 6]. Ho wev er, we will show in this subsectio n t hat this b ound o n the number o f ev aluat i ons fo r ∇ f ca n b e sig niﬁcan tly reduced to O (log(1 /ǫ ) ) , by p r operly resta rting t he SGS algo r ithm in Secti on 4. Thi s multi-pha se sto c ha stic gra dien t slidi ng (M-SGS) a l- gorit hm is formal ly describ ed a s fol l o ws. Algorithm 3 The multi-phase st ochastic g radient sliding ( M-SGS) al gorithm Input: Initial p oin t y 0 ∈ X , iteration lim it N 0 , and an initial estimate ∆ 0 s.t. Ψ ( y 0 ) − Ψ ∗ ≤ ∆ 0 . for s = 1 , 2 , . . . , S do Run the SGS algorithm w i th x 0 = y s − 1 , N = N 0 , { p t } and { θ t } in (3.27), { β k } and { γ k } in (3.28), and { T k } in (4.16) with ˜ D = ∆ 0 / ( ν µ 2 s ), and let y s be its output solution. end for Output: y S . W e now establ ish the main co nvergence prop erties of the M-SG S algo rithm describ ed above. Theorem 3 If N 0 = l 4 p 2 L/ ( ν µ ) m in the MGS algorithm, then E [ Ψ ( y s ) − Ψ ∗ ] ≤ ∆ 0 2 s , s ≥ 0 . (5.2 ) As a c onse quenc e, the total numb er of evaluations for ∇ f and H , r esp e ctively, r e quir e d by the M-SGS al gor ithm to ﬁnd a sto chastic ǫ -solution of (1.1) c an b e b ounde d by O  r L ν µ log 2 max  ∆ 0 ǫ , 1  (5.3 ) and O  M 2 + σ 2 ν µǫ + r L ν µ log 2 max  ∆ 0 ǫ , 1  . (5.4 ) Pr o of . W e show (5. 2) by induct ion. Note that ( 5.2) clearl y hol ds for s = 0 by our assumpti on on ∆ 0 . Now assum e tha t (5 .2 ) hol ds at phase s − 1 , i.e., Ψ ( y s − 1 ) − Ψ ∗ ≤ ∆ 0 / 2 ( s − 1) for s o me s ≥ 1. In view of Co rollary 3 a nd the deﬁnition of y s , we ha ve E [ Ψ ( y s ) − Ψ ∗ | y s − 1 ] ≤ 2 L N 0 ( N 0 + 1)  3 V ( y s − 1 , x ∗ ) ν + 4 ˜ D  ≤ 2 L N 2 0  6 ν µ ( Ψ ( y s − 1 ) − Ψ ∗ ) + 4 ˜ D  . 26 Guangh ui Lan where the second inequali t y follows f r om the st rong co n vexity o f Ψ a n d (2 . 4). Now taking expect ation on b oth sides of the ab ov e inequality w.r.t . y s − 1 , and using t he inductio n hypo thesis and the deﬁnit ion o f ˜ D in the M-SGS algor ithm, we co n cl ude that E [ Ψ ( y s ) − Ψ ∗ ] ≤ 2 L N 2 0 8 ∆ 0 ν µ 2 s − 1 ≤ ∆ 0 2 s , where the last inequ a lity foll o w s from the deﬁnit ion of N 0 . Now, by ( 5 .2 ), the tota l number of phases p erformed by the M-SGS algo rithm can b e b ounded by S = ⌈ log 2 max n ∆ 0 ǫ , 1 o ⌉ . U s i ng this observ ati on, we can easily see t hat the t otal num b er of g radien t ev aluati ons of ∇ f is given b y N 0 S , w hic h i s bo unded b y (5 .3 ). Now l et us pr o v ide a bound o n total n umber of sto c hast ic subgra dien t ev alua tions of h ′ . Without loss o f generali t y , let us assum e that ∆ 0 > ǫ . Using the previ ous bo und on S and the deﬁni tion o f T k , t he t otal number of st ochastic s ub g radient ev aluat ions of h ′ can b e b ounded by S X s =1 N 0 X k =1 T k ≤ S X s =1 N 0 X k =1  ν µN 0 ( M 2 + σ 2 ) k 2 ∆ 0 L 2 2 s + 1  ≤ S X s =1  ν µN 0 ( M 2 + σ 2 ) 3 ∆ 0 L 2 ( N 0 + 1) 3 2 s + N 0  ≤ ν µN 0 ( N 0 + 1) 3 ( M 2 + σ 2 ) 3 ∆ 0 L 2 2 S +1 + N 0 S ≤ 4 ν µN 0 ( N 0 + 1) 3 ( M 2 + σ 2 ) 3 ǫL 2 + N 0 S. This ob s er v ati on, in v i ew of the deﬁniti on of N 0 , then cl early implies the bo und in (5. 4). W e now a dd a few rem arks ab out the results obtai ned i n Theo rem 3. Fi r stly , the M-SGS al gorithm p ossesses optim al co mplexity b ounds in ter m s o f the n umber of gra dien t ev aluati ons for ∇ f and subgradient ev aluat ions for h ′ , whil e exi sting algo r ithms onl y exhibit opt i mal complexi ty b ounds o n the num ber of sto chastic subgradient ev aluati ons (see [6]). Seco ndly , i n Theorem 3, we o nly establ ish the optim a l co n vergence of the M-SGS alg orithm in exp ectati on. It is also po ssible to establi sh the opt imal conv erg ence o f this a lgorithm wi th high probabi lity by making use of the light-tail a ssumption in (4.1) a nd a doma in shrinki ng pro cedure simil a rly to the one studied in Sectio n 3 of [6]. 5.2 Structu r ed nonsmo oth problems Our go al in thi s subsection is t o further gener a lize the gra dien t sliding algo rithms to t he s i tuation w hen f is no nsmoo t h, but can b e cl osely approximated by a certai n smo oth convex functio n. More speci ﬁcally , we assume that f i s given i n t he form of f ( x ) = max y ∈ Y h Ax, y i − J ( y ) , (5.5 ) Gradien t Sliding for Composite Optimization 27 where A : R n → R m denotes a linea r op erator, Y is a clo sed co n vex set , a nd J : Y → ℜ is a relatively sim ple, proper, convex, and lower sem i-con tinuous (l.s. c.) function (i.e., pro b l em (5.8) below is easy to solve). Observe that if J is the convex conjugat e of some conv ex funct ion F and Y ≡ Y , then problem ( 1.1) with f g iv en in (5. 5) can b e w ritten equi v alently as min x ∈ X h ( x ) + F ( Ax ) , Simil a rly t o the p r evious subsecti on, we fo cus on the sit uation when h i s repre- sented by a SO. Sto c ha s t ic comp osite p r oblems i n t his form ha ve wide a pplications in machine lea rning, for exa m ple, to mini mize the regular ized loss function of min x ∈ X E ξ [ l ( x, ξ ) ] + F ( Ax ) , where l ( · , ξ ) is a conv ex l o ss function for any ξ ∈ Ξ and F ( K x ) i s a cer tain regu- lariza tion (e.g. , low rank t ens o r [ 10 , 20], overlapped group l asso [7, 14], and g r aph regular ization [7, 19]). Since f in ( 5.5) is non s m oot h, we cannot directly appl y the gra dien t sli ding metho ds develop ed in the previous secti ons. How ever, as shown by Nesterov [17], the functio n f ( · ) in (5. 5 ) can be closely approximated by a class of smo oth con vex functions. More sp eciﬁcally , for a given strong ly conv ex functi o n v : Y → R such that v ( y ) ≥ v ( x ) + h∇ v ( x ) , y − x i + ν ′ 2 k y − x k 2 , ∀ x, y ∈ Y (5.6 ) for som e ν ′ > 0, l et us denote c v := ar gmin y ∈ Y v ( y ), d ( y ) := v ( y ) − v ( c v ) − h∇ v ( c v ) , y − c v i and D Y := ma x y ∈ Y d ( y ) . (5.7 ) Then the function f ( · ) in (5. 5 ) can be closely approximated by f η ( x ) := max y {h Ax, y i − J ( y ) − η d ( y ) : y ∈ Y } . (5.8 ) Indeed, by deﬁni tion we have 0 ≤ V ( y ) ≤ D Y and hence, for any η ≥ 0, f ( x ) − η D Y ≤ f η ( x ) ≤ f ( x ) , ∀ x ∈ X . (5.9) Moreov er, Nest ero v [1 7] shows tha t f η ( · ) is diﬀerentiable and its gr a dien ts are Lipschitz continuous with the Lipschitz co nstan t gi v en by L η := k A k 2 η ν ′ . (5.1 0 ) W e are now ready to pr esen t a smo othing sto c ha stic gr adien t sli ding ( S-SGS) metho d a nd st ud y it s conv ergence pro perties. Theorem 4 L et ( ¯ x k , x k ) b e the se ar ch p oints gener ate d by a smo othing sto chastic gr a- dient sliding (S-SGS) metho d, which is obtaine d by r eplacing f with f η ( · ) in the deﬁ- nition of g k in the SGS metho d. Supp ose that { p t } and { θ t } i n the SPS pr o c e dur e ar e 28 Guangh ui Lan set to (3.27). Also assume that { β k } and { γ k } ar e set to (3.28) and that T k is given by (4.16) with ˜ D = 3 D X / (4 ν ) for some N ≥ 1 , wher e D X is given by (3.41). If η = 2 k A k N r 3 D X ν ν ′ D Y , then the total numb er of outer iter ations and inner iter ations p erforme d by the S-SGS algorithm to ﬁnd an ǫ -solution of (1.1) c an b e b ounde d by O  k A k √ D X D Y ǫ √ ν ν ′  (5.1 1 ) and O ( ( M 2 + σ 2 ) k A k 2 V ( x 0 , x ∗ ) ν ǫ 2 + k A k p D Y V ( x 0 , x ∗ ) √ ν ν ′ ǫ ) , (5.1 2 ) r esp e ctively. Pr o of . Let us denote Ψ η ( x ) = f η ( x ) + h ( x ) + X ( x ) . In view of (4 .17) and (5.1 0 ) , we hav e E [ Ψ η ( ¯ x N ) − Ψ η ( x )] ≤ 2 L η N ( N + 1)  3 V ( x 0 , x ) ν + 4 ˜ D  = 2 k A k 2 η ν ′ N ( N + 1)  3 V ( x 0 , x ) ν + 4 ˜ D  , ∀ x ∈ X, N ≥ 1 . Moreov er, it fo llows from (5 . 9) that Ψ η ( ¯ x N ) − Ψ η ( x ) ≥ Ψ ( ¯ x N ) − Ψ ( x ) − η D Y . Combining the a bov e tw o inequalit ies, w e obta in E [ Ψ ( ¯ x N ) − Ψ ( x )] ≤ 2 k A k 2 η ν ′ N ( N + 1)  3 V ( x 0 , x ) ν + 4 ˜ D  + η D Y , ∀ x ∈ X , which im pl ies that E [ Ψ ( ¯ x N ) − Ψ ( x ∗ )] ≤ 2 k A k 2 η ν ′ N ( N + 1)  3 D X ν + 4 ˜ D  + η D Y . (5.13) Plugging the v al ue of ˜ D and η into the ab ov e b ound, we can easil y see that E [ Ψ ( ¯ x N ) − Ψ ( x ∗ )] ≤ 4 √ 3 k A k √ D X D Y √ ν ν ′ N , ∀ x ∈ X, N ≥ 1 . It then fo llows fr om th e ab o v e relatio n t hat th e tota l n umber of o uter it erations to ﬁnd an ǫ -solut ion of problem (5.5) can b e bo unded by ¯ N ( ǫ ) = 4 √ 3 k A k √ D X D Y √ ν ν ′ ǫ . Now obser ve that t he to tal num b er of inner iterat ions is b ounded by ¯ N ( ǫ ) X k =1 T k = ¯ N ( ǫ ) X k =1  ( M 2 + σ 2 ) ¯ N ( ǫ ) k 2 ˜ DL 2 η + 1  = ¯ N ( ǫ ) X k =1  ( M 2 + σ 2 ) ¯ N ( ǫ ) k 2 ˜ D L 2 η + 1  . Gradien t Sliding for Composite Optimization 29 Combining these two observ at ions, we conclude that th e tota l num b er of inner iterat ions is bounded by (4). In v i ew of Theorem 4, by using the smo othing SGS al gorithm, we can signi f- icantly reduce the n umber of outer it erations, and hence the n umber of times to access the linear op erator A and A T , from O (1 /ǫ 2 ) to O (1 /ǫ ) i n order to ﬁnd an ǫ -soluti o n of (1. 1), whil e s t ill maintaining t he optim a l bound on the t otal num ber of sto c ha stic subgradient ev a l uations for h ′ . It should b e not ed that, b y using the result in T heorem 2 .b), we can show that the afor emen t ioned savings on t he access to the l inear op erator A and A T also h o ld wit h overwhelming probabil i t y under the lig ht-tai l a ssumption in (4 .1 ) asso ciated with t he SO. 6 Concluding remarks In this pa per, we present a new class o f ﬁrst-order metho d which can signiﬁca n tl y reduce the number o f gradient ev aluat ions for ∇ f requi red to so lv e the comp osite problems i n (1. 1 ) . More sp eciﬁcall y , we show that by using t hese algori thms, the tota l number of g radient ev aluati ons can b e si gniﬁcantly r educed fr om O (1 /ǫ 2 ) to O ( 1 / √ ǫ ). As a r esult, these a lgorithm s hav e the p otential to si g niﬁcantly ac- celerate ﬁrst-order metho ds for solv i ng the co mposi t e problem in (1.1), especia lly when the b ottl enec k exists in the co mputation ( or com m uni cation in the ca se o f distribut ed comput ing) of the gradient of the smo oth com ponent, as happ ened in ma n y a pplications. W e a l so establish simi lar co mplexity bo unds for solv ing a n imp ortant class of s t ochastic comp osite opti mization pr oblems by dev eloping the sto c hastic gradi en t sl iding metho ds. By prop erly r estarting the gradient sl i ding algo r ithms, we dem onstrate that dram a tic saving on g radient ev al u a tions ( from O ( 1 /ǫ ) to O (log( 1 /ǫ )) can b e achieved for solvi ng st rongly conv ex problems. Gen- eraliza tion to the cas e when f is nonsmo oth but possessi ng a bilinear saddle p o in t structure has also b een discussed. It should b e po in t ed out that this pap er fo cuses only on theoret ical studies f o r the conv ergence prop erties asso ciated w ith the gra dien t sl iding algo rithms. The practica l perfo rmance for these algori thms, howev er , will cer tainly dep end o n our estima t ion fo r a few probl em paramet ers, e.g., the Lipschitz const an t s L a nd M . In addit ion, t he sliding per iods { T k } in b oth GS a nd SGS ha ve b een sp eciﬁed in a conserv at i v e way to obta in the optim a l complexity bound s for gradient and sub- gradient ev alua tions. W e exp ect t hat the p r actical p erformance o f these alg orithms will be f urther im pro v ed wit h pro per incorp oratio n of certai n adapti v e searc h pro- cedures on L , M , and { T k } , which w ill b e very interesting resea rc h to pics in the future. References 1. A. Auslender and M. T eb oulle. Inte rior gradien t and prox imal metho ds for conv ex and conic optimization. SIAM Journal on Optimization , 16:697–725, 2006. 2. H.H. Bauschk e, J.M. Borwein, and P . L. Com bettes. Bregman monotone optimi zation algorithms. SIAM Journal on Contr o al and Optimization , 42:596–636, 2003. 3. A. Beck and M . T eboulle. A fast iterative s hr ink age-thresholding al gori thm for li ne ar inv erse problems. SIAM J. Imaging Sciences , 2:183–202, 2009. 30 Guangh ui Lan 4. L.M. Bregman. The r elaxat ion method of ﬁnding the common p oin t conv ex sets and its application to the solution of problems in con vex programming. USSR Comput. Math. Phys. , 7:200–21 7, 1967. 5. S. Ghadimi and G. Lan. Optimal s toc hastic approximation algorithms for strongly con vex stochastic composite optimization, I: a generic algorithmic f ramew ork. SIAM Journal on Optimization , 22:1469– 1492, 2012. 6. S. Ghadimi and G. Lan. Optimal s toc hastic approximation algorithms for strongly con vex stochastic comp osite optimization, II: shr inking procedures and optimal algorithms. SIAM Journal on Optimization , 23:2061–2089 , 2013. 7. L. Jacob, G. Ob ozinski, and J.-P . V ert. Group lasso with ov erlap and graph lasso. In Pr o c e e dings of the 26th International Confer enc e on Machine L e arning , 2009. 8. A. Juditsky , A. S. Nemirovski, and C. T auvel. Solving v ariational i ne qualities with sto c has- tic mirror- pro x algorithm. Manuscript, Georgia Institute of T ec hnology , Atlan ta, GA, 2011. 9. K.C. K i wiel. Proximal minim izat ion metho ds with generalized bregman functions. SIAM Journal on Contr o al and Optimization , 35:1142–116 8, 1997. 10. T. G. Kolda and B. W. Bader. T ensor decompositions and applications. SIAM R evie w , 51(3):455– 500, 2009. 11. G. Lan. An optimal method for sto c hastic comp osite optimization. Mathematic al Pr o- gr amming , 133(1):365–39 7, 2012. 12. G. Lan. Bundle-level t ype methods uniformly optimal for smo oth and non-smo oth conv ex optimization. Manuscript, Department of Indust rial and Systems Engineering, Universit y of Flori da, Gainesville, FL 32611, USA, Jan uary 2013. Mathematical Pr ogramming (to appear). 13. G. Lan, A. S. Nemirovski, and A. Shapiro. V alidation analysis of mir ror descent sto c hastic appro ximation method. Mathematic al Pr o gr amming , 134:425–45 8, 2012. 14. J. Mairal, R. Jenatton, G. Ob oz inski, and F. B ach. Conv ex and net w ork ﬂow optimization for structured sparsit y . Journal of Machine L e arning R ese ar ch , 12:2681–2720, 2011. 15. Y . E. Nesterov. A metho d for unconstrained conv ex minimization problem with the rate of con vergence O (1 /k 2 ). Doklady AN SSSR , 269 :543–547, 1983. 16. Y . E. Nesterov. Intr o ductory L ectur e s on Convex Optimization: a b asic c ourse . Kl u wer Academic Publishers, Massac h usetts, 2004. 17. Y . E. Nestero v. Smooth minimization of nonsmo oth functions. Mathematic al Pr o g r am- ming , 103:127–1 52, 2005. 18. Y . E. Nesterov. Gradient metho ds for mi nimizing composi te ob jectiv e functions. T ec hnical repor t, Cen ter for Oper at ions Researc h and Econometrics (CORE), Catholic Uni v ersity of Louv ain, Septe mber 2007. 19. R. Tibshirani, M . Saunders, S. Rosset, J. Zhu, and K . Knigh t. Sparsity and smoothness via the fused lasso. Journal of R oyal Statistic al So cie ty: B , 67(1) :91–108, 2005. 20. R. T omiok a, T. Suzuki, K. Ha yashi, and H. Kashima. Statistical perf ormance of con v ex tensor decomposition. A dvanc es in Neur al Information Pr o c essing Systems , 25, 2011. 21. P . Tseng. O n accelerate d pr o ximal gradien t methods for con v ex-conca ve optimization. Manu script, Univ ersity of W ashi ngton, Seat tle, May 2008.

Gradient Sliding for Composite Optimization

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment