Mirror Descent on Riemannian Manifolds

Mirror Descen t on Riemannian Manifolds ∗ Jiaxin Jiang 1 , Lei Shi 1,2 , and Jiyuan T an 1 1 Sc ho ol of Mathematical Sciences, F udan Univ ersity , Shanghai 200433, China 2 Shanghai Key Lab oratory for Con temp orary Applied Mathematics, F udan Univ ersity , Shanghai 200433, China Abstract Mirror Descen t (MD) is a scalable ﬁrst-order metho d widely used in large-scale optimization, with applications in image pro cessing, policy optimization, and neural net work training. This pap er generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Rie- mannian Mirror Descent (RMD) framework via reparameterization and further prop ose a sto c has- tic v ariant of RMD. W e also establish non-asymptotic conv ergence guarantees for b oth RMD and sto c hastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradien t Descen t (CGD) method prop osed in [ 26 ]. Moreo ver, when sp ecializing the sto c hastic RMD framework to the Stiefel setting, w e obtain a sto chastic extension of CGD, which eﬀectiv ely addresses large-scale manifold optimization problems. Keyw ords: Mirror descent, Riemannian optimization, Reparameterization, Non-asymptotic con- v ergence analysis, Stiefel manifolds MSCco des: 65K05, 90C06, 90C30 1 In tro duction In this pap er, we study the Riemannian optimization problem min x ∈M f ( x ) , where M is a Riemannian manifold, and f : M → R is a smo oth function on M . This problem can b e viewed as constrained nonlinear optimization with a manifold constrain t. The manifold structure enables us to design eﬃcien t algorithms and p erform detailed analysis. Riemannian optimization ﬁnds wide applications in practice. F or instance, it can be applied to dictionary learning [ 8 ] and matrix decomp osition [ 25 , 23 ]. In deep learning, imp osing orthogonal constrain ts on w eight matrices can enhance training stabilit y and impro ve generalization p erformance [ 3 , 4 ]. Such constrain ts can be mo deled by optimizing ob jective functions ov er the Stiefel manifold. One line of research in this ﬁeld focuses on generalizing unconstrained optimization algorithms to Riemannian manifolds. F or example, Riemannian Gradient Descen t [ 31 ], Riemannian Proximal P oint [ 6 ], Riemannian Momen tum [ 1 ], and Riemannian Stochastic V ariance Reduction Gradien t [ 30 , 18 ] ha v e b een explored. Ho wev er, Mirror Descent (MD), an imp ortan t class of algorithms in Euclidean space, ∗ Authors are listed in alphab etical order. The w ork of Lei Shi is supp orted by National Natural Science F ounda- tion of China [Gran t No.12171093]. Email addresses: jxjiang20@fudan.edu.cn (J. Jiang), leishi@fudan.edu.cn (L. Shi), jiyuantan19@gmail.com (J. T an). 1 has been surprisingly unexplored in the existing literature. This pap er takes the ﬁrst step to ﬁll this gap. The MD algorithm w as prop osed b y Nemirovski and Y udin [ 16 ] as a generalization of gradient descen t and was later p opularized b y the widely used form ulation of Bec k and T eb oulle [ 5 ]. Consider the Euclidean optimization problem min x ∈ Ω f ( x ), where Ω ⊆ R d is a closed conv ex feasible set. Let ψ : Ω → R be a diﬀeren tiable and strongly conv ex function, often called a p oten tial function. The asso ciated Bregman div ergence [ 7 ] is deﬁned as D ψ ( x, y ) = ψ ( x ) − ψ ( y ) − ⟨∇ ψ ( y ) , x − y ⟩ . Giv en a step size η t > 0, the MD up date is x t +1 = arg min x ∈ Ω {⟨ η t ∇ f ( x t ) , x − x t ⟩ + D ψ ( x, x t ) } , where the Bregman term can b e interpreted as a geometry-adapted regularizer. When Ω = R d , the ﬁrst-order optimality condition yields the implicit dual update ∇ ψ ( x t +1 ) = ∇ ψ ( x t ) − η t ∇ f ( x t ) . Diﬀeren t c hoices of ψ reco v er algorithms suited to diﬀerent geometries. F or instance, taking ψ ( x ) = 1 2 ∥ x ∥ 2 2 reduces MD to the standard gradient descen t update. When Ω is the probabilit y simplex, the en tropy potential ψ ( x ) = P d i =1 x i log x i (with the con ven tion 0 log 0 = 0) leads to En trop y Mirror Descen t, and D ψ b ecomes the Kullbac k–Leibler divergence [ 15 ]. By adapting to the problem geometry with mild dependence on dimension, MD has found widespread applications, including image processing [ 5 ], p olicy optimization [ 27 , 28 ], and neural netw ork training [ 22 , 20 ]. Subsequen t w ork further broadened the study of MD, with particular emphasis on its extensions and connections to other metho ds. Nemirovski and Juditsky [ 15 ] considered using MD to solv e stochas- tic optimization and sto c hastic saddle-p oin t problems. Duc hi et al. [ 10 ] generalized these results to comp osite functions. Recen t researc h has focused on pro viding alternative explanations for MD. Gu- nasek ar et al. [ 12 ] demonstrated that MD can b e interpreted as the gradien t ﬂo w on the manifold as the step sizes tend to zero. Amid and W arm uth [ 2 ] observ ed that MD b ecomes equiv alen t to gradien t descen t under diﬀerent parameterizations. Li et al. [ 14 ] utilized this observ ation to explain the im- plicit regularization in o ver-parameterized models. Lei and Zhou [ 13 ] studied online mirror descent in a setting where data arriv e sequentially . The literature on Riemannian optimization is extensiv e. W e brieﬂy review several related results. Zhang and Sra [ 31 ] establish the ﬁrst non-asymptotic conv ergence rates of geodesic gradient descent in b oth deterministic and sto c hastic settings. Later, Bento et al. [ 6 ] analyzed Riemannian subgradient and pro ximal p oin t methods; how ev er, their results rely on the assumption that the underlying manifold is Hadamard. More recently , Sriniv asan and Wirth [ 21 ] extended this non-asymptotic con v ergence analysis to general Riemannian manifolds b y constructing a suitable p otential function. Alimisis et al. [ 1 ] studied momentum-based acceleration of ﬁrst-order metho ds on manifolds. In the sto chastic setting, Zhang and Sra [ 30 ] and Sato et al. [ 18 ] applied v ariance-reduction tec hniques and derived impro ved complexity b ounds for ﬁnite-sum problems. T ripuraneni et al. [ 24 ] generalized Poly ak– Rupp ert a veraging to the manifold setting and show ed that the resulting av eraged iterates enjoy impro ved con vergence guarantees for stochastic gradien t descen t on manifolds. Recently , Sh u et al. [ 19 ] in tro duced a quasilinearization framew ork to analyze pro ximal-based methods on Hadamard manifolds. Our contributions can b e summarized as follows: • W e generalize Euclidean Mirror Descent to optimization on Riemannian manifolds by develop- ing a Riemannian Mirror Descent (RMD) framew ork via reparameterization. Building on the prop osed framew ork, w e further prop ose a stochastic v arian t of RMD (sto c hastic RMD) suitable for large-scale settings in whic h only stochastic gradien t estimates are a v ailable. 2 • W e establish non-asymptotic con v ergence guarantees for b oth RMD and sto c hastic RMD. Let T denote the total num b er of iterations. F or geo desically conv ex ob jectives, we pro ve that the ob jective error deca ys at a sublinear rate of O (1 /T ) in the deterministic setting and O (1 /T 1 / 3 ) in the stochastic setting. F or smo oth nonconv ex ob jectives, w e sho w that the av erage squared gra- dien t norm decays at a rate of O (1 /T ) in the deterministic case and O (1 /T 1 / 2 ) in the stochastic case. Our theoretical results ﬁll the gap of [ 2 , 12 ]. • As an application to optimization on the Stiefel manifold, we show that our RMD framework reco vers the Curvilinear Gradien t Descen t (CGD) method of [ 26 ]. Moreov er, specializing stochas- tic RMD to the Stiefel setting yields a sto c hastic extension of CGD, whic h we term Sto c hastic Curvilinear Gradient Descen t (SCGD). The prop osed SCGD is scalable and computationally eﬃcien t for large-scale manifold optimization problems and may b e of indep enden t in terest. W e further v alidate its eﬀectiveness through numerical experiments. Outline of this pap er. In Section 2, we in tro duce the notation and assumptions used throughout the pap er. In Section 3, w e present the framework of RMD and Stochastic RMD, and pro vide several concrete instantiations. In Section 4, w e analyze the conv ergence rates of our algorithms. In Section 5, w e demonstrate the eﬀectiv eness of the prop osed methods through numerical exp erimen ts. W e conclude with a discussion and future research directions in Section 6. 2 Notation and Theoretical Background W e ﬁrst recall some basic notions in Riemannian geometry . Deﬁnition 1. A R iemannian manifold ( M , g ) is a smo oth manifold endowe d with Riemannian metric g . L et T x M b e the tangent sp ac e of M at p oint x . The tangent bund le of M is T M = ∪ x ∈M T x M . A Ri emannian metric g assigns to e ach x ∈ M an inner pr o duct g x ( · , · ) = ⟨· , · ⟩ x , which varies smo othly with x . The induc e d norm on T x M is denote d by ∥ · ∥ x . Diﬀeren tials and gradien ts are indispensable for optimization on manifolds. Deﬁnition 2. L et f : M → R b e diﬀer entiable. The diﬀer ential of f at x ∈ M is the line ar map D f ( x ) : T x M → R . The R iemannian gr adient ∇ f ( x ) ∈ T x M is the unique ve ctor satisfying D f ( x )[ v ] = ⟨∇ f ( x ) , v ⟩ x , ∀ v ∈ T x M . F or a diﬀerentiable map F : M → N betw een manifolds, we denote b y D F ( x ) : T x M → T F ( x ) N for its diﬀeren tial at x ∈ M , and write DF ( x )[ v ] ∈ T F ( x ) N for its action on v ∈ T x M . W e also denote b y ∇ the Levi–Civita connection on ( M , g ). In what follo ws, w e will mainly use it in the form of the co v ariant deriv ativ e along a smo oth curve. Deﬁnition 3. F or a smo oth curve γ : [0 , 1] → M , we write ˙ γ ( t ) for its tangent ve ctor ﬁeld. The c ovariant derivative ∇ ˙ γ is denote d by d dt , as usual. Thus the twic e c ovariant derivative is d 2 dt 2 . A curve γ is c al le d a ge o desic if d 2 γ ( t ) dt 2 = 0 , ∀ t ∈ [0 , 1] . In this pap er, we assume that ( M , g ) is connected and complete (in the Riemannian sense). Equiv- alen tly , with the distance function d M induced by Riemannian metric, ( M , d M ) is a complete metric space, or ev ery geodesic can b e extended for all time, i.e., the exp onential map is deﬁned on T M . By the Hopf–Rinow theorem, for any x, y ∈ M , there exists a minimizing geodesic segment γ : [0 , 1] → M joining them, satisfying γ (0) = x, γ (1) = y , and the length of γ is equal to d M ( x, y ). But this minimizing geo desic γ is not necessarily unique. 3 Deﬁnition 4. F or any x ∈ M , the exp onential map Exp x : T x M → M satisﬁes that for any v ∈ T x M , γ ( t ) = Exp x ( tv ) , t ∈ [0 , 1] is a ge o desic and the length of γ is ∥ v ∥ x . F or x ∈ M and r > 0 , we denote by N r ( x ) = { y ∈ M : d M ( x, y ) ⩽ r } the close d ge o desic b al l of r adius r c enter e d at x , and by B r ( x ) = { v ∈ T x M : ∥ v ∥ ⩽ r } the b al l c enter e d at origin of the tangent sp ac e. L et inj( M ) b e the inje ctive r adius of M , such that for any x ∈ M and r < inj( M ) , the exp onential map Exp x | B r ( x ) : B r ( x ) → N r ( x ) is a diﬀe omorphism when r estricte d on B r ( x ) . In general, since the geodesic equation d 2 γ ( t ) dt 2 = 0 is of second order, the exp onential map do es not ha ve a closed form expression and is extremely exp ensiv e to compute. In this case, a retraction map is used to appro ximate it. Deﬁnition 5. A (ﬁrst-or der) r etr action on M is a diﬀer entiable map R : T M → M such that R | M = Id , and for e ach x ∈ M , the map R x = R | T x M satisﬁes DR x (0) = Id T x M . A t ypical example of retraction is the unit sphere S d − 1 . F or an y p oin t x ∈ S d − 1 and v ∈ T x S d − 1 , a retraction is R x ( v ) = x + v ∥ x + v ∥ , that is, the pro jection of x + v . The follo wing deﬁnition measures ho w close a retraction is to the exp onen tial map, in view of Lemma 2. Deﬁnition 6. L et R b e a r etr action on M , R is c al le d L Φ -r e gular within r adius r , if, for any x ∈ M , ther e exists 0 < ρ < inj( M ) such that R x ( B r ( x )) ⊆ N ρ ( x ) , the function Φ x ( u ) = Exp − 1 x ( R x ( u )) is twic e c ontinuously diﬀer entiable for u ∈ B r ( x ) , and sup t ∈ [0 , 1] ∥ D 2 Φ x ( tu )[ u, u ] ∥ x ⩽ L Φ ∥ u ∥ 2 x , ∀ x ∈ M , u ∈ B r ( x ) . With the notion of geodesics, w e can deﬁne con vexit y on Riemannian manifolds. Deﬁnition 7. A subset A ⊆ M is ge o desic al ly c onvex if for any x, y ∈ A and for every minimizing ge o desic se gment γ : [0 , 1] → M c onne cting x and y , we have γ ( t ) ∈ A , ∀ t ∈ [0 , 1] . Note that we do not require the minimizing geo desic segment connecting x and y to b e unique. Since ( M , g ) is assumed to b e complete, by the Hopf–Rino w theorem any tw o p oints in M can be joined b y a minimizing geo desic segment; hence the whole manifold M is geodesically conv ex. F or an arbitrary subset B ⊆ M , the conv ex hull of B is deﬁned as the in tersection of all geo desically con vex subsets A ⊆ M that contain B . By construction, this con vex h ull is geo desically conv ex. Next, w e in tro duce the con v exity of a diﬀerentiable function. Recall that in Euclidean space, a function h is con vex if for an y x, y , h ( tx + (1 − t ) y ) ⩽ th ( x ) + (1 − t ) h ( y ) , ∀ t ∈ [0 , 1] . F or function on the Riemannian manifold, we adopt the deﬁnition of Zhang and Sra [ 31 ]. Deﬁnition 8. L et A ⊆ M b e ge o desic al ly c onvex and let f : M → R . We say that f is ge o desic al ly c onvex on A if for any x, y ∈ A and for every minimizing ge o desic se gment γ : [0 , 1] → M c onne cting x and y , f ( γ ( t )) ⩽ (1 − t ) f ( x ) + tf ( y ) , ∀ t ∈ [0 , 1] . If we replace the exponential map Exp x with retraction R x in Deﬁnition 6, w e get the deﬁnition of retraction conv ex function. If a function is both geodesically con vex and diﬀeren tiable, we ha ve f (Exp x ( v )) − f ( x ) ⩾ ⟨∇ f ( x ) , v ⟩ x . In optimization, a common assumption is the Lipschitz prop ert y of gradient. 4 Deﬁnition 9. A function f on ( M , g ) is L -smo oth if ∥ Γ y x ∇ f ( x ) − ∇ f ( y ) ∥ y ⩽ Ld M ( x, y ) , ∀ x, y ∈ M , wher e d M is the ge o desic distanc e induc e d by the R iemannian metric g and Γ y x : T x M → T y M is the p ar al lel tr ansp ort along any minimizing ge o desic c onne cting x and y . Remark 1. We r emark that this deﬁnition is mainly ab out the lo c al pr op erty of function f . Sinc e p ar al- lel tr ansp ort is an isometry, the left hand side of the ine quality is ∥ Γ y x ∇ f ( x ) − ∇ f ( y ) ∥ y ⩽ ∥ Γ y x ∇ f ( x ) ∥ y + ∥∇ f ( y ) ∥ y = ∥∇ f ( x ) ∥ x + ∥∇ f ( y ) ∥ y ⩽ 2 G by Assumption 4, thus when d M ( x, y ) ⩾ 2 G L the deﬁnition is satisﬁe d trivial ly. And also when x and y ar e close enough, for example when d M ( x, y ) < inj( M ) , the minimizing ge o desic c onne cting them is unique. W e conclude by introducing some notation used throughout the pap er. Giv en a matrix A = [ a ij ] and index sets I , J , we denote by A ( I , J ) = [ a ij ] i ∈I , j ∈J the submatrix of A with row indices I and column indices J . W e use the shorthand [ n ] = { 1 , 2 , . . . , n } . Finally , ∇ M means taking gradient on manifold M . 3 Riemannian Mirror Descen t: Algorithms and Con vergence Theorems In this section, w e formulate Riemannian Mirror Descent (RMD) and sto c hastic RMD on manifolds. W e establish non-asymptotic con vergence guaran tees for both algorithms. W e then presen t sev eral concrete examples arising from our RMD framew ork. In particular, w e obtain Sto chastic Curvilinear Gradien t Descen t (SCGD) as a sp ecial case of particular interest. 3.1 Motiv ation and Algorithm F ramew ork In this subsection, the am bient space is R d equipp ed with its standard co ordinates, and ∇ and ∇ 2 denote the usual Euclidean gradient and Hessian, respectively . Our main motiv ation comes from the observ ation of Raskutti and Mukherjee [ 17 ]. They consider a diﬀerent Riemannian structure on R d . Giv en a strongly conv ex smooth function ψ , for each p oint x ∈ R d , we deﬁne g ψ x ( v , u ) = v T ∇ 2 ψ ( x ) u, ∀ u, v ∈ R d . Notice that g ψ x ( · , · ) is an inner pro duct on R d since ∇ 2 ψ ( x ) is p ositiv e deﬁnite. Under the standard iden tiﬁcation T x R d ≃ R d , the matrix represen tation of g ψ x ( · , · ) is g ψ ( x ) = ∇ 2 ψ ( x ). W e denote by M ψ = ( R d , g ψ ) the Riemannian manifold obtained by equipping R d with the Riemannian metric g ψ . Let ψ ∗ b e the conjugate of ψ , deﬁned by ψ ∗ ( y ) = sup x {⟨ y , x ⟩ − ψ ( x ) } . Then ψ ∗ is also strongly con vex and smo oth. Let M ψ ∗ = ( R d , g ψ ∗ ) b e the manifold in the dual v ariable y . Then the map x 7→ ∇ ψ ( x ) is from M ψ to M ψ ∗ , and ∇ ψ ∗ ( ∇ ψ ( x )) = x , ∇ 2 ψ ( x ) ∇ 2 ψ ∗ ( ∇ ψ ( x )) = Id R d . The up date in mirror descent (MD) can b e rewritten as y t +1 = ∇ ψ ( x t ) − η t ∇ f ( x t ) , (1) x t +1 = ∇ ψ ∗ ( y t +1 ) . (2) In the ﬁrst step, ∇ ψ maps x t ∈ M ψ to ∇ ψ ( x t ) ∈ M ψ ∗ , and the up date subtracts η t ∇ f ( x t ) in the dual v ariable. Under our setting, ∇ ψ is inv ertible with inv erse ( ∇ ψ ) − 1 = ∇ ψ ∗ . The inv ertible map ∇ ψ pulls bac k f on M ψ to ˜ f = f ◦ ( ∇ ψ ) − 1 = f ◦ ∇ ψ ∗ on M ψ ∗ . W e can calculate the Riemannian 5 gradien t of ˜ f at ∇ ψ ( x t ) on M ψ ∗ . Note that y t = ∇ ψ ( x t ), we ha ve ∇ M ψ ∗ ˜ f ( y t ) = ( g ψ ∗ ( y t )) − 1 ∇ ˜ f ( y t ) = ( ∇ 2 ψ ∗ ( y t )) − 1 · ∇∇ ψ ∗ ( y t ) · ∇ f ( ∇ ψ ∗ ( y t )) = ( ∇ 2 ψ ∗ ( y t )) − 1 · ∇ 2 ψ ∗ ( y t ) · ∇ f ( x t ) = ∇ f ( x t ) , where the ﬁrst line is the co ordinate expression for Riemannian gradien t and in the second line we use the chain rule. This computation means that ∇ f ( x t ) is equiv alent to the gradient of ˜ f in M ψ ∗ at y t . As a result, the ﬁrst step is a gradien t step y t +1 = y t − η t ∇ M ψ ∗ ˜ f ( y t ) in the space M ψ ∗ . In the second step, the inv erse of ∇ ψ , i.e., ∇ ψ ∗ maps y t +1 bac k to the primal manifold M ψ . The MD up date pro cess can b e viewed as gradien t descen t via reparameterization. More precisely , b y changing of v ariables y = ∇ ψ ( x ), the algorithm maps the primal iterate x t ∈ M ψ to the dual co ordinate y t ∈ M ψ ∗ , p erforms gradient descen t in the dual space M ψ ∗ , and then maps back to M ψ via ∇ ψ ∗ . Motiv ated b y this idea, w e prop ose the Riemannian Mirror Descen t (RMD) algorithm (Algorithm 1) as an extension of Euclidean MD to general Riemannian manifolds. Algorithm 1 Riemannian Mirror Descen t (RMD) Input: Iteration num b er T , Initial p oin t x 0 , step size { η t } T − 1 t =0 , radius { r t } T − 1 t =0 for t = 0 to T − 1: do Construct lo cal diﬀeomorphism: φ t : ˆ N r t ( x t ) → M t , and retraction R t φ t ( x t ) on M t y t +1 = R t φ t ( x t ) ( − η t D φ t ( x t )[ ∇ f ( x t )]) Dual up date x t +1 = φ − 1 t ( y t +1 ) Bac k to primal space end for return x T In Algorithm 1, w e replace the global map x 7→ ∇ ψ ( x ) b y a sequence of lo cal diﬀeomorphisms φ t . At eac h iteration, we c ho ose a lo cal diﬀeomorphism φ t : ˆ N r t ( x t ) → M t on to a reparameterized manifold M t . Here ˆ N r ( x ) = { y ∈ M : d M ( x, y ) < r } is the op en geo desic ball. F or simplicit y , when inj( M ) < + ∞ , w e assume r t < inj( M ) so that ˆ N r t ( x ) is diﬀeomorphic to op en ball { v ∈ T x M : ∥ v ∥ < r } , and so is M t . When inj( M ) = + ∞ , r t is free to b e ﬁnite or inﬁnite. W e then p erform a gradient step on M t via retraction R t and map the iterate back via φ − 1 t . Up dating y t +1 in Algorithm 1 corresp onds to (1). In essence, RMD amounts to carrying out gradient descent after a lo cal reparameterization. One b eneﬁt is that the retraction map can b e easier to compute in the reparameterized space. Next, w e present the conv ergence theorem for RMD in Algorithm 1. T o establish these results, w e in tro duce the following assumptions. Assumption 1. The map ˆ R t x ( u ) = φ − 1 t ( R t φ t ( x ) ( D φ t ( x )[ u ])) is a r etr action and it is L Φ -r e gular within r adius r (se e Deﬁnition 5 and Deﬁnition 6). Assumption 2. Ther e exists a c omp act ge o desic al ly c onvex set A ⊂ M such that al l iter ation p oints x t ∈ A for al l t = 1 , . . . , T , and a minimizer x ∗ satisﬁes x ∗ ∈ A . Assumption 3 (Smo othness) . The obje ctive function f is L f -smo oth (Se e Deﬁnition 9). Assumption 4 (Bounded Gradien t) . The gr adient is b ounde d, i.e., ther e exists a c onstant G > 0 such that ∥∇ f ( x ) ∥ x ⩽ G for al l x ∈ M . Remark 2. We r emark that in Assumption 2, it is suﬃcient to take A to b e the ge o desic c onvex hul l of { x 1 , . . . , x T , x ∗ } , pr ovide d that this set A is b ounde d. Mor e over, if the initialization x 1 lies in a neighb orho o d of x ∗ , it is natur al to assume that f is ge o desic al ly c onvex on A . We intr o duc e a subset A r ather than working on the whole manifold M in or der to avoid glob al ge ometric obstructions to ge o desic c onvexity. F or instanc e, although Hadamar d manifolds admit ge o desic al ly c onvex (even str ongly c onvex) functions, imp osing ge o desic L -smo othness simultane ously c an b e subtle; se e Pr op osition 28 in [ 9 ]. A 6 mor e dir e ct example is that on a close d (i.e., c omp act and b oundaryless) and c onne cte d Riemannian manifold ther e is no nonc onstant glob al ly ge o desic al ly c onvex function. If it is diﬃcult to obtain such an initialization and glob al L -smo othness and c onvexity on A c annot b e ensur e d, the “nonc onvex” p art of the c onver genc e the or ems b elow stil l applies. The “c onvex” p art c an b e interpr ete d as a tail-r ate (lo c al) analysis: view A as a neighb orho o d of x ∗ on which f is ge o desic al ly c onvex, and assume that the tail iter ates { x n , x n +1 , . . . , x T , x ∗ } ar e c ontaine d in A for some n ∈ N . W e are no w in a position to state the conv ergence theorem. Theorem 1 (Conv ergence of RMD) . F or RMD up dates in Algorithm 1, assume that Assumptions 1–4 hold. L et η t ≡ η b e a c onstant step size satisfying 0 < η < min  1 L Φ G/ 2 + L f + L f L 2 Φ , 2 G , r G  . Then the fol lowing statements hold: 1. (Nonc onvex c ase) 1 T T X t =1 ∥∇ f ( x t ) ∥ 2 x t ⩽ O  1 T  . 2. (Ge o desic al ly c onvex c ase) If, in addition, f is ge o desic al ly c onvex on A ⊂ M (fr om Assump- tion 2), then f ( x T ) − f ( x ∗ ) ⩽ O  1 T  . A detailed pro of is given in Section 4. 3.2 Examples of Riemannian Mirror Descen t Algorithm 1 establishes an abstract framework for RMD. By c ho osing diﬀerent mirror maps φ t and manifolds M t , one can reco ver a v ariety of algorithms as special cases of this framework. In what follo ws, w e present sev eral concrete examples to illustrate our framework. Example 1 (Euclidean Mirror Descen t) . Given a str ongly c onvex se c ond-or der diﬀer entiable function ψ , take M = M ψ = ( R d , g ψ ) , M t ≡ M ψ ∗ = ( R d , g ψ ∗ ) , wher e g ψ is deﬁne d in Se ction 3.1, R x ( v ) = x + v and φ t ( x ) ≡ ∇ ψ ( x ) , Algorithm 1 r e c overs the MD algorithm in Euclide an sp ac e. In this example, the mirror map is in fact an isometry . Fix x ∈ M ψ and u, v ∈ T x M ψ . Let y = ∇ ψ ( x ) and denote ˆ u = D ( ∇ ψ )( x )[ u ] = ∇ 2 ψ ( x ) u and ˆ v = ∇ 2 ψ ( x ) v . Then g ψ ∗ y ( ˆ u, ˆ v ) = ˆ u T ∇ 2 ψ ∗ ( y ) ˆ v = u T ∇ 2 ψ ( x ) ∇ 2 ψ ∗ ( ∇ ψ ( x )) ∇ 2 ψ ( x ) v = u T ∇ 2 ψ ( x ) v = g ψ x ( u, v ) , where we used the identit y ∇ 2 ψ ∗ ( ∇ ψ ( x )) = ( ∇ 2 ψ ( x )) − 1 . Therefore, using the conjugacy relation b et w een ψ and ψ ∗ , the mirror map ∇ ψ in classical MD is a Riemannian isometry b et ween M ψ and M ψ ∗ . On a general Riemannian manifold, one cannot in general exp ect to construct a globally deﬁned Riemannian isometry to serv e as a mirror map. Ho wev er, the tangent space provides a natural local parameterization of the manifold. By using the exponential map as a mirror map, Algorithm 1 reco vers the Geo desic Gradien t Descen t algorithm. 7 Example 2 (Geo desic Gradien t Descen t) . L et M b e a Riemannian manifold. In Algorithm 1, ﬁx x t ∈ M and cho ose r < inj( M ) so that Exp x t is a diﬀe omorphism on the ge o desic b al l B r ( x t ) . Deﬁne φ t ( x ) = Exp − 1 x t ( x ) , x ∈ ˆ N r ( x t ) , R x ( v ) = x + v, M t = { v ∈ T x t M : ∥ v ∥ < r } ⊆ T x t M . Then the r esulting up date c oincides with Ge o desic Gr adient Desc ent [ 31 ]. Indeed, Exp x t (0) = x t and D Exp x t (0) = Id T x t M . Since φ t ( x t ) = Exp − 1 x t ( x t ) = 0, we obtain y t +1 = R φ t ( x t ) ( − η t D φ t ( x t )[ ∇ f ( x t )]) = − η t ∇ f ( x t ) , and then x t +1 = φ − 1 t ( y t +1 ) = Exp x t ( − η t ∇ f ( x t )) , whic h is exactly Geo desic Gradien t Descent. Next, we discuss ho w to apply our framework to optimization ov er the Stiefel manifold. Deﬁnition 10. F or 1 ≤ p ≤ n , the Stiefel manifold is St( n, p ) =  X ∈ R n × p : X T X = I p  . Its tangent sp ac e at X ∈ St( n, p ) is T X St( n, p ) = { Z ∈ R n × p : X T Z + Z T X = 0 } . The c anonic al metric is deﬁne d by, for al l A, B ∈ T X St( n, p ) , ⟨ A, B ⟩ X = tr  A T  I n − 1 2 X X T  B  . F rom this deﬁnition, the Stiefel manifold with p = n is St( n, n ) = { X ∈ R n × n : X T X = I n } . Then I n ∈ St( n, n ) and T I n St( n, n ) = sk ew ( n ) , where sk ew ( n ) = { W ∈ R n × n : W + W T = 0 } denotes the set of n × n skew-symmetric matrices. There- fore, the Cayley transform provides a natural lo cal reparameterization of St( n, n ) near the iden tity . F or any W ∈ skew( n ), the Cayley transform is deﬁned by C ( W ) = ( I n − W ) − 1 ( I n + W ) ∈ St( n, n ) . Then C is a diﬀeomorphism b et ween sk ew ( n ) and the open subset { X ∈ St( n, n ) : X + I n is inv ertible } . On this subset, its in verse is φ ( X ) = C − 1 ( X ) = ( X − I n )( X + I n ) − 1 . Viewing φ as a smooth map in to sk ew( n ), w e hav e D φ ( I n )[ W ] = 1 2 W , ∀ W ∈ skew( n ) . Th us, when x t = I n and using the Euclidean retraction on skew( n ), the up date in dual v ariable y is y t +1 = − η t D φ ( I n )[ ∇ f ( I n )] = − η t 2 ∇ f ( I n ) ∈ s k ew ( n ) , and mapping back yields x t +1 = φ − 1 ( y t +1 ) = C ( y t +1 ) = ( I n − y t +1 ) − 1 ( I n + y t +1 ) ∈ St( n, n ) . F or a general base point x t = X 0 ∈ St( n, n ), since St( n, n ) is a homogeneous manifold, we use the translation O X 0 : St( n, n ) → St( n, n ) , O X 0 ( X ) = X X T 0 , whic h satisﬁes O X 0 ( X 0 ) = I n . On a neighborho o d of X 0 where O X 0 ( X ) + I n is inv ertible, w e deﬁne φ ( X ) = C − 1  O X 0 ( X )  =  X X T 0 − I n  X X T 0 + I n  − 1 . This provides a lo cal reparameterization of St( n, n ) around X 0 with φ ( X 0 ) = 0 . 8 Example 3 (Curvilinear Gradient Descen t on St( n, n )) . In A lgorithm 1, let M t = T I n St( n, n ) = sk ew ( n ) . Deﬁne φ t at X t by φ t ( X ) =  X X T t − I n  X X T t + I n  − 1 , which is wel l-deﬁne d whenever X X T t + I n is invertible, and use Euclide an r etr action R X ( V ) = X + V on the ve ctor sp ac e skew( n ) . Since T X t St( n, n ) = { W X t : W ∈ skew( n ) } , the Riemannian gradient can b e written as ∇ f ( X t ) = W t X t for some W t ∈ sk ew ( n ). The RMD up date then in Algorithm 1 admits the explicit form X t +1 =  I n + η t 2 W t  − 1  I n − η t 2 W t  X t , (3) whic h coincides with the Curvilinear Gradient Descent (CGD) up date in [ 26 ]. F or the general Stiefel manifold St( p, n ) with p < n , we ma y extend any X ∈ St( n, p ) to an orthogonal matrix by c ho osing an orthonormal complemen t X ⊥ ∈ R n × ( n − p ) suc h that ¯ X = [ X, X ⊥ ] ∈ St( n, n ) . Applying the same construction to ¯ X and then taking its ﬁrst p columns yields the iteration (3) on St( n, p ). 3.3 Sto c hastic Riemannian Mirror Descen t In this subsection, w e introduce a sto c hastic v ariant of Riemannian Mirror Descent (RMD). Supp ose that, instead of the exact Riemannian gradient ∇ f ( x ), w e ha ve access to an unbiased stochastic estimator ∇ f ( x, ξ ) satisfying E [ ∇ f ( x, ξ )] = ∇ f ( x ) , where ξ denotes the randomness in the oracle. In this setting, w e can run Algorithm 1 by replacing the exact gradient with its sto c hastic estimate. In particular, the dual up date b ecomes y t +1 = R t φ t ( x t )  − η t D φ t ( x t )  ∇ f ( x t , ξ t )  , (4) where { ξ t } T t =1 are indep enden t sampling noises. W e refer to this algorithm as Sto chastic Riemannian Mirror Descent (SRMD). Analogous to Theorem 1, we establish the follo wing conv ergence theorem for SRMD. Theorem 2 (Conv ergence of SRMD) . Assume that Assumptions 1–4 hold. If the dual up date in Al gorithm 1 is given by (4) , wher e { ξ t } T t =1 ar e i.i.d. r andom variables satisfying E [ ∇ f ( x t , ξ t ) | x t ] = ∇ f ( x t ) , ∥∇ f ( x t , ξ t ) ∥ x t ⩽ G and σ 2 = sup t E [ ∥∇ f ( x t , ξ t ) − ∇ f ( x t ) ∥ 2 | x t ] < ∞ . L et η t ≡ η b e a c onstant step size satisfying 0 < η < min  1 L Φ G + 2 L f + 2 L f L 2 Φ , 2 G , r G  . Then the fol lowing statements hold: 1. (Nonc onvex c ase) 1 T T X t =1 E ∥∇ f ( x t ) ∥ 2 x t ⩽ O  1 η T + η σ 2  . 2. (Ge o desic al ly c onvex c ase) If, in addition, f is ge o desic al ly c onvex on A ⊂ M (fr om Assump- tion 2), then E f ( x T ) − f ( x ∗ ) ⩽ O  η 2 T σ 2 + 1 η T  . 9 In p articular, take η ∝ 1 T 1 / 2 and η ∝ 1 T 2 / 3 , r esp e ctively. Then we have 1 T T X t =1 E ∥∇ f ( x t ) ∥ 2 x t ⩽ O  1 √ T  in nonc onvex c ase and E f ( x T ) − f ( x ∗ ) ⩽ O  1 T 1 / 3  in the ge o desic al ly c onvex c ase. A detailed pro of is given in Section 4. As an illustration, w e specialize SRMD to optimization o ver the Stiefel manifold St( n, p ) and obtain a randomized algorithm suitable for the regime where p is large. W en and Yin [ 26 ] use the Sherman– Morrison–W o odbury (SMW) formula to accelerate the Ca yley-type update (3) when 2 p ≪ n , exploiting the fact that the asso ciated sk ew-symmetric matrix has rank at most 2 p . When 2 p ≥ n , this low-rank reduction no longer yields a smaller system to inv ert (and hence provides little or no computational adv antage); in this regime, one t ypically resorts to direct solvers or other implementations; see [ 26 ]. Motiv ated b y SRMD, w e propose the Stochastic Curvilinear Gradient Descen t (SCGD) algorithm, whic h is amenable to parallel implementation and is eﬀective when p is large. W e need the follo wing lemma (Lemma 1 in [ 26 ]) to express the tangen t-space gradien t in terms of a skew-symmetric matrix. Lemma 1 (Lemma 1 in [ 26 ]) . L et f : R n × p → R b e diﬀer entiable and let X ∈ St( n, p ) . Then ∇ St( n,p ) f ( X ) = W X , W = ∇ R n × p f ( X ) X ⊤ − X ( ∇ R n × p f ( X )) ⊤ ∈ R n × n . The CGD up date requires applying ( I n + η t W t / 2) − 1 . When 2 p ≪ n , W t has rank at most 2 p and the SMW form ula can substantially reduce the computational cost. When 2 p ≥ n , this reduction no longer yields a smaller system to inv ert, and computing ( I n + η t W t / 2) − 1 can b e exp ensiv e. Algorithm 2 Sto c hastic Curvilinear Gradien t Descent (SCGD) Input: Iteration num b er T , Initial p oin t X 0 , step size { η t } T − 1 t =0 , parameter K for t = 0 to T − 1 do Randomly divide [ n ] in to K sets I 1 , · · · , I K ev enly for k = 1 to K do W k = ∇ R n × p f ( X t )( I k , :)( X t ( I k , :)) ⊤ − X t ( I k , :)( ∇ R n × p f ( X t )( I k , :)) ⊤ X t +1 ( I k , :) = ( I n + η t / 2 W k )( I n − η t W k / 2) − 1 X t ( I k , :) end for end for return X T T o address the large- p regime, w e construct a block-diagonal approximation by randomly parti- tioning [ n ] into K equal-sized subsets I 1 , . . . , I K (with n = mK ) and deﬁning ˆ W b y keeping only the within-blo c k entries: ˆ w ij = ( w ij , i, j ∈ I k for some k , 0 , otherwise . F or example, if I k = { ( k − 1) m + 1 , · · · , k m } , then ˆ W =     W 1 O O O O W 2 O O O O · · · O O O O W K     . 10 F or any i  = j , we ha ve P ( i, j b elong to the same blo c k) = m − 1 n − 1 , and hence E [ ˆ W ] = n/K − 1 n − 1 W . Therefore, n − 1 n/K − 1 ˆ W is an un biased estimator of W . Since ˆ W is block diagonal, applying ( I + η t ˆ W / 2) − 1 reduces to inv erting K independent m × m systems ( I + η t ˆ W ( I k , I k ) / 2) − 1 , which can b e readily parallelized. This enables the algorithm to handle large-scale problems; see Algorithm 2 for details. 4 Con v ergence Analysis This section is dev oted to proving Theorems 1 and 2. T o this end, we ﬁrst establish the following lemma on the diﬀerence betw een retraction and the exponential map. Let Φ x ( u ) = Exp − 1 x ( R x ( u )) b e w ell-deﬁned for a retraction R x . Lemma 2. Given a r etr action R x that is L Φ -r e gular within r adius r . Then for al l u ∈ B r ( x ) , ∥ Φ x ( u ) − u ∥ x ⩽ L Φ 2 ∥ u ∥ 2 x . Pr o of. F or u ∈ B r ( x ), consider the T a ylor expansion of g ( t ) = Φ x ( tu ) at t = 0, g (1) = Φ x ( u ) = Φ x (0) + D Φ x (0)[ u ] + Z 1 0 (1 − t ) D 2 Φ( tu )[ u, u ] dt = u + Z 1 0 (1 − t ) D 2 Φ( tu )[ u, u ] dt, where we use R x (0) = x and D Φ x (0) = Id. By assumption, there holds ∥ Φ x ( u ) − u ∥ x ⩽ L Φ ∥ u ∥ 2 x Z 1 0 (1 − t ) dt = L Φ 2 ∥ u ∥ 2 x , whic h implies the desired result. □ Using Lemma 2, w e can analyze the one-step progress of RMD. Recall that ˆ R t x ( u ) = φ − 1 t ( R t φ t ( x ) ( D φ t ( x )[ u ])) . Then D ˆ R t x (0) = ( D φ t ( x )) − 1 D φ t ( x ) = Id, which implies that ˆ R t x is a retraction. Lemma 3. Supp ose that ˆ R t x satisﬁes Assumption 1 and f satisﬁes Assumptions 2-4, then f ( x t +1 ) − f ( x t ) ⩽ −  η t − L Φ Gη 2 t / 2 − η 2 t L f  ∥∇ f ( x t ) ∥ 2 x t + L f L 2 Φ η 4 t 4 ∥∇ f ( x t ) ∥ 4 x t . Pr o of. Let d t = − Exp − 1 x t ( ˆ R t x t ( − η t ∇ f ( x t ))), by Assumption 3, f ( x t +1 ) − f ( x t ) ⩽ −⟨∇ f ( x t ) , d t ⟩ x t + L f 2 ∥ d t ∥ 2 x t = − η t ∥∇ f ( x t ) ∥ 2 x t − ⟨∇ f ( x t ) , d t − η t ∇ f ( x t ) ⟩ x t + L f 2 ∥ d t ∥ 2 x t ⩽ −  η t − L Φ Gη 2 t 2  ∥∇ f ( x t ) ∥ 2 x t + L f 2 ∥ d t ∥ 2 x t . (5) 11 Here we use Cauc hy inequalit y and Assumption 4 to b ound ⟨∇ f ( x t ) , d t − η t ∇ f ( x t ) ⟩ x t in the second line as ⟨∇ f ( x t ) , d t − η t ∇ f ( x t ) ⟩ x t ⩽ ∥∇ f ( x t ) ∥ x t ∥ d t − η t ∇ f ( x t ) ∥ x t ⩽ G ∥ d t − η t ∇ f ( x t ) ∥ x t ⩽ L Φ Gη 2 t 2 ∥∇ f ( x t ) ∥ 2 x t . By Lemma 2, ∥ d t ∥ 2 x t = ∥ d t − η t ∇ f ( x t ) + η t ∇ f ( x t ) ∥ 2 x t ⩽ 2 ∥ d t − η t ∇ f ( x t ) ∥ 2 x t + 2 ∥ η t ∇ f ( x t ) ∥ 2 x t ⩽ L 2 Φ η 4 t 2 ∥∇ f ( x t ) ∥ 4 x t + 2 η 2 t ∥∇ f ( x t ) ∥ 2 x t . T ogether with (5) w e obtain f ( x t +1 ) − f ( x t ) ⩽ −  η t − L Φ Gη 2 t 2 − η 2 t L f  ∥∇ f ( x t ) ∥ 2 x t + L f L 2 Φ η 4 t 4 ∥∇ f ( x t ) ∥ 4 x t . W e thus conclude the pro of. □ No w w e are ready to pro ve Theorem 1. Pr o of. [Pro of of Theorem 1] If η < min  1 /  L Φ G/ 2 + L f + L f L 2 Φ  , 2 /G, r /G  , we ha ve η 4 ∥∇ f ( x t ) ∥ 4 x t ⩽ 4 η 2 ∥∇ f ( x t ) ∥ 2 x t . By Lemma 3, f ( x t +1 ) − f ( x t ) ⩽ −  η − L Φ Gη 2 / 2 − L f η 2 − L f L 2 Φ η 2  ∥∇ f ( x t ) ∥ 2 x t . Let C d = η − L Φ Gη 2 / 2 − L f η 2 − L f L 2 Φ η 2 > 0 . Noncon vex Case: Summing up ov er t and rearranging, we get 1 T T X t =1 ∥∇ f ( x t ) ∥ 2 x t ⩽ f ( x 1 ) − f ( x T +1 ) C d T . Geo desically con vex case: W e construct a p otential function to pro ve the con vergence rate. Let A t +1 = A t + a t , A 0 = 0 , E t = A t ( f ( x t ) − f ( x ∗ )) , where a t will b e sp eciﬁed later. W e hav e E t +1 − E t = A t +1 ( f ( x t +1 ) − f ( x ∗ )) − ( A t +1 − a t ) ( f ( x t ) − f ( x ∗ )) = A t +1 ( f ( x t +1 ) − f ( x t )) + a t ( f ( x t ) − f ( x ∗ )) . (6) By the deﬁnition of con vexit y , f ( x t ) − f ( x ∗ ) ⩽ −⟨∇ f ( x t ) , Exp − 1 x t ( x ∗ ) ⟩ x t . Com bining with (6), w e hav e E t +1 − E t ⩽ − C d A t +1 ∥∇ f ( x t ) ∥ 2 x t − a t ⟨∇ f ( x t ) , Exp − 1 x t ( x ∗ ) ⟩ x t = − C d A t +1  ∥∇ f ( x t ) ∥ 2 x t + ⟨∇ f ( x t ) , a t C d A t +1 Exp − 1 x t ( x ∗ ) ⟩ x t  . Let H ( v ) = ∥ v ∥ 2 x t + ⟨ v , Exp − 1 x t ( x ∗ ) /C ⟩ x t , where C is a constan t. Since H is a con vex function, it attains its minimal at stationary p oin t v ∗ = − Exp − 1 x t ( x ∗ ) 2 C . H ( v ) ⩾ H Exp − 1 x t ( x ∗ ) 2 C ! = − 1 4 C 2   Exp − 1 x t ( x ∗ )   2 x t . 12 By this equation, E t +1 − E t ⩽ a 2 t 4 C d A t +1   Exp − 1 x t ( x ∗ )   2 x t . (7) Since we ha ve   Exp − 1 x t ( x ∗ )   x t = d ( x t , x ∗ ) ⩽ diam( A ), plug it into (7), E t +1 − E t ⩽ a 2 t diam( A ) 2 4 C d A t +1 . Summation with resp ect to t , we get E T ⩽ E 0 + diam( A ) 2 4 C d T − 1 X t =0 a 2 t A t +1 . Let A t = t 2 , a t = A t +1 − A t = 2 t + 1, w e hav e E T ⩽ E 0 + diam( A ) 2 4 C d T − 1 X t =0 (2 t + 1) 2 ( t + 1) 2 < E 0 + diam( A ) 2 4 C d T − 1 X t =0 (2 t + 2) 2 ( t + 1) 2 = E 0 + diam( A ) 2 T C d . By the deﬁnition of E , E T = A T ( f ( x T ) − f ( x ∗ )), f ( x T ) − f ( x ∗ ) ⩽ E 0 T 2 + diam( A ) 2 C d T = O  1 η T  , where the η factor in the denominator comes from C d . This concludes the proof of Theorem 1. □ T o establish the con v ergence guarantee for sto c hastic RMD, i.e., Theorem 2, w e ﬁrst pro ve the follo wing le mma. Its role is analogous to Lemma 3. Lemma 4. Supp ose that ˆ R t x satisﬁes Assumption 1 and f satisﬁes Assumptions 2-4, then E t f ( x t +1 ) − f ( x t ) ⩽ − η t ∥∇ f ( x t ) ∥ 2 x t +  L Φ Gη 2 t 2 + η 2 t L f  E t ∥∇ f ( x t , ξ t ) ∥ 2 x t + L f L 2 Φ η 4 t 4 E t ∥∇ f ( x t , ξ t ) ∥ 4 x t , wher e E t [ · ] = E [ ·| x 1 , · · · , x t ] . Pr o of. Let d t = − Exp x t ( ˆ R t x t ( − η t ∇ f ( x t , ξ t ))), by Assumption 3, E t f ( x t +1 ) − f ( x t ) ⩽ − E t ⟨∇ f ( x t ) , d t ⟩ x t + L f 2 E t ∥ d t ∥ 2 x t = − η t ∥∇ f ( x t ) ∥ 2 x t − E t ⟨∇ f ( x t ) , d t − η t ∇ f ( x t , ξ t ) ⟩ x t + L f 2 E t ∥ d t ∥ 2 x t ⩽ − η t ∥∇ f ( x t ) ∥ 2 x t + L Φ Gη 2 t 2 E t ∥∇ f ( x t , ξ t ) ∥ 2 x t + L f 2 E t ∥ d t ∥ 2 x t , (8) where in the last inequalit y we use ⟨∇ f ( x t ) , d t − η t ∇ f ( x t , ξ t ) ⟩ x t ⩽ ∥∇ f ( x t ) ∥ x t ∥ d t − η t ∇ f ( x t , ξ t ) ∥ x t ⩽ G ∥ d t − η t ∇ f ( x t , ξ t ) ∥ x t ⩽ GL Φ η 2 t 2 ∥∇ f ( x t , ξ t ) ∥ 2 x t , 13 and Lemma 2, ∥ d t ∥ 2 x t = ∥ d t − η t ∇ f ( x t , ξ t ) + η t ∇ f ( x t , ξ t ) ∥ 2 x t ⩽ 2 ∥ d t − η t ∇ f ( x t , ξ t ) ∥ 2 x t + 2 ∥ η t ∇ f ( x t , ξ t ) ∥ 2 x t ⩽ L 2 Φ η 4 t 2 ∥∇ f ( x t , ξ t ) ∥ 4 x t + 2 η 2 t ∥∇ f ( x t , ξ t ) ∥ 2 x t . T ogether with (8), w e get E t f ( x t +1 ) − f ( x t ) ⩽ − η t ∥∇ f ( x t ) ∥ 2 x t +  L Φ Gη 2 t 2 + η 2 t L f  E t ∥∇ f ( x t , ξ t ) ∥ 2 x t + L f L 2 Φ η 4 t 4 E t ∥∇ f ( x t , ξ t ) ∥ 4 x t , whic h concludes the pro of. □ No w w e are ready to pro ve Theorem 2. Pr o of. [Pro of of Theorem 2] Let C sd =  L Φ G + 2 L f + 2 L f L 2 Φ  , if η t = η < min { 1 /C sd , 2 /G, r /G } , we ha ve E t f ( x t +1 ) − f ( x t ) ⩽ − η ∥∇ f ( x t ) ∥ 2 x t +  L Φ Gη 2 2 + η 2 L f  E t ∥∇ f ( x t , ξ t ) ∥ 2 x t + L f L 2 Φ η 4 4 E t ∥∇ f ( x t , ξ t ) ∥ 4 x t ⩽ − η ∥∇ f ( x t ) ∥ 2 x t + C sd η 2 E t ∥∇ f ( x t , ξ t ) ∥ 2 x t , where we use η 4 ∥∇ f ( x t , ξ t ) ∥ 4 x t ⩽ 4 η 2 ∥∇ f ( x t , ξ t ) ∥ 2 x t since by assumption, ∥∇ f ( x t , ξ t ) ∥ 2 x t ⩽ G . F or the second term, we ha ve E t ∥∇ f ( x t , ξ t ) ∥ 2 x t = E t ∥∇ f ( x t , ξ t ) − ∇ f ( x t ) + ∇ f ( x t ) ∥ 2 x t ⩽ 2 E t ∥∇ f ( x t , ξ t ) − ∇ f ( x t ) ∥ 2 x t + 2 ∥∇ f ( x t ) ∥ 2 x t ⩽ 2 σ 2 + 2 ∥∇ f ( x t ) ∥ 2 x t . Th us, w e hav e E t f ( x t +1 ) − f ( x t ) ⩽ −  η − C sd η 2  ∥∇ f ( x t ) ∥ 2 x t + C sd η 2 σ 2 . Let F t = E f ( x t ) − f ( x ∗ ) − tC sd η 2 σ 2 , then F t +1 − F t ⩽ −  η − C sd η 2  E ∥∇ f ( x t ) ∥ 2 x t . Using the same argumen t as the proof of Theorem 1, we will pro ve the theorem in diﬀeren t cases. Noncon vex Case: Summing up ov er t and rearranging leads to 1 T T X t =1 E ∥∇ f ( x t ) ∥ 2 x t ⩽ F 1 − F T +1 ( η − C sd η 2 ) T ⩽ f ( x 1 ) − f ( x ∗ ) ( η − C sd η 2 ) T + C sd η σ 2 (1 − C sd η ) . Geo desically con v ex case: Let A t = t 2 , a t = A t − A t − 1 = 2 t +1. W e deﬁne E t = A t ( E f ( x t ) − f ( x ∗ )). Th us, E t +1 − E t = A t +1 E [ f ( x t +1 ) − f ( x t )] + a t E [ f ( x t ) − f ( x ∗ )] . 14 By Lemma 4 and con vexit y , we hav e E t +1 − E t ⩽ − A t +1  η − C sd η 2  E ∥∇ f ( x t ) ∥ 2 x t − a t ⟨∇ f ( x t ) , Exp − 1 x t ( x ∗ ) ⟩ x t + A t +1 C sd η 2 σ 2 ⩽ a 2 t 4 A t +1 ( η − C sd η 2 ) E ∥∇ f ( x t ) ∥ 2 x t + A t +1 C sd η 2 σ 2 ⩽ a 2 t diam( A ) 4 A t +1 ( η − C sd η 2 ) + A t +1 C sd η 2 σ 2 where we use a similar argument as the pro of of Theorem 1. Therefore, E T ⩽ E 0 + diam( A ) 4 ( η − C sd η 2 ) T − 1 X t =0 (2 t + 1) 2 ( t + 1) 2 + T − 1 X t =0 ( t + 1) 2 C sd η 2 σ 2 ⩽ O  E 0 + diam( A ) T η − C sd η 2 + C sd η 2 σ T 3  . Note that E T = T 2 ( E f ( x T ) − f ( x ∗ )), we get E f ( x T ) − f ( x ∗ ) ⩽ O  1 η T + η 2 σ T  . W e thus ﬁnish the pro of of Theorem 2. □ 5 Numerical Exp erimen ts In this section, w e do some exp erimen ts on Stiefel manifold optimization to verify our algorithm. W e compare our algorithm with the classic algorithm [ 26 ], which is a sp ecial case of Algorithm 2. Since OptStiefel adapts non-monotonic line searc h strategy [ 29 ], whic h can greatly improv e the p erformance of the algorithm, we adopt the same line search technique [ 29 ] to Algorithm 2. In [ 26 ], the authors also use the Barzilai-Borwein (BB) step size heuristic to accelerate their algorithm. Ho wev er, it is not clear ho w to use BB step size in the sto c hastic setting. T o highligh t the improv ement made by randomization, we do not use BB step size in our exp erimen t. W e set the stop criterion to b e ∥∇ M f ( x ) ∥ ⩽ 10 − 5 or ∥ x t − x t − 1 ∥ ⩽ 10 − 5 or ∥ f ( x t ) − f ( x t − 1 ) ∥ ⩽ 10 − 8 . All the exp eriments are tested on a mac hine with Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz 2.59 GHz and 16.0GB RAM 1 . 5.1 Linear Eigen v alue Problem Giv en a symmetric matrix A , the linear eigen v alue problem is max X ∈ St( n,p ) T r( X ⊤ AX ) = p X k =1 λ k , where λ 1 ≥ λ 2 ≥ · · · ≥ λ p are the p th largest eigen v alues of A . In the experiments, we generate a random matrix N whose en tries are i.i.d. standard normal random v ariables and set A = N ⊤ N . The test results are reported in T able 1. W e observ e that while both algorithms pro duce solutions of similar quality , SCGD achiev es a low er runtime than CGD. One reason is that SCGD only needs to solv e sev eral smaller linear systems rather than a large one in each iteration. 1 The code and data set are av ailable at https://gith ub.com/JiyuanT an/RMD/tree/main. 15 Algorithm n 1000 2000 5000 CGD Time 1.41 7.10 64.67 Error 1.20e-06 9.28e-07 5.88e-06 SCGD Time 0.72 4.56 38.10 Error 7.08e-08 9.54e-07 8.09e-06 T able 1: Results of the tw o algorithms. When K = 1 the step size is 10 − 3 . When K > 1 the step size is 10 − 2 . W e tak e p = 10. The n umber of block K in Algorithm 2 is c hosen to be ⌊ n/ 300 ⌋ . The results are the a verage of 5 runs. 5.2 Orthogonal Pro crustes Problem The orthogonal pro crustes problem [ 11 ] is min X ∈ St( n,p ) ∥ AX − B ∥ 2 F . In the exp erimen ts, w e generate A = ( A ij ) with A ij i.i.d. ∼ Unif (0 , 1), and construct X ⋆ ∈ St( n, p ) by taking the Q factor of the QR decomp osition of a Gaussian random matrix. W e then set B = AX ⋆ . Since the ob jectiv e is nonnegativ e and X ⋆ attains v alue 0, the optimal v alue is 0. The results are rep orted in T able 2. Algorithm n 1000 2000 5000 CGD Time 11.34 37.68 341.41 Error 7.55e-2 7.30e-2 1.13e-1 SCGD Time 11.58 38.49 211.09 Error 2.05e-2 4.41e-2 9.76e-2 T able 2: Results of the tw o algorithms. When K = 1 the step size is 10 − 3 . When K > 1 the step size is 10 − 2 . W e tak e p = 10. The num b er of blo c k K in Algorithm 2 is chosen to b e ⌊ n/ 300 ⌋ . 6 Conclusion In this pap er, w e prop ose a new framew ork for mirror descent (MD) on Riemannian manifolds, based on the key observ ation that MD can b e in terpreted as an optimization metho d under a suitable reparameterization. Under mild assumptions, w e obtain the ﬁrst non-asymptotic conv ergence result of Riemannian Mirror Descent in both the deterministic and sto c hastic settings. F urthermore, w e in tro duce the Stochastic Curvilinear Gradient Descen t algorithm for large-scale Stiefel manifold opti- mization under the RMD framew ork, which we b eliev e is of indep enden t interest. Similar techniques ma y be applied to address optimization problems in other settings. An in teresting direction for future work is the explicit construction of problem-adapted reparam- eterizations. F or instance, Li et al. [ 14 ] show ed that, under any comm uting parametrization, the gradien t ﬂow is equiv alent to the contin uous-time mirror ﬂow. In our general Riemannian setting, where no additional global structure is imp osed, the resulting reparameterizations are inherently lo cal. It would b e of interest to inv estigate whether stronger geometric assumptions, such as those satisﬁed b y Hessian manifolds, p ermit global reparameterizations and lead to sharp er guaran tees and impro ved algorithms. 16 References [1] F oivos Alimisis, An tonio Orvieto, Gary Becigneul, and Aurelien Lucc hi. Momentum improv es optimization on Riemannian manifolds. In International Confer enc e on A rtiﬁcial Intel ligenc e and Statistics , pages 1351–1359. PMLR, 2021. [2] Ehsan Amid and Manfred KK W armuth. Reparameterizing mirror descent as gradient descen t. A dvanc es in Neur al Information Pr o c essing Systems , 33:8430–8439, 2020. [3] Martin Arjo vsky , Amar Shah, and Y oshua Bengio. Unitary ev olution recurrent neural net works. In International Confer enc e on Machine L e arning , pages 1120–1128. PMLR, 2016. [4] Nitin Bansal, Xiaohan Chen, and Zhangyang W ang. Can we gain more from orthogonality reg- ularizations in training deep netw orks? A dvanc es in Neur al Information Pr o c essing Systems , 31, 2018. [5] Amir Bec k and Marc T eb oulle. Mirror descent and nonlinear pro jected subgradien t methods for con vex optimization. Op er ations R ese ar ch L etters , 31(3):167–175, 2003. [6] Glaydston C Bento, Orizon P F erreira, and Jeﬀerson G Melo. Iteration-complexity of gradient, subgradien t and proximal p oin t metho ds on Riemannian manifolds. Journal of Optimization The ory and Applic ations , 173(2):548–562, 2017. [7] Lev M Bregman. The relaxation method of ﬁnding the common point of con vex sets and its ap- plication to the solution of problems in con vex programming. USSR Computational Mathematics and Mathematic al Physics , 7(3):200–217, 1967. [8] Ano op Cherian and Suvrit Sra. Riemannian dictionary learning and sparse co ding for p ositive deﬁnite matrices. IEEE T r ansactions on Neur al Networks and L e arning Systems , 28(12):2859– 2871, 2016. [9] Christopher Criscitiello and Nicolas Boumal. Negativ e curv ature obstructs acceleration for strongly geo desically con vex optimization, ev en with exact ﬁrst-order oracles. In Confer enc e on L e arning The ory , pages 496–542. PMLR, 2022. [10] John C Duchi, Shai Shalev-Sh wartz, Y oram Singer, and Am buj T ew ari. Comp osite ob jective mirror descent. In Confer enc e on L e arning The ory , v olume 10, pages 14–26. Citeseer, 2010. [11] John C Go wer and Garm t B Dijksterh uis. Pr o crustes Pr oblems , v olume 30. OUP Oxford, 2004. [12] Suriya Gunasek ar, Blake W o odworth, and Nathan Srebro. Mirrorless mirror descen t: A natural deriv ation of mirror descent. In International Confer enc e on Artiﬁcial Intel ligenc e and Statistics , pages 2305–2313. PMLR, 2021. [13] Y unw en Lei and Ding-Xuan Zhou. Conv ergence of online mirror descen t. Applie d and Computa- tional Harmonic A nalysis , 48(1):343–373, 2020. [14] Zhiyuan Li, Tianhao W ang, Jason D Lee, and Sanjeev Arora. Implicit bias of gradient descen t on reparametrized models: On equiv alence to mirror descent. A dvanc es in Neur al Information Pr o c essing Systems , 35:34626–34640, 2022. [15] Ark adi Nemiro vski, Anatoli Juditsky , Guangh ui Lan, and Alexander Shapiro. Robust sto chastic appro ximation approach to sto c hastic programming. SIAM Journal on Optimization , 19(4):1574– 1609, 2009. [16] Ark adi Semeno viˇ c Nemiro vski and David Borisovic h Y udin. Pr oblem Complexity and Metho d Eﬃciency in Optimization . Wiley-In terscience, 1983. [17] Garvesh Raskutti and Say an Mukherjee. The information geometry of mirror descent. IEEE T r ansactions on Information The ory , 61(3):1451–1457, 2015. 17 [18] Hiroyuki Sato, Hiro yuki Kasai, and Bamdev Mishra. Riemannian stochastic v ariance reduced gra- dien t algorithm with retraction and v ector transport. SIAM Journal on Optimization , 29(2):1444– 1472, 2019. [19] Y unlu Shu, Jiaxin Jiang, Lei Shi, and Tian yu W ang. Revisit ﬁrst-order metho ds for geo desically con vex optimization. arXiv pr eprint arXiv:2504.06814 , 2025. [20] Hyung jo on Soh, Dongy eob Kim, Juno Hwang, and Junghy o Jo. Mirror descent of Hopﬁeld model. Neur al Computation , 35(9):1529–1542, 2023. [21] Vishw ak Sriniv asan and Ashia Wilson. Suﬃcient conditions for non-asymptotic con vergence of Riemannian optimisation methods. 14th Annual Workshop on Optimization for Machine L e arning , 2022. [22] Haoyuan Sun, Kwangjun Ahn, Christos Thramp oulidis, and Na vid Azizan. Mirror descent max- imizes generalized margin and can be implemented eﬃciently . A dvanc es in Neur al Information Pr o c essing Systems , 35:31089–31101, 2022. [23] Mingkui T an, Ivor W Tsang, Li W ang, Bart V andereyck en, and Sinno Jialin Pan. Riemannian pursuit for big matrix recov ery . In International Confer enc e on Machine L e arning , pages 1539– 1547. PMLR, 2014. [24] Nilesh T ripuraneni, Nicolas Flammarion, F rancis Bach, and Michael I Jordan. Averaging stochas- tic gradien t descent on Riemannian manifolds. In Confer enc e On L e arning The ory , pages 650–687. PMLR, 2018. [25] Bart V andereyck en. Low-rank matrix completion b y Riemannian optimization. SIAM Journal on Optimization , 23(2):1214–1236, 2013. [26] Zaiwen W en and W otao Yin. A feasible metho d for optimization with orthogonality constrain ts. Mathematic al Pr o gr amming , 142(1-2):397–434, 2013. [27] Long Y ang, Y u Zhang, Gang Zheng, Qian Zheng, P engfei Li, Jianhang Huang, and Gang Pan. P olicy optimization with sto c hastic mirror descent. In Pr o c e e dings of the AAAI Confer enc e on A rtiﬁcial Intel ligenc e , volume 36, pages 8823–8831, 2022. [28] W enhao Zhan, Shicong Cen, Baihe Huang, Y uxin Chen, Jason D Lee, and Y uejie Chi. Policy mirror descen t for regularized reinforcement learning: A generalized framework with linear con vergence. SIAM Journal on Optimization , 33(2):1061–1091, 2023. [29] Hongchao Zhang and William W. Hager. A nonmonotone line searc h tec hnique and its application to unconstrained optimization. SIAM Journal on Optimization , 14(4):1043–1056, January 2004. [30] Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: F ast sto c hastic optimization on Riemannian manifolds. A dvanc es in Neur al Information Pr o c essing Systems , 29, 2016. [31] Hongyi Zhang and Suvrit Sra. First-order metho ds for geo desically con v ex optimization. In Confer enc e on L e arning The ory , pages 1617–1638. PMLR, 2016. 18

Mirror Descent on Riemannian Manifolds

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment