Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization
Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported…
Authors: Hideaki Iiduka
Muon Converges Muon Con v erges under Hea vy-T ailed Noise: Noncon v ex H¨ older-Smo oth Empirical Risk Minimization Hideaki Iiduk a iiduka@cs.meiji.ac.jp Dep artment of Computer Scienc e Meiji University 1-1-1 Higashimita, T ama-ku, Kawasaki-shi, Kanagawa 214-8571 Jap a Editor: Abstract Muon is a recen tly proposed optimizer that enforces orthogonalit y in parameter up dates b y pro jecting gradien ts on to the Stiefel manifold, leading to stable and efficient training in large-scale deep neural net works. Meanwhile, the previously reported results indicated that sto c hastic noise in practical mac hine learning ma y exhibit hea vy-tailed b eha vior, violating the b ounded-v ariance assumption. In this pap er, w e consider the problem of minimizing a noncon vex H¨ older-smo oth empirical risk that works well with the heavy-tailed sto c hastic noise. W e then sho w that Muon conv erges to a stationary point of the empirical risk under the b oundedness condition accounting for hea vy-tailed sto c hastic noise. In addition, we sho w that Muon conv erges faster than mini-batch SGD. Keyw ords: con vergence, hea vy-tailed noise, H¨ older-smooth, mini-batch SGD, Muon 1 In tro duction 1.1 Bac kground Empirical risk minimization (ERM) is a central issue in training deep neural netw orks (DNNs) on certain training datasets. ERM is an optimization problem for minimizing an empirical risk (ER) defined by the sum of loss functions corresp onding to the training set. Since a loss function suc h as the cross entrop y loss is nonconv ex, we can consider ERM to b e a nonconv ex minimization problem. Mini-batc h sto c hastic gradien t descen t (SGD) (Robbins and Monro, 1951; Zink evich, 2003; Nemirovski et al., 2009; Ghadimi and Lan, 2012, 2013; Umeda and Iiduk a, 2025) is a simple and useful optimizer for finding appropriate parameters of the DNN in the sense of minimizing the ER. A standard assumption when analyzing mini-batch SGD is the smo othness of the ER, i.e., the Lipschitz contin uit y of the gradien t of the ER, since almost all analyses of mini-batc h SGD hav e b een based on the descent lemma (see, e.g., (Beck, 2017, Lemma 5.7) for the descent lemma). Mini-batc h SGD uses a sto c hastic gradien t of the ER that is randomly chosen from the gradien ts of loss functions. Hence, a discrepancy arises b et ween the sto chastic gradien t and the true gradien t of the ER. W e call suc h a discrepancy the stochastic noise. It is commonly assumed that the sto chastic noise is b ounded in the sense of the exp ectation of the squared norm; i.e., the v ariance of the stochastic gradien t is bounded. This is b ecause, in theory , the 1 Hideaki Iiduka b oundedness condition of the v ariance works well with the descent lemma that is satisfied under the condition of smo othness of the ER. Ho wev er, the n umerical results in (Simsekli et al., 2019; Garg et al., 2021; Battash et al., 2024; Ahn et al., 2024) indicated that sto c hastic noise may exhibit hea vy-tailed behavior. Hea vy-tailed noise refers to sto c hastic noise whose distribution allows large fluctuations with non-negligible probabilit y due to its slo wly decaying tails. Moreo v er, it was rep orted that the sto chastic noise of SGD can b e heavy-tailed (Gorbunov et al., 2020; Hodgkinson and Mahoney, 2021; Liu and Zhou, 2025). Accordingly , the b ounded v ariance condition of the sto c hastic gradien t w ould b e unrealistic in practical machine-learning problems. 1.2 Motiv ation A standard condition (Zhang et al., 2020, Assumption 1) to analyze optimizers under hea vy- tailed noise is that the sto chastic noise is b ounded in the sense of the exp ectation of the p -th p o w er of the norm, where p ∈ (1 , 2]. In particular, we say that the sto chastic noise is hea vy-tailed when p ∈ (1 , 2) (Zhang et al., 2020, Assumption 1). W e call the exp ectation of the p -th p ow er sto chastic noise norm the p -v ariance of the sto c hastic gradien t (The precise mathematical formulation of the p -v ariance of the sto c hastic gradien t is giv en in Assumption 2.1(A2)(ii)). Under the b oundedness of the p -v ariance of the stochastic gradient, mini-batc h SGD and its v arian ts hav e been analyzed in (Zhang et al., 2020; Cutk osky and Meh ta, 2021; Nguy en et al., 2023; Sadiev et al., 2023; Liu et al., 2024). Mean while, F atkh ullin et al. (2025) and Y amada et al. (2026) show ed that the H¨ older smo othness (H¨ older, 1882) that is w eaker than the smoothness w orks w ell in both theory and practice with the hea vy-tailed sto c hastic noise (The precise mathematical form ulation of the H¨ older smo othness is given in Assumption 2.1(A1)). The motiv ation b ehind this work is thus to sho w that, under the H¨ older smo othness of the ER and the b oundedness of the p -v ariance of the sto c hastic gradien t, mini-batc h SGD conv erges. Man y optimizers ha v e been presented to accelerate mini-batc h SGD. F or example, adap- tiv e gradient metho ds suc h as Adam (Kingma and Ba, 2015) and its v arian t AdamW (Loshc hilov and Hutter, 2019) hav e b ecome the de facto standard in mo dern deep learning, o wing to their fast con vergence and strong empirical p erformance across a wide range of tasks. Subsequen t w ork has explored ric her preconditioning strategies, including metho ds suc h as Shamp o o (Gupta et al., 2018), whic h lev erage matrix-v alued statistics of gradients. More recently , the Muon (Momentum orthogonalized b y Newton-Sch ulz) optimizer (Jordan et al., 2024) has b een prop osed as a new optimizer that p erforms up dates based on or- thogonalized gradients. Con v ergence analyses of the Muon optimizer ha ve b een presen ted in (T ang et al., 2026; Sato et al., 2025; Pethic k et al., 2025a,b; Nagashima and Iiduk a, 2026) under the smoothness or ( L 0 , L 1 )-smo othness of the ER. Meanwhile, w e are inter- ested in v erifying that, under the H¨ older smo othness of the ER and the b oundedness of the p -v ariance of the sto c hastic gradient, Muon conv erges faster than mini-batch SGD. 1.3 Main results This pap er considers the ERM under the H¨ older smo othness of the ER and the b oundedness of the p -v ariance of the sto chastic gradien t and provides useful prop erties of mini-batch gradien t (Sec tion 2). Let ( W t ) ⊂ R m × n b e the sequence generated b y an optimizer with 2 Muon Converges step size η t and batch size b t to minimize the ER f . ν ∈ (0 , 1] app ears in the definition of the H¨ older smo othness (see Assumption 2.1(A1)), and p ∈ (1 , 2] app ears in the definition of the p -v ariance of the sto c hastic gradien t (see Assumption 2.1(A2)(ii)). The following summarizes con vergence of the Muon optimizer when the momentum parameter β is 0 (Section 4), compared with mini-batch SGD (Section 3). Section 5 shows that Muon with β = 0 has the same results in Section 4. 1.3.1 Descent proper ty Let η t b e a diminishing step size con verging to 0 and let b t b e a constan t or an increasing batc h size. Then, for all ϵ > 0, there exists s 0 ∈ N such that, for all t ≥ s 0 , η ν t < 2 L , O ( η 1+ ν t + η 1+ ν t b 1 − p t ) < ϵ , and O ( η 1+ ν t + η t b (1 − p ) p − 1 t ) < ϵ , where L > 0 is the H¨ older constant (see Lemma 3.1 for the definition) and O is Landau’s sym b ol. Mini-batc h SGD satisfies the following inequality (Lemma 3.1(ii)) based on the generalized descent lemma (see (2)) under the assumption of H¨ older smo othness of the ER: if 1 + ν ≤ p holds, then, for all t ≥ s 0 , [SGD] E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) − η t 1 − Lη ν t 2 ∥∇ f ( W t ) ∥ 2 F | {z } < 0 + O η 1+ ν t + η 1+ ν t b p − 1 t ! | {z } → 0 ( t + ∞ ) < f ( W t ) + ϵ, where E ξ t [ f ( W t +1 ) | ξ [ t − 1] ] is the exp ectation of f ( W t +1 ) with resp ect to a random v ariable ξ t conditioned on ξ [ t − 1] = ( ξ 0 , · · · , ξ t − 1 ) and ∥ · ∥ F is the F rob enius norm. Meanwhile, the Muon optimizer satisfies the follo wing inequalit y (Lemma 4.1(ii)): for all t ≥ s 0 , [Muon] E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) − η t ∥∇ f ( W t ) ∥ F | {z } < 0 + O η 1+ ν t + η t b p − 1 p t | {z } → 0 ( t → + ∞ ) < f ( W t ) + ϵ. The ab o v e inequalities imply that mini-batch SGD and Muon ha v e a descent prop ert y in the sense that E ξ t [ f ( W t +1 ) | ξ [ t − 1] ] < f ( W t )+ ϵ ≈ f ( W t ). The main difference b et w een the ab ov e t wo inequalities is the exp onen t of ∥∇ f ( W t ) ∥ F . This difference b etw een the tw o optimizers arises from the definition of the search direction. While the searc h direction of mini-batc h SGD uses the mini-batch gradien t (see (7) and D SGD t in (10)), the search direction of Muon uses the p oint on the Stiefel manifold St( n, m ) : = { O ∈ R m × n : O ⊤ O = I n } closest to the mini-batc h gradien t (see D Muon t in (17) and (18)). 1.3.2 Convergence The ab o ve inequalities, together with the sup er martingale conv ergence theorem (Bertsek as et al., 2003, Prop osition 8.2.10), ensure that mini-batch SGD and Muon with η t satisfying 3 Hideaki Iiduka P + ∞ t =0 η t = + ∞ satisfy [SGD] + ∞ X t =0 η 1+ ν t + η 1+ ν t b p − 1 t ! < + ∞ [Muon] + ∞ X t =0 η 1+ ν t + η t b p − 1 p t < + ∞ ⇒ lim inf t → + ∞ ∥∇ f ( W t ) ∥ F = 0 a.s. , whic h, together with the descen t properties, implies that mini-batc h SGD and Muon con- v erge to a stationary p oin t of f that corresp onds to either a lo cal minimizer or a saddle p oin t (Theorems 3.1 and 4.1). 1.3.3 Convergence ra te Sections 1.3.1 and 1.3.2 indicate that b oth mini-batch SGD and Muon conv erge to appro- priate p oin ts almost surely . The difference b et ween the t wo optimizers is reflected in the con vergence rate, since the main difference b et w een them in Section 1.3.1 is the exponent of ∥∇ f ( W t ) ∥ F . Under certain assumptions, mini-batc h SGD (Theorems 3.2 and 3.3) and Muon (Theorems 4.2 and 4.3) hav e the follo wing conv ergence rate: there exists s ∈ N such that, for all T ≥ s , [SGD] 1 P T t = s η t T X t = s η t E ∥∇ f ( W t ) ∥ 2 F = Θ 1 P T t = s η t ! [Muon] 1 P T t = s η t T X t = s η t E [ ∥∇ f ( W t ) ∥ F ] = Θ 1 P T t = s η t ! , where f ( T ) = Θ( g ( T )) implies that there exist c 1 , c 2 > 0 and t ∈ N such that, for all T ≥ t , c 1 g ( T ) ≤ f ( T ) ≤ c 2 g ( T ). When η t = 1 ( t +1) a ( t ∈ { 0 } ∪ N ) is used, where a ∈ (0 , 1) satisfies (1 + ν ) a > 1 (e.g., a > 1 2 when ν = 1), we hav e that T 1 − a 1 − a ≤ P T t =1 1 t a ≤ T 1 − a 1 − a + 1, i.e., P T t =1 1 t a = Θ( T 1 − a ). Hence, mini-batc h SGD has a Θ( 1 T 1 − a ) rate of conv ergence in the sense of the mean of the total exp ectation of the squar e d norm ∥∇ f ( W t ) ∥ 2 F , while Muon has a Θ( 1 T 1 − a ) rate of conv ergence in the sense of the mean of the total exp ectation of the norm ∥∇ f ( W t ) ∥ F . In particular, since we hav e min t ∈{ 1 , 2 , ··· ,T } E [ ∥∇ f ( W t ) ∥ F ] = O 1 T 1 − a 2 (SGD) O 1 T 1 − a (Muon) , w e can chec k that Muon conv erges faster than mini-batch SGD. Notation and definitions Here, we describ e the notation and state some definitions. Let N b e the set of natural n umbers. Let [ N ] : = { 1 , 2 , · · · , N } and [0 : N ] : = { 0 , 1 , · · · , N } for N ∈ N . Let R + : = 4 Muon Converges { x ∈ R : x ≥ 0 } . Let R m × n b e the set of m × n matrices with inner pro duct W 1 • W 2 : = T r( W ⊤ 1 W 2 ) ( W 1 , W 2 ∈ R m × n ) and the norm ∥ W ∥ F : = √ W • W , where T r( X ) is the trace of X . The dual norm ∥ W ∥ 2 , ∗ of the spectral norm ∥ W ∥ 2 : = max {∥ W x ∥ 2 : ∥ x ∥ 2 ≤ 1 } is defined by ∥ W ∥ 2 , ∗ : = max { W • X : ∥ X ∥ 2 ≤ 1 } , where ∥ x ∥ 2 is the Euclidean norm of x ∈ R n . O m × n denotes the m × n zero matrix and I n denotes the n × n iden tity matrix. P( A ) denotes the probability of even t A . E ξ [ X ( ξ )] denotes the exp ectation of a random v ariable X ( ξ ) with resp ect to a random v ariable ξ . The v ariance of X ( ξ ) with resp ect to ξ is defined by V ξ [ X ( ξ )] : = E ξ [ ∥ X ( ξ ) − E ξ [ X ( ξ )] ∥ 2 F ]. Let p > 1. The p -v ariance of X ( ξ ) with resp ect to ξ is defined b y V p ξ [ X ( ξ )] : = E ξ [ ∥ X ( ξ ) − E ξ [ X ( ξ )] ∥ p F ]. The 2-v ariance coincides with the v ariance (i.e., V 2 ξ [ X ( ξ )] = V ξ [ X ( ξ )]). E ξ [ X ( ξ ) | Y ] (resp. V p ξ [ X ( ξ ) | Y ]) denotes the exp ectation (resp. the p -v ariance) of X ( ξ ) conditioned on Y . When ξ 0 , ξ 1 , · · · , ξ t are indep enden t, we define the total exp ectation E by E : = E ξ 0 E ξ 1 · · · E ξ t . W e denote ξ ∼ DU( N ) when ξ follo ws a discrete uniform distribution on [ N ]. The gradien t of a differen tiable function f : R m × n → R is denoted by ∇ f : R m × n → R m × n . 2 Noncon v ex H¨ older-Smo oth ERM Let W ∈ R m × n b e a parameter of a DNN, S = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } b e the training set, where data p oin t x i is asso ciated with lab el y i , and f i ( · ) : = f ( · ; ( x i , y i )) : R m × n → R b e the loss function corresp onding to the i -th lab eled training data ( x i , y i ). Empirical risk minimization (ERM) minimizes the empirical risk (ER) defined for all W ∈ R m × n as f ( W ) = 1 N N X i =1 f i ( W ) . (1) This pap er considers the follo wing stationary p oin t problem: Find W ⋆ ∈ R m × n suc h that ∇ f ( W ⋆ ) = O m × n . 2.1 Assumptions and Examples W e assume that the loss functions f i ( i ∈ [ N ]) satisfy the following conditions. Assumption 2.1 Let N ∈ N , ν ∈ (0 , 1], L i = L i ( ν ) > 0 ( i ∈ [ N ]), and p ∈ (1 , 2]. (A1) f i : R m × n → R ( i ∈ [ N ]) is L i -H¨ older smo oth, i.e., for all W 1 , W 2 ∈ R m × n , ∥∇ f i ( W 1 ) − ∇ f i ( W 2 ) ∥ F ≤ L i ∥ W 1 − W 2 ∥ ν F and f ⋆ i : = inf { f i ( W ) : W ∈ R m × n } ∈ R . (A2) Let ξ b e a random v ariable that is indep enden t of W ∈ R m × n . ∇ f ξ : R m × n → R m × n is the sto c hastic gradient of ∇ f such that (i) [Unbiasedness of sto c hastic gradient] for all W ∈ R m × n , E ξ [ ∇ f ξ ( W )] = ∇ f ( W ) and (ii) [Boundedness of p -v ariance of sto chastic gradien t] there exists σ ≥ 0 such that, for all W ∈ R m × n , V p ξ [ ∇ f ξ ( W )] : = E ξ [ ∥∇ f ξ ( W ) − E ξ [ ∇ f ξ ( W )] ∥ p ] ≤ σ p . The L i -H¨ older smo othness (H¨ older, 1882) of f i in Assumption 2.1(A1) is used to analyze mini-batc h SGD (F atkhullin et al., 2025, Assumption 4), (Y amada et al., 2026, Assumption 5 Hideaki Iiduka 2.1), since almost all of the analyses of mini-batch SGD hav e b een based on the follo wing inequalit y (Nesterov, 2015, (2.5)), (Y ashtini, 2016, Lemma 1) that is satisfied under L i - H¨ older smo othness of f i : for all W 1 , W 2 ∈ R m × n , f i ( W 1 ) ≤ f i ( W 2 ) + ∇ f i ( W 2 ) • ( W 1 − W 2 ) + L i 1 + ν ∥ W 1 − W 2 ∥ 1+ ν F . (2) Inequalit y (2) is called the generalized descent lemma, since this is a generalization of the descen t lemma (Bec k, 2017, Lemma 5.7) that is satisfied under L i -smo othness of f i (Assumption 2.1(A1) when ν = 1). If f ⋆ i := inf { f i ( W ) : W ∈ R m × n } = −∞ holds, then the loss function f i corresp onding to the i -th labeled training data ( x i , y i ) do es not ha v e an y global minimizer, whic h implies that the empirical loss f satisfies f ⋆ : = inf { f ( W ) : W ∈ R m × n } = −∞ . Hence, the in terp olation property (Garrigos and Go w er, 2024, Section 4.3.1) (i.e., there exists W ⋆ ∈ R m × n suc h that, for all i ∈ [ N ], f i ( W ⋆ ) = f ⋆ i ∈ R ) do es not hold, whereas the in terp olation prop ert y do es hold for optimization of a linear mo del with the squared hinge loss for binary classification on linearly separable data (V aswani et al., 2019, Section 2). Moreo ver, in the case where f is con vex with f ⋆ = −∞ , there are no stationary p oin ts of f , which implies that no algorithm ever finds stationary p oin ts of f . Accordingly , the condition f ⋆ i : = inf { f i ( W ) : W ∈ R m × n } ∈ R in (A1) is a natural one for training DNNs including the case where the empirical loss f is the cross-en tropy with W ⋆ ∈ R m × n suc h that f ( W ⋆ ) = inf { f ( W ) : W ∈ R m × n } ≥ 0. Sto c hastic noise is defined by N ξ ( W ) : = ∇ f ξ ( W ) − ∇ f ( W ). Assumption 2.1(A2) th us ensures that σ p ≥ V p ξ [ ∇ f ξ ( W )] : = E ξ [ ∥∇ f ξ ( W ) − E ξ [ ∇ f ξ ( W ) | {z } ∇ f ( W ) ] ∥ p ] = E ξ [ ∥ N ξ ( W ) ∥ p ] , whic h implies that the sto c hastic noise N ξ ( W ) is heavy-tailed when p ∈ (1 , 2) (Zhang et al., 2020, Assumption 1). The following example indicates that Assumption 2.1(A2) is satisfied when Assumption 2.1(A1) holds and the random v ariable ξ follows the uniform distribution that is used to train DNNs in practice. Example 2.1 (Example satisfying Assumption 2.1(A2)) Supp ose that f i : R m × n → R ( i ∈ [ N ]) satisfies Assumption 2.1(A1) with L i < 2 L ν i , p ∈ (1 , 2], and W ∈ R m × n is indep enden t of ξ ∼ DU( N ). Then, (i) E ξ ∼ DU( N ) [ ∇ f ξ ( W )] = ∇ f ( W ). Moreo ver, if f ⋆⋆ i : = sup { f i ( W ) : W ∈ R m × n } ∈ R holds 1 , then (ii) V p ξ ∼ DU( N ) [ ∇ f ξ ( W )] ≤ ( 1 N N X i =1 2 L 1+ ν i ( f ⋆⋆ i − f ⋆ i ) 2 L ν i − L i + (1 − ν ) L i (1 + ν )(2 L ν i − L i ) ) 1 2 p = : σ p . 1. It is sufficient that ( W t ) generated by an optimizer satisfies Assumption 2.1. Hence, we may replace the condition f ⋆⋆ i : = sup { f i ( W ) : W ∈ R m × n } ∈ R in Example 2.1(ii) with the condition f ⋆⋆ i : = sup { f i ( W t ) : t ∈ { 0 } ∪ N } ∈ R . The supremum of f i tends to f i ( W 0 ) ∈ R , since f i satisfies the generalized descen t lemma (2) and the optimizer has the descent prop ert y (see, e.g., Lemma 3.1). 6 Muon Converges Pro of (i) F rom P( ξ = i ) = 1 N , w e hav e that E ξ ∼ DU( N ) [ ∇ f ξ ( W )] : = P N i =1 ∇ f i ( W )P( ξ = i ) = 1 N P N i =1 ∇ f i ( W ) = ∇ ( 1 N P N i =1 f i )( W ) = ∇ f ( W ). (ii) Let i ∈ [ N ] and W 2 ∈ R m × n . The generalized descent lemma (2) with W 1 : = W 2 − 1 L i ∇ f i ( W 2 ) ensures that f ⋆ i ≤ f i ( W 1 ) ≤ f i ( W 2 ) − 1 L i ∥∇ f i ( W 2 ) ∥ 2 F + 1 (1 + ν ) L ν i ∥∇ f i ( W 2 ) ∥ 1+ ν F . (3) W e apply a = ∥∇ f i ( W 2 ) ∥ 1+ ν F , b = 1, p = 2 1+ ν , and q = 2 1 − ν to Y oung’s inequalit y ab ≤ a p p + b q q , where 1 p + 1 q = 1. Then, ∥∇ f i ( W 2 ) ∥ 1+ ν F ≤ 1 + ν 2 ∥∇ f i ( W 2 ) ∥ 1+ ν F 2 1+ ν + 1 − ν 2 = 1 + ν 2 ∥∇ f i ( W 2 ) ∥ 2 F + 1 − ν 2 . (4) Accordingly , (3) and (4) ensure that f ⋆ i ≤ f i ( W 2 ) − 1 L i ∥∇ f i ( W 2 ) ∥ 2 F + 1 (1 + ν ) L ν i 1 + ν 2 ∥∇ f i ( W 2 ) ∥ 2 F + 1 − ν 2 = f i ( W 2 ) + L i − 2 L ν i 2 L 1+ ν i ∥∇ f i ( W 2 ) ∥ 2 F + 1 − ν 2(1 + ν ) L ν i , whic h, together with f ⋆⋆ i ∈ R and L i − 2 L ν i < 0, implies that ∥∇ f i ( W 2 ) ∥ 2 F ≤ 2 L 1+ ν i 2 L ν i − L i ( f ⋆⋆ i − f ⋆ i ) + (1 − ν ) L i (1 + ν )(2 L ν i − L i ) . (5) Let g : R → R b e conca v e (e.g., g ( x ) = x p 2 ). Jensen’s inequalit y th us ensures that, for all X ∈ R + , E ξ [ g ( X ( ξ ))] ≤ g ( E ξ [ X ( ξ )]). Hence, for all X ∈ R m × n , V p ξ [ X ( ξ )] = E ξ h ∥ X ( ξ ) − E ξ [ X ( ξ )] ∥ 2 F p 2 i ≤ E ξ [ ∥ X ( ξ ) − E ξ [ X ( ξ )] ∥ 2 F ] p 2 = ( V ξ [ X ( ξ )]) p 2 , whic h, together with V ξ [ X ( ξ )] = E ξ [ ∥ X ( ξ ) ∥ 2 F ] − ∥ E ξ [ X ( ξ )] ∥ 2 F ≤ E ξ [ ∥ X ( ξ ) ∥ 2 F ], implies that V p ξ [ X ( ξ )] ≤ E ξ [ ∥ X ( ξ ) ∥ 2 F ] p 2 . (6) Applying X ( ξ ) = ∇ f ξ ( W ) = ∇ f ξ ( W 2 ) to (6) and using (5) lead to the finding that V p ξ ∼ DU( N ) [ ∇ f ξ ( W )] ≤ E ξ ∼ DU( N ) [ ∥∇ f ξ ( W ) ∥ 2 F ] p 2 = N X i =1 ∥∇ f i ( W ) ∥ 2 F P( ξ = i ) ! p 2 = ( 1 N N X i =1 2 L 1+ ν i 2 L ν i − L i ( f ⋆⋆ i − f ⋆ i ) + (1 − ν ) L i (1 + ν )(2 L ν i − L i ) ) p 2 , whic h indicates that Assumption 2.1(A2)(ii) holds. 7 Hideaki Iiduka 2.2 Useful prop erties of mini-batc h gradien t Let b ∈ N b e the batch size (the num b er of samples) and let ξ = ( ξ 1 , ξ 2 , · · · , ξ b ) ⊤ comprise b indep enden t and identically distributed (i.i.d.) v ariables and be indep enden t of W ∈ R m × n . Then, the mini-batc h gradien t of f at W is defined b y ∇ f ξ ( W ) : = 1 b b X i =1 ∇ f ξ i ( W ) . (7) The following prop osition indicates that the mini-batc h gradien t inherits useful prop- erties of the stochastic gradient suc h as un biasedness and boundedness of v ariance in As- sumption 2.1(A2). Prop osition 2.1 Supp ose that Assumption 2.1 holds and let ∇ f ξ ( W ) be defined b y (7). Then, the follo wing hold. (i) [Unbiasedness of mini-batc h gradien t] E ξ [ ∇ f ξ ( W )] = ∇ f ( W ); (ii) [Boundedness of p -v ariance of mini-batch gradien t] V p ξ [ ∇ f ξ ( W )] ≤ 2 2 − p σ p b p − 1 . Pro of (i) F rom the prop ert y of E ξ and Assumption 2.1(A2)(i), we hav e E ξ [ ∇ f ξ ( W )] = E ξ " 1 b b X i =1 ∇ f ξ i ( W ) # = 1 b b X i =1 E ξ i [ ∇ f ξ i ( W )] = 1 b b X i =1 ∇ f ( W ) = ∇ f ( W ) . (ii) The definition of V p ξ and Prop osition 2.1(i) imply that V p ξ [ ∇ f ξ ( W )] = E ξ 1 b b X i =1 ( ∇ f ξ i ( W ) − ∇ f ( W )) p = 1 b p E ξ b X i =1 ( ∇ f ξ i ( W ) − ∇ f ( W )) p = 1 b p E ξ " b X i =2 ( ∇ f ξ i ( W ) − ∇ f ( W )) | {z } W ( ξ [2: b ] ) + ( ∇ f ξ 1 ( W ) − ∇ f ( W )) | {z } W 1 ( ξ 1 ) p # , where ξ [2: b ] : = ( ξ 2 , ξ 3 , · · · , ξ b ) ⊤ . In the case of W ( ξ [2: b ] ) = O m × n a.s., Assumption 2.1(A2) ensures that V p ξ [ ∇ f ξ ( W )] = 1 b p E ξ 1 [ ∥ W 1 ∥ p F ] = 1 b p V p ξ 1 [ ∇ f ξ 1 ( W )] ≤ σ p b p ≤ 2 2 − p σ p b p − 1 , whic h implies that Prop osition 2.1(ii) holds. Let us consider the case of W ( ξ [2: b ] ) = O m × n a.s.. F rom ∥ W + W 1 ∥ p F ≤ ∥ W ∥ p F + 2 2 − p ∥ W 1 ∥ p F + p ∥ W ∥ 2 − p F W • W 1 ( W 1 , W ( = O m × n ) ∈ R m × n ) and the indep endence of W 1 ( ξ 1 ) and W ( ξ [2: b ] ), we hav e E ξ [ ∥ W + W 1 ∥ p F ] ≤ E ξ [2: b ] [ ∥ W ∥ p F ] + 2 2 − p E ξ 1 [ ∥ W 1 ∥ p F ] + E ξ " p W ∥ W ∥ 2 − p F • W 1 # = E ξ [2: b ] [ ∥ W ∥ p F ] + 2 2 − p E ξ 1 [ ∥ W 1 ∥ p F ] + E ξ [2: b ] " p W ∥ W ∥ 2 − p F # • E ξ 1 [ W 1 ] , (8) 8 Muon Converges whic h, together with E ξ 1 [ W 1 ] = E ξ 1 [ ∇ f ξ 1 ( W )] − ∇ f ( W ) = ∇ f ( W ) − ∇ f ( W ) = O m × n (b y Assumption 2.1(A2)(i)), implies that E ξ [ ∥ W + W 1 ∥ p F ] ≤ E ξ [2: b ] [ ∥ W ∥ p F ] + 2 2 − p E ξ 1 [ ∥ W 1 ∥ p F ] (9) = E ξ [2: b ] " b X i =3 ( ∇ f ξ i ( W ) − ∇ f ( W )) | {z } W ( ξ [3: b ] ) + ( ∇ f ξ 2 ( W ) − ∇ f ( W )) | {z } W 2 ( ξ 2 ) p F # + 2 2 − p E ξ 1 [ ∥ W 1 ∥ p F ] . If W ( ξ [3: b ] ) = O m × n a.s., then Assumption 2.1(A2) and the condition b ≥ 3 imply that V p ξ [ ∇ f ξ ( W )] ≤ 1 b p E ξ 2 [ ∥ W 2 ∥ p F ] + 2 2 − p E ξ 1 [ ∥ W 1 ∥ p F ] ≤ 2 · 2 2 − p σ p b p ≤ b · 2 2 − p σ p b p = 2 2 − p σ p b p − 1 . Hence, we may assume W ( ξ [ i : b ] ) = O m × n a.s. ( i ∈ [3 : b ]). A similar argument to the one ab o v e for (9) leads to E ξ [ ∥ W + W 1 ∥ p F ] ≤ E ξ b [ ∥∇ f ξ b ( W ) − ∇ f ( W ) ∥ p F ] + 2 2 − p b − 1 X i =1 E ξ i [ ∥∇ f ξ i ( W ) − ∇ f ( W ) ∥ p F ] = : V p ξ b [ ∇ f ξ b ( W )] + 2 2 − p b − 1 X i =1 V p ξ i [ ∇ f ξ i ( W )] . Accordingly , from Assumption 2.1(A2)(ii), V p ξ [ ∇ f ξ ( W )] ≤ 1 b p σ p + 2 2 − p ( b − 1) σ p ≤ 1 b p 2 2 − p σ p + 2 2 − p ( b − 1) σ p = 2 2 − p σ p b p − 1 . This completes the pro of. 3 Mini-batc h SGD First, w e will consider the follo wing mini-batc h SGD to minimize f defined by (1) under Assumption 2.1: Given an initial p oint W 0 ∈ R m × n , [Mini-batc h SGD] W t +1 = W t + η t D SGD t = W t − η t ∇ f ξ t ( W t ) = W t − η t b t b t X i =1 ∇ f ξ t,i ( W t ) , (10) where η t > 0 is the step size, b t ∈ N is the batch size, ξ t = ( ξ t, 1 , · · · , ξ t,b t ) ⊤ comprises b t i.i.d. v ariables and is indep enden t of W t , and D SGD t : = −∇ f ξ t ( W t ) is the search direction of mini-batch SGD. W e may in theory assume sampling with replacement. In sampling with replacement, even if the batc h size b t exceeds N , ∇ f ξ t = ∇ f holds in general. Hence, to examine the conv ergence of mini-batc h optimizers under sampling with replacement, we can use b t → + ∞ ( t → + ∞ ). Although the previously rep orted results in (F atkhullin et al., 2025, Theorem 4) and (Y amada et al., 2026, Theorem 3.5) indicated conv ergence of mini-batch SGD under H¨ older smo othness, this section presents it in comparison with the con vergence of the Muon opti- mizer. 9 Hideaki Iiduka 3.1 Descen t prop ert y The follo wing lemma giv es the descent prop ert y of mini-batch SGD (10) to minimize f defined by (1). Lemma 3.1 Let ( W t ) be a sequence generated b y mini-batch SGD (10) under Assumption 2.1 and let ξ [ t − 1] : = { ξ 0 , · · · , ξ t − 1 } . Under the condition ∇ f ( W t ) = O m × n for all t ∈ { 0 } ∪ N , (i) E ξ t ∇ f ( W t ) • D SGD t | ξ [ t − 1] = −∥∇ f ( W t ) ∥ 2 F < 0. Let L : = 1 N P N i =1 L i . If 1 + ν ≤ p and η ν t < 2 L hold, then (ii) E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + (1 − ν ) Lη 1+ ν t 2(1 + ν ) + 2 3 − ( ν + p ) Lσ p η 1+ ν t (1 + ν ) b p − 1 t . This implies that, if 1 + ν ≤ p holds and if ( η 1+ ν t ) and ( η 1+ ν t b p − 1 t ) conv erge to 0, then, for all ϵ > 0, there exists t 0 ∈ N such that, for all t ≥ t 0 , η ν t < 2 L and E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + ϵ . Lemma 3.1(i) indicates that the search direction D SGD t = −∇ f ξ t ( W t ) is a descen t direction of f in the sense of the conditional exp ectation E ξ t [ ·| ξ [ t − 1] ]. Ho w ever, alone, the prop ert y of the descent direction D SGD t do es not guarantee minimization of f , since using a large step size η t w ould increase f . Hence, in order to minimize f by using mini-batc h SGD (10), w e will set a small step size η t . In fact, Lemma 3.1(ii) indicates that, if we set a diminishing step size η t (e.g., η t decreases with eac h epo ch), then mini-batch SGD (10) decreases f in the sense that E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + ϵ ≈ f ( W t ). Pro of of Lemma 3.1 (i) The prop erty of E ξ t and Proposition 2.1(i) imply that, for all t ∈ { 0 } ∪ N , E ξ t ∇ f ( W t ) • D SGD t | ξ [ t − 1] = − E ξ t ∇ f ( W t ) • ∇ f ξ t ( W t ) | ξ [ t − 1] = −∇ f ( W t ) • E ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] | {z } ∇ f ( W t ) = −∥∇ f ( W t ) ∥ 2 F . (11) (ii) Summing the generalized descent lemma (2) for a H¨ older smo oth function f i ( i ∈ [ N ]) ensures that, for all W 1 , W 2 ∈ R m × n , N X i =1 f i ( W 1 ) ≤ N X i =1 f i ( W 2 ) + N X i =1 ∇ f i ( W 2 ) • ( W 1 − W 2 ) + P N i =1 L i 1 + ν ∥ W 1 − W 2 ∥ 1+ ν F , whic h, together with the definition (1) of f , implies that, for all W 1 , W 2 ∈ R m × n , f ( W 1 ) ≤ f ( W 2 ) + ∇ f ( W 2 ) • ( W 1 − W 2 ) + L 1 + ν ∥ W 1 − W 2 ∥ 1+ ν F , (12) where L : = 1 N P N i =1 L i . Applying W 1 = W t +1 and W 2 = W t to (12) and using W t +1 − W t = η t D SGD t imply that, for all t ∈ N , E ξ t f ( W t +1 ) | ξ [ t − 1] 10 Muon Converges ≤ f ( W t ) + η t E ξ t ∇ f ( W t ) • D SGD t | ξ [ t − 1] | {z } = −∥∇ f ( W t ) ∥ 2 F ∵ (11) + Lη 1+ ν t 1 + ν E ξ t h D SGD t 1+ ν F ξ [ t − 1] i | {z } D t . Using the same pro of technique (8) (i.e., the expansion of the p -th p ow er) as in Prop osition 2.1(ii) and the same pro of techniques (4) and (6) (i.e., Jensen’s inequality and Y oung’s inequalit y) as in Example 2.1, w e can ev aluate an upp er b ound of D t . F rom ∇ f ( W t ) = O m × n for all t and the expansion of the (1 + ν )-th p o w er, w e hav e D t ≤ ∥∇ f ( W t ) ∥ 1+ ν F + 2 1 − ν E ξ t ∥∇ f ξ t ( W t ) − ∇ f ( W t ) ∥ 1+ ν F | ξ [ t − 1] + E ξ t " (1 + ν ) ∇ f ( W t ) ∥∇ f ( W t ) ∥ 1 − ν F • ( ∇ f ξ t ( W t ) − ∇ f ( W t )) ξ [ t − 1] # = ∥∇ f ( W t ) ∥ 1+ ν F + 2 1 − ν V 1+ ν ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] + (1 + ν ) ∇ f ( W t ) ∥∇ f ( W t ) ∥ 1 − ν F • E ξ t ∇ f ξ t ( W t ) − ∇ f ( W t ) | ξ [ t − 1] ≤ ∥∇ f ( W t ) ∥ 1+ ν F + 2 1 − ν V p ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] 1+ ν p , (13) where the relation E ξ t [ ∇ f ξ t ( W t ) − ∇ f ( W t ) | ξ [ t − 1] ] = E ξ t [ ∇ f ξ t ( W t ) | ξ [ t − 1] ] − ∇ f ( W t ) = O m × n comes from Prop osition 2.1(i) and V 1+ ν ξ t [ ∇ f ξ t ( W t ) | ξ [ t − 1] ] ≤ ( V p ξ t [ ∇ f ξ t ( W t ) | ξ [ t − 1] ]) 1+ ν p comes from Jensen’s inequalit y with a conca v e function g ( x ) = x 1+ ν p b y 1 + ν ≤ p . Moreov er, Y oung’s inequalit y and Prop osition 2.1(ii) ensure that D t ≤ 1 + ν 2 ∥∇ f ( W t ) ∥ 2 F + 1 − ν 2 + 2 3 − ( ν + p ) σ p b p − 1 t . (14) Accordingly , for all t ∈ { 0 } ∪ N , E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) − η t ∥∇ f ( W t ) ∥ 2 F + Lη 1+ ν t 1 + ν ( 1 + ν 2 ∥∇ f ( W t ) ∥ 2 F + 1 − ν 2 + 2 3 − ( ν + p ) σ p b p − 1 t ) = f ( W t ) − η t 1 − Lη ν t 2 ∥∇ f ( W t ) ∥ 2 F + (1 − ν ) Lη 1+ ν t 2(1 + ν ) + 2 3 − ( ν + p ) Lσ p η 1+ ν t (1 + ν ) b p − 1 t , (15) whic h, together with η ν t < 2 L , completes the pro of. 3.2 Con v ergence The following is a conv ergence analysis of mini-batch SGD (10) under Assumption 2.1. Theorem 3.1, together with Lemma 3.1, indicates that mini-batch SGD conv erges to a stationary p oin t of f (a lo cal minimizer of f or a saddle p oin t of f ) under the conditions in (16), which are stronger than the conv ergence of ( η 1+ ν t ) and ( η 1+ ν t b p − 1 t ) to 0 in Lemma 3.1(ii). 11 Hideaki Iiduka Theorem 3.1 L et ( W t ) b e a se quenc e gener ate d by mini-b atch SGD (10) under Assumption 2.1. If 1 + ν ≤ p holds and if ( η t ) and ( b t ) satisfy + ∞ X t =0 η t = + ∞ , + ∞ X t =0 η 1+ ν t < + ∞ , + ∞ X t =0 η 1+ ν t b p − 1 t < + ∞ , (16) then ( ∇ f ( W t )) c onver ges to O m × n almost sur ely in the sense of the limit inferior. Pro of of Theorem 3.1 Inequality (15), (16), and the sup er martingale con vergence theorem (Bertsek as et al., 2003, Prop osition 8.2.10) give + ∞ X t =0 η t 1 − Lη ν t 2 ∥∇ f ( W t ) ∥ 2 F < + ∞ a.s. . F rom η t → 0 ( t → + ∞ ), there exist t 1 ∈ N and η > 0 such that, for all t ≥ t 1 , η ν t ≤ η ν < 2 L . Hence, we hav e + ∞ X t = t 1 η t ∥∇ f ( W t ) ∥ 2 F < + ∞ a.s. , whic h, together with P + ∞ t =0 η t = + ∞ , implies that lim inf t → + ∞ ∥∇ f ( W t ) ∥ 2 F = 0, i.e., lim inf t → + ∞ ∥∇ f ( W t ) ∥ F = 0 a.s. . This completes the pro of. 3.3 Con v ergence Rate 3.3.1 Upper convergence bound W e sho w an upper con vergence rate of mini-batc h SGD (10) that con verges in the Ces` aro mean. Theorem 3.2 L et ( W t ) b e the se quenc e gener ate d by mini-b atch SGD (10) with 1 + ν ≤ p and ( η t ) and ( b t ) satisfying (16) under Assumption 2.1. Then, for al l T ∈ N , the me an of ( W t ) T + t 1 − 1 t = t 1 satisfies 1 P T + t 1 − 1 t = t 1 η t T + t 1 − 1 X t = t 1 η t E [ ∥∇ f ( W t ) ∥ 2 F ] = O 1 P T + t 1 − 1 t = t 1 η t ! ≤ C 1 ( ν ) P T + t 1 − 1 t = t 1 η t + C 2 ( ν ) P T + t 1 − 1 t = t 1 η t + ∞ X t = t 1 η 1+ ν t + C 3 ( ν, p , σ ) P T + t 1 − 1 t = t 1 η t + ∞ X t = t 1 η 1+ ν t b p − 1 t , wher e L : = 1 N P N i =1 L i , t 1 ∈ N and η > 0 ar e such that, for al l t ≥ t 1 , η ν t ≤ η ν < 2 L , f ⋆ ∈ R is such that, for al l W ∈ R m × n , f ( W ) ≥ f ⋆ , and C 1 ( ν ) : = 2( E [ f ( W t 1 )] − f ⋆ ) 2 − Lη ν , C 2 ( ν ) : = (1 − ν ) L (1 + ν )(2 − Lη ν ) , C 3 ( ν, p , σ ) : = 2 4 − ( ν + p ) Lσ p (1 + ν )(2 − Lη ν ) . 12 Muon Converges Pro of of Theorem 3.2 F rom η t → 0 ( t → + ∞ ), there exist t 1 ∈ N and η > 0 such that, for all t ≥ t 1 , η ν t ≤ η ν < 2 L . Since (15) holds for all t ≥ t 1 , w e can take the total exp ectation E = E t : = E ξ t 1 · · · E ξ t to (15). Hence, for all t ≥ t 1 , 2 − Lη ν 2 η t E [ ∥∇ f ( W t ) ∥ 2 F ] ≤ E [ f ( W t )] − E [ f ( W t +1 )] + (1 − ν ) Lη 1+ ν t 2(1 + ν ) + 2 3 − ( ν + p ) Lσ p η 1+ ν t (1 + ν ) b p − 1 t . Let T ∈ N . Summing the ab o v e inequality from t = t 1 to t = T + t 1 − 1 and in voking Assumption 2.1(A1) (the existence of f ⋆ i ( i ∈ [ N ])) together ensure that 2 − Lη ν 2 T + t 1 − 1 X t = t 1 η t E [ ∥∇ f ( W t ) ∥ 2 F ] ≤ E [ f ( W t 1 )] − E [ f ( W T + t 1 )] + (1 − ν ) L 2(1 + ν ) T + t 1 − 1 X t = t 1 η 1+ ν t + 2 3 − ( ν + p ) Lσ p (1 + ν ) T + t 1 − 1 X t = t 1 η 1+ ν t b p − 1 t ≤ E [ f ( W t 1 )] − f ⋆ + (1 − ν ) L 2(1 + ν ) T + t 1 − 1 X t = t 1 η 1+ ν t + 2 3 − ( ν + p ) Lσ p (1 + ν ) T + t 1 − 1 X t = t 1 η 1+ ν t b p − 1 t , where f ⋆ satisfies f ( W ) = 1 N P N i =1 f i ( W ) ≥ 1 N P N i =1 f ⋆ i = : f ⋆ for all W ∈ R m × n . Accord- ingly , for all T ∈ N , 1 P T + t 1 − 1 t = t 1 η t T + t 1 − 1 X t = t 1 η t E [ ∥∇ f ( W t ) ∥ 2 F ] ≤ 2( E [ f ( W t 1 )] − f ⋆ ) 2 − Lη ν 1 P T + t 1 − 1 t = t 1 η t + (1 − ν ) L (1 + ν )(2 − Lη ν ) 1 P T + t 1 − 1 t = t 1 η t T + t 1 − 1 X t = t 1 η 1+ ν t + 2 4 − ( ν + p ) Lσ p (1 + ν )(2 − Lη ν ) 1 P T + t 1 − 1 t = t 1 η t T + t 1 − 1 X t = t 1 η 1+ ν t b p − 1 t , whic h completes the pro of. 3.3.2 Lower conver gence bound Lemma 3.1 indicates that, for sufficiently large steps t , mini-batc h SGD (10) with an ap- propriate step size and batch size decreases f in the sense that E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + ϵ ≈ f ( W t ). Moreo v er, Theorem 3.1 ensures con vergence of mini-batc h SGD (10) to a stationary p oin t of f . When the empirical loss f defined by (1) is a noncon v ex func- tion with many lo cal minimizers, we may assume from the ab o ve results in Lemma 3.1 and Theorem 3.1 that mini-batch SGD (10) conv erges to a lo cal minimizer, denoted by W ⋆ . Hence, we assume the following: Assumption 3.1 (A3) f is conv ex in a neighborho o d of a conv ergent p oin t W ⋆ ; (A4) There exists t 2 ∈ N such that C 4 : = inf { E [ f ( W t 2 )] − E [ f ( W t )] : t ≥ t 2 } ≥ 0. 13 Hideaki Iiduka When Assumption 3.1(A3) holds, w e ha ve that, for all W in a neighborho o d N ( W ⋆ ; B ) : = { W : ∥ W − W ⋆ ∥ F ≤ B } of a stationary p oin t W ⋆ , where B > 0, f ( W ) ≥ f ( W ⋆ ) + ∇ f ( W ⋆ ) • ( W − W ⋆ ) = f ( W ⋆ ), which implies that W ⋆ is a lo cal minimizer of f . Hence, Assumption 3.1(A3) is a sligh tly stronger condition than the one in whic h the con ver- gen t point is a local minimizer of f . Theorem 3.1 ensures that, for a sufficien tly large s , ( W t ) + ∞ t = s ⊂ N ( W ⋆ ; B ). Hence, f is con v ex at W t ( t ≥ s ) under Assumption 3.1(A3). Let us consider Assumption 3.1(A4). Under Assumption 3.1(A3), Theorem 3.1 implies that, for all ϵ > 0, there exists t 2 ∈ N such that, for all t ≥ t 2 , ∥∇ f ( W t ) ∥ F ≤ ϵ and f is con- v ex on N ( W ⋆ ; B ) ( ∋ W t ). The Cauch y-Sc hw arz inequalit y thus ensures that f ( W t 2 ) ≥ f ( W t ) + ∇ f ( W t ) • ( W t 2 − W t ) ≥ f ( W t ) − ∥∇ f ( W t ) ∥ F ∥ W t 2 − W t ∥ F ≥ f ( W t ) − ϵ ≈ f ( W t ). Hence, Assumption 3.1(A4) w ould not b e strong enough to ensure that an optimizer con- v erges. The following theorem provides a lo w er con vergence rate of mini-batch SGD (10) that con verges in the Ce s` aro mean. Theorem 3.3 L et ( W t ) b e the se quenc e gener ate d by mini-b atch SGD (10) with ( η t ) and ( b t ) satisfying (16) under Assumptions 2.1 and 3.1. Then, the me an of ( W t ) T + t 2 − 1 t = t 2 satisfies that, for al l T ∈ N , 1 P T + t 2 − 1 t = t 2 η t T + t 2 − 1 X t = t 2 η t E [ ∥∇ f ( W t ) ∥ 2 F ] = Ω 1 P T + t 2 − 1 t = t 2 η t ! ≥ C 4 P T + t 2 − 1 t = t 2 η t , wher e t 2 ∈ N is such that C 4 : = inf { E [ f ( W t 2 )] − E [ f ( W t )] : t ≥ t 2 } ≥ 0 . Pro of of Theorem 3.3 Assumption 3.1(A3) implies that, for all t ≥ t 2 + 1, E ξ t f ( W t +1 ) | ξ [ t 2 : t − 1] ≥ f ( W t ) + η t E ξ t ∇ f ( W t ) • D SGD t | ξ [ t 2 : t − 1] | {z } = −∥∇ f ( W t ) ∥ 2 F ∵ (11) . T aking the total expectation E : = E ξ t 2 · · · E ξ t to the abov e inequalit y implies that, for all t ≥ t 2 , η t E [ ∥∇ f ( W t ) ∥ 2 F ] ≥ E [ f ( W t )] − E [ f ( W t +1 )] . Let T ∈ N . Summing the ab o v e inequality from t = t 2 to t = T + t 2 − 1 and in voking Assumption 3.1(A4) together lead to 1 P T + t 2 − 1 t = t 2 η t T + t 2 − 1 X t = t 2 η t E [ ∥∇ f ( W t ) ∥ 2 F ] ≥ E [ f ( W t 2 )] − E [ f ( W T + t 2 )] P T + t 2 − 1 t = t 2 η t ≥ C 4 P T + t 2 − 1 t = t 2 η t , whic h completes the pro of. 14 Muon Converges 4 Muon without Momentum: Comparisons with Mini-batc h SGD The Muon optimizer (Jordan et al., 2024) is updated as follo ws: Given initial p oin ts W 0 , M − 1 ∈ R m × n and a momen tum parameter β ∈ [0 , 1), [Muon] M t = β M t − 1 + (1 − β ) ∇ f ξ t ( W t ) O t ∈ argmin {∥ O − M t ∥ F : O ⊤ O = I n } W t +1 = W t + η t D Muon t = W t − η t O t . (17) T o compare the con vergence prop erties of mini-batc h SGD (Section 3) using the mini- batc h gradient (7) fairly with those of Muon, we consider the case of a Muon optimizer without momen tum, i.e., in the case of β = 0, minimizing f defined by (1) under Assumption 2.1: [Muon with β = 0] G t = ∇ f ξ t ( W t ) O t : = U t V ⊤ t ∈ argmin {∥ O − G t ∥ F : O ⊤ O = I n } W t +1 = W t + η t D Muon t = W t − η t O t = W t − η t U t V ⊤ t , (18) where U t ∈ R m × r and V t ∈ R n × r are matrices in the singular v alue decomp osition of G t , i.e., G t = U t Σ t V ⊤ t , and Σ t is a diagonal matrix whose diagonal entries are the r singular v alues of G t . An m × n matrix with orthonormal columns O t : = U t V ⊤ t minimizes a function F t ( O ) : = ∥ O − G t ∥ F o ver the Stiefel manifold St( n, m ) : = { O ∈ R m × n : O ⊤ O = I n } (Bernstein and Newhouse, 2024, Prop osition 4). Using O t : = U t V ⊤ t is computationally exp ensiv e, since it requires the singular v alue decomp osition of G t to b e computed. In practice, we use an appro ximation X t,K of O t : = U t V ⊤ t that is computed with the following Newton-Sc hulz iteration ( X t,k ) K k =0 : giv en X t, 0 : = G t ∥ G t ∥ F and a, b, c ∈ R , X t,k +1 = a X t,k + b X t,k X ⊤ t,k X t,k + c X t,k X ⊤ t,k 2 X t,k → O t : = U t V ⊤ t ( k → + ∞ ) . 4.1 Descen t prop ert y The following lemma giv es the descen t property of Muon (18) with β = 0 to minimize f defined by (1). Lemma 4.1 Let ( W t ) b e the sequence generated b y Muon (18) with β = 0 under Assump- tion 2.1, ξ [ t − 1] : = { ξ 0 , · · · , ξ t − 1 } , and L : = 1 N P N i =1 L i . Under the condition ∇ f ( W t ) = O m × n for all t ∈ { 0 } ∪ N , (i) E ξ t ∇ f ( W t ) • D Muon t | ξ [ t − 1] ≤ −∥∇ f ( W t ) ∥ F + 2 2 p √ nσ b p − 1 p t , (ii) E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + Ln 1+ ν 2 η 1+ ν t 1 + ν + 2 2 p √ nσ η t b p − 1 p t . 15 Hideaki Iiduka This implies that, if ( η 1+ ν t ) and ( η t b 1 − p p t ) conv erge to 0, then, for all ϵ > 0, there exists t 0 ∈ N such that, for all t ≥ t 0 , E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + ϵ . Lemma 4.1(i) indicates that, if ( b 1 − p p t ) con verges to 0 (e.g., b t increases with eac h ep o c h), then the searc h direction D Muon t = − O t = − U t V ⊤ t is a descen t direction of f in the sense that E ξ t [ ∇ f ( W t ) • D Muon t | ξ [ t − 1] ] ≤ −∥∇ f ( W t ) ∥ F + ϵ ≈ −∥∇ f ( W t ) ∥ F < 0. Let us compare Lemma 3.1(ii) with Lemma 4.1(ii). Lemma 3.1(ii) sho ws that, under the conditions 1 + ν ≤ p and η ν t < 2 L , mini-batch SGD with a diminishing step size η t decreases f in the sense that E ξ t [ f ( W t +1 ) | ξ [ t − 1] ] < f ( W t ) + ϵ . W e do not know whether conditions 1 + ν ≤ p (that is used to ev aluate D t in (13) with Jensen’s inequality) and η ν t < 2 L (that is used to delete the term ∥∇ f ( W t ) ∥ F in (15) that comes from (14) and Y oung’s inequalit y) hold b efore implementing mini-batch SGD, since p and L in Assumption 2.1 are unknown parameters. Hence, w e may need to exercise caution when using mini-batc h SGD to train DNNs. Meanwhile, Lemma 4.1(ii) indicates that Muon (18) with β = 0 and a diminishing step size η t decreases f in the sense that E ξ t [ f ( W t +1 ) | ξ [ t − 1] ] < f ( W t ) + ϵ without unrealistic conditions, suc h as 1 + ν ≤ p and η ν t < 2 L . This is b ecause w e can ev aluate G t in (21) and D t in (22) without using Jensen’s inequalit y or Y oung’s inequality (see the pro of of Lemma 4.1 for details). Pro of of Lemma 4.1 (i) F rom D Muon t : = − O t = − U t V ⊤ t , we hav e G t : = E ξ t ∇ f ( W t ) • D Muon t | ξ [ t − 1] = − E ξ t h G t • O t ξ [ t − 1] i | {z } G 1 ,t + E ξ t h ( G t − ∇ f ( W t )) • O t ξ [ t − 1] i | {z } G 2 ,t . F rom (18) and the expansion of the 2-nd p o w er ∥ O − G t ∥ 2 F = ∥ O ∥ 2 F − 2 G t • O + ∥ G t ∥ 2 F = − 2 G t • O + ( n + ∥ G t ∥ 2 F ) ( O ∈ St( n, m )), we hav e O t ∈ argmin {∥ O − G t ∥ 2 F : O ⊤ O = I n } = argmax { G t • O : ∥ O ∥ 2 = 1 } . Hence, the definition of the dual norm ∥ · ∥ 2 , ∗ of ∥ · ∥ 2 ensures that ∥ G t ∥ 2 , ∗ : = max { G t • O : ∥ O ∥ 2 = 1 } = G t • O t . Accordingly , the triangle inequalit y for ∥ · ∥ 2 , ∗ giv es G 1 ,t = − E ξ t ∥ G t ∥ 2 , ∗ | ξ [ t − 1] ≤ −∥∇ f ( W t ) ∥ 2 , ∗ + E ξ t ∥ G t − ∇ f ( W t ) ∥ 2 , ∗ | ξ [ t − 1] . This, together with the relation ∥ W ∥ F ≤ ∥ W ∥ 2 , ∗ ≤ √ n ∥ W ∥ F ( W ∈ R m × n ) and the same tec hnique used to pro ve (6) (i.e., Je nsen’s inequality) in Example 2.1, implies that G 1 ,t = − E ξ t ∥ G t ∥ 2 , ∗ | ξ [ t − 1] ≤ −∥∇ f ( W t ) ∥ F + √ n V p ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] 1 p . Prop osition 2.1(ii) thus ensures that G 1 ,t ≤ −∥∇ f ( W t ) ∥ F + 2 2 − p p √ nσ b p − 1 p t . (19) 16 Muon Converges F rom O t ∈ St( n, m ), we ha ve ∥ O t ∥ F = √ n . The Cauch y-Sc hw arz inequality , together with the same tec hnique used to pro ve (19), ensures that G 2 ,t ≤ E ξ t ∥ O t ∥ F | ξ [ t − 1] V p ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] 1 p ≤ 2 2 − p p √ nσ b p − 1 p t . (20) F rom (19) and (20), we hav e G t ≤ −∥∇ f ( W t ) ∥ F + 2 2 p √ nσ b p − 1 p t , (21) whic h completes the pro of. (ii) Applying W 1 = W t +1 and W 2 = W t to (12) and using W t +1 − W t = η t D Muon t imply that, for all t ∈ N , E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) + η t E ξ t ∇ f ( W t ) • D Muon t | ξ [ t − 1] | {z } G t + Lη 1+ ν t 1 + ν E ξ t h D Muon t 1+ ν F ξ [ t − 1] i | {z } D t . F rom the same pro of tec hnique (20) (i.e., ∥ O t ∥ F = √ n ), we hav e D t = E ξ t h ∥ O t ∥ 1+ ν F ξ [ t − 1] i = n 1+ ν 2 . (22) Accordingly , for all t ∈ { 0 } ∪ N , E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) + η t −∥∇ f ( W t ) ∥ F + 2 2 p √ nσ b p − 1 p t + Lη 1+ ν t 1 + ν n 1+ ν 2 < f ( W t ) + 2 2 p √ nσ η t b p − 1 p t + Ln 1+ ν 2 η 1+ ν t 1 + ν . (23) This completes the pro of. 4.2 Con v ergence The following is a conv ergence analysis of Muon (18) with β = 0 under Assumption 2.1. Theorem 3.1 indicates that, in order to conv erge, mini-batch SGD must satisfy the condition 1 + ν ≤ p , while Theorem 4.1 indicates that Muon (18) requires only the step size η t and batc h size b t to b e set. Theorem 4.1 L et ( W t ) b e the se quenc e gener ate d by Muon (18) with β = 0 under As- sumption 2.1. If ( η t ) and ( b t ) satisfy + ∞ X t =0 η t = + ∞ , + ∞ X t =0 η 1+ ν t < + ∞ , + ∞ X t =0 η t b p − 1 p t < + ∞ . (24) Then, ( ∇ f ( W t )) c onver ges to O m × n almost sur ely in the sense of the limit inferior. 17 Hideaki Iiduka Pro of of Theorem 4.1 Inequality (23), (24), and the sup er martingale con vergence theorem (Bertsek as et al., 2003, Prop osition 8.2.10) give + ∞ X t =0 η t ∥∇ f ( W t ) ∥ F < + ∞ a.s. , whic h, together with P + ∞ t =0 η t = + ∞ , implies that lim inf t → + ∞ ∥∇ f ( W t ) ∥ F = 0. This com- pletes the pro of. 4.3 Con v ergence rate 4.3.1 Upper convergence bound The following gives an upper con vergence rate of Muon (18) with β = 0 that con verges in the Ces` aro mean. Theorem 4.2 L et ( W t ) b e the se quenc e gener ate d by Muon (18) with β = 0 and ( η t ) and ( b t ) satisfying (24) under Assumption 2.1. Then, the me an of ( W t ) T − 1 t =0 satisfies that, for al l T ∈ N , 1 P T − 1 t =0 η t T − 1 X t =0 η t E [ ∥∇ f ( W t ) ∥ F ] = O 1 P T − 1 t =0 η t ! ≤ C 1 P T − 1 t =0 η t + C 2 ( ν ) P T − 1 t =0 η t + ∞ X t =0 η 1+ ν t + C 3 ( p , σ ) P T − 1 t =0 η t + ∞ X t =0 η t b p − 1 p t , wher e L : = 1 N P N i =1 L i , f ⋆ ∈ R is such that, for al l W ∈ R m × n , f ( W ) ≥ f ⋆ , and C 1 : = f ( W 0 ) − f ⋆ , C 2 ( ν ) : = Ln 1+ ν 2 1 + ν , C 3 ( p , σ ) : = 2 2 p √ nσ. In contrast to mini-batc h SGD in Theorem 3.2 needing the existence of t 1 ∈ N such that, for all t ≥ t 1 , η ν t < 2 L , Theorem 4.2 says that Muon (18) with β = 0 has a simpler upp er con vergence b ound O ( 1 P T − 1 t =0 η t ). Pro of of Theorem 4.2 T aking the total exp ectation E = E t : = E ξ 0 · · · E ξ t to (23) ensures that, for all t ∈ { 0 } ∪ N , η t E [ ∥∇ f ( W t ) ∥ F ] ≤ E [ f ( W t )] − E [ f ( W t +1 )] + Ln 1+ ν 2 η 1+ ν t 1 + ν + 2 2 p √ nσ η t b p − 1 p t . Summing the ab o v e inequalit y from t = 0 to t = T − 1, where T ∈ N , implies that T − 1 X t =0 η t E [ ∥∇ f ( W t ) ∥ F ] ≤ f ( W 0 ) − f ⋆ + Ln 1+ ν 2 1 + ν T − 1 X t =0 η 1+ ν t + 2 2 p √ nσ T − 1 X t =0 η t b p − 1 p t . The ab o ve inequality divided by P T − 1 t =0 η t leads to the assertion of Theorem 4.2. 18 Muon Converges 4.3.2 Lower conver gence bound The following presents a low er conv ergence b ound of Muon (18) with β = 0. Theorem 4.3 L et ( W t ) b e the se quenc e gener ate d by Muon (18) with β = 0 and ( η t ) and ( b t ) satisfying (24) under Assumptions 2.1 and 3.1. Then, the me an of ( W t ) T + t 2 − 1 t = t 2 satisfies that, for al l T ∈ N , 1 P T + t 2 − 1 t = t 2 η t T + t 2 − 1 X t = t 2 η t E [ ∥∇ f ( W t ) ∥ F ] = Ω 1 P T + t 2 − 1 t = t 2 η t ! ≥ C 4 P T + t 2 − 1 t = t 2 η t , wher e t 2 ∈ N is such that C 4 : = 1 √ n inf { E [ f ( W t 2 )] − E [ f ( W t )] : t ≥ t 2 } ≥ 0 . Pro of of Theorem 4.3 Assumption 3.1(A3) implies that, for all t ∈ { 0 } ∪ N , E ξ t f ( W t +1 ) | ξ [ t 2 : t − 1] ≥ f ( W t ) + η t E ξ t ∇ f ( W t ) • D Muon t | ξ [ t 2 : t − 1] | {z } G t . The Cauc hy-Sc h warz inequality , together with D Muon t = − O t and ∥ O t ∥ F = √ n , ensures that G t = − E ξ t ∇ f ( W t ) • O t | ξ [ t 2 : t − 1] ≥ −∥∇ f ( W t ) ∥ F E ξ t ∥ O t ∥ F | ξ [ t 2 : t − 1] = − √ n ∥∇ f ( W t ) ∥ F , whic h implies that, for all t ≥ t 2 + 1, E ξ t f ( W t +1 ) | ξ [ t 2 : t − 1] ≥ f ( W t ) − √ nη t ∥∇ f ( W t ) ∥ F . T aking the total expectation E : = E ξ t 2 · · · E ξ t to the abov e inequalit y implies that, for all t ≥ t 2 , √ nη t E [ ∥∇ f ( W t ) ∥ F ] ≥ E [ f ( W t )] − E [ f ( W t +1 )] . Let T ∈ N . By summing the ab o v e inequalit y from t = t 2 to t = T + t 2 − 1 and inv oking Assumption 3.1(A4), w e ha ve 1 P T + t 2 − 1 t = t 2 η t T + t 2 − 1 X t = t 2 η t E [ ∥∇ f ( W t ) ∥ F ] ≥ E [ f ( W t 2 )] − E [ f ( W T + t 2 )] √ n P T + t 2 − 1 t = t 2 η t ≥ C 4 P T + t 2 − 1 t = t 2 η t , whic h completes the pro of. 5 Muon The singular v alue decomp osition of the matrix M t = β M t − 1 + (1 − β ) G t in Muon (17) is represen ted by M t = U t Σ t V ⊤ t , where U t ∈ R m × r , V t ∈ R n × r , and Σ t is a diagonal matrix whose diagonal entries are the r singular v alues of M t . The matrix O t : = U t V t minimizes a function F t ( O ) : = ∥ O − M t ∥ F o ver St( n, m ). This implies that Muon (17) with β = 0 is structurally almost identical to Muon (18). Hence, w e can analyze the conv ergence of Muon (17) b y using the results and pro of tec hniques in Section 4. 19 Hideaki Iiduka 5.1 Descen t prop ert y The following lemma gives the descent property of Muon (17) to minimize f defined by (1). The only difference from the pro of of Lemma 4.1 is in ev aluating M 3 ,t in (25). Lemma 5.1 Let ( W t ) b e the sequence generated by Muon (17) under Assumption 2.1, ξ [ t − 1] : = { ξ 0 , · · · , ξ t − 1 } , and L : = 1 N P N i =1 L i . Under the condition ∇ f ( W t ) = O m × n for all t ∈ { 0 } ∪ N , (i) E ξ t ∇ f ( W t ) • D Muon t | ξ [ t − 1] ≤ −∥∇ f ( W t ) ∥ F + 2 √ n β t ∥ M 0 − ∇ f ( W 0 ) ∥ F + Ln ν 2 t X i =1 β i η ν t − i + (1 − β )2 2 − p p σ t X i =0 β i b p − 1 p t − i , (ii) E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + Ln 1+ ν 2 η 1+ ν t 1 + ν + 2 √ n ∥ M 0 − ∇ f ( W 0 ) ∥ F η t β t + 2 Ln 1+ ν 2 η t t X i =1 β i η ν t − i + 2 3 − p p (1 − β ) √ nσ η t t X i =0 β i b p − 1 p t − i . This implies that, if ( η 1+ ν t ), ( η t β t ), ( η t P t i =1 β i η ν t − i ), and ( η t P t i =0 β i b 1 − p p t − i ) con verge to 0, then, for all ϵ > 0, there exists t 0 ∈ N such that, for all t ≥ t 0 , E ξ t f ( W t +1 ) | ξ [ t − 1] < f ( W t ) + ϵ . Pro of of Lemma 5.1 (i) F rom D Muon t : = − O t = − U t V ⊤ t , we hav e M t : = E ξ t ∇ f ( W t ) • D Muon t | ξ [ t − 1] = − E ξ t h M t • O t ξ [ t − 1] i | {z } M 1 ,t + E ξ t h ( M t − ∇ f ( W t )) • O t ξ [ t − 1] i | {z } M 2 ,t . F rom (17) and the expansion of the 2-nd p o wer ∥ O − M t ∥ 2 F = ∥ O ∥ 2 F − 2 M t • O + ∥ M t ∥ 2 F = − 2 M t • O + ( n + ∥ M t ∥ 2 F ) ( O ∈ St( n, m )), we hav e O t ∈ argmin {∥ O − M t ∥ 2 F : O ⊤ O = I n } = argmax { M t • O : ∥ O ∥ 2 = 1 } . Hence, the definition of the dual norm ∥ · ∥ 2 , ∗ of ∥ · ∥ 2 ensures that ∥ M t ∥ 2 , ∗ : = max { M t • O : ∥ O ∥ 2 = 1 } = M t • O t . Accordingly , from the triangle inequality for ∥ · ∥ 2 , ∗ and the relation ∥ W ∥ F ≤ ∥ W ∥ 2 , ∗ ≤ √ n ∥ W ∥ F ( W ∈ R m × n ), we hav e M 1 ,t = − E ξ t ∥ M t ∥ 2 , ∗ | ξ [ t − 1] ≤ −∥∇ f ( W t ) ∥ 2 , ∗ + E ξ t ∥ M t − ∇ f ( W t ) ∥ 2 , ∗ | ξ [ t − 1] ≤ −∥∇ f ( W t ) ∥ F + √ n E ξ t ∥ M t − ∇ f ( W t ) ∥ F | ξ [ t − 1] | {z } M 3 ,t . (25) Moreo ver, from the definition of M t and the triangle inequality , we hav e M 3 ,t = E ξ t ∥ β ( M t − 1 − ∇ f ( W t )) + (1 − β )( G t − ∇ f ( W t )) ∥ F | ξ [ t − 1] ≤ β ∥ M t − 1 − ∇ f ( W t ) ∥ F + (1 − β ) E ξ t ∥ G t − ∇ f ( W t ) ∥ F | ξ [ t − 1] 20 Muon Converges ≤ β ∥ M t − 1 − ∇ f ( W t − 1 ) ∥ F + β ∥∇ f ( W t − 1 ) − ∇ f ( W t ) ∥ F + (1 − β ) V 1 ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] . This, together with Assumption 2.1(A1) ( L -H¨ older smo othness of f ), ∥ W t − 1 − W t ∥ F = η t − 1 ∥ O t − 1 ∥ F = √ nη t − 1 , the same tec hnique used to pro v e (6) (i.e., Jensen’s inequalit y) in Example 2.1, and Prop osition 2.1(ii), implies that M 3 ,t ≤ β ∥ M t − 1 − ∇ f ( W t − 1 ) ∥ F + β Ln ν 2 η ν t − 1 + (1 − β ) V p ξ t ∇ f ξ t ( W t ) | ξ [ t − 1] 1 p ≤ β ∥ M t − 1 − ∇ f ( W t − 1 ) ∥ F + β Ln ν 2 η ν t − 1 + (1 − β ) 2 2 − p p σ b p − 1 p t . Induction thus gives M 3 ,t ≤ β β ∥ M t − 2 − ∇ f ( W t − 2 ) ∥ F + β Ln ν 2 η ν t − 2 + (1 − β ) 2 2 − p p σ b p − 1 p t − 1 + β Ln ν 2 η ν t − 1 + (1 − β ) 2 2 − p p σ b p − 1 p t = β 2 ∥ M t − 2 − ∇ f ( W t − 2 ) ∥ F + Ln ν 2 β 1 η ν t − 1 + β 2 η ν t − 2 + (1 − β )2 2 − p p σ β 0 b p − 1 p t + β 1 b p − 1 p t − 1 ≤ β t ∥ M 0 − ∇ f ( W 0 ) ∥ F + Ln ν 2 t X i =1 β i η ν t − i + (1 − β )2 2 − p p σ t X i =0 β i b p − 1 p t − i . The Cauch y-Sc hw arz inequality , together with the same tec hnique used to pro v e (19), e n- sures that M 2 ,t ≤ √ n E ξ t ∥ M t − ∇ f ( W t ) ∥ F | ξ [ t − 1] | {z } M 3 ,t . (26) Therefore, we hav e M t = M 1 ,t + M 2 ,t ≤ −∥∇ f ( W t ) ∥ F + √ nM 3 ,t + √ nM 3 ,t ≤ −∥∇ f ( W t ) ∥ F + 2 √ n β t ∥ M 0 − ∇ f ( W 0 ) ∥ F + Ln ν 2 t X i =1 β i η ν t − i + (1 − β )2 2 − p p σ t X i =0 β i b p − 1 p t − i . (ii) Applying W 1 = W t +1 and W 2 = W t to (12) and using W t +1 − W t = η t D Muon t imply that, for all t ∈ N , E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) + η t E ξ t ∇ f ( W t ) • D Muon t | ξ [ t − 1] | {z } M t + Lη 1+ ν t 1 + ν E ξ t h D Muon t 1+ ν F ξ [ t − 1] i | {z } D t . F rom (22), we hav e that D t = E ξ t [ ∥ O t ∥ 1+ ν F | ξ [ t − 1] ] = n 1+ ν 2 . Accordingly , for all t ∈ { 0 } ∪ N , E ξ t f ( W t +1 ) | ξ [ t − 1] ≤ f ( W t ) 21 Hideaki Iiduka + η t −∥∇ f ( W t ) ∥ F + 2 √ n β t ∥ M 0 − ∇ f ( W 0 ) ∥ F + Ln ν 2 t X i =1 β i η ν t − i + (1 − β )2 2 − p p σ t X i =0 β i b p − 1 p t − i + Lη 1+ ν t 1 + ν n 1+ ν 2 < f ( W t ) + 2 √ n ∥ M 0 − ∇ f ( W 0 ) ∥ F η t β t + 2 Ln 1+ ν 2 η t t X i =1 β i η ν t − i + 2 3 − p p (1 − β ) √ nσ η t t X i =0 β i b p − 1 p t − i + Ln 1+ ν 2 η 1+ ν t 1 + ν . (27) This completes the pro of. 5.2 Con v ergence The following is a conv ergence analysis of Muon (17) with β ∈ [0 , 1) under Assumption 2.1. W e can chec k that Theorem 5.1 with β = 0 coincides with Theorem 4.1. Theorem 5.1 L et ( W t ) b e the se quenc e gener ate d by Muon (17) with β ∈ [0 , 1) under Assumption 2.1. If ( η t ) and ( b t ) satisfy + ∞ X t =0 η t = + ∞ , + ∞ X t =0 η 1+ ν t < + ∞ , + ∞ X t =0 η t β t < + ∞ , + ∞ X t =0 η t t X i =1 β i η ν t − i < + ∞ , + ∞ X t =0 η t t X i =0 β i b p − 1 p t − i < + ∞ , (28) then ( ∇ f ( W t )) c onver ges to O m × n almost sur ely in the sense of the limit inferior. Pro of of Theorem 5.1 Inequality (27), (28), and the sup er martingale con vergence theorem (Bertsek as et al., 2003, Prop osition 8.2.10) give + ∞ X t =0 η t ∥∇ f ( W t ) ∥ F < + ∞ a.s. , whic h, together with P + ∞ t =0 η t = + ∞ , implies that lim inf t → + ∞ ∥∇ f ( W t ) ∥ F = 0. This com- pletes the pro of. 5.3 Con v ergence Rate 5.3.1 Upper convergence bound (27) and a discussion similar to the one proving Theorem 4.2 lead to an upp er b ound of Muon (17). 22 Muon Converges Theorem 5.2 L et ( W t ) b e the se quenc e gener ate d by Muon (17) with β ∈ [0 , 1) and ( η t ) and ( b t ) satisfying (28) under Assumption 2.1. Then, the me an of ( W t ) T − 1 t =0 satisfies that, for al l T ∈ N , 1 P T − 1 t =0 η t T − 1 X t =0 η t E [ ∥∇ f ( W t ) ∥ F ] = O 1 P T − 1 t =0 η t ! ≤ C 1 P T − 1 t =0 η t + C 2 ( ν ) P T − 1 t =0 η t + ∞ X t =0 η 1+ ν t + C 3 P T − 1 t =0 η t + ∞ X t =0 η t β t + C 4 P T − 1 t =0 η t + ∞ X t =0 η t t X i =1 β i η ν t − i + C 5 P T − 1 t =0 η t + ∞ X t =0 η t t X i =0 β i b p − 1 p t − i wher e L : = 1 N P N i =1 L i , f ⋆ ∈ R is such that, for al l W ∈ R m × n , f ( W ) ≥ f ⋆ , and C 1 : = f ( W 0 ) − f ⋆ , C 2 ( ν ) : = Ln 1+ ν 2 1 + ν , C 3 : = 2 √ n ∥ M 0 − ∇ f ( W 0 ) ∥ F , C 4 : = 2 Ln 1+ ν 2 , C 5 ( β , p , σ ) : = 2 3 − p p (1 − β ) √ nσ. Pro of of Theorem 5.2 T aking the total exp ectation E = E t : = E ξ 0 · · · E ξ t to (27) ensures that, for all t ∈ { 0 } ∪ N , η t E [ ∥∇ f ( W t ) ∥ F ] ≤ E [ f ( W t )] − E [ f ( W t +1 )] + Ln 1+ ν 2 η 1+ ν t 1 + ν + 2 √ n ∥ M 0 − ∇ f ( W 0 ) ∥ F η t β t + 2 Ln 1+ ν 2 η t t X i =1 β i η ν t − i + 2 3 − p p (1 − β ) √ nσ η t t X i =0 β i b p − 1 p t − i . Summing the ab o v e inequalit y from t = 0 to t = T − 1, where T ∈ N , implies that T − 1 X t =0 η t E [ ∥∇ f ( W t ) ∥ F ] ≤ f ( W 0 ) − f ⋆ + Ln 1+ ν 2 1 + ν T − 1 X t =0 η 1+ ν t + 2 √ n ∥ M 0 − ∇ f ( W 0 ) ∥ F T − 1 X t =0 η t β t + 2 Ln 1+ ν 2 T − 1 X t =0 η t t X i =1 β i η ν t − i + 2 3 − p p (1 − β ) √ nσ T − 1 X t =0 η t t X i =0 β i b p − 1 p t − i . The ab o ve inequality divided by P T − 1 t =0 η t leads to the assertion of Theorem 5.2. 5.3.2 Lower conver gence bound The follo wing gives a low er conv ergence b ound of Muon (17) with β ∈ [0 , 1). The pro of of Theorem 5.3 follo ws that of Theorem 4.3. 23 Hideaki Iiduka Theorem 5.3 L et ( W t ) b e the se quenc e gener ate d by Muon (17) with β ∈ [0 , 1) and ( η t ) and ( b t ) satisfying (28) under Assumptions 2.1 and 3.1. Then, the me an of ( W t ) T + t 2 − 1 t = t 2 satisfies that, for al l T ∈ N , 1 P T + t 2 − 1 t = t 2 η t T + t 2 − 1 X t = t 2 η t E [ ∥∇ f ( W t ) ∥ F ] = Ω 1 P T + t 2 − 1 t = t 2 η t ! ≥ C 6 P T + t 2 − 1 t = t 2 η t , wher e t 2 ∈ N is such that C 6 : = 1 √ n inf { E [ f ( W t 2 )] − E [ f ( W t )] : t ≥ t 2 } ≥ 0 . 6 Conclusion This pap er considered a noncon v ex H¨ older-smo oth ERM with the boundedness condition of the p -v ariance of the sto c hastic gradient accounting for heavy-tailed sto chastic noise. W e sho wed that Muon conv erges almost surely to appropriate p oints faster than mini- batc h SGD. Our conv ergence pro of indicated that this faster con vergence of Muon strongly dep ends on the search direction using the p oin t on the Stiefel manifold closest to the mini- batc h gradien t. App endix A. Examples of ( η t ) and ( b t ) A.1 Examples of ( η t ) and ( b t ) satisfying (16) and (24) Let η > 0, a ∈ (0 , 1), and η t : = η ( t +1) a ( t ∈ { 0 } ∪ N ). W e hav e that, for all T ∈ N , T − 1 X t =0 η t ≥ η Z T 0 d t ( t + 1) a = η 1 − a { ( T + 1) 1 − a − 1 } ( a ∈ (0 , 1)) η log( T + 1) ( a = 1) . W e also hav e T − 1 X t =0 η 1+ ν t ≤ η 1+ ν 1 + Z T − 1 0 d t ( t + 1) (1+ ν ) a ≤ η 1+ ν 1 − (1 + ν ) a T 1 − (1+ ν ) a ((1 + ν ) a < 1) η 1+ ν (1 + log T ) ((1 + ν ) a = 1) (1 + ν ) aη 1+ ν (1 + ν ) a − 1 (1 < (1 + ν ) a ) . Let b ∈ N , δ > 1, and b t : = bδ t ( t ∈ { 0 } ∪ N ). Then, T − 1 X t =0 η 1+ ν t b p − 1 t ≤ η 1+ ν b p − 1 T − 1 X t =0 1 δ ( p − 1) t ≤ η 1+ ν b p − 1 ( δ ( p − 1) − 1) and T − 1 X t =0 η 1+ ν t b p − 1 p t ≤ η 1+ ν b p − 1 p T − 1 X t =0 1 δ p − 1 p t ≤ η 1+ ν b p − 1 p ( δ p − 1 p − 1) . 24 Muon Converges A.2 Examples of ( η t ) and ( b t ) satisfying (28) Let η t and b t b e the sequences defined as in the ab o v e subsection, and β ∈ [0 , 1). Then, T − 1 X t =0 η t β t ≤ η T − 1 X t =0 β t ≤ η 1 − β . Moreo ver, T − 1 X t =0 η t t X i =1 β i η ν t − i ≤ η 1+ ν T − 1 X t =0 t X i =1 β i = η 1+ ν T − 1 X t =0 β (1 − β t ) 1 − β ≤ η 1+ ν 1 − β and T − 1 X t =0 η t t X i =0 β i b p − 1 p t − i ≤ η b p − 1 p T − 1 X t =0 t X i =0 β i δ p − 1 p ( t − i ) ≤ η δ p − 1 p b p − 1 p (1 − β )( δ p − 1 p − 1) . References Kw ang jun Ahn, Xiang Cheng, Minhak Song, Chulhee Y un, Ali Jadbabaie, and Suvrit Sra. Linear attention is (ma yb e) all y ou need (to understand transformer optimization). In The Twelfth International Confer enc e on L e arning R epr esentations , 2024. URL https: //openreview.net/forum?id=0uI5415ry7 . Barak Battash, Lior W olf, and Ofir Lindenbaum. Revisiting the noise mo del of stochas- tic gradien t descent. In Sanjoy Dasgupta, Stephan Mandt, and Yingzhen Li, editors, Pr o c e e dings of The 27th International Confer enc e on Artificial Intel ligenc e and Statis- tics , volume 238 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 4780–4788. PMLR, 02–04 May 2024. URL https://proceedings.mlr.press/v238/battash24a.html . Amir Bec k. First-Or der Metho ds in Optimization . So ciet y for Industrial and Applied Math- ematics, Philadelphia, P A, 2017. Jerem y Bernstein and Laker Newhouse. Old optimizer, new norm: An an thology . In Pr o c e e dings of the OPT 2024: Workshop on Optimization for Machine L e arning , 2024. W orkshop pap er. Dimitri P . Bertsek as, Angelia Nedi ´ c, and Asuman E. Ozdaglar. Convex Analysis and Opti- mization . A thena Scien tific, Cam bridge, MA, 2003. Ashok Cutkosky and Harsh Mehta. High-probabilit y bounds for non-con v ex stochas- tic optimization with heavy tails. In Marc’Aurelio Ranzato, Alina Beygelzimer, Y ann N. Dauphin, P ercy Liang, and Jennifer W ortman V aughan, editors, A dvanc es in Neur al Information Pr o c essing Systems 34: A nnual Confer enc e on Neur al In- formation Pr o c essing Systems 2021, NeurIPS 2021, De c emb er 6-14, 2021, virtual , pages 4883–4895, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ 26901debb30ea03f0aa833c9de6b81e9- Abstract.html . 25 Hideaki Iiduka Ily as F atkhullin, Florian H ¨ ubler, and Guangh ui Lan. Can SGD handle heavy-tailed noise? In OPT 2025: Optimization for Machine L e arning , 2025. URL https://openreview. net/forum?id=raN3EfA42K . Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, Zico Kolter, Zachary Lipton, Siv araman Balakrishnan, Ruslan Salakh utdinov, and Pradeep Ravikumar. On pro ximal p olicy optimization’s hea vy-tailed gradients. In Marina Meila and T ong Zhang, editors, Pr o c e e dings of the 38th International Confer enc e on Machine L e arning , volume 139 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 3610–3619. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/garg21b.html . Guillaume Garrigos and Robert M. Gow er. Handb ook of conv ergence theorems for (stochas- tic) gradient metho ds, 2024. URL . Saeed Ghadimi and Guangh ui Lan. Optimal stochastic appro ximation algorithms for strongly con vex sto c hastic comp osite optimization I: A generic algorithmic framew ork. SIAM Journal on Optimization , 22:1469–1492, 2012. Saeed Ghadimi and Guangh ui Lan. Optimal stochastic appro ximation algorithms for strongly conv ex sto c hastic comp osite optimization I I: Shrinking pro cedures and optimal algorithms. SIAM Journal on Optimization , 23:2061–2089, 2013. Eduard Gorbuno v, Marina Danilo v a, and Alexander Gasnik ov. Sto c hastic optimiza- tion with hea vy-tailed noise via accelerated gradient clipping. In H. Laro chelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, A dvanc es in Neur al Information Pr o c essing Systems , volume 33, pages 15042–15053. Curran Asso ciates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ abd1c782880cc59759f4112fda0b8f98- Paper.pdf . Vineet Gupta, T omer Koren, and Y oram Singer. Shamp oo: Preconditioned sto chas- tic tensor optimization. In Jennifer Dy and Andreas Krause, editors, Pr o c e e dings of the 35th International Confer enc e on Machine L e arning , volume 80 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 1842–1850. PMLR, 10–15 Jul 2018. URL https: //proceedings.mlr.press/v80/gupta18a.html . Liam Ho dgkinson and Michael Mahoney . Multiplicative noise and heavy tails in sto c hastic optimization. In Marina Meila and T ong Zhang, editors, Pr o c e e dings of the 38th Interna- tional Confer enc e on Machine L e arning , volume 139 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 4262–4274. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr. press/v139/hodgkinson21a.html . Otto Ludwig H¨ older. Beitr¨ age zur Potentialthe orie . J. B. Metzlersc he Buchdruc k erei, Stuttgart, 1882. Inaugural-Dissertation zur Erlangung der Do ctorw¨ urde der naturwis- sensc haftlichen F acult¨ at zu T ¨ ubingen. Keller Jordan, Y uc hen Jin, Vlado Boza, Jiacheng Y ou, F ranz Cesista, Laker Newhouse, and Jerem y Bernstein. Muon: An optimizer for hidden lay ers in neural netw orks, 2024. URL https://kellerjordan.github.io/posts/muon/ . 26 Muon Converges Diederik P . Kingma and Jimm y Ba. Adam: A metho d for sto chastic optimization. In Pr o c e e dings of The International Confer enc e on L e arning R epr esentations , 2015. Langqi Liu, Yib o W ang, and Lijun Zhang. High-probabilit y bound for non-smo oth non- con vex sto c hastic optimization with hea vy tails. In Ruslan Salakhutdino v, Zico Kolter, Katherine Heller, Adrian W eller, Nuria Oliv er, Jonathan Scarlett, and F elix Berkenk amp, editors, Pr o c e e dings of the 41st International Confer enc e on Machine L e arning , volume 235 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 32122–32138. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/liu24bo.html . Zijian Liu and Zhengyuan Zhou. Nonconv ex sto c hastic optimization under heavy-tailed noises: Optimal conv ergence without gradient clipping. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025. URL https://openreview.net/forum? id=NKotdPUc3L . Ily a Loshc hilov and F rank Hutter. Decoupled weigh t decay regularization. In Pr o c e e dings of The International Confer enc e on L e arning R epr esentations , 2019. Sh untaro Nagashima and Hideaki Iiduk a. Improv ed con vergence rates of Muon optimizer for nonconv ex optimization, 2026. URL . Ark adi Nemiro vski, Anatoli Juditsky , Guangh ui Lan, and Alexander Shapiro. Robust sto c hastic appro ximation approach to stochastic programming. SIAM Journal on Op- timization , 19:1574–1609, 2009. Y u Nesterov. Universal gradient metho ds for con vex optimization problems. Mathematic al Pr o gr amming , 152(1):381–404, 2015. doi: 10.1007/s10107- 014- 0790- 0. URL https:// doi.org/10.1007/s10107- 014- 0790- 0 . T a Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. Improv ed conv er- gence in high probability of clipped gradien t methods with heavy tailed noise. In Thirty-seventh Confer enc e on Neur al Information Pr o c essing Systems , 2023. URL https://openreview.net/forum?id=h1FhXVM0cB . Thomas Pethic k, W an yun Xie, Kimon An tonakopoulos, Zhen yu Zh u, An tonio Silv eti- F alls, and V olk an Cevher. T raining deep learning mo dels with norm-constrained LMOs. In F orty-se c ond International Confer enc e on Machine L e arning , 2025a. URL https: //openreview.net/forum?id=2Oqm2IzTy9 . Thomas Pethic k, W an yun Xie, Mete Erdogan, Kimon An tonakopoulos, T ony Silveti-F alls, and V olk an Cevher. Generalized gradien t norm clipping & non-Euclidean ( L 0 , L 1 )- smo othness. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025b. URL https://openreview.net/forum?id=rMdf8jhLR7 . Herb ert Robbins and Herb ert Monro. A sto chastic approximation metho d. The A nnals of Mathematic al Statistics , 22:400–407, 1951. Ab durakhmon Sadiev, Marina Danilov a, Eduard Gorbuno v, Samuel Horv´ ath, Gauthier Gidel, P av el Dvurec hensky , Alexander Gasniko v, and Peter Rich t´ arik. High-probability 27 Hideaki Iiduka b ounds for stochastic optimization and v ariational inequalities: the case of unbounded v ariance. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Siv an Sabato, and Jonathan Scarlett, editors, Pr o c e e dings of the 40th International Con- fer enc e on Machine L e arning , volume 202 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 29563–29648. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/ v202/sadiev23a.html . Naoki Sato, Hiroki Naganuma, and Hideaki Iiduk a. Conv ergence b ound and critical batch size of Muon optimizer, 2025. URL . Um ut Simsekli, Leven t Sagun, and Mert Gurbuzbalaban. A tail-index analysis of sto c hastic gradien t noise in deep neural net works. In Kamalik a Chaudhuri and Ruslan Salakhut- dino v, editors, Pr o c e e dings of the 36th International Confer enc e on Machine L e arning , v olume 97 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 5827–5837. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/simsekli19a.html . Xuan T ang, Jich u Li, and Difan Zou. A con vergence analysis of adaptive optimizers under floating-p oin t quantization. In The F ourte enth International Confer enc e on L e arning R epr esentations , 2026. URL https://openreview.net/forum?id=wwP1SCACee . Hik aru Umeda and Hideaki Iiduk a. Increasing b oth batch size and learning rate accelerates sto c hastic gradient descent. T r ansactions on Machine L e arning R ese ar ch , 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=sbmp55k6iE . Sharan V asw ani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless sto c hastic gradient: Interpolation, line-search, and conv ergence rates. In A dvanc es in Neur al Information Pr o c essing Systems , volume 32, 2019. Ryusei Y amada, Naoki Sato, and Hideaki Iiduk a. V anilla SGD with momen tum survives hea vy-tailed noise: Conv ergence analysis without gradien t clipping or normalization. 2026. Mary am Y ashtini. On the global conv ergence rate of the gradient descent metho d for func- tions with H¨ older con tinuous gradients. Optimization L etters , 10(6):1361–1370, 2016. doi: 10.1007/s11590- 015- 0936- x. URL https://doi.org/10.1007/s11590- 015- 0936- x . Jingzhao Zhang, Sai Praneeth Karimireddy , Andreas V eit, Seungy eon Kim, Sashank J. Reddi, Sanjiv Kumar, and Suvrit Sra. Wh y are adaptiv e metho ds go od for atten tion mo dels? In Hugo Laro chelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Bal- can, and Hsuan-Tien Lin, editors, A dvanc es in Neur al Information Pr o c essing Systems 33: Annual Confer enc e on Neur al Information Pr o c essing Systems 2020, NeurIPS 2020, De c emb er 6-12, 2020, virtual , 2020. URL https://proceedings.neurips.cc/paper/ 2020/hash/b05b57f6add810d3b7490866d74c0053- Abstract.html . Martin Zink evich. Online conv ex programming and generalized infinitesimal gradient ascent. In Pr o c e e dings of the 20th International Confer enc e on Machine L e arning , pages 928–936, 2003. 28
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment