Policy Gradient Algorithms in Average-Reward Multichain MDPs
While there is an extensive body of research analyzing policy gradient methods for discounted cumulative-reward MDPs, prior work on policy gradient methods for average-reward MDPs has been limited, with most existing results restricted to ergodic or …
Authors: Jongmin Lee, Ernest K. Ryu
Preprint vol.: 1 – 35 , 2026 P olicy Gradient Algorithms in A verage-Reward Multichain MDPs Jongmin Lee 1 and Ernest K. Ryu 2 Abstract While there is an extensi v e body of research analyzing policy gradient methods for discounted cumulativ e-re ward MDPs, prior work on policy gradient methods for av erage-re ward MDPs has been limited, with most existing results restricted to ergodic or unichain settings. In this work, we first establish a policy gradient theorem for av erage-re ward multichain MDPs based on the in v ariance of the classification of recurrent and transient states. Building on this foundation, we dev elop refined analyses and obtain a collection of con ver gence and sample-comple xity results that advance the understanding of this setting. In particular , we sho w that the proposed α -clipped polic y mirror ascent algorithm attains an ϵ -optimal policy with respect to positi v e policies. Keyw ords: av erage-re ward multichain MDPs, visitation measure, recurrent and transient states, policy gradient theorem, polic y mirror ascent, con v ergence and sample comple xity analysis 1. Intr oduction A verage-re w ard Markov decision processes (MDPs) provide a fundamental framework for modeling sequential decision-making problems in which the objective is to maximize long-term, steady-state performance. In the dynamic programming and reinforcement learning literature, both value-based and policy-based methods hav e been extensi vely studied in the average-re w ard setting. Surprisingly , ho we ver , since the seminal Policy Gradient Theorem of Sutton et al. ( 1999 ), a policy gradient the- orem for general multichain MDPs has not been established. Existing analyses of policy gradient methods under the av erage-re ward criterion are restricted to er godic or unichain MDPs, lea ving the con v ergence of polic y gradient methods in multichain MDPs open. Contribution. In this work, we study the con ver gence of policy gradient methods for av erage- re ward multichain MDPs. Using the notion of recurrent-transient classification of states, we first establish a policy gradient theorem for multichain MDPs by introducing new visitation measures, which we term the r ecurr ent visitation measur e and the transient visitation measure . Building on this frame work, we analyze the con vergence of α -clipped policy mirror ascent in the tab ular setting and further extend the analysis to the generativ e model setting. Notably , we establish ϵ -optimality guarantees with respect to positiv e policies for the general multichain MDPs, in both the tab ular and generati ve model settings. 1.1. Prior w orks A verage-r eward MDPs. The setup of average-re w ard MDPs was first introduced in the dynamic programming literature by How ard ( 1960 ), and Blackwell ( 1962 ) established a theoretical frame- work for their analysis. In reinforcement learning (RL), a verage-re ward MDPs were mainly consid- ered in the sample-based setup where the transition matrix and rew ard are unknown ( Mahadev an , 1996 ; De wanto et al. , 2020 ). For this setup, v arious methods were proposed: model-based methods 1: Seoul National Uni versity , 2: UCLA ( Jin and Sidford , 2021 ; T uynman et al. , 2024 ; Zurek and Chen , 2024 ), value-based methods ( W an et al. , 2024 ; Bra v o and Cominetti , 2024 ; Chen , 2025 ), and policy gradient methods ( Bai et al. , 2024 ; Murthy and Srikant , 2023 ; Kumar et al. , 2024 ). Specifically , sample complexity to obtain ϵ -optimal under a generativ e model ( W ang , 2017 ; Zhang and Xie , 2023 ; Li et al. , 2024 ; Jin et al. , 2024 ; Lee et al. , 2025 ) and regret minimization frame w ork ( Burnetas and Katehakis , 1997 ; Jaksch et al. , 2010 ; Zhang and Ji , 2019 ; Boone and Zhang , 2024 ) hav e been acti vely studied on the theoretical side. Policy gradient methods. Policy gradient methods ( W illiams , 1992 ; Sutton et al. , 1999 ; K onda and Tsitsiklis , 1999 ; Kakade , 2001 ) are foundational reinforcement learning algorithms, commonly implemented with deep neural networks for policy parameterization ( Schulman et al. , 2015 , 2017 ). In line with their practical success, con v ergence and sample complexity of policy gradient variants hav e been extensi vely studied across settings ( Shani et al. , 2020 ; Mei et al. , 2020 ; Agarwal et al. , 2021 ; Cen et al. , 2022 ; Xiao , 2022 ; Bhandari and Russo , 2024 ). For the average-re w ard MDP , Sutton et al. ( 1999 ); Marbach and Tsitsiklis ( 2001 ); Baxter and Bartlett ( 2000 ) establish policy gradient theorem in unichain MDP . More recently , Murthy and Srikant ( 2023 ) establishes global con v ergence of natural policy gradient under an irreducibility as- sumption on MDP , and Kumar et al. ( 2024 ) provides a global con ver gence analysis of projected policy gradient in the tabular ergodic MDP . Re gret guarantees for general parameterized polic y gra- dient methods were studied by Bai et al. ( 2024 ) and subsequently , Ganesh et al. ( 2025 ) improv ed sample complexity via variance reduction technique. In the unichain MDP , Ganesh and Aggarwal ( 2025 ) analyzes a batched natural actor–critic algorithm and Li et al. ( 2024 ) establishes sample- complexity guarantees for both generati v e and Marko vian sampling models assuming mixing time. In multichain and weakly communicating MDPs, ho we ver , there are no existing prior results on policy gradient methods. T o the best of our knowledge, policy gradient theorem has not been established, and only model-based and v alue-based method exist ( W ei et al. , 2020 ; Zurek and Chen , 2024 ; Lee and Ryu , 2025a ). 1.2. Pr eliminaries and notations A verage-r eward MDP . Let M ( X ) be the space of probability distributions over a set X . Write ( S , A , P , r ) to denote the infinite-horizon undiscounted MDP with finite state space S , finite action space A , transition matrix P : S × A → M ( S ) , and bounded re ward r : S × A → [ − R , R ] . Denote π : S → M ( A ) for a polic y , J π ( s ) = lim inf H →∞ 1 H E π H − 1 X h =0 r ( s h , a h ) s 0 = s , K π ( s, a ) = lim inf H →∞ 1 H E π H − 1 X h =0 r ( s h , a h ) s 0 = s, a 0 = a for av erage-re wards of a gi v en policy , and V π ( s ) = lim H →∞ 1 H H X h =1 E π h − 1 X i =0 r π ( s i ) − J π ( s i ) s 0 = s Q π ( s, a ) = lim H →∞ 1 H H X h =1 E π h − 1 X i =0 r ( s i , a i ) − K π ( s i , a i ) s 0 = s, a 0 = a 2 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S multichain = general weakly communicating unichain ergodic Figure 1: The classification of MDPs. for state and state-action relative v alue (under an aperiodicity assumption, av eraging with respect to H can be omitted in the definition ( Puterman , 2014 , Section 8.2)), where E π denotes the ex- pected v alue over all trajectories ( s 0 , a 0 , s 1 , a 1 , . . . , s H − 1 , a H − 1 ) induced by P and π and r π ( s ) = E a ∼ π ( · | s ) [ r ( s, a )] as the reward induced by policy π . Giv en µ ∈ M ( S ) , define J π µ = E s 0 ∼ µ [ J π s 0 ] . W e say J ⋆ is optimal average rew ard if J ⋆ = max π J π and optimal average rew ard always ex- ists in multichain MDP ( Puterman , 2014 , Section 9.1). Denote P π ( s, s ′ ) = Prob( s → s ′ | a ∼ π ( · | s ) , s ′ ∼ P ( · | s, a )) as the transition matrix induced by polic y π . W e denote P ⋆ = lim H →∞ 1 H H − 1 X i =0 P i for the Ces ` aro limit of a stochastic matrix P . (Ces ` aro limit of stochastic matrix always exists ( Puterman , 2014 , Theorem A.6)). Then, by definition, we can write ( Puterman , 2014 , Section 8.2) J π = P π ⋆ r π , V π = ( I − P π + P π ⋆ ) − 1 ( I − P π ⋆ ) r π , K π = P J π , Q π = r + P V π − K π . Lastly , for ev ery policy π , it is known that V π and J π satisfy the follo wing Bellman equations ( Puterman , 2014 , Theorem 8.2.6): P π J π = J π , r π + P π V π = J π + V π . Classification of MDPs. MDPs are classified as follows by the structure of transition matrices. (For definitions of basic concepts of MDPs such as irreducible class, communicating class, accessi- bility , etc., please refer to Puterman ( 2014 , Appendix A.2) and Br ´ emaud ( 2013 , Section 3).) An MDP is ergodic if the transition matrices induced by ev ery policy π hav e a single recurrent class and are aperiodic. MDP is unichain if the transition matrix corresponding to e very determin- istic policy consists of a single irreducible recurrent class plus a possibly empty set of transient states. MDP is weakly communicating if there exists a closed set of states where each state in that set is accessible from ev ery other state in that set under some deterministic policy , plus a possibly empty set of states which is transient under e very polic y . MDP is multic hain if the transition matrix corresponding to any deterministic polic y contains one or more irreducible recurrent classes. MDP is unichain if MDP is ergodic, MDP is weakly communicating if MDP is unichain, and MDP is multichain if MDP is weakly communicating. Since every MDP is multichain, we use the expressions multic hain and general interchangeably . If MDP is unichain, for ev ery policy π , there e xists a unique stationary distribtution g π ∈ M ( S ) satisfying ( g π ) ⊤ P π = g π ( Puterman , 2014 , Theorem 8.3.2). If MDP is ergodic, for ev ery policy 3 π , g π is strictly positiv e ( Levin and Peres , 2017 , Proposition 1.19). If MDP is unichain, for ev ery policy π , J π is a uniform constant vector , i.e., J π = c 1 for some c ∈ R where 1 ∈ R n is the vector with entries all 1 . In the next section, we discuss the continuity of the mapping π 7→ J π in multichain MDP . Since |S | and |A| are finite, we can identify π and J π as finite-dimensional vectors, namely , as π ∈ R |S |×|A| and J π ∈ R |S | . Therefore, continuity of π 7→ J π can be interpreted as continuity of the mapping from R |S |×|A| to R |S | under the usual metric. W e define the class of policies Π = set of all policies = M ( A ) S , Π + = { π ∈ Π | π ( a | s ) > 0 for all s, a } , so that Π + is the (relati ve) interior of Π . Also define M + ( X ) as the space of probability distribu- tions with full support ov er a set X . 2. Recurrent-transient theory of policy gradients In this section, we apply the recurrent-transient theory of Marko v chains to the a verage-rew ard multichain MDP setting and introduce the notions of the recurr ent visitation measure and the tran- sient visitation measur e . Building on these concepts, we establish a policy gradient theorem for av erage-reward multichain MDPs. 2.1. Recurr ent-transient classification of states Definition 1 Given a policy π ∈ Π , a state s ∈ S is recurr ent if its return time starting fr om s is finite with pr obability 1 . Otherwise, s is transient. Equi v alently , if n s is the random variable representing the number of visits to state s starting from s , then s is recurrent if and only if E π [ n s ] = P ∞ k =0 ( P π ) k ( s, s ) = ∞ , and otherwise it is transient ( Br ´ emaud , 2013 , Theorem 3.1.3). Let π ∈ Π . F or a given P π , the states can be classified into recurrent and transient states, and the Marko v chain can be canonically represented as follo ws ( Puterman , 2014 , Appendix A.2): P π = R π 1 0 0 · · 0 0 R π 2 0 · · 0 · · 0 0 0 0 · R π m 0 S π 1 S π 2 · · S π m T π , P π ⋆ = R π 1 ,⋆ 0 0 · · 0 0 R π 2 ,⋆ 0 · · 0 0 · · 0 0 0 0 · R π m,⋆ 0 S π 1 ,⋆ S π 2 ,⋆ · · S π m,⋆ 0 , where R π 1 , . . . , R π m , T π , S π 1 , . . . , S π m represent transition probabilities among the recurrent states, among the transient states, and from transient to recurrent states, respectiv ely . (Note that a multi- chain MDP can hav e m ≥ 1 recurrent classes whereas a unichain MDP must ha ve a single recurrent class). P π ⋆ is the Ces ` aro limit of P π , and it is known that R π i,⋆ = 1 ( g π i ) ⊤ where g π i is the unique stationary distrib ution of the probability matrix R π i and S π i,⋆ = ( I − T π ) − 1 S π i R π i,⋆ for i = 1 , . . . , m ( Puterman , 2014 , Appendix A.4). Such canonical representations exist for any π ∈ Π , but the recurrent-transient classification of states and the structure of the corresponding canonical decomposition of P π may v ary as a function of π ∈ Π . Howe ver , the recurrent-transient classification remains inv ariant for all π ∈ Π + . 4 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S F act 1 ( Lee and Ryu , 2025b , Pr oposition 1) The r ecurr ent-transient classification of the states does not depend on the c hoice of π ∈ Π + . Mor e specifically , the state set S decomposes into the transient states T and m recurr ent classes R 1 , . . . , R m for some m ∈ N , and this decomposition is the same acr oss all π ∈ Π + . W e further clarify . For any π ∈ Π + , the recurrent-transient classification is determined by the transition kernel P , not on the particular choice of π ∈ Π + . While a policy π ∈ Π \ Π + (a policy that assigns zero probability to some actions) may induce a different classification, the algorithms we consider (as well as deep RL algorithms employing a softmax output layer) search ov er Π + rather than the full set Π . By Fact 1 , we therefore fix, for any π ∈ Π + , the collection of recurrent classes S m i =1 R i ⊂ S and the transient set T ⊂ S , where m denotes the number of recurrent classes. W e also write R := m [ i =1 R i for the set of all recurrent states. 2.2. Continuity of J π on Π + For multichain MDPs, the average-re ward J π ( s ) can be a discontinuous function of the policy π ∈ Π as the counterexample in ( Schweitzer , 1968 , Section 8) demonstrates. Howe ver , we do hav e continuity on π ∈ Π + . Lemma 2 In a multichain MDP , the mappings π 7→ J π and π 7→ J π µ (wher e J π µ = E s ∼ µ [ J π ( s )] ) ar e continuous on Π + for any fixed µ ∈ M ( S ) . In other words, we hav e continuity on the (relative) interior of Π and the discontinuity described in ( Schweitzer , 1968 , Section 8) can arise only on the (relati ve) boundary of Π . In our policy gradient methods, we restrict the search to policies π ∈ Π + . Next, define J ⋆ + ,µ = sup π ∈ Π + E s ∼ µ [ J π ( s )] , J ⋆ µ = max π ∈ Π E s ∼ µ [ J π ( s )] , where µ is an initial state distrib ution. By definition, J ⋆ µ ≥ J ⋆ + ,µ . Since our polic y gradient methods search over Π + , they should be thought of as optimizing for J ⋆ + ,µ . Thus, giv en µ ∈ M ( S ) , we say a policy π is an ϵ -optimal policy with respect to positi ve policies if J ⋆ + ,µ − J π µ ≤ ϵ . 2.3. V isitation measures In prior works, the state visitation measure for av erage-reward unichain MDPs under a policy π is defined as the stationary distribution, which is independent of the initial state s 0 . In contrast, for multichain MDPs, the stationary distribution generally depends on the starting state, and it might not be unique. T o address this, we introduce new visitation measures for the multichain setting. First, define the r ecurr ent visitation measur e with respect to policy π and starting state s 0 as d π s 0 ( s ) = lim H →∞ 1 H H − 1 X h =0 Prob( s h = s | s 0 ; P π ) = e ⊺ s 0 P π ⋆ e s , 5 where e s is the s -th unit vector . Like wise, define the r ecurrent visitation measure with respect to policy π and starting state distribution µ as d π µ = E s 0 ∼ µ [ d π s 0 ] . Note that, for all transient states s , we hav e d π s 0 ( s ) = d π µ ( s ) = 0 . Next, follo wing ( Lee and Ryu , 2025b ), define the transient matrix : ¯ T π = 0 0 0 T π , i.e., ¯ T π ( s 1 , s 2 ) = P π ( s 1 , s 2 ) if s 1 , s 2 are both transient 0 otherwise. F act 2 ( Berman and Plemmons , 1994 , Lemma 8.3.20) Spectr al radius of ¯ T π is strictly less than 1 . By Fact 2 and the classical Neumann series argument, we hav e ( I − ¯ T π ) − 1 = P ∞ i =0 ( ¯ T π ) i . No w , we define the transient visitation measur e with respect to policy π and starting state s 0 as δ π s 0 ( s ) = ∞ X h =0 Prob( s h = s | s 0 ; ¯ T π ) = e ⊺ s 0 ( I − ¯ T π ) − 1 e s , and the transient visitation measur e with respect to policy π and starting state distribution µ as δ π µ = E s 0 ∼ µ [ δ π s 0 ] . W e point out that this transient visitation measure is not a probability measure. Also, importantly , max s,s 0 ∈S δ π s 0 ( s ) < ∞ by Fact 2 . W ith these recurrent and transient visitation measures, we can obtain the following performance dif ference lemma in the average-re ward multichain MDP setup, which will be crucially used in the analysis of policy gradient algorithms. Lemma 3 (Perf ormance difference lemma) Consider a multichain MDP . F or π , π ′ ∈ Π + and µ ∈ M + ( S ) , J π µ − J π ′ µ = X s ∈R a ∈A d π µ ( s ) π ( a | s ) − π ′ ( a | s ) Q π ′ ( s, a ) + X s ∈T a ∈A δ π µ ( s ) π ( a | s ) − π ′ ( a | s ) K π ′ ( s, a ) . W e note that Lemma 3 can be viewed as a generalization of the following performance difference lemma for the unichain setup. Corollary 4 ( Even-Dar et al. , 2009 , Lemma 4.1) Consider a unichain MDP . F or π , π ′ ∈ Π + and µ ∈ M + ( S ) , J π µ − J π ′ µ = X s ∈R X a ∈A d π µ ( s ) π ( a | s ) − π ′ ( a | s ) Q π ′ ( s, a ) . Proof For any π ′ ∈ Π + , the unichain assumption implies K π ′ = P J π ′ = c π ′ 1 for some c π ′ ∈ R ( Puterman , 2014 , Theorem 8.3.2). Therefore, the second term in Lemma 3 vanishes. 6 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S 2.4. P olicy gradient theorem W e are now ready to present the policy gradient theorem in the average-re ward setup. Consider the optimization problem max θ ∈ Θ J π θ µ , where { π θ | θ ∈ Θ ⊂ R d } is a set of differentiable parametric policies with respect to θ . Based on pre vious machinery , we establish the following polic y gradient theorem. Theorem 5 (Recurr ent and transient policy gradient) Consider a multichain MDP . F or π θ ∈ Π + and µ ∈ M + ( S ) , ∇ θ J π θ µ = X s ∈R X a ∈A d π θ µ ( s ) ∇ θ π θ ( a | s ) Q π θ ( s, a ) + X s ∈T X a ∈A δ π θ µ ( s ) ∇ θ π θ ( a | s ) K π θ ( s, a ) . In Theorem 5 , the first term in volv es a sum over recurrent states, while the second term sums ov er transient states. V iewed in this way , Theorem 5 naturally generalizes the classical policy gradient theorem for unichain MDPs. Corollary 6 (Recurr ent policy gradient, Sutton et al. ( 1999 )) Consider a unichain MDP . F or π θ ∈ Π + and µ ∈ M + ( S ) , ∇ θ J π θ µ = X s ∈R X a ∈A d π θ µ ( s ) ∇ θ π θ ( a | s ) Q π θ ( s, a ) . Proof For any π ∈ Π + , the unichain assumption implies K π = P J π = c π 1 for some c π ′ ∈ R ( Put- erman , 2014 , Theorem 8.3.2). By the product rule, P a ∈A ∇ θ π θ ( a | s ) K π θ ( s, a ) = ∇ θ ( P a ∈A π θ ( a | s ) K π θ ( s, a )) − P a ∈A π θ ( a | s ) ∇ θ K π θ ( s, a ) = ∇ θ ( c π θ ) − ∇ θ ( c π θ ) = 0 . Therefore, the second term in Theorem 5 v anishes. W e also note that for unichain MDPs, d π µ = g π where g π is the stationary distribution of P π . 3. Con vergence of policy mirr or ascent In this section, we study the con vergence of polic y mirror ascent on a compact subset of Π + . 3.1. α -clipped policy mirr or ascent W e consider the direct parameterization π θ ( a | s ) = θ s,a , where θ ∈ R |S |×|A| satisfies P a ∈A θ s,a = 1 and θ s,a ≥ 0 for all s ∈ S and a ∈ A . W ith this direct parameterization, we do not distinguish between the policy π θ and the parameter θ , and we use π k to denote the iterates of the algorithm. Let h : M ( A ) → R be a strictly con ve x function and continuously dif ferentiable on the relati ve interior of M ( A ) , denoted as rin t M ( A ) . Define the Bregman di ver gence generated by h as D ( p, p ′ ) = h ( p ) − h ( p ′ ) − ⟨∇ h ( p ′ ) , p − p ′ ⟩ , ∀ p ∈ M ( A ) , p ′ ∈ rint M ( A ) . 7 In particular , if h ( p ) = 1 2 ∥ p ∥ 2 2 , then D ( p, p ′ ) = 1 2 ∥ p − p ′ ∥ 2 2 . If h ( p ) = P a ∈A p ( a ) log p ( a ) , then D ( p, p ′ ) = P a ∈A p ( a ) log p ( a ) p ′ ( a ) . For ρ ∈ R |S | + , define the weighted di ver gence D ρ ( π , π ′ ) = X s ∈S ρ ( s ) D ( π ( · | s ) , π ′ ( · | s )) . For polic y π , define ρ π µ ( s ) = d π µ ( s ) 1 R ( s ) + δ π µ ( s ) 1 T ( s ) , where 1 R and 1 T are indicator functions. Following the deri vations of Shani et al. ( 2020 ) and Xiao ( 2022 ), we consider policy mirror ascent methods with weighted Bre gman diver gences: π k +1 = argmax π ∈C ( η k X s ∈S X a ∈A ∇ J π k µ ( s, a ) π ( a | s ) − D ρ π k µ π , π k ) , where η k is the step size, µ ∈ M + ( S ) is an initial state distribution with full support, and C is compact subset of Π + . For ev aluating ∇ J π k µ , Theorem 5 implicitly implies ∇ J π µ ( s, a ) = d π µ ( s ) Q π ( s, a ) 1 R ( s ) + δ π µ ( s ) K π ( s, a ) 1 T ( s ) . Let G π ( s, · ) = Q π ( s, · ) 1 R ( s ) + K π ( s, · ) 1 T ( s ) for all s ∈ S . Then, for µ with full support on S , since d π µ ( s ) > 0 , δ π µ ( s ′ ) > 0 for an y s ∈ R and s ′ ∈ T , the update for π k +1 splits across states: π k +1 ( · | s ) = argmax p ∈C ′ n X a ∈A η k G π k ( s, a ) p ( a ) − D p ( · ) , π k ( · | s ) o , ∀ s ∈ S . The choice C = Π may seem natural, but in our setup, we further restrict the policy set to guarantee finiteness of certain coef ficients appearing in the con vergence analysis. Define Π α = { π | π ( a | s ) ≥ α for all s ∈ S , a ∈ A} , M α ( A ) = { p | p ( a ) ≥ α for all a ∈ A} with α ∈ (0 , 1 / |A| ) . Indeed, on Π α , we can ensure finiteness of the follo wing coef ficients. Lemma 7 If µ ∈ M + ( S ) and α ∈ (0 , 1 / |A| ) , then max π ∈ Π α ∥ ρ π µ ∥ 1 := B α < ∞ , max s ∈S ,π ,π ′ ∈ Π α ρ π µ ( s ) ρ π ′ µ ( s ) ∞ := C α < ∞ . No w , we are ready to present the algorithm α -clipped policy mirror ascent. Algorithm 1 α -clipped policy mirror ascent Input: α ∈ (0 , 1 / |A| ) , K, π 0 ∈ Π α , { η k } K − 1 k =0 ⊂ (0 , ∞ ) , D ( · , · ) f or k = 0 , 1 , . . . , K − 1 do f or s ∈ S do π k +1 ( · | s ) = argmax p ∈M α ( A ) η k P a ∈A G π k ( s, a ) p ( a ) − D p ( · ) , π k ( · | s ) end f or end f or Output: π K 8 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S If h is a Legendre function (please refer to Appendix A for definition), such as the Euclidean norm or negati ve entropy , π k +1 always e xists ( Xiao , 2022 , Lemma 6). Specifically , if the Bregman di ver gence is Euclidean distance, α -clipped polic y mirror ascent is reduced to π k +1 ( · | s ) = argmin p ∈M α ∥ p ( · ) − ( π k ( · | s ) + η k G π k ( · | s )) ∥ 2 2 , ∀ s ∈ S and if Bregman di vergence is KL-di ver gence, it is reduced to π k +1 ( · | s ) = argmin p ∈M α KL p ( · ) | π k ( · | s ) exp( η k G π k ( s, · )) / Z s , ∀ s ∈ S , where Z s = P a ∈A π k ( a | s ) exp( η k ( G π k ( s, a )) . If α = 0 , these methods are known as the pro- jected Q -descent and multiplicativ e weights updates, respecti vely ( Xiao , 2022 ; Freund and Schapire , 1997 ). W e note that both optimization subproblems can be solved through Euclidean and KL pro- jection algorithms with e O ( |S ||A| ) time complexity ( W ang and Carreira-Perpin ´ an , 2013 ; Herbster and W armuth , 2001 ). W e present both projection algorithms in Appendix F . In the follo wing subsections, we establish con vergence of α -clipped policy mirror ascent. For a gi ven µ , define π α ∈ argmax π ∈ Π α J π µ . By Lemma 2 , π α exists, and we will sho w con vergence of J π k with respect to J π α with π k ∈ Π α . 3.2. Sublinear con vergence with constant step size W e first establish the sublinear conv ergence of the policy gradient algorithm with a constant step size. As a first step in our analysis, we state the following lemma, which ensures that the policies generated by the α -clipped policy mirror ascent impro ve monotonically . Lemma 8 Consider a multichain MDP . F or π 0 ∈ Π + and µ ∈ M + ( S ) , the α -clipped policy mirr or ascent generates a sequence of policies { π k } ∞ k =1 satisfying J π k µ ≤ J π k +1 µ . Through Lemma 8 , we obtain the following con vergence result of the α -clipped policy mir- ror ascent. The proof presented in Appendix C closely follows Xiao ( 2022 ) which considers the discounted re ward setup and uses the discounted state-visitation distrib ution. Theorem 9 Consider a multichain MDP . F or π 0 ∈ Π α and µ ∈ M + ( S ) , the α -clipped policy mirr or ascent with constant step size η > 0 gener ates a sequence of policies { π k } ∞ k =1 satisfying J π α µ − J π k µ ≤ 1 k + 1 D ρ π α µ ( π α , π 0 ) η + C α ( J π α µ − J π 0 µ ) . Theorem 9 shows that J π k µ → J π α µ with a sublinear rate, and since J π α µ → J ⋆ + ,µ as α → 0 , if we choose a suf ficiently small α such that J ⋆ + ,µ − J π α µ < ϵ 2 , we can obtain policy π satisfying J ⋆ + ,µ − J π µ < ϵ through Theorem 9 . 9 3.3. linear con vergence with adaptive step size Next, we present the linear con vergence rate of the α -clipped policy mirror ascent with adaptive step sizes. Theorem 10 Consider a multichain MDP . F or π 0 ∈ Π α and µ ∈ M + ( S ) , the α -clipped policy mirr or ascent with step sizes ( C α − 1) η k +1 ≥ C α η k > 0 g enerates a sequence of policies { π k } ∞ k =1 satisfying J π α µ − J π k µ ≤ 1 − 1 C α k D ρ π α µ ( π α , π 0 ) η 0 ( C α − 1) + J π α µ − J π 0 µ . Although the adaptiv e step size yields a linear conv ergence rate, it requires kno wledge of C α to set the step sizes. In contrast, a constant step size always guarantees a sublinear rate. W e point out that to the best of our knowledge, Theorems 9 and 10 are the first con ver gence results of policy gradient methods for av erage-reward multichain MDPs. Algorithm 2 Critic Input: π ∈ Π + , N , H, N ′ , H ′ ∈ Z + f or j = 1 , . . . , N do f or ( s 0 , a 0 ) ∈ S × A do Generate { ( s 0 , a 0 ) , ( s 1 , a 1 ) , . . . , ( s H , a H ) | s i ∼ P ( · | s i − 1 , a i − 1 ) , a i ∼ π ( · | s i ) } K j ( s 0 , a 0 ) = 1 H +1 P H i =0 r j ( s i , a i ) . end f or end f or ˆ K π = 1 N P N j =1 K j . f or j = 1 , . . . , N ′ do f or ( s 0 , a 0 ) ∈ S × A do Generate { ( s 0 , a 0 ) , ( s 1 , a 1 ) , . . . , ( s H ′ , a H ′ ) | s i ∼ P ( · | s i − 1 , a i − 1 ) , a i ∼ π ( · | s i ) } ˆ Q j ( s 0 , a 0 ) = 1 H ′ +1 P H ′ h =0 P h i =0 r j ( s i , a i ) − ˆ K π ( s i , a i ) . end f or end f or ˆ Q π = 1 N ′ P N ′ j =1 ˆ Q j f or s ∈ S do ˆ G π ( s, · ) = ˆ Q π ( s, · ) 1 R ( s ) + ˆ K π ( s, · ) 1 T ( s ) end f or Output: ˆ G π 4. Sample complexity of policy mirror ascent In this section, we extend the analysis of the α -clipped policy mirror ascent to the sampling setting in which the transition probabilities are unknown. Specifically , we assume access to a generativ e model ( K earns and Singh , 1998 ), which provides independent samples of the next state for any gi ven state and action. 10 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S 4.1. A pproximating G π with a generative model W ith a generativ e model, for a giv en policy π and any state-action pair ( s, a ) ∈ S × A , we can generate independent trajectories of horizon H : s 0 = s, a 0 = a , s 1 , a 1 , . . . , s H − 1 , a H − 1 . W ith such samples, we approximate G π ( s, · ) = Q π ( s, · ) 1 R ( s ) + K π ( s, · ) 1 T ( s ) , ∀ s ∈ S , through the critic method described as Algorithm 2 . Intuitiv ely speaking, ˆ K π , ˆ Q π is H -horizon truncation of K π , Q π . T o present the sample-complexity results for the critic method under a generative model, we define expected tar get time as follows. For π ∈ Π and each recurrent class R i , t π ,i tar := X s ′ ∈R i ρ π i ( s ′ ) E π t i s ′ | s 0 = s for i = 1 , . . . , m, t π tar = max 1 ≤ i ≤ m t π ,i tar where t s ′ := inf { t ≥ 0 : s t = s ′ } for s, s ′ ∈ R i , and ρ π i is stationary distrib ution of R π i . It is known that expected target time is always finite and independent of the starting state when the state space is finite ( Le vin and Peres , 2017 ). Theorem 11 Consider a multichain MDP . Let ϵ > 0 , δ > 0 . F or π 0 , π ∈ Π + and µ ∈ M + ( S ) , with 1 − δ pr obability , the output of critic method satisfies ∥ ˆ G π − G π ∥ ∞ ≤ ϵ with sample comple xity O R 3 ( t π tar ) 3 ∥ Q π ∥ 4 ∞ |S ||A| ϵ 6 log 2 |S ||A| δ . 4.2. Obser ving classification of states through sampling T o compute ˆ G π with ˆ Q π , ˆ K π from the critic method, the classification of the states is required. Define transient half-life and co ver time for gi ven π and each recurrent class R π i as t 1 2 ,π = min t ≥ 1 : ( ¯ T π ) t ∞ ≤ 1 2 , t π ,i cov = max s ∈R i E π [ t i cov | s 0 = s ] and t π cov = max 1 ≤ i ≤ m t π ,i cov where ¯ T π is transient matrix and t i cov := inf { t ≥ 0 : { s j } t − 1 j =0 = R i } . It is kno wn that these quantities are finite when the state space is finite ( Lee and Ryu , 2025b ; Le vin and Peres , 2017 ). Lemma 12 Consider a multichain MDP . Let δ > 0 . Given π ∈ Π + , set M 1 = ⌈ t 1 2 ,π log( 2 δ ) ⌉ and M 2 = ⌈ et π cov log 2 δ ⌉ . W ith pr obability 1 − δ , for a generated single trajectory with length M 1 + M 2 , ( s 0 , a 0 , s 1 , a 1 , . . . , s M 1 + M 2 − 1 , a M 1 + M 2 − 1 ) , { s j } M 1 + M 2 − 1 j = M 1 = R π i for some i . This lemma sho ws that by sampling a trajectory of length M 1 + M 2 starting from an arbitrary state s 0 , we can recover the complete set of states in a recurrent class. Moreover , by checking whether s 0 belongs to this set, we can determine the classification of s 0 . Repeating this procedure for the remaining unclassified states under a generativ e model allows us to identify the classification of all states with sample complexity O ( |T | + m ) ( t 1 2 ,π + t π cov ) log |T | + m δ , where m denotes the number of recurrent classes. 11 4.3. Sample complexity of stochastic α -clipped policy mirr or ascent Based on results in pre vious sections, we no w present stochastic α -clipped policy mirror ascent: Algorithm 3 Stochastic α -clipped policy mirror ascent Input: α ∈ (0 , 1 / |A| ) , K, π 0 ∈ Π α , { η k } K − 1 k =0 ⊂ (0 , ∞ ) , D ( · , · ) , { N k , H k , N ′ k , H ′ k } K − 1 k =0 f or k = 0 , 1 , . . . , K − 1 do ˆ G π k = Critic ( π k , N k , H k , N ′ k , H ′ k ) f or s ∈ S do π k +1 ( s, · ) = argmax p ∈M α ( A ) n η k P a ∈A ˆ G π k ( s, a ) p ( a ) − D p ( · ) , π k ( · | s ) o end f or end f or Output: π K Theorem 13 Consider a multichain MDP . Let ϵ > 0 , δ > 0 . F or any π , π 0 ∈ Π α and µ ∈ M + ( S ) , with pr obability 1 − δ , the iterates of stochastic α -clipped policy mirr or ascent with constant step size η > 0 and K = 2 D ρ π α µ ( π α ,π 0 ) η + C α ( J π α µ − J π 0 µ ) /ϵ satisfies J π α µ − J π k µ ≤ ϵ with sample complexity e O t 3 tar ∥ Q π ∥ 4 ∞ R 9 C 6 α D ρ π α µ ( π α ,π 0 ) η 6 B 6 α |S ||A| ϵ 12 ! and with adaptive step size η k +1 ( C α − 1) ≥ η k C α > 0 and K = log 2( J π α µ − J π 0 µ + D ρ π α µ ( π α ,π 0 ) η 0 ( C α − 1) ) /ϵ ! log( C α / ( C α − 1)) satisfies J π α µ − J π k µ ≤ ϵ with sample complexity ˜ O t 3 tar ∥ Q π ∥ 4 ∞ R 3 C 6 α B 6 α |S ||A| ϵ 6 wher e e O ignor es all logarithmic factors, ∥ Q π ∥ ∞ = max 0 ≤ k ≤ K − 1 ∥ Q π k ∥ ∞ , and t tar = max 0 ≤ k ≤ K − 1 t π k tar . W e point out that to the best of our kno wledge, Theorem 13 is the first sample complexity results of policy gradient methods for av erage-reward multichain MDPs. Lastly , we briefly mention that refined analysis for weakly communicating MDPs is provided in Appendix E . 5. Conclusion In this w ork, we present the first analysis of policy gradient methods for a verage-re ward multichain MDPs using the notion of recurrent and transient visitation measures based on the in variance of the classification of states. Our results and proof techniques open the door to future work on policy gradient algorithms for the av erage-reward multichain MDP setup. One future direction is to further improve the sample complexity by incorporating variance re- duction techniques ( Li et al. , 2024 ; Lee et al. , 2025 ) and to fully characterize the sample comple xity by quantifying the dependence between α and ϵ in our theorems. De veloping trust region meth- ods ( Schulman et al. , 2015 ; Zhang and Ross , 2021 ) for a verage-rew ard multichain MDPs using our machinery would be an interesting direction as well. 12 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S References Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gra- dient methods: Optimality , approximation, and distribution shift. Journal of Machine Learning Resear ch , 22(98):1–76, 2021. Qinbo Bai, W ashim Uddin Mondal, and V aneet Aggarwal. Regret analysis of policy gradient algo- rithm for infinite horizon av erage re ward Marko v decision processes. International Conference on Artificial Intelligence , 2024. Jonathan Baxter and Peter L Bartlett. Direct gradient-based reinforcement learning. IEEE Interna- tional Symposium on Cir cuits and Systems , 3:271–274, 2000. Abraham Berman and Robert J Plemmons. Nonne gative Matrices in the Mathematical Sciences . SIAM, 1994. Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Oper- ations Resear ch , 72(5):1906–1927, 2024. David Blackwell. Discrete dynamic programming. The Annals of Mathematical Statistics , 33: 719–726, 1962. V ictor Boone and Zihan Zhang. Achieving tractable minimax optimal regret in av erage reward MDPs. Neural Information Pr ocessing Systems , 2024. Mario Brav o and Roberto Cominetti. Stochastic fix ed-point iterations for none xpansiv e maps: Con- ver gence and error bounds. SIAM Journal on Contr ol and Optimization , 62(1):191–219, 2024. Pierre Br ´ emaud. Mark ov Chains: Gibbs F ields, Monte Carlo Simulation, and Queues . Springer Science & Business Media, 2nd edition, 2013. Apostolos N Burnetas and Michael N Katehakis. Optimal adaptiv e policies for Marko v decision processes. Mathematics of Operations Resear ch , 22(1):222–255, 1997. Shicong Cen, Chen Cheng, Y uxin Chen, Y uting W ei, and Y uejie Chi. Fast global con vergence of natural policy gradient methods with entropy regularization. Oper ations Researc h , 70(4):2563– 2578, 2022. Zaiwei Chen. Non-asymptotic guarantees for average-re ward Q-learning with adaptive stepsizes. Neural Information Pr ocessing Systems , 2025. V ektor Dew anto, George Dunn, Ali Eshragh, Marcus Gallagher , and Fred Roosta. A verage- re ward model-free reinforcement learning: a systematic revie w and literature mapping. arXiv:2010.08920 , 2020. Eyal Even-Dar , Sham M Kakade, and Y ishay Mansour . Online Markov decision processes. Math- ematics of Operations Resear ch , 34(3):726–736, 2009. Y oav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences , 55(1):119–139, 1997. 13 Swetha Ganesh and V aneet Aggarwal. Regret analysis of average-re ward unichain MDPs via an actor-critic approach. arXiv pr eprint arXiv:2505.19986 , 2025. Swetha Ganesh, W ashim Uddin Mondal, and V aneet Aggarwal. Order-optimal regret with novel policy gradient approaches in infinite-horizon av erage reward MDPs. International Confer ence on Artificial Intelligence and Statistics , 2025. Mark Herbster and Manfred K W armuth. T racking the best linear predictor . Journal of Machine Learning Resear ch , 1:281–309, 2001. Ronald A Ho ward. Dynamic Pr ogramming and Marko v Pr ocesses. John Wile y and Sons, 1960. Thomas Jaksch, Ronald Ortner , and Peter Auer . Near -optimal regret bounds for reinforcement learning. Journal of Mac hine Learning Resear ch , 11(51):1563–1600, 2010. Y ing Jin, Ramki Gummadi, Zhengyuan Zhou, and Jose Blanchet. Feasible Q-learning for av erage re ward reinforcement learning. International Confer ence on Artificial Intellig ence and Statistics , 2024. Y ujia Jin and Aaron Sidford. T ow ards tight bounds on the sample complexity of a verage-re ward MDPs. International Conference on Mac hine Learning , 2021. Sham M Kakade. A natural policy gradient. Neural Information Pr ocessing Systems , 2001. Michael Kearns and Satinder Singh. Finite-sample con ver gence rates for Q-learning and indirect algorithms. Neural Information Pr ocessing Systems , 1998. V ijay Konda and John Tsitsiklis. Actor-critic algorithms. Neural Information Pr ocessing Systems , 1999. Navdeep Kumar , Y ashaswini Murthy , Itai Shufaro, Kfir Y Levy , R Srikant, and Shie Mannor . On the global conv ergence of policy gradient in av erage re ward Marko v decision processes. Interna- tional Confer ence on Learning Repr esentations , 2024. John M Lee. Smooth manifolds. In Intr oduction to smooth manifolds , pages 1–29. Springer , 2003. Jongmin Lee and Ernest K Ryu. Optimal non-asymptotic rates of v alue iteration for av erage-reward Marko v decision processes. International Conference on Learning Repr esentations , 2025a. Jongmin Lee and Ernest K Ryu. Why policy gradient algorithms work for undiscounted total-reward MDPs. arXiv preprint , 2025b. Jongmin Lee, Mario Brav o, and Roberto Cominetti. Near-optimal sample complexity for MDPs via anchoring. International Conference on Mac hine Learning , 2025. David A Levin and Y uval Peres. Marko v chains and mixing times . American Mathematical Soc., 2017. T ianjiao Li, Feiyang W u, and Guanghui Lan. Stochastic first-order methods for av erage-reward Marko v decision processes. Mathematics of Operations Resear ch , 50(4):3125–3160, 2024. 14 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Sridhar Mahadev an. A verage rew ard reinforcement learning: Foundations, algorithms, and empiri- cal results. Machine Learning , 22(1):159–195, 1996. Peter Marbach and John N Tsitsiklis. Simulation-based optimization of Marko v reward processes. IEEE T ransactions on Automatic Contr ol , 46(2):191–209, 2001. Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global con vergence rates of softmax policy gradient methods. International Confer ence on Machine Learning , 2020. Y ashaswini Murthy and R Srikant. On the con vergence of natural policy gradient and mirror descent-like policy methods for average-re ward MDPs. IEEE Confer ence on Decision and Con- tr ol , 2023. Martin L. Puterman. Markov Decision Pr ocesses: Discr ete Stochastic Dynamic Pr ogr amming . John W iley and Sons, 2nd edition, 2014. Gareth O Roberts and Jeffre y S Rosenthal. Shift-coupling and con vergence rates of ergodic av er- ages. Stochastic Models , 13(1):147–165, 1997. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. International Confer ence on Machine Learning , 2015. John Schulman, Filip W olski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov . Proximal policy optimization algorithms. arXiv preprint , 2017. Paul J Schweitzer . Perturbation theory and finite Markov chains. Journal of Applied Pr obability , 5 (2):401–413, 1968. Lior Shani, Y onathan Efroni, and Shie Mannor . Adaptiv e trust region policy optimization: Global con vergence and faster rates for regularized MDPs. AAAI Confer ence on Artificial Intelligence , 2020. Richard S Sutton, Da vid McAllester , Satinder Singh, and Y ishay Mansour . Policy gradient methods for reinforcement learning with function approximation. Neural Information Pr ocessing Systems , 1999. Adrienne T uynman, R ´ emy Degenne, and Emilie Kaufmann. Finding good policies in av erage- re ward Markov decision processes without prior knowledge. Neural Information Pr ocessing Systems , 2024. Y i W an, Huizhen Y u, and Richard S Sutton. On con vergence of average-re ward Q-learning in weakly communicating Marko v decision processes. arXiv preprint , 2024. Mengdi W ang. Primal-dual π learning: Sample complexity and sublinear run time for ergodic Marko v decision problems. , 2017. W eiran W ang and Miguel A Carreira-Perpin ´ an. Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. arXiv preprint , 2013. 15 Chen-Y u W ei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon av erage-reward Marko v decision processes. Interna- tional Confer ence on Machine Learning , 2020. Ronald J W illiams. Simple statistical gradient-follo wing algorithms for connectionist reinforcement learning. Machine Learning , 8(3):229–256, 1992. Lin Xiao. On the con vergence rates of policy gradient methods. Journal of Machine Learning Resear ch , 23(282):1–36, 2022. Y iming Zhang and Keith W Ross. On-policy deep reinforcement learning for the average-re ward criterion. International Conference on Mac hine Learning , 2021. Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by ev aluating the optimal bias function. Neural Information Pr ocessing Systems , 2019. Zihan Zhang and Qiaomin Xie. Sharper model-free reinforcement learning for average-re ward Marko v decision processes. Conference on Learning Theory , 2023. Matthe w Zurek and Y udong Chen. Span-based optimal sample complexity for weakly communi- cating and general av erage reward MDPs. Neural Information Pr ocessing Systems , 2024. 16 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S A ppendix A. Preliminaries W e say function h is essential smoothness if h is dif ferentiable and ∥∇ h ( x k ) ∥ → ∞ for every sequence { x k } con verging to a boundary point of dom h . W e say function h is of Legendre if it is essentially smooth and strictly con vex in the (relati ve) interior of dom h , F act 3 (Three-point descent lemma) ( Xiao , 2022 , Lemma 6) Suppose that C ⊂ R n is a closed con vex set, ϕ : C → R is a proper , closed, and con vex function, D ( · , · ) is the Bre gman diver gence gener ated by a function h of Le gendr e type and rin t dom h ∩ C = ∅ . F or any x ∈ rint dom h , let x + = argmin u ∈C ϕ ( u ) + D ( u, x ) . Then, x + ∈ rint dom h ∩ C and for any u ∈ C , ϕ ( x + ) + D ( x + , x ) ≤ ϕ ( u ) + D ( u, x ) − D ( u, x + ) . In our setup, C = M α ( A ) , ϕ ( p ) = − η k ⟨ G π k ( · , s ) , p ( · ) ⟩ , and h is the neg ati ve-entropy function, which is also of Legendre type, satisfying rint dom h ∩ C = M α ( A ) . Therefore, if we start with an initial point in M α ( A ) , then e very iterate will stay in M α ( A ) . F act 4 (Hoeffding inequality) Let X 1 , . . . , X n be indepedent random variables such that a i ≤ X i ≤ b i for all i . Then P 1 n n X i =1 X i − E " 1 n n X i =1 X i # ≥ ϵ ! ≤ 2 exp − 2 n 2 ϵ 2 P n i =1 ( b i − a i ) 2 . F act 5 (T arget time lemma) ( Roberts and Rosenthal , 1997 , Cor ollary 3) F or π ∈ Π + and ∀ s ∈ R i , 1 k k X j =1 R π i j ( · , s ) − g π i ( · ) 1 ≤ 2 t π ,i tar k , wher e g π i is stationary distribution of R π i . A ppendix B. Omitted proofs of Section 2 B.1. Pr oof of Lemma 3 W e first prove the follo wing lemma. Lemma 14 F or any π ∈ Π + , a, a ′ ∈ A and s, s ′ ∈ R i , K π ( s, a ) = K π ( s ′ , a ′ ) and J π ( s ) = J π ( s ′ ) . Proof For s ∈ R i and s ′ ∈ R i , P ( s ′ | s, a ) = 0 . For any s, s ′ ∈ R i , J π ( s ) = J π ( s ′ ) since R π i,⋆ = 1 g ⊤ i where g i is stationary distribution of R π i . Thus, K π ( s, a ) = P J π ( s, a ) = P s ′′ ∈R i P ( s ′′ | s, a ) J π ( s ′′ ) = J π ( s ) = J π ( s ′ ) for an y a ∈ A . No w we prov e Lemma 3 . Proof First, we hav e J π − J π ′ = P π ⋆ ( J π − J π ′ ) + ( P π ⋆ − I ) J π ′ since P π ⋆ P π ⋆ = P π ⋆ . 17 For the first term, P π ⋆ ( J π − J π ′ ) = P π ⋆ ( r π − J π ′ ) = P π ⋆ ( r π + V π ′ − P π ′ V π ′ − r π ′ ) = P π ⋆ ( r π + P π V π ′ − P π ′ V π ′ − r π ′ ) = P π ⋆ (Θ π − Θ π ′ )( r + P V π ′ ) = P π ⋆ (Θ π − Θ π ′ )( K π ′ + Q π ′ ) where first and third equalities come from property of limiting matrix, P π ⋆ = P π ⋆ P π ⋆ = P π ⋆ P π , second and last equalities are from Bellman equation, and fourth equality comes from that Θ π ∈ R |S |×|S ||A| is matrix form of policy π satisfying Θ π P = P π and Θ π r = r π . Then, for gi ven µ with full support and s ∈ R , we hav e µ ⊤ P π ⋆ ( J π − J π ′ ) = µ ⊤ P π ⋆ (Θ π − Θ π ′ )( K π ′ + Q π ′ ) = X s ∈S d π µ ( s ) X a ∈A ( π ( a | s ) − π ′ ( a | s ))( K π ′ ( s, a ) + Q π ′ ( s, a )) = X s ∈R d π µ ( s ) X a ∈A ( π ( a | s ) − π ′ ( a | s ))( K π ′ ( s, a ) + Q π ′ ( s, a )) = X s ∈R d π µ ( s ) X a ∈A ( π ( a | s ) − π ′ ( a | s )) Q π ′ ( s, a ) where third equality comes from d π µ ( s ) = 0 for all s ∈ T and last equality is from Lemma 14 . For the second term, if s ∈ R i , P π ⋆ J π ′ ( s ) = P s ′ ∈R i g π i ( s ′ ) J π ′ ( s ′ ) = J π ′ ( s ) since R π i,⋆ = 1 ( g π i ) ⊤ where g π i is stationary distribution, J π ′ ( s ′ ) = J π ′ ( s ′′ ) for s ′ , s ′′ ∈ R i , and π , π ′ share same recurrent class by Lemma 14 and Fact 1 . Otherwise, if s ∈ T , define J π i ∈ R |R i | and J π m +1 ∈ R |T | for 1 ≤ i ≤ m such that J π ( s ) = J π i ( s ) for s ∈ R i and J π ( s ) = J π m +1 ( s ) for s ∈ T . Since S π i,⋆ = ( I − T π ) − 1 S π i R π i,⋆ , we hav e P π ⋆ J π ′ ( s ) = m X i =1 ( I − T π ) − 1 S π i J π ′ i ( s ) = ( I − ¯ T π ) − 1 ( P π − ¯ T π ) J π ′ ( s ) where first equality comes from Lemma 14 , Fact 1 , and canonical form of P π ⋆ and P π . This implies ( P π ⋆ − I ) J π ′ ( s ) = ( I − ¯ T π ) − 1 ( P π − I ) J π ′ ( s ) = ( I − ¯ T π ) − 1 ( P π − P π ′ ) J π ′ ( s ) = ( I − ¯ T π ) − 1 (Θ π − Θ π ′ ) K π ′ ( s ) for s ∈ T , where second equality is from Bellman equation. By multiplying µ , we conclude the proof. 18 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S B.2. Pr oof of Theorem 5 Proof First, due to in varaince of transient and recurrent class, it’ s sufficient to consider dif feren- tiablity of { R π ⋆,i } m i =1 , { S π ⋆,i } m i =1 since other elements of P π ⋆ is always zero. The differenitablity of R π ⋆,i is guaranteed by the proof of ( Marbach and Tsitsiklis , 2001 , Lemma 1) (As a technical detail, dif ferentiablity of R π ⋆,i in Marbach and Tsitsiklis ( 2001 ) is proved under aperiodicity , but the proof also works for the unichain MDP due to uniqueness of stationary distribution ( Le vin and Peres , 2017 , Proposition 1.29).), and differenitablity of ( I − T π ) − 1 is guaranteed by proof of ( Lee and Ryu , 2025b , Theorem 4.7). These imply differentiability of S π i,⋆ = ( I − T π ) − 1 S π i R π i,⋆ and P π ⋆ as well. For manifold S , we denote tangent space of S as T S ( Lee , 2003 ). Consider Θ a dif feren- tiable manifold and differential ∇ θ π θ : T Θ → T Π + for π θ : Θ → Π + , where T Π + = { u ∈ R |S ||A| | P a ∈A u ( a | s ) = 0 ∀ s ∈ S } . (Here note that Π + is simplex embedded in Euclidean space.) Since J π µ : Π + → R is smooth function by previous argument, we can also consider differential ∇ π J π µ : T Π + → T R where T R = R . By Lemma 3 , for any u ∈ T Π + and ϵ > 0 such that π + ϵu ∈ Π + , we hav e J π + ϵu µ − J π µ = ϵ X s ∈R a ∈A d π + ϵu µ ( s ) u ( a | s ) Q π ( s, a ) + ϵ X s ∈T a ∈A δ π + ϵu µ ( s ) u ( a | s ) K π ( s, a ) By continuity of d π µ and δ π µ on Π + and dividing both sides by ϵ and letting ϵ → 0 , differential can be formalized as ∇ π J π µ ( s, a ) = d π µ ( s ) Q π ( s, a ) + δ π µ ( s ) K π ( s, a ) . Then, by chain rule, ∇ θ J π µ = X s ∈R a ∈A d π µ ( s ) ∇ θ π ( a | s ) Q π ( s, a ) + X s ∈T a ∈A δ π µ ( s ) ∇ θ π ( a | s ) K π ( s, a ) . W e briefly note that Lemma 2 directly follows from Theorem 5 . A ppendix C. Omitted proofs in Section 3 C.1. Pr oof of Lemma 7 Proof As we showed in proof of Theorem 5 , d π and δ π are continuous on Π + . For each i , R i is single recurrent class on π ∈ Π α , so stationary distribution g π i of R π i is always positiv e ( Levin and Peres , 2017 , Proposition 1.19). This implies min s ∈R ,π ∈ Π α d π µ ( s ) > 0 by continuity , and ∥ d π µ ∥ 1 = 1 . On the other hand, for s ∈ T , min π ∈ Π α δ π µ ( s ) ≥ µ ( s ) by definition and max π ∈ Π α δ π µ ( s ) < ∞ by continuity . 19 C.2. Pr oof of Lemma 8 W e proved more detailed v ersion of Lemma 8 . Lemma 15 F or µ ∈ M + ( S ) , the α -policy mirr or ascent g enerates a sequence of policies { π k } ∞ k =1 satisfying X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) ≤ 0 , ∀ s ∈ S , and J π k µ ≤ J π k +1 µ . Proof Applying Fact 3 with C = M α ( A ) , ϕ ( p ) = − η k P a ∈A G π k ( a, s ) p ( a ) , we obtain η k X a ∈A G π k ( a, s ) p ( a ) + D π k +1 ( · | s ) , π k ( · | s ) ≤ η k X a ∈A G π k ( a, s ) π k +1 ( a | s ) + D p, π k ( · | s ) − D p, π k +1 ( · | s ) for any p ∈ M α ( A ) . Rearranging terms and dividing both sides by η k , we get X a ∈A G π k ( a, s )( p ( a ) − π k +1 ( a | s )) + 1 η k D π k +1 ( · | s ) , π k ( · | s ) ≤ 1 η k D p, π k ( · | s ) − 1 η k D p, π k +1 ( · | s ) . ( ∗ ) Letting p = π k ( · | s ) in pre vious inequality yields X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) ≤ − 1 η k D π k +1 ( · | s ) , π k ( · | s ) − 1 η k D π k ( · | s ) , π k +1 ( · | s ) . Then, the first result comes from nonnegati vity of Bregman diver gence and second result comes from performance dif ference lemma by multiplying ρ π k +1 µ ( · ) both sides. C.3. Pr oof of Theorem 9 Proof Consider pre vious inequality ( ∗ ). Let p = π α ( · | s ) ∈ M α and add–subtract π k ( · | s ) inside the inner product. Then we have X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) + X a ∈A G π k ( a, s )( π α ( a | s ) − π k ( a | s )) ≤ 1 η k D π α ( · | s ) , π k ( · | s ) − 1 η k D π α ( · | s ) , π k +1 ( · | s ) . This implies X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) + X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π α ( a | s ) − π k ( a | s )) ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) . 20 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S For the first term, X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) = X s ∈S ρ π k +1 µ ( s ) ρ π α µ ( s ) ρ π k +1 µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) ≥ C α ( J π k µ − J π k +1 µ ) , where the last inequality comes from Lemma 15 and 3 . For the second term, by Lemma 3 , X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π α ( a | s ) − π k ( a | s )) = J π α µ − J π k µ . Thus we hav e J π α µ − J π k µ ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) + C α ( J π k +1 µ − J π k µ ) . Setting η k = η for all k ≥ 0 and summing o ver k gi ves k X i =0 J π α µ − J π i µ ≤ 1 η D ρ π α µ ( π α , π 0 ) − 1 η D ρ π α µ ( π α , π k +1 ) + C α ( J π k +1 µ − J π 0 µ ) . Since J π k µ are non-decreasing by Lemma 8 and Bregman di ver gence is non-negati ve, we con- clude that J π α µ − J π k µ ≤ 1 k + 1 D ρ π α µ ( π α , π 0 ) η + C α ( J π α µ − J π 0 µ ) ! . C.4. Pr oof of Theorem 10 Define U π α k = J π α µ − J π k µ , W e first prov e the following k ey lemma. Lemma 16 F or π 0 ∈ Π α and a given µ ∈ M + ( S ) , the α -clipped policy mirr or ascent with step size η k > 0 g enerates a sequence of policies { π k } ∞ k =1 satisfying, C α U π α k +1 − U π α k + U π α k ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) Proof In the previous proof of Theorem 9 , we sho wed that X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) + X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π α ( a | s ) − π k ( a | s )) ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) . 21 For the first term X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) ≥ C α ( J π k µ − J π k +1 µ ) where the inequality comes from Lemma 15 and 3 . For the second term, by Lemma 3 , X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π α ( a | s ) − π k ( a | s )) = J π α µ − J π k µ . W e obtain desired result after substitution. W e are now ready to prov e Theorem 10 Proof By Lemma 16 , C α U π α k +1 − U π α k + U π α k ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) . Di viding both sides by C α and rearranging terms, we obtain U π α k +1 + 1 η k C α D ρ π α µ ( π α , π k +1 ) ≤ 1 − 1 C α U π α k + 1 η k ( C α − 1) D ρ π α µ ( π α , π k ) . Since the step sizes satisfy condition, η k +1 ( C α − 1) ≥ η k C α > 0 , we have U π α k +1 + 1 η k +1 ( C α − 1) D ρ π α µ ( π α , π k +1 ) ≤ 1 − 1 C α U π α k + 1 η k ( C α − 1) D ρ π α µ ( π α , π k ) . Therefore, by recursion, U π α k + 1 η k ( C α − 1) D ρ π α µ ( π α , π k ) ≤ 1 − 1 C α k U π α 0 + 1 η 0 ( C α − 1) D ρ π α µ ( π α , π 0 ) . A ppendix D . Omitted proofs in Section 4 D.1. Proof of Theor em 11 Define V k +1 = P π V k + r π k = 0 , 1 , . . . where V 0 = 0 . F act 6 (Classical result, ( Puterman , 2014 , Theor em 9.4.1)) Consider a general (multichain) MDP . Then, for k ≥ 1 , the sequence { V k } ∞ k =0 exhibit the r ate V k k − J π ∞ ≤ 2 ∥ V π ∥ ∞ k . 22 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Using Fact 6 , we first study sample comple xity for approximation of K π . Lemma 17 Let ϵ > 0 and δ > 0 . F or given π ∈ Π , with 1 − δ pr obability , ˆ K π in critic satisfies ˆ K π − K π ∞ ≤ ϵ with sample complexity N H |S ||A| = O ( ∥ V π ∥ ∞ + R ) R 2 |S ||A| ϵ 3 log 2 |S ||A| δ . Proof Since ˆ K ( s 0 , a 0 ) = 1 N N X j =1 1 H + 1 H X i =0 r j ( s i , a i ) , and E π 1 N N X j =1 1 H + 1 H X i =0 r j ( s i , a i ) = E π " 1 H + 1 H X i =0 r j ( s i , a i ) # by independence of samples from generati ve model. Also, we hav e E π " 1 H + 1 H X i =0 r j ( s i , a i ) # − K π ( s 0 , a 0 ) = 1 H + 1 ( P V H + r )( s 0 , a 0 ) − P J π ( s 0 , a 0 ) = P V H H + 1 − J π ( s 0 , a 0 ) + r ( s 0 , a 0 ) H + 1 = H H + 1 P V H H − J π ( s 0 , a 0 ) + ( r − K )( s 0 , a 0 ) H + 1 ≤ 2( ∥ V π ∥ ∞ + R ) H + 1 where first equality comes from Bellman equation and last inequality comes from by Fact 6 and ∥ K ∥ ∞ ≤ R . Moreover , by Hoeffding inequality with N = 2 R 2 ( ϵ ′ ) 2 log( 2 δ ) , we get Prob ˆ K π ( s 0 , a 0 ) − E π 1 N N X j =1 1 H + 1 H X i =0 r j ( s i , a i ) ≤ ϵ ′ ≥ 1 − δ. and by union bound over ( s 0 , a 0 ) ∈ S × A , we ha ve ˆ K π − E h ˆ K π i ∞ < ϵ ′ . Let H = 4( ∥ V π ∥ ∞ + R ) ϵ and ϵ ′ = ϵ/ 2 . Then, by triangular inequality K π − ˆ K π ∞ ≤ K π − E h ˆ K π i ∞ + E h ˆ K π i − ˆ K π ∞ ≤ ϵ with sample comple xity of N H |S ||A| = 32 |S ||A| ( ∥ V π ∥ ∞ + R ) R 2 ϵ 3 log 2 |S ||A| δ No w , we establish following lemma for e valuation of Q π with critic method. 23 Lemma 18 Consider a general (multichain) MDP . Then, for any r ecurr ent state s ∈ R for k ≥ 1 , { V k } ∞ k =0 exhibit the r ate 1 k k X i =1 ( V i − iJ π )( s ) − V π ( s ) ≤ 2 t tar ∥ V π ∥ ∞ k . Proof First, V k = k − 1 X i =0 ( P π ) i r π = k − 1 X i =0 ( P π ) i ( J π + ( I − P π ) V π ) = k J π + V π − ( P π ) k V π . Thus 1 k k X i =1 ( V i − iJ π ) = V π − 1 k k − 1 X i =0 ( P π ) i V π = V π + P π ⋆ − 1 k k − 1 X i =0 ( P π ) i ! V π where first equality is from V 0 = 0 and last equality comes from P π ⋆ V π = 0 ( Puterman , 2014 , Section A.5). Then by canontical form of stochastic matrix and Fact 5 , we hav e for s ∈ R , P π ⋆ ( s | · ) − 1 k k − 1 X i =0 ( P π ) i ( s | · ) 1 ≤ 2 t π tar k ∀ s ∈ R where t π tar = max 1 ≤ j ≤ m t π ,j tar . Using pre vious lemma, we establish follo wing sample complexity result. Lemma 19 Let ϵ > 0 and δ > 0 . F or given π ∈ Π + , with 1 − δ pr obability ˆ Q π ( · , ·· ) 1 R ( · ) − Q π ( · , ·· ) 1 R ( · ) ∞ ≤ (2 t tar + 1) ∥ V π ∥ ∞ H ′ + 1 + ( H ′ + 2) K π − ˆ K π ∞ 2 + ϵ. with sample complexity N ′ H ′ |S ||A| = O 2 R 2 H ′ ( H ′ + 2) 2 |S ||A| ϵ 2 log( 2 |S ||A| δ ) . 24 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Proof ˆ Q j ( s 0 , a 0 ) = 1 H ′ +1 P H ′ h =0 P h i =0 r j ( s i , a i ) − ˆ K π ( s i , a i ) . First, we can decompose as follo ws 1 N ′ N ′ X j =1 ˆ Q j ( s 0 , a 0 ) − Q π ( s 0 , a 0 ) ≤ 1 N ′ N ′ X j =1 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − ˆ K π ( s i , a i ) − 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) + 1 N ′ N ′ X j =1 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) − E π " 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) # + 1 N ′ N ′ X j =1 E π " 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) # − Q π ( s 0 , a 0 ) For the first term, 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − ˆ K π ( s i , a i ) − 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) = − 1 H ′ + 1 H ′ X h =0 h X i =0 ˆ K π ( s i , a i ) + 1 H ′ + 1 H ′ X h =0 h X i =0 K π ( s i , a i ) ≤ ( H ′ + 2) K π − ˆ K π ∞ 2 For the last term, Q π ( s 0 , a 0 ) − E π " 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) # = ( P V π + r − K π )( s 0 , a 0 ) − E π " 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) # = P V π ( s 0 , a 0 ) − E π " 1 H ′ + 1 H ′ X h =1 h X i =1 r j ( s i , a i ) − K π ( s i , a i ) # = P V π ( s 0 , a 0 ) − P 1 H ′ + 1 H ′ X i =1 ( V i − iJ π ) ! ( s 0 , a 0 ) = 1 H ′ + 1 P V π ( s 0 , a 0 ) − H ′ H ′ + 1 P 1 H ′ H ′ X i =0 ( V i − iJ π ) − V π ! ( s 0 , a 0 ) ≤ (2 t π tar + 1) ∥ V π ∥ ∞ H ′ + 1 where last inequality is from Lemma 18 . 25 For the second term, let Q ′ j = 1 H ′ +1 P H ′ h =0 P h i =0 r j ( s i , a i ) − K π ( s i , a i ) by Hoef fding in- equality with N ′ = 2 R 2 ( H ′ +2) 2 ϵ 2 log( 2 δ ) , Prob 1 N ′ N ′ X j =1 Q ′ j ( s 0 , a 0 ) − E 1 N ′ N ′ X j =1 1 H ′ + 1 H ′ X h =0 h X i =0 r j ( s i , a i ) − K π ( s i , a i ) ≤ ϵ ≥ 1 − δ. By union bound ov er ( s, a ) ∈ S × A , we conclude the result. W e are now ready to prov e Theorem. Proof In Lemma 19 , let H ′ = 3(2 t π tar +1) ∥ V π ∥ ∞ ϵ ′ , ϵ = ϵ ′ 3 . Then we have sample comple xity N ′ H ′ |S ||A| = 2 R 2 H ′ ( H ′ + 2) 2 |S ||A| ϵ ′ 3 2 log 2 |S ||A| δ = O R 2 ( t π tar ) 3 ∥ Q π ∥ 3 ∞ |S ||A| ( ϵ ′ ) 5 log 2 |S ||A| δ ! where we used ∥ V π ∥ ∞ ≤ ∥ Q π ∥ ∞ . Also, by Lemma 17 , with sample comple xity N H |S ||A| = 32 |S ||A| ( ∥ V π ∥ ∞ + R ) R 2 2 3( H ′ +2) 3 log 2 |S ||A| δ = O ( t π tar ) 3 ∥ Q π ∥ 4 ∞ R 3 |S ||A| ( ϵ ′ ) 3 log 2 |S ||A| δ ! we can have K π − ˆ K π ∞ ≤ 2 ϵ ′ 3( H ′ +2) . Then this implies ˆ Q π ( · , ·· ) 1 R ( · ) − Q π ( · , ·· ) 1 R ( · ) ∞ ≤ ϵ ′ 3 + ϵ ′ 3 + ϵ ′ 3 ≤ ϵ ′ Therefore, combining previous results, with 1 − δ probability , we have G π − ˆ G π ∞ ≤ ϵ ′ with total sample complexity is O R 3 ( t π tar ) 3 ∥ Q π ∥ 4 ∞ |S ||A| ( ϵ ′ ) 6 log 2 |S ||A| δ ! D.2. Proof of Lemma 12 Denote F t = { σ ( { s i : 0 ≤ i ≤ t } ) } as the natural filtration generated by the sampling process in Marko v chain. First we prov e the following lemma. Lemma 20 Let s 0 ∈ R i and π ∈ Π + . F or given deterministic t, n > 0 , Pr ob (( t i cov − t ) + ≥ n | F t ) ≤ E [( t i cov − t ) + | F t ] n ≤ t π ,i cov n 26 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Proof First inequality is from Markov’ s inequality . For second inequality , define new cover time t ′ cov start from t , i.e, t ′ cov = inf { u ≥ 0 : { s j } u + t − 1 j = t = R i } then t i cov − t ≤ t ′ cov almost surely by definition of cov er time, and this implies E [( t i cov − t ) + | F t ] ≤ E [ t ′ cov | F t ] = E [ t ′ cov | s t ] ≤ t π ,i cov . From Lemma 20 and definition of t π cov , we can directly obtain follo wing corollary . Corollary 21 Let s 0 ∈ R and π ∈ Π + . F or given deterministic t, n > 0 , Pr ob (( t cov − t ) + ≥ n | F t ) ≤ E [( t cov − t ) + | F t ] n ≤ t π cov n wher e t cov = inf { u ≥ 0 : { s j } u − 1 j =0 = R i for some i } No w , we prove Lemma 12 . Proof Suppose s 0 was transient state. If k = ⌈ t 1 2 ,π log( 1 δ ) ⌉ such that ( ¯ T π ) k ∞ < δ by definition of transient half-life, and this implies with 1 − δ probability , s k is recurrent. No w , suppose s 0 was recurrent. Then, by Markov’ s inequality and E [ t cov | s 0 ∈ R ] ≤ t π cov , Prob ( t cov ≥ 3 t π cov | s 0 ∈ R ) ≤ Prob ( t cov ≥ et π cov | s 0 ∈ R ) ≤ e − 1 . Then, we have Prob ( t cov ≥ 3 k t π cov | s 0 ∈ R ) ≤ e − k , since by induction, Prob ( t cov ≥ 3 k t π cov | s 0 ∈ R ) = E [ 1 t cov ≥ 3( k − 1) t π cov Prob ( t cov ≥ 3 k t π cov | F 3( k − 1) t π cov ) | s 0 ∈ R ] = E [ 1 t cov ≥ 3( k − 1) t π cov Prob (( t cov − 3( k − 1) t π cov ) + ≥ 3 t π cov | F 3( k − 1) t π cov ) | s 0 ∈ R ] ≤ e − 1 E [ 1 t cov ≥ 3( k − 1) t π cov | s 0 ∈ R ] ≤ e − k . where the second to the last inequality is from Corollary 21 . Therefore, Prob ( t cov ≤ 3 k t π cov | s 0 ∈ R ) ≥ 1 − δ with sample complexitiy ⌈ 3 t π cov log 1 δ ⌉ . Define A = { s M 1 ∈ R} , B = {∪ M 1 + M 2 − 1 i = M 1 { s i } = R j for some j } . Then, Prob ( A ∩ B ) = Prob ( A ) Prob ( B | A ) = Prob ( A ) Prob ( B | s M 1 ∈ R ) ≥ (1 − δ 2 )(1 − δ 2 ) ≥ 1 − δ with sample complexity M 1 + M 2 where second to the last inequality comes from pre vious arguments. D.3. Inexact α -clipped policy mirror ascent First, we consider inexact policy mirr or ascent π k +1 ( · | s ) = argmax p ∈M α ( A ) ( η k X a ∈A ˆ G π k ( s, a ) p ( a ) − D ( p ( · ) , π k ( · | s )) ) , ∀ s ∈ S , where ˆ G π k is an inexact ev aluation of G π k . W e first study the conv ergence properties of inexact policy mirror ascent under the follo wing assumption. Assumption 1 The inexact evaluations ˆ G π k satisfy ˆ G π k − G π k ∞ ≤ ϵ, ∀ k ≥ 0 . 27 W e first prove the ke y Lemma, counterpart of Lemma 8 . Lemma 22 F or given π 0 ∈ Π α and µ ∈ M + ( S ) , the inexact α -clipped policy mirr or ascent with step size η k > 0 g enerates a sequence of policies { π k } ∞ k =1 satisfying X a ∈A ˆ G π k ( s, a )( π k ( · | s ) − π k +1 ( · | s )) ≤ 0 , ∀ s ∈ S , and J π k µ − J π k +1 µ ≤ 2 B α ϵ. Proof The first inequality directly follo ws from the same arguments as in Lemma 15 . By Lemma 3 , J π k µ − J π k +1 µ = X s ∈S ρ π k +1 µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) = X s ∈S ρ π k +1 µ ( s ) X a ∈A ˆ G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) + X s ∈S ρ π k +1 µ ( s ) X a ∈A ( G π k ( a, s ) − ˆ G π k ( a, s ))( π k ( a | s ) − π k +1 ( a | s )) . For the second term, X a ∈A ( G π k ( a, s ) − ˆ G π k ( a, s ))( π k ( a | s ) − π k +1 ( a | s )) ≤ G π k − ˆ G π k ∞ ∥ π k ( a | s ) − π k +1 ( a | s ) ∥ 1 ≤ 2 ˆ G π k − G π k ∞ ≤ 2 ϵ, where the first inequality is from Holder’ s inequality and the last inequality follows from Assump- tion 1 . Since max π ∈ Π α ∥ ρ π ∥ 1 ≤ B α , we get the desired result. Follo wing fact will be used in the proof of Theorem 23 . F act 7 [ Xiao ( 2022 )] Suppose 0 < α < 1 , b > 0 , and a nonne gative sequence { a k } satisfies a k +1 ≤ αa k + b, ∀ k ≥ 0 . Then for all k ≥ 0 , a k ≤ α k a 0 + b 1 − α . No w , we present the con vergence result of ine xact policy mirror ascent. 28 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Theorem 23 Consider a multichain MDP . Under Assumption 1 , for π 0 ∈ Π α and given µ ∈ M + ( S ) , the ine xact α -clipped policy mirr or ascent with constant step size η > 0 generates a sequence of policies { π k } ∞ k =1 satisfying J π α µ − J π k µ ≤ 1 k + 1 D ρ π α µ ( π α , π 0 ) η + C α ( J π α µ − J π 0 µ ) ! + (2 C α + k + 2) B α ϵ. and with adaptive stepsize η k +1 ( C α − 1) ≥ η k C α > 0 generates a sequence of policies { π k } ∞ k =1 satisfying J π α − J π k µ ≤ 1 − 1 C α k J π α µ − J π 0 µ + 1 η 0 ( C α − 1) D ρ π α µ ( π α , π 0 ) + 4 C α B α ϵ. (1) Proof Follo wing the same arguments as in proof of Theorem 9 , X s ∈S ρ π α µ ( s ) X a ∈A ˆ G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) + X s ∈S ρ π α µ ( s ) X a ∈A ˆ G π k ( a, s )( π α ( a | s ) − π k ( a | s )) ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) . For the first term, X s ∈S ρ π α µ ( s ) X a ∈A ˆ G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) ≥ C α X s ∈S ρ π k +1 µ ( s ) X a ∈A ˆ G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) = C α X s ∈S ρ π k +1 µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) + C α X s ∈S ρ π k +1 µ ( s ) X a ∈A ( ˆ G π k ( a, s ) − G π k ( a, s ))( π k ( a | s ) − π k +1 ( a | s )) ≥ C α X s ∈S ρ π k +1 µ ( s ) X a ∈A G π k ( a, s )( π k ( a | s ) − π k +1 ( a | s )) − 2 C α B α ϵ ≥ C α ( J π k µ − J π k +1 µ ) − 2 C α B α ϵ where second to the last inequality is from Lemma 22 and Assumption 1 , and last inequality comes from Lemma 3 . For the second term, X s ∈S ρ π α µ ( s ) X a ∈A ˆ G π k ( a, s )( π α ( a | s ) − π k ( a | s )) = X s ∈S ρ π α µ ( s ) X a ∈A G π k ( a, s )( π α ( a | s ) − π k ( a | s )) + X s ∈S ρ π α µ ( s ) X a ∈A ( ˆ G π k ( a, s ) − G π k ( a, s ))( π α ( a | s ) − π k ( a | s )) ≥ J π α µ − J π k µ − 2 B α ϵ. 29 where last inequality comes from Lemma 15 and 3 and Assumption 1 . First, consider constant step size. W e hav e J π α µ − J π k µ ≤ 1 η k D ρ π α µ ( π α , π k ) − 1 η k D ρ π α µ ( π α , π k +1 ) + C α ( J π k +1 µ − J π k µ ) + 2( C α + 1) B α ϵ and summing ov er k giv es k X i =0 J π α µ − J π i µ ≤ 1 η D ρ π α µ ( π α , π 0 ) − 1 η D ρ π α µ ( π α , π k +1 ) + C α ( J π k +1 µ − J π 0 µ ) + 2( k + 1)( C α + 1) B α ϵ By Lemma 22 , − J π k µ − 2 B α ϵ ≤ − J π k − 1 µ and this implies J π α µ − J π k µ ≤ 1 k + 1 D ρ π α µ ( π α , π 0 ) η + C α ( J π α µ − J π 0 µ ) ! + k B α ϵ + 2( C α + 1) B α ϵ. Second, consider adapti ve step size. C α ( U k +1 − U k − 2 B α ϵ ) + U k ≤ 1 η k D k − 1 η k D k +1 + 2 B α ϵ, where U k = J π α µ − J π k µ and D k = D ρ π α µ ( π α , π k ) . This implies C α ( U k +1 − U k ) + U k ≤ 1 η k D k − 1 η k D k +1 + 2 B α ( C α + 1) ϵ. Di viding both sides by C α and rearranging terms with adapti ve stepsize yields U k +1 + 1 η k +1 ( C α − 1) D k +1 ≤ 1 − 1 C α U k + 1 η k ( C α − 1) D k + 4 B α ϵ. Finally , Fact 7 with a k = U k + 1 η k ( C α − 1) D k , α = 1 − 1 C α , b k = 4 B α ϵ, leads to U k ≤ 1 − 1 C α k U 0 + 1 η 0 ( C α − 1) D 0 + 4 C α B α ϵ. D.4. Proof of Theor em 13 Proof First, consider constant step size with K = 2 D ρ π α µ ( π α ,π 0 ) η + C α ( J π α µ − J π 0 µ ) /ϵ ′ . By Theorem 11 and union bound ov er 0 ≤ k ≤ K − 1 , we have G π k − ˆ G π k ∞ ≤ ϵ ′ 2(2 C α + K +2) B α with sample complexity K ( H N + H ′ N ′ ) |S ||A| = O t 3 tar ∥ Q π ∥ 4 ∞ R 9 C 6 α D ρ π α µ ( π α ,π 0 ) η 6 B 6 α |S ||A| ( ϵ ′ ) 12 log 2 K |S ||A| δ 30 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S for all 0 ≤ k ≤ K − 1 . Then, by Theorem 23 , we have J π α µ − J π k µ ≤ ϵ ′ 2 + ϵ ′ 2 ≤ ϵ ′ . For adaptiv e step size η k +1 ( C α − 1) ≥ η k C α with K = log 2( J π α µ − J π 0 µ + D ρ π α µ ( π α ,π 0 ) η 0 ( C α − 1) ) /ϵ ! log( C α / ( C α − 1)) , By Theorem 11 and union bound over 0 ≤ k ≤ K − 1 , G π k − ˆ G π k ∞ ≤ ϵ ′ 8 C α B α with sample complexity K ( N H + N ′ H ′ ) |S ||A| = O R 3 C 6 α B 6 α t 3 tar ∥ Q π ∥ 4 ∞ |S ||A| ( ϵ ′ ) 6 K log 2 K |S ||A| δ ! for all 0 ≤ k ≤ K − 1 . Then, by Theorem 23 , we hav e J π α µ − J π k µ ≤ ϵ ′ 2 + ϵ ′ 2 ≤ ϵ ′ . A ppendix E. W eakly communicating MDPs In this section, we assume that MDP is weakly communicating. W e will sho w that α -clipped policy mirror ascent algorithm attains an ϵ -optimal policy satisfying J ⋆ µ − J π µ ≤ ϵ (i.e., not restricted to positi ve policies) by choosing α as a function of ϵ . E.1. Continuity of J π in weakly communicating W e first argue about continuity of J π at optimal policy in weakly communicating MDP . Lemma 24 In weakly communicating MDP , the mappings π 7→ J π and π 7→ J π µ ar e continuous at optimal policies, and J ⋆ + ,µ = J ⋆ µ for any fixed µ ∈ M ( S ) . Proof F or weakly communicating MDP , optimal av erge-re ward is uniform vector ( Puterman , 2014 , Theorem 8.32). Then, in the proof of Lemma 3 , we have ( P π ⋆ ⋆ − I ) J ⋆ = 0 , and we hav e J π µ − J ⋆ µ = X s ∈R X a ∈A d π µ ( s ) π ( a | s ) − π ⋆ ( a | s ) Q π ⋆ ( s, a ) . By Holder’ s inequality , J ⋆ µ − J π µ ≤ P s ∈R d π µ ( s ) π ( · | s ) − π ⋆ ( · | s ) 1 ∥ Q π ⋆ ∥ ∞ . So lim π → π ⋆ J π µ = J ⋆ µ for π ∈ Π + . Gi ven µ ∈ M ( S ) , we say π is an ϵ -optimal policy if J ⋆ µ − J π µ ≤ ϵ . Then, Lemma 24 guarantee that we can find ϵ -optimal policy for arbitrary ϵ > 0 by only considering policies in Π + . E.2. P olicy gradient theorem in weakly communicating MDP W e first state following ke y lemma for weakly communicating MDPs. Lemma 25 Consider a weakly communicating MDP . F or every π ∈ Π + , J π = c π 1 for some c π ∈ R . 31 Proof By ( Puterman , 2014 , Proposition 8.3.1), there exist policy π such that P π has single recurrent class and trasient states. Then by definition of accessibility , for π ′ ∈ Π + , P π ′ also has same single recurrent class and trasient states. Lastly , since d π ′ = 1 ( g π ′ ) ⊤ where g π ′ is unique stationary distribution of P π ′ , we conclude that J π ′ = ( g π ′ ) ⊤ r π ′ 1 . Applying Lemma 25 into the proof of Lemma 3 , we can directly obtain following performance dif ference lemma for weakly communicating MDP . Corollary 26 Consider a weakly communicating MDP . F or π , π ′ ∈ Π + and µ ∈ M + ( S ) , J π µ − J π ′ µ = X s ∈R X a ∈A d π µ ( s ) π ( a | s ) − π ′ ( a | s ) Q π ′ ( s, a ) . Again, applying Lemma 25 into the proof of Theorem 5 , we can directly obtain following policy gradient theorem for weakly communicating MDP . Corollary 27 Consider a weakly communicating MDP . F or π θ ∈ Π + and µ ∈ M + ( S ) , ∇ θ J π θ µ = X s ∈R X a ∈A d π θ µ ( s ) ∇ θ π θ ( a | s ) Q π θ ( s, a ) . E.3. Con vergence of of policy mirror ascent in weakly communicating MDPs If MDP is weakly communicating, we can explicitly define an iteration number for finding an ϵ - optimal policy with the v alue of α chosen to be a function of ϵ . W e first present the sublinear con vergence rate of the α -clipped policy mirror ascent with con- stant step sizes. Corollary 28 Consider weakly communicating MDP . F or any given ϵ ∈ (0 , 1) , set α = ϵ 2( |A| +1) ∥ Q π ⋆ ∥ ∞ . Then, for π 0 ∈ Π α and µ ∈ M + ( S ) , α -clipped policy mirr or ascent with constant step size η gen- erates ϵ -optimal policies for k ≥ 2 ϵ D ρ π α µ ( π α ,π 0 ) η + C α ( J π α − J π 0 ) . Proof Follo wing proof of Lemma 24 , if α = ϵ 2( |A| +1) ∥ Q π ⋆ ∥ ∞ , we hav e J ⋆ µ − J π µ = X s ∈S X a ∈A d π µ ( s )( π ⋆ ( a | s ) − π ( a | s )) Q π ⋆ ( s, a ) ≤ ( |A| + 1) α ∥ Q π ⋆ ∥ ∞ ≤ ϵ 2 . By Theorem 9 , when k ≥ 2 ϵ D ρ π α µ ( π α ,π 0 ) η + C α ( J π α − J π 0 ) , J π α µ − J π k µ ≤ ϵ 2 , which concludes the proof. Next, we present the linear con ver gence rate of the α -clipped policy mirror ascent with adapti ve step sizes. 32 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Corollary 29 Consider a weakly communcating MDP . F or any given ϵ ∈ (0 , 1) , set α = ϵ 2( |A| +1) ∥ Q π ⋆ ∥ ∞ . Then, for π 0 ∈ Π α and µ ∈ M + ( S ) , α -clipped policy mirr or ascent with step sizes ( C α − 1) η k +1 ≥ C α η k > 0 g enerates ϵ -optimal policies for k ≥ log 2( J π α µ − J π 0 µ + D ρ π α µ ( π α ,π 0 ) η 0 ( C α − 1) ) /ϵ log( C α / ( C α − 1)) . Proof Follo wing proof of Corollary 28 , let α = ϵ 2( |A| +1) ∥ Q π ⋆ ∥ ∞ . Then there exist π ∈ Π α such that ∥ π ⋆ − π ∥ ∞ ≤ α . Also, we have J ⋆ µ − J π µ ≤ ϵ 2 . By Theorem 10 , when K ≥ log 4( J π α µ − J π 0 µ + D ρ π α µ ( π α ,π 0 ) η 0 ( C α − 1) ) /ϵ ! log( C α / ( C α − 1)) , J π α µ − J π k µ ≤ ϵ 2 , which concludes the proof. T o the best of our kno wledge, Corollaries 28 and 29 are the first con vergence results of policy gradient methods for av erage-reward weakly communicating MDPs. E.4. Sample complexity of policy mirr or ascent in weakly communicating MDPs If MDP is weakly communicating, we can explicitly provide sample complexity for finding an ϵ - optimal policy . Note that for π ∈ Π + , ∥ d π µ ∥ 1 = B α = 1 in weakly communicating MDP . Corollary 30 Consider a weakly communicating MDP . Let ϵ > 0 , δ > 0 and set α = ϵ 2( |A| +1) ∥ Q π ⋆ ∥ ∞ . F or any π , π 0 ∈ Π α and given µ ∈ M + ( S ) , with probability 1 − δ , the iterate of stochastic α -clipped policy ascent with constant step size η > 0 and K = 4 D ρ π α µ ( π α ,π 0 ) η + C α ( J π α µ − J π 0 µ ) /ϵ gen- erates ϵ -optimal policy with sample comple xity e O t 3 tar ∥ Q π ∥ 4 ∞ R 9 C 6 α D ρ π α µ ( π α ,π 0 ) η 6 |S ||A| ϵ 12 and with adaptive step size η k +1 ( C α − 1) ≥ η k C α > 0 and K = log 4( J π α µ − J π 0 µ + D ρ π α µ ( π α ,π 0 ) η 0 ( C α − 1) ) /ϵ ! log( C α / ( C α − 1)) gener ates ϵ -optimal policy with sample complexity ˜ O t 3 tar ∥ Q π ∥ 4 ∞ R 3 C 6 α |S ||A| ϵ 6 ! wher e e O ignores all logarithmic factors, ∥ Q π ∥ ∞ = max 0 ≤ k ≤ K − 1 ∥ Q π k ∥ ∞ , and t tar = max 0 ≤ k ≤ K − 1 t π k tar . Proof Following proof of Corollary 28 , let α = ϵ 2( |A| +1) ∥ Q π ⋆ ∥ ∞ . Then there exist π ∈ Π α such that ∥ π ⋆ − π ∥ ∞ ≤ α . Also, we hav e J ⋆ µ − J π µ ≤ ϵ 2 . 33 First, for constant step size, by Theorem 13 , K = 4 D ρ π α µ ( π α ,π 0 ) η + C α ( J π α µ − J π 0 µ ) /ϵ , J π α µ − J π k µ ≤ ϵ 2 with sample complexity K ( H N + H ′ N ′ ) |S ||A| = O t 3 tar ∥ Q π ∥ 4 ∞ R 9 C 6 α D ρ π α µ ( π α ,π 0 ) η 6 |S ||A| ϵ 12 log 2 K |S ||A| δ . Then, we hav e J ⋆ µ − J π k µ ≤ ϵ 2 + ϵ 2 ≤ ϵ. For adaptive step size η k +1 ( C α − 1) ≥ η k C α with K = log 4( J π α µ − J π 0 µ + D ρ π α µ ( π α ,π 0 ) η 0 ( C α − 1) ) /ϵ ! log( C α / ( C α − 1)) with sample complexity K ( H N + H ′ N ′ ) |S ||A| = R 3 C 6 α t 3 tar ∥ Q π ∥ 4 ∞ |S ||A| ϵ 6 K log 2 K |S ||A| δ ! . Then, we hav e J ⋆ µ − J π k µ ≤ ϵ 2 + ϵ 2 ≤ ϵ. T o the best of our kno wledge, Corollary 30 is the first sample complexity results of policy gradient methods for av erage-reward weakly communicating MDPs. A ppendix F . Projection algorithms Algorithm 4 Euclidean projection onto M α ( A ) Input: q ∈ R d , α ∈ [0 , 1 /d ] Sort q into q ′ satisfying q ′ 1 ≥ q ′ 2 ≥ · · · ≥ q ′ d Find ρ = max n 1 ≤ j ≤ d : q ′ j + 1 j 1 − dα − P j i =1 q ′ i > 0 o λ = 1 ρ (1 − dα − P ρ i =1 q ′ i f or i = 1 , . . . , d do p i = max { q i + λ + α , α } end f or Output: p 34 P O L I C Y G R A D I E N T A L G O R I T H M S I N A V E R A G E - R E W A R D M U LT I C H A I N M D P S Algorithm 5 KL projection onto M α ( A ) Input: w ∈ M + ( S ) where |S | = d , α ∈ [0 , 1 /d ] Set W = { 1 , . . . , d } , C # = 0 , C % = 0 while W = ∅ do Find median ω in { w i : i ∈ W } L = { i ∈ W : w i < ω } , L # = | L | , L % ← P i ∈ L w i M = { i ∈ W : w i = ω } , M # = | M | , M % = P i ∈ M w i H = { i ∈ W : w i > ω } m 0 = 1 − ( C # + L # ) α 1 − ( C % + L % ) if ω m 0 < α then C # = C # + L # + M # C % = C % + L % + M % W = H else W = L end if end while m 0 = 1 − C # α 1 − C % f or i = 1 , . . . , d do if w i < ω then p i = α else p i = m 0 w i end if end f or Output: p 35
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment