Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Optimal V ariance-Dep enden t Regret Bounds for Inﬁnite-Horizon MDPs Guy Zamir, Matthew Zurek, and Y udong Chen Departmen t of Computer Sciences, Universit y of Wisconsin–Madison gzamir@wisc.edu matthew.zurek@wisc.edu yudongchen@cs.wisc.edu Abstract Online reinforcemen t learning in inﬁnite-horizon Mark o v decision processes (MDPs) remains less theoretically and algorithmically developed than its episo dic coun terpart, with many algorithms suﬀering from high “burn-in” costs and failing to adapt to b enign instance-speciﬁc complexity . In this w ork, we address these shortcomings for tw o inﬁnite-horizon ob jectiv es: the classical av erage-reward regret and the γ -regret. W e dev elop a single tractable UCB-st yle algorithm applicable to both settings, whic h achiev es the ﬁrst optimal v ariance-dependent regret guarantees. Our regret bounds in b oth settings take the form e O ( √ S A V ar + lo w er-order terms ) , where S, A are the state and action space sizes, and V ar captures cum ulative transition v ariance. This implies minimax-optimal a verage-rew ard and γ -regret bounds in the w orst case but also adapts to easier problem instances, for example yielding nearly constant regret in deterministic MDPs. F urthermore, our algorithm enjo ys signiﬁcan tly impro v ed lo wer-order terms for the av erage-reward setting. With prior kno wledge of the optimal bias span ∥ h ⋆ ∥ sp , our algorithm obtains lo wer-order terms scaling as ∥ h ⋆ ∥ sp S 2 A , whic h we prov e is optimal in both ∥ h ⋆ ∥ sp and A . Without prior knowledge, we prov e that no algorithm can hav e low er-order terms smaller than ∥ h ⋆ ∥ 2 sp S A , and we pro vide a prior-free algorithm whose low er-order terms scale as ∥ h ⋆ ∥ 2 sp S 3 A , nearly matching this lo w er b ound. T aken together, these results completely characterize the optimal dep endence on ∥ h ⋆ ∥ sp in b oth leading and lo wer-order terms, and rev eal a fundamental gap in what is achiev able with and without prior knowledge. 1 In tro duction W e study online reinforcement learning (RL) in tabular Marko v decision processes (MDPs), where an agent in teracts with an unkno wn environmen t and aims to maximize cumulativ e rew ard. W e sp eciﬁcally consider inﬁnite-horizon contin uing settings, where the en vironment do es not con tain a built-in reset mechanism. Despite its practical relev ance and foundational signiﬁcance, online inﬁnite-horizon RL is muc h less well understo od theoretically than the ﬁnite-horizon episodic setting. In this work w e study t wo particular p erformance measures for inﬁnite-horizon problems. The most classical performance ob jectiv e is the a verage-rew ard regret P t ( ρ ⋆ − r t ) introduced by the seminal w ork Auer et al. [ 2008 ], which measures the instan taneous rew ard r t of the agen t against the optimal gain ρ ⋆ , which is the b est long-term av erage reward p er timestep of any p olicy . The reset-less nature of inﬁnite-horizon online RL requires additional structural assumptions to p ermit sublinear regret bounds, suc h as for the MDP to b e comm unicating, and these also ensure that ρ ⋆ can be deﬁned indep enden t of an initial state. Auer et al. [ 2008 ] pro ved a regret lo w er b ound of Ω( √ D S AT ) , where D is the MDP diameter, S and A are the n umbers of states and actions, and T is the time horizon. Extensiv e researc h eﬀort has since b een dedicated to matching this regret low er b ound and relaxing the communicativit y assumptions b y replacing D with ∥ h ⋆ ∥ sp , the maximum gap in cum ulative rewards b et w een an y t wo starting states, which is smaller than D and ﬁnite even in non-communicating MDPs (e.g., Bartlett and T ew ari 2012 , F ruit et al. 2018 , 2019 , T alebi and Maillard 2018 , Zhang and Xie 2023 ), culminating in the minimax-optimal algorithms of Zhang and Ji [ 2019 ], Boone and Zhang [ 2024 ]. 1 Algorithm e O ( · ) Regret Priorless? Burn-In Cost T ractable? T yp e UCRL2 [ Auer et al. , 2008 ] D S √ AT ✓ N/A ✓ EVI REGAL [ Bartlett and T ew ari , 2012 ] ∥ h ⋆ ∥ sp S √ AT × N/A × EVI SCAL [ F ruit et al. , 2018 ] ∥ h ⋆ ∥ sp S √ AT × N/A ✓ EVI KL-UCRL [ T alebi and Maillard , 2018 ] q S T P s,a V ⋆ s,a + D √ T ✓ N/A ✓ EVI EBF [ Zhang and Ji , 2019 ] p ∥ h ⋆ ∥ sp S AT × (*) × EVI UCB-A V G [ Zhang and Xie , 2023 ] S 5 A 2 ∥ h ⋆ ∥ sp √ T × N/A ✓ UCB PMEVI-DT [ Bo one and Zhang , 2024 ] p ∥ h ⋆ ∥ sp S AT ✓ ∥ h ⋆ ∥ 10 sp S 40 A 20 ✓ EVI γ -UCB-CVI [ Hong et al. , 2025 ] ∥ h ⋆ ∥ sp S √ AT × N/A ✓ UCB Corollary 3.6 q V ar ⋆ 1 − 1 /T S A + Γ ∥ h ⋆ ∥ sp S A × ∥ h ⋆ ∥ sp S 3 A ✓ UCB Corollary 3.7 p ∥ h ⋆ ∥ sp S AT ✓ ∥ h ⋆ ∥ 2 sp S 3 A ✓ UCB Lo wer Bound [ Auer et al. , 2008 ] √ D S AT , implies p ∥ h ⋆ ∥ sp S AT T able 1: Comparison of algorithms and regret b ounds for av erage-rew ard MDPs. Here ∥ h ⋆ ∥ sp is the span of the optimal bias function and D is the diameter, whic h satisfy ∥ h ⋆ ∥ sp ≤ D . W e alwa ys hav e Γ ≤ S , and Γ = 1 in deterministic MDPs. V ar ⋆ 1 − 1 /T and V ⋆ sa are instance-dep endent v ariance parameters, whic h in particular are 0 for deterministic MDPs. Also V ar ⋆ 1 − 1 /T ≤ e O ( ∥ h ⋆ ∥ sp T + ∥ h ⋆ ∥ 2 sp ) . Only the leading terms as T → ∞ of the regret bound are shown. The burn-in cost is deﬁned as the smallest T for whic h the algorithm achiev es a minimax-optimal regret of e O ( p ∥ h ⋆ ∥ sp S AT ) , or N/A if this do es not o ccur. Priorless means that an algorithm do es not require prior knowledge about the v alue of ∥ h ⋆ ∥ sp . (*) The burn-in cost of EBF [ Zhang and Ji , 2019 ] is ∥ h ⋆ ∥ 6 sp S 4 A 4 + ∥ h ⋆ ∥ 6 sp S 6 A 2 + ∥ h ⋆ ∥ 3 sp S 12 A 3 . Another performance measure for inﬁnite-horizon online RL is the γ -regret P t ((1 − γ ) V ⋆ γ ( s t ) − r t ) in- tro duced b y Liu and Su [ 2021 ], where V ⋆ γ ( s t ) denotes the γ -discounted optimal v alue function at the state s t encoun tered by the agent. While γ -regret may appear to b e a weak er regret notion due to its compar- ison with the agent’s own tra jectory rather than that of an optimal p olicy , this feature has the adv an tage of enabling sublinear regret b ounds without the structural assumptions needed for the av erage-rew ard set- ting. F urthermore, when suc h comm unicativit y assumptions do hold, w e show the γ -regret can actually b e used to con trol the av erage-reward regret in an optimal manner, whic h is the approach w e tak e in this pap er (see Lemma 3.4 ). Recent work has developed algorithms ac hieving the minimax-optimal γ -regret, e O ( p S AT / (1 − γ )) [ He et al. , 2021 , Ji and Li , 2023 ]. Despite all of the aforementioned algorithmic progress on the av erage-rew ard and the γ -regret settings, there are several signiﬁcant limitations of all existing w ork in b oth settings relative to episodic online RL. First, existing minimax-optimal algorithms incur a large burn-in cost, meaning that they only attain the optimal regret rate when T is very large. F or example, the only computationally eﬃcient and minimax- optimal algorithm for the a v erage-reward regret, PMEVI-DT [ Bo one and Zhang , 2024 ], has a regret b ound of e O  p ∥ h ⋆ ∥ sp S AT + ∥ h ⋆ ∥ sp S 5 2 A 3 2 T 9 20  , and th us it only matc hes the optimal rate for T ≥ ∥ h ⋆ ∥ 10 sp S 40 A 20 . This contrasts the episo dic setting, where signiﬁcant eﬀort has b een exp ended to wards remedying such issues (e.g., Zhang et al. 2021 , Zhou et al. 2023 , Zhang et al. 2024 ). Second, prior work for inﬁnite-horizon settings fails to adapt to easier problem instances such as low-v ariance or deterministic MDPs, where substantially smaller regret should be p ossible. In episo dic RL, this gap has been addressed through v ariance-dep enden t regret b ounds, whic h interpolate betw een stochastic and deterministic en vironments and can b e optimal in both regimes [ Zhou et al. , 2023 ]. How ev er, no optimal v ariance-dep enden t regret guaran tees ha ve been established for either inﬁnite-horizon setting that we consider. 1.1 Con tributions In this pap er, we establish the ﬁrst optimal v ariance-dep enden t regret guarantees for inﬁnite-horizon MDPs. Our main contribution is a single tractable algorithm that, in b oth the av erage-reward and γ -regret settings, attains a regret b ound of the form e O  p V ar γ S A + lo wer-order terms  . 2 W e handle b oth of these inﬁnite-horizon settings in a uniﬁed w ay b y treating γ = 1 − 1 T as a tuning parameter in the a verage-rew ard case. Here V ar γ is a cum ulative v ariance term that measures the sto c hasticity of the transition dynamics along the learner’s tra jectory . When the MDP is deterministic, we ha v e V ar γ = 0 , and th us the resulting regret is indep enden t of T up to logarithmic factors. W e also show V ar γ ≤ e O ( ∥ V ⋆ γ ∥ sp T + ∥ V ⋆ γ ∥ 2 sp ) , so by using that ∥ V ⋆ γ ∥ sp ≤ 2 ∥ h ⋆ ∥ sp in w eakly comm unicating MDPs [ W ei et al. , 2020 ], or simply that ∥ V ⋆ γ ∥ sp ≤ 1 1 − γ , we obtain minimax-optimal regret b ounds in b oth the a v erage-reward and γ -regret settings, respectively . F o cusing on the av erage-reward setting, another main con tribution is that w e signiﬁcan tly impro ve the lo wer-order terms relative to prior work . When given prior knowledge of the bias span ∥ h ⋆ ∥ sp , our (v ariance- dep enden t and minimax-optimal) result con tains lo wer-order terms scaling as ∥ h ⋆ ∥ sp S 2 A , and we sho w via matc hing low er bounds that this dep endence on ∥ h ⋆ ∥ sp and A is unimpro v able. Additionally , without prior knowledge of ∥ h ⋆ ∥ sp , w e obtain low er-order terms scaling as ∥ h ⋆ ∥ 2 sp S 3 A . W e also show a surprising hardness result that no algorithm without prior kno wledge of ∥ h ⋆ ∥ sp can obtain lo wer-order terms b etter than ∥ h ⋆ ∥ 2 sp S A . T ak en together, our results nearly completely characterize the optimal dep endence on ∥ h ⋆ ∥ sp in b oth the leading and low er-order terms and reveal a fundamental separation b et ween what is achiev able with and without prior knowledge. T o obtain our improv ed bounds for γ -regret, w e dev elop a mo del-based, upper-conﬁdence-b ound(UCB)- based algorithm called F ully Optimizing Clipp ed UCB Solv er (FOCUS). W e impro ve up on previous UCB- based algorithms for γ -regret by incorp orating a sharp Bernstein-st yle bonus and span-clipping into our empirical Bellman operator. Crucially , instead of p erforming a single step of v alue iteration at each up date, F OCUS rep eatedly applies the empirical Bellman operator until con vergence. This design ensures that the Q-estimates fully exploit the collected data at eac h up date and is essential for obtaining v ariance-dep enden t b ounds. Finally , F OCUS is the ﬁrst UCB-style algorithm to ac hieve minimax-optimal regret guaran tees for the online a verage-rew ard setting, contrasting previous optimal algorithms which instead dep end on extended v alue iteration (EVI). 1.2 Related W ork Here w e discuss related work, and we also refer to App endix A for more discussion. Online A v erage-Reward The a verage-rew ard setting is classical and w ell-studied for online inﬁnite- horizon RL. See T able 1 for a comparison of imp ortant prior work; ho wev er, giv en the extensive history of this problem, this table is non-exhaustive. The seminal w ork of Auer et al. [ 2008 ] introduces UCRL2 and establishes a regret b ound of e O ( DS √ AT ) , along with a low er bound of Ω( √ D S AT ) . Bartlett and T ewari [ 2012 ] establish regret b ounds that dep end on ∥ h ⋆ ∥ sp instead of D , but their algorithm REGAL is computationally intractable algorithm and requires prior knowledge of ∥ h ⋆ ∥ sp . The computationally eﬀcient algorithm SCAL by F ruit et al. [ 2018 ] utilizes the span-clipping tec hnique to matc h the b ound of REGAL. The EBF algorithm of Zhang and Ji [ 2019 ] is the ﬁrst to achiev e minimax optimality of e O ( p ∥ h ⋆ ∥ sp S AT ) , but it is in tractable and requires prior knowledge. Boone and Zhang [ 2024 ] resolves these issues b y dev eloping PMEVI-DT, whic h is tractable, minimax optimal and prior-kno wledge-free. T alebi and Maillard [ 2018 ] obtain v ariance-aw are guarantees for the KL-UCRL algorithm, but their bounds remain sub optimal in the w orst-case and suﬀer √ T dep endence even on deterministic MDPs. All aforementioned algorithms are based on EVI. UCB-based algorithms, whic h lik e ours also employ discounting to appro ximate the a verage-rew ard problem, include W ei et al. [ 2020 ], Zhang and Xie [ 2023 ], Hong et al. [ 2025 ]. γ -regret The γ -regret notion is introduced by Liu and Su [ 2021 ], who prov e the ﬁrst upp er b ound for this framework with their Double Q-learning algorithm. The subsequent work of He et al. [ 2021 ] proposes UCBVI- γ , a mo del-based algorithm with Bernstein-style b on uses, and achiev es a γ -regret bound that is minimax-optimal in the leading term. Ji and Li [ 2023 ] dev elop Q-SlowSwitc h-Adv, a model-free algorithm that matches the optimal leading term while improving lo w er-order dep endence on S A . More recen tly , Ma and Lee [ 2026 ] tak e a Ba y esian approac h with the EUBRL algorithm, whic h leverages epistemic uncertaint y for directed exploration and ac hieves state-of-the-art lo wer-order dep endence on 1 1 − γ . W e note that the deﬁnition of γ -regret in our pap er matc hes that of Liu and Su [ 2021 ], whereas He et al. [ 2021 ], Ji and Li 3 [ 2023 ], and Ma and Lee [ 2026 ] use slightly diﬀerent deﬁnitions. While there is not an immediate translation b et w een these deﬁnitions, they are closely related and algorithms with go od guaran tees for one should hav e go od guarantees for the others. F or further discussion on this issue, see App endix A.2 of He et al. [ 2021 ] and App endix A of Ji and Li [ 2023 ]. 2 Preliminaries MDP Basics A Marko v Decision Pro cess is a tuple ( S , A , P , r, µ 0 ) 1 , where S is the state space, A is the action space, P : S × A → ∆( S ) is the transition k ernel with ∆( S ) denoting the probabilit y simplex ov er S , r : S × A → [0 , 1] is the rew ard function, and µ 0 ∈ ∆( S ) is the initial state distribution. W e assume S and A are ﬁnite sets with S : = |S | and A : = |A| . A (stationary) policy is a mapping π : S → ∆( A ) . When π is a deterministic policy , w e treat π as a mapping S → A . Let Π b e the set of all stationary deterministic policies. An initial state distribution µ ∈ ∆( S ) and a policy π induce a distribution o ver tra jectories ( s 0 , a 0 , s 1 , a 1 , . . . ) , where s 0 ∼ µ , a t ∼ π ( s t ) , and s t +1 ∼ P ( ·| s t , a t ) . W e let E π s denote the exp ectation with resp ect to this distribution when µ satisﬁes µ ( s ) = 1 . F or a p olicy π and discount factor γ ∈ (0 , 1) , the discounted v alue function V π γ ∈ [0 , 1 1 − γ ] S is deﬁned b y V π γ ( s ) = E π s [ P ∞ t =0 γ t r t ] , where r t = r ( s t , a t ) . Also deﬁne the optimal v alue function V ⋆ γ ∈ [0 , 1 1 − γ ] S b y V ⋆ γ ( s ) = sup π ∈ Π V π γ ( s ) . W e often write P s,a to denote the row v ector such that P s,a ( s ′ ) = P ( s ′ | s, a ) . Then for an y V ∈ R S w e hav e P s,a V = E s ′ ∼ P ( ·| s,a ) [ V ( s ′ )] . The gain of a p olicy π , ρ π ∈ [0 , 1] S , is deﬁned by ρ π ( s ) = lim T →∞ 1 T E π s [ P T − 1 t =0 r t ] . W e deﬁne the optimal gain ρ ⋆ = sup π ∈ Π ρ π . The bias function of a p olicy π , h π ∈ R S , is h π ( s ) = C-lim T →∞ E π s [ P T − 1 t =0 ( r t − ρ π ( s t ))] . The optimal bias function is h ⋆ : = h π ⋆ , where π ⋆ is a Blackw ell-optimal p olicy , whic h satisﬁes V π ⋆ γ = V ⋆ γ for all suﬃciently large γ . The diameter of an MDP is D = max s 1  = s 2 inf π ∈ Π E π s 1 [ η s 2 ] , where η s denotes the hitting time of a state s ∈ S . An MDP is (strongly) communicating if its diameter is ﬁnite; that is, any state is reachable from any other state under some p olicy . An MDP is w eakly communicating if the states can be partitioned in to t wo disjoin t subsets S = S 1 ∪ S 2 suc h that all states in S 1 are transient under all p olicies, and within S 2 an y state is reac hable from any other state under some policy . In weakly communicating MDPs, the optimal gain ρ ⋆ is constant and thus, with a sligh t abuse of notation, treated as a scalar. W e say an MDP is deterministic if the transition probabilities P ( ·| s, a ) are one-hot v ectors, i.e., from each state-action pair the agen t will transit to a certain state with probability 1. W e also deﬁne Γ : = max ( s,a ) ∈S ×A | supp( P ( ·| s, a )) | . Online RL and Regrets The learner interacts with the MDP for T steps, starting from a state s 1 ∼ µ 0 . At each step t = 1 , . . . , T , the learner at state s t c ho oses an action a t and observes the next state s t +1 ∼ P ( ·| s t , a t ) . The learner aims to maximize the total reward P T t =1 r ( s t , a t ) it recei v es. W e consider tw o diﬀeren t notions of regret, whic h measure the disparit y betw een the learner’s rew ard and that of an optimal p olicy . Given a discount factor γ ∈ (0 , 1) , deﬁne Regret γ ( T ) : = P T t =1  (1 − γ ) V ⋆ γ ( s t ) − r ( s t , a t )  . When the MDP is w eakly communicating, further deﬁne Regret( T ) : = P T t =1  ρ ⋆ − r ( s t , a t )  . Throughout the paper, γ -regret or discounted setting refers to Regret γ ( T ) , and regret or a verage-rew ard setting refers to Regret( T ) . T o be precise, the regret and γ -regret are both functions of the underlying MDP and the tra jectory s 1 , a 1 , . . . , s T , a T . The tra jectory is random, with a distribution determined b y the MDP , the learning algorithm, and the time horizon T . Often times we only explicitly write T as a parameter in regret because the MDP and learning algorithm are usually clear from context. In situations where the underlying MDP P and learning algorithm Alg are not obvious, w e write Regret( T , P , Alg ) . Burn-In Cost W e say that an algorithm achiev es a burn-in cost of f ( ∥ h ⋆ ∥ sp , S, A ) if for all MDPs M and an y T ≥ f ( ∥ h ⋆ M ∥ sp , S M , A M ) , the regret can be b ounded (with high probability) by e O  p ∥ h ⋆ M ∥ sp S M A M T  , where h ⋆ M , S M , and A M are the optimal bias function, state space size, and action space size, resp ectiv ely , of M . W e use low er-order terms to refer to any of the additive terms in the regret b ound b esides the leading 1 Sometimes we consider classes of MDPs with the same S , A , r , and µ 0 . In this case we simply denote the MDP b y its transition kernel P . 4 term, which is typically √ V ar ⋆ 1 − 1 /T S A or p ∥ h ⋆ ∥ sp S AT . Note that the lo wer-order terms may sometimes dominate the regret, including when the MDP is deterministic so that V ar 1 − 1 /T = 0 . T o illustrate the diﬀerence b et ween burn-in cost and low er-order terms, consider an algorithm that obtains a regret b ound of e O ( p ∥ h ⋆ ∥ sp S AT + ∥ h ⋆ ∥ sp S 2 A ) . Then the algorithm has a low er-order term of ∥ h ⋆ ∥ sp S 2 A and ac hiev es a burn-in cost of ∥ h ⋆ ∥ sp S 3 A . There is an con version — alb eit a lossy one — b et w een the t w o notions. If an algorithm has a burn-in cost of f , then its regret can b e bounded by e O ( p ∥ h ⋆ ∥ sp S AT + f ( ∥ h ⋆ ∥ sp , S, A )) , because for large T the leading term dominates, and for small T the regret is at most T < f ( ∥ h ⋆ ∥ sp , S, A ) . If an algorithm achiev es a regret of e O ( p ∥ h ⋆ ∥ sp S AT + f ( ∥ h ⋆ ∥ sp , S, A )) , then the algorithm has a burn-in cost of f ( ∥ h ⋆ ∥ sp ,S,A ) 2 ∥ h ⋆ ∥ sp S A , since for T larger than this quantit y the leading term dominates. A dditional Notation Let [ m ] denote { 1 , . . . , m } for an y p ositiv e integer m . Let 0 , 1 b e the all-zero and all-one v ectors. F or x ∈ R S , deﬁne the span semi-norm ∥ x ∥ sp : = max s ∈S x ( s ) − min s ∈S x ( s ) . F or x, y ∈ R n , w e deﬁne V ( x, y ) : = P n i =1 x i y 2 i − ( P n i =1 x i y i ) 2 . Observ e that when x is a probability vector, V ( x, y ) is the v ariance of a random v ariable that tak es v alue y i with probability x i . F or H ≥ 0 , we deﬁne the clipping op erator Clip H : R S → R S b y (Clip H ( V ))( s ) = min { V ( s ) , min s ′ ∈S V ( s ′ ) + H } . W e also de ﬁne the maximum op erator M : R S ×A → R S b y ( M Q )( s ) = max a ∈A Q ( s, a ) . The notation O ( · ) hides constant factors. The notation e O ( · ) hides constant factors and possible p olylog factors of S, A, T , and 1 δ . 3 Main Results In this section, we ﬁrst describ e our algorithm and discuss improv emen ts ov er prior work. W e then deﬁne our v ariance-dependent term V ar ⋆ γ and relate it to other relev ant quan tities. Next, we pro vide our main b ound for γ -regret and implications for the discoun ted setting. W e then reduce the a verage-rew ard setting to the discoun ted setting and deriv e optimal b ounds on the regret with and without prior knowledge. Finally , we state lo wer bounds showing that our lo wer-order dependence on ∥ h ⋆ ∥ sp is optimal in b oth the prior kno wledge and prior-free settings. 3.1 Algorithm Algorithm 1: F ully Optimizing Clipp ed UCB Solver (F OCUS) Input: run time T ≥ 1 , discount factor γ ∈ (0 , 1) , failure probability δ ∈ (0 , 1) , span clipping parameter H ≥ 1 1 k ← 1 , U ← log  1 δ ′  , where δ ′ = δ /  9 S 2 AT  2 N ( s, a ) ← 0 , N ( s, a, s ′ ) ← 0 for all ( s, a, s ′ ) ∈ S × A × S 3 b Q 1 ( s, a ) ← 1 1 − γ for all ( s, a ) ∈ S × A 4 Observe s 1 5 for t = 1 , . . . , T do 6 T ake action a t ∈ argmax a ∈A b Q k ( s t , a ) and observ e s t +1 7 N ( s t , a t ) ← N ( s t , a t ) + 1 , N ( s t , a t , s t +1 ) ← N ( s t , a t , s t +1 ) + 1 8 if N ( s t , a t ) = 2 i for some inte ger i ≥ 0 then 9 k ← k + 1 , ε k ← 1 t (1 − γ ) 10 N k ( s, a ) ← N ( s, a ) , N k ( s, a, s ′ ) ← N ( s, a, s ′ ) for all ( s, a, s ′ ) 11 b P k s,a,s ′ ← N k ( s,a,s ′ ) N k ( s,a ) for all ( s, a, s ′ ) suc h that N k ( s, a ) > 0 12 b P k s,a,s ′ ← 1 S for all ( s, a, s ′ ) suc h that N k ( s, a ) = 0 13 Compute b Q k = b T ( m ) k ( 0 ) where m = l 1 1 − γ log 1+32 H U ε k (1 − γ ) m 14 end 15 end 5 W e present our algorithm, F ully Optimizing Clipp ed UCB Solver (FOCUS), in Algorithm 1 . The al- gorithm uses a model-based approach: it k eeps trac k of state-action visitation counts and uses these to main tain an empirical estimate of the transition kernel. The algorithm runs through episodes, with a new episo de starting when the n um b er of visits to a state-action pair doubles. A t the start of the k th episode, the algorithm updates the empirical transition kernel b P k , then uses a clipp ed optimistic v alue iteration pro- cedure to compute b Q k , whic h is an optimistic estimate of Q ⋆ . At each time step t through the rem ainder of the episode, the algorithm observes state s t ∈ S and takes a greedy action a t ∈ argmax a ∈A b Q k ( s t , a ) . F OCUS uses the following clipp ed optimistic empirical Bellman op erators b T k . Fixing the episo de num b er k , for any Q ∈ R S ×A , deﬁne b T k ( Q )( s, a ) : = r ( s, a ) + γ b P k s,a  Clip H ( M Q )  + γ b  s, a, Clip H ( M Q )  where b ( s, a, V ) : = max n 4 r V ( b P k s,a ,V ) U max { N k ( s,a ) , 1 } , 32 H U max { N k ( s,a ) , 1 } o for V ∈ R S , and U, H , γ are deﬁned within Algorithm 1 . W e now discuss the key features of b T k . First, it uses span-clipping, which ensures that all v alue estimates hav e span bounded b y a parameter H . Second, the op erator incorp orates a sharp, Bernstein-style b on us, similar to that of the MVP algorithm [ Zhang et al. , 2021 ]. Lastly , rather than performing a one-step v alue iteration as in UCBVI- γ [ He et al. , 2021 ] or γ -UCB-CVI [ Hong et al. , 2025 ], FOCUS fully solv es for the ﬁxed point of b T k b y iteratively applying it until conv ergence. This ﬁnal ingredient is essential for obtaining a bound tigh t enough to p erform our av erage-to-discount reduction, for reasons that we further discuss in Section 4 . Computational Complexity W e note that FOCUS is computationally tractable because we design the empirical bellman op erators b T k to be γ -contractions. Consequently , the iterates con verge geometrically to the ﬁxed point of the op erator and only e O ( 1 1 − γ ) iterations suﬃce. Additionally , although the doubling trick helps replace a factor of T with S A log( T ) in the computational complexity , its main purpose is to remov e a factor of 1 1 − γ from the regret b ound for a tec hnical reason that w e explain further in Section 4 . W e now consider the computational complexit y of F OCUS. By the doubling rule, there are at most O ( S A log T ) episodes. At the start of each episo de, w e update our empirical counts and transition k ernel, whic h takes O  S 2 A  time. W e then run O ( 1 1 − γ log T ) steps of v alue iteration, and since each step of v alue iteration tak es O ( S 2 A ) time, this tak es a total of O ( S 2 A 1 − γ log T ) time. So across all episodes, the total run time of F OCUS is O  S 3 A 2 1 − γ (log T ) 2  . F or the a verage-rew ard setting where we set γ = 1 − 1 T , the run time is O ( S 3 A 2 T ) . 3.2 V ariance P arameters Next, w e in tro duce the main v ariance-dep enden t term that will later app ear in our regret b ounds. Deﬁnition 3.1. Let γ ∈ (0 , 1) . F or a particular MDP with transition kernel P and tra jectory { s t , a t } t ∈ [ T ] , deﬁne the cumulativ e v ariance as V ar ⋆ γ : = X T t =1 V  P s t ,a t , V ⋆ γ  . One can interpret V ar ⋆ γ as a measure of stochasticit y in an MDP . When the transition probabilities P ( ·| s, a ) are more concen trated on states of s imilar v alue, V ar ⋆ γ will generally b e smaller. If the MDP is completely deterministic, it is easy to see that V ar ⋆ γ = 0 for any tra jectory . F or a more detailed discussion on related v ariance parameters (in the episo dic setting), see Zhou et al. [ 2023 ]. Existing v ariance-dep enden t guarantees for inﬁnite-horizon MDPs (i.e., T alebi and Maillard 2018 ) inv olv e terms similar to T V ⋆ γ , where V ⋆ γ : = max s,a V ( P s,a , V ⋆ γ ) is the maximum p er-step v ariance. Our b ounds will instead dep end on V ar ⋆ γ , whic h is a random but muc h sharp er quan tit y . It is easy to see that V ar ⋆ γ ≤ T V ⋆ γ . F urthermore, while T V ⋆ γ can be as large as T ∥ V ⋆ γ ∥ 2 sp , the follo wing lemma sho ws that V ar ⋆ γ scales with T ∥ V ⋆ γ ∥ sp . Thus this lemma can b e used to deriv e minimax-optimal span-based b ounds from our v ariance- dep enden t b ounds in b oth the discounted and av erage-reward settings, extending a successful approac h from oﬄine settings (e.g., Zurek and Chen [ 2025b ]) to the v ariance measure V ar ⋆ γ relev ant for online learning. W e defer the pro of to Appendix E . Lemma 3.2. F or δ ∈ (0 , 1) , we have with pr ob ability 1 − δ that V ar ⋆ γ ≤ O  ∥ V ⋆ γ ∥ sp T + ∥ V ⋆ γ ∥ 2 sp log( T /δ )  . 6 3.3 Main Results for Discoun ted Setting W e no w present our main theorem on the performance of F OCUS in the discounted setting. Theorem 3.3 (V ariance-Dep enden t γ -Regret Bound) . L et T ≥ 1 , γ ∈ (0 , 1) , δ ∈ (0 , 1) . F or any H ≥ ∥ V ⋆ γ ∥ sp , A lgorithm 1 with input ( T , γ , δ, H ) achieves, with pr ob ability at le ast 1 − δ , Regret γ ( T ) ≤ e O  q S A V ar ⋆ γ + Γ H S A  . W e pro vide a complete pro of in App endix D . Theorem 3.3 establishes the ﬁrst v ariance-dep enden t γ -regret b ound. Unlike prior w orks that fail to exploit easier environmen ts and thereb y necessarily scale with √ T , the leading term of our b ound dep ends on V ar ⋆ γ , which as previously mentioned captures the sto c hasticity of transition dynamics in the MDP . Consequently , our b ound interpolates b et ween sto c hastic and deterministic en vironments and is signiﬁcan tly sharp er in the latter case. T o illustrate these improv ements, w e consider implications when one has prior kno wledge of the span ∥ V ⋆ γ ∥ sp . In this case, one can set H = ∥ V ⋆ γ ∥ sp to obtain a regret b ound of e O  √ S A V ar ⋆ γ + Γ ∥ V ⋆ γ ∥ sp S A  . When the MDP is deterministic, the γ -regret is T -indep enden t up to logarithmic factors, scaling as e O  ∥ V ⋆ γ ∥ sp S A  . F or sto c hastic MDPs, the leading term e O  √ ∥ V ⋆ γ ∥ sp S AT  matc hes the minimax low er b ound for γ -regret. Although this rate w as previously achiev ed b y He et al. [ 2021 ] and Ji and Li [ 2023 ], their b ounds dep end on 1 1 − γ in both the leading term and the low er-order terms. Now, ∥ V ⋆ γ ∥ sp can be as large as 1 1 − γ , but it has the p oten tial to b e bounded independently of γ , such as in w eakly comm unicating MDPs where ∥ V ⋆ γ ∥ sp ≤ 2 ∥ h ⋆ ∥ sp [ W ei et al. , 2020 ]. W e remark that while Theorem 3.3 is the the ﬁrst explicit span-dep enden t b ound for γ -regret, the analysis in Hong et al. [ 2025 ] can b e slightly mo diﬁed to yield a span-dep enden t b ound of e O  ∥ V ⋆ γ ∥ sp S √ AT + S 1 − γ  (see App endix B for details). How ever, this result is not minimax optimal and still has a low er-order dep endence on 1 1 − γ . Their algorithm also requires prior kno wledge of the span. Theorem 3.3 , on the other hand, sho ws that we can attain a span-based b ound without prior kno wledge. In particular, b y setting H = 1 1 − γ , we attain a b ound of e O  √ ∥ V ⋆ γ ∥ sp S AT + S 2 A 1 − γ  . W e can even remov e the lo wer-order dep endence on 1 1 − γ b y setting H = p T / ( S 3 A ) , similar to Corollary 3.7 below. 3.4 Main Results for A v erage-Rew ard Setting In this section w e use our results from the discounted setting with a prop erly tuned γ to approximate the a verage-rew ard setting. That is, by setting the discount factor γ to be large enough, the γ -regret bound ac hieved by FOCUS implies an optimal v ariance-dep enden t regret. Our reduction hinges on the follo w- ing lemma. The proof follo ws in a straigh tforw ard manner from standard av erage-to-discounted reduction tec hniques [ W ei et al. , 2020 , Zurek and Chen , 2025a ], and w e defer the complete pro of to Section F . Lemma 3.4. Supp ose the MDP is we akly c ommunic ating. F or any T ≥ 1 and γ ∈ (0 , 1) , it holds that Regret( T ) ≤ (1 − γ )   V ⋆ γ   sp T + Regret γ ( T ) . Lemma 3.4 decomp oses regret into γ -regret and an approximation error term (1 − γ ) ∥ V ⋆ γ ∥ sp T . By choosing γ close to 1, this term b ecomes negligible, and the regret is b ounded by γ -regret. This observ ation reinforces the notion that a span-based γ -regret b ound is crucial for deriving optimal guaran tees in the av erage-reward setting. Indeed, when γ is close to 1, applying Lemma 3.4 to a γ -regret b ound that scales with 1 1 − γ instead of the span would imply a v acuous regret b ound. W e can now state our main theorem for the av erage-reward setting. It follo ws by combining Theorem 3.3 and Lemma 3.4 , taking γ = 1 − 1 T , and using the fact that ∥ V ⋆ γ ∥ sp ≤ 2 ∥ h ⋆ ∥ sp for all γ ∈ (0 , 1) . Theorem 3.5 (V ariance-Dep enden t Regret Bound) . Supp ose the MDP is we akly c ommunic ating. L et T ≥ 1 and δ ∈ (0 , 1) . F or any H ≥ 2 ∥ h ⋆ ∥ sp , Algorithm 1 with input ( T , γ = 1 − 1 T , δ, H ) achieves, with pr ob ability at le ast 1 − δ , Regret( T ) ≤ e O  q V ar ⋆ 1 − 1 T S A + Γ H S A  . 7 Theorem 3.5 establishes the ﬁrst minimax optimal v ariance-dep enden t regret bound for the av erage- rew ard setting. Similar to our γ -regret result, the leading term dep ends on V ar ⋆ 1 − 1 T , so the regret adapts to the sto chasticit y of the environmen t. W e remark that T alebi and Maillard [ 2018 ] provide a v ariance- dep enden t regret b ound, but it only implies a suboptimal e O ( DS √ AT ) regret bound in the worst case. F urthermore, their b ound includes a lo wer order term of e O ( D √ T ) , which means it cannot b e optimal for deterministic MDPs. Th us, as demonstrated b y the follo wing corollary , we hav e the ﬁrst regret guarantee that is simultaneously optimal for sto c hastic and determinis tic MDPs. Corollary 3.6 (Regret Bound with Prior Knowledge) . Supp ose the MDP is we akly c ommunic ating. L et T ≥ 1 and δ ∈ (0 , 1) . A lgorithm 1 with input ( T , γ = 1 − 1 T , δ, H = 2 ∥ h ⋆ ∥ sp ) satisﬁes Regret( T ) ≤ e O  q V ar ⋆ 1 − 1 T S A + Γ ∥ h ⋆ ∥ sp S A  with pr ob ability at le ast 1 − δ . Conse quently, L emma 3.2 implies that with pr ob ability at le ast 1 − 2 δ , Regret( T ) ≤ e O  q ∥ h ⋆ ∥ sp S AT + ∥ h ⋆ ∥ sp S 2 A  . It fol lows that with pr ob ability at le ast 1 − 2 δ , Regret( T ) ≤ e O  q ∥ h ⋆ ∥ sp S AT  , pr ovide d that T ≥ ∥ h ⋆ ∥ sp S 3 A . Corollary 3.6 sho ws the optimal b ounds that F OCUS attains across diﬀerent regimes. When the un- derlying MDP is deterministic, the regret scales as e O  ∥ h ⋆ ∥ sp S A  , whic h is optimal and T -indep endent up to logarithmic factors. F or sto chastic MDPs, the leading term is minimax optimal, while the low er-order term is signiﬁcantly smaller than those incurred by existing algorithms. W e later sho w that this ∥ h ⋆ ∥ sp S 2 A lo wer order term is nearly optimal — it could b e improv ed at most by a factor of S to ∥ h ⋆ ∥ sp S A . Note that Corollary 3.6 applies Theorem 3.5 with span b ound H = 2 ∥ h ⋆ ∥ sp , which requires prior knowledge of ∥ h ⋆ ∥ sp . W e next consider the p erformance of FOCUS when we do not ha ve prior kno wledge of ∥ h ⋆ ∥ sp . Corollary 3.7 (Regret Bound without Prior Knowledge) . Supp ose the MDP is we akly c ommunic ating. L et T ≥ 1 and δ ∈ (0 , 1) . Algorithm 1 with input ( T , γ = 1 − 1 T , δ, H = p T / ( S 3 A )) satisﬁes, with pr ob ability at le ast 1 − δ , Regret( T ) ≤ e O  q V ar ⋆ 1 − 1 T S A + √ S AT  , pr ovide d that T ≥ ∥ h ⋆ ∥ 2 sp S 3 A . Conse quently, L emma 3.2 implies that with pr ob ability at le ast 1 − 2 δ , Regret( T ) ≤ e O r  ∥ h ⋆ ∥ sp + 1  S AT ! pr ovide d that T ≥ ∥ h ⋆ ∥ 2 sp S 3 A . It fol lows that with pr ob ability at le ast 1 − 2 δ , we have Regret( T ) ≤ e O  r  ∥ h ⋆ ∥ sp + 1  S AT + ∥ h ⋆ ∥ 2 sp S 3 A  . W e show how the last tw o regret b ounds follow from the ﬁrst in App endix H . Corollary 3.7 sho ws that our algorithm ac hiev es signiﬁcantly improv ed burn-in cost compared to that of previous w ork on priorless algorithms. Indeed, PMEVI-DT attains minimax optimality for T ≥ ∥ h ⋆ ∥ 10 sp S 40 A 20 , whereas our algorithm attains minimax optimality for T ≥ ∥ h ⋆ ∥ 2 sp S 3 A. F urthermore, we prov e a matc hing low er b ound in The- orem 3.8 showing that the low er-order dep endence on the bias span cannot b e improv ed b ey ond ∥ h ⋆ ∥ 2 sp . W e remark that our c hoice of H is optimized for the more in teresting scenario that ∥ h ⋆ ∥ sp ≥ 1 and leads to lo wer-order terms scaling quadratically in ∥ h ⋆ ∥ sp . If one cares ab out the possibility of ∥ h ⋆ ∥ sp ≪ 1 , w e could instead c ho ose H to b e low er-order in T , suc h as H = ( T / ( S 3 A )) 1 / (2+ ε ) , so that the leading term is e O ( p ∥ h ⋆ ∥ sp S AT ) for large T albeit with sligh tly w orse lo wer-order terms. 8 Unlik e the case with prior knowledge of ∥ h ⋆ ∥ sp , Theorem 3.5 do es not immediately imply a simultaneously optimal b ound for stochastic and deterministic MDPs. W e claim it is still possible to obtain some form of T -indep enden t b ound for certain MDP instances. T o wards this end, consider running a diameter estimation pro cedure (i.e., T arb ouriec h et al. 2021 ; T uynman et al. 2024 ) for at most p T / ( S 3 A ) steps. With high probabilit y , it will either terminate within p oly ( D S A ) steps and output b D satisfying D ≤ b D ≤ 2 D , or it will not terminate, in whic h case w e set b D = ∞ . F or the remainder of the time steps, we run Algorithm 1 with H = min { b D , p T / ( S 3 A ) } , reco vering the same b ound as Corollary 3.7 for sto c hastic MDPs and a b ound of e O (min { poly ( D S A ) , √ S AT + ∥ h ⋆ ∥ 2 sp S 3 A }) for deterministic MDPs, whic h is T -indep endent for strongly comm unicating deterministic MDPs, and still ﬁnite when D = ∞ . 3.5 Lo w er Bounds for A verage-Rew ard Regret In this section w e presen t low er bounds on the a verage-rew ard regret. Prior w ork in Auer et al. [ 2008 ] establishes a regret low er bound of Ω( p ∥ h ⋆ ∥ sp S AT ) when T ≥ ∥ h ⋆ ∥ sp S A , 2 whic h has since b een matched b y sev eral algorithms ( Zhang and Ji 2019 , Bo one and Zhang 2024 ; our Corollaries 3.6 and 3.7 ) when the horizon T is suﬃciently large. In con trast to the large- T regime, here w e fo cus on low er bounds applicable for all T , in order to c haracterize the optimal burn-in cost of an y algorithm. W e begin by formalizing the deﬁnition of algorithms to which our low er b ounds apply . W e deﬁne a horizon- T algorithm Alg to be a function from histories of length ≤ T to a distribution ov er actions, that is, a function S 0 ≤ t ≤ T ( S × ( A × S ) t ) → ∆( A ) . W e note that as deﬁned, such an algorithm only takes as input a sequence of elements of S and A ; intuitiv ely sp eaking, any other data, such as the v alue of T or prior kno wledge of ∥ h ⋆ ∥ sp , must already be “baked in” to the algorithm. Hence by our deﬁnition, an “algorithm” (for horizon T ) with prior kno wledge of ∥ h ⋆ ∥ sp is actually a family of horizon- T algorithms, one for each v alue of ∥ h ⋆ ∥ sp . No w w e presen t the main theorem of this section, a low er b ound on low er-order terms incurred b y an y algorithm that do es not hav e prior knowledge of ∥ h ⋆ ∥ sp . Note that this also implies a lo wer b ound on the burn-in cost. Theorem 3.8 (Burn-In Low er Bound F or Prior-F ree Algorithms) . Ther e is a universal c onstant c ≥ 1 such that the fol lowing holds. L et S ≥ 2 and A ≥ 2 b e inte gers. Fix α ∈ [1 , 2) and a function t 7→ β t . Supp ose that T > S A ( cβ T ) 4 2 − α and β T ≥ 1 . Then ther e exist two c ommunic ating MDPs P 1 and P 2 , e ach with S states and A actions, such that no horizon- T algorithm Alg c an satisfy, for b oth i = 1 and i = 2 , E [Regret( T , P i , Alg )] ≤ q β T   h ⋆ P i   sp S AT + β T S A   h ⋆ P i   α sp . W e provide a proof sketc h in Section 4.4 and a complete pro of in Appendix J . Here we in tend β T to b e used to encapsulate e O (1) terms; Theorem 3.8 states that for any α < 2 , w e can ﬁnd T suc h that no single horizon- T algorithm can enjo y regret b ounds of the form e O ( p ∥ h ⋆ ∥ sp S AT + ∥ h ⋆ ∥ α sp S A ) sim ultaneously for t wo certain MDPs P 1 and P 2 . The t wo MDPs hav e ∥ h ⋆ P 1 ∥ sp ≫ ∥ h ⋆ P 2 ∥ sp . With prior knowledge, formally a diﬀeren t horizon- T algorithm would b e applied to eac h MDP P 1 , P 2 , and furthermore, we would not exp ect a horizon- T algorithm designed for MDPs with bias span ≤ ∥ h ⋆ P 2 ∥ sp to enjoy a nonv acuous regret b ound when deplo yed on P 1 . So in short, Theorem 3.8 is not a counterexample to the t yp e of theorem that one prov es when designing algorithms that use prior knowledge, and in particular do es not contradict our Corollary 3.6 . How ever, this low er bound does prohibit any minimax-optimal (for large T ) algorithm without prior kno wledge from obtaining a b etter ∥ h ⋆ ∥ sp dep endence in its low er-order terms than ∥ h ⋆ ∥ 2 sp . This is matched b y our prior-kno wledge-free Corollary 3.7 . Theorem 3.8 demonstrates a “price of adaptivit y” for the burn-in cost, that is, a gap b et ween what is ac hiev able with and without prior knowledge. In particular, this gap is established by combining the ab o ve lo wer b ound and the strictly smaller regret upper b ound provided b y our prior-knowledge-based Corollary 3.6 , which achiev es e O ( p ∥ h ⋆ ∥ sp T + ∥ h ⋆ ∥ sp ) when applied to instances with S, A ≤ O (1) . Note that previous results are insuﬃcien t for establishing this gap, as the only other algorithm which uses prior knowledge and ac hieves minimax-optimal regret for large T has a burn-in cost scaling with ∥ h ⋆ ∥ 6 sp ev en when S , A ≤ O (1) 2 This low er b ound was originally stated in terms of the diameter D , but for their hard instances ∥ h ⋆ ∥ sp and D diﬀer by only a constant factor. 9 [ Zhang and Ji , 2019 ]. The only result of a similar nature for the a verage-rew ard regret of which we are a ware is F ruit et al. [ 2019 , Lemma 3], whic h sho ws an exp onen tial low er b ound on the burn-in cost but only for algorithms achieving a logarithmic regret. One particularly interesting feature of the gap implied b y Theorem 3.8 is that this con trasts the sim ulator (a.k.a. generativ e mo del) setting, where recent work has c haracterized the sample complexity in terms of ∥ h ⋆ ∥ sp and demonstrated no gap b et ween algorithms whic h do and do not possess prior kno wledge [ W ang et al. , 2022 , Zurek and Chen , 2025b , a ]. Next, w e sho w a burn-in cost low er b ound applicable even to algorithms with prior knowledge. Theorem 3.9 (General Burn-In Lo w er Bound) . L et S ≥ 2 and A ≥ 2 b e inte gers, and let D ≥ 4 ⌈ log A S ⌉ . F or any horizon- T algorithm Alg , ther e exists an MDP P with S states, A actions, and diameter at most D such that for al l T ≤ 1 32 D S A , E [Regret( T , P , Alg )] ≥ T 4 . W e defer the pro of, which uses standard constructions, to App endix I . W e emphasize that this result applies to an y horizon- T algorithm. Since D ≥ ∥ h ⋆ ∥ sp (in fact they are equal up to a constant factor for this MDP), Theorem 3.9 implies that any algorithm with a sublinear-in- T regret b ound must hav e a burn-in requiremen t of T ≥ Ω( ∥ h ⋆ ∥ sp S A ) . In particular, combining with the low er b ound from Auer et al. [ 2008 ], we see that no algorithm can ha ve regret b elo w Ω( p ∥ h ⋆ ∥ sp S AT + ∥ h ⋆ ∥ sp S A ) , matching our Corollary 3.6 up to an additional S in the additive burn-in term and e O (1) factors. W e conjecture that this low er b ound is nearly tigh t up to a multiple of e O (1) and that the factor of S in our upp er bound could b e remo ved, although w e b eliev e this may b e very challenging and the techniques used to do so in the inhomogeneous episo dic setting [ Zhang et al. , 2024 ] w ould not apply in the inﬁnite horizon case. 4 T ec hnical Highligh ts In this section, we discuss our algorithmic and analytical con tributions in the context of related work. W e also include a pro of sk etch for Theorem 3.8 . 4.1 Algorithmic Impro vemen ts Ov er Prior UCB-based Approac hes Our algorithm F OCUS builds on prior model-based algorithms for the discoun ted setting, particularly UCBVI- γ [ He et al. , 2021 ] and γ -UCB-CVI [ Hong et al. , 2025 ], but in tro duces sev eral crucial mo diﬁcations that enable v ariance-dep enden t b ounds and eliminate extraneous dep endence on 1 1 − γ . One main no velt y is ho w the optimistic Q-estimate is up dated. Previous algorithms initialize the estimate using b Q 1 ← 1 1 − γ 1 , and then at each time step t ∈ [ T ] perform the follo wing one-step v alue iteration: b Q t +1 ( s, a ) ← r ( s, a ) + γ P t s,a b V t + γ b t ( s, a ) . Here, b V t = M b Q t (in UCBVI- γ ) or b V t = Clip H  M b Q t  (in γ -UCB-CVI). W e ﬁrst discuss the b on us term b t ( s, a ) . UCBVI- γ uses a Bernstein-style b onus similar to that of the UCBVI algorithm for the episo dic setting [ Azar et al. , 2017 ]. UCBVI obtains a regret b ound which is optimal in the leading term but sub optimal in low er-order terms. This sub optimalit y is due to an extra term in the b on us; indeed, in the episo dic setting, Zhang et al. [ 2021 ] shows a similar term to b e unnecessary for optimism. Hence, UCBVI- γ incurs sub optimal lo wer-order terms for the same reason. The MVP algorithm [ Zhang et al. , 2021 ] remov es this extra term from the b onus to ac hieve signiﬁcantly impro v ed low er order terms in the episo dic setting. Secondly , as mentioned ab o v e, γ -UCB-CVI uses a clipping step to ensure ∥ b V t ∥ sp ≤ H . Their analysis has steps that inv olve upp er b ounding ∥ b V t ∥ sp , and clipping allows one to replace some factors of 1 1 − γ with H . Ho wev er, γ -UCB-CVI uses a Ho eﬀding-st yle bonus, resulting in a suboptimal leading term. Still, the result of clipping is that the leading term of their γ -regret b ound depends on H instead of 1 1 − γ . In light of the discussion ab o ve, an immediate idea is to com bine clipping with the sharpest Bernstein- st yle bonus similar to that of the MVP algorithm. This strategy do es yield an impro vemen t, and we b eliev e it would achiev e a γ -regret of e O  q ∥ V ⋆ γ ∥ sp S AT + H S 2 A + S A 1 − γ  . While this b ound is b etter than the state-of- the-art for γ -regret, the low er-order factor of 1 1 − γ w ould prev ent us from obtaining a v ariance-dep endent or priorless span-based regret b ound for the av erage-reward setting. Indeed, applying our av erage-to-discounted 10 reduction (Lemma 3.4 ) to derive an optimal regret b ound w ould require setting γ = 1 − p H T / ( S A ) , and the resulting b ound would be e O  √ H S AT + H S 2 A  . This b ound is not v ariance-dep enden t, and moreo v er the leading term depends on the tuning parameter H instead of ∥ h ⋆ ∥ sp , which means the algorithm w ould only be optimal with prior knowledge of ∥ h ⋆ ∥ sp . An in tuitive explanation for the sub optimalit y of these prior metho ds is that while their estimated b Q t ev entually con verges to Q ⋆ when t → ∞ , for ﬁnite t the estimation error is signiﬁcant when γ is close to 1 , since they p erform only one step of v alue iteration at a time. T o be more precise, let b T t denote the empirical Bellman op erator used at time t . The estimate b Q t +1 ma y remain signiﬁcantly larger than the ﬁxed p oin t of b T t , esp ecially for small t . This results in a detrimen tal dependency on 1 1 − γ in low er order terms ev en if w e use clipping. In particular, as γ gets closer to 1, b Q 1 is initialized with larger entries, and it takes longer for b Q t to conv erge to the best v alues supp orted b y the data. In short, v alue iteration tak es on the order of 1 1 − γ steps to approximately con v erge, while the statistical error conv erges to 0 at an unrelated and p otentially m uch faster rate, esp ecially for low-v ariance MDPs or av erage-reward settings where γ is tuned to b e very large. T o address this issue, our algorithm FOCUS fully optimizes the Q-estimate at the b eginning of each episo de k . Concretely , the algorithm iteratively applies the empirical Bellman op erator b T k un til conv ergence, pro ducing an estimate b Q k that fully exploits all data collected to that point. This mec hanism of fully optimizing is also a feature of algorithms that utilize the EVI subroutine, suggesting that full exploitation of a v ailable data is crucial for achieving optimal span-based b ounds in the a verage-rew ard setting. With this strategy , we obtain a γ -regret b ound without dep endence on 1 1 − γ , which allo ws us to solve the av erage-reward problem with a simple, UCB-based approac h. 4.2 Ho w F ull Optimization Helps in Regret Analysis A k ey tec hnical challenge that illustrates the necessity of fully optimizing lies in con trolling T ind , a term in our regret decomp osition which accounts for changes in the v alue estimate along the learner’s tra jectory . T o see this, w e ﬁrst note that the prior w ork in Hong et al. [ 2025 ] b ounds an analogous quantit y in the analysis of γ -UCB-CVI according to T X t =1  b V t − 1 ( s t +1 ) − b V t ( s t )  ≤ T X t =1  b V t − 1 ( s t +1 ) − b V t +1 ( s t +1 )  + 1 1 − γ ≤ X s ∈S T X t =1  b V t − 1 ( s ) − b V t +1 ( s )  + 1 1 − γ ≤ O  S 1 − γ  . Here, the second inequality holds b ecause the v alue estimates are monotonically decreasing with t , a prop ert y they enforce b y taking minimums. These calculations show that the scale of this term dep ends on the p ossible range of v alue estimates, w hic h, under one-step v alue iteration updates, can b e as large as 1 1 − γ . In con trast, b y fully running the v alue iteration procedure, along with the use of clipping, our v alue estimates lie in a range of H times the num b er of episo des, whic h is only logarithmic in T due to the doubling trick. F ormally , letting m b e the total n um b er of episo des and t k b e the time at the start of the k th episo de, a telescoping argumen t yields T ind = m X k =1 t k +1 − 1 X t = t k  b V k ( s t k +1 ) − b V k ( s t k  ≤ m X k =1  b V k ( s t k +1 ) − b V k ( s t k  ≤ mH ≤ e O ( H S A ) . 4.3 Comparison to EVI-Based Approac hes The t wo existing minimax-optimal algorithms for the av erage-reward setting, EBF [ Zhang and Ji , 2019 ] and PMEVI-DT [ Bo one and Zhang , 2024 ], are both based on EVI. The wa y that these algorithms reﬁne previous EVI-based approaches suggest that obtaining an optimal span-dep enden t regret requires exploiting 11 structural information encoded in the optimal bias function h ⋆ . W e elab orate on this p oin t by examining the state-of-the-art PMEVI-DT algorithm, whose name reﬂects t wo cen tral mo diﬁcations to standard EVI. W e remark that EBF is an earlier attempt in this direction, in that it estimates bias diﬀerences to shrink the conﬁdence set, but incorporates this restriction through a step that is not eﬃcien tly computable. A t eac h episo de, PMEVI-DT runs a “BiasEstimation” subroutine to construct a conﬁdence set for h ⋆ . The extended Bellman op erator used in EVI is then com bined with a “pro jection” step (the P in PMEVI-DT) that constrains the p ossible mo dels to those with optimal bias in the conﬁdence set. Their extended Bellman op erator also incorp orates a “mitigation” step (the M in PMEVI-DT) that uses the bias conﬁdence region to tigh ten a Bernstein-type v ariance constraint. Despite this mac hinery , PMEVI-DT still requires either prior kno wledge of ∥ h ⋆ ∥ sp or a condition like T ≥ ∥ h ⋆ ∥ 5 sp to ac hieve minimax-optimal regret. Our results sho w that under the same type of assumptions, our UCB-based algorithm suﬃciently exploits h ⋆ without explicitly estimating it. The span-clipping component of the empirical Bellman update replaces the pro jection and mitigation steps of PMEVI-DT with a simple, easily in terpretable op eration. Sp eciﬁcally , span-clipping prev ents the v alue estimate from b eing o verly optimistic; a smaller clipping threshold H reduces exploration and increases immediate exploitation of known high-reward actions. This span-clipping tec hnique w as introduced b y F ruit et al. [ 2018 ] as part of the EVI-based algorithm SCAL. While SCAL obtains a span-based regret b ound, it do es not maintain sharp conﬁdence regions and consequently its regret suﬀers from extra factors of ∥ h ⋆ ∥ sp and S . The more in v olved bias estimation tec hniques of EBF and PMEVI-DT circumv ent these issues and pro duce sharp conﬁdence regions with bounded bias spans, but our algorithm achiev es optimal regret b ounds with the simpler com bination of span-clipping and a sharp Bernstein-st yle b on us. Finally , giv en the success of UCB-based approaches in the episodic setting, w e remark that it is at least somewhat surprising that suc h algorithms hav e y et to be thoroughly studied in the inﬁnite-horizon av erage- rew ard setting. W e suggest t wo con tributing factors. First, our approach utilizes an a verage-to-discoun ted reduction, a strategy which only recen tly has b een sho wn to yield optimal span-based b ounds [ Zurek and Chen , 2025b , a ]. In these w orks whic h deriv e an optimal span-based b ound via an a verage-to-discoun ted reduction, a crucial step is tight analysis of v ariance-dep enden t quantities to remov e factors of 1 1 − γ . The analogous step in our results is Lemma 3.2 , whic h was vital in successfully applying the reduction. Secondly , since the seminal w ork of Auer et al. [ 2008 ], the most w ell-studied and successful algorithms for the online inﬁnite-horizon a verage-rew ard setting ha ve been EVI-based, so the most natural route for obtaining a minimax-optimal algorithm was to reﬁne these existing works. 4.4 Pro of Sketc h for Theorem 3.8 W e sketc h the construction underlying Theorem 3.8 , which shows that without prior kno wledge of the bias span, an y algorithm must incur a burn-in cost of order ∥ h ⋆ ∥ 2 sp S A . Consider the MDPs P 1 and P 2 in Figure 1 . The MDPs, which both ha ve t w o states and t wo actions, are nearly identical. F or b oth w e let state 1 b e the initial state. In state 1 of b oth MDPs, the stay action yields reward 1/2 and remains in state 1, and the leave action yields rew ard 0 and has a small probability ( 1 /B , for a parameter B > 2 ) of transiting to state 2. F urthermore, in b oth MDPs the leave action in state 2 yields rew ard 0 and transits to state 1. The diﬀerence betw een P 1 and P 2 is the stay action in state 2. In both MDPs it yields maxim um rew ard 1, but in P 1 it remains in state 2, while in P 2 it transits bac k to state 1. It follows that in P 1 , state 2 is an absorbing high-rew ard region; the optimal policy is to reach state 2 and sta y there for an a v erage rew ard of 1. In P 2 , on the other hand, state 2 oﬀers no long-term b eneﬁt, and the optimal p olicy is to stay in state 1 for an a verage reward of 1/2. Additionally , a direct calculation conﬁrms that the span of the optimal bias function in P 1 is B (reﬂecting the long dela y before b eing able to collect rew ard 1), while that of P 2 is 1/2. No w, supp ose a learning algorithm has no structural information ab out the underlying MDP and promises a sublinear regret b ound on P 1 . T o ac hieve sublinear regret on P 1 , the learner clearly m ust reac h state 2. Otherwise, the learner would incur at least regret 1/2 p er time step. Moreov er, reac hing state 2 requires taking the leave action B times in expectation. On P 2 , ho wev er, these exploration attempts are wasteful since leave yields reward 0 with no p oten tial to recoup rew ard in state 2. Y et, since P 1 and P 2 only diﬀer on the stay action in state 2 , even when the true MDP is P 2 , the algorithm must still reac h state 2 and collect data there to guard against the p ossibility that the MDP is actually P 1 . Note that this deduction is only v alid without prior kno wledge, b ecause with knowledge that the bias span is at most 1 / 2 , the p ossibilit y 12 1 2 a = stay , r = 1 2 a = stay , r = 1 a = leave , r = 0 1 − 1 B 1 B a = leave , r = 0 P 1 1 2 a = stay , r = 1 2 a = leave , r = 0 1 − 1 B 1 B a = leave , r = 0 a = stay , r = 1 P 2 Figure 1: An example of the MDPs used in the proof of Theorem 3.8 . Here each state-action pair is annotated with its rew ard. If the transition asso ciated with a state-action pair is deterministic, it is denoted with a solid arro w. If it is sto c hastic, it is represen ted as a solid line splitting into multiple dashed arrows to diﬀerent states, each annotated with the associated probability of that transition. The MDPs are parameterized b y B > 2 , both hav e starting state 1 , and diﬀer only in the transition distribution of the stay action of state 2 . In P 1 an optimal stationary p olicy trav erses to state 2 and sta ys there, while in P 2 an optimal stationary p olicy remains in state 1 . of P 1 could be eliminated without reaching state 2 since ∥ h ⋆ P 2 ∥ sp > 1 / 2 . Thus, any prior-knowledge-free algorithm whic h ac hieves less than T / 2 regret on P 1 m ust incur at least Ω( B ) regret on P 2 . W e summarize the preceding argument in the following intermediate result. Lemma 4.1 (Simpliﬁed V ersion of Theorem J.1 ) . Ther e is a universal c onstant c ∈ (0 , 1) so that the fol lowing holds. Fix T , B , and let Alg b e any horizon- T algorithm. Ther e exists two MDPs P 1 and P 2 such that   h ⋆ P 1   sp = B ,   h ⋆ P 2   sp = 1 / 2 , and E P 1 [Regret( T )] < T / 4 = ⇒ E P 2 [Regret( T )] ≥ cB . W e no w sketc h how Theorem 3.8 follo ws from Lemma 4.1 , noting that in this discussion w e will ignore factors of S and A for ease of presentation. Let α ∈ [1 , 2) and β ≥ 1 b e arbitrary . F or a suﬃciently large T — sp eciﬁcally T satisfying β 2 c 2 ( T 3 / 4 + T α/ 2 ) < T / 4 and p T / 2 + 1 / 2 < √ T — supp ose to w ards a contradiction that there exists a horizon- T algorithm that obtains a β  p ∥ h ⋆ ∥ sp T + ∥ h ⋆ ∥ α sp  regret b ound in exp ectation for any MDP with S A = O (1) . W e then choose B = β c √ T and let MDPs P 1 and P 2 b e as in Lemma 4.1 . Next, w e compute that on P 1 the algorithm obtains regret E P 1 [Regret( T )] ≤ β ( √ B T + B α ) ≤ β r T β c √ T +  β c √ T  α ! ≤ β 2 c 2  T 3 / 4 + T α/ 2  < T / 4 , while on P 2 the algorithm obtains regret E P 2 [Regret( T )] ≤ β  p T / 2 + (1 / 2) α  < β √ T = cB whic h contradicts Lemma 4.1 . W e emphasize that the role of no prior knowledge is implicit but crucial. The theorem ﬁxes a single horizon- T algorithm and ev aluates it on t wo MDPs with v ery diﬀerent optimal bias spans. If prior kno wl- edge w ere a v ailable, a diﬀeren t horizon- T algorithm could be deplo yed on eac h instance, and the ab ov e con tradiction would not arise. Prior work (e.g., F ruit et al. [ 2018 ]) demonstrates how prior knowledge allows the learner to aggressively utilize span-clipping and exploit earlier, thereby a voiding unnecessary exploration. Our hard MDP instances illustrate that, in general, such exploitation is imp ossible without prior knowledge. Indeed, without prior knowledge the algorithm must explore signiﬁcantly longer to p erform optimally in instances with large bias span, but this exploration results in worse burn-in cost in instances with small bias span. 13 5 Conclusion W e developed the ﬁrst algorithm for both av erage-reward regret and γ -regret that is simultaneously minimax- optimal and v ariance-dep enden t. A dditionally , our a verage-rew ard regret bounds ha ve optimal low er-order dep endence on ∥ h ⋆ ∥ sp , and w e pro ved low er b ounds which rev ealed a fundamental gap in what is ac hiev able with and without prior knowledge. One op en problem is to eliminate the Γ factor from the low er-order terms of Theorems 3.3 and 3.5 , whic h has recently b een done in the inhomogeneous episo dic setting [ Zhang et al. , 2024 ] but app ears more c hallenging in inﬁnite-horizon settings. A c kno wledgmen t G. Zamir, M. Zurek, and Y. Chen ackno wledge supp ort by National Science F oundation grants CCF-2233152 and DMS-2023239, by a Cisco Systems F ello wship, and b y a Vilas Asso ciates A w ard. References P eter Auer, Nicolo Cesa-Bianchi, Y oav F reund, and Rob ert E Sc hapire. The nonstochastic multiarmed bandit problem. SIAM journal on c omputing , 32(1):48–77, 2002. P eter Auer, Thomas Jaksc h, and Ronald Ortner. Near-optimal Regret Bounds for Reinforce- men t Learning. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 21. Cur- ran Asso ciates, Inc., 2008. URL https://papers.nips.cc/paper_files/paper/2008/hash/ e4a6222cdb5b34375400904f03d8e6a5- Abstract.html . Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax Regret Bounds for Reinforcement Learning. In Pr o c e e dings of the 34th International Confer enc e on Machine L e arning , pages 263–272. PMLR, July 2017. URL https://proceedings.mlr.press/v70/azar17a.html . ISSN: 2640-3498. P eter L. Bartlett and Ambuj T ew ari. REGAL: A Regularization based Algorithm for Reinforcement Learning in W eakly Communicating MDPs, May 2012. URL . Victor Bo one and Zihan Zhang. Ac hieving tractable minimax optimal regret in a verage rew ard mdps. In A. Glob erson, L. Mack ey , D. Belgrav e, A. F an, U. Paquet, J. T omczak, and C. Zhang, editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 37, pages 26728–26769. Curran Associates, Inc., 2024. doi: 10.52202/079017- 0840. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 2f0bb736ccc8551ef5bcc9165c2a4d9e- Paper- Conference.pdf . Ronan F ruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Eﬃcient Bias-Span-Constrained Exploration-Exploitation in Reinforcemen t Learning, July 2018. URL 04020 . arXiv:1802.04020 [cs, stat]. Ronan F ruit, Matteo Pirotta, and Alessandro Lazaric. Near Optimal Exploration-Exploitation in Non- Comm unicating Marko v Decision Pro cesses, Marc h 2019. URL . arXiv:1807.02373 [cs, stat]. Germano Gabbianelli, Gergely Neu, Nnek a Ok olo, and Matteo P apini. Oﬄine Primal-Dual Reinforcement Learning for Linear MDPs, Ma y 2023. URL . arXiv:2305.12944 [cs]. Jiafan He, Dongruo Zhou, and Quanquan Gu. Nearly Minimax Optimal Reinforcement Learning for Discoun ted MDPs. In A dvanc es in Neur al Information Pr o c essing Systems , v olume 34, pages 22288– 22300. Curran Asso ciates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/hash/ bb57db42f77807a9c5823bd8c2d9aaef- Abstract.html . Kih yuk Hong, W o o jin Chae, Y ufan Zhang, Dabeen Lee, and Am buj T ewari. Reinforcement Learning for Inﬁnite-Horizon A v erage-Reward Linear MDPs via Approximation b y Discounted-Rew ard MDPs, Marc h 2025. URL . arXiv:2405.15050 [stat]. 14 Xiang Ji and Gen Li. Regret-Optimal Mo del-F ree Reinforcemen t Learning for Discoun ted MDPs with Short Burn-In Time. A dvanc es in Neur al Information Pr o c essing Systems , 36:80674– 80689, Decem b er 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ ff887781480973bd3cb6026feb378d1e- Abstract- Conference.html . Y ujia Jin and Aaron Sidford. Eﬃciently Solving MDPs with Sto c hastic Mirror Descent, August 2020. URL http://arxiv.org/abs/2008.12776 . Y ujia Jin and Aaron Sidford. T o wards Tigh t Bounds on the Sample Complexit y of A v erage-reward MDPs, June 2021. URL . arXiv:2106.07046 [cs, math]. Jongmin Lee, Mario Brav o, and Roberto Cominetti. Near-Optimal Sample Complexit y for MDPs via An- c horing, June 2025. URL . arXiv:2502.04477 [math]. Tianjiao Li, F eiyang W u, and Guanghui Lan. Sto c hastic First-Order Metho ds for A v erage-Reward Mark ov Decision Pro cesses. Mathematics of Op er ations R ese ar ch , December 2024. ISSN 0364-765X. doi: 10.1287/ mo or.2022.0241. URL https://pubsonline.informs.org/doi/full/10.1287/moor.2022.0241 . Sh uang Liu and Hao Su. Regret Bounds for Discoun ted MDPs, Ma y 2021. URL 2002.05138 . arXiv:2002.05138 [cs]. Jianfei Ma and W ee Sun Lee. Eubrl: Epistemic uncertain ty directed ba yesian reinforcemen t learning, 2026. URL . Andreas Maurer and Massimiliano P ontil. Empirical Bernstein Bounds and Sample V ariance Penalization, July 2009. URL . arXiv:0907.3740 [stat]. Gergely Neu and Nnek a Okolo. Dealing with un b ounded gradients in stochastic saddle-p oin t optimization, June 2024. URL . arXiv:2402.13903 [cs, math, stat] version: 2. Asuman Ozdaglar, Sarath P attathil, Jia wei Zhang, and Kaiqing Zhang. Oﬄine Reinforcemen t Learning via Linear-Programming with Error-Bound Induced Constraints, Decem b er 2024. URL abs/2212.13861 . arXiv:2212.13861 [cs]. Charles Chapman Pugh. R e al mathematic al analysis . Undergraduate texts in mathematics. Springer, Cham Heidelb erg, 2. ed edition, 2015. ISBN 978-3-319-17770-0. Mohammad Sadegh T alebi and Odalric-Ambrym Maillard. V ariance-aw are regret bounds for undiscoun ted reinforcemen t learning in mdps. In Firdaus Janoos, Mehryar Mohri, and Karthik Sridharan, editors, Pr o c e e dings of A lgorithmic L e arning The ory , volume 83 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 770–805. PMLR, 07–09 Apr 2018. URL https://proceedings.mlr.press/v83/talebi18a.html . Jean T arb ouriec h, Matteo Pirotta, Mic hal V alk o, and Alessandro Lazaric. A prov ably eﬃcien t sample col- lection strategy for reinforcemen t learning. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P .S. Liang, and J. W ortman V aughan, editors, A dvanc es in Neur al Information Pr o c essing Systems , v olume 34, pages 7611–7624. Curran Asso ciates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/ paper/2021/file/3e98410c45ea98addec555019bbae8eb- Paper.pdf . A drienne T uynman, Rémy Degenne, and Emilie Kaufmann. Finding go od p olicies in av erage-rew ard marko v decision pro cesses without prior kno wledge. In A dvanc es in Neur al Information Pr o c essing Systems , vol- ume 37, pages 109948–109979, 2024. Jinghan W ang, Mengdi W ang, and Lin F. Y ang. Near Sample-Optimal Reduction-based Policy Learning for A v erage Reward MDP, December 2022. URL . [cs]. Shengb o W ang, Jose Blanc het, and P eter Glynn. Optimal Sample Complexit y for A verage Rew ard Marko v Decision Processes, F ebruary 2024. URL . 15 Chen-Y u W ei, Mehdi Jafarnia Jahromi, Haip eng Luo, Hiteshi Sharma, and Rahul Jain. Mo del-free Rein- forcemen t Learning in Inﬁnite-horizon A verage-rew ard Mark ov Decision Processes. In Pr o c e e dings of the 37th International Confer enc e on Machine L e arning , pages 10170–10180. PMLR, Nov ember 2020. URL https://proceedings.mlr.press/v119/wei20c.html . ISSN: 2640-3498. Andrea Zanette and Emma Brunskill. Tigh ter Problem-Dep enden t Regret Bounds in Reinforcement Learning without Domain Knowledge using V alue F unction Bounds, Nov ember 2019. URL 1901.00210 . arXiv:1901.00210 [cs, stat] v ersion: 4. Zihan Zhang and Xiangyang Ji. Regret Minimization for Reinforcement Learning b y Ev aluat- ing the Optimal Bias F unction. In A dvanc es in Neur al Information Pr o c essing Systems , v ol- ume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 9e984c108157cea74c894b5cf34efc44- Abstract.html . Zihan Zhang and Qiaomin Xie. Sharp er mo del-free reinforcement learning for av erage-reward marko v decision pro cesses. In Gergely Neu and Lorenzo Rosasco, editors, Pr o c e e dings of Thirty Sixth Confer enc e on L e arning The ory , v olume 195 of Pr o c e e dings of Machine L e arning R ese ar ch , pages 5476–5477. PMLR, 12–15 Jul 2023. URL https://proceedings.mlr.press/v195/zhang23b.html . Zihan Zhang, Xiangyang Ji, and Simon Du. Is Reinforcement Learning More Diﬃcult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon. In Pr o c e e dings of Thirty F ourth Confer enc e on L e arning The ory , pages 4528–4531. PMLR, July 2021. URL https://proceedings.mlr.press/v134/ zhang21b.html . ISSN: 2640-3498. Zihan Zhang, Y uxin Chen, Jason D. Lee, and Simon S. Du. Settling the sample complexit y of online reinforcemen t learning. In Pr o c e e dings of Thirty Seventh Confer enc e on L e arning The ory , pages 5213–5219. PMLR, June 2024. URL https://proceedings.mlr.press/v247/zhang24a.html . ISSN: 2640-3498. Runlong Zhou, Zhang Zihan, and Simon Shaolei Du. Sharp V ariance-Dep enden t Bounds in Reinforcement Learning: Best of Both W orlds in Sto chastic and Deterministic En vironmen ts. In Pr o c e e dings of the 40th International Confer enc e on Machine L e arning , pages 42878–42914. PMLR, July 2023. URL https: //proceedings.mlr.press/v202/zhou23t.html . ISSN: 2640-3498. Matthew Zurek and Y udong Chen. The Plug-in Approac h for A verage-Rew ard and Discoun ted MDPs: Optimal Sample Complexity Analysis, October 2024. URL . arXiv:2410.07616 [cs]. Matthew Zurek and Y udong Chen. Span-Agnostic Optimal Sample Complexit y and Oracle Inequalities for A verage-Rew ard RL. In Pr o c e e dings of Thirty Eighth Confer enc e on L e arning The ory , pages 6156–6209. PMLR, July 2025a. URL https://proceedings.mlr.press/v291/zurek25a.html . Matthew Zurek and Y udong Chen. Span-Based Optimal Sample Complexity for W eakly Comm uni- cating and General A verage Rew ard MDPs. A dvanc es in Neur al Information Pr o c essing Systems , 37:33455–33504, Jan uary 2025b. URL https://proceedings.neurips.cc/paper_files/paper/2024/ hash/3acbe9dc3a1e8d48a57b16e9aef91879- Abstract- Conference.html . Matthew Zurek, Guy Zamir, and Y udong Chen. Optimal single-policy sample complexity and transien t co verage for a verage-rew ard oﬄine RL. In The Thirty-ninth A nnual Confer enc e on Neur al Information Pr o c essing Systems , 2025. URL https://openreview.net/forum?id=MjOf5qnEX7 . A More Related W ork Here w e discuss additional related w ork. 16 A v erage-Reward Simulator and Oﬄine Settings A complemen tary problem setting to online RL is the oﬄine/simulator setting, where the goal is to learn a ε -optimal policy π (suc h that ρ π ( s ) ≥ ρ ⋆ ( s ) − ε for all s ∈ S ) from a ﬁxed/simulator-generated dataset with the minimum num b er of samples. A sequence of w orks obtained sharp er sample complexit y b ounds with relaxed assumptions on the environmen tal structure (e.g. Jin and Sidford 2020 , 2021 , W ang et al. 2022 , 2024 , Li et al. 2024 ), culminating with the optimal ∥ h ⋆ ∥ sp - based sample complexity b ound of e O ( S A ∥ h ⋆ ∥ sp ε 2 ) shown by Zurek and Chen [ 2025b ], matching a low er b ound due to W ang et al. [ 2022 ]. Ho wev er, this result required prior knowledge of ∥ h ⋆ ∥ sp , leaving the question op en of whether the optimal sample complexity could b e obtained by algorithms without prior kno wledge. After extensive research eﬀort [ Neu and Ok olo , 2024 , T uynman et al. , 2024 , Zurek and Chen , 2024 , Lee et al. , 2025 ], this question was answ ered aﬃrmatively b y Zurek and Chen [ 2025a ]. T uynman et al. [ 2024 ] and Zurek and Chen [ 2025b ] sho w v arious hardness results related to estimating ∥ h ⋆ ∥ sp and ∥ h ⋆ ∥ sp -based P AC guaran tees with online environmen t access. A very common approach throughout these w orks is to reduce the a verage-rew ard problem to a discoun ted one; obtaining sharp v ariance b ounds for the discoun ted problem, somewhat analogous to our Lemma 3.2 , plays a key role in all minimax-optimal approaches [ Zurek and Chen , 2025b , 2024 , 2025a ]. All of the aforementioned w ork is for the generative mo del setting, where a dataset with uniform cov erage can b e sampled. Recent works ha ve also studied oﬄine settings with more general data sampling patterns [ Gabbianelli et al. , 2023 , Ozdaglar et al. , 2024 , Zurek et al. , 2025 ]. Episo dic Online RL Studying the theoretical limits of regret for episo dic online RL has b een one of the most fundamental problems in RL theory . Hence, we c annot provide a comprehensive review of work on this topic, but w e discuss a few related works with strong connections to our own. Azar et al. [ 2017 ] establishes minimax regret b ounds in the episodic setting with the UCBVI algorithm, whic h uses a Bernstein-st yle b on us. Zanette and Brunskill [ 2019 ] developed the EULER algorithm whic h obtains both minimax-optimal and v ariance-dep endent regret b ounds. Zhang et al. [ 2021 ] greatly impro ves lo wer order terms by using an ev en sharp er Bernstein-st yle b on us in their MVP algorithm. Later reﬁnemen ts to the MVP algorithm in Zhou et al. [ 2023 ] and Zhang et al. [ 2024 ] yield optimal low er order terms and v ariance-dep enden t regret b ounds that adapt to the diﬃculty of the en vironment. B γ -Regret Bound of γ -UCB-CVI In this section w e show how the analysis of γ -UCB-CVI in Hong et al. [ 2025 ] can be mo diﬁed sligh tly to obtain a span-based γ -regret b ound. The quan tities P t , V t and N t refer to the empirical transition k ernel, v alue estimate, and empirical state-action counts, resp ectiv ely , of γ -UCB-CVI at time t . β = e Θ(   V ⋆ γ   sp √ S ) is a constan t in the b onus term which is large enough to ensure optimism. A dditionally , in their analysis they often immediately b ound   V ⋆ γ   sp ≤ 2 ∥ h ⋆ ∥ sp and end up with factors of ∥ h ⋆ ∥ sp in intermediate steps. Belo w, we leav e in the factors of   V ⋆ γ   sp . Via a concen tration argument on | ( P t − 1 s t ,a t − P s t ,a t ) V t − 1 | (their Lemma 3), Hong et al. obtains that with high probabilit y r ( s t , a t ) ≥ V t ( s t ) − γ P s t ,a t V t − 1 − 2 β p N t − 1 ( s t , a t ) . 17 Subsequen tly , Hong et al. decompose the a verage-rew ard regret as Regret( T ) = T X t =1 ( ρ ⋆ − r ( s t , a t )) ≤ T X t =1 ρ ⋆ − V t ( s t ) + γ P s t ,a t V t − 1 + 2 β p N t − 1 ( s t , a t ) ! = T X t =1 ( ρ ⋆ − (1 − γ ) V t ( s t )) | {z } ( a ) + γ T X t =1 ( V t − 1 ( s t +1 − V t ( s t ))) | {z } ( b ) + γ T X t =1 ( P s t ,a t V t − 1 − V t − 1 ( s t +1 )) | {z } ( c ) + 2 β T X t =1 1 p N t − 1 ( s t , a t | {z } ( d ) . T o instead analyze Regret γ ( T ) = P t ((1 − γ ) V ⋆ γ ( s t ) − r ( s t , a t ) , we simply replace ρ ⋆ with (1 − γ ) V ⋆ γ ( s t ) . With this replacemen t, term ( a ) v anishes due to optimism (their Lemma 4). F or the other terms, the b ounds in the original pro of still hold. In particular, term ( b ) is b ounded by O ( S 1 − γ ) , term ( c ) is b ounded b y e O (   V ⋆ γ   sp √ T ) , and term ( d ) is bounded by O ( β √ S AT ) ≤ e O (   V ⋆ γ   sp S √ AT ) . Recom bining terms, w e obtain that with high probabilit y , Regret γ ( T ) ≤ e O    V ⋆ γ   sp S √ AT + S 1 − γ  . C T ec hnical Lemmas Lemma C.1 (Bernstein’s Inequality , Theorem 3 in Maurer and P on til 2009 ) . L et Z, Z 1 , . . . , Z n b e i.i.d. r andom variables with values in [ c min , c max ] for some c onstants c min < c max . Set c = c max − c min . L et δ > 0 . Then we have with pr ob ability at le ast 1 − δ that      E [ Z ] − 1 n n X i =1 Z i      ≤ r 2V ar( Z ) log (2 /δ ) n + c log (2 /δ ) 3 n . Lemma C.2 (Empirical Bernstein’s Inequalit y , Theorem 4 in Maurer and Pon til 2009 ) . L et Z, Z 1 , . . . , Z n b e i.i.d. r andom variables with values in [ c min , c max ] for some c onstants c min < c max . Set c = c max − c min . Denote Z = 1 n P n i =1 Z i and d V ar n = 1 n P n i =1  Z i − Z  2 . L et δ > 0 . Then we have with pr ob ability at le ast 1 − δ that      E [ Z ] − 1 n n X i =1 Z i      ≤ s 2 d V ar n log (2 /δ ) n − 1 + 7 c log (2 /δ ) 3( n − 1) . Lemma C.3 (Lemma 22 in Zhou et al. 2023 ) . F or any two nonne gative c onstants c 1 , c 2 satisfying 2 c 2 1 ≤ c 2 , let f : ∆([ S ]) × [0 , 2 C ] S × R × R → R b e deﬁne d by f ( p, v , n, u ) = pv + max ( c 1 r V ( p, v ) u n , c 2 C u n ) . Then for al l p ∈ ∆([ S ]) , v ∈ [0 , 2 C ] S , and n, u > 0 , f is non-de cr e asing in v , i.e. f ( p, v , n, u ) ≥ f ( p, v ′ , n, u ) ∀ v , v ′ ∈ [0 , 2 C ] S satisfying v ≥ v ′ . Lemma C.4 (Lemma 19 in Zhou et al. 2023 ) . L et X b e a r andom variable with sup | X | ≤ c for some c onstant c ≥ 0 . Then V ar  X 2  ≤ 4 c 2 V ar( X ) . 18 Lemma C.5. If x ≤ a √ x + b for a, b > 0 , then x ≤ 2 a 2 + 2 b . Lemma C.6 (Bernoulli’s Inequality) . L et r ≥ 1 and x ≥ − 1 . Then (1 + x ) r ≥ 1 + r x . Lemma C.7 (Rearrangemen t Inequalit y) . L et x 1 ≤ · · · ≤ x n and y 1 ≤ · · · ≤ y n b e r e al numb ers. F or every p ermutation σ of 1 , . . . , n , we have x 1 y n + · · · + x n y 1 ≤ x 1 y σ (1) + · · · + x n y σ ( n ) ≤ x 1 y 1 + · · · + x n y n . D Pro of of Theorem 3.3 W e provide an ov erview of notation used in the proof. W e let I denote the indicator function, meaning that for an ev ent E , we ha ve I ( E ) = 1 if E holds and I ( E ) = 0 otherwise. W e denote by m the num b er of episo des. F or each k ∈ [ m ] , w e write t k to denote the time at the start of the k th episo de, and we set t m +1 = T + 1 . Analogously , for each t ∈ [ T ] , we write k t to denote the current episo de at time t . b Q k is the Q-estimate used during the k th episo de, and b V k = Clip H ( M Q ) . W e will frequen tly use the fact that    b V k    sp ≤ H for all k ∈ [ m ] . N k ( s, a, s ′ ) and N k ( s, a ) are the counts of ( s, a, s ′ ) and ( s, a ) , resp ectiv ely , at the start of the k th episo de, and we let N k denote the length of the k th episo de. W e write n k ( s, a ) as shorthand for max { N k ( s, a ) , 1 } . F or any ( s, a ) ∈ S × A , V ∈ R S , and k ∈ [ m ] , w e let b k ( s, a, V ) denote the b on us used for the k th episode, namely b k ( s, a, V ) : = max      4 v u u t V  b P k s,a , V  U n k ( s, a ) , 32 H U n k ( s, a )      . F or an y ( s, a ) ∈ S × A and V ∈ R S w e deﬁne V s,a ( V ) : = V ( P s,a , V ) . F urther recall that δ ′ = δ 9 S 2 AT and U = log  1 δ ′  . It is easy to see b y the doubling trick of Algorithm 1 that the n umber of episodes is bounded as m ≤ S AU . W e also ha ve ε k = 1 t k (1 − γ ) . Pr o of. Let T ≥ 1 and γ , δ ∈ (0 , 1) b e arbitrary . F or the optimal discoun ted v alue function w e will drop the γ and simply write V ⋆ . By assumption H ≥ ∥ V ⋆ ∥ sp . As is standard in the analysis of optimistic algorithms for online RL, a key step in our analysis is to establish an optimism prop ert y , which enables the rest of our regret decomposition. Step 1: Optimism W e state the fact that with high probability , our Q -estimate b Q k and v alue estimate b V k are indeed optimistic across all episodes. W e defer the pro of to App endix G.1 . Lemma D.1 (Optimism) . With pr ob ability 1 − S AT δ ′ , b oth of the fol lowing hold: 1. b Q k ( s, a ) ≥ Q ⋆ ( s, a ) − ε k and b V k ( s ) ≥ V ⋆ ( s ) − ε k for al l ( s, a, k ) ∈ S × A × [ m ] . 2. F or any k ∈ [ m ] and t ∈ { t k , . . . , t k +1 − 1 } , b V k ( s t ) ≤ r ( s t , a t ) + γ b P k s t ,a t b V k + b k  s t , a t , b V k  . Equipp ed with Lemma D.1 , w e turn to decomposing the regret. Step 2: Regret Decomp osition Under the successful ev en ts of L emma D.1 , observ e that we hav e the follo wing b ound for any t ∈ { t k , . . . , t k +1 − 1 } : (1 − γ ) V ⋆ ( s t ) − r ( s t , a t ) ≤ γ  b P k s t ,a t b V k − b V k ( s t )  + γ b k  s t , a t , b V k  + (1 − γ ) ε k . (1) No w, it is not immediately obvious ho w we should b ound the ﬁrst term of the RHS, so w e will relate it to something we do know ho w to b ound. Observe that this term v aguely lo oks like P s t ,a t b V k − b V k ( s t +1 ) . Deﬁning X t to b e this expression, one can v erify that { X t } 0 ≤ t ≤ T is a martingale diﬀerence sequence with resp ect to the ﬁltration F t = σ ( s 1 , a 1 , . . . , s t +1 , a t +1 ) . Consequently , w e can b ound P T t =1 X t with the follo wing martingale concen tration result. 19 Lemma D.2 (Martingale Concen tration; Adapted from Lemma 13 in Zhang et al. 2021 ) . L et { M n } n ≥ 0 b e a martingale with r esp e ct to some ﬁltr ation {F n } n ≥ 0 such that M 0 = 0 and | M n − M n − 1 | ≤ c almost sur ely for some c ≥ 0 and al l n ≥ 1 . L et V ar n = P n k =1 E  ( M k − M k − 1 ) 2 | F k − 1  . Then for any p ositive inte ger n and any δ ∈ (0 , 1) , we have with pr ob ability at le ast 1 − 3 nδ that | M n | < 2 √ 2 s V ar n log  1 δ  + 4 c log  1 δ  . W e now hav e motiv ation for transforming b P k s t ,a t b V k − b V k ( s t ) into X t . W e accomplish this feat by introducing an additional term: b P k s t ,a t b V k − b V k ( s t ) =  b P k s t ,a t b V k − b V k ( s t +1 )  +  b V k ( s t +1 ) − b V k ( s t )  . Substituting bac k in to ( 1 ) and summing ov er t giv es us the following decomposition: T X t =1 ((1 − γ ) V ⋆ ( s t ) − r ( s t , a t )) = m X k =1 t k +1 − 1 X t = t k ((1 − γ ) V ⋆ ( s t ) − r ( s t , a t )) ≤ m X k =1 t k +1 − 1 X t = t k  b P k s t ,a t b V k − b V k ( s t ) + b k  s t , a t , b V k  + (1 − γ ) ε k  = m X k =1 t k +1 − 1 X t = t k  b P k s t ,a t − P s t ,a t  b V k + b k  s t , a t , b V k  + (1 − γ ) ε k  | {z } = : T model + m X k =1 t k +1 − 1 X t = t k  P s t ,a t b V k − b V k ( s t +1 )  | {z } = : T mart + m X k =1 t k +1 − 1 X t = t k  b V k ( s t +1 ) − b V k ( s t )  | {z } = : T ind . W e call the ﬁrst term T model b ecause it is a function of the error of our mo del estimate. Note that we include the b on us term in T model mainly due to technical reasons, but in tuitively our b ound for the mo del error term will inv olve a Bernstein inequality and this will absorb the Bernstein-st yle b onus. The second term is T mart b ecause w e will b ound it using martingale concen tration. Lastly , w e call the third term T ind b ecause it arises due to our need to shift indices. Step 3: Bounding T model and T mart b y cum ulative v ariance terms Our next step is to b ound each term in the decomp osition. Before doing so, we in tro duce some cumulativ e v ariance terms that arise in the analysis. W e deﬁne V ar ⋆ γ := T X t =1 V s t ,a t ( V ⋆ ) , V ar diﬀ := T X t =1 V s t ,a t  b V k t − V ⋆  . W e start with T model = P m k =1 P t k +1 − 1 t = t k  b P k s t ,a t − P s t ,a t  b V k + b k  s t , a t , b V k  + (1 − γ ) ε k  . W e w ould lik e to bound the term in volving model error via a Bernstein-lik e concen tration inequalit y . W e cannot 20 immediately do so, ho wev er, b ecause b P k s t ,a t and b V k are not statistically independent. T o ov ercome this h urdle, we p erform the following decomp osition:  b P k s t ,a t − P s t ,a t  b V k =  b P k s t ,a t − P s t ,a t  V ⋆ +  b P k s t ,a t − P s t ,a t   b V k − V ⋆  . Since V ⋆ is ﬁxed, w e can apply Bernstein to the ﬁrst term. Bounding the second term is also doable b ecause    b V k − V ⋆    ∞ decreases as k increases. The b on us term is designed so that it is subsumed b y the mo del error term. W e state the b ound for T model in the following lemma, which we formally prov e in Appendix G.2 . Lemma D.3. With pr ob ability at le ast 1 − 2 S 2 AT δ ′ , we have T model ≤ O  q S AU 2 V ar ⋆ γ + p Γ S AU 2 V ar diﬀ + Γ H S AU 2  . W e now mov e on to T mart . As discussed ab o ve, w e can apply Lemma D.2 , whic h gives us that with probabilit y 1 − 3 T δ ′ , T mart ≤ 2 √ 2 v u u t T X t =1 V s t ,a t  b V k t  U + 4 H U ≤ 4 q V ar ⋆ γ U + 4 p V ar diﬀ U + 4 H U, (2) where the second inequality holds b ecause v u u t T X t =1 V s t ,a t  b V k t  = v u u t T X t =1 V s t ,a t  b V k t − V ⋆ + V ⋆  ≤ v u u t T X t =1  2 V s t ,a t  b V k t − V ⋆  + 2 V s t ,a t ( V ⋆ )  = q 2V ar ⋆ γ + 2V ar diﬀ ≤ q 2V ar ⋆ γ + p 2V ar diﬀ . Step 4: Bounding cum ulative v ariance terms W e can b ound V ar diﬀ with the follo wing lemma, whose pro of we defer to App endix G.3 . Lemma D.4. Conditione d on the suc c essful events of L emma D.1 , we have with pr ob ability at le ast 1 − 3 T δ ′ that V ar diﬀ ≤ O  T model H + H 2 S AU  . Subsequen tly , under the successful ev ents of Lemma D.3 , ( 2 ), and Lemma D.4 , w e hav e T model + T mart ≤ O  √ Γ H S AU 2 p T model + T mart + q V ar ⋆ γ S AU 2 + Γ H S AU 2  , whic h by Lemma C.5 implies that T model + T mart ≤ O  q V ar ⋆ γ S AU 2 + Γ H S AU 2  . (3) Step 5: Bounding T ind It remains to b ound T ind , which turns out to be the most straightfo ward argu- men t. W e compute T ind = m X k =1 t k +1 − 1 X t = t k  b V k ( s t +1 ) − b V k ( s t )  = m X k =1  b V k ( s t k +1 ) − b V k ( s t k )  ≤ mH ≤ H S AU. Step 6: Recom bining Finally , w e recom bine terms. Conditioned under the successful ev ents of Lemma D.1 and ( 3 ), we hav e Regret γ ( T ) ≤ γ ( T model + T mart + T ind ) ≤ O  q V ar ⋆ γ S AU 2 + Γ H S AU 2  . Via a union b ound (ov er the high probabilit y even ts of Lemma D.1 , Lemma D.3 , ( 2 ), and Lemma D.4 ), this o ccurs with probability at least 1 − 9 S 2 AT δ ′ = 1 − δ . 21 E Pro of of Lemma 3.2 Pr o of. Let δ ∈ (0 , 1) b e arbitrary , and set δ ′ = δ 6 T . F or the optimal v alue function we w ill drop γ and simply write V ⋆ . Our goal is to b ound V ar ⋆ γ = P T t =1 V ( P s t ,a t , V ⋆ ) . No w, observe that if we set e V ⋆ : = V ⋆ − min s V ⋆ ( s ) , then we hav e e V ⋆ ∈ h 0 , ∥ V ⋆ ∥ sp i S and P T t =1 V s t ,a t ( V ⋆ ) = P T t =1 V s t ,a t  e V ⋆  . Subsequently , w e perform the follo wing decomp osition. T X t =1 V s t ,a t  e V ⋆  = T X t =1  P s t ,a t  e V ⋆  2 −  P s t ,a t e V ⋆  2  = T X t =1  P s t ,a t  e V ⋆  2 −  e V ⋆ ( s t +1 )  2  | {z } = : T ⋆ 1 + T X t =1   e V ⋆ ( s t )  2 −  P s t ,a t e V ⋆  2  | {z } = : T ⋆ 2 + T X t =1   e V ⋆ ( s t +1 )  2 −  e V ⋆ ( s t )  2  | {z } = : T ⋆ 3 . With probabilit y at least 1 − 3 T δ ′ , w e ha ve that T ⋆ 1 ≤ 2 √ 2 v u u t T X t =1 V s t ,a t   e V ⋆  2  log  1 δ ′  + 4 ∥ V ⋆ ∥ 2 sp log  1 δ ′  ≤ 4 √ 2 ∥ V ⋆ ∥ sp s V ar ⋆ γ log  1 δ ′  + 4 ∥ V ⋆ ∥ 2 sp log  1 δ ′  , where the ﬁrst inequality holds by Lemma D.2 and the second inequalit y holds b y Lemma C.4 . Next, w e compute that with probabilit y at least 1 − 3 T δ ′ , w e ha ve T ⋆ 2 = T X t =1   e V ⋆ ( s t )  2 −  P s t ,a t e V ⋆  2  ( i ) ≤ 2 ∥ V ⋆ ∥ sp T X t =1 max n e V ⋆ ( s t ) − P s t ,a t e V ⋆ , 0 o ≤ 2 ∥ V ⋆ ∥ sp T X t =1 max n e V ⋆ ( s t ) − P s t ,a t e V ⋆ + 1 , 0 o ( ii ) = 2 ∥ V ⋆ ∥ sp T X t =1  e V ⋆ ( s t ) − P s t ,a t e V ⋆ + 1  = 2 ∥ V ⋆ ∥ sp T + T X t =1  e V ⋆ ( s t ) − P s t ,a t e V ⋆  ! ≤ 2 ∥ V ⋆ ∥ sp T + e V ⋆ ( s 1 ) + T X t =1  e V ⋆ ( s t +1 ) − P s t ,a t e V ⋆  ! ( iii ) ≤ 2 ∥ V ⋆ ∥ sp T + e V ⋆ ( s 1 ) + 2 √ 2 s V ar ⋆ γ log  1 δ ′  + 4 ∥ V ⋆ ∥ sp log  1 δ ′  ! ≤ 4 √ 2 ∥ V ⋆ ∥ sp s V ar ⋆ γ log  1 δ ′  + 2 T ∥ V ⋆ ∥ sp + 10 ∥ V ⋆ ∥ 2 sp log  1 δ ′  . 22 Inequalit y ( i ) holds b ecause a 2 − b 2 = ( a + b )( a − b ) ≤ ( a + b ) max { a − b, 0 } ≤ 2 ∥ V ⋆ ∥ sp max { a − b, 0 } for a, b ∈ h 0 , ∥ V ⋆ ∥ sp i . Equality ( ii ) holds b ecause e V ⋆ ( s t ) − P s t ,a t e V ⋆ + 1 ≥ 0 . Indeed, we ha v e e V ⋆ ( s t ) − P s t ,a t e V ⋆ = V ⋆ ( s t ) − P s t ,a t V ⋆ ≥ Q ⋆ ( s t , a t ) − P s t ,a t V ⋆ = r ( s t , a t ) − (1 − γ ) P s t ,a t V ⋆ ≥ − 1 . Finally , inequalit y ( iii ) holds with probability at least 1 − 3 T δ ′ b y Lemma D.2 . Observing that T ⋆ 3 is a telescoping sum, w e ha ve T ⋆ 3 = T X t =1   e V ⋆ ( s t +1 )  2 −  e V ⋆ ( s t )  2  ≤  e V ⋆ ( s T +1 )  2 ≤ ∥ V ⋆ ∥ 2 sp . Recom bining terms, w e ha ve with probability 1 − 6 T δ ′ that V ar ⋆ γ ≤ T ⋆ 1 + T ⋆ 2 + T ⋆ 3 ≤ O ∥ V ⋆ ∥ sp s V ar ⋆ γ log  1 δ ′  + T ∥ V ⋆ ∥ sp + ∥ V ⋆ ∥ 2 sp log  1 δ ′  ! , whic h by Lemma C.5 implies V ar ⋆ γ ≤ O  ∥ V ⋆ ∥ sp T + ∥ V ⋆ ∥ 2 sp log  1 δ ′  . Substituting bac k δ ′ = δ 6 T giv es us that V ar ⋆ γ ≤ O  ∥ V ⋆ ∥ sp T + ∥ V ⋆ ∥ 2 sp log( T /δ )  with probabilit y at least 1 − δ . F Pro of of Lemma 3.4 The proof relies on the following lemma. Lemma F.1 (Lemma 6 in Zurek and Chen 2025a ) . Supp ose the underlying MDP is we akly c ommunic ating. L et γ ∈ (0 , 1) . The optimal value function V ⋆ γ satisﬁes   ρ ⋆ − (1 − γ ) V ⋆ γ   ∞ ≤ (1 − γ )   V ⋆ γ   sp . Pr o of. F or an y T ≥ 1 and γ ∈ (0 , 1) , we hav e Regret( T ) = T X t =1 ( ρ ⋆ − r ( s t , a t )) = T X t =1  ρ ⋆ − (1 − γ ) V ⋆ γ ( s t )  + T X t =1  (1 − γ ) V ⋆ γ ( s t ) − r ( s t , a t )  ≤ T X t =1 (1 − γ )   V ⋆ γ   sp + T X t =1  (1 − γ ) V ⋆ γ ( s t ) − r ( s t , a t )  = (1 − γ )   V ⋆ γ   sp T + Regret γ ( T ) , where the inequality is due to Lemma F.1 . 23 G Missing Pro ofs from App endix D G.1 Pro of of Lemma D.1 Pr o of. W e denote b y b T k the empirical Bellman op erator used in Algorithm 1 during the k th episo de. In particular, w e ha ve  b T k Q  ( s, a ) = r ( s, a ) + γ b P k s,a Clip H ( M Q ) + γ b k ( s, a, Clip H ( M Q )) . W e also write iters k to be the n umber of v alue iterations used Algorithm 1 during the k th episode, so that iters k =  1 1 − γ log 1 + 32 H U ε k (1 − γ )  . Observ e that γ iters k ≤ exp( − (1 − γ )iters k ) ≤ ε k (1 − γ ) 1 + 32 H U , where the ﬁrst inequality is due to x ≤ e − (1 − x ) . The follo wing Lemma, whic h states some crucial prop erties of b T k , is prov ed in Appendix G.4 . Lemma G.1. With pr ob ability 1 − S AT δ ′ , the fol lowing hold for al l k ∈ [ m ] . 1. b T k is monotonic: b T k Q ≥ b T k Q ′ for any Q, Q ′ ∈ R S ×A such that Q ≥ Q ′ . 2. b T k is a γ -c ontr action:    b T k Q − b T k Q ′    ∞ ≤ γ ∥ Q − Q ′ ∥ ∞ for any Q, Q ′ ∈ R S ×A . 3. Q ⋆ ≤ b T k Q ⋆ . No w, assume that the ev ents of Lemma G.1 hold, and ﬁx an arbitrary k . W e ﬁrst prov e P art 1 of Lemma D.1 . By the Banac h ﬁxed point theorem [ Pugh , 2015 ], the fact that b T k is a γ -contraction im- plies that b T k has a unique ﬁxed point, whic h w e will denote b Q ⋆ k . By monotonicit y of b T k , the sequence Q ⋆ , b T k Q ⋆ , b T k b T k Q ⋆ , . . . is nondecreasing and con verges to b Q ⋆ k , whic h implies Q ⋆ ≤ b Q ⋆ k . (4) F urthermore, w e hav e    b Q k − b Q ⋆ k    ∞ =      b T k  ( iters k ) 0 −  b T k  ( iters k ) b Q ⋆ k     ∞ ≤ γ iters k    b Q ⋆ k    ∞ ≤ ε k , (5) where the ﬁnal inequality holds by the abov e b ound on γ iters k and  b T k Q  ( s, a ) ≤ 1 + γ ∥ Q ∥ ∞ + γ 32 H U = ⇒    b T k 0    ∞ ≤ 1 + γ 32 H U = ⇒    b T k b T k 0    ∞ ≤ 1 + γ 32 H U + γ (1 + γ 32 H U ) . . . = ⇒    b Q ⋆ k    ∞ ≤ 1 + γ 32 H U 1 − γ . Com bining ( 4 ) and ( 5 ) giv es us that (elemen twise) b Q k ≥ Q ⋆ − 1 ε k . 24 Subsequen tly , for an y s ∈ S , b V k ( s ) = min n M b Q k  ( s ) , min s ′  M b Q k  ( s ′ ) + H o ≥ min n ( M ( Q ⋆ − 1 ε k )) ( s ) , min s ′ ( M ( Q ⋆ − 1 ε k )) ( s ′ ) + H o = min n ( M Q ⋆ ) ( s ) , min s ′ ( M Q ⋆ ) ( s ′ ) + H o − ε k = min n V ⋆ ( s ) , min s ′ V ⋆ ( s ′ ) + H o − ε k = V ⋆ ( s ) − ε k , so the desired result holds. It remains to prov e Part 2 of Lemma D.1 . W e remark that b Q k ≤ b T k b Q k b ecause monotonicity implies 0 ≤ b T k 0 ≤ b T k b T k 0 ≤ · · · ≤  b T k  ( iters k ) 0 | {z } = b Q k ≤  b T k  ( iters k +1) 0 | {z } = b T k b Q k . It follo ws that for any t ∈ { t k , . . . , t k +1 − 1 } , b V k ( s t ) ≤  M b Q k  ( s t ) = b Q k ( s t , a t ) ≤ b T k b Q k ( s t , a t ) = r ( s t , a t ) + γ b P k s t ,a t b V k + γ b k  s t , a t , b V k  , so the desired result holds. G.2 Pro of of Lemma D.3 Pr o of. As explained in the main section of the proof, we decomp ose T model = m X k =1 t k +1 − 1 X t = t k  b P k s t ,a t − P s t ,a t  b V k + b k  s t , a t , b V k  + (1 − γ ) ε k  = m X k =1 t k +1 − 1 X t = t k   b P k s t ,a t − P s t ,a t  V ⋆ +  P k s t ,a t − P s t ,a t   b V k − V ⋆  + b k  s t , a t , b V k  + (1 − γ ) ε k  . W e will bound each of the ﬁrst three terms inside the sum separately under high probability even ts. Recall that for ease of notation w e denote n k ( s, a ) : = max { N k ( s, a ) , 1 } . Let E 1 b e the even t that    b P k s,a,s ′ − P s,a,s ′    ≤ s 2 P s,a,s ′ U n k ( s, a ) + I ( P s,a,s ′ > 0) U 3 n k ( s, a ) ∀ ( s, a, s ′ , k ) ∈ S × A × S × [ m ] , and let E 2 b e the even t that     b P k s,a − P s,a  V ⋆    ≤ s 2 V s,a ( V ⋆ ) U n k ( s, a ) + H U 3 n k ( s, a ) ∀ ( s, a, k ) ∈ S × A × [ m ] . (6) 25 W e will later conﬁrm that E 1 and E 2 are indeed high probabilit y even ts. Under ev ent E 1 , for an y ( s, a, k ) ∈ S × A × [ m ] we hav e  b P k s,a − P s,a   b V k − V ⋆  = X s ′ ∈S  b P k s,a,s ′ − P s,a,s ′   b V k ( s ′ ) − V ⋆ ( s ′ )  ( i ) = X s ′ ∈S  b P k s,a,s ′ − P s,a,s ′   b V k ( s ′ ) − V ⋆ ( s ′ ) − P s,a ( b V k − V ⋆ )  ( ii ) ≤ X s ′ ∈S      s 2 P s,a,s ′ U n k ( s, a ) + I ( P s,a,s ′ > 0) U 3 n k ( s, a )      ·    b V k ( s ′ ) − V ⋆ ( s ′ ) − P s,a  b V k − V ⋆     ( iii ) ≤ X s ′ ∈S s 2 P s,a,s ′ U n k ( s, a ) ·    b V k ( s ′ ) − V ⋆ ( s ′ ) − P s,a  b V k − V ⋆     + 2Γ H U 3 n k ( s, a ) ( iv ) ≤ v u u t 2Γ U P s ′ ∈S P s,a,s ′  b V k ( s ′ ) − V ⋆ ( s ′ ) − P s,a  b V k − V ⋆  2 n k ( s, a ) + 2Γ H U 3 n k ( s, a ) = v u u t 2Γ U V s,a  b V k − V ⋆  n k ( s, a ) + 2Γ H U 3 n k ( s, a ) . (7) Equalit y ( i ) holds b ecause P s ′  b P k s,a,s ′ − P s,a,s ′  c = 0 for an y c ∈ R . Inequalit y ( ii ) holds under even t E 1 . W e obtain inequalit y ( iii ) by bounding    b V k ( s ′ ) − V ⋆ ( s ′ ) − P s,a  b V k − V ⋆     ≤ 2 H and summing ov er s ′ . Inequalit y ( iv ) follo ws from Cauc hy-Sc hw arz. Moreo ver, under E 1 , for any ( s, a, k ) ∈ S × A × [ m ] w e ha ve b k  s, a, b V k  = max      4 v u u t V  b P k s,a , b V k  U n k ( s, a ) , 32 H U n k ( s, a )      ≤ 4 v u u t V  b P k s,a , b V k  U n k ( s, a ) + 32 H U n k ( s, a ) ≤ 4 p 3 / 2 v u u t V s,a  b V k  U n k ( s, a ) + 4 p 4 / 3 √ Γ H 2 U n k ( s, a ) + 32 H U n k ( s, a ) ≤ 5 v u u t V s,a  b V k  U n k ( s, a ) + 37 Γ H U n k ( s, a ) , (8) with the second inequality holding due to V  b P k s,a , b V k  = X s ′ ∈S b P k s,a,s ′  b V k ( s ′ ) − b P k s,a b V k  2 ( i ) ≤ X s ′ ∈S b P k s,a,s ′  b V k ( s ′ ) − P s,a b V k  2 ( ii ) ≤ X s ′ ∈S  3 2 P s,a,s ′ + 4 I ( P s,a,s ′ > 0) U 3 n k ( s, a )   b V k ( s ′ ) − P s,a b V k  2 ≤ 3 2 V s,a  b V k  + 4Γ H 2 U 3 n k ( s, a ) . 26 Here, inequality ( i ) is b ecause E [ X ] minimizes f ( λ ) = E [( X − λ ) 2 ] , and inequality ( ii ) holds under E 1 . Indeed, under E 1 , w e ha ve    b P k s,a,s ′ − P s,a,s ′    ≤ s 2 P s,a,s ′ I ( P s,a,s ′ > 0) U n k ( s, a ) + I ( P s,a,s ′ > 0) U 3 n k ( s, a ) = ⇒ b P k s,a,s ′ − P s,a,s ′ ≤ 1 2 P s,a,s ′ + I ( P s,a,s ′ > 0) U n k ( s, a ) + I ( P s,a,s ′ > 0) U 3 n k ( s, a ) = ⇒ b P k s,a,s ′ ≤ 3 2 P s,a,s ′ + 4 I ( P s,a,s ′ > 0) U 3 n k ( s, a ) , where the ﬁrst implication holds because √ ab ≤ a 2 + b 2 for a, b ≥ 0 . No w, combining ( 6 ), ( 7 ), and ( 8 ) gives us that under E 1 ∩ E 2 , T model ≤ O    m X k =1 t k +1 − 1 X t = t k    s U V s t ,a t ( V ⋆ ) n k ( s t , a t ) + v u u t Γ U V s t ,a t  b V k − V ⋆  n k ( s t , a t ) + Γ H U n k ( s t , a t ) + (1 − γ ) ε k       . The follo wing lemma, whic h we prov e in Appendix G.5 , allo ws us to easily b ound the ab ov e. Lemma G.2. W e have the fol lowing b ounds. 1. P T t =1 1 n k t ( s t ,a t ) ≤ S AU . 2. F or nonne gative numb ers w 1 , . . . , w T , we have P T t =1 q w t n k t ( s t ,a t ) ≤ q S AU P T t =1 w t . 3. P T t =1 (1 − γ ) ε k t ≤ S AU . Setting w t = V s t ,a t ( V ⋆ ) , P art 2 of Lemma G.2 giv es us that T X t =1 s U V s t ,a t ( V ⋆ ) n k t ( s t , a t ) ≤ q S AU 2 V ar ⋆ γ . Next, setting w t = V s t ,a t  b V k − V ⋆  , P art 2 of Lemma G.2 giv es us that T X t =1 v u u t Γ U V s t ,a t  b V k − V ⋆  n k t ( s t , a t ) ≤ p Γ S AU 2 V ar diﬀ . Lastly , P arts 1 and 3 of Lemma G.2 give us that T X t =1  Γ H U n k t ( s t , a t ) + (1 − γ ) ε k t  ≤ 2Γ H S AU 2 . Com bining these three bounds, w e ha ve that under E 1 ∩ E 2 , T model ≤ O  q S AU 2 V ar ⋆ γ + p Γ S AU 2 V ar diﬀ + Γ H S AU 2  . It remains to show that E 1 and E 2 hold with the claimed high probability . Starting with E 1 , ﬁx s, s ′ ∈ S and a ∈ A , and suppose n k ( s, a ) = n for some n ≥ 1 . Observ e that P k s,a,s ′ = 1 n P n i =1 Z i , where Z 1 , . . . , Z n are i.i.d. Bernoulli random v ariables with mean P s,a,s ′ . Denoting Z to also b e an indep enden t Bernoulli random v ariable with mean P s,a,s ′ , w e ha ve with probabilit y at least 1 − δ ′ ,    b P k s,a,s ′ − P s,a,s ′    ≤ r 2 V ar ( Z ) log (2 /δ ′ ) n + I ( P s,a,s ′ > 0) log(2 /δ ′ ) 3 n ≤ r 2 P s,a,s ′ U n + I ( P s,a,s ′ > 0) U 3 n , 27 where the ﬁrst inequalit y is due to Lemma C.1 and the observ ation that P s,a,s ′ = 0 = ⇒ b P k s,a,s ′ = 0 , and the second inequalit y holds because V ar ( Z ) ≤ E [ Z 2 ] = E [ Z ] = P s,a,s ′ and log (2 /δ ′ ) ≤ U . It follows from a union bound o ver possible s, s ′ , a, n giv es us that E 1 holds with probability at least 1 − S 2 AT δ ′ . Next, for E 2 , ﬁx s ∈ S and a ∈ A , and supp ose n k ( s, a ) = n for some n ≥ 1 . Observe that b P k s,a V ⋆ = 1 n P n i =1 Z i , where Z 1 , . . . , Z n are i.i.d. multinoulli random v ariables that take the v alue V ⋆ ( s ′ ) with probabilit y P s,a,s ′ for eac h s ′ . Letting Z b e i.i.d. from the same distribution as the Z i ’s, w e hav e that E [ Z ] = P s,a V ⋆ and V ar ( Z ) = E [ Z 2 ] − ( E [ Z ]) 2 = P s,a ( V ⋆ ) 2 − ( P s,a V ⋆ ) 2 = V s,a ( V ⋆ ) . Subsequen tly , Lemma C.1 gives us that with probabilit y at least 1 − δ ′ ,   P t s,a V ⋆ − P s,a V ⋆   ≤ r 2 V s,a ( V ⋆ ) U n + H U 3 n , so it follows from a union b ound ov er pos sible s, a, n that E 2 holds with probability at least 1 − S AT δ ′ . W e conclude that E 1 ∩ E 2 o ccurs with probability at least 1 − 2 S 2 AT δ ′ , as desired. G.3 Pro of of Lemma D.4 Pr o of. Our goal is to b ound V ar diﬀ = P T t =1 V s t ,a t  b V k t − V ⋆  . W e will pro ceed in a manner similar to the pro of of Lemma 3.2 , where we bounded V ar ⋆ γ . F or ease of notation, write D k : = b V k − V ⋆ . Then, set e D k : = D k − min s D k ( s ) so that e D k ∈ [0 , H ] S and V ar diﬀ = P T t =1 V s t ,a t  e D k t  . W e subsequen tly ha ve T X t =1 V s t ,a t  e D k t  = m X k =1 t k +1 − 1 X t = t k  P s t ,a t  e D k  2 −  P s t ,a t e D k  2  = T X t =1  P s t ,a t  e D k t  2 −  e D k t ( s t +1 )  2  | {z } = : T diﬀ 1 + m X k =1 t k +1 − 1 X t = t k   e D k ( s t )  2 −  P s t ,a t e D k  2  | {z } = : T diﬀ 2 + m X k =1 t k +1 − 1 X t = t k   e D k ( s t +1 )  2 −  e D k ( s t )  2  | {z } = : T diﬀ 3 . With probabilit y at least 1 − 3 T δ ′ , w e ha ve T diﬀ 1 ≤ 2 √ 2 v u u t T X t =1 V s t ,a t   e D k t  2  U + 4 H 2 U ≤ 4 √ 2 H p V ar diﬀ U + 4 H 2 U, where the ﬁrst inequality is by Lemma D.2 and the second inequalit y is b y Lemma C.4 . 28 Next, w e compute that T diﬀ 2 = m X k =1 t k +1 − 1 X t = t k   e D k ( s t )  2 −  P s t ,a t e D k  2  ≤ 2 H m X k =1 t k +1 − 1 X t = t k max n e D k ( s t ) − P s t ,a t e D k , 0 o = 2 H m X k =1 t k +1 − 1 X t = t k max { D k ( s t ) − P s t ,a t D k , 0 } = 2 H m X k =1 t k +1 − 1 X t = t k max n b V k ( s t ) − P s t ,a t b V k − ( V ⋆ ( s t ) − P s t ,a t V ⋆ ) , 0 o , with the inequality holding by a 2 − b 2 ≤ 2 H max { a − b, 0 } for a, b ∈ [0 , H ] . F urthermore, w e ha ve b V k ( s t ) − P s t ,a t b V k ≤ r ( s t , a t ) + γ b P k s t ,a t b V k + γ b k  s t , a t , b V k  − P s t ,a t b V k = r ( s t , a t ) − (1 − γ ) P s t ,a t b V k + γ  b P k s t ,a t − P s t ,a t  b V k + γ b k  s t , a t , b V k  (9) as w ell as V ⋆ ( s t ) − P s t ,a t V ⋆ ≥ r ( s t , a t ) − (1 − γ ) P s t ,a t V ⋆ . (10) Con tinuing from ab o ve by plugging in ( 9 ) and ( 10 ), we hav e T diﬀ 2 ≤ 2 H m X k =1 t k +1 − 1 X t = t k max n b V k ( s t ) − P s t ,a t b V k − ( V ⋆ ( s t ) − P s t ,a t V ⋆ ) , 0 o ≤ 2 H m X k =1 t k +1 − 1 X t = t k max n (1 − γ ) P s t ,a t  V ⋆ − b V k  + γ  b P k s t ,a t − P s t ,a t  b V k + γ b k  s t , a t , b V k  , 0 o ( i ) ≤ 2 H m X k =1 t k +1 − 1 X t = t k   P k s t ,a t − P s t ,a t  b V k + b k  s t , a t , b V k  + (1 − γ ) ε k  = 2 H T model . Note that inequality ( i ) holds under Lemma D.1 , in which case (1 − γ ) P s t ,a t  V ⋆ − b V k  ≤ (1 − γ ) max s  V ⋆ ( s ) − b V k ( s )  ≤ (1 − γ ) ε k . F urthermore, w e compute T diﬀ 3 = m X k =1 t k +1 − 1 X t = t k   e D k ( s t +1 )  2 −  e D k ( s t )  2  = m X k =1   e D k ( s t k +1 )  2 −  e D k ( s t k )  2  ≤ mH 2 ≤ H 2 S AU. Recom bining terms, w e ha ve with probability 1 − 3 T δ ′ that V ar diﬀ ≤ T diﬀ 1 + T diﬀ 2 + T diﬀ 3 ≤ O  H √ U p V ar diﬀ + H T model + H 2 S AU  , whic h by Lemma C.5 implies V ar diﬀ ≤ O  H T model + H 2 S AU  . 29 G.4 Pro of of Lemma G. 1 Pr o of. Let E b e the ev ent that     b P k s,a − P s,a  V ⋆    ≤ 2 v u u t V  b P k s,a , V ⋆  U n k ( s, a ) + 14 H U 3 n k ( s, a ) ∀ ( s, a, k ) ∈ S × A × [ m ] . W e restate an alternate, stronger version of Lemma G.1 , which w e will prov e instead. Afterwards, w e will complete the pro of of Lemma G.1 b y sho wing that E holds with the claimed high probabilit y . Lemma G.3. The fol lowing hold for al l k ∈ [ m ] . 1. b T k satisﬁes the c onstant shift pr op erty: for any Q ∈ R S ×A and c ∈ R , we have b T k ( Q + c 1 ) = b T k Q + γ c 1 . 2. b T k is monotonic: for any Q, Q ′ ∈ R S ×A such that M Q ≥ M Q ′ , we have b T k Q ≥ b T k Q ′ . 3. b T k is a γ -c ontr action: for any Q, Q ′ ∈ R S ×A , we have    b T k Q − b T k Q ′    ∞ ≤ γ ∥ Q − Q ′ ∥ ∞ . 4. Under the event E , we also have Q ⋆ ≤ b T k Q ⋆ . W e remark that while w e call P art 2 the monotonicity prop erty , it is actually stronger. Indeed, for b T k Q ≥ b T k Q ′ to hold, we only need M Q ≥ M Q ′ , which is a w eaker requiremen t than Q ≥ Q ′ . This version will be useful in proving P art 3. Now, let k ∈ [ m ] b e arbitrary . T o show Part 1 of Lemma G.3 (constant shift), let Q ∈ R S ×A and c ∈ R b e arbitrary . Since Clip H ( M ( Q + c 1 )) = Clip H ( M Q + c 1 ) = Clip H ( M Q ) + c 1 , a straigh tforward computation gives us  b T k ( Q + c 1 )  ( s, a ) = r ( s, a ) + γ b P k s,a (Clip H ( M ( Q + c 1 ))) + γ b k ( s, a, Clip H ( M ( Q + c 1 ))) = r ( s, a ) + γ b P k s,a (Clip H ( M Q )) + γ b P k s,a ( c 1 ) + γ b k ( s, a, Clip H ( M Q )) = b T k Q + γ c 1 . T o sho w P art 2 of Lemma G.3 (monotonicit y), let Q, Q ′ ∈ R S ×A b e suc h that M Q ≥ M Q ′ . Set- ting α : = min s (Clip H ( M Q ′ ))( s ) and β : = min s ((Clip H ( M Q ))( s ) − (Clip H ( M Q ′ ))( s )) . Observe that β > 0 , Clip H ( M Q ′ ) − α 1 ∈ [0 , H ] S , Clip H ( M Q ) − α 1 − β 1 ∈ [0 , 2 H ] S and Clip H ( M Q ) − α 1 − β 1 ≥ Clip H ( M Q ′ ) − α 1 . Using f ( p, v , n, u ) as deﬁned in Lemma C.3 , w e ha ve  b T k Q  ( s, a ) = r ( s, a ) + γ b P k s,a (Clip H ( M Q )) + γ b k ( s, a, Clip H ( M Q )) = r ( s, a ) + γ b P k s,a (Clip H ( M Q ) − α 1 − β 1 ) + γ b k ( s, a, Clip H ( M Q ) − α 1 − β 1 ) + γ α + γ β = r ( s, a ) + γ f  b P k s,a , Clip H ( M Q ) − α 1 − β 1 , n k ( s, a ) , U  + γ α + γ β ≥ r ( s, a ) + γ f  b P k s,a , Clip H ( M Q ′ ) − α 1 , n k ( s, a ) , U  + γ α + γ β = r ( s, a ) + γ b P k s,a (Clip H ( M Q ′ )) + γ b k ( s, a, Clip H ( M Q ′ )) + γ β =  b T k Q ′  ( s, a ) + γ β ≥  b T k Q ′  ( s, a ) , where the ﬁrst inequality is due to Lemma C.3 . T o show P art 3 of Lemma G.3 ( γ -contraction), let Q, Q ′ ∈ R S ×A b e arbitrary . Since M Q ≤ M Q ′ + ∥ M Q − M Q ′ ∥ ∞ 1 = M ( Q ′ + ∥ M Q − M Q ′ ∥ ∞ 1 ) , we hav e b y the monotonicit y and constant shift properties of b T k that b T k Q ≤ b T k ( Q ′ + ∥ M Q − M Q ′ ∥ ∞ 1 ) ≤ b T k Q ′ + γ ∥ M Q − M Q ′ ∥ ∞ ≤ b T k Q ′ + γ ∥ Q − Q ′ ∥ ∞ . 30 Rearranging giv es us b T k Q − b T k Q ′ ≤ γ ∥ Q − Q ′ ∥ ∞ . Rev ersing the roles of Q and Q ′ in the ab ov e, we also ha ve b T k Q ′ − b T k Q ≤ γ ∥ Q − Q ′ ∥ ∞ , and com bining this with the ab o v e, we conclude that    b T k Q − b T k Q ′    ∞ ≤ γ ∥ Q − Q ′ ∥ ∞ . Finally , w e show Part 4 of Lemma G.3 . F or an y ( s, a ) ∈ S × A , we hav e Q ⋆ ( s, a ) = r ( s, a ) + γ P s,a V ⋆ = r ( s, a ) + γ b P k s,a V ⋆ + γ  P s,a − b P k s,a  V ⋆ ≤ r ( s, a ) + γ b P k s,a Clip H ( M Q ⋆ ) + γ b k ( s, a, V ⋆ ) = b T k Q ⋆ , where the inequality holds under E . It remains to sho w that E o ccurs with high probabilit y . Fix s ∈ S and a ∈ A , and supp ose n k ( s, a ) = n for some n ≥ 2 . Note that the n k ( s, a ) = 1 case trivially alwa ys holds. Observ e that P k s,a V ⋆ = 1 n P n i =1 Z i , where Z 1 , . . . , Z n are i.i.d. m ultinoulli random v ariables that take the v alue V ⋆ ( s ′ ) with probability P s,a,s ′ for each s ′ . Letting Z be i.i.d. from the same distribution as the Z i ’s, w e hav e that E [ Z ] = P s,a V ⋆ and in the notation of Lemma C.2 , d V ar n = V  b P k s,a , V ⋆  . Subsequen tly , Lemma C.2 gives us that with probabilit y at least 1 − δ ′ ,    b P k s,a V ⋆ − P s,a V ⋆    ≤ v u u t 2 V  b P k s,a , V ⋆  log(2 /δ ′ ) n − 1 + 7 H log(2 /δ ′ ) 3( n − 1) ≤ 2 v u u t V  b P k s,a , V ⋆  U n + 14 H U 3 n , where the second inequality follows from the fact that 1 x − 1 ≤ 2 x for x ≥ 2 , as well as the fact that log (2 /δ ′ ) ≤ U . T aking a union b ound o ver all p ossible s, a, n thus giv es us that E holds with probabilit y at least 1 − S AT δ ′ . G.5 Pro of of Lemma G. 2 Pr o of. W e start b y pro ving P art 1, that P T t =1 1 n k t ( s t ,a t ) ≤ S AU . Observ e that T X t =1 1 n k t ( s t , a t ) = T X t =1 X s,a I (( s, a ) = ( s t , a t )) n k t ( s, a ) = X s,a T X t =1 I (( s, a ) = ( s t , a t )) n k t ( s, a ) . No w, ﬁx ( s, a ) ∈ S × A . W e ha ve T X t =1 I (( s, a ) = ( s t , a t )) n k t ( s, a ) = T X t =1 i max X i =0 I  ( s, a ) = ( s t , a t ) , 2 i ≤ n k t ( s, a ) ≤ 2 i +1 − 1  2 i = i max X i =0 T X t =1 I  ( s, a ) = ( s t , a t ) , 2 i ≤ n k t ( s, a ) ≤ 2 i +1 − 1  2 i ≤ i max X i =0 2 i 2 i = i max + 1 ≤ U, 31 where the ﬁrst inequality holds due to the doubling tric k. Namely , T X t =1 I  ( s t , a t ) = ( s, a ) , 2 i ≤ n k t ( s, a ) ≤ 2 i +1 − 1  ≤ 2 i . Summing o ver all p ossible ( s, a ) gives us the desired bound. W e pro ceed to proving Part 2. Let w 1 , . . . , w T ∈ R be nonnegative num b ers. T X t =1 r w t n k t ( s t , a t ) ≤ v u u t T X t =1 1 n k t ( s t , a t ) v u u t T X t =1 w t ≤ v u u t S AU T X t =1 w t . The ﬁrst inequality is by Cauch y-Sch w arz, and the second inequality is by Part 1 ab o ve. Lastly , w e prov e Part 3. W e ha ve m X k =1 t k +1 − 1 X t = t k (1 − γ ) ε k = m X k =1 t k +1 − 1 X t = t k 1 t k = m X k =1 N k t k . No w, for eac h k ∈ [ m ] , it holds that t k = 1 + P k − 1 ℓ =1 N ℓ . Contin uing, w e subsequen tly hav e m X k =1 N k t k = m X k =1 N k 1 + P k − 1 ℓ =1 N ℓ ≤ m X k =1 log 1 + N k 1 + P k − 1 ℓ =1 N ℓ ! = m X k =1 log 1 + P k ℓ =1 N ℓ 1 + P k − 1 ℓ =1 N ℓ ! = m X k =1 log 1 + k X ℓ =1 N ℓ ! − log 1 + k − 1 X ℓ =1 N ℓ !! = log 1 + m X ℓ =1 N ℓ ! − log(1) = log (1 + T ) ≤ U. H Pro of of Corollary 3.7 Here w e pro ve Corollary 3.7 . Pr o of. After applying Theorem 3.5 with span b ound H = q T S 3 A and com bining with Lemma 3.2 and the fact that   V ⋆ γ   sp ≤ 2 ∥ h ⋆ ∥ sp [ W ei et al. , 2020 ] to b ound the v ariance parameter (and taking a union b ound to get a failure probabilit y of at most 2 δ that these do not both hold) all that remains is to conﬁrm that the resulting regret b ound of Regret( T ) ≤ e O  r  ∥ h ⋆ ∥ sp T + ∥ h ⋆ ∥ 2 sp  S A + √ S AT  (11) 32 whenev er T ≥ ∥ h ⋆ ∥ 2 sp S 3 A implies the claim that Regret( T ) ≤ e O  p ( ∥ h ⋆ ∥ sp + 1) S AT  for T ≥ ∥ h ⋆ ∥ 2 sp S 3 A . Observ e that T ≥ ∥ h ⋆ ∥ 2 sp S 3 A ≥ ∥ h ⋆ ∥ 2 sp implies that ∥ h ⋆ ∥ sp √ S A ≤ √ S AT , so ( 11 ) simpliﬁes to a b ound of Regret( T ) ≤ e O  q ∥ h ⋆ ∥ sp S AT + √ S AT  = e O  r  ∥ h ⋆ ∥ sp + 1  S AT  whenev er T ≥ ∥ h ⋆ ∥ 2 sp S 3 A , as required. Note that the second regret bound stated in Corollary 3.7 follo ws b y considering cases. Indeed, for T ≤ ∥ h ⋆ ∥ 2 sp S 3 A , the regret can b e b ounded b y T , while for T ≥ ∥ h ⋆ ∥ 2 sp S 3 A w e ha ve just sho wn that the regret is b ounded b y e O  p ( ∥ h ⋆ ∥ sp + 1) S AT  . So for all T , the regret is b ounded b y e O  p ( ∥ h ⋆ ∥ sp + 1) S AT + ∥ h ⋆ ∥ 2 sp S 3 A  . I Pro of of Theorem 3.9 Pr o of. Let S ≥ 2 and A ≥ 2 b e integers, and let D ≥ 4 ⌈ log A S ⌉ . Suppose that T ≤ 1 32 D S A . Let S ′ = ⌈ S − 1 2 ⌉ and A ′ = A − 1 , and observe that T ≤ 1 8 D S ′ A ′ and 2 ⌈ log A S ⌉ ≤ D 2 (w e will need these facts later). W e construct a family of hard MDPs { P ( s,a ) | ( s, a ) ∈ [ S ′ ] × [ A ′ ] } such that eac h MDP has S states, A actions, and diameter at most D (see Figure 2 ). First, w e use an A -ary tree structure to connect states 1 , . . . , S − 1 by assigning A actions with deterministic transitions at each non-leaf state. In particular, eac h action go es to a distinct child no de. If a non-leaf no de has fewer than A children, the remaining actions are deterministic self-lo ops. Note that w e can (and do) choose a tree with depth at most ⌈ log A S ⌉ and at least S ′ leaf nodes. Next, w e let state S b e the “go od” state, where all actions are deterministic self-loops with rew ard 1, except for one action whic h deterministically returns to the root of the tree. A t every other no de the reward for an y action is 0, so the learner’s goal is to reac h state S . T o reach state S , the learner must search for the correct action in the correct leaf no de, which is diﬀerent in each MDP instance. In the MDP P ( s,a ) , the correct state-action pair is ( s, a ) . That is, at all leaf no des except state s , A ′ actions are deterministic self-lo ops, and the remaining action is a deterministic transition bac k to the root. State s is identical except that action a transits to the go o d state with probability 2 D , and sta ys in its current state with probability 1 − 2 D . Note that the diameter of this MDP is 2 ⌈ log A S ⌉ + D 2 ≤ D as required. No w ﬁx an y horizon- T algorithm. F or any θ ∈ [ S ′ ] × [ A ′ ] , write E θ to denote the exp ectation induced b y the algorithm when the underlying MDP is P θ , and write P a for the corresp onding probability . F or eac h ( s, a ) , let the random v ariable N ( s,a ) b e the n umber of times action a is tak en in state s , and let N b e the total n umber of visits to the bad states with rew ard 0. W e observ e that Regret( T ) ≥ N , so our goal is to lo wer b ound E θ [ N ] for some θ . Let E denote the even t that the learner do es not observe a transition to the go od state. F urther let θ ∈ [ S ′ ] × [ A ′ ] . W e start with the simple observ ation that since, T ≥ E θ [ N | E ] ≥ X ( s,a ) ∈ [ S ′ ] × [ A ′ ] E θ [ N ( s,a ) | E ] , it follo ws that for some ( s ′ , a ′ ) w e ha ve E θ [ N ( s ′ ,a ′ ) | E ] ≤ T S ′ A ′ ≤ D 8 . W e then claim that E ( s ′ ,a ′ ) [ N ( s ′ ,a ′ ) ] ≤ E ( s ′ ,a ′ ) [ N ( s ′ ,a ′ ) | E ] = E θ [ N ( s ′ ,a ′ ) | E ] ≤ D 8 . The ﬁrst inequality holds b ecause w e can assume WLOG that the algorithm will stay in the go o d state up on transiting there. The second inequality holds b ecause the algorithm will behav e exactly the same on an y of the hard MDPs under the ev en t E . So, denoting θ ′ : = ( s ′ , a ′ ) , w e ha ve shown that E θ ′ [ N θ ′ ] ≤ D 8 . Next, w e compute that P θ ′ ( E ) ≥ P θ ′ ( E | N θ ′ < D / 4) P θ ′ ( N θ ′ < D / 4) ≥  1 − 2 D  D/ 4 1 2 ≥ 1 4 33 2 /D r = 0 r = 1 Figure 2: An example of a hard MDP construction for S = 14 and A = 3 . T o av oid clutter, w e omit an additional deterministic self-lo op at each leaf state. W e also omit the deterministic actions which transit from the leaf states to the root and from the go od state to the root, as these actions only serv e to k eep the diameter bounded b y D . where the last inequality is due to Bernoulli’s inequality (Lemma C.6 ), and the second inequalit y is due to P θ ′ ( N θ ′ ≥ D / 4) ≤ E θ ′ [ N θ ′ ] D / 4 ≤ 1 2 . Since E θ ′ [ N ] ≥ E θ ′ [ N | E ] P θ ′ ( E ) ≥ T 4 , w e conclude that E θ ′ [Regret( T )] ≥ T 4 . J Pro of of Theorem 3.8 In this section we prov e the lo wer bound Theorem 3.8 . First w e sho w the follo wing in termediate result. Theorem J.1. Ther e exist universal c onstants c 1 > 1 and 0 < c 2 < 1 such that the fol lowing holds. Fix inte gers T ≥ 1 , S ≥ 2 , A ≥ 2 , and ﬁx B ≥ max { c 1 , 2 ⌈ log A S ⌉} . L et Alg b e any horizon- T algorithm. Then ther e exist two c ommunic ating MDPs P 1 , P 2 such that: 1. P 1 and P 2 b oth have S states and A actions. 2.   h ⋆ P 1   sp = B and   h ⋆ P 2   sp = 1 2 . 3. If E Alg P 1 [Regret( T )] < T / 4 , then E Alg P 2 [Regret( T )] ≥ c 2 B S A . Pr o of. Let T ≥ 1 , S ≥ 2 , A ≥ 2 b e in tegers, and let B ≥ max { 50 , ⌈ log A S ⌉} . F urther deﬁne S ′ = ⌈ S − 1 2 ⌉ and A ′ = A − 1 . W e will construct a family of MDPs  P ( i,s,a ) | ( i, s, a ) ∈ [2] × [ S ′ ] × [ A ′ ]  , where each P ( i,s,a ) will ha ve a tree construction similar to that used in the pro of of Theorem 3.9 in Section I (see Figure 3 ). Sp eciﬁcally , w e again use an A -ary tree structure to connect states 1 , . . . , S − 1 b y assigning A actions with deterministic transitions at eac h non-leaf state, with eac h action going to a distinct child no de. If a non-leaf no de has fewer than A children, the remaining actions are deterministic self-lo ops. W e use a tree with depth at most ⌈ log A S ⌉ ≤ B / 2 and at least S ′ leaf nodes. F or the MDPs P (1 ,s,a ) and P (2 ,s,a ) , in all leaf states other than s , actions 1 , . . . , A ′ are deterministic self-lo ops, and action A is a deterministic transition back to the ro ot. State s is identical exce pt that action a transits to state S outside the tree with probabilit y 2 /B . In state S , actions 2 , . . . , A are all deterministic 34 2 /B r = 1 / 2 r = 1 / 2 r = 0 r = 1 P (1 , 2 , 1) 2 /B r = 1 / 2 r = 1 / 2 r = 0 r = 1 P (2 , 2 , 1) Figure 3: An example of the MDPs used in the pro of of Theorem J.1 . If the transition associated with a state-action pair is deterministic, it is den oted with a solid arro w. If it is stochastic, it is represen ted as a solid line splitting into multiple dashed arro ws to diﬀeren t states, each annotated with the associated probabilit y of that transition. The MDPs are parameterized b y B > 1 . Some actions, such as those which transit from leaf states bac k to the ro ot state, are omitted. transitions back to the ro ot. The reward is 1/2 for an y action in an y non-leaf state or any deterministic transition from a tree state back to the ro ot. The rew ard for action 1 in state S is 1, and the rew ard at any other state-action pair is 0. The only diﬀerence b et ween P (1 ,s,a ) and P (2 ,s,a ) is that in P (1 ,s,a ) , action 1 is a deterministic self-loop, while in P (2 ,s,a ) , action 1 is a deterministic transition bac k to the ro ot. Subsequently , it is straightforw ard to see that the optimal strategy in P (1 ,s,a ) is to take action ( s, a ) un til a transition to state S is observ ed, then rep eatedly tak e action 1. On the other hand, in P (2 ,s,a ) it is not w orth trying to reach state S , so the optimal strategy is to stay in the tree and only take rew ard 1/2 actions. F or a learner to distinguish b et w een the t wo MDPs, it must reach state S and take action 1, but that will take many time steps when the correct ( s, a ) is not known beforehand. One can easily v erify that the span of the optimal bias function is at most B for all P (1 ,s,a ) and 1 2 for all P (2 ,s,a ) , as required. W e no w consider an y horizon- T algorithm Alg . WLOG w e can assume the following: 1. Alg is deterministic, meaning that giv en a sequence of past states and actions, the next action is computed by a deterministic function. This assumption can be justiﬁed via a standard argument that an y randomized strategy is equiv alent to some random choice from the set of all deterministic strategies ( Auer et al. [ 2002 ], Auer et al. [ 2008 ]). 2. Once a transition to state S is observed, Alg acts optimally . If this were not the case, the exp ected regret w ould only increase. Since we assume that Alg is deterministic, under the even t that no transition to state S occurs Alg , will alw ays observe the deterministic sequence of state-action pairs ( s (1) , a (1) ) , . . . , ( s ( T ) , a ( T ) ) . F or any ( s, a ) , w e denote by t k ( s, a ) the index of the k th o ccurrence of ( s, a ) in this sequence, and w e denote by n t ( s, a ) the n umber of times ( s, a ) occurs through index t in this sequence. F or θ ∈ [2] × [ S ′ ] × [ A ′ ] , let P θ and E θ denote the probability and exp ectation, resp ectiv ely , induced b y Alg when the underlying MDP is P θ . F or each ( s, a ) , let N ( s,a ) denote the n umber of times the learner tak es action a in state s . F urther write N leav e = P ( s,a ) ∈ [ S ′ ] × [ A ′ ] N ( s,a ) , and let N stay b e the n umber of times the learner takes a reward 1 2 action. Additionally , let E denote the even t that the algorithm observ es a transition to state S . When the underlying MDP is some P (1 ,s,a ) the optimal gain is 1, and hence the algorithm adds 1 to the regret an y time it tries to reac h state S (i.e. some state-action in [ S ′ ] × [ A ′ ] and adds 1/2 to the regret an y time it takes a reward 1/2 action within the tree. In particular, Regret( T , P (1 ,s,a ) , Alg ) ≥ T − 1 2 N stay − N ( S, 1) . 35 Similarly , when the underlying MDP is some P (2 ,s,a ) , the optimal gain is 1/2, and hence the algorithm adds 1/2 to the regret any time it tries to reac h state S . It ma y subtract 1/2 from the regret in the ev ent that it do es ( S, 1) . Consequently , Regret( T , P (2 ,s,a ) , Alg ) ≥ 1 2 N leav e − 1 2 N ( S, 1) ≥ 1 2 N leav e − 1 2 , with the second inequality due to the fact that the algorithm will try ( S, 1) at most once in P (2 ,s,a ) . No w, let ( s, a ) ∈ [ S ′ ] × [ A ′ ] . Assuming that E (1 ,s,a ) [Regret( T )] < T 4 , we will w ork to show a lo wer b ound on n T ( s, a ) . Under our assumption, w e ha ve T 4 > T − 1 2 E (1 ,s,a ) [ N stay ] − E (1 ,s,a )  N ( S, 1)  = ⇒ 1 2 E (1 ,s,a ) [ N stay ] + E (1 ,s,a ) [ N ( S, 1) ] > 3 T 4 = ⇒ E (1 ,s,a ) [ N ( S, 1) ] > T 4 , with the second implication following from the trivial fact that N stay ≤ T . F urthermore, since N ( S, 1) ≤ T I ( E ) , w e hav e T 4 < E (1 ,s,a ) [ N ( S, 1) ] ≤ T P (1 ,s,a ) ( E ) = ⇒ P (1 ,s,a ) ( E ) > 1 4 . W e ha ve shown that there is a constan t probability of observing a transition to state S , and w e will use this fact to show that N ( s,a ) is large with constant probability . T ow ards this end, w e compute P (1 ,s,a ) ( E c ) ≥ P (1 ,s,a )  E c | N ( s,a ) < B 10  P (1 ,s,a )  N ( s,a ) < B 10  ≥  1 − 2 B  B 10 P (1 ,s,a )  N ( s,a ) < B 10  ≥ 4 5 P (1 ,s,a )  N ( s,a ) < B 10  , with the last inequality holding due to Bernoulli’s inequality (Lemma C.6 ). Hence, 1 4 < P (1 ,s,a ) ( E ) = 1 − P (1 ,s,a ) ( E c ) ≤ 1 − 4 5 P (1 ,s,a )  N ( s,a ) < B 10  , whic h implies that P (1 ,s,a )  N ( s,a ) < B 10  < 3 4 · 5 4 = 15 16 . Therefore, P (1 ,s,a ) ( N ( s,a ) ≥ B 10 ) ≥ 1 16 , which implies that n T ( s, a ) ≥ B 10 (otherwise P (1 ,s,a ) ( N ( s,a ) ≥ B 10 ) w ould b e 0). Since ( s, a ) w as arbitrary , w e hav e shown that n T ( s, a ) ≥ B 10 for all ( s, a ) ∈ [ S ′ ] × [ A ′ ] . Our next step is to deriv e a lo wer b ound on E (1 ,s,a ) [ N leav e ] for some ( s, a ) . Observ e that for any ( s, a ) , w e hav e E (1 ,s,a ) [ N leav e ] = T X t =1 t P (1 ,s,a ) ( N leav e = t ) . F urthermore, for t < T , N leav e = t o ccurs precisely when ( s t , a t ) = ( s, a ) , all previous occurrences of ( s, a ) do not result in a transition to S , and this occurrence of ( s, a ) do es result in a transition to S . N leav e = T is similar except that it does not require a transition to S. In other w ords, w e ha ve E (1 ,s,a ) [ N leav e ] ≥ T X t =1 t I  ( s t , a t ) = ( s, a )   1 − 2 B  n t ( s,a ) − 1 2 B . 36 W riting t k ( s, a ) to be the k th time ( s, a ) o ccurs, we can low er bound this sum b y ⌊ B / 8 ⌋ X k =1 t k ( s, a )  1 − 2 B  k − 1 2 B . The follo wing technical lemma, the pro of of which we p ostpone, shows that w e can b ound this sum for at least one ( s, a ) . In tuitively , the sum will be large enough when the t k ( s, a ) are large, and b ecause all ( s, a ) o ccur many times, there is at least one ( s, a ) whose indices are suﬃcien tly large. Lemma J.2. L et B > 0 and c ∈ (0 , 1) such that ⌈ cB ⌉ ≥ 2 . L et M b e a p ositive inte ger. L et z 1 , z 2 , . . . b e a se quenc e of inte gers such that e ach i ∈ { 1 , . . . , M } o c curs at le ast cB times. L et t k ( i ) b e the index of the k th o c curr enc e of value i . Then max i ∈{ 1 ,...,M } ⌊ cB ⌋ X k =1 t k ( i )  1 − 2 B  k − 1 2 B ≥ c 2 (1 − 2 c ) 2 B M . An application of Lemma J.2 with c = 1 / 10 and M = S ′ A ′ giv es us that there exists some ( s ′ , a ′ ) satisfying E (1 ,s ′ ,a ′ ) [ N leav e ] ≥ 1 250 B S ′ A ′ . Finally , it is not hard to see that E (1 ,s ′ ,a ′ ) [ N leav e ] = E (2 ,s ′ ,a ′ ) [ N leav e ] , so E (2 ,s ′ ,a ′ ) [Regret( T , P (2 ,s,a ) , Alg )] ≥ E (2 ,s ′ ,a ′ )  1 2 N leav e − 1 2  ≥ B S ′ A ′ 250 − 1 2 ≥ B S A 2000 − 1 2 ≥ B S A 4000 , where the ﬁnal inequality is due to the fact that B ≥ 500 = ⇒ B S A 4000 ≥ 1 2 . W e conclude that the requirements of the theorem hold with c 1 = 500 , c 2 = 1 / 4000 , P 1 = P (1 ,s ′ ,a ′ ) , and P 2 = P (2 ,s ′ ,a ′ ) . W e no w prov e Lemma J.2 . Pr o of. Fix arbitrary B > 0 and c ∈ (0 , 1) satisfying ⌈ cB ⌉ ≥ 2 , let M ∈ Z ≥ 1 , and let z 1 , z 2 , . . . b e a sequence of in tegers suc h that P ∞ j =1 I ( z j = i ) ≥ cB for all i ∈ { 1 , . . . , M } . F or ease of presentation, write B ′ : = ⌈ cB ⌉ and w k : =  1 − 2 B  k − 1 2 B so that our goal is to b ound max i ∈{ 1 ,...,M } B ′ X k =1 t k ( i ) w k . First, w e use that the max is greater the a verage, so that max i ∈{ 1 ,...,M } B ′ X k =1 t k ( i ) w k ≥ 1 M M X i =1 B ′ X k =1 t k ( i ) w k . Observing that { t k ( i ) } = { 1 , . . . , M B ′ } , w e then reindex the sum and write 1 M M X i =1 B ′ X k =1 t k ( i ) w k = 1 M M B ′ X t =1 t w n ( t ) , where n ( t ) is deﬁned as the num b er of times that z t app ears through index t . F urthermore, the rearrangement inequalit y (Lemma C.7 ) giv es us that 1 M M B ′ X t =1 t w n ( t ) ≥ 1 M M B ′ X t =1 t w ⌈ t/ M ⌉ ≥ 1 M B ′ X j =1 M (( j − 1) M + 1) w j ≥ M B ′ X j =1 ( j − 1) w j . 37 It remains to analyze M P j ( j − 1) w j . Note that for any j ∈ { 1 , . . . , B ′ } , Bernoulli’s inequality (Lemma C.6 ) giv es us w j =  1 − 2 B  j − 1 2 B ≥  1 − 2 B  B ′ − 1 2 B ≥  1 − 2( B ′ − 1) B  2 B ≥ 2 − 4 c B . Consequen tly , M B ′ X j =1 ( j − 1) w j ≥ M (2 − 4 c ) B B ′ − 1 X j =1 j = M (2 − 4 c ) B B ′ ( B ′ − 1) 2 ≥ M 1 B 2 − 4 c 2 cB cB 2 ≥ c 2 (1 − 2 c ) 2 B M . Since w e ha ve fully established Theorem J.1 , we can turn to pro ving Theorem 3.8 . Pr o of. Let S ≥ 2 and A ≥ 2 be integers. Fix some α ∈ [1 , 2) , and let T > S A ( c 4 β T ) 4 2 − α , where c 4 ≥ 1 is a univ ersal constant to b e deﬁned later. Supp ose that some horizon- T algorithm Alg has for all MDPs P , E [Regret( T , P , Alg )] ≤ q β T ∥ h ⋆ P ∥ sp S AT + β T S A ∥ h ⋆ P ∥ α sp . W e w ant to use Theorem J.1 to show a con tradiction. Hence w e need a c hoice of B such that p β T B S AT + β T S AB α < T / 4 (12) p β T S AT / 2 + β T S A/ 2 α < c 2 B S A. (13) W e will set B in terms of T , S , A to b e as small as pos sible such that the second inequality holds, and then sho w that the ﬁrst inequality holds under the assumed conditions. Assuming that T > β T S A , w e can then deriv e that p β T S AT / 2 + β T S A/ 2 α < 2 p β T S AT / 2 , so we can satisfy ( 13 ) b y setting c 2 B S A = 2 p β T S AT / 2 ⇐ ⇒ B = q c 3 β T T S A , where c 3 : = 2 c 2 √ 2 . No w c hecking that our c hoice of B admits ( 12 ), w e calculate the equiv alence p β T B S AT + β T S AB α < T / 4 ⇐ ⇒  c 2 3 β 3 T T 3 S A  1 4 + β T S A r c 3 β T T S A ! α < T / 4 . Hence a suﬃcient condition for ( 12 ) is that b oth of the follo wing are true:  c 2 3 β 3 T T 3 S A  1 4 < T / 8 (14) β T S A r c 3 β T T S A ! α < T / 8 . (15) The condition ( 14 ) is equiv alent to  c 2 3 β 3 T T 3 S A  1 4 < T / 8 ⇐ ⇒ c 2 3 β 3 T T 3 S A < T 4 / 8 4 ⇐ ⇒ T > 8 4 c 2 3 β 3 T S A. 38 Deﬁning c 4 : = 8 c 3 , for condition ( 15 ) w e compute β T S A r c 3 β T T S A ! α < T / 8 ⇐ ⇒ T 2 − α 2 > 8 c α 2 3 β 2+ α 2 T ( S A ) 2 − α 2 ⇐ ⇒ T > S A 8 c α 2 − α 3 β 2+ α 2 − α T ⇐ = T > S A ( c 4 β T ) 4 2 − α where the ﬁnal implication is due to α < 2 and c 3 > 1 . So the desired con tradiction holds as long as T > β T S A , T > 8 4 c 2 3 β 3 T S A , and T > S A ( c 4 β T ) 4 2 − α , but all of these conditions are clearly implied b y T > S A ( c 4 β T ) 4 2 − α , since α ≥ 1 implies that S A ( c 4 β T ) 4 2 − α ≥ S A (8 c 3 β T ) 4 . 39

Optimal Variance-Dependent Regret Bounds for Infinite-Horizon MDPs

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment