An Asymptotic Law of the Iterated Logarithm for $\mathrm{KL}_{\inf}$

1 – 28 An Asymptotic Law of the Iterated Logarithm f or KL inf Ashwin Ram A R A M 2 @ A N D R E W . C M U . E D U Aaditya Ramdas A R A M DA S @ C M U . E D U Carne gie Mellon University Abstract The population KL inf is a fundamental quantity that appears in lower bounds for (asymptotically) optimal regret of pure-e xploration stochastic bandit algorithms, and optimal stopping time of se- quential tests. Motiv ated by this, an empirical KL inf statistic is frequently used in the design of (asymptotically) optimal bandit algorithms and sequential tests. While nonasymptotic concentra- tion bounds for the empirical KL inf hav e been dev eloped, their optimality in terms of constants and rates is questionable, and their generality is limited (usually to bounded observations). The funda- mental limits of nonasymptotic concentration are often described by the asymptotic ﬂuctuations of the statistics. W ith that motiv ation, this paper presents a tight (upper and lower) law of the iterated logarithm for empirical KL inf applying to extremely general (unbounded) data. Keyw ords: Law of the Iterated Logarithm (LIL), Kullback-Leibler (KL) Di ver gence, Sequential T esting, Online Learning. 1. Introduction Consider the typical setting of observing independent and identically distributed random v ariables X 1 , X 2 , . . . , with common distrib ution P ∈ P ([ a, b ]) all with mean µ and variance σ 2 ∈ (0 , ∞ ) . Denote the empirical law of said random v ariables as b P t = 1 t P t i =1 δ X i with empirical mean b µ t and v ariance b σ 2 t . Consider some distribution ν supported on a compact interv al [ a, b ] and a target mean m ∈ [ a, b ] . Gi ven this, deﬁne the one-sided mean constrained information projection as, KL inf ( ν, m ) := inf n KL( ν ∥ Q ) : Q ∈ P ([ a, b ]) , Z x dQ ( x ) ≥ m o , and we take KL( ν ∥ Q ) = ∞ if ν ≪ Q . In this paper, we characterize the almost-sure iterated log- arithm scale of the empirical projection cost at the true mean, KL inf ( b P t , µ ) , where all probabilities and almost sure statements are under i.i.d. draws from P . It is well known that the sample mean satisﬁes the Hartman-Wintner law of the iterated loga- rithm (LIL) ( Khinchin , 1924 ; K olmogoroff , 1929 ; Hartman and W intner , 1941 ; Feller , 1943 ; Chung , 1948 ; Strassen , 1964 ; T eicher , 1974 ). That is, almost surely we hav e lim sup t →∞ b µ t − µ √ (2 σ 2 log log t ) /t = 1 and lim inf t →∞ b µ t − µ √ (2 σ 2 log log t ) /t = − 1 . Our main contribution, howe ver , is that we show that KL inf ( b P t , µ ) in fact has the same iterated logarithm constant (through expanding the KL inf quadrat- ically). In other words, we show that KL inf ( b P t , µ ) = (1 + o (1)) ( µ − b µ t ) 2 + 2 b σ 2 t . As a consequence, with probability one, lim sup t →∞ t KL inf ( b P t ,µ ) log log t = 1 . W e prove it through analyzing the afﬁne tilt of b P t to get a sharp upper bound and then a dual lo wer bound using the Donsker -V aradan inequality ( Donsker and V aradhan , 1975 ; Csisz ´ ar and Mat ´ u ˇ s , 2003 ) alongside uniform T aylor control. © A. Ram & A. Ramdas. R A M R A M DA S Importantly , we extend our sharp LIL equality to situations beyond the abo ve introduced com- pact support by restricting the competing class to a slo wly-growing deterministic en velope [ − B t , B t ] and sho w that if a very weak suf ﬁcient condition is met (the data almost surely eventually satisfy- ing | X i | ≤ B t for all i ≤ t , for some sequence B t = o ( p t/ log log t ) ), then the same sharp- 1 LIL constant holds for the time varying functional KL ( t ) inf deﬁned ov er P ([ − B t , B t ]) . W e show that such env elopes are immediate when analyzing typical tail “re gimes” such as sub-Gaussian, sub-exponential, and ﬁnite p -th moment. The fundamental mechanism behind KL-based index policies for stochastic multi-armed bandits are in fact the mean constrained KL projections ( Robbins , 1952 ; Gittins , 1989 ; Lai and Robbins , 1985 ; Burnetas and Katehakis , 1996 ; Auer et al. , 2002 ; Bubeck and Cesa-Bianchi , 2012 ; Garivier and Capp ´ e , 2011 ; Capp ´ e et al. , 1924 ; Honda and T akemura , 2010 , 2015 ; Kaufmann et al. , 2012 ; Lattimore and Szepesv ´ ari , 2020 ). There has also been an enormous amount of work that uses lar ge de viations and nonasymptotic concentration to derive logarithmic regret ( Cram ´ er , 1938 ; Kullback and Leibler , 1951 ; Sano v , 1957 ; Dembo and Zeitouni , 1998 ; Boucheron et al. , 2013 ; Ho ward et al. , 2020 , 2021 ). Instead, our work is the ﬁrst to analyze (and characterize) the natural almost-sure ﬂuctuation scale that the empirical projection has, t KL inf ( b P t , µ ) : we show it oscillates at order log log t with a sharp constant. Related W ork. Khinchin and K olmogorov were among the ﬁrst to analyze the LIL for sums of random v ariables and provide a ﬁnite-variance reﬁnement of Hartman-W intner ( Khinchin , 1924 ; K olmogorof f , 1929 ; Hartman and W intner , 1941 ), and there certainly hav e been innumerable ex- tensions and v ariants ( Feller , 1943 ; Chung , 1948 ; Strassen , 1964 ; T eicher , 1974 ; Stout , 1970 , 1974 ; Bingham , 1986 ; de Acosta , 1983 ). Especially important in sequential inference, there has been later work in generalizing LIL to ﬁnite-horizon or time-uniform settings through, for example, sharp ﬁnite-time martingale LIL bounds and conﬁdence-sequences ( W ald , 1947 ; Darling and Robbins , 1967 ; Balsubramani , 2015 ; Howard et al. , 2020 , 2021 ). Moreover , la ws of the iterated logarithm hav e been pro ven to extend far beyond the sample mean itself, with laws of the logarithm and func- tional LILs for empirical and local empirical processes ( Shorack and W ellner , 1986 ; v an der V aart and W ellner , 1996 ; Deheuv els and Mason , 1994 ; Mason , 2004 ; Einmahl and Rosalsky , 2001 ). Fundamentally , the KL diver gence ( Kullback and Leibler , 1951 ) is the driving force, so to speak, of large deviations of empirical measures and entropy minimization ( Sanov , 1957 ; Dembo and Zeitouni , 1998 ; Dupuis and Ellis , 1997 ). Csisz ´ ar’ s I -div ergence geometry and his reﬁnements of them gav e intuition for information projections under linear constraints ( Csisz ´ ar , 1975 , 1984 ; Csisz ´ ar and Mat ´ u ˇ s , 2003 ). And closely related to all these entropy-tilting ideas is empirical likeli- hood and exponential tilting methods in statistics ( Owen , 1988 , 2001 ). No w , it is well understood in the bandit literature that optimality and bandit lower bounds are characterized by KL “information” deﬁned by taking inﬁma ov er so-called confusing alternativ es ( Robbins , 1952 ; Gittins , 1989 ; Lai and Robbins , 1985 ; Burnetas and Katehakis , 1996 ; Kaufmann et al. , 2012 ). That is the motiv ation of a number of w orks that presented KL-based optimism indices and of course nonparametric v ari- ants to these ( Robbins , 1952 ; Garivier and Capp ´ e , 2011 ; Capp ´ e et al. , 1924 ; Honda and T akemura , 2010 , 2015 ; Lattimore and Szepesv ´ ari , 2020 ). W ith all this said, we present a completely orthogonal contribution. These procedures frequently compute the empirical mean constrained projection cost as mentioned. T o that end, we provide the ﬁrst almost-sure LIL calibration of this KL inf cost. The rest of this paper is organized as follo ws. In Section 2 , we provide an upper bound on the empirical KL inf , which we then tighten (thereby sho wing equality) in Section 3 . W e present 2 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf a theorem to extend these results to unbounded data in Section 4 . In Appendix A , we provide all the full proofs that were not provided in our main text. In Appendix B we giv e three concrete “instantiations” of our general theorem from Section 4 holding. Finally , in Appendix C , we prove the tightness of the sharp en velope introduced in Section 4 that allo ws us to extend our results to unbounded data. 2. An Upper Law of Iterated Logarithm f or the Empirical KL inf In this section, let us formalize the idea of KL inf satisfying an asymptotic LIL, in a general setting, outside of any speciﬁc application lik e sequential testing, etc. Recall our same setup as before such that X 1 , X 2 , X 3 , . . . are iid random variables with common distrib ution P ∈ P ([ a, b ]) supported on the compact interval [ a, b ] , with mean µ := E [ X 1 ] and v ariance σ 2 := V ar( X 1 ) ∈ (0 , ∞ ) . All probabilities P ( · ) and expectations E [ · ] below are taken with respect to P . Giv en all this, let us denote the empirical measure for t ≥ 1 as, b P t := 1 t t X i =1 δ X i , and let us denote the empirical mean and v ariance respectiv ely as, b µ t := Z x d b P t ( x ) = 1 t t X i =1 X i , b σ 2 t := Z ( x − b µ t ) 2 d b P t ( x ) = 1 t t X i =1 ( X i − b µ t ) 2 . Further , for an y measurable h , we write E b P t [ h ( X )] := R h ( x ) d b P t ( x ) = 1 t P t i =1 h ( X i ) . Impor- tantly , albeit b P t indeed is random, conditioned on the data ( X 1 , X 2 , . . . ) it is ﬁxed. As a conse- quence, E b P t is simply an empirical a verage. Now , with those deﬁned, let us e xplain ho w we denote the KL inf in this setting. Namely , for a probability measure ν taking support on [ a, b ] and a scalar m ∈ [ a, b ] , let us deﬁne it as, KL inf ( ν, m ) := inf n KL( ν ∥ Q ) : Q ∈ P ([ a, b ]) , Z x dQ ( x ) ≥ m o , where KL( ν ∥ Q ) is just the typical KL div ergence. And we adopt the con vention that KL( ν ∥ Q ) = ∞ if ν ≪ Q . In addition, one point we want to mak e is that the compact class that we analyze here does in fact matter . Why? Because if one allo ws Q to place mass arbitrarily far on the upper tail, 1 then clearly KL inf ( ν, m ) can collapse to 0 whenev er m e xceeds the mean of ν . Meaning, one can easily “sprinkle” if you will an arbitrarily small amount of mass at some v ery large point to meet the mean constraint at a v anishing KL cost. As such, a bounded (or some otherwise constrained) model class is really what it takes to make the KL inf nontri vial. Hence that is precisely the moti v ation for our upper LIL theorem belo w . Theorem 1 Under our above setup we have that, P lim sup t →∞ t KL inf ( b P t , µ ) log log t ≤ 1 ! = 1 . 1. As an example, on R with no restrictions. 3 R A M R A M DA S Equivalently , for every ε > 0 , almost sur ely there exists a (possibly random) T ε < ∞ suc h that for all t ≥ T ε , KL inf ( b P t , µ ) ≤ (1 + ε ) log log t t . Before proceeding with our proof of this theorem, recall that in a sense the quantity KL inf ( b P t , µ ) is the smallest relativ e-entropy b udget needed to perturb the empirical law so that its mean can be at least the true mean µ . If in fact the empirical mean b µ t ≥ µ , then no perturbation of course would be necessary and KL inf ( b P t , µ ) = 0 . Now , in the case where b µ t < µ , one of the simplest ways to increase the mean is to slightly upweight observ ations above b µ t and do wnweight those that are below b µ t . This must be done in a way that preserves the total mass. Such a perturbation is inﬁnitesimal, so its KL cost is second-order: meaning, it is proportional to  µ − b µ t  2 / (2 b σ 2 t ) . From there we can use the classic LIL for b µ t to con vert immediately into an LIL for KL inf . W ith all these being said, one more tool we will need for our theorem’ s proof will be the follo wing lemma: a uniform T aylor bound for − log (1 + u ) on a shrinking neighborhood. Lemma 2 T ake some r ∈ (0 , 1) and deﬁne f ( u ) := − log(1 + u ) on ( − 1 , ∞ ) . Then, for e very u ∈ [ − r, r ] , we have that f ( u ) ≤ − u + u 2 2 + | u | 3 3(1 − r ) 3 . W ith all of this in mind, we are no w ﬁnally ready for our proof of Theorem 1 . Proof [Theorem 1 ] Let us start by setting ∆ t := ( µ − b µ t ) + = max { 0 , µ − b µ t } . Note that if ∆ t = 0 , then necessarily b µ t ≥ µ . This means that the empirical law b P t must itself satisfy the mean constraint. Thus, KL inf ( b P t , µ ) = 0 . So, only the times with ∆ t > 0 matter and hence that is exactly what we’ re going to focus on. Let us work on the almost sure ev ent on which the strong law of lar ge numbers and the LIL for b µ t both hold. Namely , b µ t → µ almost surely and b σ 2 t → σ 2 almost surely . Note that also the “classical” LIL by Hartman and W intner ( 1941 ) implies that for iid sequences with V ar( X 1 ) = σ 2 ∈ (0 , ∞ ) , lim sup t →∞ b µ t − µ p (2 σ 2 log log t ) /t = 1 , lim inf t →∞ b µ t − µ p (2 σ 2 log log t ) /t = − 1 , almost surely . No w , let us deﬁne a feasible afﬁne tilt of the empirical law . That is, for those t with ∆ t > 0 and b σ 2 t > 0 , deﬁne θ t := ∆ t b σ 2 t and dQ t d b P t ( x ) := 1 + θ t ( x − b µ t ) . Crucially , this is the simplest possible perturbation that preserves the total mass because R ( x − b µ t ) d b P t ( x ) = 0 . In addition, this perturbation pushes the mean upward at a rate proportional to the variance. Furthermore, because of the fact that x ∈ [ a, b ] , we hav e that | x − b µ t | ≤ ( b − a ) , and so, inf x ∈ [ a,b ]  1 + θ t ( x − b µ t )  ≥ 1 − θ t ( b − a ) . Since ∆ t → 0 and b σ 2 t → σ 2 > 0 , we ha ve that θ t ( b − a ) → 0 almost surely . Thus, this means that for all sufﬁciently large t the Radon-Nikodym deriv ati ve is strictly positiv e on [ a, b ] . As such, for such large t , Q t is therefore a well-deﬁned probability measure supported on [ a, b ] , and, Z dQ t = Z  1 + θ t ( x − b µ t )  d b P t ( x ) = 1 + θ t Z ( x − b µ t ) d b P t ( x ) | {z } =0 = 1 . 4 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf And, moreov er , we kno w that its mean is exactly lifted to µ . That is, Z x dQ t ( x ) = Z x  1 + θ t ( x − b µ t )  d b P t ( x ) = Z xd b P t ( x ) | {z } b µ t + θ t Z x ( x − b µ t ) d b P t ( x ) = b µ t + θ t Z x ( x − b µ t ) | {z } ( x − b µ t ) 2 + b µ t ( x − b µ t ) d b P t ( x ) = b µ t + θ t Z ( x − b µ t ) 2 d b P t ( x ) + θ t Z b µ t ( x − b µ t ) d b P t ( x ) | {z } =0 = b µ t + θ t Z ( x − b µ t ) 2 d b P t ( x ) = b µ t + θ t b σ 2 t = b µ t + ∆ t = µ. Hence, gorgeously , Q t satisﬁes the constraint R x dQ t ( x ) ≥ µ e xactly . Therefore, it follows that KL inf ( b P t , µ ) ≤ KL( b P t ∥ Q t ) . W e no w need to bound the KL cost of the actual tilt. By deﬁnition we kno w that, KL( b P t ∥ Q t ) = Z log d b P t dQ t ! d b P t = − Z log  1 + θ t ( x − b µ t )  d b P t ( x ) . No w , let X ∼ b P t and deﬁne the random variable U t := θ t ( X − b µ t ) . Then, necessarily because we treat b P t as ﬁxed conditioned on ( X 1 , . . . , X t ) , E b P t [ U t ] = θ t E b P t [ X − b µ t ] = θ t  R xd b P t ( x ) − b µ t  = 0 . Furthermore, we know that | U t | ≤ r t , where r t := θ t ( b − a ) = ∆ t b σ 2 t ( b − a ) − − − → t →∞ 0 almost surely . So, to recap, we learned two things about our random v ariable U t . First, it is centered, and second, uniformly small. Now , let us take t large enough so that r t ∈ (0 , 1) and then quickly apply Lemma 2 with r = r t pointwise to u = U t . If we do this, we get that, − log(1 + U t ) ≤ − U t + U 2 t 2 + | U t | 3 3(1 − r t ) 3 . Then, taking b P t -expectations and le veraging the f act that E b P t [ U t ] = 0 , we get that, KL( b P t ∥ Q t ) ≤ E b P t [ U 2 t ] 2 + E b P t [ | U t | 3 ] 3(1 − r t ) 3 , (1) as the ﬁrst term − E b P t [ U t ] = 0 . Let us now compute these moments and then plug them back in. First, for the second moment we know that E b P t [ U 2 t ] = θ 2 t E b P t [( X − b µ t ) 2 ] = θ 2 t b σ 2 t . Second, for the third absolute moment we know that | U t | ≤ r t . So, it follows that E b P t [ | U t | 3 ] ≤ r t E b P t [ U 2 t ] = r t θ 2 t b σ 2 t . Plugging all these back into ( 1 ) gives us, KL( b P t ∥ Q t ) ≤ θ 2 t b σ 2 t 2  1 + 2 r t 3(1 − r t ) 3  = ∆ 2 t 2 b σ 2 t  1 + 2 r t 3(1 − r t ) 3  . 5 R A M R A M DA S No w , we know that r t → 0 almost surely . As such, the (1 + 2 r t 3(1 − r t ) 3 ) term is just essentially (1 + o (1)) almost surely . Hence, we get that, KL inf ( b P t , µ ) ≤ KL( b P t ∥ Q t ) ≤ (1 + o (1)) ∆ 2 t 2 b σ 2 t , almost surely as t → ∞ . (2) This really is very interesting here. W e are seeing a local quadratic behavior of the KL. Namely , that the KL cost is second-order in the mean deﬁcit. This aside, let us now inject the classical LIL for the mean: in other words, let us con vert the mean LIL into a KL inf LIL to ﬁnish. The classic LIL for b µ t of course tells us that almost surely , lim sup t →∞ ( b µ t − µ ) 2 (2 σ 2 log log t ) /t = 1 . Furthermore, since ( µ − b µ t ) + ≤ | b µ t − µ | , we kno w that the same upper bound holds for ∆ t . In addition, the classic LIL also giv es us lim inf ( b µ t − µ ) / p (2 σ 2 log log t ) /t = − 1 . This means that along a subsequence we ha ve that b µ t − µ < 0 with asymptotically maximal magnitude. Naturally , then this forces with probability one that lim sup t →∞ ∆ 2 t (2 σ 2 log log t ) /t = 1 . Obviously also, we know that with probability one that b σ 2 t → σ 2 . Hence, lim sup t →∞ t log log t · ∆ 2 t 2 b σ 2 t = lim sup t →∞  σ 2 b σ 2 t   ∆ 2 t (2 σ 2 log log t ) /t  = 1 , almost surely . T o ﬁnish, all we need to do is combine this result with ( 2 ) to get us with probability one that, lim sup t →∞ t KL inf ( b P t , µ ) log log t ≤ 1 , which is exactly our desired LIL bound. So, we are done. 3. A Matching Lower Law of Iterated Logarithm f or the Empirical KL inf In this section, we are no w going to sho w that, in addition to the upper bound we proved, in fact there exists a matching lo wer bound, hence indeed the KL inf satisﬁes an asymptotic LIL with equality . T o recap, Theorem 1 established the upper bound with probability one that, lim sup t →∞ t KL inf ( b P t , µ ) log log t ≤ 1 . W e’ re no w going to prove a matching lo wer bound. T o this end, the lim sup constant will become exactly 1 . T o sort of prelude this entire section, we state the main theorem ﬁrst. Theorem 3 Under the same exact setup of Theor em 1 we have in fact that, P lim sup t →∞ t KL inf ( b P t , µ ) log log t = 1 ! = 1 . T o prove this, the only additional item we need is the Donsker-V aradhan variational inequality for KL( ν ∥ Q ) ( Donsk er and V aradhan , 1975 ), which we state below and pro vide a self-contained proof in the appendix for completeness. 6 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf Lemma 4 Let (Ω , F ) be a measur able space and let ν, Q be pr obability measur es on the space. Now suppose that ν ≪ Q . Then, for e very bounded and measurable φ : Ω → R one has, KL( ν ∥ Q ) ≥ Z φ dν − log  Z e φ dQ  . If ν ≪ Q , then KL( ν ∥ Q ) = ∞ by convention and hence the inequality is trivially true . W e are no w ready to prove Theorem 3 . Proof [Theorem 3 ] As we mentioned before, we already hav e an almost-sure upper bound from the ﬁrst Theorem 1 . So it remains to prove the matching lo wer bound with probability one that, lim sup t →∞ t KL inf ( b P t , µ ) log log t ≥ 1 . T o begin, as in our proof of Theorem 1 , deﬁne ∆ t := ( µ − b µ t ) + = max { 0 , µ − b µ t } . W e kno w that whene ver ∆ t = 0 , we hav e that b µ t ≥ µ , which means that KL inf ( b P t , µ ) = 0 by feasibility of b P t alone. The tak eaway is therefore that the lo wer bound for the lim sup can only come from those times where ∆ t > 0 . Let’ s work on the probability one e vent where both the strong la w and classic LIL hold. Meaning, we will w ork where b µ t → µ and b σ 2 t → σ 2 ∈ (0 , ∞ ) . And, we will work where, lim sup t →∞ b µ t − µ √ (2 σ 2 log log t ) /t = 1 and lim inf t →∞ b µ t − µ √ (2 σ 2 log log t ) /t = − 1 . Importantly , the second relation here tells us that along some subsequence { t k } , we hav e that b µ t k < µ with asymptotically optimal magnitude. That is, we have with probability one that, lim sup t →∞ ∆ 2 t (2 σ 2 log log t ) /t = 1 . (3) Indeed if we let a t := p (2 σ 2 log log t ) /t , the LIL gi ves us that lim sup( b µ t − µ ) /a t = 1 and lim inf ( b µ t − µ ) /a t = − 1 . By the second identity we explained abov e, we kno w that there is a subsequence t k with ( b µ t k − µ ) /a t k → − 1 . As a result, for all large k we hav e that b µ t k < µ and therefore ∆ t k a t k = µ − b µ t k a t k = − b µ t k − µ a t k → 1 . On the other hand, we know that for all t , ∆ t ≤ | b µ t − µ | and lim sup | b µ t − µ | /a t = 1 . So, lim sup ∆ t /a t ≤ 1 . If we no w combine both we get that lim sup ∆ t /a t = 1 , which is exactly ( 3 ). Let us no w deriv e a deterministic lo wer bound on KL inf ( b P t , µ ) in terms of ∆ t and b σ 2 t . T o be gin, consider a particular t ≥ 1 and an arbitrary measure Q ∈ P ([ a, b ]) such that R xdQ ( x ) ≥ µ . Let λ ≥ 0 be such that λ ( b − µ ) < 1 so that for ev ery x ∈ [ a, b ] we hav e that 1 + λ ( µ − x ) ≥ 1 − λ ( b − µ ) > 0 . As a result of this, we know that φ λ ( x ) := log(1 + λ ( µ − x )) will be well-deﬁned indeed and bounded on [ a, b ] . Let’ s no w apply Lemma 4 with ν = b P t , this particular Q , and φ = φ λ . Doing this gets us KL( b P t ∥ Q ) ≥ Z log(1 + λ ( µ − x )) d b P t ( x ) − log  Z (1 + λ ( µ − x )) dQ ( x )  . W e kno w that the second term is controlled by the mean constraint. Hence, Z (1 + λ ( µ − x )) dQ ( x ) = 1 + λ  µ − Z x dQ ( x )  ≤ 1 . 7 R A M R A M DA S Obviously , the function log( · ) is increasing, which implies that, − log  Z (1 + λ ( µ − x )) dQ ( x )  ≥ − log (1) = 0 . So! It follows that for e very such Q and ev ery λ ∈ [0 , 1 / ( b − µ )) we ha ve that, KL( b P t ∥ Q ) ≥ Z log(1 + λ ( µ − x )) d b P t ( x ) . Then, taking the inﬁmum very quickly o ver all feasible Q giv es us the lower bound, KL inf ( b P t , µ ) ≥ sup 0 ≤ λ< 1 / ( b − µ ) Z log(1 + λ ( µ − x )) d b P t ( x ) . (4) The key beautiful point here is that we now have a lo wer bound on KL inf only in terms of the empirical measure. W ith this said, we will now e v aluate the right-hand side here at a very carefully chosen λ which matches the local quadratic scale we’ ve been seeing for the KL. Namely deﬁne, λ t := ( ∆ t b σ 2 t , if ∆ t > 0 and b σ 2 t > 0 , 0 , otherwise . No w , on this probability one e vent, we kno w that b σ 2 t → σ 2 > 0 and ∆ t → 0 . As a consequence, λ t → 0 . Meaning, for all t suf ﬁciently lar ge we hav e that λ t ( b − µ ) < 1 and thus λ t is admissible in ( 4 ). Hence indeed for all such large t we hav e that, KL inf ( b P t , µ ) ≥ Z log(1 + λ t ( µ − x )) d b P t ( x ) . (5) No w , let X ∼ b P t and set U t := λ t ( µ − X ) . Here, again, as per our con vention E b P t is integration with respect to the empirical measure. W e know that X ∈ [ a, b ] necessarily so we must have | µ − X | ≤ b − a . Hence it follows that with probability one, | U t | ≤ r t where r t := λ t ( b − a ) = ∆ t b σ 2 t ( b − a ) − − − → t →∞ 0 . Now , since r t → 0 almost surely , we know that there exists with probability one a ﬁnite random time T r such that for all t ≥ T r we hav e that r t < 1 . And when r t = 0 we hav e that λ t = 0 so that would mean U t ≡ 0 which means that the inequality is trivial. So, when we are applying Lemma 2 , we certainly may restrict our attention to the case where r t ∈ (0 , 1) strictly . T o this end, applying Lemma 2 with r = r t pointwise to u = U t gi ves us that log (1 + U t ) ≥ U t − U 2 t 2 − | U t | 3 3(1 − r t ) 3 . Let’ s no w take expectations with respect to b P t . Doing this gets us, Z log(1 + λ t ( µ − x )) d b P t ( x ) ≥ E b P t [ U t ] − 1 2 E b P t [ U 2 t ] − 1 3(1 − r t ) 3 E b P t [ | U t | 3 ] . (6) W e’ re no w going to compute each one of these terms e xplicitly in terms of both ∆ t and b σ 2 t . For the ﬁrst term we kno w that, E b P t [ U t ] = λ t E b P t [ µ − X ] = λ t ( µ − b µ t ) = λ t ∆ t . No w , let’ s do the second term. W e know that E b P t [ U 2 t ] = λ 2 t E b P t [( µ − X ) 2 ] . No w , of course, b µ t = E b P t [ X ] and thus we can easily expand ( µ − X ) 2 around b µ t . How? W e can use the basic fact 8 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf that µ − X = ( µ − b µ t ) + ( b µ t − X ) which holds for all t . Let’ s no w square and take e xpectations with respect to b P t . That is, E b P t [( µ − X ) 2 ] = E b P t  ( µ − b µ t ) 2 + 2( µ − b µ t )( b µ t − X ) + ( b µ t − X ) 2  = ( µ − b µ t ) 2 + 2( µ − b µ t ) E b P t [ b µ t − X ] | {z } =0 + E b P t [( b µ t − X ) 2 ] | {z } = b σ 2 t = b σ 2 t + ( µ − b µ t ) 2 . Therefore, we get that E b P t [ U 2 t ] = λ 2 t  b σ 2 t + ( µ − b µ t ) 2  . Lastly , by deﬁnition we know that λ t = ∆ t / b σ 2 t . And, we kno w that λ t = 0 whene ver ∆ t = 0 . As such, we ha ve that λ 2 t ( µ − b µ t ) 2 = λ 2 t ∆ 2 t for all t . Hence we get that, E b P t [ U 2 t ] = λ 2 t ( b σ 2 t + ∆ 2 t ) . So, that concludes the second term. It remains to compute the third ﬁnal moment. Let’ s do it. W e can simply use the fact that | µ − X | ≤ b − a . This gi ves us that E b P t [ | U t | 3 ] = λ 3 t E b P t [ | µ − X | 3 ] ≤ λ 3 t ( b − a ) 3 . No w we’ ve computed all the terms. Hence let us substitute them back into ( 6 ). Doing this giv es us Z log(1 + λ t ( µ − x )) d b P t ( x ) ≥ λ t ∆ t − 1 2 λ 2 t ( b σ 2 t + ∆ 2 t ) − 1 3(1 − r t ) 3 λ 3 t ( b − a ) 3 . Let’ s place our attention on the rele v ant lar ge t regime where ∆ t > 0 and b σ 2 t > 0 . In doing so, let us plug in λ t = ∆ t / b σ 2 t . T o this end for the ﬁrst term we get that λ t ∆ t = ∆ 2 t b σ 2 t . The quadratic term becomes just 1 2 λ 2 t ( b σ 2 t + ∆ 2 t ) = 1 2 ∆ 2 t b σ 4 t ( b σ 2 t + ∆ 2 t ) = ∆ 2 t 2 b σ 2 t + ∆ 4 t 2 b σ 4 t . So overall, we kno w that the ﬁrst two contributions are going to simplify as λ t ∆ t − 1 2 λ 2 t ( b σ 2 t + ∆ 2 t ) = ∆ 2 t 2 b σ 2 t − ∆ 4 t 2 b σ 4 t . Lastly , we know that the cubic remainder term is bounded by 1 3(1 − r t ) 3 λ 3 t ( b − a ) 3 = ( b − a ) 3 3(1 − r t ) 3 · ∆ 3 t b σ 6 t . Let’ s take all that we hav e sho wn now . Le veraging it all, we get the lo wer bound Z log(1 + λ t ( µ − x )) d b P t ( x ) ≥ ∆ 2 t 2 b σ 2 t − ∆ 4 t 2 b σ 4 t − ( b − a ) 3 3(1 − r t ) 3 · ∆ 3 t b σ 6 t . (7) W e’ re ﬁnally ready to go back to ( 5 ). Indeed, we therefore are going to have for all t large enough that KL inf ( b P t , µ ) ≥ ∆ 2 t 2 b σ 2 t − ∆ 4 t 2 b σ 4 t − ( b − a ) 3 3(1 − r t ) 3 · ∆ 3 t b σ 6 t . No w , ∆ t → 0 . And, b σ 2 t → σ 2 > 0 . Also, r t → 0 almost surely . As a consequence of all this, we know that both of the error terms are o  ∆ 2 t / b σ 2 t  almost surely . T o be concrete we have that with probability one ∆ 4 t / b σ 4 t ∆ 2 t / b σ 2 t = ∆ 2 t b σ 2 t − − − → t →∞ 0 , ∆ 3 t / b σ 6 t ∆ 2 t / b σ 2 t = ∆ t b σ 4 t − − − → t →∞ 0 and that (1 − r t ) − 3 → 1 . As a consequence we get this beautiful local quadratic lo wer bound almost surely as t → ∞ , KL inf ( b P t , µ ) ≥ (1 − o (1)) ∆ 2 t 2 b σ 2 t . (8) 9 R A M R A M DA S No w , we are going to combine ( 8 ) with the mean LIL information that we got, ( 3 ). Of course, since b σ 2 t → σ 2 , we get that, lim sup t →∞ t log log t · ∆ 2 t 2 b σ 2 t = lim sup t →∞  σ 2 b σ 2 t   ∆ 2 t (2 σ 2 log log t ) /t  = 1 , almost surely . Now , let us take a particular b ut arbitrary ε ∈ (0 , 1) . Certainly we kno w by ( 8 ) that on our almost sure e vent there exists a random T ε < ∞ such that for all t ≥ T ε , KL inf ( b P t , µ ) ≥ (1 − ε ) ∆ 2 t 2 b σ 2 t . Let’ s now multiply by t/ log log t and then take the lim sup o ver t → ∞ . Doing so gi ves us that, lim sup t →∞ t KL inf ( b P t , µ ) log log t ≥ (1 − ε ) lim sup t →∞ t log log t · ∆ 2 t 2 b σ 2 t = (1 − ε ) · 1 . Note that ε was of course arbitrary . That means that almost surely , lim sup t →∞ t KL inf ( b P t , µ ) log log t ≥ 1 . And taken together with our Theorem 1 ’ s already-prov en upper bound, we indeed ha ve equality . That is, lim sup t →∞ t KL inf ( b P t , µ ) log log t = 1 . And, with that, we are done at last. 4. Bey ond Bounded Case: An Extension to Sub-Gaussian and other such tail regimes The bounded support condition that we imposed is in fact quite important, as we already detailed. The reason is simple: we are able to keep the KL inf controlled, particularly if mass is placed at arbitrary far out points on the right tail. Therefore, in this section, we will start by showing this degenerac y of the KL inf formally . W e will then pro vide a general theorem to recov er the very asymptotic tail patterns in multiple regimes (eg, subgaussian tails). Intuitiv ely , what we do is enforce an natural (and of course, apt) constraint on the competing model class. Ha ving said all of that, the very ﬁrst thing we are going to sho w rigorously this idea of the KL inf collapsing on R when it is not constrained. Let ν be any probability measure on R with ﬁnite mean ¯ µ := R x dν ( x ) . Let us consider the unconstrained deﬁnition, KL R inf ( ν, m ) := inf n KL( ν ∥ Q ) : Q ∈ P ( R ) , Z x dQ ( x ) ≥ m o . W e sho w degeneracy of it without a tail constraint in the next proposition. Proposition 5 Suppose m > ¯ µ . Then, KL R inf ( ν, m ) = 0 . 10 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf What this particular proposition is telling us is precisely the consequence of being allowed to “sprin- kle” the mass far out. That is, it’ s easy to pay a very small (in fact, arbitrarily small) KL cost while still achie ving an increase in the mean. Why? The class of competitors is not constrained. So, a natural question that we all would hav e upon seeing this issue is how do we escape this issue? In other words, can we generalize beyond bounded random variables? The simple and natural answer is that we need only impose a constraint to eliminate such “sprinkling. ” An easy expansion is if we consider the class of tail-controlled variables. 2 For this class, let us le verage a gro wing env e- lope that will almost surely ev entually contain all samples. Gor geously , this will yield us a time constrained and nontrivial KL inf . T o that end, our previous analysis with random variables with bounded support can easily be e xtended. That will thus be the crux of this section. T o begin, let’ s take ( B t ) t ≥ 1 to be a deterministic and nondecreasing sequence whereby B t → ∞ . Deﬁne then the time- t constrained version as, KL ( t ) inf ( ν, m ) := inf n KL( ν ∥ Q ) : Q ∈ P ([ − B t , B t ]) , Z xdQ ( x ) ≥ m o . For fun, suppose that ν = b P t . In this case, the interpretation is as follows: what is the “budget” with respect to the KL that we would need to perturb the empirical law into another distribution taking support on [ − B t , B t ] with mean at least m . W ith all that said, we will now present our main theorem result. T o prelude, it is intuiti vely telling us that if B t gro ws slowly enough compared to the LIL scale, then we again will get the sharp asymptotic LIL constant 1 indeed. In short we present a theorem that gi ves a two sided LIL under a gro wing env elope. Theorem 6 Suppose that X 1 , X 2 , X 3 , . . . are iid r eal-valued random variables with mean µ and variance σ 2 ∈ (0 , ∞ ) . Assume that almost sur ely there exists a ﬁnite (possibly random) T B such that | X i | ≤ B t for all t ≥ T B and all 1 ≤ i ≤ t , for some deterministic nondecr easing sequence ( B t ) t ≥ 1 such that as t → ∞ B t = o  r t log log t  . Then, P lim sup t →∞ t KL ( t ) inf ( b P t , µ ) log log t = 1 ! = 1 . The proof of this theorem follo ws closely that of the bounded case, with minor adjustments. Hence we defer it to the appendix along with the lemmas and proposition proofs. But, remarkably , the takea way is as follows. If we can control the tails in a way to get us the almost sure en velope B t = o ( p t/ log log t ) , then our bounded support KL inf asymptotic LIL argument extends. Meaning, after verifying that the uniform smallness parameter r t := 2 B t ∆ t / b σ 2 t satisﬁes r t → 0 wp 1, the LIL argument extends to the time-v arying constraint KL ( t ) inf . And in fact the lim sup constant we get remains the sharp 1 . Note that in Appendix B we gi ve three broad instantiations of Theorem 6 holding: sub-Gaussian, subexponential, and ﬁnite p > 2 moments. Finally , in Appendix C we sho w tightness of the en velope and why our abo ve Theorem excludes the case where p = 2 . Gi ven the importance of Appendix C , we explain the main ideas as follo ws. First, Lemma 10 giv es a 2. These include, for instance sub-Gaussian, sub-exponential, ﬁnite moments, etc. 11 R A M R A M DA S feasible competitor Q , which we get from sprinkling a small mass ε at the boundary point B . From this, we get the upper bound KL ( t ) inf ( ν, m ) ≤ KL( ν ∥ Q ) ≤ − log (1 − ε ) , and equality in fact when ν ( { B } ) = 0 , with ε = ( m − ¯ µ ) / ( B − ¯ µ ) and − log(1 − ε ) ∼ ε as ε ↓ 0 . This is taken further with Proposition 11 where this so-called sprinkling bound is combined with the mean LIL. Speciﬁcally , through this proposition we sho w that whene ver an almost sure valid deterministic en velope satisﬁes B t p t/ log log t → ∞ , then the constrained empirical cost will collapse on the log log t scale. Dangerously , we would ha ve that, lim sup t →∞ t KL ( t ) inf ( b P t , µ ) log log t = 0 . Then, in Lemma 12 we giv e a Borel Cantelli criterion under the weak second-moment assumption to sho w (ﬁrst) validity of a particular en velope. In other words, if indeed P t 1 /B 2 t < ∞ , then we hav e that | X i | ≤ B t for all i ≤ t e ventually almost surely . That is: B t = max s ≤ t √ s (log s ) γ with γ > 1 2 is an almost sure v alid deterministic en velope. W e take this idea and combine it with Proposition 11 in our Corollary 13 . In particular , under E [ X 2 1 ] < ∞ such en velopes would force the same collapse of the log log t normalization. W e conclude the section with Proposition 14 where we sho w that in general we cannot verify the key assumption of Theorem 6 under only p = 2 : a probability one en velope with B t = o ( p t/ log log t ) . Note that we do this by building a distribution with ﬁnite v ariance for which | X t | > p t/ log log t inﬁnitely often almost surely . T ogether , all of the results in Appendix C tell us that p t/ log log t is a sharp boundary for deterministic en velopes. Meaning: if we go belo w the boundary (assuming that the en velope can be veriﬁed), we will remain in the local quadratic re gime of Theorem 6 with the sharp constant 1 . The sharpness breaks, ho we ver , if we go above the en velope when sprinkling dominates and forces the normalized cost to 0 . 5. Conclusion In this paper , we established the ﬁrst exact LIL for the empirical KL inf . Namely we showed that lim sup t →∞ t KL inf ( b P t , µ ) log log t = 1 almost surely . W e did this ﬁrst for random variables who take v alues on a compact support. W e then extend it to unbounded data via slowly gro wing en velopes [ − B t , B t ] with B t = o ( p t/ log log t ) that will e ventually contain the data almost surely . Crucially , we hav e the local equiv alence with probability once that KL inf ( ˆ P t , µ ) = (1 + o (1))( µ − b µ t ) 2 + / (2 b σ 2 t ) . The beneﬁt of this local equiv alence is that it makes the iterated-logarithm constant immediate and obvious from the classical mean LIL. W ith all this being said, certainly , there are sev eral promising areas of future work. An immediate one is when analyzing higher -dimensional (or multiple-moment) constraints. In these situations, curv ature and feasibility become much more subtle, albeit information projections are still quite natural. In addition, our paper worked under the standard i.i.d. data setup. As such, it would certainly be v aluable to extend these results to dependent data (eg: martingales, mixing, or Marko v chains), where certainly the LIL beha vior could persist. But dif ferent tools would be needed to analyze them ( Stout , 1970 ; de la Pe ˜ na et al. , 2009 ; Ho ward et al. , 2021 ; Dembo and Zeitouni , 1998 ). 12 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf References Peter Auer , Nicol ` o Cesa-Bianchi, and Paul Fischer . Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2):235–256, 2002. doi: 10.1023/A:1013689704352. Akshay Balsubramani. Sharp ﬁnite-time iterated-logarithm martingale concentration, 2015. URL https://arxiv.org/abs/1405.2639 . Nicholas H. Bingham. V ariants on the law of the iterated logarithm. Bulletin of the London Mathe- matical Society , 18(5):433–467, 1986. doi: 10.1112/blms/18.5.433. St ´ ephane Boucheron, G ´ abor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymp- totic Theory of Independence . Oxford Uni versity Press, 2013. ISBN 978-0199535255. doi: 10.1093/acprof:oso/9780199535255.001.0001. S ´ ebastien Bubeck and Nicol ` o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi- armed bandit problems. F oundations and T r ends® in Machine Learning , 5(1):1–122, 2012. doi: 10.1561/2200000024. Apostolos N. Burnetas and Michael N. Katehakis. Optimal adapti ve policies for sequential allo- cation problems. Advances in Applied Mathematics , 17(2):122–142, 1996. doi: 10.1006/aama. 1996.0007. Oli vier Capp ´ e, Aur ´ elien Garivier , Odalric-Ambrym Maillard, R ´ emi Munos, and Gilles Stoltz. Kullback–Leibler upper conﬁdence bounds for optimal sequential allocation. The Annals of Statistics , 41(3):1516–1541, 1924. doi: 10.1214/13- A OS1119. Kai Lai Chung. On the maximum partial sums of sequences of independent random vari- ables. T ransactions of the American Mathematical Society , 64(2):205–233, 1948. doi: 10. 1090/S0002- 9947- 1948- 0026274- 0. URL https://www.ams.org/journals/tran/ 1948- 064- 02/S0002- 9947- 1948- 0026274- 0/ . Harald Cram ´ er . Sur un nouveau th ´ eor ` eme-limite de la th ´ eorie des probabilit ´ es. Actualit ´ es Scien- tiﬁques et Industrielles , 736:5–23, 1938. Imre Csisz ´ ar . I -div ergence geometry of probability distrib utions and minimization problems. The Annals of Pr obability , 3(1):146–158, 1975. doi: 10.1214/aop/1176996454. Imre Csisz ´ ar . Sanov property , generalized I -projection and a conditional limit theorem. The Annals of Pr obability , 12(3):768–793, 1984. doi: 10.1214/aop/1176993227. Imre Csisz ´ ar and Franti ˇ sek Mat ´ u ˇ s. Information projections revisited. IEEE T ransactions on Infor- mation Theory , 49(6):1474–1490, 2003. doi: 10.1109/TIT .2003.810633. Donald A. Darling and Herbert Robbins. Conﬁdence sequences for mean, v ariance, and median. Pr oceedings of the National Academy of Sciences , 58(1):66–68, 1967. doi: 10.1073/pnas.58.1.66. Alejandro de Acosta. A ne w proof of the Hartman–W intner law of the iterated logarithm. The Annals of Pr obability , 11(2):270–276, 1983. doi: 10.1214/aop/1176993596. 13 R A M R A M DA S V ictor H. de la Pe ˜ na, Tze Leung Lai, and Qi-Man Shao. Self-Normalized Pr ocesses: Limit Theory and Statistical Applications . Probability and Its Applications. Springer , 2009. ISBN 978-3-540- 85635-1. doi: 10.1007/978- 3- 540- 85636- 8. Paul Deheuv els and Da vid M. Mason. Functional laws of the iterated logarithm for local empirical processes inde xed by sets. The Annals of Pr obability , 22(3):1619–1661, 1994. doi: 10.1214/aop/ 1176988617. Amir Dembo and Ofer Zeitouni. Lar ge Deviations T echniques and Applications , volume 38 of Stochastic Modelling and Applied Pr obability . Springer , 2nd edition, 1998. Monroe D. Donsker and S. R. Srini v asa V aradhan. Asymptotic ev aluation of certain marko v process expectations for large time, i. Communications on Pure and Applied Mathematics , 28(1):1–47, 1975. doi: 10.1002/cpa.3160280102. Paul Dupuis and Richard S. Ellis. A W eak Con verg ence Appr oach to the Theory of Larg e Deviations . W iley Series in Probability and Statistics. W iley-Interscience, 1997. ISBN 978-0-471-07672-8. doi: 10.1002/9781118165904. John H. J. Einmahl and Andrew Rosalsk y . The functional la w of the iterated logarithm for the empirical process based on sample means. J ournal of Theor etical Pr obability , 14(2):577–597, 2001. doi: 10.1023/A:1011128101094. W ill Feller . The general form of the so-called law of the iterated logarithm. T rans- actions of the American Mathematical Society , 54(3):373–402, 1943. doi: 10.1090/ S0002- 9947- 1943- 0009263- 7. Aur ´ elien Gari vier and Olivier Capp ´ e. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Pr oceedings of the 24th Annual Conference on Learning Theory (COLT) , volume 19 of Pr oceedings of Machine Learning Resear ch , pages 359–376. PMLR, 2011. John C. Gittins. Multi-armed Bandit Allocation Indices . W iley-Interscience Series in Systems and Optimization. John W iley & Sons, 1989. ISBN 978-0471920595. Philip Hartman and Aurel Wintner . On the law of the iterated logarithm. American Journal of Mathematics , 63(1):169–176, 1941. Junya Honda and Akimichi T akemura. An asymptotically optimal bandit algorithm for bounded support models. In Pr oceedings of the 23rd Confer ence on Learning Theory (COLT) , pages 67– 79, 2010. Junya Honda and Akimichi T akemura. Non-asymptotic analysis of a new bandit algorithm for semi-bounded re wards. J ournal of Machine Learning Resear ch , 16(1):3721–3756, 2015. Ste ven R. How ard, Aaditya Ramdas, Jon McAulif fe, and Jasjeet Sekhon. T ime-uniform Chernoff bounds via nonnegati ve supermarting ales. Pr obability Surve ys , 17:257–317, 2020. doi: 10.1214/ 18- PS321. Ste ven R. Howard, Aaditya Ramdas, Jon McAulif fe, and Jasjeet S. Sekhon. T ime-uniform, non- parametric, nonasymptotic conﬁdence sequences. Annals of Statistics , 49(2):1055–1080, 2021. doi: 10.1214/20- A OS1991. 14 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf Emilie Kaufmann, Olivier Capp ´ e, and Aur ´ elien Gari vier . On Bayesian upper conﬁdence bounds for bandit problems. In Pr oceedings of the Fifteenth International Conference on Artiﬁcial Intelli- gence and Statistics (AIST A TS) , volume 22 of Pr oceedings of Machine Learning Researc h , pages 592–600. PMLR, 2012. A. Khinchin. ¨ Uber einen satz der wahrscheinlichkeitsrechnung. Fundamenta Mathematicae , 6(1): 9–20, 1924. URL http://eudml.org/doc/214283 . A. K olmogorof f. ¨ Uber das gesetz des iterierten logarithmus. Mathematische Annalen , 101(1):126– 135, 1929. doi: 10.1007/BF01454828. URL http://eudml.org/doc/159322 . Solomon Kullback and Richard A. Leibler . On information and sufﬁciency . The Annals of Mathe- matical Statistics , 22(1):79–86, 1951. doi: 10.1214/aoms/1177729694. Tze Leung Lai and Herbert Robbins. Asymptotically ef ﬁcient adapti ve allocation rules. Advances in Applied Mathematics , 6(1):4–22, 1985. doi: 10.1016/0196- 8858(85)90002- 8. T or Lattimore and Csaba Szepesv ´ ari. Bandit Algorithms . Cambridge Univ ersity Press, 2020. ISBN 978-1108486828. doi: 10.1017/9781108571401. David M. Mason. A uniform functional law of the logarithm for the local empirical process. The Annals of Pr obability , 32(2):1391–1418, 2004. doi: 10.1214/009117904000000243. Art B. Owen. Empirical likelihood ratio conﬁdence interv als for a single functional. Biometrika , 75 (2):237–249, 1988. doi: 10.1093/biomet/75.2.237. Art B. Owen. Empirical Lik elihood , volume 92 of Monogr aphs on Statistics and Applied Pr obabil- ity . Chapman & Hall/CRC, 2001. ISBN 978-1584880714. doi: 10.1201/9781420036152. Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society , 58(5):527–535, 1952. doi: 10.1090/S0002- 9904- 1952- 09620- 8. Igor Nikolaevich Sanov . On the probability of lar ge de viations of random magnitudes. Matematich- eskii Sbornik (N.S.) , 42(84)(1):11–44, 1957. Galen R. Shorack and Jon A. W ellner . Empirical Pr ocesses with Applications to Statistics . W ile y Series in Probability and Mathematical Statistics. John W iley & Sons, 1986. W illiam F . Stout. The Hartman–W intner law of the iterated logarithm for martingales. The Annals of Mathematical Statistics , 41(6):2158–2160, 1970. doi: 10.1214/aoms/1177696716. W illiam F . Stout. Almost Sur e Con ver gence . Academic Press, 1974. ISBN 978-0126727500. V olker Strassen. An in variance principle for the law of the iterated logarithm. Zeitschrift f ¨ ur W ahrsc heinlichk eitstheorie und V erwandte Gebiete , 3(3):211–226, 1964. doi: 10.1007/ BF00534910. Henry T eicher . On the law of the iterated logarithm. The Annals of Pr obability , 2(4):714–728, 1974. doi: 10.1214/aop/1176996614. 15 R A M R A M DA S Aad W . van der V aart and Jon A. W ellner . W eak Con ver gence and Empirical Pr ocesses: W ith Applications to Statistics . Springer Series in Statistics. Springer , 1996. ISBN 978-0-387-94640- 5. doi: 10.1007/978- 1- 4757- 2545- 2. Abraham W ald. Sequential Analysis . John W iley & Sons, 1947. A ppendix A. Omitted Proofs Proof [Proof of Lemma 2 ] This is a very easy proof to see. Let’ s ﬁrst ﬁx r ∈ (0 , 1) and of course u ∈ [ − r, r ] . W e will use T aylor’ s theorem with a remainder to approximate f ( u ) around u = 0 . W e kno w that f ( u ) = − log (1 + u ) is thrice differentiable (ie, in C 3 ) on the interv al [ − r , r ] . Therefore, we can write the third order expansion as, f ( u ) = f (0) + f ′ (0) u + f ′′ (0) 2 u 2 + f (3) ( ξ ) 6 u 3 , where ξ is some intermediate value that’ s between 0 and u . Let us no w calculate the ﬁrst three deri vati ves of f ( u ) and e valuate them at u = 0 . Clearly , f (0) = − log(1 + 0) = 0 . Then, f ′ ( u ) = − 1 1+ u , so f ′ (0) = − 1 , and f ′′ ( u ) = 1 (1+ u ) 2 , so f ′′ (0) = 1 . Finally , f (3) ( v ) = − 2 (1+ v ) 3 . Obviously , since ξ ∈ [ − r , r ] , we ha ve that (1 + ξ ) ≥ (1 − r ) . Hence we can bound the remainder term easily to not depend on the exact v alue of ξ : | f (3) ( ξ ) | ≤ 2 (1 − r ) 3 . Therefore, we get that, f ( u ) ≤ − u + u 2 2 + 1 6 · 2 (1 − r ) 3 | u | 3 = − u + u 2 2 + | u | 3 3(1 − r ) 3 , which was e xactly our claim and hence we are done! Proof [Proof of Lemma 4 ] Suppose R e φ dQ = + ∞ . If this indeed were the case, then we’ d kno w that the right hand side in our lemma becomes −∞ . As such, the inequality would hold tri vially . Therefore, wlog assume that R e φ dQ ∈ (0 , ∞ ) and set Z := R e φ dQ . Then let us deﬁne a probability measure Q φ by e xponential tilting. That is, dQ φ dQ := e φ Z . W e know that φ is real- v alued, so it follo ws that e φ > 0 ev erywhere, so it follows that Q φ and Q have the same null sets. Speciﬁcally , ν ≪ Q implies that ν ≪ Q φ . Looking at the set where dν dQ φ is deﬁned which is ν almost surely , let us write the Radon-Nikodym chain rule as, dν dQ = dν dQ φ · dQ φ dQ . Therefore, we get that log  dν dQ  = log  dν dQ φ  + log  dQ φ dQ  = log  dν dQ φ  + φ − log Z . Let us now integrate both sides with respect to ν . If we do this we get that, KL( ν ∥ Q ) = Z log  dν dQ  dν = Z log  dν dQ φ  dν + Z φ dν − log Z = KL( ν ∥ Q φ ) + Z φ dν − log  Z e φ dQ  . 16 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf Clearly , KL( ν ∥ Q φ ) ≥ 0 , and it may be + ∞ also. But in any case, using this f act, we get that, KL( ν ∥ Q ) ≥ Z φdν − log  Z e φ dQ  , which was to be sho wn and hence that concludes the proof. Proof [Proof of Proposition 5 ] W e already gave intuition for this proof earlier in our ﬁrst section, hence it’ s going to be a very obvious proof. T o begin, take a ε ∈ (0 , 1) and pick M ∈ R large enough so that we can hav e (1 − ε ) ¯ µ + εM ≥ m . Then, deﬁne the mixture Q ε,M := (1 − ε ) ν + εδ M . T w o things then follow . First, we know that Q ε,M must be a probability measure on R . Second, it also necessarily satisﬁes the mean constraint. It’ s easy to see why . By deﬁnition it satisﬁes that R xdQ ε,M ( x ) = (1 − ε ) R xdν ( x ) + εM = (1 − ε ) ¯ µ + εM ≥ m by our construction of M . Also, for any measurable set A , we have the relation that Q ε,M ( A ) ≥ (1 − ε ) ν ( A ) , so it must follow that Q ε,M dominates ν . This is particularly important for us to note here because we can conclude that if Q ε,M ( A ) were to be 0 , ν ( A ) = 0 follows. So, ν ≪ Q ε,M and KL( ν ∥ Q ε,M ) is ﬁnite. W e now will upper bound KL( ν ∥ Q ε,M ) . Now , let L := dν dQ ε,M . W e will claim that ν -almost-surely , we hav e that L ≤ 1 1 − ε . It’ s v ery easy to see why . Indeed, take an arbitrary δ > 0 and let A δ := { L > 1 1 − ε + δ } . Then it follo ws that, ν ( A δ ) = Z A δ L dQ ε,M >  1 1 − ε + δ  Q ε,M ( A δ ) ≥  1 1 − ε + δ  (1 − ε ) ν ( A δ ) =  1+ δ (1 − ε )  ν ( A δ ) . From here, we can see that ν ( A δ ) is forced to be 0 in this case (if it wasn’t then that would contradict the fact that it’ s a probability measure and nonne gati ve). No w , we know that δ > 0 was arbitrary . Hence, giv en that ν ( A δ ) is a measure 0 event, it follo ws that almost surely under ν , indeed L ≤ 1 1 − ε . Therefore, we can conclude that, KL( ν ∥ Q ε,M ) = Z log  dν dQ ε,M  dν = Z log( L ) dν ≤ log  1 1 − ε  = − log(1 − ε ) . Note that we also chose ε ∈ (0 , 1) arbitrarily . And, − log(1 − ε ) ↓ 0 as ε ↓ 0 . Hence we get that 0 ≤ KL R inf ( ν, m ) ≤ inf ε ∈ (0 , 1) ( − log(1 − ε )) = 0 . And, thus, KL R inf ( ν, m ) is 0 and so we are done. Proof [Proof of Theorem 6 ] Before we begin this proof, we will clarify that throughout, we are working on the probability one event on which three things hold: the mean LIL, the fact that b σ 2 t → σ 2 , and that the en velope property holds for all lar ge t . No w to begin, let us consider some particular but arbitrarily chosen outcome ω and restrict to those t that are lar ge enough so that | X i ( ω ) | ≤ B t holds for all i ≤ t . Then, b P t indeed is supported on [ − B t , B t ] . No w , at this point we know that our proofs from the bounded support case automatically apply just by simply making the replacement of [ a, b ] = [ − B t , B t ] , b − a = 2 B t , and with KL inf replaced by KL ( t ) inf . In the case where m > B t , we know that the constraint set is empty and thus that KL ( t ) inf ( ν, m ) = + ∞ . But for us in our application, m = µ and B t → ∞ . As a consequence, for all suf ﬁciently large t , our constraint indeed is of course feasible. Because we know that our Theorem 1 ’ s upper bound construction uses the af ﬁne tilt dQ t d b P t ( x ) = 1 + θ t ( x − b µ t ) and θ t = ∆ t b σ 2 t and ∆ t = ( µ − b µ t ) + . Notice how the only place really where we needed the actual boundedness was in guaranteeing that 1 + θ t ( x − b µ t ) 17 R A M R A M DA S remains positi v e uniformly ov er the support. In this situation here, we know that x ∈ [ − B t , B t ] and b µ t ∈ [ − B t , B t ] , and so as a result we get that | x − b µ t | ≤ 2 B t . Thus it follows that, inf x ∈ [ − B t ,B t ] (1 + θ t ( x − b µ t )) ≥ 1 − 2 B t θ t . Hence it would sufﬁce that only 2 B t θ t → 0 almost surely . W e kno w that θ t = ∆ t / b σ 2 t and that b σ 2 t → σ 2 > 0 . The mean LIL tells us that ∆ t is in fact O ( p (log log t ) /t ) almost surely along the full sequence (ie, by the limsup). Therefore, we can see that almost surely , 2 B t θ t = 2 B t ∆ t b σ 2 t = o  r t log log t  · O r log log t t ! − − − → t →∞ 0 . What this tells us is the tilt is going to be feasible for all lar ge t (and certainly supported on [ − B t , B t ] ). As such, it will be admissible for KL ( t ) inf ( b P t , µ ) . One more thing we need to account for in this upper bound is the T aylor remainder terms. Howe ver , in the bounded proof we kno w that those T aylor terms scale with powers of the uniform radius r t := θ t ( b − a ) = θ t (2 B t ) = 2 B t θ t . W e just sho wed that r t → 0 wp 1. So, the same argument gi ves us the upper asymptotic bound that almost surely , KL ( t ) inf ( b P t , µ ) ≤ (1 + o (1)) ∆ 2 t 2 b σ 2 t . Let’ s now work on the lo wer bound now to complete our proof. W e will use our same lower bound v ariational argument we used earlier . Ho wev er , in doing so we will k eep track of the remainder in a way that will only depend on the uniform smallness parameter . T ake a particular t that’ s lar ge enough so that b P t is supported on [ − B t , B t ] and B t ≥ | µ | (we kno w this certainly must hold e ventually since B t is sent to ∞ ). No w let us consider some feasible Q ∈ P ([ − B t , B t ]) where R xdQ ( x ) ≥ µ and any λ ∈ [0 , 1 / ( B t − µ )) . Then, exactly as in ( 4 ), we know that the same choice of φ λ ( x ) := log(1 + λ ( µ − x )) , we have that, KL ( t ) inf ( b P t , µ ) ≥ Z log  1 + λ ( µ − x )  d b P t ( x ) . No w let us pick a λ = λ t := ∆ t / b σ 2 t . Note that for lar ge t , we know that this lies in [0 , 1 / ( B t − µ )) . (Because, we kno w that e ventually , B t ≥ | µ | and that wp 1, r t := 2 B t λ t → 0 . Hence it certainly follo ws that for all those t large enough that λ t ( B t − µ ) ≤ λ t ( B t + | µ | ) ≤ 2 B t λ t = r t < 1 .) Now , let X ∼ b P t and set U t := λ t ( µ − X ) . Then it follows that for all x ∈ [ − B t , B t ] , | µ − x | ≤ | µ | + B t ≤ 2 B t . Hence, | U t | ≤ λ t · 2 B t =: r t . On our almost sure ev ent now , ∆ t = O ( p (log log t ) /t ) and b σ 2 t → σ 2 > 0 . As a consequence we hav e that with probability one, r t = 2 B t λ t = 2 B t ∆ t b σ 2 t − − − → t →∞ 0 . Meaning, for all t lar ge enough we get that r t < 1 . In the case where r t = 0 , we would ha ve that U t ≡ 0 , so the inequality w ould be tri vial. Thus when applying Lemma 2 we can restrict to the case where r t ∈ (0 , 1) . Let’ s do this, ie apply Lemma 2 with r = r t . Doing so gives us, log(1 + U t ) ≥ U t − U 2 t 2 − | U t | 3 3(1 − r t ) 3 . 18 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf Further , we kno w that since | U t | ≤ r t as we just showed abov e, it follo ws that | U t | 3 ≤ r t U 2 t . Let’ s no w take expectations. Doing so and using all this giv es us that, E b P t [log(1 + U t )] ≥ E b P t [ U t ] −  1 2 + r t 3(1 − r t ) 3  E b P t [ U 2 t ] . Let us look at the relev ant case for us, speciﬁcally the case where ∆ t > 0 , so necessarily λ t > 0 . In this case, we hav e that µ − b µ t = ∆ t . In addition, we kno w that E b P t [ U t ] = λ t ( µ − b µ t ) = λ t ∆ t . And, E b P t [ U 2 t ] = λ 2 t E b P t [( µ − X ) 2 ] = λ 2 t ( b σ 2 t + ∆ 2 t ) . Note that the last step before holds as we did pre viously , using the fact that ( µ − X ) = ( µ − b µ t ) + ( b µ t − X ) and E b P t [ b µ t − X ] = 0 . Let us now plug in λ t = ∆ t / b σ 2 t . Doing so gi ves us that E b P t [ U t ] = ∆ 2 t b σ 2 t and E b P t [ U 2 t ] = ∆ 2 t b σ 2 t + ∆ 4 t b σ 4 t . Therefore we get that, E b P t [log(1 + U t )] ≥ ∆ 2 t b σ 2 t −  1 2 + r t 3(1 − r t ) 3   ∆ 2 t b σ 2 t + ∆ 4 t b σ 4 t  . Note that r t → 0 and ∆ t → 0 wp 1 as we kno w . Therefore as a result of this, the term 1 2 + r t 3(1 − r t ) 3 is simply nothing b ut 1 2 + o (1) and the last term with ∆ 4 t of course reduces do wn o (∆ 2 t ) . Hence we get that with probability one, E b P t [log(1 + U t )] ≥ (1 − o (1)) ∆ 2 t 2 b σ 2 t . No w , recall that for our choice of λ t , we had that KL ( t ) inf ( b P t , µ ) ≥ E b P t [log(1 + U t )] . Thus we get almost surely , KL ( t ) inf ( b P t , µ ) ≥ (1 − o (1)) ∆ 2 t 2 b σ 2 t . No w , we already showed the upper bound that KL ( t ) inf ( b P t , µ ) ≤ (1 + o (1)) ∆ 2 t 2 b σ 2 t . So, we ha ve both an upper and lo wer version! Hence it must follow that almost surely as t → ∞ , KL ( t ) inf ( b P t , µ ) =  1 + o (1)  ∆ 2 t 2 b σ 2 t . W e know that the classic mean LIL implies that, as in ( 3 ), almost surely lim sup t →∞ ∆ 2 t (2 σ 2 log log t ) /t = 1 . Now , we ha ve the almost sure con vergence also of b σ 2 t → σ 2 . That tells us that, lim sup t →∞ t log log t · ∆ 2 t 2 b σ 2 t = 1 , and so therefore using our prov en asymptotic equiv alence we get that with probability one, lim sup t →∞ t KL ( t ) inf ( b P t , µ ) log log t = 1 , which completes our proof! 19 R A M R A M DA S A ppendix B. Applications of Theorem 6 No w , the reason we know that Theorem 6 is so po werful is in its reduction of the problem to simply verifying the almost sure en velope | X i | ≤ B t for all i ≤ t ev entually , where B t = o ( p t/ log log t ) . Such a theorem is only useful if we kno w it holds for various tail regimes. Hence that is what we will show . The ﬁrst is for subgaussian tails. Meaning, assume that there exists v > 0 such that for all x ≥ 0 , P ( | X 1 − µ | > x ) ≤ 2 exp  − x 2 2 v  . (9) Gi ven this deﬁnition of the subgaussian tail, we will present the lemma below for the almost sure en velope for subg aussian random variables. Lemma 7 Assume that ( 9 ) holds. Consider any particular ε > 0 and deﬁne , u t := p 2( v + ε ) log t, B t := | µ | + u t . Then, almost sur ely there exists T < ∞ such that for all t ≥ T and all 1 ≤ i ≤ t , we have that | X i | ≤ B t . And moreo ver , B t = o ( p t/ log log t ) . Proof By stationarity , it follows that for each inte ger t ≥ 3 we hav e that, P ( | X t − µ | > u t ) = P ( | X 1 − µ | > u t ) ≤ 2 exp  − u 2 t 2 v  = 2 exp  − v + ε v log t  = 2 t ( − 1+ ε/v ) . Notice that 1 + ε/v > 1 , hence it follo ws that the series P t ≥ 3 t − (1+ ε/v ) indeed con ver ges. Meaning, P ∞ t =3 P ( | X t − µ | > u t ) < ∞ . So by Borel-Cantelli Lemma 1, we can conclude that P ( | X t − µ | > u t inﬁnitely often) must be 0 indeed. In other words, wp 1 there exists a random T 0 such that for all t ≥ T 0 we hav e that | X t − µ | ≤ u t . No w , we kno w that u t is nondecreasing in t , so it follo ws that for an y t ≥ T 0 , max T 0 ≤ i ≤ t | X i − µ | ≤ u t . Let us no w deﬁne the (ﬁnite) random constant M 0 := max 1 ≤ i ≤ T 0 | X i − µ | . W e kno w that u t → ∞ . As a consequence, it follows that there exists T 1 such that for all t ≥ T 1 , u t ≥ M 0 . As such, for all t ≥ max { T 0 , T 1 } , we hav e, max 1 ≤ i ≤ t | X i − µ | = max n max 1 ≤ i ≤ T 0 | X i − µ | , max T 0 ≤ i ≤ t | X i − µ | o ≤ max { M 0 , u t } = u t . So, it follo ws that for all such t and all 1 ≤ i ≤ t , | X i | ≤ | µ | + | X i − µ | ≤ | µ | + u t = B t . This prov es our en velope property . T o ﬁnish, note that our B t = | µ | + p 2( v + ε ) log t satisﬁes, B t p t/ log log t = | µ | p t/ log log t + p 2( v + ε ) √ log t √ log log t √ t − − − → t →∞ 0 . Thus, the takea way is that indeed B t = o ( p t/ log log t ) . At this point, we hav e proved that sub- gaussian tails satisfy both properties. So we are done. Notice how if we take the result of Lemma 7 and Theorem 6 we get our desired “same” LIL state- ment for subGaussian random v ariables. That is, constraining the other class to an env elope that gro ws slowly that ev entually contains the data almost surely gi ves us something beautiful. Some- thing where the empirical KL inf once again satisﬁes an exact LIL with a sharp constant 1 . Of course, in addition to subGaussian, we can also choose such a sequence for sub-exponential random v ariables. Namely , we can construct carefully an almost sure en velope for sub-e xponential random v ariables also. 20 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf Lemma 8 Suppose that ther e exist constants K, c > 0 such that for all x ≥ 0 , P ( | X 1 − µ | > x ) ≤ K e − cx . Then consider any particular ε > 0 and deﬁne for t ≥ 3 , u t := 1 + ε c log t, B t := | µ | + u t , and deﬁne the B 1 := B 2 := B 3 in a way so that ( B t ) t ≥ 1 is nondecr easing. Then, almost sur ely ther e must exist T < ∞ such that for all t ≥ T and all 1 ≤ i ≤ t , one has that | X i | ≤ B t . In addition, B t = o ( p t/ log log t ) . Proof Before we start our proof, we will point out that the sube xponential proof follows the same line very closely as that of the subgaussian. Meaning, to begin, for each integer t ≥ 3 , let’ s apply stationarity alongside our assume tail bound. Doing so gives us that, P ( | X t − µ | > u t ) = P ( | X 1 − µ | > u t ) ≤ K e − cu t = K e − (1+ ε ) log t = K t − (1+ ε ) . No w , the sum P t ≥ 3 K t − (1+ ε ) < ∞ . By the Borel Cantelli Lemma 1 we therefore once again ha ve that P ( | X t − µ | > u t ) inﬁnitely often is 0 . Hence almost surely there must exist a ﬁnite T 0 such that for all t ≥ T 0 , | X t − µ | ≤ u t . Note that ( u t ) is nondecreasing. So, for any t ≥ T 0 and any T 0 ≤ i ≤ t , we hav e that | X i − µ | ≤ u i ≤ u t . As a consequence we get that max T 0 ≤ i ≤ t | X i − µ | ≤ u t . Now let’ s let M 0 := max 1 ≤ i ≤ T 0 | X i − µ | < ∞ . Again, we send u t → ∞ . As such, there must exist T 1 such that u t ≥ M 0 for all t ≥ T 1 . So for all t ≥ max { T 0 , T 1 } we have that max 1 ≤ i ≤ t | X i − µ | ≤ u t . Hence for all 1 ≤ i ≤ t , | X i | ≤ | µ | + | X i − µ | ≤ | µ | + u t = B t , which proves the en velope property . It remains to sho w that B t = o ( p t/ log log t ) . It’ s easy to see that, B t p t/ log log t = | µ | p t/ log log t + 1 + ε c · log t · √ log log t √ t − − − → t →∞ 0 , and so that concludes the proof as once again we sho wed both properties are satisﬁed, this time wrt subexponential random v ariables. Lastly , we present an almost sure en velope for any random v ariable with a ﬁnite p -th moment. Lemma 9 Suppose that E [ | X 1 | p ] < ∞ for some p > 2 . T ak e any γ > 1 /p and deﬁne for t ≥ 3 , u t := t 1 /p (log t ) γ , B t := max 3 ≤ s ≤ t u s , and deﬁne B 1 := B 2 := B 3 . It follows then that ( B t ) t ≥ 1 is deterministic, nondecr easing, and B t → ∞ . In addition, we certainly have that ther e e xists a T < ∞ such that for all t ≥ T , and all 1 ≤ i ≤ t , one has that | X i | ≤ B t , and B t = o ( p t/ log log t ) . Proof Once again, for each integer t ≥ 3 , let’ s apply Marko v’ s inequality and the same stationarity . Doing so gi ves us that, P ( | X t | > u t ) = P ( | X 1 | > u t ) ≤ E [ | X 1 | p ] u p t = E [ | X 1 | p ] t (log t ) pγ . 21 R A M R A M DA S W e kno w that pγ > 1 . As such, the series P t ≥ 3 1 t (log t ) pγ con ver ges. So, P ∞ t =3 P ( | X t | > u t ) < ∞ . So the same Borel Cantelli Lemma 1 tells us that P ( | X t | > u t ) inﬁnitely often indeed must be zero. In other words, wp 1 there exists ﬁnite T 0 such that for all t ≥ T 0 , | X t | ≤ u t . By our construction B t ≥ u t and B t is nondecreasing. Hence it follo ws that for any t ≥ T 0 and any T 0 ≤ i ≤ t , | X i | ≤ u i ≤ B i ≤ B t . Hence max T 0 ≤ i ≤ t | X i | ≤ B t . Let M 0 := max 1 ≤ i ≤ T 0 | X i | < ∞ . W e know that B t → ∞ . As a consequence, there exists T 1 such that B t ≥ M 0 for all t ≥ T 1 . Hence for all t ≥ max { T 0 , T 1 } , we hav e that max 1 ≤ i ≤ t | X i | ≤ B t . This is the en velope property! It remains to now sho w that B t = o ( p t/ log log t ) . Up to the running maximum, certainly it’ s the case that B t ∼ t 1 /p (log t ) γ and p > 2 . Therefore, B t p t/ log log t ≤ t 1 /p − 1 / 2 (log t ) γ p log log t − − − → t →∞ 0 . And so once again, we sho wed both properties hold, this case with ﬁnite p moment random vari- ables. In any case, this concludes the proof. A ppendix C. What happens in the p = 2 case and is the en velope tight? In this section, let us e xplain what would in fact happen if we only assume a weak second moment assumption (ie, p = 2 ) on the tail. There are two points that we will detail here that are comple- mentary to one another . First, suppose that a deterministic en velope ( B t ) t ≥ 1 gro ws strictly faster than p t/ log log t and ev entually contains the data almost surely . If this were to occur , then the time-v arying projection cost KL ( t ) inf ( b P t , µ ) will be forced onto a scale that is in fact strictly smaller than log log t . Second, unfortunately if only E [ X 2 1 ] < ∞ , then we cannot verify in general the ke y assumption needed in Theorem 6 : meaning, we cannot verify the existence of an almost sure v alid deterministic en velope with B t = o ( p t/ log log t ) . In other words, there certainly exist ﬁnite v ariance distributions for which almost surely e very en velope will fail. All of this is what we will formalize in this section. T o begin, recall our time- t constrained functional deﬁned as, KL ( t ) inf ( ν, m ) := inf n KL( ν ∥ Q ) : Q ∈ P ([ − B t , B t ]) , Z x dQ ( x ) ≥ m o . And, in the empirical specialization of this if you will where ν = b P t , we let b µ t = R x d b P t ( x ) and ∆ t := ( µ − b µ t ) + . With this setup in mind, let us present our ﬁrst lemma, which will gi ve intuition for what happens when we “sprinkle” mass at the boundary . Lemma 10 T ake a B > 0 and let ν ∈ P ([ − B , B ]) have mean ¯ µ := R xdν ( x ) . Now , for any m ∈ ( ¯ µ, B ) deﬁne ε := m − ¯ µ B − ¯ µ ∈ (0 , 1) . In addition, deﬁne Q := (1 − ε ) ν + εδ B . Then, it follows that for Q ∈ P ([ − B , B ]) we get that R xdQ ( x ) = m and KL( ν ∥ Q ) ≤ − log(1 − ε ) . And if in fact we have that ν ( { B } ) = 0 then it follows that KL( ν ∥ Q ) = − log(1 − ε ) exactly . F inally as ε ↓ 0 , − log(1 − ε ) = (1 + o (1)) ε , and in particular if ε ≤ 1 2 then − log(1 − ε ) ≤ 2 ε . Proof The ﬁrst thing we need to do is show that Q is supported on [ − B , B ] and has mean m . By our o wn construction we kno w that ν is supported on [ − B , B ] and δ B is supported at B ∈ [ − B , B ] . 22 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf As a consequence, their con ve x combination Q must be also. Let’ s now compute the mean: Z x dQ ( x ) = Z x d  (1 − ε ) ν + εδ B  ( x ) = (1 − ε ) Z x dν ( x ) + ε Z x dδ B ( x ) = (1 − ε ) ¯ µ + εB = ¯ µ + ε ( B − ¯ µ ) = ¯ µ + ( m − ¯ µ ) = m. W ith that shown let us no w proceed to sho w absolute continuity , namely ν ≪ Q and also deriv e a Radon-Nikodym bound. It’ s easy to see that for any measurable set A we hav e that Q ( A ) = (1 − ε ) ν ( A ) + εδ B ( A ) ≥ (1 − ε ) ν ( A ) . Clearly then if Q ( A ) were 0 , that means that (1 − ε ) ν ( A ) ≤ Q ( A ) = 0 . And so as such, we’ d ha ve that ν ( A ) = 0 , thereby showing that ν ≪ Q indeed. No w let L := dν dQ be the Radon-Nikodym deri vati ve. W e will sho w that L ≤ 1 1 − ε ν -almost surely . T o begin, simply take an η > 0 and deﬁne the measurable A η := n L > 1 1 − ε + η o . No w using the fact that ν ( A η ) = R A η LdQ and the very deﬁnition of A η we hav e that, ν ( A η ) = Z A η L dQ >  1 1 − ε + η  Q ( A η ) ≥  1 1 − ε + η  (1 − ε ) ν ( A η ) =  1 + η (1 − ε )  ν ( A η ) . W e know that of course the only way a nonnegati ve number can be strictly speaking larger than itself multiplied by 1 + η (1 − ε ) > 1 is if it were 0 . So it must be the case that for ev ery η > 0 , ν ( A η ) = 0 . W e know that η > 0 was taken arbitrary and hence we can see that dν dQ = L ≤ 1 1 − ε almost surely under ν . The KL bound is now immediate. That is, we can use our bound on L and just apply the deﬁnition. Namely , KL( ν ∥ Q ) = R log  dν dQ  dν ≤ R log  1 1 − ε  = − log (1 − ε ) R 1 dν = − log(1 − ε ) . As for when we ha ve equality , note that if ν ( { B } ) = 0 then that would mean that δ B would place mass outside of the ν support. As such on the support of ν we would ha ve that Q = (1 − ε ) ν . Therefore dν dQ = 1 1 − ε will hold under ν wp 1. Plugging this back into the KL deﬁnition would therefore giv e us that KL( ν ∥ Q ) = R log  1 1 − ε  dν = − log (1 − ε ) . W e know that as ε ↓ 0 , − log (1 − ε ) → 0 and ε → 0 , so it follo ws that lim ε ↓ 0 − log(1 − ε ) ε = 1 . So clearly it follo ws that − log (1 − ε ) = (1 + o (1)) ε as ε ↓ 0 . It remains to now sho w the bound we get when ε ≤ 1 2 . Let us start by noting that for 0 ≤ u ≤ ε we ha ve that 1 − u ≥ 1 − ε and thus 1 1 − u ≤ 1 1 − ε . As such, − log(1 − ε ) = Z ε 0 1 1 − u du ≤ Z ε 0 1 1 − ε du = ε 1 − ε ≤ 2 ε, which completes our proof. W e are no w ready to formalize the idea that large en velopes will collapse the log log t/t scale. W e will do that with the follo wing proposition. Proposition 11 Suppose that X 1 , X 2 , X 3 , . . . ar e all iid with mean µ and variance σ 2 ∈ (0 , ∞ ) . Let ( B t ) t ≥ 1 be deterministic and nondecr easing with B t → ∞ and assume that the en velope event holds. That is, E B :=  ∃ T B < ∞ : ∀ t ≥ T B , max 1 ≤ i ≤ t | X i | ≤ B t  . 23 R A M R A M DA S If in addition we have that B t √ t/ log log t − − − → t →∞ ∞ , then on the pr obability-one event where both E B and the classical mean LIL hold we in fact have that lim sup t →∞ t KL ( t ) inf ( b P t , µ ) log log t = 0 . Proof W e’ re going to begin our proof by ﬁrst taking a particular outcome ω that belong to the intersection of both the probability-one events, ie the en velope ev ent E B and the classic mean LIL. So with a very minor ab use of notation, we will omit ω from our notation. The ﬁrst thing we need to sho w is a lo wer bound on B t − b µ t for lar ge t . By the mean LIL, we know that b µ t → µ and therefore there must exist a T µ < ∞ and such that for all t ≥ T µ , | b µ t | ≤ | µ | + | b µ t − µ | ≤ | µ | + 1 . In addition, note that B t → ∞ and ( B t ) by deﬁnition is nondecreasing. So, there must also e xist a T ′ B < ∞ such that for all t ≥ T ′ B , B t ≥ 2( | µ | + 1) . So! That means that for all t ≥ T 0 := max { T µ , T ′ B } , it follo ws that B t − b µ t ≥ B t − | b µ t | ≥ B t − ( | µ | + 1) ≥ B t 2 . Let us no w deri ve a sprinkling bound for KL ( t ) inf ( b P t , µ ) . T ak e a t ≥ T 0 such that we also hav e t ≥ T B , so b P t is clearly supported [ − B t , B t ] . Necessarily if ∆ t = 0 , then b µ t ≥ µ and hence KL ( t ) inf ( b P t , µ ) = 0 . So assume that ∆ t > 0 , or that b µ t < µ . Again, since B t → ∞ , there exists T 1 < ∞ such that µ < B t for all t ≥ T 1 . Now considering all of this, for a t ≥ max { T 0 , T B , T 1 } , ε t := µ − b µ t B t − b µ t = ∆ t B t − b µ t ∈ (0 , 1) . Using the fact that B t − b µ t ≥ B t / 2 , we get that, ε t = ∆ t B t − b µ t ≤ ∆ t B t / 2 = 2∆ t B t . Let’ s no w apply Lemma 10 with B = B t , ν = b P t and m = µ . Doing all this giv es us that KL ( t ) inf ( b P t , µ ) ≤ − log (1 − ε t ) . No w , from the mean LIL we know that lim sup t →∞ | b µ t − µ | √ (2 σ 2 log log t ) /t = 1 . This means that we conclude that for a particular constant c := √ 2 σ 2 + 1 > √ 2 σ 2 , there e xists T LIL < ∞ such that for all t ≥ T LIL , | b µ t − µ | ≤ c q log log t t . Therefore, ∆ t ≤ c q log log t t for all t ≥ T LIL . Therefore, using the fact that ε t ≤ 2∆ t /B t and the bound on ∆ t , it’ s easy to see that ε t ≤ 2 c B t q log log t t . And, because B t / p t/ log log t → ∞ , we hav e that, 1 B t r log log t t = 1 B t · 1 p t/ log log t − − − → t →∞ 0 . So, ε t → 0 . Thus there e xists T ε < ∞ where for all T ≥ T ε , ε t ≤ 1 2 so Lemma 10 gives us that indeed − log (1 − ε t ) ≤ 2 ε t . In particular for all sufﬁciently large t with ∆ t > 0 we have that KL ( t ) inf ( b P t , µ ) ≤ − log (1 − ε t ) ≤ 2 ε t ≤ 2 · 2∆ t B t = 4∆ t B t . Let’ s now multiply it all by t/ log log t and use the fact that ∆ t ≤ c q log log t t . As such, t log log t KL ( t ) inf ( b P t , µ ) ≤ t log log t · 4∆ t B t ≤ t log log t · 4 c B t r log log t t = 4 c B t r t log log t . 24 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf No w! W e kno w that B t / p t/ log log t → ∞ . Therefore, the rhs above will con ver ge to 0 . That indeed prov es that, lim sup t →∞ t KL ( t ) inf ( b P t , µ ) log log t = 0 . And, hence, we are done. W ith this now sho wn, it is interesting to note that e ven with a second moment, we can still get a deterministic almost sure en velope. W e formalize that idea with the following lemma. Lemma 12 Suppose that E [ X 2 1 ] < ∞ . Let ( B t ) t ≥ 1 be deterministic and nondecreasing , wher e B t → ∞ . If in fact we have that P ∞ t =1 1 B 2 t < ∞ , then almost surely ther e exists a T < ∞ such that for all t ≥ T and all 1 ≤ i ≤ t , | X i | ≤ B t . In particular we know that for any particular γ > 1 2 and for t ≥ 3 , the choice, u t := √ t (log t ) γ , B t := max 3 ≤ s ≤ t u s , B 1 := B 2 := B 3 , satisﬁes the summability condition and thus is an almost sur e valid deterministic en velope. Proof Let us start our proof by deﬁning the ev ent A t := {| X t | > B t } for each t ≥ 1 . W e kno w that the sequence is iid by assumption. Hence, of course, P ( A t ) = P ( | X 1 | > B t ) . By Markov’ s inequality applied to X 2 1 we get that, P ( A t ) = P ( | X 1 | > B t ) = P ( | X 1 | 2 > B 2 t ) ≤ E [ X 2 1 ] B 2 t . Let us no w sum over all t and use the fact that P 1 /B 2 t < ∞ . Doing so giv es us that, ∞ X t =1 P ( A t ) ≤ E [ X 2 1 ] ∞ X t =1 1 B 2 t < ∞ . Therefore, by Borell Cantelli Lemma 1, we can see that P ( A t i.o. ) = 0 , so wp 1 there exists a ﬁnite (possibly random) T 0 < ∞ such that | X t | ≤ B t for all t ≥ T 0 . At this point, we are now going to “upgrade” our en velope to a uniform in i ≤ t en velope. T o begin, consider a particular outcome on which | X t | ≤ B t for all t ≥ T 0 . W e kno w that ( B t ) is nondecreasing, we kno w that for an y t ≥ T 0 and an y index i with T 0 ≤ i ≤ t , | X i | ≤ B i ≤ B t . No w , for ﬁnitely many indices 1 ≤ i ≤ T 0 let us deﬁne M 0 as max 1 ≤ i ≤ T 0 | X i | < ∞ . W e kno w that B t → ∞ and so there must exist again a ﬁnite (and possibly random) T 1 such that B t ≥ M 0 for all t ≥ T 1 . Hence for all t ≥ T := max { T 0 , T 1 } and all 1 ≤ i ≤ t , we ha ve that | X i | ≤ B t . It remains to no w verify our choice of B t = max 3 ≤ s ≤ t u s . Note that for γ > 1 2 and t ≥ 3 we ha ve that 1 u 2 t = 1 t (log t ) 2 γ . By the integral test, we know that the series P t ≥ 3 1 t (log t ) 2 γ con ver ges if and only if R ∞ 3 1 x (log x ) 2 γ dx < ∞ . Let u = log x (and so as such du = dx/x ). Then, Z ∞ 3 1 x (log x ) 2 γ dx = Z ∞ log 3 1 u 2 γ du =  u 1 − 2 γ 1 − 2 γ  ∞ log 3 . 25 R A M R A M DA S This will be ﬁnite exactly in the case when 2 γ > 1 , or when γ > 1 2 . Now for each t , because B t ≥ u t it follows that we hav e 1 B 2 t ≤ 1 u 2 t . Hence, P ∞ t =1 1 B 2 t < ∞ , and that concludes our proof at last. The importance of Lemma 12 really is in its power on the next corollary . Meaning, the p = 2 en velope will force the normalized lim sup to be 0 . Corollary 13 Suppose that σ 2 ∈ (0 , ∞ ) and E [ X 2 1 ] < ∞ . Let γ > 1 2 and let’ s take ( B t ) exactly as in Lemma 12 . Then on the pr obability one event wher e both the en velope and mean LIL hold we have that lim sup t →∞ t KL ( t ) inf ( b P t , µ ) log log t = 0 . Proof This is actually a very easy proof to see. Because, by Lemma 12 we know that the en velope e vent E B will hold almost surely for this particular choice of ( B t ) . In addition, for t ≥ 3 we hav e that B t ≥ u t = √ t (log t ) γ . As a consequence we get that, B t p t/ log log t ≥ √ t (log t ) γ p t/ log log t = (log t ) γ p log log t − − − → t →∞ ∞ . Hence, clearly the needed conditions are satisﬁed from Proposition 11 , and so our corollary’ s con- clusion indeed holds and we are done. One last point we will formalize here is that ﬁnite v ariance does not imply an en velope at the LIL scale, and hence that kills the fact that all p = 2 laws could hold under Theorem 6 , which is exactly why we consider only p > 2 in that theorem. Proposition 14 Ther e indeed exists a mean-zer o distribution with ﬁnite variance suc h that for the deterministic sequence b t := q t log log t (wher e t ≥ 3 ), one has that, P  | X t | > b t i.o.  = 1 . Consequently for any deterministic nondecreasing sequence ( B t ) with B t = o  p t/ log log t  we also have that, P  | X t | > B t i.o.  = 1 . And so, unfortunately , the en velope event E B fr om Theorem 6 will fail for this law . Proof The v ery ﬁrst thing we must do in our proof is build a ﬁnite-variance distribution. T o that end, let S be a Rademacher random v ariable, ie that P ( S = ± 1) = 1 2 . Giv en this, we will deﬁne a nonnegati ve random variable Y by specifying very clearly its survi val function for y ≥ 0 . That is, P ( Y > y ) =      A, 0 ≤ y < e e , 1 y 2 log y (log log y ) 2 , y ≥ e e . Note that we choose A of course so that the survi val function is continuous at y = e . That is, A := 1 ( e e ) 2 log( e e ) (log log ( e e )) 2 = 1 e 2 e · e · 1 2 = e − (2 e +1) ∈ (0 , 1) . 26 A N A S Y M P TO T I C L AW O F T H E I T E R A T E D L O G A R I T H M F O R KL inf Clearly , this function is nonincreasing, right-continuous, and tends to 0 as y → ∞ . As a conse- quence, it certainly deﬁnes a valid la w on [0 , ∞ ) , with an atom at 0 of size 1 − A . Now , let Y be independent and deﬁne the random v ariable X := S Y . Clearly , X is symmetric about 0 . Further- more, we know that by Cauchy-Schwarz, E [ | X | ] ≤ p E [ X 2 ] < ∞ , so certainly the mean e xists and by symmetry we get that E [ X ] = 0 . W ith all this being said, we are no w going to verify that indeed E [ X 2 ] < ∞ . W e kno w that | X | = Y , so it must follo w that E [ X 2 ] = E [ Y 2 ] . Note that for a nonnegati ve random variable by the basic property of the tail inte gral we ha ve that, E [ Y 2 ] = Z ∞ 0 P ( Y 2 > s ) ds = Z ∞ 0 P ( Y > √ s ) ds. Let’ s now mak e the substitution of s = y 2 , so clearly ds = 2 y dy . As such we get, E [ Y 2 ] = 2 Z ∞ 0 y P ( Y > y ) dy . Let’ s now compute what this is. Meaning, E [ Y 2 ] = 2 Z e e 0 y P ( Y > y ) dy + 2 Z ∞ e e y P ( Y > y ) dy = 2 Z e e 0 y · A dy + 2 Z ∞ e e y · 1 y 2 log y (log log y ) 2 dy = 2 A  y 2 2  e e 0 + 2 Z ∞ e e 1 y log y (log log y ) 2 dy = A ( e e ) 2 + 2 Z ∞ e e 1 y log y (log log y ) 2 dy . W e no w need to ev aluate what the remaining inte gral is. T o that end, let us make the substitution u = log y , which means that du = dy /y . If we do this we get that, Z ∞ e e 1 y log y (log log y ) 2 dy = Z ∞ u = e 1 u (log u ) 2 du. Let us no w substitute again here, this time with v = log u , so therefore dv = du/u . Doing this gi ves us, Z ∞ e 1 u (log u ) 2 du = Z ∞ v =1 1 v 2 dv =  − 1 v  ∞ 1 = 1 . So it follo ws that E [ Y 2 ] = A ( e e ) 2 + 2 < ∞ , so certainly it must be the case that E [ X 2 ] < ∞ . W ith this shown, we kno w need to show that P ( | X t | > b t i.o. ) = 1 . T o begin, let X 1 , X 2 , X 3 , . . . be iid copies of X and deﬁne E t := {| X t | > b t } . Of course, b t → ∞ as t/ log log t → ∞ . As such, there exists T 0 < ∞ such that b t ≥ e e for all t ≥ T 0 . For such t we may use the tail formula that, P ( E t ) = P ( | X 1 | > b t ) = 1 b 2 t log b t (log log b t ) 2 . Additionally , because of the fact that log log t → ∞ , there exists T 1 < ∞ , such that log log t ≥ 1 for all t ≥ T 1 and hence for all t ≥ T 1 , b 2 t = t log log t ≤ t . As such, b t ≤ √ t ≤ t . Deﬁne 27 R A M R A M DA S T 2 := max { T 0 , T 1 } . Then for all t ≥ T 2 , we have log b t ≤ log t and log log b t ≤ log log t . Hence we get, P ( E t ) = 1 b 2 t log b t (log log b t ) 2 =  log log t t  · 1 log b t · 1 (log log b t ) 2 ≥  log log t t  · 1 log t · 1 (log log t ) 2 = 1 t log t log log t . W e kno w that the series P t ≥ 3 1 t log t log log t di ver ges. This is ob vious by the integral test. W e proceed in a manner with two substitutions u = log x and then v = log u . Meaning, Z ∞ 3 1 x log x log log x dx = Z ∞ log 3 1 u log u du = Z ∞ log log 3 1 v dv = ∞ . As such, P t ≥ T 2 P ( E t ) = ∞ . W e kno w that the e vents E t are independent. So by Borel Cantelli Lemma 2 we hav e that P ( E t i.o. ) = 1 . Extending from b t to any B t = o ( b t ) is easy . W e proceed as follows. Let ( B t ) be determinstic and nondecreasing as usual, with B t = o ( b t ) . Clearly then we hav e that B t /b t → 0 , and so there exists T < ∞ such that B t ≤ b t for all t ≥ T . As a consequence, it holds that for all t ≥ T , {| X t | > b t } ⊆ {| X t | > B t } . Importantly , if | X t | > b t happens inﬁnitely often, then that must mean that | X t | > B t does also. As such, since P ( | X t | > b t i.o. ) = 1 , it P ( | X t | > B t i.o. ) = 1 . So that proves the claim and hence we are ﬁnally done. All in all crucially , we ha ve tightness at the scale p t/ log log t . When b µ t < µ < B t , for the empirical instance ν = b P t and target m = µ , the sprinkling construction of Lemma 10 giv es us that KL ( t ) inf ( b P t , µ ) ≲ ∆ t B t . Howe ver , the local quadratic regime we analyzed in Theorem 6 gi ves that KL ( t ) inf ( b P t , µ ) ≍ ∆ 2 t . W e know that ∆ t will ﬂuctuate at the LIL scale of ∆ t ≍ p (log log t ) /t . As such, the boundary where ∆ 2 t and ∆ t /B t match e xactly can be easily computed as follo ws. That is, ∆ 2 t ≍ ∆ t B t implies that B t ≍ 1 ∆ t ≍ q t log log t . What is the takea way? Clearly , this means that all the en velopes that are strictly smaller than p t/ log log t are going to keep the problem in the quadratic regime. And, they will also gi ve us the sharp constant 1 as in Theorem 6 . Ho wev er , en velopes that are strictly lar ger than p t/ log log t will allow sprinkling to dominate. This will collapse the log log t/t normalization to 0 by Proposition 11 . 28

An Asymptotic Law of the Iterated Logarithm for $\mathrm{KL}_{\inf}$

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment