Design-based theory for causal inference from adaptive experiments

Adaptive designs dynamically update treatment probabilities using information accumulated during the experiment. Existing theory for causal inference from adaptive experiments primarily assumes the superpopulation framework with independent and ident…

Authors: Xinran Li, Anqi Zhao

Design-based theory for causal inference from adaptiv e exp erimen ts Xinran Li ∗ Departmen t of Statistics, Universit y of Chicago and Anqi Zhao F uqua Sc ho ol of Business, Duk e Univ ersity Abstract A daptive designs dynamically up date treatment probabilities using information accum ulated during the exp erimen t. Existing theory for causal inference from adap- tiv e exp erimen ts primarily assumes the sup erpopulation framework with indep en- den t and iden tically distributed units, and ma y not apply when the distribution of units evolv es o ver time. This paper mak es t w o contributions. First, we extend the literature to the nite-population framew ork, which allows for p ossibly nonex- c hangeable units, and establish the design-based theory for causal inference under general adaptive designs using inv erse-prop ensit y-weigh ted (IPW) and augmen ted IPW (AIPW) estimators. Our theory accommo dates nonexchangeable units, b oth noncon verging and v anishing treatment probabilities, and nonconv erging outcome estimators, thereby justifying inference using AIPW estimators with blac k-b o x out- come mo dels that integrate adv ances from machine learning metho ds. T o alleviate the conserv ativeness inherent in v ariance estimation under nite-p opulation infer- ence, we also introduce a cov ariance estimator for the AIPW estimator that b ecomes sharp when the residuals from the adaptive regression of p oten tial outcomes on co- v ariates are additive across units. Our framework encompasses widely used adaptiv e designs, such as m ulti-armed bandits, cov ariate-adaptive randomization, and sequen- tial rerandomization, adv ancing the design-based theory for causal inference in these sp ecic settings. Second, as a metho dological contribution, we prop ose an adaptive c ovariate adjustment approach for analyzing even nonadaptive designs. The martin- gale structure induced b y adaptiv e adjustment enables v alid inference with blac k-b o x outcome estimators that would otherwise require strong assumptions under standard nonadaptiv e analysis. K eywor ds: A daptive cov ariate adjustment; A ugmen ted in v erse-prop ensity-w eighted estima- tor; Co v ariate-adaptiv e randomization; Multi-armed bandits; Sequential rerandomization ∗ A uthors contributed equally; names are listed alphab etically . X. L. ac knowledges supp ort from the U.S. National Science F oundation (Gran ts # 2400961). W e thank Alex Belloni, Iav or Bo jinov, Hongbin (Elijah) Huang, Jinze Cui, P eng Sun, and Lei Zhang for inspiring discussions. 1 1 In tro duction A daptive exp erimental designs, also known as sequential experimental designs, ha v e gained increasing atten tion in recen t y ears. Unlike nonadaptiv e designs, whic h x treatmen t prob- abilities prior to the experiment, adaptiv e designs use information accum ulated during the exp eriment to dynamically up date treatment probabilities for subsequen t units, of- fering m ultiple adv antages in applications across business, medicine, and p olicy , among other areas ( Bhatt and Meh ta , 2016 ; Hadad et al. , 2021 ; Oer-W estort et al. , 2021 ). As a widely used example, m ulti-armed bandit algorithms aim to maximize cumulativ e rew ard o ver time by dynamically adjusting treatmen t probabilities based on realized outcomes. Con textual bandits extend the multi-armed bandit framework by incorp orating con tex- tual features, enabling p ersonalized treatment assignment p olicies ( Bub eck et al. , 2012 ; Dimak op oulou et al. , 2017 ; Agra wal , 2019 ). Another prominent class of adaptiv e designs— including co v ariate-adaptiv e randomization ( Efron , 1971 ; Bugni et al. , 2018 ; Y e et al. , 2022 ), minimization methods ( P o co c k and Simon , 1975 ), and sequen tial rerandomization ( Zhou et al. , 2018 )—dynamically adjusts treatmen t probabilities to improv e cov ariate balance across treatmen t groups. Inference of treatmen t eects remains a cen tral ob jective in analyzing adaptiv e exp eri- men ts. How ev er, dep endence among data p oints collected ov er time complicates this task, esp ecially when the adaptive treatment assignment mechanism is designed to optimize ob- jectiv es other than estimating treatmen t eects ( Hadad et al. , 2021 ). In this context, Mel and P age ( 2000 ) studied inference from adaptive designs with con v erging treatment prob- abilities, and established large-sample guarantees of a class of estimators that are strongly consisten t when applied to nonadaptive data; see also v an der Laan ( 2008 ). Hadad et al. ( 2021 ) extended this framework to designs with noncon verging treatment probabilities, pro viding conditions for doubly robust inference from a class of adaptively rew eighted es- 2 timators. Bibaut et al. ( 2021 ) further extended these results to contextual bandit designs and established weak er conditions for inference. Additionally , fo cusing on sp ecic adaptive designs, Bugni et al. ( 2018 ) studied inference under cov ariate-adaptiv e randomization, and established theoretical prop erties of three commonly used h yp othesis-testing pro cedures. Zhou et al. ( 2018 ) studied sequen tial rerandomization for exp erimental units enrolled in groups, and established its adv an tage in improving co v ariate balance ov er nonsequential rerandomization. Ham et al. ( 2023 ) studied m ulti-armed bandits, and provided b oth con- dence interv als and condence sequences for arm-sp ecic reward means and mean reward dierences betw een arms. This pap er makes t wo main con tributions. First, we adv ance the nite-p opulation, design-b ase d theory for causal inference from adaptiv e designs, and establish guarantees of W ald-type inference based on the inv erse-prop ensit y-w eighted (IPW) and augmented IPW (AIPW) estimators. Our theory accommo dates nonexc hangeable units, b oth nonconv erg- ing and v anishing treatment probabilities, and nonconv erging outcome estimators, thereb y justifying inference using AIPW estimators with black-box outcome mo dels that in tegrate adv ances from mac hine learning methods. Second, as a metho dological contribution, we prop ose an adaptive c ovariate adjustment approach for causal inference from ev en non- adaptiv e designs, enabling v alid inference with blac k-b ox outcome estimators that w ould otherwise require strong assumptions under standard nonadaptiv e analysis. W e provide the details b elow. Design-based theory for causal inference from adaptiv e exp eriments. Existing literature on causal inference from adaptive exp eriments primarily assumes the sup erp op- ulation framework with indep endent and iden tically distributed units ( Hadad et al. , 2021 ; Bibaut et al. , 2021 ). Ho w ever, metho ds developed under this framework ma y b e biased when the distribution of units ev olves ov er time; see Section S1 of the Supplemen tary 3 Material for a sim ulated illustration. W e instead adopt the nite-p opulation, design-based framew ork, whic h conditions on p otential outcomes and cov ariates—or, equiv alen tly , views them as xed—and takes treatment assignment as the sole source of randomness for infer- ence ( Fisher , 1935 ). Accordingly , our theory accommo dates nonexchangeable units arising from a wide range of underlying data-generating pro cesses. As a technical con tribution, w e introduce a new cov ariance estimator for the AIPW estimator that leverages adaptive outcome estimation to reduce the conserv ativeness inherent in v ariance estimation under nite-p opulation inference ( Neyman , 1923 ). The prop osed estimator is generally conser- v ativ e but b ecomes sharp when the residuals from the adaptive regression of p oten tial outcomes on cov ariates are additiv e across units. Our theory encompasses widely used adaptiv e designs such as m ulti-armed bandits, co v ariate-adaptive randomization, and sequen tial rerandomization, adv ancing the design- based theory for causal inference in these sp ecic settings ( Bugni et al. , 2018 ; Zhou et al. , 2018 ; Ham et al. , 2023 ). In particular, the theory of Zhou et al. ( 2018 ) do es not include inference under sequential rerandomization, and our pap er lls this gap. A daptive cov ariate adjustment for nonadaptiv e designs. Existing metho ds for co- v ariate adjustment in nonadaptive exp eriment uses either all units ( Lin , 2013 ; Guo and Basse , 2023 ; Cohen and F ogart y , 2024 ; Qu et al. , 2025 ) or cross-tting ( Arono w and Mid- dleton , 2013 ; W ager et al. , 2016 ; W u and Gagnon-Bartsc h , 2021 ) to estimate outcome mo dels. Such practices may induce complex dep endence among unit-level adjusted out- comes, requiring strong consistency or stability conditions on the outcome estimators, or Bonferroni correction when cross-tting is used, for v alid inference. In contrast, we prop ose analyzing nonadaptiv e designs as if they w ere adaptive, using the adaptively adjuste d AIPW estimator w e developed for adaptiv e designs to p erform adaptive c ovariate adjustment . The martingale structure underlying the adaptive adjustmen t enables 4 v alid cov ariate-adjusted inference using black-box outcome estimators that w ould otherwise require strong assumptions under standard nonadaptive analysis. Similar idea app ears in Luedtk e and V an Der Laan ( 2016 ) and W ager ( 2024 ) when studying the optimal treatment rule and treatment heterogeneity , resp ectiv ely , under the sup erp opulation framew ork. Notation. Let 1 ( · ) denote the indicator function. Let 0 m denote the m × 1 zero vector and I m the m × m iden tity matrix. W e omit the subscript m when the dimension is clear from context. F or an m × n matrix A = ( a ij ) m × n , let ∥ A ∥ f = ( P m i =1 P n j =1 a 2 ij ) 1 / 2 denote the F rob enius norm of A . F or an n × n square matrix A , let λ min ( A ) denote its smallest eigen v alue. F or a collection of num b ers { a z ∈ R : z ∈ Z } , where Z is the index set, let diag ( a z ) z ∈Z denote the diagonal matrix with a z on the diagonal. F or a set of v ectors { a i ∈ R m : i = 1 , . . . , n } , dene its sample c ovarianc e matrix as ( n − 1) − 1 P n i =1 ( a i − ¯ a )( a i − ¯ a ) ⊤ , where ¯ a = n − 1 P n i =1 a i . Let d − → denote conv ergence in distribution. 2 Setting 2.1 A daptiv e randomization Consider an exp eriment with K ≥ 2 treatment lev els, indexed by z ∈ Z = { 1 , . . . , K } , and a study p opulation of T units, indexed b y t = 1 , . . . , T . An adaptiv e design sequentially randomizes each unit t to a treatment lev el at time t = 1 , . . . , T , and determines the assignmen t probabilities adaptiv ely based on the information accumulated up to time t , referred to as the history . F or eac h unit t , let Z t ∈ Z denote its treatment assignmen t, Y t ∈ R the outcome of interest, and X t ∈ R J a vector of J baseline cov ariates. Let H t denote the history up to time t , which includes the cov ariates, treatment assignments, and outcomes of units 1 through t − 1 , as well as the co v ariates of unit t : H t = { ( Z s , Y s , X s ) : s = 1 , . . . , t − 1 } ∪ { X t } . 5 W e c haracterize the assignment mechanism by the conditional distribution of Z t giv en H t , as formalized in Denition 1 b elo w. Denition 1 (Adaptiv e randomization) . Let e t ( z ) = P ( Z t = z | H t ) denote the probability of unit t receiving treatment lev el z ∈ Z given H t . F or t = 1 , . . . , T and z ∈ Z , e t ( z ) is a presp ecied function of H t that sati ses (i) P z ∈Z e t ( z ) = 1 and (ii) e t ( z ) ∈ (0 , 1) . Denition 1 requires p ositive assignmen t probability e t ( z ) for all treatment levels at eac h time t = 1 , . . . , T , but is otherwise general, allowing e t ( z ) to dep end arbitrarily on the history H t , as illustrated in Examples 1 – 2 b elo w. When e t ( z ) = P ( Z t = z | H t ) = P ( Z t = z | X t ) , so that the assignment Z t of unit t is indep endent of the previous units, Denition 1 reduces to a nonadaptiv e design that randomizes each unit indep endently . Example 1. Multi-armed bandit designs dynamically up date treatmen t probabilities to maximize the exp ected cum ulative outcome. The assignment probabilit y e t ( z ) at time t is a function of past assignments and outcomes up to time t − 1 , { ( Z s , Y s ) : s = 1 , . . . , t − 1 } , a subset of the history H t . Contextual bandits further incorp orate time-sp ecic contex- tual information, represented by a cov ariate vector X t for each time t = 1 , . . . , T . The corresp onding e t ( z ) is a function of the full history H t . Example 2. Biase d-c oin designs ( Efron , 1971 ) dynamically update treatmen t probabilities to improv e co v ariate balance. The assignmen t probability e t ( z ) at time t dep ends on the co v ariates of unit t and the co v ariate balance among units 1 through t − 1 , and is a function of { ( Z s , X s ) : s = 1 , . . . , t − 1 } ∪ { X t } , a subset of the history H t . 2.2 P oten tial outcomes and treatmen t eects W e dene treatmen t eects using the potential outcomes framew ork ( Neyman , 1923 ; Im- b ens and Rubin , 2015 ). Let Y t ( z ) denote the p otential outcome of unit t if assigned to 6 treatmen t level z . The observ ed outcome equals the p otential outcome under the real- ized treatment level: Y t = P z ∈Z 1 ( Z t = z ) Y t ( z ) = Y t ( Z t ) . Let ¯ Y ( z ) = T − 1 P T t =1 Y t ( z ) denote the p opulation av erage p otential outcome under treatmen t lev el z , and let ¯ Y = ( ¯ Y (1) , . . . , ¯ Y ( K )) ⊤ denote the v ector of ¯ Y ( z ) across all K levels. A general goal of nite- p opulation causal inference is to estimate linear combinations of { ¯ Y ( z ) : z ∈ Z } , denoted b y τ C = C ¯ Y = ( c ⊤ 1 ¯ Y , . . . , c ⊤ Q ¯ Y ) ⊤ ∈ R Q , where Q is the presp ecied num b er of estimands, and C = ( c 1 , . . . , c Q ) ⊤ ∈ R Q × K is a presp ecied co ecien t matrix with c q ∈ R K for q = 1 , . . . , Q . Often, eac h c q is sp ecied as a con trast vector, so that c ⊤ q ¯ Y represen ts a contrast among { ¯ Y ( z ) : z ∈ Z } . F or example, in a treatmen t-control exp eriment with K = 2 , setting C = ( − 1 , 1) yields τ C = ¯ Y (2) − ¯ Y (1) , the nite-p opulation a verage treatmen t eect. Alternativ ely , setting C as the identit y matrix yields τ C = ¯ Y , targeting the av erage p oten tial outcomes by treatmen t. Without loss of generalit y , w e assume a general C with K columns and full ro w rank unless specied otherwise. 2.3 Estimation by inv erse prop ensit y w eigh ting In verse prop ensity weigh ting, also kno wn as inv erse probabilit y-of-treatment weigh ting or imp ortance sampling w eighting, is standard for inference from adaptive designs. Given the assignmen t probability e t ( z ) , the inv erse-prop ensity-w eigh ted (IPW) estimator of Y t ( z ) is ˆ Y t, ipw ( z ) = 1 ( Z t = z ) e t ( z ) Y t . (1) 7 Dene ˆ Y ipw ( z ) = T − 1 T X t =1 ˆ Y t, ipw ( z ) , ˆ Y ipw = ( ˆ Y ipw (1) , . . . , ˆ Y ipw ( K )) ⊤ , ˆ τ ipw ,C = C ˆ Y ipw (2) as the corresp onding IPW estimators of ¯ Y ( z ) , ¯ Y , and τ C , respectively . W e establish the design-based guaran tees of ˆ τ ipw ,C in Sectio n 3 . The augmented inv erse-prop ensity-w eigh ted (AIPW) estimator augments the IPW es- timator with outcome regression ( Robins et al. , 1994 ). Let ˆ m t ( z ) denote an adaptive esti- mator of Y t ( z ) constructed using only information in H t ( Hadad et al. , 2021 ). W e dene ˆ Y t, aipw ( z ) = ˆ Y t, ipw ( z ) +  1 − 1 ( Z t = z ) e t ( z )  ˆ m t ( z ) (3) as the adaptive AIPW estimator of Y t ( z ) based on H t , and dene ˆ Y aipw ( z ) = T − 1 T X t =1 ˆ Y t, aipw ( z ) , ˆ Y aipw = ( ˆ Y aipw (1) , . . . , ˆ Y aipw ( K )) ⊤ , ˆ τ aipw ,C = C ˆ Y aipw (4) as the corresp onding AIPW estimators of ¯ Y ( z ) , ¯ Y , and τ C , resp ectively , generalizing ˆ Y ipw ( z ) , ˆ Y ipw , and ˆ τ ipw ,C in ( 2 ). W e establish the design-based guarantees of ˆ τ aipw ,C in Section 4 . 3 Design-based theory of the IPW estimator W e establish in this section the design-based guarantees of the IPW estimator ˆ τ ipw ,C de- ned in ( 2 ) for estimating τ C . The design-based framew ork views co v ariates and p otential outcomes as xed—or, equiv alen tly , conditions on them—and ev aluates the sampling prop- erties of ˆ τ ipw ,C with resp ect to the randomness in treatmen t assignmen ts { Z t : t = 1 , . . . , T } . 8 3.1 Sampling prop erties Let V ipw = diag " T − 1 T X t =1 E  1 e t ( z )  Y t ( z ) 2 # z ∈Z − T − 1 T X t =1 Y t Y ⊤ t , (5) where Y t = ( Y t (1) , . . . , Y t ( K )) ⊤ . Theorem 1. Assume the adaptive randomization in Denition 1 . Then E ( ˆ τ ipw ,C ) = τ C and co v ( ˆ τ ipw ,C ) = T − 1 C V ipw C ⊤ for any xed matrix C with K columns. Theorem 1 establishes the un biasedness of ˆ τ ipw ,C for τ C and provides the explicit form of its co v ariance matrix. Setting C = I K implies that ˆ Y ipw ( z ) is un biased for the population a verage p otential outcome ¯ Y ( z ) , with V ipw = T cov ( ˆ Y ipw ) . W e next establish the central limit theorem for ˆ τ ipw ,C as T → ∞ . The nite-p opulation asymptotic regime embeds the study p opulation in to an innite sequence of nite p opula- tions with increasing sizes T = 1 , 2 , . . . , and denes the asymptotic distribution of ˆ τ ipw ,C as the limit of its design-based distributions along this sequence ( Lehmann , 1975 , 1999 ; Li and Ding , 2017 ). Under adaptive randomization, it is intuitiv e to imagine a sequen tially enrolled innite population { X t , Z t , Y t ( z ) : z ∈ Z } ∞ t =1 , and dene the T -th nite p opulation as the rst T units of this innite p opulation for T = 1 , 2 , . . . . How ev er, our theory is general and applies not only to this in tuitive scheme but also to the alternative scheme in whic h the nite p opulations consist of dierent units for each T . Under the design-based framework, at time t , the treatment assignments up to time t − 1 , ( Z 1 , . . . , Z t − 1 ) , are the only random elemen ts in the history H t , so that H t can tak e at most |Z | t − 1 p ossible v alues. Let e t = min H t , z ∈Z e t ( z ) denote the minim um of { e t ( z ) : z ∈ Z } o ver the |Z | t − 1 × |Z | p ossible v alues of ( H t , z ) at time t . Denition 1 implies that e t > 0 for all t . Let L t = max z ∈Z | Y t ( z ) | denote the maximum of | Y t ( z ) | at time t . Recall the denition of V ipw from ( 5 ). Let λ min ( V ipw ) denote the smallest eigen v alue of V ipw , and let 9 ˇ V ipw = diag " T − 1 T X t =1 1 e t ( z ) Y t ( z ) 2 # z ∈Z − T − 1 T X t =1 Y t Y ⊤ t with E ( ˇ V ipw ) = V ipw . Condition 1 b elow sp ecies the regularity conditions w e impose to establish the central limit theorem for ˆ τ ipw ,C . Condition 1. As T → ∞ , (i) { λ min ( V ipw ) } − 1 · ∥ ˇ V ipw − V ipw ∥ f = o P (1) . (ii) { λ min ( V ipw ) } − 1 · T − 1 max t =1 ,...,T ( L t /e t ) 2 = o (1) . W e show in the Supplementary Material that V ipw = T − 1 T X t =1 E n co v ( ˆ Y t, ipw − Y t | H t ) o , ˇ V ipw = T − 1 T X t =1 co v ( ˆ Y t, ipw − Y t | H t ) , where { ˆ Y t, ipw − Y t : t = 1 , . . . , T } form a martingale dierence sequence. Condition 1 ( i ) and ( ii ) corresp ond to the v ariance con v ergence and Lindeb erg conditions, resp ectiv ely , for applying the martingale cen tral limit theorem to { ˆ Y t, ipw − Y t : t = 1 , . . . , T } ( Bro wn , 1971 ). Theorem 2. Assume the adaptive randomization in Denition 1 . If Condition 1 holds, then as T → ∞ , ( C V ipw C ⊤ ) − 1 / 2 · √ T ( ˆ τ ipw ,C − τ C ) d − → N (0 , I ) for an y xed matrix C with K columns and full row rank. Recall from Theorem 1 that V ipw = T cov ( ˆ Y ipw ) . This implies λ min ( V ipw ) = min C ∈ R 1 × K , ∥ C ∥ 2 =1 C V ipw C ⊤ = min C ∈ R 1 × K , ∥ C ∥ 2 =1 T v ar ( ˆ τ ipw ,C ) , (6) so that λ min ( V ipw ) represents the minimum scaled v ariance of ˆ τ ipw ,C = C ˆ Y ipw o ver all C ∈ R 1 × K with ∥ C ∥ 2 = 1 . Given T as the total sample size, when p oten tial outcomes v ary across units and not all units are assigned to the same treatment arm, w e do not exp ect to estimate ˆ τ ipw ,C = C ˆ Y ipw , as a linear combination of { ¯ Y ( z ) : z ∈ Z } , at a rate faster than 10 T − 1 / 2 for an y nonzero C ∈ R 1 × K . As a result, v ar ( ˆ τ ipw ,C ) generally do es not v anish at a rate faster than T − 1 , so ( 6 ) suggests it is reasonable to assume that λ min ( V ipw ) is uniformly b ounded a wa y from 0. In addition, outcomes in practice are t ypically b ounded, making it reasonable to assume that Y t ( z ) is uniformly b ounded. Prop osition 1 b elow builds on these in tuitions and pro vides sucient conditions for Condition 1 . Prop osition 1. Assume the adaptiv e randomization in Denition 1 . Let v t = max z ∈Z v ar { e t ( z ) − 1 } denote the maxim um v ariance of e t ( z ) − 1 o ver z ∈ Z at time t . If λ min ( V ipw ) is uniformly b ounded a wa y from 0 as T → ∞ , then (i) Condition 1 ( i ) holds if either of the following conditions is satised: (a) V anishing v ariance: T − 1 P T t =1 v t = o (1) . (b) Limited-range dep endence: There exist constants β ∈ [0 , 1) and c 0 > 0 such that max t =1 ,...,T v t = o ( T 1 − β ) , and after some xed time p oin t T 0 , { e t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. (ii) Condition 1 ( ii ) holds if max t =1 ,...,T L t /e t = √ T · o (1) . If further Y t ( z ) is uniformly b ounded, then Condition 1 ( ii ) holds if max t =1 ,...,T e − 1 t = √ T · o (1) . Prop osition 1 ( i ) implies that when λ min ( V ipw ) is uniformly b ounded a wa y from 0, Con- dition 1 ( i ) holds if either e t ( z ) has v anishing v ariance or dep ends only on recent history . Sp ecically , Proposition 1 ( ia ) requires v ar { e t ( z ) − 1 } to v anish asymptotically , whic h is lik ely to hold when e t ( z ) con verges. In contrast, Prop osition 1 ( ib ) imp oses m uch weak er restric- tions on v ar { e t ( z ) − 1 } but requires e t ( z ) to dep end only on recen t history . When e t ( z ) is uniformly b ounded aw ay from 0, as in the ϵ -greedy m ulti-armed bandit, the corresp onding v t is uniformly b ounded, with max t =1 ,...,T v t = o ( T 1 − β ) for an y β < 1 . This allows c ho osing β arbitrarily close to 1 to accommo date longer-range dep endence. Prop osition 1 ( ii ) provides intuition for the rate at whic h e t ma y v anish under Condi- tion 1 ( ii ). When Y t ( z ) is uniformly b ounded, the condition in Prop osition 1 ( ii ) holds if 11 there exist constants δ, c 0 > 0 such that min t =1 ,...,T e t ≥ c 0 T − 1 / 2+ δ . The rate − 1 / 2 + δ is more stringen t than the condition assumed b y Bibaut et al. ( 2021 ) for iden tically dis- tributed units, as their approac h used rew eighting to mitigate the impact of highly v ariable comp onents in ˆ Y ipw ( z ) . Ho wev er, this reweigh ting strategy may not b e desirable when units are nonexc hangeable. See Section S1 of the Supplemen tary Material for a simulated illustration. Remark 1. When the treatment probabilit y e t ( z ) v anishes asymptotically for one or more treatmen t arms, the diagonal elements of V ipw ma y dier in order due to the small eective sample sizes for those arms. As a result, the estimated a v erage p oten tial outcomes ˆ Y ipw ( z ) ma y con verge at dieren t rates across z ∈ Z . When con trasts among treatmen t arms are of in terest, the rate required for the central limit theorem is t ypically determined by the arm with the smallest eectiv e sample size. Therefore, the central limit theorem may hold under w eaker sucient conditions than those in Prop osition 1 . F or example, when the n umber of units in the rst arm is substantially smaller than T in order, the corresp onding scaled v ariance of ˆ Y ipw (1) , T v ar { ˆ Y ipw (1) } , ma y diverge as T → ∞ . In suc h cases, w e may relax the sucien t conditions in Prop osition 1 , which are derived under only the assumption that T v ar { ˆ Y ipw (1) } , as the rst diagonal elemen t of V ipw , is b ounded aw a y from 0. F or ease of presen tation and to a v oid tec hnical clutter, w e fo cus on the cen tral limit the- orem in Theorem 2 , whic h sim ultaneously accommo dates all con trast matrices and allows for arms with muc h smaller eective sample sizes relative to T . 3.2 V ariance estimation and condence sets W e now construct condence sets for τ C based on ˆ τ ipw ,C . Recall from ( 5 ) that Y t = ( Y t (1) , . . . , Y t ( K )) ⊤ . Let ˆ Y t, ipw = ( ˆ Y t, ipw (1) , . . . , ˆ Y t, ipw ( K )) ⊤ b e its IPW estimator, and 12 dene ˆ V ipw = 1 T − 1 T X t =1 ( ˆ Y t, ipw − ˆ Y ipw )( ˆ Y t, ipw − ˆ Y ipw ) ⊤ (7) as the sample co v ariance matrix of { ˆ Y t, ipw : t = 1 , . . . , T } . Prop osition 2 b elow justies using ˆ V ipw to estimate V ipw . Condition 2. As T → ∞ , { λ min ( V ipw ) } − 2 · T − 2 P T t =1 L 4 t /e 3 t = o (1) . Prop osition 2. Assume the adaptive randomization in Denition 1 . Let S = ( T − 1) − 1 P T t =1 ( Y t − ¯ Y )( Y t − ¯ Y ) ⊤ denote the sample cov ariance matrix of { Y t : t = 1 , . . . , T } . F or any xed matrix C with K columns, (i) E ( C ˆ V ipw C ⊤ ) = C V ipw C ⊤ + C S C ⊤ , where C S C ⊤ = 0 if and only if τ C,t = C Y t is constan t across t = 1 , . . . , T . (ii) If Conditions 1 – 2 hold, then C ˆ V ipw C ⊤ = C V ipw C ⊤ + C S C ⊤ + λ min ( V ipw ) · o P (1) . Recall from Theorem 1 that co v ( ˆ τ ipw ,C ) = T − 1 C V ipw C ⊤ . Prop osition 2 implies that C ˆ V ipw C ⊤ is a conserv ative, and therefore v alid, estimator of C V ipw C ⊤ = T cov ( ˆ τ ipw ,C ) in b oth exp ectation and probabilit y limit, with an up ward bias of C S C ⊤ . This conserv ativ eness in v ariance estimation is inherent to the design-based framework ( Neyman , 1923 ; Imbens and Rubin , 2015 ), and v anishes if and only if the individual analogs of τ C , τ C,t = C Y t , are constan t across all units. When cov ariates are av ailable, this up ward bias can b e reduced b y constructing a low er b ound for S to adjust ˆ V ipw . W e presen t the details in Prop osition 5 in Section 4 as part of the theory for the AIPW estimator. Let S ipw ,C,α = n τ : ( ˆ τ ipw ,C − τ ) ⊤ ( T − 1 C ˆ V ipw C ⊤ ) − 1 ( ˆ τ ipw ,C − τ ) ≤ χ 2 rank( C ) , 1 − α o denote the 100(1 − α )% condence set for τ C based on ( ˆ τ ipw ,C , ˆ V ipw ) and the normal appro ximation, where rank( C ) denotes the rank of C and χ 2 rank( C ) , 1 − α denotes the 1 − α 13 quan tile of the chi-square distribution with rank( C ) degrees of freedom. Theorem 3 b elo w follo ws from Theorem 2 and Prop osition 2 , and establishes the v alidit y of large-sample W ald-type inference based on S ipw ,C,α . This concludes our discussion of the IPW estimator. Theorem 3. Assume the adaptiv e randomization in Denition 1 . If Conditions 1 – 2 hold, then lim inf T →∞ P ( τ C ∈ S ipw ,C,α ) ≥ 1 − α for all α ∈ (0 , 1) for an y xed matrix C with K columns and full row rank. 4 Design-based theory of the AIPW estimator W e now extend Section 3 to the AIPW estimator ˆ τ aipw ,C , as dened in ( 4 ). Recall from ( 3 ) that ˆ τ aipw ,C is constructed using ˆ Y t, aipw ( z ) = ˆ Y t, ipw ( z ) +  1 − 1 ( Z t = z ) e t ( z )  ˆ m t ( z ) , where ˆ m t ( z ) denotes an adaptive estimator of Y t ( z ) , constructed using only the in- formation in H t . The results for ˆ τ aipw ,C include those for ˆ τ ipw ,C as a sp ecial case when ˆ m t ( z ) = 0 . 4.1 Sampling prop erties Let A t ( z ) = Y t ( z ) − ˆ m t ( z ) denote an adjusted v ersion of Y t ( z ) , interpreted as the residual from the outcome estimator ˆ m t ( z ) . Let A t = ( A t (1) , . . . , A t ( K )) ⊤ , analogous to Y t . Let V aipw = diag " T − 1 T X t =1 E  A t ( z ) 2 e t ( z )  # z ∈Z − T − 1 T X t =1 E ( A t A ⊤ t ) , ˇ V aipw = diag " T − 1 T X t =1 A t ( z ) 2 e t ( z ) # z ∈Z − T − 1 T X t =1 A t A ⊤ t (8) denote the AIPW analogs of V ipw and ˇ V ipw , resp ectiv ely , with E ( ˇ V aipw ) = V aipw . 14 Recall that L t = max z ∈Z | Y t ( z ) | , and e t = min H t , z ∈Z e t ( z ) denotes the minim um of e t ( z ) o ver all |Z | t p ossible v alues of ( H t , z ) . Let M t = max H t , z ∈Z | ˆ m t ( z ) | denote the maximum of | ˆ m t ( z ) | o ver the same set. Theorem 4 b elo w extends Theorems 1 – 2 , and establishes the un biasedness and cen tral limit theorem for ˆ τ aipw ,C . Condition 3. As T → ∞ , (i) { λ min ( V aipw ) } − 1 · ∥ ˇ V aipw − V aipw ∥ f = o P (1) . (ii) { λ min ( V aipw ) } − 1 · T − 1 max t =1 ,...,T ( L t + M t ) 2 /e 2 t = o (1) . Theorem 4. Assume the adaptiv e randomization in Denition 1 . F or any xed matrix C with K columns and full row rank, (i) E ( ˆ τ aipw ,C ) = τ C , co v ( ˆ τ aipw ,C ) = T − 1 C V aipw C ⊤ . (ii) If Condition 3 holds, then ( C V aipw C ⊤ ) − 1 / 2 · √ T ( ˆ τ aipw ,C − τ C ) d − → N (0 , I ) . Recall from Theorem 1 that the co v ariance matrix of ˆ τ ipw ,C is T − 1 C V ipw C ⊤ . Theorem 4 further establishes that the cov ariance matrix of ˆ τ aipw ,C is T − 1 C V aipw C ⊤ . The relativ e ef- ciency of ˆ τ aipw ,C compared to ˆ τ ipw ,C therefore dep ends on the dierence V aipw − V ipw . Let ˆ Y t, aipw = ( ˆ Y t, aipw (1) , . . . , ˆ Y t, aipw ( K )) ⊤ denote the AIPW estimator of Y t = ( Y t (1) , . . . , Y t ( K )) ⊤ , generalizing the IPW estimator ˆ Y t, ipw as dened in ( 7 ). Ec hoing the discussion follo wing Condition 1 , we show in the Supplementary Material that V ipw = T − 1 T X t =1 E n co v ( ˆ Y t, ipw − Y t | H t ) o , V aipw = T − 1 T X t =1 E n co v ( ˆ Y t, aipw − Y t | H t ) o , (9) where the elements of ˆ Y t, ipw − Y t and ˆ Y t, aipw − Y t are ˆ Y t, ipw ( z ) − Y t =  1 ( Z t = z ) e t ( z ) − 1  Y t , ˆ Y t, aipw ( z ) − Y t =  1 ( Z t = z ) e t ( z ) − 1  { Y t − ˆ m t ( z ) } for z ∈ Z , resp ectiv ely , b y ( 1 ) and ( 3 ). Heuristically , if ˆ m t ( z ) is a reasonable estimator of Y t ( z ) , then ˆ Y t, aipw tends to b e less v ariable than ˆ Y t, ipw , with cov ( ˆ Y t, aipw − Y t | H t ) < 15 co v ( ˆ Y t, ipw − Y t | H t ) . This implies V aipw < V ipw b y ( 9 ), suggesting that ˆ τ aipw ,C impro ves eciency o ver ˆ τ ipw ,C . W e illustrate this eciency gain through sim ulation in Section 6 . Condition 3 extends Condition 1 , and sp ecies the regularity conditions for the central limit theorem of ˆ τ aipw ,C in Theorem 4 ( ii ). Sp ecically , Condition 3 ( i ) and ( ii ) are the v ariance conv ergence and Lindeberg conditions, resp ectively , for applying the martingale cen tral limit theorem to the sequence { ˆ Y t, aipw − Y t : t = 1 , . . . , T } . Parallel to the discussion follo wing Theorem 2 , λ min ( V aipw ) is the minimum scaled v ariance of ˆ τ aipw ,C = C ˆ Y aipw o ver all C ∈ R 1 × K with ∥ C ∥ 2 = 1 , and it is reasonable to assume that λ min ( V aipw ) is uniformly b ounded a wa y from 0. In addition, outcomes in practice are t ypically b ounded, making it also reasonable to assume that b oth Y t ( z ) and ˆ m t ( z ) are uniformly b ounded. Prop osition 3 b elow builds on these in tuitions and pro vides four sucien t conditions for Condition 3 . See Prop osition S1 and Lemma S12 in the Supplemen tary Material for more general v ersions that allo w for v anishing e t ( z ) and unbounded outcomes. Recall that v t = max z ∈Z v ar { e t ( z ) − 1 } from Prop osition 1 . Let ω t = max z ∈Z v ar { ˆ m t ( z ) } denote the maximum v ariance of ˆ m t ( z ) at time t . Prop osition 3. Assume the adaptive randomization in Denition 1 . If as T → ∞ , (a) b oth λ min ( V aipw ) and e t ( z ) are uniformly b ounded a wa y from 0, and (b) b oth Y t ( z ) and ˆ m t ( z ) are uniformly b ounded, then Condition 3 holds if any of the follo wing conditions holds: (i) V anishing v ar { e t ( z ) − 1 } and v ar { ˆ m t ( z ) } : T − 1 P T t =1 v t = o (1) ; T − 1 P T t =1 ω t = o (1) . (ii) V anishing v ar { e t ( z ) − 1 } and limited-range dependence of ˆ m t ( z ) : T − 1 P T t =1 √ v t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that after some xed time p oint T 0 , { ˆ m t ( z ) } z ∈Z dep end only on information from the min { t − 1 , c 0 t β } most recent units. 16 (iii) V anishing v ar { ˆ m t ( z ) } and limited-range dependence of e t ( z ) : T − 1 P T t =1 ω t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that after some xed time p oint T 0 , { e t ( z ) } z ∈Z dep end only on information from the min { t − 1 , c 0 t β } most recent units. (iv) Limited-range dep endence of e t ( z ) and ˆ m t ( z ) : There exist constants β ∈ [0 , 1) and c 0 > 0 suc h that after some xed time point T 0 , { e t ( z ) , ˆ m t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. Eac h of the four sucien t conditions in Prop osition 3 imposes restrictions on both e t ( z ) and ˆ m t ( z ) . T ogether, they require: - e t ( z ) has either v anishing v ar { e t ( z ) − 1 } or limited-range dep endence on past units, and - ˆ m t ( z ) has either v anishing v ar { ˆ m t ( z ) } or limited-range dep endence on past units, as illustrated by the table b elo w: V anishing v ar { ˆ m t ( z ) } Limited-range dep endence of ˆ m t ( z ) V anishing v ar { e t ( z ) − 1 } Prop osition 3 ( i ) Prop osition 3 ( ii ) Limited-range dep endence of e t ( z ) Prop osition 3 ( iii ) Prop osition 3 ( iv ) Consider the special case in whic h the sequence of nite populations consists of the rst T units of a sequentially enrolled innite p opulation { X t , Z t , Y t ( z ) : z ∈ Z } ∞ t =1 for T = 1 , 2 , . . . . By the dominated conv ergence theorem ( Spall , 2003 ), the conditions of v anishing v ar { e t ( z ) − 1 } in Prop osition 3 ( i )–( ii ), T − 1 P T t =1 v t = o (1) and T − 1 P T t =1 √ v t = o (1) , hold if e t ( z ) conv erges in probability to a constant as t → ∞ . In contrast, the limited-range dep endence condition on e t ( z ) in Proposition 3 ( iii )–( iv ) allows for nonconv erging e t ( z ) . The same applies to the outcome estimator ˆ m t ( z ) . Moreo ver, there is no restriction on β b ey ond β ∈ [0 , 1) , so w e can choose β arbitrarily close to 1 to allow for longer-range dep endence. 17 4.2 V ariance estimation and condence sets W e now construct condence sets for τ C based on ˆ τ aipw ,C . Recall from ( 9 ) that ˆ Y t, aipw = ( ˆ Y t, aipw (1) , . . . , ˆ Y t, aipw ( K )) ⊤ denotes the AIPW estimator of Y t = ( Y t (1) , . . . , Y t ( K )) ⊤ . Par- allel to ˆ V ipw as dened in ( 7 ), dene ˆ V aipw = 1 T − 1 T X t =1 ( ˆ Y t, aipw − ˆ Y aipw )( ˆ Y t, aipw − ˆ Y aipw ) ⊤ , as the sample cov ariance matrix of { ˆ Y t, aipw : t = 1 , . . . , T } , where ˆ Y aipw = T − 1 P T t =1 ˆ Y t, aipw . Recall that S denotes the sample cov ariance matrix of { Y t : t = 1 , . . . , T } . Prop osition 4 b elow extends Prop osition 2 , and justies using C ˆ V aipw C ⊤ to estimate C V aipw C ⊤ , with an up ward bias of C S C ⊤ that v anishes when C Y t is constan t across t = 1 , . . . , T . Condition 4. As T → ∞ , { λ min ( V aipw ) } − 2 · T − 2 P T t =1 ( L t + M t ) 4 /e 3 t = o (1) . Prop osition 4. Assume the adaptive randomization in Denition 1 . F or any xed matrix C with K columns, (i) E ( C ˆ V aipw C ⊤ ) = C V aipw C ⊤ + C S C ⊤ , where C S C ⊤ = 0 if and only if τ C,t = C Y t is constan t across t = 1 , . . . , T . (ii) If Conditions 3 – 4 hold, then C ˆ V aipw C ⊤ = C V aipw C ⊤ + C S C ⊤ + λ min ( V aipw ) · o P (1) . As preview ed after Prop osition 2 , we can reduce the up w ard bias C S C ⊤ b y adjusting ˆ V aipw using a low er b ound for S . W e formalize the impro ved cov ariance estimator b elow. Let ˆ m t = ( ˆ m t (1) , . . . , ˆ m t ( K )) ⊤ denote the v ector of outcome estimators for unit t across all K levels. Let ˜ Y t, aipw = ˆ Y t, aipw − ˆ m t , and dene ˜ V aipw = 1 T − 1 T X t =1 ( ˜ Y t, aipw − ˜ Y aipw )( ˜ Y t, aipw − ˜ Y aipw ) ⊤ as the sample cov ariance matrix of { ˜ Y t, aipw : t = 1 , . . . , T } , where ˜ Y aipw = T − 1 P T t =1 ˜ Y t, aipw . Prop osition 5. Assume the adaptive randomization in Denition 1 . If Conditions 3 – 4 hold, then C ˜ V aipw C ⊤ = C V aipw C ⊤ + C Ω C ⊤ + λ min ( V aipw ) · o P (1) , where Ω denotes the sample co v ariance matrix of { Y t − ˆ m t : t = 1 , . . . , T } . 18 Prop ositions 4 – 5 imply that b oth ˆ V aipw and ˜ V aipw are asymptotically v alid for estimating V aipw , with upw ard biases S and Ω , resp ectively . Since S and Ω are the sample co v ariance matrices of Y t and Y t − ˆ m t , respectively , w e heuristically exp ect Ω ≤ S when ˆ m t ( z ) is a reasonable estimator of Y t ( z ) , making ˜ V aipw less conserv ative than ˆ V aipw for inference. W e illustrate the improv ed precision through simulation in Section 6 . Let S aipw ,C,α = n τ : ( ˆ τ aipw ,C − τ ) ⊤ ( T − 1 C ˆ V aipw C ⊤ ) − 1 ( ˆ τ aipw ,C − τ ) ≤ χ 2 rank( C ) , 1 − α o , ˜ S aipw ,C,α = n τ : ( ˆ τ aipw ,C − τ ) ⊤ ( T − 1 C ˜ V aipw C ⊤ ) − 1 ( ˆ τ aipw ,C − τ ) ≤ χ 2 rank( C ) , 1 − α o denote the 100(1 − α )% W ald-type condence sets for τ C based on ( ˆ τ aipw ,C , ˆ V aipw ) and ( ˆ τ aipw ,C , ˜ V aipw ) , resp ectiv ely . Theorem 5 b elo w follo ws from Theorem 4 and Prop ositions 4 – 5 , and establishes the v alidity of large-sample W ald-type inference based on S aipw ,C,α and ˜ S aipw ,C,α . This concludes our theory of the AIPW estimator for estimating treatment eects under adaptiv e randomization. Theorem 5. Assume the adaptiv e randomization in Denition 1 . If Conditions 3 – 4 hold, then for any xed matrix C with K columns and full row rank, lim inf T →∞ P ( τ C ∈ S aipw ,C,α ) ≥ 1 − α, lim inf T →∞ P ( τ C ∈ ˜ S aipw ,C,α ) ≥ 1 − α for all α ∈ (0 , 1) . 4.3 A daptiv e analysis of nonadaptiv e designs Recall that Denition 1 includes b oth adaptiv e and nonadaptiv e designs. Therefore, the denition of the AIPW estimator ˆ τ aipw ,C in ( 4 ) with an adaptively constructed outcome estimator ˆ m t ( z ) , along with all its design-based guarantees in Sections 4.1 – 4.2 , also applies to nonadaptiv e designs. As previewed in the Section 1 , we prop ose using ˆ τ aipw ,C for adaptive c ovariate adjustment ev en in nonadaptiv e designs, esp ecially when outcome estimation via black-box mac hine learning methods is desired, for the following b enets: 19 (i) Compared with co v ariate adjustment strategies that construct ˆ m t ( z ) using all units, adaptiv e co v ariate adjustmen t induces a martingale structure that enables v alid in- ference under weak er regularit y conditions on ˆ m t ( z ) , thereb y allowing more exible use of black-box machine learning estimation metho ds to improv e eciency . (ii) Compared with cov ariate adjustmen t strategies that construct ˆ m t ( z ) using cross- tting with Bonferroni correction, adaptive cov ariate adjustmen t impro ves eciency . W e illustrate these impro vemen ts through simulation in Section 6 . 5 Extension to blo c k adaptiv e designs 5.1 T reatmen t assignmen t mec hanism The discussion so far has fo cused on designs with treatment assigned adaptively at the unit lev el. W e no w extend our theory to blo ck adaptive r andomization , where treatmen t is assigned sequen tially , group by group, and the treatment probabilities for a new group ma y dep end on the co v ariates, assignmen ts, and outcomes of earlier groups. Consider an exp erimen t with K ≥ 2 treatmen t levels, indexed by z ∈ Z = { 1 , . . . , K } , and a study population of T groups of units, indexed b y t = 1 , . . . , T . A blo c k adaptive design sequen tially randomizes the units in group t to treatmen ts at time t , with the assignmen t probabilities determined by the history up to time t . Let n t denote the size of group t , and index the units in group t b y { ti : i = 1 , . . . , n t } . F or each unit ti , let Z ti ∈ Z denote its treatmen t assignmen t, Y ti ∈ R its outcome, and X ti ∈ R J its cov ariate v ector. Renew Z t = ( Z t 1 , . . . , Z t,n t ) , Y t = ( Y t 1 , . . . , Y t,n t ) , X t = ( X t 1 , . . . , X t,n t ) as the collections of Z ti , Y ti , and X ti for units in group t , and 20 H t = { ( Z s , Y s , X s ) : s = 1 , . . . , t − 1 } ∪ { X t } as the history up to time t with the renew ed ( Z s , Y s , X s ) and X t . This and all subse- quen t reuse of notation are justied by the fact that ( Z t , Y t , X t , H t ) rev ert to their original denitions in Section 2 when n t = 1 for all t . Denition 2 b elow generalizes Denition 1 and c haracterizes blo ck adaptiv e randomization via the conditional distribution of Z t giv en H t . Denition 2 (Block adaptiv e randomization) . Let e t ( z ) = P ( Z t = z | H t ) denote the probabilit y that Z t = ( Z t 1 , . . . , Z t,n t ) equals z = ( z 1 , . . . , z n t ) ∈ Z n t giv en H t . Let e ti ( z ) = P ( Z ti = z | H t ) denote the marginal probabilit y of Z ti = z given H t for z ∈ Z , obtained b y summing e t ( z ) ov er all p ossible v alues of Z t \ { Z ti } . F or t = 1 , . . . , T and all z ∈ Z n t , e t ( z ) is a presp ecied function of H t that satises: (i) P z ∈Z n t e t ( z ) = 1 ; (ii) e ti ( z ) ∈ (0 , 1) for all i = 1 , . . . , n t and z ∈ Z . Denition 2 requires that each unit-treatmen t pair ( ti, z ) has p ositive assignmen t prob- abilit y e ti ( z ) at each time t = 1 , . . . , T , but is otherwise general, allowing e ti ( z ) to dep end arbitrarily on the history H t . Examples 3 – 4 b elow review pairwise sequen tial randomization and sequential rerandomization as tw o sp ecial cases previously studied in the literature. When e t ( z ) = P ( Z t = z | H t ) = P ( Z t = z | X t ) , so that the assignment Z t of group t is indep endent of the previous groups, Denition 2 reduces to a nonadaptiv e blo ck design that conducts an indep endent randomization within eac h group. When n t = 1 for all t , Denition 2 reverts to the unit-lev el adaptive randomization in Denition 1 . Example 3 (Pairwise sequen tial randomization ( Qin et al. , 2018 )) . Consider a treatment- con trol exp eriment with T pairs of units, indexed by t = 1 , . . . , T . Pairwise sequential randomization randomly assigns one unit in pair t to treatment and the other to con trol at each time t , with assignmen t probabilities determined by the co v ariates of these tw o units, X t = ( X t 1 , X t 2 ) , and the current co v ariate balance among pairs 1 through t − 1 . The 21 corresp onding e t ( z ) is a function of { ( Z s , X s ) : s = 1 , . . . , t − 1 } ∪ { X t } . Example 4 (Sequen tial rerandomization ( Zhou et al. , 2018 )) . Consider an exp erimen t with T groups of units, indexed by t = 1 , . . . , T . Sequential rerandomization randomly assigns the units in group t to treatmen ts at each time t as follows: for a ten tative assignment z ∈ Z n t for the n t units in group t , accept Z t = z if { ( Z s , X s ) : s = 1 , . . . , t − 1 } ∪ ( z , X t ) satises a presp ecied balance criterion and rerandomize by p erm uting the elements of z otherwise. The corresp onding e t ( z ) is a function of { ( Z s , X s ) : s = 1 , . . . , t − 1 } ∪ { X t } . 5.2 Inference using the IPW and AIPW estimators Let Y ti ( z ) denote the p otential outcome of unit ti under treatmen t level z . Renew ¯ Y ( z ) = N − 1 T X t =1 n t X i =1 Y ti ( z ) , where N = T X t =1 n t , as the p opulation av erage p otential outcome under treatment level z , with N denoting the total n umber of units across all groups. The goal is to infer a set of linear combinations of the renewed { ¯ Y ( z ) : z ∈ Z } , denoted by τ C = C ¯ Y , where ¯ Y = ( ¯ Y (1) , . . . , ¯ Y ( K )) ⊤ is the vector of the renewed ¯ Y ( z ) and C is a presp ecied co ecien t matrix with K columns. Recall from Denition 2 that e ti ( z ) = P ( Z ti = z | H t ) denotes the treatmen t probability for unit ti given the history H t . P arallel to the IPW and AIPW estimators of Y t ( z ) under unit-lev el adaptive randomization in ( 1 ) and ( 3 ), dene ˆ Y ti, ipw ( z ) = 1 ( Z ti = z ) e ti ( z ) Y ti , ˆ Y ti, aipw ( z ) = ˆ Y ti, ipw ( z ) +  1 − 1 ( Z ti = z ) e ti ( z )  ˆ m ti ( z ) as the IPW and AIPW estimators of Y ti ( z ) , resp ectively , where ˆ m ti ( z ) is an adaptive estimator of Y ti ( z ) constructed using only information in H t . Renew ˆ Y ipw ( z ) = N − 1 T X t =1 n t X i =1 ˆ Y ti, ipw ( z ) , ˆ Y ipw = ( ˆ Y ipw (1) , . . . , ˆ Y ipw ( K )) ⊤ , ˆ τ ipw ,C = C ˆ Y ipw , ˆ Y aipw ( z ) = N − 1 T X t =1 n t X i =1 ˆ Y ti, aipw ( z ) , ˆ Y aipw = ( ˆ Y aipw (1) , . . . , ˆ Y aipw ( K )) ⊤ , ˆ τ aipw ,C = C ˆ Y aipw (10) 22 as the IPW and AIPW estimators of the renew ed ¯ Y ( z ) , ¯ Y , and τ C , resp ectively . W e establish below the design-based guarantees of the renew ed ˆ τ aipw ,C in ( 10 ) for estimating the renewed τ C . The results include ˆ τ ipw ,C as a sp ecial case with ˆ m ti ( z ) = 0 for all ti and z . Dene ¯ Y t ( z ) = n − 1 t P n t i =1 Y ti ( z ) as the av erage p otential outcome in group t under treat- men t level z , and renew ˆ Y t, aipw ( z ) = n − 1 t n t X i =1 ˆ Y ti, aipw ( z ) (11) as the adaptiv e AIPW estimator of ¯ Y t ( z ) . Let π t = n t / N denote the prop ortion of units within group t , and dene ρ t = T · π t . By direct algebra, ¯ Y ( z ) = T X t =1 π t ¯ Y t ( z ) = T − 1 T X t =1 ρ t ¯ Y t ( z ) , ˆ Y aipw ( z ) = T X t =1 π t ˆ Y t, aipw ( z ) = T − 1 T X t =1 ρ t ˆ Y t, aipw ( z ) , (12) paralleling the expressions of ¯ Y ( z ) and ˆ Y aipw ( z ) in ( 4 ) under unit-lev el adaptive randomiza- tion, with the unit-level { Y t ( z ) , ˆ Y t, aipw ( z ) } in ( 3 ) replaced by the group-level { ρ t ¯ Y t ( z ) , ρ t ˆ Y t, aipw ( z ) } in ( 12 ). Therefore, all results in Section 4 extend to inference of the renewed τ C using the renew ed ˆ τ aipw ,C under blo ck adaptive randomization after replacing { Y t ( z ) , ˆ Y t, aipw ( z ) } with { ρ t ¯ Y t ( z ) , ρ t ˆ Y t, aipw ( z ) } . W e state the main results b elow. Renew ˆ Y t, aipw = ( ˆ Y t, aipw (1) , . . . , ˆ Y t, aipw ( K )) ⊤ as the v ector of the renewed ˆ Y t, aipw ( z ) in ( 11 ). Renew ˆ m t = ( ˆ m t (1) , . . . , ˆ m t ( K )) ⊤ , where ˆ m t ( z ) = n − 1 t P n t i =1 ˆ m ti ( z ) , and ˜ Y t, aipw = ˆ Y t, aipw − ˆ m t with the renewed ˆ Y t, aipw and ˆ m t . Renew ˆ V aipw = 1 T − 1 T X t =1 ( ρ t ˆ Y t, aipw − ˆ Y aipw )( ρ t ˆ Y t, aipw − ˆ Y aipw ) ⊤ , ˜ V aipw = 1 T − 1 T X t =1 ( ρ t ˜ Y t, aipw − ˜ Y aipw )( ρ t ˜ Y t, aipw − ˜ Y aipw ) ⊤ (13) as the sample co v ariance matrices of ( ρ t ˆ Y t, aipw ) T t =1 and ( ρ t ˜ Y t, aipw ) T t =1 , resp ectiv ely , and renew 23 S aipw ,C,α and ˜ S aipw ,C,α as the 100(1 − α )% W ald-type condence sets for τ C based on the renew ed ( ˆ τ aipw ,C , ˆ V aipw ) and ( ˆ τ aipw ,C , ˜ V aipw ) in ( 10 ) and ( 13 ), resp ectiv ely . Renew V aipw = T − 1 T X t =1 E n co v ( ρ t ˆ Y t, aipw | H t ) o , ˇ V aipw = T − 1 T X t =1 co v ( ρ t ˆ Y t, aipw | H t ) , and renew L t = max z ∈Z , i =1 ,...,n t | Y ti ( z ) | , M t = max H t , z ∈Z , i =1 ,...,n t ˆ m ti ( z ) , e t = min H t , z ∈Z , i =1 ,...,n t e ti ( z ) as the resp ective extrema of | Y ti ( z ) | , ˆ m ti ( z ) , and e ti ( z ) ov er all p ossible v alues of ( H t , z , i ) . Theorem 6 b elow extends Theorems 4 – 5 to block adaptive randomization, and justies large-sample W ald-type inference of the renewed τ C using the renewed ( ˆ τ aipw ,C , ˆ V aipw , S aipw ,C,α ) . Condition 5. As T → ∞ , (i) { λ min ( V aipw ) } − 1 · ∥ ˇ V aipw − V aipw ∥ f = o (1) ; (ii) { λ min ( V aipw ) } − 1 · T − 1 max t =1 ,...,T { ρ t ( L t + M t ) /e t } 2 = o (1) . Condition 6. As T → ∞ , { λ min ( V aipw ) } − 2 · T − 2 P T t =1 { ρ t ( L t + M t ) } 4 /e 3 t = o (1) . Theorem 6. Assume the blo ck adaptiv e randomization in Denition 2 . F or an y xed matrix C with K columns and full row rank, (i) E ( ˆ τ aipw ,C ) = τ C and cov ( ˆ τ aipw ,C ) = T − 1 C V aipw C ⊤ . (ii) If Condition 5 holds, then ( C V aipw C ⊤ ) − 1 / 2 · √ T ( ˆ τ aipw ,C − τ C ) d − → N (0 , I ) . (iii) If Conditions 5 – 6 hold, then lim inf T →∞ P ( τ C ∈ S aipw ,C,α ) ≥ 1 − α , lim inf T →∞ P ( τ C ∈ ˜ S aipw ,C,α ) ≥ 1 − α for all α ∈ (0 , 1) . Theorem 6 ( i )–( ii ) extend Theorem 4 and imply that ˆ τ aipw ,C is an un biased, consis- ten t, and asymptotically normal estimator of τ C , with co v ( ˆ τ aipw ,C ) = T − 1 C V aipw C ⊤ . The- orem 6 ( iii ) extends Theorem 5 and establishes the cov erage guaran tee of S aipw ,C,α and ˜ S aipw ,C,α . 24 Remark 2 (Adaptiv e co v ariate adjustmen t for nonadaptiv e blo ck designs) . P arallel to our recommendation in Section 4.3 , the adaptive AIPW estimator in ( 10 ) also applies to data from nonadaptive blo c k randomization and oers several adv antages ov er standard nonadaptiv e cov ariate adjustmen t metho ds. 5.3 An alternative cov ariance estimator Recall from Prop osition 4 that under the unit-lev el adaptive randomization, C ˆ V aipw C ⊤ has an upw ard bias of C S C ⊤ in estimating C V aipw C ⊤ , which v anishes if and only if C Y t is constan t across t = 1 , . . . , T . Under blo ck adaptive randomization, let Y ti = ( Y ti (1) , . . . , Y ti ( K )) ⊤ denote the p oten tial outcomes vector for unit ti , and let ¯ Y t = n − 1 t P n t i =1 Y ti = ( ¯ Y t (1) , . . . , ¯ Y t ( K )) ⊤ denote the group a v erage. The discussion following ( 12 ) implies that ρ t ¯ Y t is the analog of Y t under blo c k adaptive randomization, so the renewed C ˆ V aipw C ⊤ is un biased if and only if C ( ρ t ¯ Y t ) is constant across t = 1 , . . . , T . Note that C ( ρ t ¯ Y t ) = ¯ n − 1 P n t i =1 C Y ti = ¯ n − 1 P n t i =1 τ C,ti , where τ C,ti = C Y ti denotes the individual analog of τ C . Therefore, C ˆ V aipw C ⊤ is unbiased if and only if the group total of τ C,ti , P n t i =1 τ C,ti , is constant across t = 1 , . . . , T . When it is more plausible that the group aver age of τ C,ti , denoted by ¯ τ C,t = n − 1 t P n t i =1 τ C,ti , is constant across t , w e prop ose using ˆ V aipw ,b = T X t =1 b t ( ˆ Y t, aipw − ˆ Y aipw )( ˆ Y t, aipw − ˆ Y aipw ) ⊤ , where b t = T · π 2 t / (1 − 2 π t ) 1 + P T s =1 π 2 s / (1 − 2 π s ) with π t = n t / N , as an alternativ e estimator of V aipw , motiv ated b y v ariance estimation for stratied ex- p eriments in P ashley and Miratrix ( 2021 ) and Ding ( 2024 ). Prop osition 6 below establishes the un biasedness of C ˆ V aipw ,b C ⊤ for C V aipw C ⊤ when ¯ τ C,t is constan t across t . The same in- tuition extends to ˜ V aipw ,b = P T t =1 b t ( ˜ Y t, aipw − ˜ Y aipw )( ˜ Y t, aipw − ˜ Y aipw ) ⊤ , a v arian t of ˜ V aipw . W e pro vide the details in the Supplementary Material. 25 Prop osition 6. Assume the blo ck adaptiv e randomization in Denition 2 . If the prop or- tion of units within any group is b elow 1 / 2 , i.e., π t = n t / N < 1 / 2 for all t = 1 , . . . , T , then the follo wing holds: (i) b t > 0 for t = 1 , . . . , T (ii) ˆ V aipw ,b = ˆ V aipw when n t is constant across t = 1 , . . . , T . (iii) E ( ˆ V aipw ,b ) = T co v ( ˆ Y aipw ) + P T t =1 b t ( ¯ Y t − ¯ Y )( ¯ Y t − ¯ Y ) ⊤ . (iv) F or any xed matrix C with K columns, E ( C ˆ V aipw ,b C ⊤ ) − T co v ( ˆ τ aipw ,C ) ≥ 0 , where the equalit y holds if and only if ¯ τ C,t = n − 1 t P n t i =1 τ C,ti is constant across t = 1 , . . . , T . 6 Sim ulation 6.1 A daptiv e co v ariate adjustmen t W e now illustrate the adv an tages of adaptive cov ariate adjustment introduced in Section 4.3 through sim ulation. Assume K = 2 treatment lev els, lab eled z = 1 , 2 , and T = 2,000 units, indexed by t = 1 , . . . , T . W e generate p oten tial outcomes and cov ariates { Y t (1) , Y t (2) , X t } as i.i.d. samples from the follo wing mo del: X ∼ N (0 , 1) , Y (1) = 1 + 2 X + ϵ 1 , Y (2) = 1 + 4 X + ϵ 2 , where ϵ 1 , ϵ 2 are indep enden t N (0 , 1) , and assign treatmen ts via Bernoulli randomization with P ( Z t = 1) = P ( Z t = 2) = 0 . 5 . The observed outcome is Y t = 1 ( Z t = 1) Y t (1) + 1 ( Z t = 2) Y t (2) for unit t . The goal is to estimate the a verage treatment eect τ = ¯ Y (2) − ¯ Y (1) . W e estimate ¯ Y ( z ) using the follo wing ve strategies, indexed by ∗ = ipw , sm , aip w , all , cf, and estimate τ as ˆ τ ∗ = ˆ Y ∗ (2) − ˆ Y ∗ (1) , where ˆ Y ∗ ( z ) denotes the estimator of ¯ Y ( z ) by Strategy ∗ . Strategies “all” and “cf ” represent metho ds for cov ariate adjustment that estimate the outcome model using all units and via cross-tting, resp ectively . 26 “ip w”: Construct ˆ Y ipw ( z ) as the IPW estimator dened in ( 2 ). “sm”: Construct ˆ Y sm ( z ) as the sample mean of observed outcomes in treatmen t arm z . “aip w”: Construct ˆ Y aipw ( z ) as the adaptive AIPW estimator dened in ( 4 ). “all”: Construct ˆ Y all ( z ) as ˆ Y all ( z ) = N − 1 z P t : Z t = z { Y t − ˆ m all t ( z ) } + T − 1 P T t =1 ˆ m all t ( z ) , where ˆ m all t ( z ) is an estimator of Y t ( z ) constructed using all T units. “cf ”: Construct ˆ Y cf ( z ) as follo ws: (i) Split the units in to G folds, indexed by g = 1 , . . . , G . (ii) F or each fold g , estimate the fold av erage of Y t ( z ) , denoted by ˆ Y [ g ] ( z ) , using a v arian t of Strategy “all” with ˆ m t ( z ) estimated using units from the other folds. (iii) Construct ˆ Y cf ( z ) as a weigh ted av erage of { ˆ Y [ g ] ( z ) } G g =1 . See Section S1 of the Supplemen tary Material for details. W e implement the outcome regression in Strategies “aip w”, “all”, and “cf ” using b oth ordi- nary least squares (OLS) regression and random forest regression of Y t on X t b y treatment lev el. F or ˆ τ aipw from Strategy “aip w”, we construct condence in terv als using b oth ˆ V aipw and ˜ V aipw as dened in Section 4.2 as the v ariance estimator. F or ˆ τ cf from Strategy “cf ”, w e set G = 2 and construct condence in terv als with and without Bonferroni correction. The v ariant without Bonferroni correction may lack theoretical guaran tee. Figure 1 shows the distributions of ˆ τ ∗ − τ for eac h strategy ∗ = ipw , sm , aip w , all , cf across 1,000 independent replications. All distributions are approximately normal and cen tered at zero, coherent with the consistency and asymptotic normalit y of the ve p oin t estimators. T able 1 rep orts the mean squared errors of ˆ τ ∗ , along with the cov erage rates and av erage lengths of the corresponding 95% condence interv als. Key observ ations are threefold: (i) Overall, adaptive cov ariate adjustment (aipw1, aip w2) achiev es v alid co verage under b oth OLS and random forest regressions, and yields shorter condence interv als than 27 −0.3 0.0 0.3 0.6 ipw sm aipw.ls all.ls cf.ls aipw.rf all.rf cf .rf Figure 1: Violin plots of the distributions of ˆ τ ∗ − τ for eac h strategy ∗ = ipw , sm , aipw , all , cf across 1,000 indep endent replications. The suxes “ .ls” and “ .rf ” indicate outcome regres- sion using OLS and random forest, resp ectiv ely . unadjusted inferences (ipw, sm) and cross-tting with Bonferroni correction (cf1). (ii) Compared with nonadaptiv e adjustmen t that uses all units for outcome regression (all), adaptiv e co v ariate adjustmen t with improv ed v ariance estimator (aipw2) - p erforms comparably with OLS regression, where the outcome mo del is correctly sp ecied so that ˆ τ all is asymptotically most ecient b y standard theory; - ensures v alid inference with random forest regression, whereas Strategy “all” leads to substan tial underco verage (0.687). (iii) The precision of aipw2 with improv ed v ariance estimator is comparable to that of cross-tting without Bonferroni correction (cf2), which ma y lac k theoretical guaran- tee. 6.2 Application to sequen tial rerandomization W e now conduct a simulation study of sequential rerandomization. Assume that units arrive sequen tially in blocks of size 8 , with a total of T = 200 blo cks, indexed b y t = 1 , . . . , T . Within each blo ck, w e assign half of the units to treatment and the other half to control. 28 T able 1: Mean squared errors of ˆ τ ∗ ( ∗ = ipw , sm , aip w , all , cf), along with the co verage rates and av erage lengths of the corresponding 95% condence interv als. “aipw1” and “aipw2” refer to t wo v ariants of inference based on ˆ τ aipw from Strategy “aip w”, using ˆ V aipw and ˜ V aipw to estimate v ar ( ˆ τ aipw ) , resp ectively . “cf1” and “cf2” refer to inference based on ˆ τ cf from Strategy “cf”, with and without Bonferroni correction, resp ectively . OLS Random forest ip w sm aip w1 aipw2 all cf1 cf2 aipw1 aipw2 all cf1 cf2 Mean squared error 0.023 0.020 0.002 0.002 0.002 0.002 0.002 0.003 0.003 0.002 0.002 0.002 Co verage rate 0.963 0.962 0.996 0.945 0.946 0.998 0.948 0.994 0.965 0.687 1.000 0.965 A v erage CI length 0.607 0.581 0.254 0.185 0.175 0.284 0.175 0.266 0.214 0.099 0.330 0.204 F ollowing Zhou et al. ( 2018 ), when randomizing units in the t -th blo c k, we measure the co v ariate imbalance of eac h p ossible treatment allo cation b y the Mahalanobis distance b etw een the cov ariate means of the treated and control groups across all units in the rst t blo cks. W e then implemen t the b est-c hoice rerandomization within eac h block ( W ang and Li , 2025 ), which randomly selects from the b est 7 treatment assignmen ts with the smallest Mahalanobis distances out of the total  8 4  = 70 p ossible assignments. In cases where some units receive only treatmen t or only con trol among the b est 7 assignments, we iterativ ely expand the candidate set b y one additional assignmen t at a time, eac h with the smallest Mahalanobis distance among the remaining, un til ev ery unit has a p ositiv e probability of receiving both treatment and con trol. W e generate p oten tial outcomes and cov ariates as i.i.d. samples from the following mo del: X ∼ N (0 , 1) , Y (1) = Y (0) = 5 X + ϵ, where ϵ ⊥ ⊥ X and ϵ ∼ N (0 , 1) . Once generated, they will b e kept xed, mimicking the nite p opulation inference. W e consider the sequential rerandomization design (SRD) describ ed ab ov e and the classical completely randomized design (CRD), b oth of whic h assign half of the units to treatment and the remaining to con trol, and sim ulate 10 3 treatmen t assignments from eac h design. Figure 2 shows the histogram of the dierence-in-means estimator under the CRD, and that of the IPW estimator under the SRD, along with Gaussian approximations based on 29 0 1 2 3 4 5 −1.0 −0.5 0.0 0.5 1.0 T reatment effect estimator Density Design CRD SRD Figure 2: Histograms of the dierence-in-means estimator and the IPW estimator under the completely randomized design and the sequen tial rerandomization design across 1,000 indep endent replications, along with densities from their Gaussian appro ximations. their sample av erages and v ariances. F rom Figure 2 , the estimators under b oth designs are appro ximately un biased and Gaussian distributed, coherent with the theory for these tw o designs in Neyman ( 1923 ) and Section 5 . Moreov er, the estimator under the SRD is more precise than that under the CRD. Sp ecically , the ro ot mean squared errors (RMSEs) for the t wo estimators under SRD and CRD are 0 . 077 and 0 . 262 , resp ectively , indicating a 70 . 6% reduction in RMSE under SRD. W e then inv estigate the p erformance of our condence interv als from Section 5 . Across the 10 3 sim ulated assignmen ts under the SRD, the 95% condence interv al has 94 . 8% co verage rate, with an a v erage length of 0 . 296 . In con trast, Neyman ( 1923 )’s 95% condence in terv al under the CRD has 95 . 5% cov erage rate, with an av erage length of 1 . 03 . Both co verage rates are close to the nominal lev el, while condence in terv al under the SRD is on a verage m uch shorter, with a 71 . 3% reduction in av erage length. 6.3 Inuence of v anishing e t on inference W e illustrate b elow the inuence of e t on inference; c.f. the comments after Prop osition 1 and Remark 1 . Consider a tw o-armed bandit with T = 200 units, indexed by t = 1 , . . . , 200 , 30 and arms lab eled as z = 1 , 2 . W e generate p otential outcomes as Y t (1) ∼ N ( t, 1) and Y t (2) ∼ N (2 t, 1) for t = 1 , . . . , 200 , and assign treatment lev els by the following rule: (i) F or t = 1 , . . . , 50 , randomly assign unit t to arm 1 or 2 with equal probabilit y 0 . 5 . (ii) F or t = 51 , . . . , 200 , assign unit t to the arm with higher sample mean of the observed outcomes from units 1 to t − 1 with probability 1 − t − 1 / 2+ δ . This assignmen t mec hanism implies e t = min H t , z ∈{ 1 , 2 } e t ( z ) = t − 1 / 2+ δ , whic h v anishes asymptotically at rate − 1 / 2 + δ . Figure 3 illustrates the p erformance of the IPW estimator ˆ τ ipw ,C for estimating the a verage treatmen t eect τ = ¯ Y (2) − ¯ Y (1) at δ = 0 . 2 , 0 , − 0 . 2 , − 0 . 4 . As δ decreases b elow 0, the distribution of ˆ τ ipw ,C − τ b ecomes increasingly skew ed, deviating from normality . The table b elo w rep orts the corresp onding empirical bias, sample skewness, and mean squared error of the p oint estimator, along with the cov erage rate and av erage length of the 95% condence interv al. Coherent with our theory and the discussion follo wing Prop osition 1 , inference based on ˆ τ ipw ,C and normal appro ximation is v alid when δ ≥ 0 , but gradually fails to ensure cov erage as δ decreases b elow 0. 7 Discussion W e establish the nite-p opulation, design-based theory for causal inference under adaptive designs that accommo date nonexc hangeable units, both noncon verging and v anishing treat- men t probabilities, and noncon verging, p ossibly blac k-b o x, outcome mo dels. Our frame- w ork encompasses widely used designs such as m ulti-armed bandit algorithms, cov ariate- adaptiv e randomization, and sequential rerandomization. As an application, we prop osed the adaptive cov ariate adjustment approach for analyzing ev en nonadaptive exp erimen ts, whic h oers m ultiple adv an tages ov er standard nonadaptive adjustment metho ds. 31 −400 −200 0 0.2 0 −0.2 −0.4 δ ( e t ) 0.2 0 − 0 . 2 − 0 . 4 ( e t = 1 /t 0 . 3 ) ( e t = 1 /t 0 . 5 ) ( e t = 1 /t 0 . 7 ) ( e t = 1 /t 0 . 9 ) Empirical bias − 1.051 − 1.020 − 2.282 0.342 Sample s k ewness − 0.112 − 0.312 − 0.517 − 1.096 Mean squa red error 575.447 957.284 2486.650 6088.563 Co verage rate 0.949 0.945 0.886 0.754 A v erage CI length 95.909 126.777 186.430 263.494 Figure 3: Violin plots of the distributions of ˆ τ ipw ,C − τ across 1,000 indep endent replications at δ = 0 . 2 , 0 , − 0 . 2 , − 0 . 4 . The adaptive reweigh ting strategy proposed b y Hadad et al. ( 2021 ) and Bibaut et al. ( 2021 ) reweigh ts the units based on their prop ensity scores, and ma y yield biased and inconsisten t estimates of nite-p opulation av erage treatment eects when units are nonex- c hangeable. An alternative is to reweigh t units based on co v ariates to b etter align the study p opulation with a target p opulation of in terest. Our framework accommo dates this setting. References Agra wal, S. (2019). Recent adv ances in m ultiarmed bandits for sequen tial decision making. Op er ations R ese ar ch & Management Scienc e in the A ge of A nalytics , 167–188. Arono w, P . M. and J. A. Middleton (2013). A class of unbiased estimators of the av erage treatmen t eect in randomized exp erimen ts. Journal of Causal Infer enc e 1 , 135–154. 32 Bhatt, D. L. and C. Mehta (2016). A daptive designs for clinical trials. New England Journal of Me dicine 375 , 65–74. Bibaut, A., A. Chambaz, M. Dimakopoulou, N. Kallus, and M. v an der Laan (2021). P ost- con textual-bandit inference. Bro wn, B. M. (1971). Martingale central limit theorems. The A nnals of Mathematic al Statistics 42 , 59–66. Bub eck, S., N. Cesa-Bianchi, et al. (2012). Regret analysis of sto c hastic and nonsto chastic m ulti-armed bandit problems. F oundations and T r ends® in Machine L e arning 5 , 1–122. Bugni, F. A., I. A. Canay , and A. M. Shaikh (2018). Inference under co v ariate-adaptiv e randomization. Journal of the A meric an Statistic al A sso ciation 113 , 1784–1796. Cohen, P . L. and C. B. F ogarty (2024). No-harm calibration for generalized oaxaca–blinder estimators. Biometrika 111 (1), 331–338. Dimak op oulou, M., Z. Zhou, S. Athey , and G. Im b ens (2017). Estimation considerations in con textual bandits. arXiv pr eprint arXiv:1711.07077 . Ding, P . (2024). A rst c ourse in c ausal infer enc e . Chapman and Hall/CRC. Efron, B. (1971). F orcing a sequential exp erimen t to b e balanced. Biometrika 58 , 403–417. Fisher, R. A. (1935). The Design of Exp eriments (1st ed.). Edinburgh, London: Oliv er and Bo yd. Guo, K. and G. Basse (2023). The generalized oaxaca-blinder estimator. Journal of the A meric an Statistic al A sso ciation 118 (541), 524–536. 33 Hadad, V., D. A. Hirshberg, R. Zhan, S. W ager, and S. A they (2021). Condence interv als for p olicy ev aluation in adaptiv e exp eriments. Pr o c e e dings of the national ac ademy of scienc es 118 (15), e2014602118. Hadad, V., L. Rosenzweig, S. A they , and D. Karlan (2021). Designing adaptive exp er- imen ts. https://www.gsb.stanfor d.e du/faculty-r ese ar ch/public ations/pr actitioners-guide- designing-adaptive-exp eriments , 826–844. Hall, P . and C. C. Heyde (2014). Martingale Limit The ory and Its A pplic ation. Academic Press. Ham, D. W., I. Bo jinov, M. Lindon, and M. Tingley (2023). Design-based inference for m ulti-arm bandits. arXiv pr eprint arXiv:2302.14136 . Im b ens, G. W. and D. B. R ubin (2015). Causal Infer enc e for Statistics, So cial, and Biome d- ic al Scienc es: A n Intr o duction . Cambridge: Cambridge Universit y Press. Lehmann, E. L. (1975). Nonp ar ametrics: Statistic al Metho ds Base d on R anks . San F ran- cisco: Holden-Da y , Inc. Lehmann, E. L. (1999). Elements of lar ge-sample the ory . Springer. Li, X. and P . Ding (2017). General forms of nite p opulation central limit theorems with applications to causal inference. Journal of the A meric an Statistic al A sso ciation 112 , 1759–1169. Lin, W. (2013). Agnostic notes on regression adjustmen ts to exp erimental data: Reexam- ining F reedman’s critique. The A nnals of A pplie d Statistics 7 , 295 – 318. Luedtk e, A. R. and M. J. V an Der Laan (2016). Statistical inference for the mean outcome under a p ossibly non-unique optimal treatment strategy . A nnals of statistics 44 (2), 713. 34 Mel, V. F. and C. P age (2000). Estimation after adaptiv e allo cation. Journal of Statistic al Planning and Infer enc e 87 , 353–363. Neyman, J. (1923). On the application of probability theory to agricultural exp eriments. Statistic al Scienc e 5 , 465–472. Oer-W estort, M., A. Copp o ck, and D. P . Green (2021). Adaptiv e exp erimental design: Prosp ects and applications in p olitical science. A meric an Journal of Politic al Scienc e 65 , 826–844. P ashley , N. E. and L. W. Miratrix (2021). Insights on v ariance estimation for blo ck ed and matc hed pairs designs. Journal of Educ ational and Behavior al Statistics 46 (3), 271–296. P o co c k, S. J. and R. Simon (1975). Sequential treatment assignment with balancing for prognostic factors in the con trolled clinical trial. Biometrics 31 , 103–115. Qin, Y., Y. Li, W. Ma, and F. Hu (2018). Pairwise sequential randomization and its prop erties. Qu, T., J. Du, and X. Li (2025). Randomization-based z-estimation for ev aluating a verage and individual treatment eects. Biometrika , asaf002. Robins, J. M., A. Rotnitzky , and L. P . Zhao (1994). Estimation of regression coecients when some regressors are not alw ays observed. Journal of the A meric an Statistic al A sso ciation 89 , 846–866. Spall, J. C. (2003). Intr o duction to Sto chastic Se ar ch and Optimization . John Wiley & Sons, Ltd. v an der Laan, M. J. (2008). The construction and analysis of adaptiv e group sequen tial designs. 35 W ager, S. (2024). Sequential v alidation of treatment heterogeneity . arXiv pr eprint arXiv:2405.05534 . W ager, S., W. F. Du, J. T a ylor, and R. J. Tibshirani (2016). High-dimensional regres- sion adjustmen ts in randomized exp eriments. Pr o c e e dings of the National A c ademy of Scienc es 113 , 12673–12678. W ang, Y. and X. Li (2025). Asymptotic theory of the b est-choice rerandomization using the mahalanobis distance. Journal of Ec onometrics 251 , 106049. W u, E. and J. A. Gagnon-Bartsch (2021). Design-based cov ariate adjustmen ts in paired exp eriments. Journal of Educ ational and Behavior al Statistics 46 , 109–132. Y e, T., Y. Yi, and J. Shao (2022). Inference on the a v erage treatmen t eect under mini- mization and other cov ariate-adaptiv e randomization metho ds. Biometrika 109 , 33–47. Zhou, Q., P . A. Ernst, K. L. Morgan, D. B. Rubin, and A. Zhang (2018). Sequen tial rerandomization. Biometrika 105 , 745–752. 36 Supplemen tary Material Section S1 provides the additional results that complement the main pap er. Section S2 provides the pro ofs of the results in Sections 3 – 4 of the main pap er. Section S3 provides the pro ofs of the results in Section 5 of the main pap er. Section S4 provides the pro ofs of the results in Section S1 . Notation and useful facts. Let 1 ( · ) denote the indicator function. Let N ( · , · ) denote the normal distribution. F or a p ositiv e integer m , let 0 m denote the m × 1 zero vector, I m the m × m identit y matrix, and χ 2 m, 1 − α the (1 − α ) th quantile of the chi-squared distribution with m degrees of freedom. W e omit the subscript m in 0 m and I m when the dimension is clear from context. F or a collection of num b ers { a z ∈ R : z ∈ Z } , where Z is the index set, let diag ( a z ) z ∈Z denote the diagonal matrix with a z on the diagonal. F or a set of v ectors { a i ∈ R m : i = 1 , . . . , n } , dene its sample c ovarianc e matrix as ( n − 1) − 1 P n i =1 ( a i − ¯ a )( a i − ¯ a ) ⊤ , where ¯ a = n − 1 P n i =1 a i . F or an m × n matrix A = ( a ij ) m × n , let ∥ A ∥ f = ( P m i =1 P n j =1 a 2 ij ) 1 / 2 denote the F r ob enius norm of A , and ∥ A ∥ 2 denote the sp e ctr al norm of A . F or an n × n square matrix B , let λ max ( B ) and λ min ( B ) denote the largest and smallest eigen v alues of B , resp ectively . Standard results imply that the F rob enius and sp ectral norms are b oth sub-multiplicativ e with 1 ≤ ∥ A ∥ f / ∥ A ∥ 2 ≤ p rank( A ) , ∥ A ∥ 2 2 = ∥ A ⊤ A ∥ 2 = ∥ AA ⊤ ∥ 2 = λ max ( A ⊤ A ) = λ max ( AA ⊤ ) . (S1) Assume the probability measure induced b y the treatment assignmen t mec hanism through- out. Let d − → denote con vergence in distribution, and let o P (1) denote a sequence con v erging to 0 in probability . S1 S1 A dditional results S1.1 Co v ariate adjustment using cross-tting in Section 6.1 Recall that ˆ Y cf ( z ) denotes the estimator of ¯ Y ( z ) using cross-tting in Strategy “cf ” in Section 6.1 . W e construct it in the follo wing steps: - Split the units into G folds, indexed b y g = 1 , . . . , G . Let T g denote the index set of fold g . - F or fold g , estimate the fold a v erage p oten tial outcome ¯ Y [ g ] ( z ) = |T g | − 1 P t ∈T g Y t ( z ) using ˆ Y [ g ] ( z ) = |T g ,z | − 1 X t ∈T g,z { Y t − ˆ m cf t ( z ) } + |T g | − 1 X t ∈T g ˆ m cf t ( z ) , where T g ,z = { t ∈ T g : Z t = z } denotes the index set of units receiving treatmen t z in fold g and ˆ m cf t ( z ) is an estimator of Y t ( z ) based on units in the other G − 1 folds. - Estimate ¯ Y ( z ) b y ˆ Y cf ( z ) = G X g =1 w g ˆ Y [ g ] ( z ) , where w g = |T g | /T denotes the prop ortion of units in fold g . This w eighting scheme is motiv ated by ¯ Y ( z ) = T − 1 T X t =1 Y t ( z ) = T − 1 G X g =1    X t ∈T g Y t ( z )    = G X g =1 w g ¯ Y [ g ] ( z ) . S1.2 P ossible biases of adaptiv e rew eigh ting Hadad et al. ( 2021 ) and Bibaut et al. ( 2021 ) prop osed an adaptive r eweighting strategy for causal inference under adaptiv e designs with exc hangeable units. W e illustrate b elow the p ossible bias of this strategy when applied to nonexchangeable units. Consider a tw o-armed bandit with T = 200 units, indexed by t = 1 , . . . , 200 , and arms lab eled as z = 1 , 2 . W e generate p oten tial outcomes as Y t (1) ∼ N ( t, 1) and Y t (2) ∼ N (2 t, 1) for t = 1 , . . . , 200 , and assign treatment levels using the follo wing ϵ -greedy rule: (i) F or t = 1 , . . . , 50 , assign unit t to arm 1 or 2 with equal probability 0 . 5 ; S2 (ii) F or t = 51 , . . . , 200 , assign unit t to the arm with higher sample mean of observ ed outcomes among units 1 to t − 1 with probability 0 . 8 . W e estimate ¯ Y (1) , ¯ Y (2) , and τ = ¯ Y (2) − ¯ Y (1) using the IPW estimators dened in ( 2 ) and the adaptively reweigh ted estimators prop osed b y Hadad et al. ( 2021 ). Figure S1 sho ws the distributions of their deviations from the true v alues across 1,000 indep enden t replications, with empirical biases summarized in the table b elow. The IPW estimators in ( 2 ) app ear empirically unbiased, whereas the adaptiv ely reweigh ted estimators show noticeable biases. T o see wh y , note that the data-generating pro cess implies up ward trends in Y t (1) , Y t (2) , and the individual treatmen t eect τ t = Y t (2) − Y t (1) , with arm 2 b eing the “b etter” arm. A ccordingly , after time t = 50 , most units are assigned to arm 2 with probability 0 . 8 , whic h is higher than the initial 0 . 5 . The prop ensity-score based reweigh ting strategy in Hadad et al. ( 2021 ) thus assigns greater weigh ts to later units in arm 2, leading to an o verestimation of ¯ Y (2) . Conv ersely , after time t = 50 , most units are assigned to arm 1 with probability 0 . 2 , lo w er than the initial 0 . 5 . The reweigh ting strategy in Hadad et al. ( 2021 ) thus assigns greater weigh ts to earlier units in arm 1, leading to an underestimation of ¯ Y (1) . T ogether, these tw o biases lead to an ov erestimation of the av erage treatment eect τ . S1.3 Sucien t conditions for Condition 3 Assume the adaptive randomization in Denition 1 . Prop osition S1 b elo w generalizes the sucien t conditions in Prop osition 3 by allowing for v anishing e t ( z ) . Recall that e t = min H t , z ∈Z e t ( z ) , v t = max z ∈Z v ar { e t ( z ) − 1 } , ω t = max z ∈Z v ar { ˆ m t ( z ) } . Prop osition S1. Assume the adaptive randomization in Denition 1 . If as T → ∞ , (a) λ min ( V aipw ) is uniformly b ounded aw ay from 0, and (b) b oth Y t ( z ) and ˆ m t ( z ) are uniformly b ounded, then Condition 3 ( i ) is ensured by ∥ ˇ V aipw − V aipw ∥ f = o P (1) and holds if any of the follo wing conditions is satised: (i) V anishing v ar { e t ( z ) − 1 } and v ar { ˆ m t ( z ) } : S3 −100 −50 0 50 100 Y1.ipw Y2.ipw tau.ipw Y1.aw Y2.aw tau.aw Y1.ip w Y2.ip w tau.ipw Y1.a w Y2.a w tau.aw Empirical bias 0.543 − 0.286 − 0.830 − 9.034 7.991 17.025 Figure S1: Violin plots of the distributions of deviations of the estimators from the true v alues ( ¯ Y (1) , ¯ Y (2) , τ ) o v er 1,000 independent replications. The sux “ .ipw” refers to the standard IPW estimators in ( 2 ). The sux “ .a w” refers to the adaptiv ely rew eighted estimators proposed by Hadad et al. ( 2021 ). T − 1 P T t =1 v t = o (1) ; T − 1 P T t =1 ω t · e − 2 t = o (1) . (ii) V anishing v ar { e t ( z ) − 1 } and limited-range dependence of ˆ m t ( z ) : T − 1 P T t =1 (max 1 ≤ s ≤ t e − 1 s ) · √ v t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that - ( T − 1 P T t =1 e − 1 t ) · (max t =1 ,...,T ω t · e − 2 t ) = o ( T 1 − β ) ; - ( T − 1 P T t =1 e − 1 t ) · (max t =1 ,...,T v t ) = o ( T 1 − β ) ; - after some xed time p oint T 0 , { ˆ m t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. (iii) V anishing v ar { ˆ m t ( z ) } and limited-range dependence of e t ( z ) : T − 1 P T t =1 ω t = o (1) ; T − 1 P T t =1 (max 1 ≤ s ≤ t e − 1 s ) · √ ω t · e − 1 t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that - max t =1 ,...,T ω t · e − 2 t = o ( T 1 − β ) ; - max t =1 ,...,T v t = o ( T 1 − β ) ; S4 - after some xed time p oint T 0 , { e t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. (iv) Limited-range dep endence of e t ( z ) and ˆ m t ( z ) : There exist constants β ∈ [0 , 1) and c 0 > 0 such that - max t =1 ,...,T ω t · e − 2 t = o ( T 1 − β ) ; - max t =1 ,...,T v t = o ( T 1 − β ) ; - after some xed time p oint T 0 , { e t ( z ) , ˆ m t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. Prop osition S1 claries the rate at which e t ( z ) can diminish to 0 under Condition 3 . If further e t ( z ) is uniformly b ounded aw ay from 0, then the four conditions in Prop osition S1 are guaran teed b y the corresp onding conditions in Prop osition 3 . S2 Pro ofs of the results in Sections 3 – 4 W e provide b elo w the pro ofs of the results on the augmented inv erse-prop ensity-w eigh ted (AIPW) estimator in Section 4 of the main pap er. The results on the inv erse-prop ensity- w eighted (IPW) estimator in Section 3 are a sp ecial case with ˆ m t ( z ) = 0 for all t and z . Assume the adaptiv e randomization in Denition 1 throughout this section with e t ( z ) = P ( Z t = z | H t ) . Let r t ( z ) = 1 e t ( z ) − 1 , w t ( z ) = 1 ( Z t = z ) e t ( z ) − 1 =    r t ( z ) if Z t = z − 1 if Z t  = z (S2) with | w t ( z ) | , r t ( z ) , | r t ( z ) − 1 | ≤ max { r t ( z ) , 1 } ≤ e t ( z ) − 1 ≤ e − 1 t , r t ( z ) 2 · e t ( z ) + 1 − e t ( z ) = r t ( z ) 2 · e t ( z ) + r t ( z ) · e t ( z ) = r t ( z ) · e t ( z ) · { r t ( z ) + 1 } = r t ( z ) . (S3) S5 Recall that A t ( z ) = Y t ( z ) − ˆ m t ( z ) denotes the residual from the adaptive outcome regression. Let D t ( z ) = 1 √ T n ˆ Y t, aipw ( z ) − Y t ( z ) o = 1 √ T w t ( z ) · A t ( z ) , (S4) where the second equality follo ws from ˆ Y t, aipw ( z ) ( 1 ) + ( 3 ) = 1 ( Z t = z ) e t ( z ) Y t +  1 − 1 ( Z t = z ) e t ( z )  ˆ m t ( z ) ( S2 ) = { w t ( z ) + 1 } Y t ( z ) − w t ( z ) ˆ m t ( z ) = Y t ( z ) + w t ( z ) · A t ( z ) . Let D t = ( D t (1) , . . . , D t ( K )) ⊤ = 1 √ T  ˆ Y t, aipw − Y t  , ¯ D = T − 1 T X t =1 D t = ( ¯ D (1) , . . . , ¯ D ( K )) ⊤ , where ¯ D ( z ) = T − 1 T X t =1 D t ( z ) , (S5) to write ˆ Y t, aipw − Y t = √ T D t , ˆ Y aipw − ¯ Y = T − 1 T X t =1 ( ˆ Y t, aipw − Y t ) = 1 √ T T X t =1 D t = √ T ¯ D . (S6) Recall that ˇ V aipw = diag " T − 1 T X t =1 A t ( z ) 2 e t ( z ) # z ∈Z − T − 1 T X t =1 A t A ⊤ t , with E ( ˇ V aipw ) = V aipw . Let ˇ V aipw ,zz ′ denote the ( z , z ′ ) th elemen t of ˇ V aipw . S6 S2.1 Lemmas S2.1.1 Lemmas for the nite-sample exact results Lemma S1. Let { ξ t ∈ R m : 1 ≤ t ≤ T } b e a square-integrable martingale dierence sequence with resp ect to ltration {F t : t = 1 , . . . , T } . Let V = co v T X t =1 ξ t ! , ˇ V = T X t =1 co v ( ξ t | F t − 1 ) , ˆ V = T ( T − 1) − 1 T X t =1 ( ξ t − ¯ ξ )( ξ t − ¯ ξ ) ⊤ , where F 0 = ∅ and ¯ ξ = T − 1 P T t =1 ξ t . F or t  = t ′ ∈ { 1 , . . . , T } , (i) E ( ξ t | F t − 1 ) = 0 m , co v ( ξ t | F t − 1 ) = E ( ξ t ξ ⊤ t | F t − 1 ) ; (ii) E ( ξ t ) = 0 m , cov ( ξ t ) = E ( ξ t ξ ⊤ t ) = E { cov ( ξ t | F t − 1 ) } , co v ( ξ t , ξ t ′ ) = 0 m × m ; (iii) V = T X t =1 co v ( ξ t ) = E ( ˇ V ) = E ( ˆ V ) , where E ( ˇ V ) = P T t =1 E { co v ( ξ t | F t − 1 ) } . Pro of of Lemma S1 . Lemma S1 ( i ) follows from the denition of martingale dierence sequence with E ( ξ t | F t − 1 ) = 0 m . Lemma S1 ( ii ) follows from E ( ξ t ) = E { E ( ξ t | F t − 1 ) } = 0 m , co v ( ξ t ) = E ( ξ t ξ ⊤ t ) = E  E ( ξ t ξ ⊤ t | F t − 1 )  = E { cov ( ξ t | F t − 1 ) } , for t ′ < t , cov ( ξ t , ξ t ′ ) = E ( ξ t ξ ⊤ t ′ ) = E  E ( ξ t ξ ⊤ t ′ | F t − 1 )  = E  E ( ξ t | F t − 1 ) · ξ ⊤ t ′  = 0 m × m . (S7) Equation ( S7 ) implies the rst and second equalities regarding V in Lemma S1 ( ii ) as follo ws: V = co v T X t =1 ξ t ! = T X t =1 co v ( ξ t ) + 2 X t ′ 0 , then V − 1 / 2 T P T t =1 ξ T t d − → N (0 , 1) . S2.1.3 Lemmas for v ariance estimation A useful fact. ( S6 ) implies ˆ Y t, aipw − ˆ Y aipw ( S6 ) = ( √ T D t + Y t ) − ( √ T ¯ D + ¯ Y ) = √ T ( D t − ¯ D ) + Y t − ¯ Y , so that T X t =1 ( ˆ Y t, aipw − ˆ Y aipw )( ˆ Y t, aipw − ˆ Y aipw ) ⊤ = T X t =1 n √ T ( D t − ¯ D ) + Y t − ¯ Y on √ T ( D t − ¯ D ) ⊤ + ( Y t − ¯ Y ) ⊤ o = T T X t =1 ( D t − ¯ D )( D t − ¯ D ) ⊤ + √ T T X t =1 ( D t − ¯ D )( Y t − ¯ Y ) ⊤ + √ T T X t =1 ( Y t − ¯ Y )( D t − ¯ D ) ⊤ + T X t =1 ( Y t − ¯ Y )( Y t − ¯ Y ) ⊤ = T T X t =1 ( D t − ¯ D )( D t − ¯ D ) ⊤ + √ T T X t =1 D t ( Y t − ¯ Y ) ⊤ + √ T T X t =1 ( Y t − ¯ Y ) D ⊤ t + ( T − 1) S. (S10) Lemma S5 (Chebyshev’s inequalit y) . If ( X n ) is a sto chastic sequence such that each elemen t has nite v ariance, then X n − E ( X n ) = p v ar ( X n ) · O P (1) . Lemma S6. F or t = 1 , . . . , T and z  = z ′ ∈ Z , v ar { w t ( z ) 2 | H t } ≤ e − 3 t , v ar { w t ( z ) w t ( z ′ ) | H t } ≤ 6 e − 1 t . S10 Pro of of Lemma S6 . Lemma S2 ensures E { w t ( z ) 2 | H t } = r t ( z ) so that v ar  w t ( z ) 2 | H t  Lemma S2 = E h  w t ( z ) 2 − r t ( z )  2 | H t i = E h  w t ( z ) 2 − r t ( z )  2 | Z t = z , H t i · P ( Z t = z | H t ) + E h  w t ( z ) 2 − r t ( z )  2 | Z t  = z , H t i · P ( Z t  = z | H t ) ( S2 ) =  r t ( z ) 2 − r t ( z )  2 · e t ( z ) + { 1 − r t ( z ) } 2 · { 1 − e t ( z ) } = { r t ( z ) − 1 } 2  r t ( z ) 2 · e t ( z ) + 1 − e t ( z )  ( S3 ) = { r t ( z ) − 1 } 2 r t ( z ) ( S3 ) ≤ e − 3 t . Similarly , Lemma S2 ensures E { w t ( z ) w t ( z ′ ) | H t } = − 1 for z  = z ′ so that v ar { w t ( z ) w t ( z ′ ) | H t } Lemma S2 = E h  w t ( z ) w t ( z ′ ) + 1  2 | H t i = E h  w t ( z ) w t ( z ′ ) + 1  2 | Z t = z , H t i · P ( Z t = z | H t ) + E h  w t ( z ) w t ( z ′ ) + 1  2 | Z t = z ′ , H t i · P ( Z t = z ′ | H t ) + E h  w t ( z ) w t ( z ′ ) + 1  2 | Z t  = z , z ′ , H t i · P ( Z t  = z , z ′ | H t ) ( S2 ) = {− r t ( z ) + 1 } 2 · e t ( z ) + {− r t ( z ′ ) + 1 } 2 · e t ( z ′ ) + 2 2 · { 1 − e t ( z ) − e t ( z ′ ) } ( S3 ) ≤ e t ( z ) − 1 + e t ( z ′ ) − 1 + 4 ≤ 6 e − 1 t . Lemma S7. F or z , z ′ ∈ Z , we hav e T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ = ( 1 T 2 T X t =1 ( L t + M t ) 4 e 3 t ) 1 / 2 · O P (1) . Pro of of Lemma S7 . Lemma S3 ( i ) ensures ˇ V aipw ,zz ′ = P T t =1 E { D t ( z ) D t ( z ′ ) | H t } so that T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ = T X t =1 h D t ( z ) D t ( z ′ ) − E { D t ( z ) D t ( z ′ ) | H t } i = T X t =1 δ t , (S11) S11 where δ t = D t ( z ) D t ( z ′ ) − E { D t ( z ) D t ( z ′ ) | H t } with E ( δ t | H t ) = 0 . Therefore, { δ t : t = 1 , . . . , T } is a martingale dierence sequence with resp ect to ltra- tion {F t = H t +1 : 1 ≤ t ≤ T } with E T X t =1 δ t ! = 0 , v ar T X t =1 δ t ! = T X t =1 v ar ( δ t ) = T X t =1 E { v ar ( δ t | H t ) } (S12) b y Lemma S1 , where, given D t ( z ) ( S4 ) = T − 1 / 2 w t ( z ) · A t ( z ) , v ar ( δ t | H t ) = v ar { D t ( z ) D t ( z ′ ) | H t } ( S4 ) = T − 2 A t ( z ) 2 A t ( z ′ ) 2 · v ar { w t ( z ) · w t ( z ′ ) | H t } Lemma S6 ≤ T − 2 ( L t + M t ) 4 · 6 e − 3 t . (S13) This, together with ( S11 ) and Lemma S6 , ensures E ( T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ ) ( S11 ) = E T X t =1 δ t ! ( S12 ) = 0 , v ar ( T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ ) ( S11 ) = v ar T X t =1 δ t ! ( S12 ) + ( S13 ) ≤ 6 T 2 T X t =1 ( L t + M t ) 4 e 3 t . The result then follows from Lemma S5 (Chebyshev’s inequalit y). Lemma S8. F or z , z ′ ∈ Z , we hav e T X t =1 D t ( z )  Y t ( z ′ ) − ¯ Y ( z ′ )  = 1 √ T ( T X t =1 L 2 t ( L t + M t ) 2 e t + max z ∈Z ¯ Y 2 ( z ) · T X t =1 ( L t + M t ) 2 e t ) 1 / 2 · O P (1) . Pro of of Lemma S8 . Let δ t = D t ( z ) { Y t ( z ′ ) − ¯ Y ( z ′ ) } with E ( δ t | H t ) = { Y t ( z ′ ) − ¯ Y ( z ′ ) } · E { D t ( z ) | H t } = 0 b y Lemma S3 . Therefore, { δ t : t = 1 , . . . , T } is a martingale dierence sequence with S12 resp ect to ltration {F t = H t +1 : 1 ≤ t ≤ T } , and it follows from Lemma S1 that E T X t =1 δ t ! = 0 , v ar T X t =1 δ t ! = T X t =1 E { v ar ( δ t | H t ) } , (S14) where v ar ( δ t | H t ) =  Y t ( z ′ ) − ¯ Y ( z ′ )  2 · v ar { D t ( z ) | H t } Lemma S3 = T − 1  Y t ( z ′ ) − ¯ Y ( z ′ )  2 · r t ( z ) A t ( z ) 2 ( S3 ) ≤ T − 1 · 2  Y 2 t ( z ′ ) + ¯ Y 2 ( z ′ )  · e − 1 t ( L t + M t ) 2 ≤ 2 T − 1  L 2 t + max z ∈Z ¯ Y 2 ( z )  · e − 1 t ( L t + M t ) 2 . (S15) Plugging ( S15 ) into ( S14 ) ensures E " T X t =1 D t ( z )  Y t ( z ′ ) − ¯ Y ( z ′ )  # = E T X t =1 δ t ! = 0 , v ar " T X t =1 D t ( z )  Y t ( z ′ ) − ¯ Y ( z ′ )  # = v ar T X t =1 δ t ! ≤ 2 T − 1 ( T X t =1 L 2 t ( L t + M t ) 2 e t + max z ∈Z ¯ Y 2 ( z ) · T X t =1 ( L t + M t ) 2 e t ) . The result then follows from Lemma S5 . Lemma S9. Assume the adaptive randomization in Denition 1 . Then ˆ V aipw − T T − 1 V aipw − S = (  ψ T  1 / 2 + ∥ ˇ V aipw − V aipw ∥ f ) · O P (1) , where ψ = T − 1 P T t =1 ( L t + M t ) 4 /e 3 t . Pro of of Lemma S9 . Let S z z ′ = ( T − 1) − 1 P T t =1 { Y t ( z ) − ¯ Y ( z ) }{ Y t ( z ′ ) − ¯ Y ( z ′ ) } denote the ( z , z ′ ) th element of S . Equation ( S10 ) implies ( T − 1) ˆ V aipw ,zz ′ = T T X t =1 D t ( z ) D t ( z ′ ) − T 2 ¯ D ( z ) ¯ D ( z ′ ) + √ T T X t =1 D t ( z ) { Y t ( z ′ ) − ¯ Y ( z ′ ) } S13 + √ T T X t =1 D t ( z ′ ) { Y t ( z ) − ¯ Y ( z ) } + ( T − 1) S z z ′ . This ensures ( T − 1) ˆ V aipw ,zz ′ − T V aipw ,zz ′ − ( T − 1) S z z ′ = T ( T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ ) + T  ˇ V aipw ,zz ′ − V aipw ,zz ′  − T 2 ¯ D ( z ) ¯ D ( z ′ ) + √ T T X t =1 D t ( z ) { Y t ( z ′ ) − ¯ Y ( z ′ ) } + √ T T X t =1 D t ( z ′ ) { Y t ( z ) − ¯ Y ( z ) } , so that T − 1 T  ˆ V aipw ,zz ′ − T T − 1 V aipw ,zz ′ − S z z ′  = ( T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ ) +  ˇ V aipw ,zz ′ − V aipw ,zz ′  − T ¯ D ( z ) ¯ D ( z ′ ) + 1 √ T T X t =1 D t ( z ) { Y t ( z ′ ) − ¯ Y ( z ′ ) } + 1 √ T T X t =1 D t ( z ′ ) { Y t ( z ) − ¯ Y ( z ) } . (S16) Belo w we b ound eac h of the terms on the right-hand side of ( S16 ). Some useful facts are T − 1 T X t =1 ( L t + M t ) 2 e t ≤ ( T − 1 T X t =1 ( L t + M t ) 4 e 2 t ) 1 / 2 ≤ ( T − 1 T X t =1 ( L t + M t ) 4 e 3 t ) 1 / 2 = ψ 1 / 2 , T − 1 T X t =1 L 2 t ( L t + M t ) 2 e t ≤ T − 1 T X t =1 ( L t + M t ) 4 e 3 t = ψ , (S17) | ¯ Y ( z ) | ≤ T − 1 T X t =1 ( L t + M t ) ≤ ( T − 1 T X t =1 ( L t + M t ) 4 ) 1 / 4 ≤ ( T − 1 T X t =1 ( L t + M t ) 4 e 3 t ) 1 / 4 = ψ 1 / 4 , so that T − 1 max z ∈Z | ¯ Y ( z ) | · ( T X t =1 ( L t + M t ) 2 e t ) 1 / 2 ( S17 ) ≤ T − 1 ψ 1 / 4 ·  T ψ 1 / 2  1 / 2 =  ψ T  1 / 2 . S14 First term on the RHS of ( S16 ) . Lemma S7 ensures T X t =1 D t ( z ) D t ( z ′ ) − ˇ V aipw ,zz ′ Lemma S7 = ( 1 T 2 T X t =1 ( L t + M t ) 4 e 3 t ) 1 / 2 · O P (1) =  ψ T  1 / 2 · O P (1) . Second term on the RHS of ( S16 ) . The denition of F rob enius norm ensures   ˇ V aipw ,zz ′ − V aipw ,zz ′   ≤ ∥ ˇ V aipw − V aipw ∥ f . Third term on the RHS of ( S16 ) . Equation ( S2 ) ensures r t ( z ) < e t ( z ) − 1 , so that V aipw ,zz = T − 1 T X t =1 E  r t ( z ) A t ( z ) 2  ≤ T − 1 T X t =1 E  e t ( z ) − 1 A t ( z ) 2  ≤ T − 1 T X t =1 ( L t + M t ) 2 e t ( S17 ) ≤ ψ 1 / 2 . This, together with Lemmas S3 and S5 , ensures T ¯ D ( z ) ¯ D ( z ′ ) Lemmas S3 + S5 = T − 1 · q v ar  T ¯ D ( z )  · q v ar  T ¯ D ( z ′ )  · O P (1) Lemma S3 = T − 1 · V 1 / 2 aipw ,zz · V 1 / 2 aipw ,z ′ z ′ · O P (1) = T − 1 ψ 1 / 2 · O P (1) . F ourth and fth terms on the RHS of ( S16 ) . Lemma S8 ensures 1 √ T T X t =1 D t ( z ) { Y t ( z ′ ) − ¯ Y ( z ′ ) } = T − 1 v u u t T X t =1 L 2 t ( L t + M t ) 2 e t + max z ∈Z ¯ Y 2 ( z ) · T X t =1 ( L t + M t ) 2 e t · O P (1) = T − 1    v u u t T X t =1 L 2 t ( L t + M t ) 2 e t + max z ∈Z | ¯ Y ( z ) | · v u u t T X t =1 ( L t + M t ) 2 e t    · O P (1) ( S17 ) = T − 1 ( ( T ψ ) 1 / 2 + T  ψ T  1 / 2 ) · O P (1) =  ψ T  1 / 2 · O P (1) . S15 and, b y symmetry , 1 √ T T X t =1 D t ( z ′ ) { Y t ( z ) − ¯ Y ( z ) } =  ψ T  1 / 2 · O P (1) . Plugging these b ounds into ( S16 ) ensures that for all z , z ′ ∈ Z , ˆ V aipw ,zz ′ − T T − 1 · V aipw ,zz ′ − S z z ′ = (  ψ T  1 / 2 + ∥ ˇ V aipw − V aipw ∥ f + ψ 1 / 2 T +  ψ T  1 / 2 ) · O P (1) = (  ψ T  1 / 2 + ∥ ˇ V aipw − V aipw ∥ f ) · O P (1) . S2.2 Pro ofs of the nite-sample results Pro of of Theorem 4 ( i ) . Recall from ( S6 ) that ˆ Y aipw − ¯ Y = T − 1 / 2 P T t =1 D t . This, together with Lemma S3 , ensures E ( ˆ Y aipw − ¯ Y ) ( S6 ) = 1 √ T E T X t =1 D t ! = 0 , co v ( ˆ Y aipw − ¯ Y ) ( S6 ) = T − 1 co v T X t =1 D t ! = T − 1 V aipw . Pro of of Prop osition 4 ( i ) , nite-sample exact exp ectation. Lemma S3 ensures that { D t : t = 1 , . . . , T } is a martingale dierence sequence with resp ect to ltration {F t = H t +1 : 1 ≤ t ≤ T } . It then follo ws from Lemma S1 that E ( D t ) = 0 , E ( T T − 1 T X t =1 ( D t − ¯ D )( D t − ¯ D ) ⊤ ) = cov T X t =1 D t ! = V aipw . (S18) This, together ( S10 ), ensures E n ( T − 1) ˆ V aipw o ( S10 ) + ( S18 ) = E ( T T X t =1 ( D t − ¯ D )( D t − ¯ D ) ⊤ ) + ( T − 1) S ( S18 ) = ( T − 1) V aipw + ( T − 1) S S16 so that E ( ˆ V aipw ) = V aipw + S . The necessary and sucient condition for C S C ⊤ = 0 follows from C S C ⊤ = C ( 1 T − 1 T X t =1 ( Y t − ¯ Y )( Y t − ¯ Y ) ⊤ ) C ⊤ = 1 T − 1 T X t =1 ( C Y t − C ¯ Y )( C Y t − C ¯ Y ) ⊤ = 1 T − 1 T X t =1 ( τ C,t − τ C )( τ C,t − τ C ) ⊤ . S2.3 Pro ofs of the asymptotic results Pro of of Theorem 4 ( ii ) . Recall from ( S6 ) that ˆ Y aipw − ¯ Y = T − 1 / 2 P T t =1 D t . The goal is to pro ve ( C V aipw C ⊤ ) − 1 / 2 · √ T C ( ˆ Y aipw − ¯ Y ) ( S6 ) = ( C V aipw C ⊤ ) − 1 / 2 C · T X t =1 D t d − → N (0 , I ) for any xed matrix C with K columns and full row rank. By the Cramér–W old theorem, it suces to v erify that, for an y constant unit v ector η , η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C · T X t =1 D t = T X t =1 δ t d − → N (0 , 1) , (S19) where δ t = η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C · D t . It follows from E ( δ t | H t ) = η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C · E ( D t | H t ) = 0 by Lemma S3 that { δ t : t = 1 , . . . , T } is a martingale dierence sequence with respect to ltration {F t = H t +1 : 1 ≤ t ≤ T } . Let ˇ V T = T X t =1 E ( δ 2 t | H t ) , V T = E ( ˇ V T ) , with ˇ V T = η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C · ( T X t =1 E ( D t D ⊤ t | H t ) ) · C ⊤ ( C V aipw C ⊤ ) − 1 / 2 η Lemma S3 = η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C ˇ V aipw C ⊤ ( C V aipw C ⊤ ) − 1 / 2 η , V T = η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C V aipw C ⊤ ( C V aipw C ⊤ ) − 1 / 2 η = η ⊤ η = 1 (S20) S17 b y Lemma S3 . W e verify b elo w that ˇ V T and V T satisfy the following conditions in the martingale cen tral limit theorem in Lemma S4 , so that ( S19 ) holds: V ariance conv ergencen: ˇ V T /V T = 1 + o P (1) , Lindeb erg: V − 1 T P T t =1 E n δ 2 t · 1 ( | δ t | ≥ ϵV 1 / 2 T ) o = o P (1) for all ϵ > 0 . (S21) A useful fact is λ min ( C V aipw C ⊤ ) = inf ∥ ˜ η ∥ 2 =1 ˜ η ⊤ C V aipw C ⊤ ˜ η ≥ λ min ( V aipw ) · inf ∥ ˜ η ∥ 2 =1 ˜ η ⊤ C C ⊤ ˜ η = λ min ( V aipw ) · λ min ( C C ⊤ ) so that ∥ ( C V aipw C ⊤ ) − 1 / 2 ∥ 2 2 · ∥ C ∥ 2 2 ( S1 ) = λ max  ( C V aipw C ⊤ ) − 1  · λ max ( C C ⊤ ) = λ − 1 min ( C V aipw C ⊤ ) · λ max ( C C ⊤ ) ≤ { λ min ( V aipw ) } − 1 · κ ( C C ⊤ ) , (S22) where κ ( C C ⊤ ) = λ max ( C C ⊤ ) /λ min ( C C ⊤ ) . Pro of of the v ariance con v ergence condition ˇ V T /V T = 1 + o P (1) in ( S21 ) . Giv en V T = 1 from ( S20 ), it suces to verify ˇ V T − V T = o P (1) . Note that ( ˇ V T − V T ) 2 = ∥ ˇ V T − V T ∥ 2 2 ( S20 ) =   η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C  ˇ V aipw − V aipw  C ⊤ ( C V aipw C ⊤ ) − 1 / 2 η   2 2 submultiplicit y ≤ ∥ ( C V aipw C ⊤ ) − 1 / 2 ∥ 4 2 · ∥ C ∥ 4 2 · ∥ ˇ V aipw − V aipw ∥ 2 2 ( S22 ) + ( S1 ) ≤ { λ min ( V aipw ) } − 2 · κ ( C C ⊤ ) 2 · ∥ ˇ V aipw − V aipw ∥ 2 f . Therefore, Condition 3 ( i ) ensures that ˇ V T − V T = o P (1) . Pro of of the Lindeb erg condition V − 1 T P T t =1 E { δ 2 t · 1 ( | δ t | ≥ ϵV 1 / 2 T ) } = o P (1) in ( S21 ) . Giv en V T = 1 from ( S20 ), it suces to verify T X t =1 E { δ 2 t · 1 ( | δ t | ≥ ϵ ) } = o P (1) for all ϵ > 0 . (S23) S18 Recall from ( S2 )–( S3 ) that | w t ( z ) | ( S2 ) = | 1 ( Z t = z ) · e t ( z ) − 1 − 1 | ( S3 ) ≤ e − 1 t . This implies | D t ( z ) | ( S4 ) =     1 √ T w t ( z ) · A t ( z )     ≤ 1 √ T ( L t + M t ) /e t , so that ∥ D t ∥ 2 2 = X z ∈Z D t ( z ) 2 ≤ K · T − 1 ( L t + M t ) 2 /e 2 t . (S24) Equations ( S22 ) and ( S24 ) together ensure δ 2 t = ∥ δ t ∥ 2 2 ( S19 ) =   η ⊤ ( C V aipw C ⊤ ) − 1 / 2 C · D t   2 2 submultiplicit y ≤ ∥ ( C V aipw C ⊤ ) − 1 / 2 ∥ 2 2 · ∥ C ∥ 2 2 · ∥ D t ∥ 2 2 ( S22 )+( S24 ) ≤  { λ min ( V aipw ) } − 1 · κ ( C C ⊤ )  ·  K · T − 1 ( L t + M t ) 2 /e 2 t  ≤ K · κ ( C C ⊤ ) · { λ min ( V aipw ) } − 1 · T − 1 max t =1 ,...,T ( L t + M t ) 2 /e 2 t . (S25) Condition 3 ( ii ) requires { λ min ( V aipw ) } − 1 · T − 1 max t =1 ,...,T ( L t + M t ) 2 /e 2 t = o (1) , so that for an y ϵ > 0 , there exists T 0 suc h that for all T ≥ T 0 , { λ min ( V aipw ) } − 1 · T − 1 max t =1 ,...,T ( L t + M t ) 2 /e 2 t < ϵ 2 K · κ ( C C ⊤ ) so that, from ( S25 ), δ 2 t ( S25 ) < ϵ 2 for all t = 1 , . . . , T . This ensures P T t =1 E { δ 2 t · 1 ( | δ t | ≥ ϵ ) } = 0 for all T ≥ T 0 , whic h implies ( S23 ). Pro of of Prop osition 3 . The result follows from Prop osition S1 , which we pro v e in Sec- tion S4 . Pro of of Prop osition 4 ( ii ) , large-sample results. Lemma S9 implies ˆ V aipw − T T − 1 V aipw − S = (  ψ T  1 / 2 + ∥ ˇ V aipw − V aipw ∥ f ) · O P (1) S19 = λ min ( V aipw ) · ( 1 λ min ( V aipw )  ψ T  1 / 2 + 1 λ min ( V aipw ) · ∥ ˇ V aipw − V aipw ∥ f ) · O P (1) , (S26) where ψ = T − 1 P T t =1 ( L t + M t ) 4 /e 3 t . Conditions 4 and 3 ensure 1 λ min ( V aipw )  ψ T  1 / 2 = o P (1) , 1 λ min ( V aipw ) · ∥ ˇ V aipw − V aipw ∥ f = o P (1) , (S27) resp ectively , in the middle term of ( S26 ), so that ˆ V aipw − V aipw − S = λ min ( V aipw ) · o P (1) · O P (1) = λ min ( V aipw ) · o P (1) . Pro of of Prop osition 5 . F or t wo sequences of v ectors ( a t ) T t =1 and ( b t ) T t =1 , dene co v s ( a t , b t ) = ( T − 1) − 1 T X t =1 ( a t − ¯ a )( b t − ¯ b ) ⊤ as the sample cov ariance matrix of ( a t , b t ) T t =1 , where ¯ a = T − 1 P T t =1 a t and ¯ b = T − 1 P T t =1 b t . Let Σ = cov s ( Y t , ˆ m t ) , ˆ Σ = cov s ( ˆ Y t, aipw , ˆ m t ) (S28) denote the sample cov ariance matrices of ( Y t , ˆ m t ) T t =1 and ( ˆ Y t, aipw , ˆ m t ) T t =1 , resp ectively . Then ˜ V aipw = cov s ( ˜ Y t, aipw ) = cov s ( ˆ Y t, aipw − ˆ m t ) = cov s ( ˆ Y t, aipw ) − cov s ( ˆ Y t, aipw , ˆ m t ) − cov s ( ˆ m t , ˆ Y t, aipw ) + cov s ( ˆ m t ) = ˆ V aipw − ˆ Σ − ˆ Σ ⊤ + cov s ( ˆ m t ) , (S29) S = co v s ( Y t ) = cov s ( Y t − ˆ m t + ˆ m t ) = cov s ( Y t − ˆ m t ) + cov s ( Y t − ˆ m t , ˆ m t ) + cov s ( ˆ m t , Y t − ˆ m t ) + cov s ( ˆ m t ) = Ω + cov s ( Y t , ˆ m t ) − cov s ( ˆ m t ) + cov s ( ˆ m t , Y t ) − cov s ( ˆ m t ) + cov s ( ˆ m t ) = Ω + Σ + Σ ⊤ − cov s ( ˆ m t ) , where Ω = cov s ( Y t − ˆ m t ) . (S30) S20 Equations ( S29 )–( S30 ) and Prop osition 4 together ensure ˜ V aipw − V aipw ( S29 ) = ˆ V aipw − ˆ Σ − ˆ Σ ⊤ + cov s ( ˆ m t ) − V aipw Proposition 4 = S − ˆ Σ − ˆ Σ ⊤ + cov s ( ˆ m t ) + λ min ( V aipw ) · o P (1) ( S30 ) = Ω − ( ˆ Σ − Σ) − ( ˆ Σ − Σ) ⊤ + λ min ( V aipw ) · o P (1) , so that it suces to v erify that ˆ Σ − Σ = λ min ( V aipw ) · o P (1) . (S31) F rom the discussion ab o ve, we essentially provide an consisten t estimation for S − Ω . Therefore, suc h an impro v ement on reducing conserv ativeness in v ariance estimation can also be applied to ˆ V ipw in ( 7 ). A sucient condition for ( S31 ) . Let ¯ m = T − 1 P T t =1 ˆ m t . Then ˆ Σ − Σ ( S28 ) = co v s ( ˆ Y t, aipw − Y t , ˆ m t ) = 1 T − 1 ( T X t =1 ( ˆ Y t, aipw − Y t ) · ˆ m ⊤ t − T ( ˆ Y aipw − ¯ Y ) · ¯ m ⊤ ) = T T − 1 (Σ 1 − Σ 2 ) , (S32) where Σ 1 = T − 1 T X t =1 ( ˆ Y t, aipw − Y t ) · ˆ m ⊤ t , Σ 2 = ( ˆ Y aipw − ¯ Y ) · ¯ m ⊤ . Recall from ( S6 ) that ˆ Y t, aipw ( z ) − Y t ( z ) = √ T D t ( z ) , ˆ Y aipw ( z ) − ¯ Y ( z ) = T − 1 / 2 T X t =1 D t ( z ) . S21 The ( z , z ′ ) th ele men ts of Σ 1 and Σ 2 equal Σ 1 ,z z ′ = T − 1 T X t =1 n ˆ Y t, aipw ( z ) − Y t ( z ) o · ˆ m t ( z ′ ) ( S6 ) = 1 √ T T X t =1 D t ( z ) · ˆ m t ( z ′ ) = 1 √ T T X t =1 δ t , Σ 2 ,z z ′ = n ˆ Y aipw ( z ) − ¯ Y ( z ) o · ¯ m ( z ′ ) ( S6 ) = 1 √ T ( T X t =1 D t ( z ) ) · ( T − 1 T X t =1 ˆ m t ( z ′ ) ) . (S33) resp ectively , where δ t = D t ( z ) · ˆ m t ( z ′ ) . W e verify b elow Σ 1 ,z z ′ = λ min ( V aipw ) · o P (1) , Σ 2 ,z z ′ = λ min ( V aipw ) · o P (1) (S34) for all z , z ′ ∈ Z . This, together with ( S32 ), implies ( S31 ). Pro of of Σ 1 ,z z ′ = λ min ( V aip w ) · o P (1) in ( S34 ) . Giv en δ t = D t ( z ) · ˆ m t ( z ′ ) from ( S33 ), Lemma S3 ensures E ( δ t | H t ) = E { D t ( z ) | H t } · ˆ m t ( z ′ ) Lemma S3 = 0 , v ar ( δ t | H t ) = ˆ m t ( z ′ ) 2 · v ar { D t ( z ) | H t } Lemma S3 = ˆ m t ( z ′ ) 2 · T − 1 r t ( z ) A t ( z ) 2 ( S3 ) ≤ T − 1 ( L t + M t ) 4 /e 3 t . (S35) Therefore, δ t is a martingale dierence sequence with resp ect to ltration {F t = H t +1 : 1 ≤ t ≤ T } , and it follows from Lemma S1 that E (Σ 1 ,z z ′ ) = 1 √ T T X t =1 E ( δ t ) = 0 , v ar (Σ 1 ,z z ′ ) = T − 1 v ar T X t =1 δ t ! = T − 1 T X t =1 E { v ar ( δ t | H t ) } ( S35 ) ≤ T − 2 T X t =1 ( L t + M t ) 4 /e 3 t , (S36) where T − 2 P T t =1 ( L t + M t ) 4 /e t = λ 2 min ( V aipw ) · o (1) under Condition 4 . This implies Σ 1 ,z z ′ Lemma S5 +( S36 ) = q v ar (Σ 1 ,z z ′ ) · O P (1) ( S36 ) + Condition 4 = λ min ( V ipw ) · o P (1) under Condition 4 by Lemma S5 (Chebyshev’s inequality). S22 Pro of of Σ 2 ,z z ′ = λ min ( V aip w ) · o P (1) in ( S34 ) . Lemma S3 implies E { D t ( z ) | H t } = 0 , v ar { D t ( z ) | H t } = T − 1 r t ( z ) A t ( z ) 2 ( S3 ) ≤ T − 1 e − 1 t ( L t + M t ) 2 , so that E n P T t =1 D t ( z ) o = 0 and v ar ( T X t =1 D t ( z ) ) Lemma S3 = T X t =1 E h v ar { D t ( z ) | H t } i ( S3 ) ≤ T − 1 T X t =1 e − 1 t ( L t + M t ) 2 . This, together with Lemma S5 (Cheb yshev’s inequalit y), ensures ( T X t =1 D t ( z ) ) 2 = v ar ( T X t =1 D t ( z ) ) · O P (1) = ( T − 1 T X t =1 e − 1 t ( L t + M t ) 2 ) · O P (1) . (S37) In addition,      T − 1 T X t =1 ˆ m t ( z ′ )      ≤ T − 1 T X t =1 M t ≤ T − 1 T X t =1 M 2 t ! 1 / 2 (S38) b y Cauc hy–Sc h w arz inequalit y . Plugging ( S37 )–( S38 ) in to the expression of Σ 2 ,z z ′ in ( S33 ) implies Σ 2 2 ,z z ′ ( S33 ) = T − 1 ( T X t =1 D t ( z ) ) 2 ·      T − 1 T X t =1 ˆ m t ( z ′ )      2 ( S37 ) + ( S38 ) = T − 1 ( T − 1 T X t =1 e − 1 t ( L t + M t ) 2 ) · T − 1 T X t =1 M 2 t ! · O P (1) ≤ T − 1 ( T − 1 T X t =1 e − 1 t ( L t + M t ) 2 ) 2 · O P (1) Cauch y–Sch warz ≤ T − 1 ( T − 1 T X t =1 e − 2 t ( L t + M t ) 4 ) · O P (1) ≤ T − 2 T X t =1 e − 3 t ( L t + M t ) 4 · O P (1) Condition 4 = { λ min ( V aipw ) } 2 · o P (1) . Pro of of Theorem 5 . W e verify b elo w the results for S aipw ,C,α . The pro of of the results for ˜ S aipw ,C,α is identical after replacing ˆ V aipw b y ˜ V aipw , hence omitted. S23 F rom Lemma S9 and ( S22 ), under Conditions 3 – 4 , for any xed matrix C with K columns and full row rank,     C V aipw C ⊤  − 1 / 2 C n ˆ V aipw − T / ( T − 1) · V aipw − S o C ⊤ ( C V aipw C ) − 1 / 2    2 submultiplicit y ≤ ∥ ( C V aipw C ⊤ ) − 1 / 2 ∥ 2 2 · ∥ C ∥ 2 2 ·    ˆ V aipw − T / ( T − 1) · V aipw − S    2 ( S22 ) + ( S1 ) ≤ κ ( C C ⊤ ) · λ − 1 min ( V aipw ) · ∥ ˆ V aipw − T / ( T − 1) · V aipw − S ∥ f Lemma S9 = κ ( C C ⊤ ) · λ − 1 min ( V aipw ) · (  ψ T  1 / 2 + ∥ ˇ V aipw − V aipw ∥ f ) · O P (1) ( S27 ) = o P (1) . This ensures  C V aipw C ⊤  − 1 / 2  C ˆ V aipw C ⊤ − C S C ⊤   C V aipw C ⊤  − 1 / 2 =  C V aipw C ⊤  − 1 / 2  T / ( T − 1) · C V aipw C ⊤   C V aipw C ⊤  − 1 / 2 + o P (1) , = I rank( C ) + o P (1) . (S39) Consequen tly , w e hav e T ( ˆ τ aipw ,C − τ C ) ⊤  C ˆ V aipw C ⊤ − C S C ⊤  − 1 ( ˆ τ aipw ,C − τ C ) = n √ T  C V aipw C ⊤  − 1 / 2 ( ˆ τ aipw ,C − τ C ) o ⊤ · n  C V aipw C ⊤  − 1 / 2  C ˆ V aipw C ⊤ − C S C ⊤   C V aipw C ⊤  − 1 / 2 o − 1 · n √ T  C V aipw C ⊤  − 1 / 2 ( ˆ τ aipw ,C − τ C ) o ( S39 ) = n √ T  C V aipw C ⊤  − 1 / 2 ( ˆ τ aipw ,C − τ C ) o ⊤ ·  I rank( C ) + o P (1)  · n √ T  C V aipw C ⊤  − 1 / 2 ( ˆ τ aipw ,C − τ C ) o =   √ T  C V aipw C ⊤  − 1 / 2 ( ˆ τ aipw ,C − τ C )   2 2 + o P (1) d − → χ 2 rank( C ) , (S40) where χ 2 rank( C ) denotes the c hi-squared distribution with rank( C ) degrees of freedom, and the last equality and the last con vergence in ( S40 ) hold due to Theorem 4 ( ii ) and Slutsky’s theorem. S24 This, together with ( C ˆ V aipw C ⊤ ) − 1 ≤ ( C ˆ V aipw C ⊤ − C S C ⊤ ) − 1 , ensures lim inf T →∞ P  T ( ˆ τ aipw ,C − τ C ) ⊤  C ˆ V aipw C ⊤  − 1 ( ˆ τ aipw ,C − τ C ) ≤ χ 2 rank( C ) , 1 − α  ≥ lim T →∞ P  T ( ˆ τ aipw ,C − τ C ) ⊤  C ˆ V aipw C ⊤ − C S C ⊤  − 1 ( ˆ τ aipw ,C − τ C ) ≤ χ 2 rank( C ) , 1 − α  ( S40 ) = 1 − α. S3 Pro of of the result in Section 5 All results on blo c k adaptive randomization in Section 5 , except those on ˆ V aipw ,b and ˜ V aipw ,b in Section 5.3 , follo w directly from the theory for the AIPW estimator under unit-lev el adaptiv e randomization in Section 4 by treating eac h group as a unit and renewing (i) Y t ( z ) as ρ t ¯ Y t ( z ) , where ρ t = T · π t = ( T / N ) · n t and ¯ Y t ( z ) = n − 1 t P n t i =1 Y ti ( z ) , and (ii) ˆ Y t, aipw ( z ) as ρ t · n − 1 t P n t i =1 ˆ Y ti, aipw ( z ) . W e verify b elow Prop osition 6 on ˆ V aipw ,b and then provide intuition for ˜ V aipw ,b = T cov ( ˆ Y aipw ) + cov b ( ¯ Y t − ˆ m t ) + o P (1) , (S41) where co v b ( ¯ Y t − ˆ m t ) is dene as follo ws. F or tuples of vectors { ( a t , c t ) } T t =1 , let co v b ( a t , c t ) = T X t =1 b t ( a t − ¯ a ) ( c t − ¯ c ) ⊤ , co v b ( a t ) = cov b ( a t , a t ) where ¯ a = P T t =1 π t a t , ¯ c = P T t =1 π t c t , and b t = T · π 2 t / (1 − 2 π t ) 1 + P T s =1 π 2 s / (1 − 2 π s ) with π t = n t / N , (S42) as dened in the main pap er. Then ˆ V aipw ,b = T X t =1 b t  ˆ Y t, aipw − ˆ Y aipw   ˆ Y t, aipw − ˆ Y aipw  ⊤ = cov b ( ˆ Y t, aipw ) , ˜ V aipw ,b = T X t =1 b t  ˜ Y t, aipw − ˜ Y aipw   ˜ Y t, aipw − ˜ Y aipw  ⊤ = co v b ( ˜ Y t, aipw ) . (S43) S25 Assume the blo ck adaptive randomization in Denition 2 throughout this section. S3.1 Pro of of Prop osition 6 Prop osition 6 ( i ) follo ws from the denition of b t and the assumption of π t < 1 / 2 . W e v erify b elow Prop osition 6 ( ii )–( iii ), resp ectively . Prop osition 6 ( iv ) then follows from Prop osi- tion 6 ( i ) and ( iii ). Pro of of Prop osition 6 ( ii ) . When blo c ks are of equal sizes, π t = 1 /T for all 1 ≤ t ≤ T , whic h implies that b t = 1 / ( T − 1) for t = 1 , . . . , T so that ˆ V aipw ,b = ˆ V aipw . Pro of of Prop osition 6 ( iii ) . W rite ˆ V aipw ,b ( S43 ) = T X t =1 b t ˆ Y t, aipw ˆ Y ⊤ t, aipw | {z } B 1 + T X t =1 b t ˆ Y aipw ˆ Y ⊤ aipw | {z } B 2 − T X t =1 b t ˆ Y t, aipw ˆ Y ⊤ aipw | {z } B 3 − T X t =1 b t ˆ Y aipw ˆ Y ⊤ t, aipw = B 1 + B 2 − B 3 − B ⊤ 3 , (S44) where B 1 = T X t =1 b t ˆ Y t, aipw ˆ Y ⊤ t, aipw , B 2 = T X t =1 b t ˆ Y aipw ˆ Y ⊤ aipw , B 3 = T X t =1 b t ˆ Y t, aipw ˆ Y ⊤ aipw . Recall that Y ti = ( Y ti (1) , . . . , Y ti ( K )) ⊤ denote the p otential outcomes v ector for unit ti , and ¯ Y t = n − 1 t P n t i =1 Y ti = ( ¯ Y t (1) , . . . , ¯ Y t ( K )) ⊤ denote the group av erage within blo c k t . Let ¯ Y = ( ¯ Y (1) , . . . , ¯ Y ( K )) ⊤ = T X t =1 π t · ¯ Y t denote the av erage p oten tial outcomes vector for all units. It follows from E ( ˆ Y t, aipw ) = ¯ Y t , E ( ˆ Y aipw ) = ¯ Y , ˆ Y aipw = T X t =1 π t · ˆ Y t, aipw , co v ( ˆ Y t, aipw , ˆ Y t ′ , aipw ) = 0 ( t  = t ′ ) (S45) S26 that E ( B 1 ) ( S44 ) = T X t =1 b t · E ( ˆ Y t, aipw ˆ Y ⊤ t, aipw ) ( S45 ) = T X t =1 b t · cov ( ˆ Y t, aipw ) + T X t =1 b t ¯ Y t ¯ Y ⊤ t , E ( B 2 ) ( S44 ) = T X t =1 b t · E ( ˆ Y aipw ˆ Y ⊤ aipw ) ( S45 ) = T X t =1 b t · cov ( ˆ Y aipw ) + T X t =1 b t ¯ Y ¯ Y ⊤ , E ( B 3 ) ( S44 ) = E ( T X t =1 b t ˆ Y t, aipw ! · ˆ Y aipw ) ( S45 ) = co v T X t =1 b t ˆ Y t, aipw , ˆ Y aipw ! + T X t =1 b t · ¯ Y t ! · ¯ Y ⊤ ( S45 ) = co v T X t =1 b t ˆ Y t, aipw , T X t =1 π t ˆ Y t, aipw ! + T X t =1 b t ¯ Y t ¯ Y ⊤ = T X t =1 b t π t · cov ( ˆ Y t, aipw ) + T X t =1 b t ¯ Y t ¯ Y ⊤ . (S46) In addition, let B 0 = P T t =1 π 2 t / (1 − 2 π t ) to write b t ( S42 ) = T 1 + B 0 · π 2 t 1 − 2 π t , b t (1 − 2 π t ) = T 1 + B 0 · π 2 t , T X t =1 b t = T 1 + B 0 · B 0 . (S47) This implies B 4 ≡ T X t =1 b t (1 − 2 π t ) · cov ( ˆ Y t, aipw ) + T X t =1 b t · cov ( ˆ Y aipw ) ( S47 ) = T 1 + B 0 · T X t =1 π 2 t · cov ( ˆ Y t, aipw ) + T B 0 1 + B 0 · cov ( ˆ Y aipw ) = T 1 + B 0 · cov ( ˆ Y aipw ) + T B 0 1 + B 0 · cov ( ˆ Y aipw ) = T · cov ( ˆ Y aipw ) . (S48) Equations ( S44 ), ( S46 ), and ( S48 ) ensure E ( ˆ V aipw ,b ) ( S44 ) + ( S46 ) = T X t =1 b t ¯ Y t ¯ Y ⊤ t + T X t =1 b t ¯ Y ¯ Y ⊤ − T X t =1 b t ¯ Y t ¯ Y ⊤ − T X t =1 b t ¯ Y ¯ Y ⊤ t | {z } ∑ T t =1 b t ( ¯ Y t − ¯ Y )( ¯ Y t − ¯ Y ) ⊤ S27 + T X t =1 b t (1 − 2 π t ) · cov ( ˆ Y t, aipw ) + T X t =1 b t · cov ( ˆ Y aipw ) | {z } ( S48 ) = B 4 ( S48 ) = T X t =1 b t ( ¯ Y t − ¯ Y )( ¯ Y t − ¯ Y ) ⊤ + T · cov  ˆ Y aipw  S3.2 In tuition for ( S41 ) : ˜ V aip w ,b = T cov ( ˆ Y aip w ) + co v b ( ¯ Y t − ˆ m t ) + o P (1) . The denition of cov b ( · , · ) and ( S43 ) ensure ˜ V aipw ,b ( S43 ) = co v b ( ˜ Y t, aipw ) = co v b ( ˆ Y t, aipw − ˆ m t ) = co v b ( ˆ Y t, aipw ) + cov b ( ˆ m t ) − cov b ( ˆ Y t, aipw , ˆ m t ) − cov b ( ˆ m t , ˆ Y t, aipw ) ( S43 ) = ˆ V aipw ,b + cov b ( ˆ m t ) − cov b ( ˆ Y t, aipw , ˆ m t ) − cov b ( ˆ m t , ˆ Y t, aipw ) , (S49) co v b ( ¯ Y t ) = co v b ( ¯ Y t − ˆ m t + ˆ m t ) = co v b ( ¯ Y t − ˆ m t ) + cov b ( ˆ m t ) + cov b ( ¯ Y t − ˆ m t , ˆ m t ) + cov b ( ˆ m t , ¯ Y t − ˆ m t ) = co v b ( ¯ Y t − ˆ m t ) − cov b ( ˆ m t ) + cov b ( ¯ Y t , ˆ m t ) + cov b ( ˆ m t , ¯ Y t ) . (S50) Let ∆ = ˆ V aipw ,b − cov b ( ¯ Y t ) with E (∆) = T cov ( ˆ Y aipw ) by Prop osition 6 ( iii ). Equations ( S49 )– ( S50 ) ensure ˜ V aipw ,b − cov b ( ¯ Y t − ˆ m t ) = ˜ V aipw ,b − ˆ V aipw ,b + cov b ( ¯ Y t ) − cov b ( ¯ Y t − ˆ m t ) + n ˆ V aipw ,b − cov b ( ¯ Y t ) o ( S49 ) + ( S50 ) = co v b ( ˆ m t ) − cov b ( ˆ Y t, aipw , ˆ m t ) − cov b ( ˆ m t , ˆ Y t, aipw ) − co v b ( ˆ m t ) + cov b ( ¯ Y t , ˆ m t ) + cov b ( ˆ m t , ¯ Y t ) + ∆ = ∆ − co v b ( ˆ Y t, aipw − ¯ Y t , ˆ m t ) − cov b ( ˆ m t , ˆ Y t, aipw − ¯ Y t ) = ∆ − B − B ⊤ , where B = cov b ( ˆ Y t, aipw − ¯ Y t , ˆ m t ) , with E ( ˜ V aipw ,b ) − cov b ( ¯ Y t − ˆ m t ) = E (∆) − E ( B ) − E ( B ) ⊤ Proposition 6 ( iii ) = T cov ( ˆ Y aipw ) − E ( B ) − E ( B ) ⊤ . S28 Renew ¯ m = P T t =1 π t ˆ m t = T − 1 P T t =1 ρ t ˆ m t . Then B = co v b ( ˆ Y t, aipw − ¯ Y t , ˆ m t ) = T X t =1 b t n ( ˆ Y t, aipw − ¯ Y t ) − ( ˆ Y aipw − ¯ Y ) o ( ˆ m t − ¯ m ) = T X t =1 b t ( ˆ Y t, aipw − ¯ Y t )( ˆ m t − ¯ m ) | {z } unbiased for 0 − ( ˆ Y aipw − ¯ Y ) | {z } close to 0 · T X t =1 b t ( ˆ m t − ¯ m ) , so that intuitiv ely B = o P (1) . S4 Pro ofs of the results in Section S1.3 Recall that L t = max z ∈Z | Y t ( z ) | , e t = min H t , z ∈Z e t ( z ) , M t = max H t , z ∈Z | ˆ m t ( z ) | , v t = max z ∈Z v ar { e t ( z ) − 1 } , ω t = max z ∈Z v ar { ˆ m t ( z ) } . F urther let γ t = max z ∈Z v ar  A t ( z ) 2 e t ( z )  , ϕ t = max z ,z ′ ∈Z v ar { A t ( z ) A t ( z ′ ) } . (S51) Useful facts. Let ∥·∥ ∞ denote the maxim um v alue of a random v ariable. F or random v ariables X 1 , . . . , X m ∈ R with nite v ariances, standard results ensure v ar m X i =1 X i ! ≤ m m X i =1 v ar ( X i ) , v ar ( X 1 X 2 ) ≤ 2 · v ar ( X 1 ) · ∥ X 2 ∥ 2 ∞ + 2 · v ar ( X 2 ) · ∥ X 1 ∥ 2 ∞ . (S52) S4.1 Lemmas Lemma S10. Assume the adaptive randomization in Denition 1 . Let V 1 ,z = T − 1 T X t =1  A t ( z ) 2 e t ( z ) − E  A t ( z ) 2 e t ( z )  for z ∈ Z . (S53) S29 As T → ∞ , we hav e v ar ( V 1 ,z ) = o (1) for all z ∈ Z if an y of the following conditions holds: (i) T − 1 P T t =1 γ t = o (1) . (ii) T − 1 P T t =1 { max 1 ≤ s ≤ t ( L s + M s ) 2 /e s } · ( L t + M t ) 2 · √ v t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that T − 1 P T t =1 (max 1 ≤ s ≤ t γ s + ϕ t ) · e − 1 t = o ( T 1 − β ) and, after some xed time p oin t T 0 , { ˆ m t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. (iii) T − 2 P T t =1 γ t = o (1) ; T − 1 P T t =1 { max 1 ≤ s ≤ t ( L s + M s ) 2 /e s } · √ ϕ t · e − 1 t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that T − 1 P T t =1 ( L t + M t ) 2 · (max 1 ≤ s ≤ t γ s + v t ) = o ( T 1 − β ) and, after some xed time p oint T 0 , { e t ( z ) : z ∈ Z } dep end only on informa- tion from the min { t − 1 , c 0 t β } most recent units. (iv) There exist constants β ∈ [0 , 1) and c 0 > 0 suc h that T − 1 P T t =1 max 1 ≤ s ≤ t γ s = o ( T 1 − β ) and, after some xed time p oin t T 0 , { e t ( z ) , ˆ m t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. Pro of of Lemma S10 . Recall that γ t = max z ∈Z v ar  A t ( z ) 2 e t ( z )  from ( S51 ). This ensures v ar ( V 1 ,z ) ( S53 ) = T − 2 v ar ( T X t =1 A t ( z ) 2 e t ( z ) ) ( S52 ) ≤ T − 1 T X t =1 v ar  A t ( z ) 2 e t ( z )  ≤ T − 1 T X t =1 γ t , whic h veries the suciency of Lemma S10 ( i ). W e v erify below the suciency of Lemma S10 ( ii )–( iv ), resp ectiv ely . F or their respective T 0 , a useful decomp osition is T 2 v ar ( V 1 ,z ) ( S53 ) = v ar ( T X t =1 A t ( z ) 2 e t ( z ) ) = T X t =1 v ar  A t ( z ) 2 e t ( z )  + 2 ⊤ X t =2 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  S30 = T X t =1 v ar  A t ( z ) 2 e t ( z )  + 2 T 0 X t =2 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  +2 ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  = S 1 + 2 S 2 + 2 S 3 , (S54) where S 1 = T X t =1 v ar  A t ( z ) 2 e t ( z )  , S 2 = T 0 X t =2 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  , S 3 = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  . Note that S 2 is a xed nite n umber and therefore satises S 2 = o ( T 2 ) . Giv en ( S54 ), a sucien t condition for v ar ( V 1 ,z ) = o (1) is S 1 = o ( T 2 ) , S 3 = o ( T 2 ) . (S55) W e verify b elow this sucien t condition ( S55 ) under Lemma S10 ( ii )–( iv ), resp ectively . Pro of of ( S55 ) under Lemma S10 ( ii ) . First, the denition of γ t in ( S51 ) implies v ar  A t ( z ) 2 e t ( z )  ( S51 ) ≤ γ t ≤  max 1 ≤ s ≤ t γ s + ϕ t  · e − 1 t , (S56) so that S 1 ( S54 ) = T X t =1 v ar  A t ( z ) 2 e t ( z )  ( S56 ) ≤ T X t =1  max 1 ≤ s ≤ t γ s + ϕ t  · e − 1 t Lemma S10 ( ii ) = o ( T 2 − β ) = o ( T 2 ) . Next, let ∆ t ( z ) = e t ( z ) − 1 − E { e t ( z ) − 1 } to write co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  = cov  A k ( z ) 2 e k ( z ) , A t ( z ) 2 · h E { e t ( z ) − 1 } + ∆ t ( z ) i  = cov  A k ( z ) 2 e k ( z ) , A t ( z ) 2  · E { e t ( z ) − 1 } + cov  A k ( z ) 2 e k ( z ) , A t ( z ) 2 · ∆ t ( z )  , (S57) S31 and S 3 ( S54 ) = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  ( S57 ) = Σ 1 + Σ 2 , (S58) where Σ 1 = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2  · E { e t ( z ) − 1 } , Σ 2 = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 · ∆ t ( z )  . W e show b elow Σ 1 = o ( T 2 ) , Σ 2 = o ( T 2 ) , (S59) resp ectively , which together ensure S 3 = o ( T 2 ) from ( S58 ). Pro of of Σ 1 = o ( T 2 ) in ( S59 ) . Assume without loss of generality that c 0 T β 0 < T 0 − 1 suc h that for t > T 0 , { ˆ m t ( z ) : z ∈ Z } dep end only on the previous min { t − 1 , c 0 t β } = c 0 t β time points. F or t > T 0 , w e ha ve co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2  = 0 for k < t − c 0 t β , co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2  ≤ 2 − 1  v ar  A k ( z ) 2 e k ( z )  + v ar { A t ( z ) 2 }  ( S51 ) ≤ 2 − 1 (max 1 ≤ s ≤ t γ s + ϕ t ) for k ∈ [ t − c 0 t β , t ) . (S60) This ensures Σ 1 ( S58 ) = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2  · E { e t ( z ) − 1 } ( S60 ) = ⊤ X t = T 0 +1 t − 1 X k = t − c 0 t β co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2  · E { e t ( z ) − 1 } ( S60 ) ≤ ⊤ X t = T 0 +1 c 0 t β · 2 − 1  max 1 ≤ s ≤ t γ s + ϕ t  · e − 1 t ≤ 2 − 1 c 0 T β · T X t =1  max 1 ≤ s ≤ t γ s + ϕ t  e − 1 t S32 Lemma S10 ( ii ) = c 0 T β · o ( T 2 − β ) = o ( T 2 ) . Pro of of Σ 2 = o ( T 2 ) in ( S59 ) . F or k < t , w e hav e     co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 · ∆ t ( z )      =     E  A k ( z ) 2 e k ( z ) · A t ( z ) 2 · ∆ t ( z )  − E  A k ( z ) 2 e k ( z )  · E  A t ( z ) 2 · ∆ t ( z )      ≤ E  A k ( z ) 2 e k ( z ) · A t ( z ) 2 · | ∆ t ( z ) |  + E  A k ( z ) 2 e k ( z )  · E n A t ( z ) 2 · | ∆ t ( z ) | o ≤     A k ( z ) 2 e k ( z )     ∞ ·   A t ( z ) 2   ∞ · E {| ∆ t ( z ) |} +     A k ( z ) 2 e k ( z )     ∞ ·   A t ( z ) 2   ∞ · E {| ∆ t ( z ) |} ≤ 2 ·  max 1 ≤ s ≤ t ( L s + M s ) 2 /e s  · ( L t + M t ) 2 · √ v t , (S61) where the last inequalit y follows from E {| ∆ t ( z ) |} ≤ p E {| ∆ t ( z ) | 2 } = p v ar { e t ( z ) − 1 } ≤ √ v t . This ensures T − 2 | Σ 2 | ( S58 ) = T − 2      ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 · ∆ t ( z )       ≤ T − 2 T X t =1 t − 1 X k =1     co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 · ∆ t ( z )      ( S61 ) ≤ T − 2 T X t =1 T · 2 ·  max 1 ≤ s ≤ t ( L s + M s ) 2 /e s  · ( L t + M t ) 2 · √ v t Lemma S10 ( ii ) = o (1) . Pro of of ( S55 ) under Lemma S10 ( iii ) . First, the denition of γ t in ( S51 ) ensures S 1 ( S54 ) = T X t =1 v ar  A t ( z ) 2 e t ( z )  ( S51 ) ≤ T X t =1 γ t = o ( T 2 ) . Next, renew ∆ t ( z ) as ∆ t ( z ) = A t ( z ) 2 − E { A t ( z ) 2 } to write co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  = cov  A k ( z ) 2 e k ( z ) , h E { A t ( z ) 2 } + ∆ t ( z ) i · 1 e t ( z )  = E { A t ( z ) 2 } · cov  A k ( z ) 2 e k ( z ) , 1 e t ( z )  + cov  A k ( z ) 2 e k ( z ) , ∆ t ( z ) · 1 e t ( z )  , (S62) S33 and S 3 ( S54 ) = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  ( S62 ) = Σ 1 + Σ 2 , (S63) where Σ 1 = ⊤ X t = T 0 +1 t − 1 X k =1 E { A t ( z ) 2 } · cov  A k ( z ) 2 e k ( z ) , 1 e t ( z )  , Σ 2 = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , ∆ t ( z ) · 1 e t ( z )  . W e show b elow Σ 1 = o ( T 2 ) , Σ 2 = o ( T 2 ) , (S64) resp ectively , which together ensure S 3 = o ( T 2 ) from ( S63 ). Pro of of Σ 1 = o ( T 2 ) in ( S64 ) . Assume without loss of generality that c 0 T β 0 < T 0 − 1 suc h that for t > T 0 , { e t ( z ) : z ∈ Z } dep end only on the previous min { t − 1 , c 0 t β } = c 0 t β time points. F or t > T 0 , w e ha ve co v  A k ( z ) 2 e k ( z ) , 1 e t ( z )  = 0 for k < t − c 0 t β , co v  A k ( z ) 2 e k ( z ) , 1 e t ( z )  ≤ 2 − 1  v ar  A k ( z ) 2 e k ( z )  + v ar { e t ( z ) − 1 }  ≤ 2 − 1 (max 1 ≤ s ≤ t γ s + v t ) for k ∈ [ t − c 0 t β , t ) . (S65) This ensures Σ 1 ( S63 ) = ⊤ X t = T 0 +1 t − 1 X k =1 E { A t ( z ) 2 } · cov  A k ( z ) 2 e k ( z ) , 1 e t ( z )  ( S65 ) = ⊤ X t = T 0 +1 t − 1 X k = t − c 0 t β E { A t ( z ) 2 } · cov  A k ( z ) 2 e k ( z ) , 1 e t ( z )  ( S65 ) ≤ ⊤ X t = T 0 +1 c 0 t β · ( L t + M t ) 2 · 2 − 1  max 1 ≤ s ≤ t γ s + v t  ≤ 2 − 1 c 0 T β · T X t =1 ( L t + M t ) 2 ·  max 1 ≤ s ≤ t γ s + v t  S34 Lemma S10 ( iii ) = c 0 T β · o ( T 2 − β ) = o ( T 2 ) . Pro of of Σ 2 = o ( T 2 ) in ( S64 ) . F or k < t , w e hav e     co v  A k ( z ) 2 e k ( z ) , ∆ t ( z ) · 1 e t ( z )      =     E  A k ( z ) 2 e k ( z ) · ∆ t ( z ) · 1 e t ( z )  − E  A k ( z ) 2 e k ( z )  · E  ∆ t ( z ) · 1 e t ( z )      ≤ E  A k ( z ) 2 e k ( z ) · | ∆ t ( z ) | · 1 e t ( z )  + E  A k ( z ) 2 e k ( z )  · E  | ∆ t ( z ) | · 1 e t ( z )  ≤     A k ( z ) 2 e k ( z )     ∞ · E {| ∆ t ( z ) |} ·     1 e t ( z )     ∞ +     A k ( z ) 2 e k ( z )     ∞ · E {| ∆ t ( z ) |} ·     1 e t ( z )     ∞ ≤ 2 ·  max 1 ≤ s ≤ t ( L s + M s ) 2 /e s  · p ϕ t · e − 1 t , (S66) where the last inequality follo ws from E {| ∆ t ( z ) |} ≤ p E {| ∆ t ( z ) | 2 } = p v ar { A t ( z ) 2 } ( S51 ) ≤ p ϕ t . This ensures T − 2 | Σ 2 | ( S63 ) = T − 2      ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , ∆ t ( z ) · 1 e t ( z )       ≤ T − 2 T X t =1 t − 1 X k =1     co v  A k ( z ) 2 e k ( z ) , ∆ t ( z ) · 1 e t ( z )      ( S66 ) ≤ T − 2 T X t =1 T · 2 ·  max 1 ≤ s ≤ t ( L s + M s ) 2 /e s  · p ϕ t · e − 1 t Lemma S10 ( iii ) = o (1) . Pro of of ( S55 ) under Lemma S10 ( iv ) . The denition of γ t in ( S51 ) ensures S 1 ( S54 ) = T X t =1 v ar  A t ( z ) 2 e t ( z )  ( S51 ) ≤ T X t =1 γ t ≤ T X t =1 max 1 ≤ s ≤ t γ s Lemma S10 ( iv ) = o ( T 2 − β ) = o ( T 2 ) . Next, assume without loss of generalit y that c 0 T β 0 < T 0 − 1 such that for t > T 0 , { e t ( z ) , ˆ m t ( z ) : z ∈ Z } depend only on the previous min { t − 1 , c 0 t β } = c 0 t β time p oints. F or t > T 0 , we S35 ha ve co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  = 0 for k < t − c 0 t β , co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  ≤ 2 − 1  v ar  A k ( z ) 2 e k ( z )  + v ar  A t ( z ) 2 e t ( z )  ≤ 2 − 1 ( γ k + γ t ) ≤ max 1 ≤ s ≤ t γ s for k ∈ [ t − c 0 t β , t ) . (S67) This ensures S 3 ( S54 ) = ⊤ X t = T 0 +1 t − 1 X k =1 co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  ( S67 ) = ⊤ X t = T 0 +1 t − 1 X k = t − c 0 t β co v  A k ( z ) 2 e k ( z ) , A t ( z ) 2 e t ( z )  ( S67 ) ≤ ⊤ X t = T 0 +1 c 0 t β · max 1 ≤ s ≤ t γ s ≤ c 0 T β · T X t =1 max 1 ≤ s ≤ t γ s Lemma S10 ( iv ) = c 0 T β · o ( T 2 − β ) = o ( T 2 ) . Lemma S11. Assume the adaptive randomization in Denition 1 . Let V 2 ,z z ′ = T − 1 T X t =1 h A t ( z ) A t ( z ′ ) − E  A t ( z ) A t ( z ′ )  i for z , z ′ ∈ Z . (S68) As T → ∞ , we hav e v ar ( V 2 ,z z ′ ) = o (1) for all z , z ′ ∈ Z if either of the follo wing conditions holds: (i) V anishing v ar { ˆ m t ( z ) } : T − 1 P T t =1 ϕ t = o (1) . (ii) Limited-range dep endence of ˆ m t ( z ) : There exist constants β ∈ [0 , 1) and c 0 > 0 suc h that T − 1 P T t =1 max 1 ≤ s ≤ t ϕ s = o ( T 1 − β ) and, after some xed time p oint T 0 , { ˆ m t ( z ) : z ∈ Z } depend only on information from the min { t − 1 , c 0 t β } most recen t units. S36 Pro of of Lemma S11 . The suciency of Lemma S11 ( i ) follo ws from v ar ( V 2 ,z z ′ ) ( S68 ) = 1 T 2 v ar ( T X t =1 A t ( z ) A t ( z ′ ) ) ( S52 ) ≤ 1 T T X t =1 v ar { A t ( z ) A t ( z ′ ) } ( S51 ) ≤ 1 T T X t =1 ϕ t . W e verify b elow the suciency of Lemma S11 ( ii ). First, write T 2 v ar ( V 2 ,z z ′ ) ( S68 ) = v ar ( T X t =1 A t ( z ) A t ( z ′ ) ) = T X t =1 v ar { A t ( z ) A t ( z ′ ) } + 2 ⊤ X t =2 t − 1 X k =1 co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } = T X t =1 v ar { A t ( z ) A t ( z ′ ) } + 2 T 0 X t =2 t − 1 X k =1 co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } +2 ⊤ X t = T 0 +1 t − 1 X k =1 co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } = S 1 + 2 S 2 + 2 S 3 , (S69) where S 1 = P T t =1 v ar { A t ( z ) A t ( z ′ ) } , S 2 = P T 0 t =2 P t − 1 k =1 co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } , and S 3 = P ⊤ t = T 0 +1 P t − 1 k =1 co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } . The denition of ϕ t in ( S51 ) implies v ar { A t ( z ) A t ( z ′ ) } ( S51 ) = ϕ t ≤ max 1 ≤ s ≤ t ϕ s so that S 1 ( S69 ) = T X t =1 v ar { A t ( z ) A t ( z ′ ) } ≤ T X t =1 max 1 ≤ s ≤ t ϕ s Lemma S11 ( ii ) = o ( T 2 − β ) = o ( T 2 ) . under Lemma S11 ( ii ). In addition, S 2 is a xed nite num b er and therefore satises S 2 = o ( T 2 ) . Giv en ( S69 ), it suces to verify that S 3 = o ( T 2 ) (S70) under Lemma S11 ( ii ). S37 Pro of of ( S70 ) under Lemma S11 ( ii ) . Assume without loss of generalit y that c 0 T β 0 < T 0 − 1 such that for t > T 0 , { ˆ m t ( z ) : z ∈ Z } dep end only on the previous min { t − 1 , c 0 t β } = c 0 t β time points. F or t > T 0 , we hav e co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } = 0 for k < t − c 0 t β , co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } ≤ 2 − 1 [ v ar { A k ( z ) A k ( z ′ ) } + v ar { A t ( z ) A t ( z ′ ) } ] ≤ 2 − 1 ( ϕ k + ϕ t ) ≤ max 1 ≤ s ≤ t ϕ s for k ∈ [ t − c 0 t β , t ) . (S71) This ensures ( S70 ) as follo ws: S 3 ( S69 ) = ⊤ X t = T 0 +1 t − 1 X k =1 co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } ( S71 ) = ⊤ X t = T 0 +1 t − 1 X k = t − c 0 t β co v { A k ( z ) A k ( z ′ ) , A t ( z ) A t ( z ′ ) } ( S71 ) ≤ T X t = T 0 +1 c 0 t β · max 1 ≤ s ≤ t ϕ s ≤ c 0 T β · T X t =1 max 1 ≤ s ≤ t ϕ s Lemma S11 ( ii ) = c 0 T β · o ( T 2 − β ) = o ( T 2 ) . Lemma S12. Assume the adaptiv e randomization in Denition 1 . If λ min ( V aipw ) is uni- formly bounded aw a y from 0 as T → ∞ , then a sucient condition for Condition 3 ( i ) is ∥ ˇ V aipw − V aipw ∥ f = o P (1) , whic h holds if an y of the follo wing conditions is satised: (i) T − 1 P T t =1 γ t = o (1) ; T − 1 P T t =1 ϕ t = o (1) . (ii) T − 1 P T t =1 { max 1 ≤ s ≤ t ( L s + M s ) 2 /e s } · ( L t + M t ) 2 · √ v t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that T − 1 P T t =1 (max 1 ≤ s ≤ t γ s + ϕ t ) · e − 1 t = o ( T 1 − β ) , T − 1 P T t =1 max 1 ≤ s ≤ t ϕ s = o ( T 1 − β ) , and, after some xed time p oin t T 0 , S38 { ˆ m t ( z ) : z ∈ Z } depend only on information from the min { t − 1 , c 0 t β } most recen t units. (iii) T − 2 P T t =1 γ t = o (1) ; T − 1 P T t =1 { max 1 ≤ s ≤ t ( L s + M s ) 2 /e s } · √ ϕ t · e − 1 t = o (1) ; T − 1 P T t =1 ϕ t = o (1) ; there exist constants β ∈ [0 , 1) and c 0 > 0 such that T − 1 P T t =1 ( L t + M t ) 2 · (max 1 ≤ s ≤ t γ s + v t ) = o ( T 1 − β ) and, after some xed time p oint T 0 , { e t ( z ) : z ∈ Z } dep end only on informa- tion from the min { t − 1 , c 0 t β } most recent units. (iv) There exist constan ts β ∈ [0 , 1) and c 0 > 0 such that T − 1 P T t =1 max 1 ≤ s ≤ t γ s = o ( T 1 − β ) , T − 1 P T t =1 max 1 ≤ s ≤ t ϕ s = o ( T 1 − β ) , and, after some xed time p oin t T 0 , { e t ( z ) , ˆ m t ( z ) : z ∈ Z } dep end only on information from the min { t − 1 , c 0 t β } most recent units. Pro of of Lemma S12 . Recall that V aipw = diag " T − 1 T X t =1 E  A t ( z ) 2 e t ( z )  # z ∈Z − T − 1 T X t =1 E ( A t A ⊤ t ) , ˇ V aipw = diag " T − 1 T X t =1 A t ( z ) 2 e t ( z ) # z ∈Z − T − 1 T X t =1 A t A ⊤ t . W e hav e ˇ V aipw − V aipw = V 1 − V 2 , (S72) where V 1 = diag T − 1 T X t =1  A t ( z ) 2 e t ( z ) − E  A t ( z ) 2 e t ( z )  ! z ∈Z = diag ( V 1 ,z ) z ∈Z , V 2 = T − 1 T X t =1 n A t A ⊤ t − E ( A t A ⊤ t ) o = ( V 2 ,z z ′ ) z ,z ′ ∈Z with V 1 ,z = T − 1 T X t =1  A t ( z ) 2 e t ( z ) − E  A t ( z ) 2 e t ( z )  , V 2 ,z z ′ = T − 1 T X t =1 h A t ( z ) A t ( z ′ ) − E  A t ( z ) A t ( z ′ )  i S39 as dened in ( S53 ) and ( S68 ), resp ectiv ely . F rom ( S72 ), w e ha ve ∥ ˇ V aipw − V aipw ∥ f = ∥ V 1 − V 2 ∥ f ≤ ∥ V 1 ∥ f + ∥ V 2 ∥ f = X z ∈Z V 2 1 ,z ! 1 / 2 + X z ,z ′ ∈Z V 2 2 ,z z ′ ! 1 / 2 , so that ∥ ˇ V aipw − V aipw ∥ f = o P (1) if V 1 ,z = o P (1) , V 2 ,z z ′ = o P (1) for all z , z ′ ∈ Z . (S73) Giv en E ( V 1 ,z ) = E ( V 2 ,z z ′ ) = 0 by denition, Lemma S5 (Cheb yshev’s inequality) ensures that a sucient condtion for ( S73 ) is v ar ( V 1 ,z ) = o (1) , v ar ( V 2 ,z z ′ ) = o (1) for all z , z ′ ∈ Z . (S74) Lemma S12 ( i ) ensures ( S74 ) by Lemma S10 ( i ) and Lemma S11 ( i ). Lemma S12 ( ii ) ensures ( S74 ) by Lemma S10 ( ii ) and Lemma S11 ( ii ). Lemma S12 ( iii ) ensures ( S74 ) b y Lemma S10 ( iii ) and Lemma S11 ( i ). Lemma S12 ( iv ) ensures ( S74 ) by Lemma S10 ( iv ) and Lemma S11 ( ii ). Lemma S13. F or innite sequences ( a t ) ∞ t =1 and ( b t ) ∞ t =1 , write a t ≲ b t if there exists constant c 0 < ∞ such that a t ≤ c 0 b t for all t . F or all t and z , z ′ ∈ Z , we hav e ϕ t ≲ ω t ( L t + M t ) 2 , γ t ≲ ω t ( L t + M t ) 2 · e − 2 t + v t ( L t + M t ) 4 . Pro of of Lemma S13 . Recall that | A t ( z ) | = | Y t ( z ) − ˆ m t ( z ) | ≤ L t + M t and v ar { A t ( z ) } = v ar { ˆ m t ( z ) } ≤ ω t b y denition. The results follo w from ( S52 ) as follows: v ar { A t ( z ) A t ( z ′ ) } ( S52 ) ≤ 2 · v ar { A t ( z ) } · ∥ A t ( z ′ ) ∥ 2 ∞ + 2 · v ar { A t ( z ′ ) } · ∥ A t ( z ) ∥ 2 ∞ ≤ 4 · ω t ( L t + M t ) 2 , (S75) v ar  A t ( z ) 2 e t ( z )  ( S52 ) ≤ 2 · v ar { A t ( z ) 2 } ·     1 e t ( z )     2 ∞ + 2 · v ar { e t ( z ) − 1 } ·   A t ( z ) 2   2 ∞ ( S75 ) ≤ 8 · ω t ( L t + M t ) 2 · e − 2 t + 2 · v t ( L t + M t ) 4 . S40 S4.2 Pro of of Prop osition S1 Pro of of Prop osition S1 . Let b 1 = max t =1 ,...,T ω t · e − 2 t , b 2 = max t =1 ,...,T v t , (S76) where we mak e their dep endence on T implicit for notational simplicity . When Y t ( z ) and ˆ m t ( z ) are uniformly b ounded, it follo ws from Lemma S13 that ϕ t ≲ ω t , max t =1 ,...,T ϕ t ≲ b 1 ; γ t ≲ ω t · e − 2 t + v t , max t =1 ,...,T γ t ≲ b 1 + b 2 . (S77) This ensures the results as follo ws. (i) Prop osition S1 ( i ) ensures Lemma S12 ( i ) as follo ws: T − 1 T X t =1 γ t ( S77 ) ≲ T − 1 T X t =1 ω t · e − 2 t + T − 1 T X t =1 v t Proposition S1 ( i ) = o (1) , T − 1 T X t =1 ϕ t ( S77 ) ≲ T − 1 T X t =1 ω t ≤ T − 1 T X t =1 ω t · e − 2 t Proposition S1 ( i ) = o (1) . (ii) Prop osition S1 ( ii ) ensures T − 1 T X t =1  max 1 ≤ s ≤ t e − 1 s  · √ v t = o (1) , b 1 ·  T − 1 P T t =1 e − 1 t  ( S76 ) = o ( T 1 − β ) , b 2 ·  T − 1 P T t =1 e − 1 t  ( S76 ) = o ( T 1 − β ) , (S78) S41 so that Lemma S12 ( ii ) holds as follows: • T − 1 T X t =1  max 1 ≤ s ≤ t ( L s + M s ) 2 /e s  · ( L t + M t ) 2 · √ v t ≲ T − 1 T X t =1  max 1 ≤ s ≤ t e − 1 s  · √ v t ( S78 ) = o (1) , • T − 1 T X t =1  max 1 ≤ s ≤ t γ s + ϕ t  · e − 1 t ≤  max t =1 ,...,T γ t + max t =1 ,...,T ϕ t  · T − 1 T X t =1 e − 1 t ! ( S77 ) ≲ ( b 1 + b 2 + b 1 ) · T − 1 T X t =1 e − 1 t ! ( S78 ) = o ( T 1 − β ) , • T − 1 T X t =1 max 1 ≤ s ≤ t ϕ s ≤ max t =1 ,...,T ϕ t ( S77 ) ≲ b 1 ≤ b 1 · T − 1 T X t =1 e − 1 t ! ( S78 ) = o ( T 1 − β ) . (iii) Prop osition S1 ( iii ) ensures T − 1 T X t =1 ω t = o (1) , T − 1 T X t =1  max 1 ≤ s ≤ t e − 1 s  · √ ω t · e − 1 t = o (1) , b 1 = o ( T 1 − β ) , b 2 = o ( T 1 − β ) , (S79) so that Lemma S12 ( iii ) holds as follows: • T − 2 T X t =1 γ t ( S77 ) ≲ T − 2 T X t =1 ω t · e − 2 t + T − 2 T X t =1 v t ( S76 ) ≤ T − 1 b 1 + T − 1 b 2 ( S79 ) = o ( T − β ) = o (1) , • T − 1 T X t =1  max 1 ≤ s ≤ t ( L s + M s ) 2 /e s  · p ϕ t · e − 1 t ( S77 ) ≲ T − 1 T X t =1  max 1 ≤ s ≤ t e − 1 s  · √ ω t · e − 1 t ( S79 ) = o (1) , • T − 1 T X t =1 ϕ t ( S77 ) ≲ T − 1 T X t =1 ω t ( S79 ) = o (1) , • T − 1 T X t =1 ( L t + M t ) 2 ·  max 1 ≤ s ≤ t γ s + v t  ≲ T − 1 T X t =1  max 1 ≤ s ≤ t γ s + v t  ≤ max t =1 ,...,T γ t + T − 1 T X t =1 v t ( S77 ) ≲ b 1 + b 2 + b 2 ( S79 ) = o ( T 1 − β ) . (iv) Prop osition S1 ( iv ) ensures b 1 = o ( T 1 − β ) , b 2 = o ( T 1 − β ) S42 so that Lemma S12 ( iv ) holds as follows: T − 1 T X t =1 max 1 ≤ s ≤ t γ s ≤ max t =1 ,...,T γ t ( S77 ) ≲ b 1 + b 2 = o ( T 1 − β ) , T − 1 T X t =1 max 1 ≤ s ≤ t ϕ s ≤ max t =1 ,...,T ϕ t ( S77 ) ≲ b 1 = o ( T 1 − β ) . S43

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment