Few Batches or Little Memory, But Not Both: Simultaneous Space and Adaptivity Constraints in Stochastic Bandits
We study stochastic multi-armed bandits under simultaneous constraints on space and adaptivity: the learner interacts with the environment in $B$ batches and has only $W$ bits of persistent memory. Prior work shows that each constraint alone is surpr…
Authors: Ruiyuan Huang, Zicheng Lyu, Xiaoyi Zhu
F ew Batc hes or Little Memory , But Not Both: Sim ultaneous Space and A daptivit y Constrain ts in Sto c hastic Bandits Ruiyuan Huang ∗ Sc ho ol of Data Science, F udan Univ ersity RuiyuanHuang00@gmail.com Zic heng Lyu ∗ Sc ho ol of Data Science, F udan Univ ersity lyuzic heng@gmail.com Xiao yi Zhu Sc ho ol of Data Science, F udan Univ ersity zh uxy22@m.fudan.edu.cn Zengfeng Huang † Sc ho ol of Data Science, F udan Univ ersity Shanghai Inno v ation Institute h uangzf@fudan.edu.cn Abstract W e study stochastic m ulti-armed bandits under simultaneous constraints on space and adap- tivit y: the learner interacts with the en vironment in B batc hes and has only W bits of p ersisten t memory . Prior work sho ws that each constrain t alone is surprisingly mild: near-minimax re- gret e O ( √ K T ) is ac hiev able with O (log T ) bits of memory under fully adaptiv e in teraction, and with a K -indep enden t O (log log T ) -type n umber of batches when memory is unrestricted. W e sho w that this picture breaks down in the sim ultaneously constrained regime. W e prov e that an y algorithm with a W -bit memory constrain t m ust use at least Ω( K/W ) batc hes to achiev e near-minimax regret e O ( √ K T ) , even under adaptive grids. In particular, logarithmic memory rules out O ( K 1 − ε ) batch complexity . Our pro of is based on an information b ottlenec k. W e sho w that near-minimax regret forces the learner to acquire Ω( K ) bits of information ab out the hidden set of goo d arms under a suitable hard prior, whereas an algorithm with B batches and W bits of memory allows only O ( B W ) bits of information. A k ey ingredien t is a lo calized change-of-measure lemma that yields probabilit y-level arm exploration guarantees, which is of indep enden t interest. W e also give an algorithm that, for any bit budget W with Ω(log T ) ≤ W ≤ O ( K log T ) , uses at most W bits of memory and e O ( K/W ) batches while achieving regret e O ( √ K T ) , nearly matching our low er b ound up to p olylogarithmic factors. 1 In tro duction The sto c hastic multi-armed bandit (MAB) problem is a standard model for sequen tial decision making. At each round t ∈ [ T ] , a learner chooses an arm A t ∈ [ K ] and observes a sto c hastic rew ard with mean µ A t . The goal is to minimize the exp ected pseudo-regret E [Reg T ] := E " T X t =1 µ ⋆ − µ A t # , µ ⋆ := max i ∈ [ K ] µ i . ∗ Equal con tribution. † Corresp onding author. 1 Classical algorithms such as UCB [A CBF02] and Thompson sampling [Tho33, A G12] ac hieve the minimax rate e O ( √ K T ) . These guaran tees are typically established in an idealized mo del that im- plicitly grants t wo resources: enough memory to maintain ric h p er-arm statistics, and full adaptivity , meaning that the learner may up date its p olicy after every observ ation. In many large-scale exp erimen tation systems such as online adv ertising and recommendation sys- tems, neither resource is alwa ys abundant. When K is large, storing p er-arm state in fast memory can b e costly . A t the same time, feedbac k is often pro cessed asynchronously in system pip elines, so p olicy up dates o ccur only p erio dically rather than after every in teraction. F rom a theoretical p ersp ectiv e, these tw o limitations corresp ond to constrain ts on t wo canonical computational resources: space and adaptivity . A space constraint limits ho w many bits of memory a learner can use, while an adaptivity constrain t limits how frequently a learner may adapt its in- teraction with the environmen t based on past feedbac k. Understanding ho w statistical p erformance dep ends on these resources, and whether one resource can comp ensate for the other, is therefore of fundamen tal interest. In this pap er, w e study sto c hastic bandits under sim ultaneous constraints on b oth resources. Space constrain t. W e mo del the space constraint b y restricting the learner to W bits of p ersisten t memory , meaning that the state carried from one round to the next can take at most 2 W v alues. Computation within a round is unrestricted; only the state retained b et ween rounds is limited. This is a natural mo del of space-constrained computation. In bandit theory , it dates back to the finite-memory bandit form ulations of [Cov68, CH03], and is also consisten t with the mo dern sublinear-space viewp oin t of [LSPY18]. Under the standard word-RAM con ven tion with word size Θ(log T ) , W = Θ(log T ) corresp onds to O (1) mac hine words. A daptivity constraint. W e consider the b atche d b andit mo del [PRCS16, GHRZ19], in which the horizon is partitioned into B batc hes and outcomes are only incorp orated b et ween batc hes. A t the b eginning of each batch, the learner m ust commit to the action-generation rule for that batch using only information av ailable at the previous batc h b oundary . Equiv alen tly , feedback observed within a batch cannot affect later actions in the same batch. F ollowing [GHRZ19], w e consider b oth the static-grid and the adaptive-grid v arian ts. The static grid assumes that the batc h boundaries are predetermined and fixed in adv ance, while the adaptive grid allows the learner to choose the next batc h b oundary using information av ailable at the previous b oundary . View ed separately , eac h constrain t app ears surprisingly mild. Sto c hastic bandits admit near-optimal algorithms using only O (log T ) bits of memory under fully adaptive interaction [CH03, LSPY18], while batc hed bandits attain near-minimax regret with only O (log log T ) batc hes when memory is unrestricted [PRCS16, GHRZ19, JTX + 21, EKMM21]. These p ositiv e results rely on the other resource b eing abundant: space-b ounded algorithms exploit frequen t up dates, whereas few-batch algorithms retain rich statistics across batch b oundaries. W e consider the joint regime where the learner has a W -bit memory budget and interacts with the en vironment in B batc hes. Our results show that this regime is qualitatively different from either constrain t in isolation: once the space budget is logarithmic, near-minimax regret is no longer compatible with an O ( K 1 − ε ) num b er of batches. Theorem 1 (Informal) . F or sufficiently lar ge T , any B -b atch sto chastic b andit algorithm with W 2 bits of p ersistent memory and ne ar-minimax r e gr et e O ( √ K T ) must satisfy B = Ω( K /W ) , even if the b atch grid is al lowe d to b e adaptive. A formal version of Theorem 1 is giv en in Section 4. A t a high lev el, the low er bound comes from an information bottleneck at batch b oundaries. All information gathered within a batch can influence future batc hes only through the W -bit state retained at the batch b oundary . With frequen t feedback, a learner can recycle a small amoun t of memory o ver time; with abundant memory , it can comp ensate for few up dates by storing accurate p er-arm summaries. Under simultaneous constrain ts, neither mechanism is a v ailable. The learner m ust either explore many arms and retain only a coarse summary , or keep precise information ab out only a small subset of arms and therefore require many batches b efore cov ering all K arms. Our results show that this b ottlenec k fundamentally reshap es the batch–regret trade-off. W e also sho w that this low er b ound is essentially tight. Theorem 2 (Informal) . F or every bit budget Ω(log T ) ≤ W ≤ O ( K log T ) , ther e exists a sto chastic static-grid b andit algorithm using at most W bits of p ersistent memory and e O ( K/W ) b atches that achieves ne ar-minimax r e gr et e O ( √ K T ) . A formal version of Theorem 2 is giv en in Section 5. Here the Ω(log T ) ≤ W lo w er b ound is needed to represent the relev an t coun ters and empirical means to inv erse-polynomial precision in T , while the W ≤ O ( K log T ) upp er b ound is without loss of generality since standard finite-armed bandit algorithms only need to store p er-arm summaries for the K arms. Our contributions. W e formalize sto c hastic bandits under sim ultaneous space and adaptivity constrain ts and show that limited space fundamen tally increases the batc h complexity of near- minimax learning. Our main con tributions are threefold. • A general low er b ound under limited memory . W e prov e that, for sufficiently large T , an y algorithm with a W -bit memory budget that achiev es the standard near-minimax regret rate e O ( √ K T ) must use at least Ω( K/W ) batc hes, even under adaptiv e grids. • T w o tec hnical ingredien ts of indep enden t interest. In proving our low er b ound, we establish tw o tec hnical lemmas of independent interest: (i) a probability-lev el per-go o d-arm exploration lemma, and (ii) a lo calized change-of-measure lemma for under-sampling even ts, whic h lets the pro of pay only for the first n pulls of the p erturb ed arm. • A near-matc hing upp er b ound across memory budgets. W e prop ose a static-grid algorithm such that, for any bit budget Ω(log T ) ≤ W ≤ O ( K log T ) , it uses at most W bits of p ersisten t memory , at most e O ( K/W ) batches, and achiev es near-minimax regret e O ( √ K T ) . R emark 3 . The adaptiv e-grid mo del has stronger algorithmic p o w er than the static-grid model, b ecause the learner can c ho ose the batch b oundaries adaptiv ely . Our low er b ound is pro ved for the algorithmically stronger adaptive-grid mo del, while our upp er b ound is achiev ed in the algorithmi- cally weak er static-grid mo del. Therefore, in each setting, our result is established in the strongest p ossible sense. 3 1.1 T echnical Ov erview The in tro duction argued informally that sim ultaneous constrain ts on space and adaptivity create a b ottleneck in how information can b e accumulated and reused ov er time. W e now sketc h how this intuition b ecomes a pro of. Our main tec hnical con tribution is a lo wer b ound driv en b y this b ottlenec k. A hard prior and the lo wer b ound pro of roadmap. F or simplicity , assume that K is even. F or eac h subset S ⊂ [ K ] with | S | = K / 2 , let ν S denote the Bernoulli instance in which arms in S ha ve mean 1 / 2 and the remaining arms hav e mean 0 . Let S ⋆ ⊂ [ K ] b e drawn uniformly from all subsets of size K / 2 , and let ν S ⋆ b e the realized base instance. On this prior, a near-minimax learner m ust satisfy tw o competing requiremen ts. First, it m ust a void allo cating a substantial exploration budget to bad arms, since every pull of a bad arm incurs regret 1 / 2 . Second, it cannot afford to ignore to o many go od arms: for ev ery go o d arm there is a nearb y instance in which that arm b ecomes uniquely optimal, so uniform regret forces the learner to remain prepared for man y suc h p erturbations. W e capture this tension through a thresholded sampling profile Y ∈ { 0 , 1 } K that records which arms receiv e exploration on the natural p er-arm scale e Θ( T /K ) . The low er b ound then has three steps. W e first sho w that bad arms rarely cross this threshold, while ev ery goo d arm crosses it with constan t probabilit y . Equiv alen tly , conditioned on S ⋆ , a constant fraction of the go od arms cross in exp ectation. This makes Y a noisy co ordinatewise predictor of the hidden go o d-arm set and yields a linear information low er b ound, I ( S ⋆ ; Y ) = Ω( K ) . W e then sho w that a B -batch proto col with only W bits of p ersisten t memory can transmit at most O ( B W ) bits of instance-dep enden t information across batch b oundaries. Comparing the tw o b ounds gives the lo wer b ound. Below we explain our pro of in more detail. A thresholded sampling profile. The constant gap b et ween go o d and bad arms is what makes this family useful for a regret lo w er b ound. Under an y base instance ν S , ev ery pull of a bad arm incurs regret 1 / 2 , whereas pulling a go o d arm incurs no regret. Hence a near-minimax learner cannot afford to sp end many rounds on bad arms and must concen trate most of its budget on go od arms. This makes it natural to distinguish arms according to whether they receive exploration on the av erage p er-arm scale, namely ab out e Θ( T /K ) pulls. W e therefore in tro duce a threshold n = e Θ( T /K ) and define the thresholded sampling profile Y i := I { N i ( T ) ≥ n } , Y ∈ { 0 , 1 } K where N i ( T ) means the num b er of pulls of arm i up to the horizon T . Thus Y i records whether arm i received a nontrivial share of the exploration budget. 4 A b ound on false p ositiv es is immediate. If a bad arm crosses the threshold, then the learner has already paid Θ( n ) regret on that arm. Hence low regret directly implies that not to o many bad arms can satisfy Y i = 1 . The harder direction is to show that go od arms cannot b e ignored either. Wh y false-negative con trol is the main difficult y . What is missing is the complementary statemen t that gen uinely go od arms m ust also cross the threshold with constan t probability . This is the delicate part of the pro of. Let ν S ⋆ b e the realized instance under this prior, and on that instance alone one migh t imagine that it suffices to find some go od arm and then exploit it. Ho wev er, w e pro ve that every go od arm m ust cross the threshold with constant marginal probabilit y . The key p oin t is that the regret guarantee is uniform o v er all instances, not tailored to a single ν S . F or ev ery j ∈ S ⋆ , we consider a neighboring instance ν ( j ) S ⋆ in which arm j has mean 1 / 2 + ∆ 0 , for some p erturbation size ∆ 0 > 0 to b e sp ecified later, while all other arm means remain the same as in ν S ⋆ . Thus arm j b ecomes uniquely optimal under ν ( j ) S ⋆ . If the learner largely ignores arm j under the base instance, then it is p oorly p ositioned for ν ( j ) S ⋆ and w ould incur large regret there. Hence no gen uinely go o d arm can b e safely ignored. This is what turns the hard prior into a source of linear information. The learner is not merely required to identify one arm worth exploiting; it must gather co ordinatewise evidence ab out man y mem b ership bits of S ⋆ . Wh y probability-lev el exploration requires localization. T o make the previous intuition formal, we need a probability-lev el statement: for every go od arm i , P ν S ( N i ( T ) ≥ n ) ≥ β for some constant β > 0 . An exp ectation lo w er b ound on N i ( T ) is too weak for this purp ose. Indeed, a learner could ignore arm i on most tra jectories and explore it extremely heavily only on a tiny exceptional set of runs. Suc h bursty exploration can keep E [ N i ( T )] at scale n while making P ( N i ( T ) ≥ n ) arbitrarily small. On the base instance ν S , this pathology is not ruled out by regret, b ecause replacing pulls of one go o d arm by pulls of other go od arms incurs no additional regret. A natural route is therefore to argue that, on the p erturb ed instance ν ( i ) S , near-minimax regret forces arm i to b e sampled at least n times with sufficien tly large probability , and then transfer this statemen t bac k to the base instance ν S . The obstacle is that standard transcript-level change-of- measure b ounds are p oorly aligned with the under-sampling ev ent E i := { N i ( T ) < n } . Standard change-of-measure b ounds compare the ful l la ws of the tw o exp erimen ts and therefore pa y for all pulls of arm i , whereas the even t E i dep ends only on whether the first n pulls of arm i ever o ccur. A useful toy picture is the follo wing: under the p erturbed instance, the learner may use the first n pulls of arm i to disco ver that i is sp ecial and then exploit i for the remaining T − n rounds. The ev ent E i is already decided after those first n pulls, so all later exploitation pulls are irrelev an t to E i . A standard full-transcript comparison nev ertheless pays for all pulls of arm i , including those p ost-iden tification exploitation pulls, and can therefore incur a cost as large as order T ∆ 2 0 , whic h is far to o large for our parameter regime. Our lo calized c hange-of-measure lemma truncates this comparison to the first n pulls relev an t to E i , so the transfer cost is only of order n ∆ 2 0 . With the p erturbation gap chosen at the nearly 5 minimax-detectable scale ∆ 0 = e Θ p K/T , w e take n ≍ 1 ∆ 2 0 = e Θ( T /K ) , so that n ∆ 2 0 = O (1) . This is exactly what allows the desired probabilit y-level exploration guarantee to b e transferred bac k from ν ( i ) S to ν S . F rom thresholded exploration to an Ω( K ) information requiremen t. The previous tw o steps give exactly the ingredients needed to interpret the thresholded profile Y as a noisy classifier of the hidden membership bits X i := I { i ∈ S ⋆ } . F alse p ositives are con trolled because a bad arm that crosses the threshold already incurs Θ( n ) regret, while false negatives are controlled b ecause the lo calized p erturbation argumen t shows that eac h goo d arm crosses the threshold with constan t probabilit y . Consequently , the naiv e predictor ˆ X i := Y i has a verage error bounded aw a y from 1 / 2 . Equiv alently , the av erage Ba yes error for predicting X i from Y i is strictly smaller than random guessing by a constant margin. A chain-rule / p er-co ordinate entrop y argument then upgrades this co ordinatewise predictive adv an tage into the global information low er b ound I ( S ⋆ ; Y ) = Ω( K ) . F rom information to batc hes. The second half of the low er b ound utilizes the batc hed na- ture of our algorithm. Once a batc h starts, the full within-batch action rule is fixed using only the information a v ailable at that batch b oundary . Therefore the only instance-dep enden t infor- mation that can pass from one batch to the next is what is retained in p ersisten t memory at the b oundaries. W e show that the sampling profile Y is a deterministic function of the seed and the b oundary-memory transcript; in the adaptive-grid case, the realized batch b oundaries are them- selv es recursively determined b y the same ob jects. Since the seed is indep enden t of S ⋆ and each b oundary state carries at most W bits, the en tire b oundary-memory transcript has en tropy at most O ( B W ) . A data-pro cessing argument therefore yields I ( S ⋆ ; Y ) ≤ O ( B W ) . Comparing this with the low er b ound on the same quantit y giv es B = Ω( K /W ) in the large-horizon regime. Upp er b ound intuition. The upp er b ound matches the same b ottlenec k from the algorithmic side. Instead of storing a large activ e set of candidate arms, the algorithm k eeps only an incum b en t arm, a small summary of its performance, and a block of at most S challenger arms at a time. These c hallengers are compared with the incumbent through a m ultistage test, after whic h the 6 algorithm mov es on to the next blo c k. This blo c kwise scan explicitly trades memory for adaptivity: with blo c k size S , the implementation uses O ( S log T ) bits of memory and the num b er of batches drops to e O ( K/S ) . Equiv alently , for any bit budget W with Ω(log T ) ≤ W ≤ O ( K log T ) , choosing S ≍ W / log T gives an implementation using at most W bits and e O ( K/W ) batches. The resulting algorithm nearly matches the low er b ound up to p olylogarithmic factors. Note that the W ≤ O ( K log T ) upp er b ound is without loss of generality since standard finite-armed bandit algorithms only need to store p er-arm summaries for the K arms. Organization. After discussing related w ork in Section 2, Section 3 formalizes the space-b ounded batc hed bandit model. Section 4 prov es the low er b ound in three steps: a probability-lev el explo- ration argument via lo calized one-arm p erturbations, an information-acquisition lo wer b ound, and a batch-memory b ottlenec k b ound. Section 5 presents a nearly matc hing upper b ound.Section 6 concludes with op en directions. 2 Related W ork Our w ork lies at the in tersection of t w o lines of research in bandit theory: limite d adaptivity and sp ac e-b ounde d le arning . W e briefly review these t wo literatures and then explain how our setting differs from b oth. Limited adaptivit y . A large literature studies bandits under restrictions on how often the learner ma y up date its policy . F or sto c hastic batched bandits, early results include [PR CS16] for the t w o-armed case and [GHRZ19], whic h dev elop ed the BaSE p olicy and established near- optimal guarantees and lo wer b ounds for general K -armed problems. Subsequen t work extended these ideas to broader settings, including sto c hastic, linear, and adv ersarial bandits [EKMM21], Thompson-sampling-based batched metho ds [KO21], and an ytime batc hed algorithms with near- optimal O (log log T ) -type batch complexity [JTX + 21]. Related guarantees are also known in struc- tured settings such as finite-action linear con textual bandits [HZZ + 20, R YZ21], high-dimensional sparse linear contextual bandits [RZ24], generalized linear con textual bandits [RZK20, SDBS24], and large-action-space contextual linear bandits [HYF23, RJX24]. Related formulations restrict adaptivit y through switching costs, dela yed feedback, or more general round-complexity constraints [ADT12, CBDS13, JGS13, VCL + 20, AAAK17, JJNZ16]. Ho w ever, these works t ypically b ound ho w often the p olicy ma y change while allowing the learner to retain arbitrarily ric h information b et w een up dates. Space-b ounded learning. Learning with limited memory has b een studied broadly in learning theory [Raz16, GR T17, MM17, MT17]. Within bandits, classical anteceden ts on finite-memory mo d- els go bac k to [Co v68, CH03], and mo dern sto c hastic bandit results sho w that one can achiev e strong regret guarantees with constant or logarithmic space under fully sequential interaction [LSPY18]. Related memory-limited v arian ts include memory-constrained adversarial bandits [XZ21], limited- memory approac hes based on subsampling or forgetting [BRC21], and bounded arm-memory mo dels in whic h the learner ma y keep statistics for only a limited num ber of arms at a time [CK20, MPK21]. A different notion app ears in streaming bandits, where arms arriv e in a stream and the learner is constrained by pass and storage limitations [A W20, JHTX21, AKP22, A W22, W an23, LZWL23, 7 A W24]. These mo dels are quite differen t from ours. In the streaming formulations ab o v e, the a v ailable arm set changes ov er time, whereas in our setting the arm set is fixed and alwa ys accessible. Moreo ver, streaming algorithms are typically fully adaptive within a pass, while in our batched mo del the learner cannot use feedbac k observ ed within a batc h to c hange later actions in that batc h. This distinction is also reflected in the complexit y trade-off: in streaming bandits, one can reco ver the optimal regret rate up to logarithmic factors with constant space and only O (log log T ) passes, whereas in our batc hed mo del, recov ering the same rate with O (log T ) bits of memory requires e Ω( K ) batc hes. F urthermore, tec hniques used in pro ving lo wer b ounds are fundamen tally differen t b et ween streaming bandits and our setting. Our setting. The works closest in spirit to ours are batched bandits and finite-memory bandits. Relativ e to the batched -bandit literature, w e additionally imp ose an explicit b ound on the persistent memory retained across batc h b oundaries. Relativ e to the space-b ounded bandit li terature, we additionally restrict adaptivit y b y allowing p olicy up dates only b et ween batches. W e study this join t regime and show that com bining the t wo constraints creates a qualitatively stronger b ottlenec k than either constraint alone. 3 Preliminaries 3.1 Notation F or n ≥ 1 , write [ n ] := { 1 , . . . , n } , and let I { E } denote the indicator of an even t E . W e use log for the natural logarithm and log 2 for the base- 2 logarithm. F or a bandit instance ν = ( P 1 , . . . , P K ) , let P ν and E ν denote probability and exp ectation under the law induced b y ν and the algorithm; when the instance itself is random, w e write P and E for the corresp onding mixture law. W e use e O ( · ) , e Ω( · ) and e Θ( · ) to suppress p olylogarithmic factors in K and T , and write H ( · ) and I ( · ; · ) for Shannon entrop y and m utual information, measured in bits. 3.2 Sto c hastic multi-armed bandits W e consider a K -armed sto c hastic bandit with horizon T . Arm i ∈ [ K ] has reward distribution P i supp orted on [0 , 1] and mean µ i := E X ∼ P i [ X ] . At eac h round t ∈ [ T ] , the learner c ho oses an arm A t ∈ [ K ] and observes a reward R t ∼ P A t ; conditional on the instance, rew ards are indep enden t across pulls. Let µ ⋆ := max i ∈ [ K ] µ i , Reg T := T X t =1 ( µ ⋆ − µ A t ) . W e measure p erformance by the exp ected pseudo-regret E ν [Reg T ] . 3.3 Space-b ounded bandits A learner with a W -bit p ersisten t-memory budget maintains a p ersisten t state M t ∈ M after each round, where |M| ≤ 2 W . 8 The algorithm ma y use a random seed U , sampled once at time 0 and indep enden t of the bandit instance; U is not counted to w ard the space budget. When conv enien t, all algorithmic randomization is absorb ed into U . W e do not restrict computation time, only the cross-round state. In the fully sequential mo del, a W -bit learner is sp ecified by measurable maps such that, for each round t ∈ [ T ] , A t = ρ t ( M t − 1 , U ) , M t = ϕ t ( M t − 1 , U, A t , R t ) , with fixed initial state M 0 ∈ M . 3.4 Batc hed bandits Let F t := σ ( U, A 1 , R 1 , . . . , A t , R t ) , t = 0 , 1 , . . . , T , with the con ven tion F 0 := σ ( U ) , denote the natural filtration of the interaction. In a B -batc h proto col, the horizon is partitioned into consecutive batches 0 = t 0 < t 1 < · · · < t B = T , where batch b ∈ [ B ] consists of rounds { t b − 1 + 1 , . . . , t b } . The batching constrain t is that, at time t b − 1 , the learner must choose the next b oundary t b and the en tire within-batch action rule for rounds t b − 1 + 1 , . . . , t b as F t b − 1 -measurable ob jects. Equiv alen tly , rew ards observed during batc h b cannot affect later actions in the same batch. W e consider tw o v arian ts. (i) In the static-grid mo del, the full grid ( t 1 , . . . , t B ) is fixed b efore in teraction, equiv alen tly , it is F 0 -measurable. (ii) In the adaptive-grid mo del, eac h t b is F t b − 1 - measurable, but must still b e chosen b efore batch b b egins. 3.5 The combined mo del: space-b ounded and batched bandits W e no w combine batching with the W -bit memory constrain t. The learner maintains a p ersisten t state M t ∈ M with |M| ≤ 2 W . All algorithmic randomness is absorb ed into a seed U , sampled once at time 0 and not counted tow ard the memory budget. The curren t time index is public and lik ewise not counted tow ard the memory budget. Th us, at the start of batch b , the learner’s batc h b oundary and within-batch action plan ma y dep end only on ( t b − 1 , M t b − 1 , U ) , which serv es as a memory-constrained summary of the full pre-b oundary history F t b − 1 . F ormally , a B -batch, W -bit algorithm is sp ecified b y measurable maps as follows. F or each internal b oundary b ∈ [ B − 1] , t b = ψ b ( t b − 1 , M t b − 1 , U ) , t b − 1 < t b < T , and we set t B := T . F or each batch b ∈ [ B ] and round t ∈ { t b − 1 + 1 , . . . , t b } , A t = ρ b,t ( t b − 1 , M t b − 1 , U ) , M t = ϕ t ( M t − 1 , U, A t , R t ) . Rew ards ma y b e observ ed and the state M t up dated within a batch, but this do es not create within- batc h adaptivit y , since the maps ρ b,t are fixed at time t b − 1 . Hence information gathered during batc h b can influence later batches only through the b oundary state M t b . In the static-grid mo del, the internal b oundaries t 1 , . . . , t B − 1 are fixed before interaction (equiv- 9 alen tly , they may dep end on U but not on observ ed rew ards). In the adaptiv e-grid model, the realized in ternal grid τ := ( t 1 , . . . , t B − 1 ) is recursively determined by U and the b oundary states M t 0 , M t 1 , . . . , M t B − 1 . W e will use this fact in Section 4.3 when b ounding the information that can b e transmitted across batc hes. Finally , this form ulation is at least as p ermissiv e as the classical delay ed-feedbac k batc hed mo del for lo wer-bound purp oses: rew ards may b e observed and the internal state up dated within a batch, but suc h information still cannot affect later actions in the same batc h. Therefore any low er b ound pro ved here also applies to the classical delay ed-feedbac k batched setting. 4 Lo w er Bounds on Batc h Complexit y under Space Constrain ts This section pro ves the formal low er b ound stated in Theorem 1. W e establish the result directly for the more p ermissiv e adaptive-grid mo del; the same conclusion therefore holds a fortiori for static- grid algorithms, whic h form a sp ecial case. At a high lev el, we construct a Bernoulli hard family in which half of the arms are go od and half are bad, and study a thresholded sampling profile that records which arms are pulled on the natural p er-arm scale. The pro of has three steps. First, a regret argumen t together with a localized one-arm p erturbation argumen t sho ws that bad arms cannot cross the threshold to o often, while eac h go o d arm must cross it with constan t probability . Second, this makes the thresholded profile carry linear information ab out the hidden go o d-arm set. Third, batc hing with a W -bit p ersisten t-memory budget allo ws at most O ( B W ) bits of instance-dep enden t information to pass across batch b oundaries. Comparing the tw o b ounds yields B = Ω( K /W ) . W e no w formalize this argument. Bernoulli sub class. F or p ∈ [0 , 1] , let Ber( p ) denote the Bernoulli distribution on { 0 , 1 } with mean p . Since the hard family used b elo w is Bernoulli, it suffices to pro ve the low er b ound on V Ber := (Ber( µ 1 ) , . . . , Ber( µ K )) : µ 1 , . . . , µ K ∈ [0 , 1] . An y regret guarantee o ver a larger class of stochastic bandit instances, suc h as all [0 , 1] -v alued rew ard distributions, in particular implies the same guarantee on V Ber . Standing notation for this section. Throughout this section, w e write M := ( M t 1 , . . . , M t B − 1 ) for the collection of p ersistent memory states at the batch b oundaries. This is only a notational device: we do not mean that the algorithm explicitly stores the entire tuple M simultaneously . W e also write N i ( t ) for the num b er of pulls of arm i up to time t , and N i := N i ( T ) . Theorem 1 (F ormal) . Supp ose K is even. Ther e exist universal c onstants c, γ 0 > 0 such that the fol lowing holds. L et A b e any adaptive-grid B -b atch sto chastic b andit algorithm with p ersistent state sp ac e M satis- 10 fying |M| ≤ 2 W . If, for some c onstant C ≥ 1 , sup ν ∈V Ber E ν [Reg T ] ≤ C √ K T log T log K , (1) with T ≥ c C 6 K log 6 T log 6 K , then B ≥ γ 0 K − 1 2 log 2 (2 K ) W . In p articular, B = Ω( K /W ) . Theorem 1 applies to an y ˜ O ( √ K T ) regret b ound with minor mo difications; we adopt the concrete rate O ( √ K T log T log K ) here to exactly match the known batched-bandit upp er b ound [GHRZ19]. Th us, the theorem sho ws that even attaining this rate under a W -bit p ersistent-memory budget requires batch complexity at least on the order of K /W . Definition 4 (Bernoulli hard family and thresholded sampling profile) . Assume K is even, and let S K/ 2 := { S ⊂ [ K ] : | S | = K / 2 } . F or eac h S ∈ S K/ 2 , let ν S ∈ V Ber denote the bandit instance with arm distributions P i = ( Ber(1 / 2) , i ∈ S, Ber(0) , i / ∈ S . Equiv alently , the arm means satisfy µ i = ( 1 / 2 , i ∈ S, 0 , i / ∈ S . W e place the prior S ⋆ ∼ Unif ( S K/ 2 ) on this family . Giv en a threshold n , define the thresholded sampling profile Y i := I { N i ( T ) ≥ n } , Y = ( Y 1 , . . . , Y K ) ∈ { 0 , 1 } K . F or S ∈ S K/ 2 , j ∈ S , and ∆ > 0 , let ν ( j, ∆) S denote the one-arm p erturbation of ν S obtained by replacing the rew ard law of arm j by Ber(1 / 2+ ∆) and leaving all other arms unc hanged. When the p erturbation size is fixed, w e abbreviate this as ν ( j ) S . Pro of roadmap. W e no w sp ell out the three ingredients announced ab o ve. F or readability , the third step is stated b elo w as a coarse B W upp er b ound, although the pro of actually yields the sligh tly sharp er ( B − 1) W b ound. 11 (A) P er-go o d-arm exploration (Section 4.1). W e prov e that for ev ery S ∈ S K/ 2 and every go od arm j ∈ S , P ν S ( N j ( T ) ≥ n ) ≥ β for an absolute constant β > 0 . This is the only step that uses the one-arm perturbation family explicitly . (B) Information lo w er b ound (Section 4.2). W e com bine the false-negative control from Step (A) with the false-p ositive control that comes directly from regret. T ogether these imply that the thresholded profile Y is a noisy co ordinatewise predictor of the hidden go od-arm set, and hence I ( S ⋆ ; Y ) = Ω( K ) . (C) F rom information to batc hes (Section 4.3). W e show that Y is determined b y th e information av ailable at batc h b oundaries, namely the seed and the p ersisten t b oundary states. Applying data pro cessing together with H ( M ) ≤ B W yields I ( S ⋆ ; Y ) ≤ B W. In the adaptive-grid case, the realized batc h b oundaries are themselv es determined by the same ob jects, so they do not create an additional instance-dep enden t information channel. Equiv alently , the quantitativ e core of the pro of is the information chain Ω( K ) | {z } Step (B): information forced by the hard family ≤ I ( S ⋆ ; Y ) ≤ B W | {z } Step (C): batch-memory b ottlenec k . (2) Com bining Steps (B) and (C) yields Theorem 1. 4.1 An Even t-Restricted Change of Measure and P er-Go o d-Arm Exploration This subsection has a local role in the pro of and a broader methodological one. Its immediate purp ose is to establish a probability-lev el exploration guaran tee for go o d arms in the hard family: under every base instance ν S , each goo d arm m ust cross a prescrib ed sampling threshold with constan t probability . This will b e the only output of the p erturbation argument used later in the information-theoretic part of the low er b ound. A t the same time, the underlying change-of-measure argumen t is not sp ecific to our Bernoulli hard family . F or indep endent in terest, w e state it in a reusable adaptiv e-sampling form. T o keep the main pro of fo cused on the low er-b ound argument, the full measurable-space form ulation and the asso ciated truncation-based pro ofs are deferred to App endix A.1. The conceptual p oint is the following. Classical transcript-level c hange-of-measure arguments c harge the discrepancy b et ween tw o en vironments to the full amount of information accumulated by the algorithm. F or under-sampling even ts, ho wev er, this is often p o orly aligned with the ev ent itself. An even t suc h as { N j ( T ) ≤ n } is insensitive to any rew ard from the p erturbed arm b ey ond its n -th pull. Thus the natural comparison cost should scale with the part of the p erturbed stream that the 12 ev ent can actually dep end on, rather than with the en tire transcript. The results b elo w formalize exactly this prefix-sensitive, even t-visible comparison principle. W e then instan tiate this general to ol for one-arm p erturbations of the hard family and obtain the follo wing exploration lemma. Lemma 5 (P er-go o d-arm exploration) . In the har d family of Definition 4, assume that algorithm A satisfies the r e gr et b ound (1) for some c onstant C > 0 . Ther e exist absolute c onstants c ∆ , α, β > 0 such that the fol lowing holds. Define ∆ 0 := c ∆ C log ( T ) log( K ) r K T , n := α T C 2 K log 2 ( T ) log 2 ( K ) . Assume T is lar ge enough so that ∆ 0 ≤ 1 / 4 and 1 ≤ n ≤ T / 2 . Then for every S ∈ S K/ 2 and every j ∈ S , P ν S N j ( T ) ≥ n ≥ β . One admissible choice is c ∆ = 8 , α = ln(1 . 8) 512 , β = 0 . 1 . The proof b elo w verifies that these v alues w ork; the subsequen t arguments only use that c ∆ , α, β are absolute p ositiv e constants. Fix once and for all any admissible constan ts c ∆ , α, β > 0 for which Lemma 5 holds, and define ∆ 0 := c ∆ C log ( T ) log( K ) r K T , n := α T C 2 K log 2 ( T ) log 2 ( K ) . All subsequent statements in this section are with resp ect to this fixed admissible choice. The proof of Lemma 5 proceeds by instan tiating the next three general results, which form a reusable even t-restricted c hange-of-measure tool for adaptive sampling problems with one-stream p erturbations. 4.1.1 An Even t-Restricted Change of Measure via χ 2 -Div ergence Man y bandit lo wer b ounds compare t wo en vironments that differ only on a single arm j and transfer probabilit y statemen ts b et w een the corresp onding transcript la ws [K CG16, GK16, LS20]. Standard Kullbac k–Leibler- or total-v ariation-based arguments are transcript-lev el: they charge the compar- ison to all pulls of the p erturbed arm. F or under-sampling ev ents, this is often fundamentally misaligned. If the ev ent of interest is { N j ( T ) ≤ n } , then only the first n rewards of arm j can affect that even t; anything observ ed from arm j after the n -th pull is irrelev ant to it. The key distinction is therefore b et ween transcript-lev el information and even t-visible information. The result b elo w isolates a reusable change-of-measure principle whose p enalt y is aligned with the latter. More precisely , whenev er the even t of interest dep ends on at most the first n observ ations from the perturb ed stream, the comparison cost is gov erned by n one-step χ 2 factors. W e do not presen t this as a new global divergence inequality . Rather, it is an even t-restricted and prefix- sensitiv e pack aging of standard likelihoo d-ratio and second-moment arguments. 13 Definition 6 ( χ 2 -div ergence) . F or probability measures P ≪ Q , define χ 2 ( P ∥ Q ) := Z dP dQ − 1 2 dQ = E ξ ∼ Q " dP dQ ( ξ ) − 1 2 # . Equiv alently , E ξ ∼ Q " dP dQ ( ξ ) 2 # = 1 + χ 2 ( P ∥ Q ) . The iden tity ab o ve is exactly what makes χ 2 the natural div ergence here: the second momen t of the prefix likelihoo d ratio o ver the first n pulls factorizes as 1 + χ 2 ( P ∥ Q ) n . The full formal setup app ears in App endix A.1. F or the statements b elo w, it suffices to consider t wo adaptive-sampling en vironments P 0 and P 1 induced b y the same non-anticipating p olicy and differing only on one arm j , whose reward laws are P under P 0 and Q under P 1 . W rite F t := σ ( U, A 1 , R 1 , . . . , A t , R t ) for the transcript filtration, and for each in teger n ≥ 0 let G n := σ U, ( X i,ℓ ) i = j, ℓ ≥ 1 , X j, 1 , . . . , X j,n . Lemma 20, pro ved in App endix A.1, shows that under a pull budget N j ( T ) ≤ n , the ev ent in question is already visible in G n . Prop osition 7 (Prefix-measurable χ 2 c hange of measure) . Assume χ 2 ( P ∥ Q ) < ∞ . Then for every inte ger n ≥ 0 and every event E ∈ G n , P 0 ( E ) ≤ p P 1 ( E ) 1 + χ 2 ( P ∥ Q ) n/ 2 ≤ p P 1 ( E ) exp n 2 χ 2 ( P ∥ Q ) . Pr o of. Let P ( n ) k := P k | G n , k ∈ { 0 , 1 } , denote the restrictions of P 0 and P 1 to G n . Define L n := n Y ℓ =1 dP dQ ( X j,ℓ ) , L 0 := 1 . Under b oth P 0 and P 1 , the random element U, ( X i,ℓ ) i = j, ℓ ≥ 1 has the same law and is indep enden t of ( X j, 1 , . . . , X j,n ) . Moreov er, ( X j, 1 , . . . , X j,n ) ∼ P ⊗ n under P 0 , ( X j, 1 , . . . , X j,n ) ∼ Q ⊗ n under P 1 . 14 Hence P ( n ) 0 ≪ P ( n ) 1 and d P ( n ) 0 d P ( n ) 1 = L n P 1 -a.s. on G n . Equiv alently , for every b ounded G n -measurable random v ariable H , E 0 [ H ] = E 1 [ H L n ] . Applying this identit y with H = I E yields P 0 ( E ) = E 1 [ I E L n ] . By Cauch y–Sch warz, P 0 ( E ) ≤ p P 1 ( E ) p E 1 [ L 2 n ] . Under P 1 , the v ariables X j, 1 , . . . , X j,n are i.i.d. with law Q , so E 1 [ L 2 n ] := n Y ℓ =1 E ξ ∼ Q " dP dQ ( ξ ) 2 # = 1 + χ 2 ( P ∥ Q ) n . This prov es the first inequalit y . The second follows from 1 + x ≤ e x . Lemma 8 (Even t-restricted χ 2 c hange of measure for adaptive sampling) . Assume χ 2 ( P ∥ Q ) < ∞ . Then for every event E ∈ F T and every inte ger n ≥ 0 , P 0 E ∩ { N j ( T ) ≤ n } ≤ q P 1 E ∩ { N j ( T ) ≤ n } 1 + χ 2 ( P ∥ Q ) n/ 2 . Conse quently, P 0 E ∩ { N j ( T ) ≤ n } ≤ p P 1 ( E ) exp n 2 χ 2 ( P ∥ Q ) . Pr o of. By Lemma 20, E ∩ { N j ( T ) ≤ n } ∈ G n . Apply Prop osition 7 to the even t E ∩ { N j ( T ) ≤ n } . The second inequality follows from P 1 E ∩ { N j ( T ) ≤ n } ≤ P 1 ( E ) and 1 + x ≤ e x . Corollary 9 (Sp ecialization to ev en ts already enforcing the budget) . If E ∈ F T satisfies E ⊆ { N j ( T ) ≤ n } , then P 0 ( E ) ≤ p P 1 ( E ) 1 + χ 2 ( P ∥ Q ) n/ 2 ≤ p P 1 ( E ) exp n 2 χ 2 ( P ∥ Q ) . 15 In p articular, if P 1 ( E ) ≤ δ , then P 0 ( E ) ≤ √ δ exp n 2 χ 2 ( P ∥ Q ) . Equivalently, P 1 ( E ) ≥ P 0 ( E ) 2 1 + χ 2 ( P ∥ Q ) − n ≥ P 0 ( E ) 2 exp − n χ 2 ( P ∥ Q ) . Pr o of. If E ⊆ { N j ( T ) ≤ n } , then E = E ∩ { N j ( T ) ≤ n } . Applying Lemma 8 and rearranging gives the desired inequalit y . Ho w the general to ol will b e used. In our application, E will b e an under-sampling even t for a single arm j , and ( P 0 , P 1 ) will corresp ond to a base instance and a one-arm p erturbed instance. The p oin t of Corollary 9 is that once one pro ves under the p erturbed environmen t that P 1 ( E ) is small, the same even t can b e transferred back to the base environmen t with a p enalt y that dep ends only on the first n relev ant observ ations from the mo dified arm. This is exactly the scale needed for the exploration guarantee b elo w. More broadly , Lemma 8 applies whenev er one compares t wo adaptive sampling environmen ts that differ only through a single data stream and the even t of interest is con trolled by a b ounded pull budget on that stream. Thus its relev ance is not restricted to the presen t Bernoulli hard family . R emark 10 (Why the even t-restricted form matters) . Classical transcript-lev el change-of-measure argumen ts compare the ful l la ws and yield, for example, KL( P 0 ∥ P 1 ) = E 0 [ N j ( T )] KL( P ∥ Q ) , KL( P 1 ∥ P 0 ) = E 1 [ N j ( T )] KL( Q ∥ P ) , and analogous b ounds for other global divergences [BH79, T sy08, LS20]. These quantities pay for all pulls of arm j under one of the t wo environmen ts. F or under-sampling even ts such as { N j ( T ) ≤ n } , this can b e muc h larger than the relev ant ev ent- lev el budget. By contrast, Lemma 8 and Corollary 9 pay only for the first n rew ards of arm j that are relev ant to the ev ent under consideration, indep enden tly of how often arm j may b e pulled on t ypical tra jectories under either en vironment. This even t-level alignment is precisely what is needed in the application b elo w to turn regret information under a p erturbed instance into a probability- lev el exploration guarantee. 4.1.2 Pro of of Lemma 5 Pr o of. It suffices to verify the claim for one admissible choice of constants. T ake c ∆ = 8 , α = ln(1 . 8) 512 , β = 0 . 1 . Fix S ∈ S K/ 2 . and j ∈ S , and define ∆ 0 := c ∆ C log ( T ) log( K ) r K T , n := α T C 2 K log 2 ( T ) log 2 ( K ) . 16 Define the instance ν ( j ) S b y changing only arm j to mean µ j = 1 / 2 + ∆ 0 , keeping all other arms iden tical to ν S . Let E j := { N j ( T ) < n } . Under ν ( j ) S , arm j is the unique optimal arm with mean 1 / 2 + ∆ 0 . F or any i = j , we ha ve µ i ≤ 1 / 2 , hence each round not pulling j incurs instan taneous exp ected regret at least ∆ 0 . Therefore, Reg T ( ν ( j ) S ) ≥ ∆ 0 T − N j ( T ) . On E j , T − N j ( T ) ≥ T − n ≥ T / 2 , so E ν ( j ) S [Reg T ] ≥ E ν ( j ) S ∆ 0 ( T − N j ( T )) I E j ≥ ∆ 0 T 2 P ν ( j ) S ( E j ) . By the regret b ound (1), E ν ( j ) S [Reg T ] ≤ C log( T ) log ( K ) √ K T , hence P ν ( j ) S ( E j ) ≤ 2 C √ K T log( T ) log ( K ) ∆ 0 T = 1 4 . The en vironments ν S and ν ( j ) S differ only at arm j . At arm j , the rew ard law is P = Ber(1 / 2) under ν S and Q = Ber(1 / 2 + ∆ 0 ) under ν ( j ) S . F or | ∆ 0 | ≤ 1 / 4 , χ 2 ( P ∥ Q ) = 4∆ 2 0 1 − 4∆ 2 0 ≤ 16∆ 2 0 . Since E j ⊆ { N j ( T ) ≤ n } , Lemma 8 giv es P ν S ( E j ) ≤ q P ν ( j ) S ( E j ) exp n 2 χ 2 ( P ∥ Q ) ≤ 1 2 exp(8 n ∆ 2 0 ) . By the definition of n and ∆ 0 , we hav e 8 n ∆ 2 0 ≤ 8 α T C 2 K log 2 ( T ) log 2 ( K ) 64 C 2 log 2 ( T ) log 2 ( K ) K T = 512 α = ln(1 . 8) . Th us P ν S ( E j ) ≤ 1 2 exp(8 n ∆ 2 0 ) ≤ 1 2 e ln(1 . 8) = 0 . 9 . Equiv alently , P ν S N j ( T ) ≥ n ≥ 0 . 1 = β . 4.2 Information Low er Bound under the Hard Prior W e now formalize Step (B) of the pro of and sho w that the thresholded sampling profile Y m ust rev eal linear information ab out the hidden set S ⋆ . W rite X i := I { i ∈ S ⋆ } , X = ( X 1 , . . . , X K ) . 17 The argument combines the false-negative con trol from Step (A) with a regret-based false-p ositiv e b ound, and then applies a p er-co ordinate F ano-type inequality to obtain I ( S ⋆ ; Y ) = Ω( K ) . The pro of pack ages Step (A) and the regret b ound in to a co ordinatewise classification statemen t: the thresholded bit Y i predicts the hidden mem b ership bit X i = I { i ∈ S ⋆ } with a verage error b ounded a wa y from 1 / 2 . A chain-rule / p er-co ordinate en tropy argumen t then turns this coordinatewise signal into a linear mutual-information lo wer b ound. Theorem 11 (Information acquisition lo wer bound) . In the setting of Definition 4, assume that algorithm A satisfies the r e gr et b ound (1) for some c onstant C > 0 , let c ∆ , α, β > 0 b e the c onstants fr om L emma 5, and define n := α T C 2 K log 2 ( T ) log 2 ( K ) . Assume C √ K T log( T ) log ( K ) n ≤ β K 8 , (3) and the standing r e quir ements of L emma 5. Then I ( S ⋆ ; Y ) ≥ 1 − H b 1 2 − β 4 K − 1 2 log 2 (2 K ) , wher e H b ( p ) := − p log 2 p − (1 − p ) log 2 (1 − p ) is the binary entr opy. In p articular, for al l sufficiently lar ge K , ther e exists a universal c onstant γ > 0 such that I ( S ⋆ ; Y ) ≥ γ K . R emark 12 (On the horizon regime) . F or a sufficiently large absolute constant c , the condition T ≥ c C 6 K log 6 T log 6 K implies b oth (3) and the standing requirements of Lemma 5 (namely ∆ 0 ≤ 1 / 4 and 1 ≤ n ≤ T / 2 ). Hence Theorem 11 applies throughout this regime. W e do not claim the exp onen t 6 is in trinsic. A large-horizon assumption is unav oidable for any low er b ound of this form under the target regret rate O ( √ K T log T log K ) : the one-batc h algorithm that pulls a fixed arm throughout has regret at most T , and therefore already meets the minimax regret whenever T ≲ K log 2 T log 2 K. Th us no statement of the form B = Ω( K /W ) can hold uniformly ov er all T . The stronger log 6 condition comes from our presen t proof template. Sp ecifically , the necessary- exploration step uses the threshold n ≍ T C 2 K log 2 T log 2 K , 18 while the information argument requires the false-p ositiv e con tribution E [FP] ≲ √ K T log T log K n to b e at most order K . Com bining the tw o yields the display ed sufficient condition. Optimizing these p olylogarithmic exp onen ts is left op en. Pr o of. Define the mem b ership bits X i := I { i ∈ S ⋆ } and X = ( X 1 , . . . , X K ) . Since S ⋆ and X are in bijection, H ( S ⋆ | Y ) = H ( X | Y ) and H ( S ⋆ ) = H ( X ) . Bound H ( S ⋆ ) . Since S ⋆ is uniform ov er K K/ 2 subsets, H ( S ⋆ ) = log 2 K K/ 2 . Using K K/ 2 ≥ 2 K √ 2 K giv es H ( S ⋆ ) ≥ K − 1 2 log 2 (2 K ) . (4) Bound H ( S ⋆ | Y ) By the chain rule and the fact that conditioning reduces entrop y , H ( X | Y ) = K X i =1 H ( X i | X 1: i − 1 , Y ) ≤ K X i =1 H ( X i | Y ) ≤ K X i =1 H ( X i | Y i ) . F or eac h i , let p ⋆ i := inf ψ : { 0 , 1 }→{ 0 , 1 } P ψ ( Y i ) = X i b e the Bay es minimum error probability when predicting X i from Y i alone. The (binary) F ano inequalit y yields H ( X i | Y i ) ≤ H b ( p ⋆ i ) . By concavit y of H b , H ( X | Y ) ≤ K X i =1 H b ( p ⋆ i ) ≤ K H b 1 K K X i =1 p ⋆ i . 1. Upp er b ound the a verage Bay es error b y the thresholding rule. Consider the particular (p ossibly sub optimal) estimator ˆ X i := Y i . Then p ⋆ i ≤ P ( ˆ X i = X i ) = P ( Y i = X i ) , hence 1 K K X i =1 p ⋆ i ≤ 1 K K X i =1 P ( Y i = X i ) =: P e , where K P e = E h K X i =1 I { Y i = X i } i = E [FP] + E [FN] , 19 with FP := X i / ∈ S ⋆ I { Y i = 1 } , FN := X i ∈ S ⋆ I { Y i = 0 } . 2. Bound FP b y regret. Fix S ∈ [ K ] K/ 2 and work under ν S . By Marko v’s inequalit y , E ν S [FP] = X i / ∈ S P ν S ( N i ( T ) ≥ n ) ≤ 1 n X i / ∈ S E ν S [ N i ( T )] . Under ν S , the optimal mean is µ ⋆ = 1 / 2 and any arm i / ∈ S has mean 0 , so pulling i / ∈ S incurs instan taneous exp ected regret 1 / 2 , and pulling i ∈ S incurs 0 . Th us E ν S [Reg T ] = 1 2 X i / ∈ S E ν S [ N i ( T )] , so X i / ∈ S E ν S [ N i ( T )] = 2 E ν S [Reg T ] . Using E ν S [Reg T ] ≤ C √ K T log( T ) log ( K ) gives E ν S [FP] ≤ 2 C √ K T log( T ) log ( K ) n . Since this b ound holds uniformly for all S ∈ [ K ] K/ 2 , we can av erage ov er the random choice of S ⋆ : E [FP] = E S ⋆ E [FP | S ⋆ ] ≤ 2 C √ K T log( T ) log ( K ) n ≤ β K 4 , (5) where the last inequality follows from assumption (3). 3. Bound FN b y Lemma 5. Fix S ∈ [ K ] K/ 2 and work under ν S . By Lemma 5, for every j ∈ S , P ν S ( Y j = 1) = P ν S ( N j ( T ) ≥ n ) ≥ β . Hence the exp ected n umber of true p ositiv es satisfies E ν S h X j ∈ S Y j i = X j ∈ S P ν S ( Y j = 1) ≥ β | S | = β K 2 . Therefore E ν S [FN] = | S | − E ν S h X j ∈ S Y j i ≤ K 2 (1 − β ) . Since this b ound holds for any realization S ∈ [ K ] K/ 2 , we can obtain the unconditional exp ectation b y marginalizing ov er S ⋆ : E [FN] = E S ⋆ E [FN | S ⋆ ] ≤ K 2 (1 − β ) . (6) 4. Conclude an upp er b ound on H ( S ⋆ | Y ) . Com bining (5) and (6), P e = E [FP] + E [FN] K ≤ β 4 + 1 − β 2 = 1 2 − β 4 =: δ < 1 2 . 20 Since H b is increasing on [0 , 1 / 2] , H ( S ⋆ | Y ) = H ( X | Y ) ≤ K H b 1 K K X i =1 p ⋆ i ≤ K H b ( P e ) ≤ K H b ( δ ) . Mutual information low er b ound. Finally , I ( S ⋆ ; Y ) = H ( S ⋆ ) − H ( S ⋆ | Y ) ≥ K − 1 2 log 2 (2 K ) − K H b ( δ ) , using (4). This is exactly the claimed b ound with δ = 1 2 − β 4 . The linear-in- K conclusion follows b ecause 1 − H b ( δ ) > 0 for any δ = 1 / 2 . What remains for the lo w er b ound. Theorem 11 completes the instance-lev el part of the argumen t: under the regret guarantee (1), even the coarse profile Y must enco de Ω( K ) bits ab out the hidden set S ⋆ . What remains is now purely structural. Under batching and a W -bit p ersisten t memory budget, how muc h instance-dep enden t information can Y carry at all? The next subsection answ ers this by sho wing that Y is measurable with resp ect to the seed and the b oundary-memory transcript, and therefore cannot contain more than O ( B W ) bits of information. 4.3 F rom Information to Batch Complexit y W e now complete the lo wer b ound by upp er-b ounding how m uch information the learner can enco de in the thresholded sampling profile Y under the batc hing and memory constrain ts. Step (B) show ed that near-minimax regret forces I ( S ⋆ ; Y ) = Ω( K ) . T o con vert this in to a low er bound on the num b er of batc hes, it remains to show that a B -batc h algorithm with a W -bit p ersisten t state can mak e Y dep end on at most O ( B W ) bits of instance- dep enden t information. The k ey structural fact is p olicy commitment: once a batc h b egins, the complete within-batc h action plan is fixed using only the public time, the b oundary memory state, and the seed U . Hence rew ards observed inside a batch can affect future action plans only through what is retained at later batc h boundaries. This yields the sharper en tropy b ound ( B − 1) W , and in particular the coarse b ound B W used in the roadmap. The next lemma formalizes the structural consequence we need: although Y is defined from the full pull counts N i ( T ) , under the batching constraint it is in fact determined b y the seed together with the internal b oundary memory transcript. Lemma 13 (Boundary transcript determines the sampling profile) . L et A b e a ( B , W ) -b atche d sp ac e-b ounde d algorithm as in Se ction 3.5, and let Y b e the thr esholde d sampling pr ofile fr om Defi- nition 4. Then Y is a deterministic me asur able function of ( U, M ) . Pr o of Sketch. The key p oin t is p olicy commitmen t. F or each batch b , once ( t b − 1 , M t b − 1 , U ) is fixed, the en tire within-batch action sequence is fixed as well. Hence the pull-coun t vector contributed by batc h b is determined by the boundary information av ailable when that batch starts. Summing these batc h-level pull counts shows that the full pull-coun t vector ( N 1 ( T ) , . . . , N K ( T )) , and therefore the 21 thresholded profile Y , is determined by ( U, τ , M ) , where τ = ( t 1 , . . . , t B − 1 ) is the realized internal grid. In the static-grid mo del, τ is fixed in adv ance. In the adaptive-grid mo del, each new boundary is c hosen from the same b oundary information, so τ is itself det ermined recursively by ( U, M ) . Therefore Y is a deterministic function of ( U, M ) . The formal measurability verification is deferred to App endix A.2. Pro of of Theorem 1 W e are no w in p osition to complete the pro of. Step (B) lo wer-bounds I ( S ⋆ ; Y ) by Ω( K ) , while Step (C) upp er-b ounds the same quantit y b y the batch-memory capacit y . Com bining the tw o yields the claimed low er b ound on B . Pr o of. By Remark 12, for a sufficiently large univ ersal constant c , the assumption T ≥ c C 6 K log 6 T log 6 K implies the hypotheses of Theorem 11 for the explicit admissible choice c ∆ = 8 , α = ln(1 . 8) 512 , β = 0 . 1 . Hence Theorem 11 yields I ( S ⋆ ; Y ) ≥ 1 − H b 1 2 − 0 . 1 4 K − 1 2 log 2 (2 K ) = γ 0 K − 1 2 log 2 (2 K ) . By Lemma 13, there exists a deterministic and measurable function g suc h that Y = g ( U, M ) a.s. Hence, conditional on U , we hav e the Marko v c hain S ⋆ − → M − → Y . By the conditional data pro cessing inequality , I ( S ⋆ ; Y | U ) ≤ I ( S ⋆ ; M | U ) . Since U is indep enden t of S ⋆ , I ( S ⋆ ; Y ) ≤ I ( S ⋆ ; Y , U ) = I ( S ⋆ ; Y | U ) . Therefore I ( S ⋆ ; Y ) ≤ I ( S ⋆ ; M | U ) ≤ H ( M | U ) ≤ H ( M ) . Using the chain rule and |M| ≤ 2 W , H ( M ) := B − 1 X b =1 H M t b | M t 1 , . . . , M t b − 1 ≤ B − 1 X b =1 H ( M t b ) ≤ B − 1 X b =1 log 2 |M| ≤ ( B − 1) W ≤ B W . 22 Th us I ( S ⋆ ; Y ) ≤ B W. Com bining the low er and upp er b ounds on I ( S ⋆ ; Y ) gives γ 0 K − 1 2 log 2 (2 K ) ≤ B W, whic h implies B ≥ γ 0 K − 1 2 log 2 (2 K ) W . Since γ 0 > 0 is universal, there exists a univ ersal constant γ > 0 suc h that for all sufficiently large K , γ 0 K − 1 2 log 2 (2 K ) ≥ γ K . Hence B ≥ γ K W , whic h prov es the asymptotic low er b ound. The sp ecialization to W = O (log T ) is immediate. In terpretation. Theorem 1 formalizes an adaptivity–memory tradeoff: ac hieving minimax regret requires learning Ω( K ) bits about whic h arms are go od, wh ile a B -batch algorithm with W -bit p ersisten t memory can transp ort at most B W bits of instance-dep enden t information ov erall across batc hes. Therefore B W m ust b e at least on the order of K . In particular, if W = O (log T ) , then an y such algorithm needs B = Ω( K / log T ) batc hes, i.e., a nearly linear n umber of adaptivity rounds in K when memory is logarithmic. 5 Upp er Bound under Sim ultaneous Space and A daptivit y Con- strain ts In this section we presen t a b atche d and sp ac e-b ounde d algorithm for the sto c hastic K -armed bandit problem with horizon T and rew ards in [0 , 1] . Throughout the pap er, w e use W to denote the p ersisten t-memory budget measured in bits. T o a void o verloading this notation, in the upper- b ound construction w e use an integer blo c k-size parameter S ∈ [ K ] . The algorithm uses O ( S log T ) bits of memory , at most e O ( K/S ) batches, and achiev es exp ected regret e O ( √ K T ) . Equiv alently , for an y bit budget W with Ω(log T ) ≤ W ≤ O ( K log T ) , c ho osing S ≍ W/ log T yields an implemen tation using at most W bits and e O ( K/W ) batches. Here the Ω(log T ) ≤ W lo wer b ound is needed to represent the relev ant counters and empirical means to in verse-polynomial precision in T , while the W ≤ O ( K log T ) upp er b ound is without loss of generalit y since standard finite-armed bandit algorithms only need to store p er-arm summaries for the K arms. This nearly matc hes the low er b ound of Section 4 up to logarithmic factors. Our algorithm works under the static-grid batched mo del. Since the static-grid mo del is algorith- mically w eaker than the adaptive-grid mo del, the same upp er bound immediately applies to the adaptiv e-grid setting as well. 23 5.1 Algorithm Description A t a high level, our algorithm may b e viewed as a space-b ounded analogue of a standard batched elimination algorithm [PR CS16, GHRZ19, JTX + 21]. The main difficulty is that standard batched elimination algorithms k eep a large active set of candidate arms across stages, which is imp ossible under a strict memory budget. Our solution is to keep only one incumbent together with one blo c k of at most S challengers at a time. Instead of storing all p oten tially go od arms simultaneously , the algorithm scans the remaining arms blo ck b y block. Within each blo c k, c hallengers are compared against a frozen benchmark incum b en t through a multistage test: if a challenger is confiden tly worse, it is deactiv ated for the rest of the blo c k; if it survives all stages and appears b etter, it ma y trigger an incumbent up date at the end of the blo ck. Because only one challenger blo c k is kept in memory at a time, this design uses O ( S log T ) bits while reducing the num b er of batc hes by a factor of S relative to the fully sequen tial scan. Concretely , the algorithm has three nested lo ops. The outer lo op uses a deterministic grid of lengths t 1 , . . . , t L defined by t 0 ≜ 1 , t i ≜ & r t i − 1 · T 10 K ' ( i ≥ 1) , and L ≜ log 2 log 2 T 10 K . Readers familiar with batched bandit literature will recognize that this is the standard “minimax”- t yp e grid [PRCS16, GHRZ19]. In each outer iteration i , the algorithm maintains a (time-v arying) incumbent arm a ⋆ i and its em- pirical mean b µ ⋆ i . At the b eginning of the iteration, it inherits the final incum b en t from the previous outer iteration and denotes this fixed initial identit y by a i, init ≜ a ⋆ i − 1 . The algorithm sets the current incumbent v ariable to a ⋆ i ← a i, init , pulls this arm for t i rounds, and stores the resulting empirical mean as b µ init i . It then initializes the incumbent summary b y b µ ⋆ i ← b µ init i . Throughout outer iteration i , the v ariable a ⋆ i denotes the curren t incumbent, and b µ ⋆ i denotes the stored summary of the arm currently held in a ⋆ i . The identit y a i, init sta ys fixed while the c hallenger blo c ks are scanned, even though a ⋆ i ma y later b e replaced. Next, the algorithm partitions [ K ] \ { a i, init } in to J ≜ K − 1 S ordered blo cks C i, 1 , . . . , C i,J , | C i,j | ≤ S. 24 F or eac h blo c k C i,j , the algorithm freezes the current incum b en t as the b enc hmark ¯ a i,j ≜ a ⋆ i , ¯ µ i,j ≜ b µ ⋆ i . It then preallocates i comparison slots indexed b y k = 1 , . . . , i , where slot k contains t k pulls for eac h arm-p osition in the block. If arm a ∈ C i,j is still activ e, then slot k is used to pull a and compute b µ i,j,k ( a ) from t k fresh samples. Once a is deactiv ated, its reserv ed pulls in the remaining slots are still executed but are reassigned to the frozen benchmark arm ¯ a i,j . Th us, for eac h pair ( i, j ) , the slots are fixed in adv ance; adaptivit y changes only whic h challengers remain activ e inside the blo ck, not the batch b oundaries. W e use the confidence radius c ( s ) ≜ r log(2 /δ ) 2 s ( s ≥ 1) . If, at level k , the challenger estimate satisfies b µ i,j,k ( a ) < ¯ µ i,j − 2 c ( t k ) , w e mark a as inactiv e and fill all later slots reserved for this arm in blo c k ( i, j ) with the frozen b enc hmark arm. Crucially , for a fixed slot k , the active/inactiv e status of ev ery arm-p osition is already determined at the start of that slot from the outcomes of the previous slots 1 , . . . , k − 1 . Th us the lo op ov er a ∈ C i,j inside slot k do es not require an y within-slot feedback: all pulls assigned to slot k can b e scheduled simultaneously in one batch, and only after that batc h is completed do w e up date the activ e set for slot k + 1 . After level k = i , let ˜ a i,j b e an activ e arm in C i,j with the largest v alue of b µ i,j,i ( a ) , if such an arm exists. If ˜ a i,j exists and moreov er satisfies b µ i,j,i (˜ a i,j ) ≥ ¯ µ i,j + 2 c ( t i ) , then ˜ a i,j replaces the incum b en t and we up date the stored incumbent summary to b µ i,j,i (˜ a i,j ) . After the main outer-lo op sc hedule i = 1 , . . . , L is completed, all remaining rounds are assigned to the final incumbent a ⋆ L in one final batch. Our algorithm is summarized in Algorithm 1. R emark 14 . An outer iteration is not a single batc h. Each incum b ent-refresh blo c k and each preallo cated blo c k-level comparison slot indexed b y ( i, j, k ) is a batch. Because the sc hedule is deterministic, the num b er of rounds used b efore the final exploitation batc h is also deterministic, so the length of the last batch is kno wn in adv ance. 5.2 Algorithm Analysis W e ha ve the following theorem. Theorem 2 (F ormal) . Ther e exist universal c onstants c, C 1 , C 2 , C 3 > 0 such that the fol lowing holds. F or every inte ger K ≥ 2 , every blo ck size S ∈ [ K ] , and every horizon T ≥ cK , if Algorithm 1 is run with δ = T − 4 , then it defines a static-grid sto chastic b andit algorithm that uses no mor e than C 1 S log T bits of p ersistent memory, no mor e than C 2 K S log 2 log T b atches, and satisfies sup ν E ν [Reg T ] ≤ C 3 p K T log T log 2 log T , wher e the supr emum is over al l K -arme d sto chastic b andit instanc es with r ewar ds in [0 , 1] . 25 Algorithm 1 Simultaneously Batched and Space Constrained Algorithm for standard Multi-Armed Bandit with blo ck size S 1: Define c ( s ) = q log(2 /δ ) 2 s 2: Initialize a ⋆ 0 ← 1 and t 0 ← 1 3: Define t i ← q t i − 1 T 10 K for i ≥ 1 4: Set L ← log 2 log 2 T 10 K 5: for i = 1 to L do 6: Set a i, init ← a ⋆ i − 1 7: Set a ⋆ i ← a i, init 8: Pull arm a i, init for t i rounds and store its empirical mean b µ init i 9: Set b µ ⋆ i ← b µ init i 10: Set J ← K − 1 S and partition [ K ] \ { a i, init } into ordered blo cks C i, 1 , . . . , C i,J of size at most S 11: for j = 1 to J do 12: Set ¯ a i,j ← a ⋆ i and ¯ µ i,j ← b µ ⋆ i 13: for each a ∈ C i,j do 14: Set active( a ) ← true 15: end for 16: for k = 1 to i do 17: for each a ∈ C i,j do 18: if active( a ) then 19: Pull arm a for t k rounds and compute its empirical mean b µ i,j,k ( a ) 20: if b µ i,j,k ( a ) < ¯ µ i,j − 2 c ( t k ) then 21: Set active( a ) ← false 22: end if 23: else 24: Pull arm ¯ a i,j for t k rounds 25: end if 26: end for 27: end for 28: if there exists some active arm in C i,j then 29: Let ˜ a i,j ∈ arg max a ∈ C i,j : active( a ) b µ i,j,i ( a ) 30: if b µ i,j,i (˜ a i,j ) ≥ ¯ µ i,j + 2 c ( t i ) then 31: Set a ⋆ i ← ˜ a i,j 32: Set b µ ⋆ i ← b µ i,j,i (˜ a i,j ) 33: end if 34: end if 35: end for 36: end for 37: Pull arm a ⋆ L for the final exploitation batch 26 In tuitively , the algorithm stores only one incum b en t, one blo c k of at most S c hallengers, and a constan t num b er of coun ters and empirical means p er c hallenger in the curren t blo c k, so the memory usage is O ( S log T ) bits. The outer lo op i = 1 , . . . , L forms the main outer-lo op sc hedule, the middle lo op scans arm-blo c ks sequentially to a void storing a large activ e set, and the inner lo op allo cates the prearranged comparison slots for the current blo ck. The sequential blo c k scan con tributes the factor K/S in the batch complexity , while the m ultistage comparison contributes the additional log log T factor. Hence, for any bit budget W with Ω(log T ) ≤ W ≤ O ( K log T ) , choosing S ≍ W/ log T yields an implementation using at most W bits and e O ( K/W ) batches. The total n umber of rounds used b efore the final exploitation batch is deterministic. Indeed, once t 1 , . . . , t L are fixed, the total num b er of rounds used b efore the final exploitation batc h is N main ≜ L X i =1 t i + ( K − 1) i X k =1 t k ! , (7) b ecause outer iteration i con tains one inc umbent-refresh blo c k of length t i and then scans the remaining K − 1 arm-p ositions through c hallenger blo c ks, each p osition receiving i preallocated slots of lengths t 1 , . . . , t i . Hence the final batc h length is exactly T − N main and is kno wn in adv ance. W e now prov e the theorem in four steps: space, sc hedule and batch accounting, concentration, and regret. Lemma 15 (Space) . Algorithm 1 uses only O ( S log T ) bits of memory. Pr o of. A t any moment, the algorithm maintains the current outer index i , the current block index j , the curren t comparison level k , the curren t scheduled length t i , the fixed inherited initial arm a i, init , the current incumbent arm a ⋆ i , one incumbent summary b µ ⋆ i , the frozen b enc hmark pair (¯ a i,j , ¯ µ i,j ) for the current blo c k, the identities of the at most S arms in the current blo c k, one active/inactiv e flag for eac h of these arms, and at most S empirical summaries b µ i,j,k ( a ) for the arms curren tly pro cessed in the blo c k. All indices and counters are integers b ounded by T , and therefore each requires only O (log T ) bits. Eac h arm iden tity needs O (log K ) ≤ O (log T ) bits, so the curren t blo c k needs O ( S log T ) bits. Eac h empirical summary is an a verage of at most T rew ards in [0 , 1] , and it suffices to store it to in verse-polynomial precision in T , which also requires O (log T ) bits. Since there are only O ( S ) such summaries and flags alive simultaneously , the total p ersisten t memory usage is O ( S log T ) . Lemma 16 (Outer-lo op schedule and round budget) . L et N main b e define d by (7) . L et L ≜ log 2 log 2 T 10 K . The deterministic outer-lo op sche dule uses a c onstant fr action of the horizon: T 40 ≤ N main ≤ 3 T 5 . In p articular, the final exploitation b atch has deterministic length 2 T 5 ≤ T − N main ≤ 39 T 40 . 27 Pr o of. Let H ≜ T 10 K , s i ≜ H 1 − 2 − i ( i ≥ 0) . W e first compare the integer schedule ( t i ) with the smo oth pro xy ( s i ) . W e claim that for ev ery 1 ≤ i ≤ L , s i ≤ t i ≤ 2 s i . (8) F or the lo wer b ound, use induction: t i = l p H t i − 1 m ≥ p H t i − 1 ≥ p H s i − 1 = s i . F or the upp er b ound, the case i = 1 is t 1 = ⌈ √ H ⌉ ≤ √ H + 1 ≤ 2 √ H = 2 s 1 . No w supp ose i ≥ 2 and t i − 1 ≤ 2 s i − 1 . Since L ≥ 1 implies H ≥ 4 , w e hav e s i ≥ s 1 = √ H ≥ 2 , and therefore t i = l p H t i − 1 m ≤ p H t i − 1 + 1 ≤ p 2 H s i − 1 + 1 = √ 2 s i + 1 ≤ 2 s i . This prov es (8). Next, we estimate the last schedule lev el. Since L = ⌊ log 2 log 2 H ⌋ , we hav e 1 log 2 H ≤ 2 − L < 2 log 2 H . Hence H 4 = H 1 − 2 log 2 H < H 1 − 2 − L = s L ≤ H 1 − 1 log 2 H = H 2 . Com bining this with (8) yields H 4 < t L ≤ H . (9) W e no w control the partial sums. If 1 ≤ i ≤ L − 2 , then 2 − i ≥ 2 − ( L − 2) = 4 · 2 − L ≥ 4 log 2 H , and therefore t i ≤ 2 H 1 − 4 log 2 H = H 8 (1 ≤ i ≤ L − 2) . Therefore, for every 1 ≤ i ≤ L − 2 , t i +1 ≥ p H t i ≥ 2 t i . Also, since t L − 1 ≤ 2 s L − 1 < H / 2 , w e hav e t L ≥ p H t L − 1 ≥ √ 2 t L − 1 . Consequen tly , L − 1 X k =1 t k ≤ 2 t L − 1 , L X k =1 t k ≤ 2 t L − 1 + t L ≤ (1 + √ 2) t L , 28 and thus L X i =1 i X k =1 t k ≤ 2 L − 1 X i =1 t i + L X k =1 t k ≤ 4 t L − 1 + (1 + √ 2) t L ≤ (1 + 3 √ 2) t L < 6 t L . Using t i + ( K − 1) P i k =1 t k ≤ K P i k =1 t k , we obtain N main ≤ K L X i =1 i X k =1 t k < 6 K t L ≤ 6 K H = 3 T 5 . F or the lo wer b ound, the last outer iteration alone con tributes at least t L + ( K − 1) t L = K t L > K H 4 = T 40 , where w e used (9). Hence N main ≥ T / 40 . Therefore N main ∈ [ T / 40 , 3 T / 5] , and the final batc h length is the predetermined quantit y T − N main ∈ 2 T 5 , 39 T 40 . Lemma 17 (Number of batches) . Algorithm 1 uses at most O K S log 2 log T b atches. Pr o of. In outer iteration i , the algorithm contains one incumbent-refresh blo c k and J = K − 1 S c hallenger blo c ks, each contributing at most i preallo cated comparison slots. Thus outer iteration i uses at most 1 + J i batches. Including the final exploitation batch, the total num b er of batches is at most L X i =1 1 + J i + 1 = O K S L 2 . Since L = O (log log T ) , we ha ve O K S L 2 = O K S log 2 log T . It remains to b ound the regret. W e first state a uniform concentration even t for the blo c k-level c hallenger estimates and the incum b en t-refresh estimates. Lemma 18 (Uniform concentration) . F or every ( i, j, k , a ) such that arm a is sample d as a chal lenger at level k . L et δ = T − 4 and let c ( · ) b e as define d in the algorithm description. Then with pr ob ability at le ast 1 − 1 T 2 , the fol lowing event E holds: E ≜ n ∀ i, j, k , ∀ a ∈ C i,j : | b µ i,j,k ( a ) − µ a | ≤ c ( t k ) and | b µ init i − µ a ⋆ i − 1 | ≤ c ( t i ) o . Pr o of. Eac h empirical mean is an a verage of independent [0 , 1] -v alued rew ards. By Ho effding’s inequalit y , for an y fixed mean based on s samples we ha ve Pr | b µ − µ | > c ( s ) ≤ δ . The total 29 n umber of empirical means computed is at most L + L X i =1 ( K − 1) i = O ( K L 2 ) = O K log 2 log T , since in outer iteration i , eac h of the K − 1 c hallenger p ositions can pro duce at most one empirical mean at each of the i lev els. Hence this total is at most T 2 whenev er T ≥ cK for a sufficiently large univ ersal constant c . Applying a union b ound ov er all computed means yields Pr( E ) ≥ 1 − T 2 · δ ≥ 1 − 1 T 2 . Lemma 19 (Accuracy of the curren t incumbent summary) . Fix an outer iter ation i and work on the event E . At every moment during outer iter ation i , the stor e d summary b µ ⋆ i satisfies b µ ⋆ i − µ a ⋆ i ≤ c ( t i ) . In p articular, for every chal lenger blo ck C i,j , its fr ozen b enchmark summary satisfies ¯ µ i,j − µ ¯ a i,j ≤ c ( t i ) . Pr o of. During outer iteration i , the quantit y b µ ⋆ i is initialized to b µ init i and can only b e up dated at the end of a challenge r block, when some activ e arm ˜ a i,j replaces the incumbent, in which case it is set to b µ i,j,i (˜ a i,j ) . Therefore b µ ⋆ i is alwa ys either the initial incum b en t estimate b µ init i or one of the blo c k-lev el c hallenger summaries b µ i,j,i ( a ) . Both are co vered b y the even t E , and b oth are based on t i samples. Hence b µ ⋆ i is alwa ys within c ( t i ) of the true mean of the current incum b en t. Since each frozen b enc hmark summary ¯ µ i,j is just a copy of b µ ⋆ i tak en at the start of blo c k j , the same b ound holds for ( ¯ a i,j , ¯ µ i,j ) . Pr o of of The or em 2. The space bound follows from Lemma 15, the sc hedule and round-budget b ounds from Lemma 16, and the batch b ound from Lemma 17. It remains to prov e the regret b ound. W e condition on the ev ent E from Lemma 18. Within a fixed outer iteration, a ⋆ i denotes the curren t incumbent at the momen t under discussion; at the end of the iteration it is the final incum b en t of that iteration. F or each blo c k C i,j , let ( ¯ a i,j , ¯ µ i,j ) denote the b enc hmark frozen at the start of that blo c k. The b est arm is never deactiv ated. Let a ⋆ ∈ arg max a ∈ [ K ] µ a denote a true b est arm. Fix an outer iteration i , let j ⋆ b e the blo c k con taining a ⋆ if a ⋆ = a i, init , and supp ose the algorithm tests a ⋆ at some level k ≤ i within that blo c k. The deactiv ation condition would b e b µ i,j ⋆ ,k ( a ⋆ ) < ¯ µ i,j ⋆ − 2 c ( t k ) . Under E , we hav e b µ i,j ⋆ ,k ( a ⋆ ) ≥ µ ⋆ − c ( t k ) , ¯ µ i,j ⋆ ≤ µ ¯ a i,j ⋆ + c ( t i ) ≤ µ ⋆ + c ( t i ) ≤ µ ⋆ + c ( t k ) , where the second inequality uses Lemma 19, and the last one uses the monotonicit y of c ( · ) . Therefore b µ i,j ⋆ ,k ( a ⋆ ) ≥ ¯ µ i,j ⋆ − 2 c ( t k ) , 30 so the deactiv ation condition can nev er hold for a ⋆ . Hence a ⋆ alw ays survives to level k = i whenever it is tested in outer iteration i . Incum b en t true means do not decrease across blo c k up dates within an outer iteration. Fix an outer iteration i , a block C i,j , and supp ose its selected arm ˜ a i,j replaces the curren t incum b en t. Immediately b efore the up date, by the up date rule, b µ i,j,i (˜ a i,j ) ≥ ¯ µ i,j + 2 c ( t i ) . Under E , we hav e µ ˜ a i,j ≥ b µ i,j,i (˜ a i,j ) − c ( t i ) ≥ ¯ µ i,j + c ( t i ) ≥ µ ¯ a i,j , where the last inequality uses Lemma 19. Th us every blo ck up date can only increase the true mean of the incumbent. Consequen tly , every incum b en t that app ears during outer iteration i has true mean at least that of the initial incumbent, whic h is a ⋆ i − 1 . Final incum b en t gap b ound. Fix an outer iteration i . If a ⋆ = a i, init , then the incum b en t is already optimal at the start of the iteration, and the previous paragraph implies that ev ery later incum b en t in this iteration is no w orse. Thus the final incum b en t has gap 0 . No w consider the case a ⋆ = a i, init . Let j ⋆ b e the blo c k containing a ⋆ . When the algorithm tests a ⋆ in blo c k j ⋆ , we show ed that the b est arm is never deactiv ated, so it reaches level k = i . A t the end of that blo c k, there are t wo cases. If the blo ck up date succeeds, then the algorithm installs some arm ˜ a i,j ⋆ satisfying b µ i,j ⋆ ,i (˜ a i,j ⋆ ) ≥ b µ i,j ⋆ ,i ( a ⋆ ) , b ecause ˜ a i,j ⋆ is chosen as the active arm with the largest final empirical mean. Under E , µ ˜ a i,j ⋆ ≥ b µ i,j ⋆ ,i (˜ a i,j ⋆ ) − c ( t i ) ≥ b µ i,j ⋆ ,i ( a ⋆ ) − c ( t i ) ≥ µ ⋆ − 2 c ( t i ) , so the up dated incum b en t has gap at most 2 c ( t i ) . If the blo ck up date fails, then b µ i,j ⋆ ,i (˜ a i,j ⋆ ) < ¯ µ i,j ⋆ + 2 c ( t i ) . Since a ⋆ is still active at level i , maximalit y of ˜ a i,j ⋆ implies b µ i,j ⋆ ,i ( a ⋆ ) ≤ b µ i,j ⋆ ,i (˜ a i,j ⋆ ) < ¯ µ i,j ⋆ + 2 c ( t i ) . Therefore, under E , µ ⋆ − µ ¯ a i,j ⋆ ≤ µ ⋆ − b µ i,j ⋆ ,i ( a ⋆ ) + b µ i,j ⋆ ,i ( a ⋆ ) − ¯ µ i,j ⋆ + ¯ µ i,j ⋆ − µ ¯ a i,j ⋆ < c ( t i ) + 2 c ( t i ) + c ( t i ) = 4 c ( t i ) , and hence the incumbent present after blo c k j ⋆ has gap at most 4 c ( t i ) . Moreo ver, incum b en t means do not decrease within the remaining blo c ks of outer iteration i , so the final incumbent at the end of outer iteration i also satisfies ∆ a ⋆ i ≜ µ ⋆ − µ a ⋆ i ≤ 4 c ( t i ) . (10) 31 Surviv al implies small gap. Fix an outer iteration i , a blo c k C i,j , an arm a ∈ C i,j , and a level k ≥ 2 . If arm a reaches level k (i.e., it is not deactiv ated at level k − 1 ), then b µ i,j,k − 1 ( a ) ≥ ¯ µ i,j − 2 c ( t k − 1 ) . Under E , we hav e µ ¯ a i,j − µ a ≤ 4 c ( t k − 1 ) . where we used µ ¯ a i,j − µ a ≤ µ ¯ a i,j − ¯ µ i,j + ¯ µ i,j − b µ i,j,k − 1 ( a ) + b µ i,j,k − 1 ( a ) − µ a ≤ c ( t i )+2 c ( t k − 1 )+ c ( t k − 1 ) ≤ 4 c ( t k − 1 ) , and the first term is b ounded using Lemma 19. The initial incumbent of outer iteration i is a ⋆ i − 1 , and by (10) applied to outer iteration i − 1 , ∆ a ⋆ i − 1 ≤ 4 c ( t i − 1 ) ≤ 4 c ( t k − 1 ) . Since incumbent means do not decrease across block up dates within an outer iteration, the blo c k b enc hmark arm ¯ a i,j cannot b e worse than the initial incumbent a ⋆ i − 1 , hence ∆ ¯ a i,j ≤ 4 c ( t k − 1 ) . Therefore, ∆ a ≤ 8 c ( t k − 1 ) . (11) Regret decomp osition. Let k i ( a ) denote the largest lev el at whic h arm a itself is sampled in outer iteration i . If k i ( a ) < i , then the remaining slots k = k i ( a ) + 1 , . . . , i reserved for arm a are filled b y the frozen b enc hmark of its blo c k. This decomp osition mirrors the predetermined batch la yout. The regret therefore has four con tributions: (i) incumbent-refresh blo c ks at the b eginning of the outer iterations, (ii) challenger pulls before deactiv ation or replacemen t, (iii) b enc hmark-filled tail slots after a challenger is deactiv ated, and (iv) the final exploitation batch. (i) Incumb ent-r efr esh blo cks. A t the start of outer iteration i , the algorithm pulls the current incum b en t (which is a ⋆ i − 1 ) for t i rounds. The i = 1 term is at most t 1 . F or i ≥ 2 , using (10) for iteration i − 1 yields ∆ a ⋆ i − 1 ≤ 4 c ( t i − 1 ) , hence L X i =1 t i ∆ a ⋆ i − 1 ≤ t 1 + L X i =2 t i · 4 c ( t i − 1 ) = O L q T K log T , where we used t i √ t i − 1 = Θ q T 10 K from the schedule. (ii) Chal lenger slots b efor e de activation. The c hallenger pulls of arm a in outer iteration i o ccur exactly at levels k = 1 , . . . , k i ( a ) inside the unique blo c k that con tains a . The k = 1 terms contribute at most O ( K Lt 1 ) and are dominated b y the final b ound. Whenev er arm a is pulled at some lev el k ≥ 2 , it must hav e surviv ed lev el k − 1 , and thus (11) applies. Therefore the regret incurred by all 32 c hallenger pulls at levels k ≥ 2 is at most L X i =1 X a ∈ [ K ] k i ( a ) X k =2 t k ∆ a ≤ L X i =1 X a ∈ [ K ] k i ( a ) X k =2 t k · 8 c ( t k − 1 ) ≤ L X i =1 X a ∈ [ K ] i X k =2 t k · 8 c ( t k − 1 ) = O L X i =1 X a ∈ [ K ] i X k =2 t k √ t k − 1 · p log T = O L 2 p K T log T , using again t k √ t k − 1 = Θ q T 10 K . (iii) Benchmark-fil le d tail slots. Supp ose c hallenger a is deactiv ated during outer iteration i inside blo c k C i,j . Then the remaining slots reserved for this arm are filled by the frozen b enc hmark arm ¯ a i,j of that blo c k. By the monotonicity established ab o ve, every blo c k b enc hmark app earing during outer iteration i is at least as go o d as the initial incumbent a ⋆ i − 1 . Hence, for i ≥ 2 , eac h suc h filler slot incurs regret at most 4 c ( t i − 1 ) by (10). Therefore the total regret of these b enc hmark-filled tail slots is at most L X i =2 X a ∈ [ K ] i X k = k i ( a )+1 t k · ∆ inc ,i,a,k ≤ L X i =2 ( K − 1) i X k =1 t k · 4 c ( t i − 1 ) ≤ 4(1 + √ 2)( K − 1) L X i =2 t i c ( t i − 1 ) = O L p K T log T , where ∆ inc ,i,a,k denotes the gap of the b enc hmark arm used to fill that slot. Thus replacing a deactiv ated challenger b y the blo c k b enc hmark do es not change the order of the regret b ound. (iv) Final exploitation. After the main outer-lo op schedule i = 1 , . . . , L terminates, the algorithm commits to the final incum b en t a ⋆ L for the remaining T − N main rounds. By Lemma 16, this final batc h length is deterministic, and by the final incum b en t gap b ound together with t L ≍ T 10 K , we ha ve T − N main ∆ a ⋆ L ≤ T · 4 c ( t L ) = O p K T log T . Com bining (i)–(iv) and using L = O (log log T ) completes the proof under E . T aking exp ectation o ver E and its complemen t (whose probability is at most 1 /T 2 ) yields the stated b ound. Therefore the explicit batched formulation attains the same upp er b ound. 6 Conclusion W e identify a qualitative transition in stochastic bandits under join t constrain ts on memory and adaptivit y . When either resource is a v ailable in abundance, near-minimax regret is attainable at 33 surprisingly low cost: logarithmic memory suffices under fully sequential in teraction, and a K - indep enden t n umber of batc hes suffices when memory is unrestricted. Our results sho w that this b enign picture breaks down when the t wo constraints are imp osed simultaneously . In the large- horizon regime, near-minimax regret requires B = Ω( K /W ) , ev en under adaptive grids, and our static-grid upper b ound shows that this dep endence on K/W is ach iev able up to p olylogarithmic factors. Thus the batc h complexity of near-minimax learning is go verned by the memory–adaptivity trade-off captured by K/W . Bey ond the sp ecific low er b ound, the main conceptual message is that statistical p erformance in in teractive learning is limited not only b y ho w m uch information is acquired, but also b y ho w efficien tly that information can b e stored and transported across rounds of adaptivit y . Our pro of mak es this precise through an information-b ottleneck viewp oin t: low regret forces the learner to extract Ω( K ) bits ab out the hidden go o d-arm set, while a B -batch proto col with a W -bit memory budget can preserve only O ( B W ) bits of instance-dep endent information across batch b oundaries. In this sense, the relev ant resource trade-off is not merely exploration v ersus exploitation, but information acquisition versus information retention. On the tec hnical side, the low er b ound combines a coarse sampling profile, a p er-co ordinate in- formation argument, and an even t-restricted change-of-measure lemma tailored to under-sampling ev ents. The last ingredien t may b e useful b eyond the presen t pro of, as it isolates a clean w a y to compare adaptiv e-sampling pro cesses while pa ying only for the first n pulls relev an t to the even t under consideration. Sev eral directions remain op en. The most immediate algorithmic question is to sharp en the remain- ing p olylogarithmic gap in the K /W trade-off. It would also b e in teresting to understand whether the same memory–adaptivity b ottlenec k p ersists in richer mo dels such as linear, generalized linear, and con textual bandits, where the latent instance information is structured rather than arm-wise indep enden t. A Deferred M easurability Pro ofs A.1 Lo calized change of measure: formal setup and pro ofs This subsection collects the measurable-space form ulation and the truncation-based pro ofs used in Section 4.1. F ormal setup. Fix a measurable rew ard space ( Y , S ) . W ork on a common measurable space (Ω , F ) supporting a random seed U ∼ λ and an arra y of reward v ariables ( X i,ℓ ) i ∈ [ K ] , ℓ ≥ 1 taking v alues in Y . Let P 0 := P ν 0 and P 1 := P ν 1 denote tw o probability measures on (Ω , F ) such that: • U is indep enden t of all rew ard v ariables under b oth P 0 and P 1 ; • the collection ( X i,ℓ ) i ∈ [ K ] , ℓ ≥ 1 is mutually indep enden t under b oth P 0 and P 1 ; • for ev ery i = j , the stream ( X i,ℓ ) ℓ ≥ 1 has the same i.i.d. law under P 0 and P 1 ; • for arm j , the stream ( X j,ℓ ) ℓ ≥ 1 is i.i.d. with law P under P 0 and i.i.d. with law Q under P 1 . 34 W rite E 0 , E 1 for the corresp onding exp ectations. The algorithm is a non-anticipating p olicy: for each t ≥ 1 , A t = ρ t U, A 1 , R 1 , . . . , A t − 1 , R t − 1 ∈ [ K ] , for some measurable map ρ t . The observed reward is R t := X A t , N A t ( t ) , N i ( t ) := t X s =1 I { A s = i } . Let ( F t ) T t =0 denote the natural filtration generated by the in teraction transcript, F t = σ ( U, A 1 , R 1 , . . . , A t , R t ) . F or eac h integer n ≥ 0 , define the prefix sigma-field G n := σ U, ( X i,ℓ ) i = j, ℓ ≥ 1 , X j, 1 , . . . , X j,n . Lemma 20 (Budget-even t measurability) . F or every event E ∈ F T and every inte ger n ≥ 0 , E ∩ { N j ( T ) ≤ n } ∈ G n . Pr o of. Fix n ≥ 0 and c ho ose an arbitrary p oint y ⋆ ∈ Y . Define truncated reward streams b y e X ( n ) i,ℓ := X i,ℓ , i = j, X j,ℓ , i = j, ℓ ≤ n, y ⋆ , i = j, ℓ > n. Let e A ( n ) t , e R ( n ) t T t =1 b e the transcript obtained by running the same p olicy on the truncated array , namely e A ( n ) t = ρ t U, e A ( n ) 1 , e R ( n ) 1 , . . . , e A ( n ) t − 1 , e R ( n ) t − 1 , and e R ( n ) t = e X ( n ) e A ( n ) t , e N ( n ) e A ( n ) t ( t ) , e N ( n ) i ( t ) := t X s =1 I { e A ( n ) s = i } . By construction, the truncated transcript e H ( n ) T := U, e A ( n ) 1 , e R ( n ) 1 , . . . , e A ( n ) T , e R ( n ) T and the count e N ( n ) j ( T ) are G n -measurable. No w define the first times at which arm j is pulled for the ( n + 1) -st time: τ ( n ) j := inf { t ≥ 1 : N j ( t ) = n + 1 } , e τ ( n ) j := inf { t ≥ 1 : e N ( n ) j ( t ) = n + 1 } , with the conv ention inf ∅ = ∞ . W e claim that τ ( n ) j = e τ ( n ) j . Indeed, we prov e by induction on t that the original and truncated transcripts coincide up to every 35 time t < τ ( n ) j ∧ e τ ( n ) j . The base case t = 0 is trivial. Assume the claim holds up to time t − 1 for some t < τ ( n ) j ∧ e τ ( n ) j . Then the histories up to time t − 1 are identical, so the actions at time t are iden tical b ecause the p olicy is non-anticipating and uses the same seed U . If the common action is some i = j , then the observ ed rewards coincide b ecause the tw o arrays agree on arm i . If the common action is j , then, since t < τ ( n ) j ∧ e τ ( n ) j , both pro cesses still ha ve at most n pulls of arm j after round t , so the reward index used at time t is at most n in b oth pro cesses, and hence the observ ed rew ards also coincide b ecause the t wo arra ys agree on X j, 1 , . . . , X j,n . Thus the transcripts coincide up to time t . If τ ( n ) j < e τ ( n ) j , then the transcripts coincide up to time τ ( n ) j − 1 , so the actions at time τ ( n ) j also coincide. Since N j ( τ ( n ) j ) = n + 1 , w e m ust ha v e A τ ( n ) j = j , hence also e A ( n ) τ ( n ) j = j , whic h forces e τ ( n ) j ≤ τ ( n ) j , a contradiction. The case e τ ( n ) j < τ ( n ) j is identical. Therefore τ ( n ) j = e τ ( n ) j , and hence { N j ( T ) ≤ n } = { e N ( n ) j ( T ) ≤ n } ∈ G n . Moreo ver, on this ev ent the original and truncated full transcripts coincide up to time T . Finally , let H T := ( U, A 1 , R 1 , . . . , A T , R T ) . Since E ∈ F T = σ ( H T ) , there exists a measurable set Γ E in the transcript space such that E = { H T ∈ Γ E } . Using the coincidence of the original and truncated transcripts on { N j ( T ) ≤ n } , w e obtain E ∩ { N j ( T ) ≤ n } = { e H ( n ) T ∈ Γ E } ∩ { e N ( n ) j ( T ) ≤ n } , and the right-hand side is G n -measurable. A.2 F ormal pro of that the b oundary transcript determines the sampling profile Pr o of of L emma 13. F or each batc h b ∈ [ B ] and arm i ∈ [ K ] , define the batch-lev el pull count N ( b ) i := t b X t = t b − 1 +1 I { A t = i } , and let N ( b ) := ( N ( b ) i ) i ∈ [ K ] . By the p olicy-commitmen t constraint, conditional on ( t b − 1 , M t b − 1 , U ) the en tire action plan for batc h b is fixed. Since t b − 1 is one co ordinate of the internal grid τ , where τ := ( t 1 , . . . , t B − 1 ) denotes the realized in ternal grid, the v ector N ( b ) is σ ( M t b − 1 , U, τ ) -measurable. Hence there exists a measurable map c b suc h that N ( b ) = c b ( M t b − 1 , U, τ ) a.s. 36 Summing ov er batches yields N i ( T ) = P B b =1 N ( b ) i , and therefore Y i := I n B X b =1 c b,i ( M t b − 1 , U, τ ) ≥ n o , i ∈ [ K ] . Since M 0 is fixed, this prov es that Y is a deterministic function of ( U, τ , M t 1 , . . . , M t B − 1 ) . In the static-grid mo del, τ is fixed b efore interaction and therefore is a deterministic function of U (or simply deterministic if the grid is nonrandom). In the adaptive-grid mo del, we ha ve t 1 = ψ 1 (0 , M 0 , U ) , t b = ψ b ( t b − 1 , M t b − 1 , U ) , b = 2 , . . . , B − 1 , so τ is recursively determined b y ( U, M t 1 , . . . , M t B − 1 ) . Hence in either mo del Y is a deterministic function of ( U, M t 1 , . . . , M t B − 1 ) , as claimed. Ev en in the adaptiv e-grid mo del, the realized grid introduces no additional instance- dep enden t information b ey ond the seed and the b oundary-memory transcript. References [AAAK17] Arpit Agarwal, Shiv ani Agarwal, Sep ehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivit y: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Confer enc e on L e arning The ory , pages 39–75. PMLR, 2017. [A CBF02] Peter Auer, Nicolo Cesa-Bianc hi, and P aul Fischer. Finite-time analysis of the multi- armed bandit problem. Machine le arning , 47(2):235–256, 2002. [ADT12] Raman Arora, Ofer Dek el, and Ambuj T ewari. Online bandit learning against an adap- tiv e adversary: from regret to p olicy regret. arXiv pr eprint arXiv:1206.6400 , 2012. [A G12] Shipra Agraw al and Na vin Goy al. Analysis of thompson sampling for the m ulti-armed bandit problem. In Confer enc e on le arning the ory , pages 39–1. JMLR W orkshop and Conference Pro ceedings, 2012. [AKP22] Arpit Agarwal, Sanjeev Khanna, and Prathamesh Patil. A sharp memory-regret trade- off for m ulti-pass streaming bandits. In Confer enc e on L e arning The ory , pages 1423– 1462. PMLR, 2022. [A W20] Sep ehr Assadi and Chen W ang. Exploration with limited memory: streaming algo- rithms for coin tossing, noisy comparisons, and m ulti-armed bandits. In Pr o c e e dings of the 52nd Annual A CM SIGACT Symp osium on the ory of c omputing , pages 1237–1250, 2020. [A W22] Sep ehr Assadi and Chen W ang. Single-pass streaming low er b ounds for multi-armed bandits exploration with instance-sensitiv e sample complexit y . A dvanc es in Neur al Information Pr o c essing Systems , 35:33066–33079, 2022. 37 [A W24] Sep ehr Assadi and Chen W ang. The b est arm ev ades: Near-optimal m ulti-pass stream- ing lo wer b ounds for pure exploration in m ulti-armed bandits. In The Thirty Seventh A nnual Confer enc e on L e arning The ory , pages 311–358. PMLR, 2024. [BH79] Jean Bretagnolle and Catherine Huber. Estimation des densités: risque minimax. Zeitschrift für W ahrscheinlichkeitsthe orie und verwandte Gebiete , 47(2):119–137, 1979. [BR C21] Dorian Baudry , Y oan Russac, and Olivier Cappé. On limited-memory subsampling strategies for bandits. In International Confer enc e on Machine L e arning , pages 727– 737. PMLR, 2021. [CBDS13] Nicolo Cesa-Bianc hi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. A dvanc es in Neur al Information Pr o c essing Sys- tems , 26, 2013. [CH03] T Co ver and M Hellman. The tw o-armed-bandit problem with time-inv arian t finite memory . IEEE T r ansactions on Information The ory , 16(2):185–195, 2003. [CK20] Argh ya Ro y Chaudhuri and Shiv aram Kalyanakrishnan. Regret minimisation in multi- armed bandits using b ounded arm memory . In Pr o c e e dings of the AAAI Confer enc e on A rtificial Intel ligenc e , volume 34, pages 10085–10092, 2020. [Co v68] Thomas M Co ver. A note on the t wo-armed bandit problem with finite memory . In- formation and Contr ol , 12(5):371–377, 1968. [EKMM21] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and V ahab Mirrokni. Regret b ounds for batc hed bandits. In Pr o c e e dings of the AAAI Confer enc e on A rtificial In- tel ligenc e , volume 35, pages 7340–7348, 2021. [GHRZ19] Zijun Gao, Y anjun Han, Zhimei Ren, and Zhengqing Zhou. Batc hed multi-armed ban- dits problem. A dvanc es in Neur al Information Pr o c essing Systems , 32, 2019. [GK16] Aurélien Garivier and Emilie Kaufmann. Optimal b est arm identification with fixed confidence. In Confer enc e on L e arning The ory , pages 998–1027. PMLR, 2016. [GR T17] Sumegha Garg, Ran Raz, and A visha y T al. Extractor-based time-space low er bounds for learning. CoRR , abs/1708.02639, 2017. [HYF23] Osama Hanna, Lin Y ang, and Christina F ragouli. Efficien t batched algorithm for con- textual linear bandits with large action space via soft elimination. A dvanc es in Neur al Information Pr o c essing Systems , 36:56772–56783, 2023. [HZZ + 20] Y anjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanc het, Peter W Glynn, and Yin yu Y e. Sequential batch learning in finite-action linear contextual bandits. arXiv pr eprint arXiv:2004.06321 , 2020. [JGS13] P o oria Joulani, Andras Gyorgy , and Csaba Szep esvári. Online learning under delay ed feedbac k. In International c onfer enc e on machine le arning , pages 1453–1461. PMLR, 2013. [JHTX21] Tianyuan Jin, Keke Huang, Jing T ang, and Xiaokui Xiao. Optimal streaming algorithms for multi-armed bandits. In International Confer enc e on Machine L e arning , pages 5045– 5054. PMLR, 2021. 38 [JJNZ16] K wang-Sung Jun, Kevin Jamieson, Rob ert Now ak, and Xiao jin Zh u. T op arm iden- tification in m ulti-armed bandits with batch arm pulls. In Artificial Intel ligenc e and Statistics , pages 139–148. PMLR, 2016. [JTX + 21] Tian yuan Jin, Jing T ang, P an Xu, Kek e Huang, Xiaokui Xiao, and Quanquan Gu. Almost optimal anytime algorithm for batched m ulti-armed bandits. In International Confer enc e on Machine L e arning , pages 5065–5073. PMLR, 2021. [K CG16] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexit y of b est- arm identification in multi-armed bandit models. The Journal of Machine L e arning R ese ar ch , 17(1):1–42, 2016. [KO21] Cem Kalk anli and A yfer Ozgur. Batched thompson sampling. A dvanc es in Neur al Information Pr o c essing Systems , 34:29984–29994, 2021. [LS20] T or Lattimore and Csaba Szep esvári. Bandit algorithms . Cam bridge Universit y Press, 2020. [LSPY18] Da vid Liau, Zhao Song, Eric Price, and Ger Y ang. Sto c hastic multi-armed bandits in constan t space. In International Confer enc e on A rtificial Intel ligenc e and Statistics , pages 386–394. PMLR, 2018. [LZWL23] Shaoang Li, Lan Zhang, Junhao W ang, and Xiang-Y ang Li. Tight memory-regret low er b ounds for streaming bandits. arXiv pr eprint arXiv:2306.07903 , 2023. [MM17] Dana Moshk ovitz and Michal Moshk ovitz. Mixing implies low er b ounds for space b ounded learning. In Saty en Kale and Ohad Shamir, editors, Pr o c e e dings of the 2017 Confer enc e on L e arning The ory , v olume 65 of Pr o c e e dings of Machine L e arning R e- se ar ch , pages 1516–1566. PMLR, 07–10 Jul 2017. [MPK21] Arnab Maiti, Vishakha Patil, and Arindam Khan. Multi-armed bandits with b ounded arm-memory: Near-optimal guaran tees for b est-arm iden tification and regret minimiza- tion. A dvanc es in Neur al Information Pr o c essing Systems , 34:19553–19565, 2021. [MT17] Michal Moshko vitz and Naftali Tishb y . A general memory-b ounded learning algorithm. CoRR , abs/1712.03524, 2017. [PR CS16] Vianney Perc het, Philippe Rigollet, Sylv ain Chassang, and Erik Snowberg. Batched bandit problems. The A nnals of Statistics , pages 660–681, 2016. [Raz16] Ran Raz. F ast learning requires go od memory: A time-space lo wer b ound for parity learning. CoRR , abs/1602.05161, 2016. [RJX24] Xuanfei Ren, Tianyuan Jin, and P an Xu. Optimal batc hed linear bandits. arXiv pr eprint arXiv:2406.04137 , 2024. [R YZ21] Y ufei Ruan, Jiaqi Y ang, and Y uan Zhou. Linear bandits with limited adaptivit y and learning distributional optimal design. In Pr o c e e dings of the 53r d A nnual A CM SIGA CT Symp osium on The ory of Computing , pages 74–87, 2021. [RZ24] Zhimei Ren and Zhengyuan Zhou. Dynamic batch learning in high-dimensional sparse linear contextual bandits. Management Scienc e , 70(2):1315–1342, 2024. 39 [RZK20] Zhimei Ren, Zhengyuan Zhou, and Ja yan t R Kalagnanam. Batched learning in gen- eralized linear contextual bandits with general decision sets. IEEE Contr ol Systems L etters , 6:37–42, 2020. [SDBS24] A yush Saw arni, Nirjhar Das, Siddharth Barman, and Gaura v Sinha. Generalized linear bandits with limited adaptivity . A dvanc es in Neur al Information Pr o c essing Systems , 37:8329–8369, 2024. [Tho33] William R Thompson. On the likelihoo d that one unkno wn probability exceeds another in view of the evidence of tw o samples. Biometrika , 25(3/4):285–294, 1933. [T sy08] Alexandre B T sybako v. Nonparametric estimators. In Intr o duction to Nonp ar ametric Estimation , pages 1–76. Springer, 2008. [V CL + 20] Claire V ernade, Alexandra Carp en tier, T or Lattimore, Giov anni Zapp ella, Beyza Ermis, and Michael Brueckner. Linear bandits with sto c hastic dela yed feedback. In Interna- tional Confer enc e on Machine L e arning , pages 9712–9721. PMLR, 2020. [W an23] Chen W ang. Tight regret b ounds for single-pass streaming multi-armed bandits. In International Confer enc e on Machine L e arning , pages 35525–35547. PMLR, 2023. [XZ21] Xiao Xu and Qing Zhao. Memory-constrained no-regret learning in adv ersarial multi- armed bandits. IEEE T r ansactions on Signal Pr o c essing , 69:2371–2382, 2021. 40
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment