Online Learning with Improving Agents: Multiclass, Budgeted Agents and Bandit Learners

Online Learning with Impro ving Agen ts: Multiclass, Budgeted Agen ts and Bandit Learners Sa jad Ashk ezari ∗ Shai Ben-Da vid † F ebruary 20, 2026 Abstract W e in vestigate the recently in tro duced mo del of learning with improv ements, where agen ts are allow ed to make small changes to their feature v alues to b e warran ted a more desirable lab el. W e extensiv ely extend previously published results by pro viding combinatorial dimensions that c haracterize online learnability in this mo del, by analyzing the m ulticlass setup, learnability in a bandit feedback setup, mo deling agents’ cost for making improv emen ts and more. 1 In tro duction With the proliferation of mac hine learning based decision making to ols and their application to so cietal and personal domains, there has b een growing interest in understanding the implications of the use of suc h tools on the b ehavior of the individuals inﬂuenced b y their decisions. One asp ect of suc h implications is addressed under the title of str ate gic classiﬁc ation . It addresses the possible manipulation of user’s data aimed to achiev e desirable classiﬁcation b y the decision algorithm. The researc h along this line concerns setups where the feature vectors a v ailable to the learner diﬀer from the true instance feature vectors in a w ay that increases the likelihoo d of some desirable outcomes. F or example, users sign up for a gym club, to app ear to b e healthier to an algorithm assigning life insurance rates. Strategic classiﬁcation learning aims to dev elop learning algorithms that mitigate the eﬀects of suc h data manipulations ([ HMPW16 ], [ ABY23 ] [ A YZ24 ]). Another related line of research aims to design algorithms that incentivize users to c hange their b eha vior (and consequen tly their true attributes) in a direction that impro v es their lab el ([ MMH20 ] and more). In this setup, the designer of the algorithm wishes to incen tivize the individuals to exercise more. This pap er follows a recent line of work title L e arning with impr ovements ([ ABN + 25 ], [ SS25 ]), where the learner assumes that the aﬀected agen ts do c hange their true attributes to w ards achieving a desired label, but the fo cus is on accurate prediction of the resulting classiﬁcation (rather than incen tivizing behavioral c hange). ∗ Univ ersity of W aterlo o, sajad.ashkezari@uwaterloo.ca . † Univ ersity of W aterlo o, shai@uwaterloo.ca . 1 W e extend the earlier published w ork on this topic along several axes: 1. While earlier work analyzed only the case of learning with ﬁnite hypothesis classes, we c haracterize learnabilit y via a combinatorial dimension. Our dimension based analysis implies learnabilit y , with explicit successful learners, for inﬁnite classes as w ell (Section 3 ). W e note that the mistake b ound in this mo del is alwa ys upp er b ounded by the usual (no impro vemen ts) mistak e b ound, ho wev er, in some cases there is a big gap b etw een the tw o (Observ ations 1 and 2 in that section). 2. W e extend the scop e of addressed by earlier works on learning with improving agents by analyzing the m ulticlass case. Namely , we consider setups in which there are (arbitrarily) man y possible labels with a user-preference ordering ov er these labels. This reﬂects a situation lik e ha ving one’s w ork b eing ev aluated for sev eral appreciation lev els (sa y , your pap er may b e accepted as a p oster, for a sp otligh t talk, for a full oral presentation, or for b est pap er aw ard ....) (Section 4 ). 3. Once we discuss the multi class setup, there is a natural question about the feedback pro vided to an online learner. When the lab els are binary , a feedback indicating whether a predicted instance label is true or false, also reveals its true label. In con trast, with more than tw o p ossible lab els, the distinction b et ween full information feedback - rev ealing the correct lab el, and partial information setup where the feedback is restricted to correct/wrong lab el. In Section 5 w e analyze this ‘bandit’ setup (that is irrelev an t to the binary classiﬁcation setup). W e provide a combinatorial characterization of the optimal mistak e b ound for this setup, as w ell as describ e the optimal (mistake minimizing) learner. In Subsection 5.1 we analyze the price in terms of additional mistak es, of the learner having only limited bandit feedback. 4. An underlying feature of learning with impro v ements is the notion of an impr ovement gr aph whose no des are agents’ feature vectors and edges corresp ond to the abilit y of an agent to shift their feature vector (sa y , “pap er with typos" to “pap er without typos"). W e extend previous w ork b y removing the requirement that these graphs ha ve a b ounded degree. 5. Finally , in Section 6 , we extend our inv estigation to the setup in whic h the agents incur a cost for improving their features. Related W ork W e study learning with improv ement setting initiated b y A ttias et al. [ ABN + 25 ] who studied the problem in the P AC setting. They assume eac h agen t can impro ve their features to a set of allo wed features, if doing so would help them get a more desirable prediction. They sho w a separation b et ween P A C learning and P AC learning with impro vemen t. They also sho w that for some classes, allo wing improv emen t makes it p ossible to ﬁnd classiﬁers that achiev e zero error as opp osed to arbitrarily small error. This problem w as extended b y Sharma and Sun [ SS25 ] to the online setting where the agen ts are chosen adv ersarially . Ho wev er, Sharma and Sun [ SS25 ] only consider ﬁnite and binary hypothesis classes. Moreo ver, they assume the n umber of p oints eac h agen t can impro ve to is b ounded. In this w ork w e ﬁrst extend their results (in the online setting) to inﬁnite hypothesis classes and general impro v ement sets. W e also in tro duce a model for studying the m ulticlass hypothesis classes where w e asso ciate each lab el with a v alue and eac h impro v ement with a cost. Some other prior w ork study the problem of incen tivizing agen ts to impro ve [ MMH20 ; KR20 ; HIL W21 ; SEA20 ]. Ho wev er, follo wing [ ABN + 25 ; SS25 ] we assume a ﬁxed impro vemen t set and try to minimize the classiﬁcation error. 2 A widely studied problem, whic h is closely related to the classiﬁcation with impro v ement setting, is strategic classiﬁcation [ HMPW16 ]. In this setting, the agen ts can game and manipulate their features to get a b etter prediction. How ever, unlike the improv ement setting, the manipulation do es not truly c hange their classiﬁcation. A common framew ork for this problem is to ha ve a manipulation gr aph whose no des represent agents (their features) and an edge betw een tw o features means the agents can manipulate their features from one no de to the other [ ZC21 ; LU22 ; LUB23 ; ABY23 ; CMMS24 ; A YZ24 ]. Notably , Ahmadi et al. [ A YZ24 ] in tro duce a new dimension that characterizes the optimal mistak e bound of strategic online classiﬁcation for binary labels. While our dimension for the binary setting is similar to theirs, w e extend their results in t wo wa ys. First, w e consider multiclass setting. Second, they consider cost of manipulation to b e zero for the no des that are connected by an edge (whic h is w.l.o.g. in the binary setting, as otherwise we could simply remo ve the edges whose cost is larger than the diﬀerence in utilit y of labels 0 and 1). Online learning was introduced in seminal work of Littlestone [ Lit88 ]. The Littlestone dimension of a h yp othesis class is deﬁned as the maximum depth of a tree that it shatters . The dimensions that we introduces in this work hav e similar prototype. Our results in the bandit setting, where the learning algorithm only learns whether its prediction w as correct or not as opp osed to receiving the correct lab el, build on results in classic setting [ DSBS15 ; Lon17 ]. 2 Notation and Setup W e let X and Y denote the instance and label space, resp ectiv ely . W e let X b e any discrete space. In the binary setting, we let Y = { 0 , 1 } and in the multiclass setting, Y can be any ﬁnite set. A h yp othesis is a function from X to Y . A hypothesis class H ⊆ Y X is a set of h yp otheses. W e use H x,y := { h ∈ H : h ( x ) = y } to denote the subset of hypotheses that lab el x with y . W e study the online learning with impro v ements on a graph G = ( V , E ) [ SS25 ], which w e call the impr ovement gr aph . In this setting, each agen t is originally represented b y a feature vector in X . Let V = X denote the no des of the graph. An edge ( x, v ) ∈ E means an agen t whose original feature is x can p otentially improv e its features to v to get a more desirable prediction. F or eac h node x , let ∆( x ) = { v : ( x, v ) ∈ E } denote the set of its neighbors, whic h we also refer to as its impr ovement set . W e assume eac h no de x ∈ V of the graph has a self-lo op, that is, x ∈ ∆( x ) . W e consider both un weigh ted and w eighted graphs. In the weigh ted case, we let Cost : E → R + b e the w eight function where for an edge e = ( x, v ) ∈ E , Cost ( e ) is the cost for an agen t to impro ve its features from x to v . Here w e assume for eac h x ∈ V , Cost ( x, x ) = 0 . In case of an un w eighted graph, w e assume there is no cost for moving, i.e., Cost ( e ) = 0 , ∀ e ∈ E . W e also deﬁne a utility function, V al : Y → R , where for y ∈ Y , V al( y ) represen ts ho w v aluable y is. The online learning with improv emen t happens in rounds as follo ws: A t time t = 1 , 2 , · · · 1. The en vironment presents an agent x ( t ) to the learner. 2. The learner implements a h yp othesis ˆ h ( t ) . 3. The agen t “b est responds” to ˆ h ( t ) and improv es to v ( t ) . 4. The en vironment selects y ( t ) . 5. The learner incurs loss 1 [ ˆ h ( t ) ( v ( t ) )  = y ( t ) ] . 3 6. The Learner receives “some feedbac k” on its prediction. W e sa y the learning problem is realizable b y H , if there exists some unknown h ∗ ∈ H , such that for all T ∈ N and for all t ≤ T , h ∗ ( v ( t ) ) = y ( t ) . Here, b est resp onse is deﬁned as follo ws: for an y h and x , the agent will improv e its features to v ∈ ∆( x ) to maximize V al ( h ( v )) − V al ( h ( x )) − Cost ( x, v ) . If there is no v with V al ( h ( v )) − V al( h ( x )) − Cost( x, v ) > 0 , the agent do esn’t mo ve. In particular, prior work of Sharma and Sun [ SS25 ] studies the binary case with un weigh ted graph. In this case, V al (1) > V al (0) and for any x , if h ( x ) = 0 and there is v ∈ ∆( x ) with h ( v ) = 1 , the agen t x mo ves to v (in case of multiple suc h v ’s, we assume the agen t c ho oses adversarially). Otherwise, the agen t do esn’t mo ve. Let ∆ + h ( x ) = { x ′ ∈ ∆( x ) : h ( x ′ ) = 1 } . Then the impro vemen t set of an agen t x w.r.t. a classiﬁer h is as follo ws: ∆ h ( x ) = ( ∆ + h ( x ) if h ( x ) = 0 and ∆ + h ( x )  = ∅ { x } otherwise In what follows, we study diﬀerent v ariations of the problem dep ending on the lab el space, the graph, and the type of feedback the learner receives. Unless otherwise stated, we assume the input sequence is realizable b y H . 3 Binary Classes In this section w e fo cus on binary classes with V al (1) > V al (0) . W e also assume the graph is un weigh ted. Sharma and Sun [ SS25 ] studied this setting for ﬁnite classes and graphs with b ounded degree. Speciﬁcally , they sho w an upp er b ound of (∆ G + 1) · log ( |H| ) where ∆ G is the maximum degree of a no de in G . Ho wev er, this bound is not tigh t. In fact, w e can readily improv e this result to Ldim ( H ) which denotes the Littlestone dimension of H [ Lit88 ]. The Littlestone dimension is deﬁned exactly as Deﬁnition 4 for ∆( x ) = { x } . It characterizes the optimal mistake b ound in the online learning without impro ving agen ts [ Lit88 ]. Observ ation 1. Any le arner A for online le arning without impr ovement c an b e c onverte d to a le arner A I for online le arning with impr ovement. This c an simply b e done as fol lows: The le arner r e c eives x ( t ) and implements h ( t ) wher e h ( t ) ( x ( t ) ) = A ( x ( t ) ) and h ( t ) ( x ′ ) = 0 for al l x ′  = x ( t ) , A is up date d if made a mistake. Thus, the agents never change their fe atur es and A I makes a mistake if and only if A makes a mistake. The ab ov e observ ation implies that the optimal mistak e bound when allowing impro vemen t is at most the optimal mistak e b ound when w e don’t. Since the later is exactly Ldim ( H ) [ Lit88 ], the former is also b ounded by the same term. This improv es the previous results as Ldim ( H ) ≤ log ( |H| ) . Ho wev er, as w e next sho w, ﬁnite Littlestone dimension is not necessary for having a ﬁnite mistak e b ound. Observ ation 2. L et X = S i ∈ N { x i , x ′ i } such that ∆( x i ) = { x i , x ′ i } and ∆( x ′ i ) = { x ′ i } . L et H ⊆ { 0 , 1 } X b e such that e ach h ∈ H lab els e ach x ′ i with 1 and H c an pr o duc e any lab eling on { x 1 , x 2 , · · · } . Then H has inﬁnite Littlestone dimension. However, a le arner that at e ach r ound implements g with g ( x i ) = 0 and g ( x ′ i ) = 1 for al l i , do es not make any mistake. 4 W e no w generalize the previous results b y deﬁning a Littlestone-t yp e dimension and showing that it c haracterizes the optimal num b er of mistakes. Interestingly , the dimension that w e deﬁne here is closely related to Strategic Littlestone dimension [ A YZ24 ]. In particular, our trees hav e the same structure, ho wev er, our notion of shattering diﬀers. The in tuition b ehind our deﬁnition is that to force a mistake, the adv ersary must be able to lab el p oin ts in the improv emen t set with 0 as otherwise the learner can alwa ys predict lab el 1 on those p oin ts and the agent will mov e there similar to what happens in Observ ation 2 . Deﬁnition 3 (Impro vemen t Littlestone T ree ( IL T )) . A n IL T is a tr e e whose no des ar e lab ele d by X such that e ach no de x has a set of | ∆( x ) | + 1 outgoing e dges lab ele d by ( x, 1) and ( v , 0) for al l v ∈ ∆( x ) (note that x ∈ ∆( x )) . A n IL T is said to b e shatter e d by H if al l r o ot-to-le af p aths of the form ( u (1) , y (1) ) , · · · , ( u ( d ) , y ( d ) ) ar e r e alizable by H . That is, ther e exists h ∈ H such that h ( u ( t ) ) = y ( t ) for 1 ≤ t ≤ d , wher e d is the depth of the tr e e. Deﬁnition 4 (Improv emen t Littlestone Dimension ILdim ) . Impr ovement Littlestone Dimension is deﬁne d as the maximum depth of an IL T shatter e d by H . In c ase of an unb alanc e d tr e e, depth of the tr e e is the length of its shortest br anch. Theorem 5. The optimal numb er of mistakes in the r e alizable online le arning with impr ovement setting for deterministic le arners is ILdim( H ) . Pr o of. The low er b ound simply follo ws b y following the maximal tree shattered b y H . That is, the adv ersary starts with the ro ot of the tree, x and receives the learner’s hypothesis h . If ∆ h ( x ) = { x } , the adversary mov es along the edge lab eled with ( x, 1 − h ( x )) . Otherwise, it c ho oses an y v ∈ ∆ h ( x ) and mov es along the edge ( v , 0) . It then presen ts the c hild along the selected edge and pro ceed similarly . By deﬁnition of the dimension, this pro cess can con tinue for at least ILdim ( H ) rounds where in eac h round, a mistak e is forced. W e no w sho w that Algorithm 1 ac hiev es this low er b ound. W e claim that each mistak e reduces the dimension of v ersion space by 1, and th us the algorithm mak es at most ILdim ( H ) mistake since the dimension is nonnegativ e. W e prov e the claim for t wo types of mistak es separately: F alse p ositiv e: In this case, either the agent remained at v ( t ) = x ( t ) or mo ved to some p oint lab eled 1 b y the learner’s predictor. In the later case, we kno w exactly to which p oin t the agent mo ved to (i.e., v ( t ) ) b ecause by construction, we only lab el this p oint in the neigh b orhoo d with 1. By prediction principle, we hav e ILdim ( VS v ( t ) , 0 ) < ILdim ( VS ) , so since in case of false positive w e set the new v ersion to b e ILdim(VS v ( t ) , 0 ) , the claim holds. F alse negativ e: In this case, the agent has not mov ed (otherwise our prediction w ould ha ve b een p ositiv e). So v ( t ) = x ( t ) and the new v ersion space will b e VS x ( t ) , 1 . F urthermore by our prediction rule ILdim ( VS v , 0 ) ≥ ILdim ( VS ) for all v ∈ ∆( x ) (in fact, they’re equal as the dimension can only decrease). Now, assume for the sake of con tradiction that ILdim ( VS x ( t ) , 1 ) ≥ ILdim ( VS ) . Then VS x ( t ) , 1 and all VS v , 0 shatters trees of depth ILdim ( VS ) . W e create a new tree whose root is x ( t ) and whose out going edges are labeled b y ( x ( t ) , 1) , and ( v , 0) for v ∈ ∆( x ( t ) ) and eac h edge connects to the resp ective tree of depth ILdim ( VS ) . Then it’s easy to see this tree of depth ILdim ( VS ) + 1 is shattered b y VS , whic h con tradicts the maximalit y of ILdim ( VS ) . Hence, it must b e that ILdim(VS x ( t ) , 1 ) < ILdim(VS) and the dimension decreases. 5 Algorithm 1 ISOA Initialize VS = H for t = 1 , 2 , · · · , do Receiv e x ( t ) Let ˜ h ( t ) b e deﬁned on eac h no de x as follows: ˜ h ( t ) ( x ) = ( 1 if ILdim(VS x, 0 ) < ILdim(VS) 0 otherwise Let ∆ + ˜ h ( t ) = { x ′ ∈ ∆( x ( t ) ) : ˜ h ( t ) ( x ′ ) = 1 } if ˜ h ( t ) ( x ( t ) ) = 1 or | ∆ + ˜ h ( t ) | = 0 then Implemen t h ( t ) ← ˜ h ( t ) v ( t ) ← x ( t ) else Pic k an arbitrary v ( t ) ∈ ∆ + ˜ h ( t ) Implemen t h ( t ) that disagrees with ˜ h ( t ) only on ∆ + ˜ h ( t ) \{ v ( t ) } end if if made a mistak e then VS = VS v ( t ) , 1 − h ( t ) ( v ( t ) ) end if end for 4 F ull F eedbac k Multiclass - Un w eigh ted Graph Here we assume the improv emen t graph is unw eigh ted and Y = { z 1 , · · · , z k } suc h that the lab els are sorted in terms of preference z 1 < z 2 < · · · < z k . Here z < z ′ means V al ( z ) < V al ( z ′ ) . Similar to the binary case, the agents are rational suc h that they will improv e their features if and only if it allo ws them to receive a more preferred lab el. In this section, we assume the learner receives the ﬁnal feature of the agen t and its true lab el up on making a mistak e. Note that receiving the ﬁnal feature of the agen t is only for simplifying the algorithm and is without loss of generalit y b ecause our learner only allo ws a single choice for the agent to improv e to (as w e saw in the binary case). W e address the bandit setting where the learner do esn’t receiv e the true lab el in the next section. W e deﬁne a new tree-based dimension similar to an IL T . The in tuition b ehind the tree that w e’ll deﬁne is as follo ws: on an y p oint x , in order to force a mistak e, the adv ersary must b e able to pro duce tw o diﬀeren t lab els on x itself. Moreov er, on any neighbor v ∈ ∆( x ) , the adv ersary must b e able to force a mistake by either producing t wo diﬀeren t lab els or the minimal label z 1 . Note that in case the agen t has mo ved to v , the learner deﬁnitely do esn’t label v with z 1 as otherwise the agent w ouldn’t ha v e mo v ed to v . Deﬁnition 6 (Multiclass IL T) . A multiclass IL T, is a tr e e whose no des ar e lab ele d by X . Each internal no de x has the fol lowing set of e dges: • Two e dges lab ele d by ( x, y 1 ) and ( x, y 2 ) with y 1  = y 2 . • F or e ach v ∈ ∆( x ) − { x } : either an e dge lab ele d ( v , z 1 ) or two e dges lab ele d ( v , y 3 ) and ( v , y 4 ) 6 with y 3  = y 4 and y 3 , y 4  = z 1 . A n IL T is said to b e shatter e d by H if al l of its br anches ar e r e alizable by H . Deﬁnition 7 (Multiclass ILdim ) . ILdim ( H ) is deﬁne d as the maximum depth of an IL T shatter e d by H wher e the depth of a tr e e is deﬁne d as the length of its shortest br anch. Lemma 8. Any deterministic le arner in the multiclass online le arning with impr ovement makes at le ast ILdim( H ) mistakes. Pr o of. The adversary picks an IL T with maxim um depth. It starts with presenting the ro ot of the tree, x , if the learners h yp othesis h is such that the agen t w on’t mo ve, the adv ersary mo v es along the edge lab eled ( x, y ) where y  = h ( x ) . Otherwise, let v b e the node that the agent mo ves to. If there exists an edge lab eled ( v , z 1 ) mov e along that edge, otherwise, mo ve along the edge labeled ( v , y ) with y  = h ( v ) . In any case, the learner mak es a mistak e. The adv ersary then presen ts the c hild along the chosen edge and con tin ues the pro cess in the same w ay . This pro cess can contin ue at least ILdim( H ) rounds and th us the learner makes at least these many mistak es. Lemma 9. L et VS ⊆ H . F or e ach x , at le ast one of the fol lowing happ ens: max y ILdim ( VS x,y ) is achieve d at a unique y (c ase 1) or ther e exists v ∈ ∆( x ) − { x } such that max y ILdim ( VS v ,y ) is achieve d at a unique y  = z 1 (c ase 2), or at le ast one of these maximums ar e strictly smal ler ILdim(VS) (c ase 3). Pr o of. Assume otherwise. Let L = ILdim ( VS ) . Since case 3 do esn’t hold, for all x ′ ∈ ∆( x ) , max y ILdim ( VS x ′ ,y ) = L . Since case 1 and 2 don’t hold, there are y 1  = y 2 suc h that ILdim ( VS x,y 1 ) = ILdim ( VS x,y 2 ) = L and for each v ∈ ∆( x ) − { x } , either ILdim ( VS v ,z 1 ) = L or there are y 3  = y 4 suc h that ILdim ( VS v ,y 3 ) = ILdim ( VS v ,y 4 ) = L . Ho wev er, this means VS can shatter an IL T of depth L + 1 , whic h con tradicts the deﬁnition of ILdim(VS) . Theorem 10. The optimal numb er of mistakes achieve d by deterministic le arners in the multiclass online le arning with impr ovement e quals ILdim ( H ) . F urthermor e, this optimal numb er is achieve d by A lgorithm 2 . Pr o of. The low er b ound is prov ed in Lemma 8 . By Lemma 9 at least one of the cases in Algorithm 2 happ en. Then by construction, eac h mistakes reduces the dimension of the v ersion space by at least one. Since the dimension is nonnegative, the algorithm mak es at most ILdim( H ) mistak es. 5 Bandit F eedbac k - Un weigh ted Graph Here we study the same setting as the previous section, how ev er, w e assume the learner only learns whether it has made a mistak e or not and it w on’t get the true lab el. F or any h yp othesis class VS , we deﬁne VS x ↛ y := { h ∈ VS : h ( x )  = y } . Unlike the full feedbac k setting where the learner received the true lab el after eac h mistake, here the learner only learns that the label of some x is not y . Thus, we can update the set of candidate hypothesis, the version space, from VS to VS x ↛ y . Our algorithm will k eep trac k of the v ersion space and predicts suc h that 7 Algorithm 2 Multiclass ISO A Initialize VS = H for t = 1 , 2 , · · · , do Receiv e x ( t ) if ∃ x ′ ∈ ∆( x ( t ) ) : max y ILdim(VS x ′ ,y ) < ILdim(VS) then Set h ( t ) ( x ′ ) = z k Set h ( t ) ( v ) = z 1 for all v ∈ ∆( x ( t ) ) − { x ′ } else if y = arg max y ILdim(VS x ( t ) ,y ) is unique then Set h ( t ) ( x ( t ) ) = y Set h ( t ) ( v ) = z 1 for all v ∈ ∆( x ( t ) ) − { x ( t ) } else Let x ′ ∈ ∆( x ( t ) ) b e such that y = arg max y ILdim(VS x ′ ,y ) is unique and y  = z 1 Set h ( t ) ( x ′ ) = y Set h ( t ) ( v ) = z 1 for all v ∈ ∆( x ( t ) ) − { x ′ } end if if made a mistak e then Let v ( t ) b e the ﬁnal features of the agen t Receiv e the true lab el y ( t ) for v ( t ) VS = VS v ( t ) ,y ( t ) end if end for eac h mistak e reduces its dimension . The in tuition behind this dimension, which we shortly deﬁne, is that the adv ersary m ust balance b et ween forcing a mistak e and k eeping the v ersion space “large enough”. On any p oin t x (ignoring improv emen t for now), to k eep this balance and force a mistak e, the adversary m ust for all possible labels y that the learner predicts, VS x ↛ y is such that the game can contin ue for as man y rounds as p ossible. W e now formally deﬁne the tree and dimension and then show it characterizes the optimal mistak e b ound. Deﬁnition 11 (Bandit Multiclass IL T) . A b andit multiclass IL T (BIL T), is a tr e e whose no des ar e lab ele d by X and e ach internal no de x has the fol lowing set of e dges: • k e dges lab ele d by ( x, z i ) for i ∈ [ k ] . • F or e ach v ∈ ∆( x ) − { x } , k − 1 e dges lab ele d by ( v , z i ) for i ∈ [ k ] \{ 1 } A BIL T is shatter e d by H if for any br anch ( u 1 , y 1 ) , · · · , ( u d , y d ) ther e is h ∈ H such that h ( u i )  = y i for al l i ≤ d . Deﬁnition 12 (Multiclass BILdim ) . BILdim ( H ) is deﬁne d as the maximum depth of a BIL T shatter e d by H wher e the depth of a tr e e is deﬁne d as the length of its shortest br anch. Theorem 13. The optimal mistake b ound of deterministic le arners in the multiclass online le arning with impr ovement and b andit fe e db ack for se quenc es r e alizable by H is BILdim( H ) . Pr o of. The low er b ound is achiev ed by an adv ersary that follows a BIL T of depth BILdim ( H ) shattered by H . The adv ersary starts by presenting the ro ot of the tree, x , to the learner and 8 receiving its h yp othesis h . Let agen t features dep ending on the learner’s hypothesis is some v ∈ ∆( x ) . The adversary then indicates to the learner that it made a mistake and mo ves along the edge ( v , h ( v )) and presen ts the c hild along that edge and con tin ues the game in the same w a y . Since the tree is shattered, there is some h ∗ ∈ H that diﬀers from the learners predictions in each round for BILdim( H ) at least rounds. The upp er bound can b e achiev ed b y Bandit ISOA (BISOA) which we describ e next. Start with VS (1) = H . At round t ≥ 1 , receiv e x ( t ) and implement h ( t ) as follows: Let v ∈ ∆( x ( t ) ) b e such that min y ∈ Y v BILdim ( VS ( t ) v ↛ y ) < BILdim ( VS ( t ) ) where Y x ( t ) = Y and Y u = Y − { z 1 } for u ∈ ∆( x ( t ) ) − { x ( t ) } . Such v m ust exist as otherwise VS ( t ) shatters a tree of depth BILdim ( VS ( t ) ) + 1 . Let ˆ y ( t ) b e the minimizer. Then let h ( t ) ( v ) = ˆ y ( t ) and h ( t ) ( u ) = z 1 for u ∈ ∆( x ( t ) ) − { x ′ } . If made a mistak e, update VS ( t +1) = VS ( t ) v ↛ ˆ y ( t ) , otherwise VS ( t +1) = VS ( t ) . By our prediction rule, each mistake reduces BILdim of the v ersion space by at least 1, thus the n umber of mistak es is at most BILdim( H ) since BILdim is nonnegative. 5.1 Price of Bandit F eedbac k Here we assume the improv emen t graph has b ounded degree ∆ G , that is, for all x ∈ V , | ∆( x ) | ≤ ∆ G . W e wan t to answ er for an y H , how man y more mistak es the optimal learning with bandit feedbac k mak es compared to the optimal learner with full feedback. W e assume the ﬁnal features of the agen t is known to the learner, as it can b e inferred similar to the binary case. F or a hypothesis h , we deﬁne ∆ − ( h, x ) := { x ′ ∈ ∆( x ) : h ( x ′ ) = z 1 } and ∆ + ( h, x ) = ∆( x ) − ∆ − ( h, x ) . A learning algorithm in the full feedback setting can b e denoted b y e : ( X × Y ) ∗ × X → Y X . F or a set of learners (also referred to as exp erts) E with w eigh ts w : E → R + and an y subset F ⊆ E , w e deﬁne w ( F ) := P e ∈ F w ( e ) . F or any suc h E , we deﬁne E x,y := { e ∈ E : e ( x ) = y } . Moreo ver, for a learner e , we use e ← x, y to b e the learner e up dated with x and y . W e’re now ready to state our results. Lemma 14. Algorithm 3 makes at most O (∆ G · ILdim ( H ) · k log ( k )) when given the multiclass ISO A (Algorithm 2 ) as input. Thus, the pric e of b andit fe e db ack is at most O (∆ G · k log ( k )) Pr o of. Let W ( t ) := w ( E ) b e the total w eigh t of exp erts at round t . W e claim after each mistak e W ( t +1) ≤ (1 − 1 2 k (∆ G +1) ) W ( t ) . W e prov e the claim for each type of mistake separately . Mistak e type 1 (the if condition holds): Here h ( t ) predicts z 1 on every p oin t in the neighborho o d and th us, the agent do esn’t mov e. W e show that a constant fraction of the exp erts hav e the same b eha vior. W e ha ve w ( { e ∈ E : | ∆ + ( x ( t ) , h ( t ) ) | = 0 } ) = W ( t ) − w ( { e : ∃ x ′ ∈ ∆( x ( t ) ) , ∃ y  = z 1 , e ( x ′ ) = y } ) . By our prediction rule, we ha ve for any x ′ and y  = z 1 , w ( E x ′ ,y ) < 1 k · (∆ G +1) W ( t ) . By union b ound, w e hav e w ( { e : ∃ x ′ ∈ ∆( x ( t ) ) , ∃ y  = z 1 , e ( x ′ ) = y } ) ≤ P x ′ ,y  = z 1 w ( E x ′ ,y ) < ∆ G ∆ G +1 W ( t ) and thus w ( { e ∈ E : | ∆ + ( x ( t ) , h ( t ) ) | = 0 } ) ≥ 1 ∆ G +1 W ( t ) . F or eac h suc h e , we will ha ve m ultiple e y as deﬁne in the algorithm, suc h that the sum of their w eights will b e w e 2 . Th us, W ( t +1) ≤ (1 − 1 2(∆ G +1) ) W ( t ) . Mistak e t yp e 2: In this case the weigh t of the exp erts that will b e up dated is at least 1 k (∆ G +1) W ( t ) and thus b y similar argumen ts, W ( t +1) ≤ (1 − 1 2 k (∆ G +1) ) W ( t ) . 9 Algorithm 3 Bandit to F ull F eedback Reduction Input: F ull feedback learner A Initialize E = {A} , w A = 1 for t = 1 , 2 , · · · , do Receiv e x ( t ) If ∃ x ′ ∈ ∆( x ( t ) ) : max y  = z 1 w ( E x ′ ,y ) ≥ w ( E ) k · (∆ G +1) Set h ( t ) ( x ′ ) = arg max y  = z 1 w ( E x ′ ,y ) Set h ( t ) ( v ) = z 1 for all v ∈ ∆( x ( t ) ) − { x ′ } Else: ∀ x ′ ∈ ∆( x ( t ) ) : h ( t ) ( x ′ ) = z 1 . if made a mistak e then Let v ( t ) b e the ﬁnal features of the agen t If v ( t ) = x ( t ) and | ∆ + ( h ( t ) , x ( t ) ) | = 0 : ∀ e ∈ E with | ∆ + ( e, x ( t ) ) | = 0 and ∀ y  = z 1 : A dd e y = e ← x ( t ) , y to E Set w e y = w e 2( k − 1) Remo ve e from E Else, ∀ e with e ( v ( t ) ) = h ( t ) ( v ( t ) ) , ∀ y  = h ( t ) ( v ( t ) ) A dd e y = e ← x ( t ) , y to E Set w e y = w e 2( k − 1) Remo ve e from E end if end for Th us, at round t , if the learner has made N mistak es, we ha ve W ( t ) ≤ exp ( − N 2 k (∆ G +1) ) . On the other hand, there is an exp ert that has alwa ys been up dated by a realizable sequence (since in eac h up date we guess all p ossible true labels). Th us, by previous guaran tees, suc h exp ert e ∗ mak es at most ILdim ( H ) many mistakes and th us w e ∗ ≥ ( 1 2( k − 1) ) ILdim ( H ) . Since the weigh ts are nonnegative, w e m ust hav e ( 1 2( k − 1) ) ILdim ( H ) ≤ exp ( − N 2 k (∆ G +1) ) . Th us, N ≤ O (∆ G · ILdim( H ) · k log( k )) . 6 F ull F eedbac k - W eighted Graph Here w e study the multiclass setting where there is a cost related to each mov e. Again we assume the lab el space Y = { z 1 , · · · , z k } and that for all i < j , V al ( z i ) < V al ( Z j ) . In the un weigh ted setting, w e to ok adv antage of the fact that if the agent mo ved from x to v in resp onse to the learner’s hypothesis h , then it must be the case that h ( v )  = z 1 . Here, w e could mak e a similar observ ation: for any y ∈ Y , if V al ( y ) − V al ( h ( x )) ≤ V al ( y ) − V al ( z 1 ) < Cost ( x, v ) , we kno w that h ( v )  = y as otherwise the agen t w ouldn’t hav e mov ed to v since the cost of moving is more than the gain in the v alue. Th us, the adv ersary can force a mistak e by setting the lab el of v to such y . With that in mind, we can now deﬁne our dimension. Deﬁne Y x,v := { y ∈ Y : V al( y ) − V al( z 1 ) < Cost( x, v ) } . 10 Deﬁnition 15. A weighte d IL T , WIL T , is a tr e e whose no des ar e lab ele d by V and e ach internal no de x has the fol lowing set of e dges: • Two e dges lab ele d ( x, y 1 ) and ( x, y 2 ) with y 1  = y 2 . • F or any v ∈ ∆( x ) − { x } , either a single e dge lab ele d with ( v , y ) with y ∈ Y x,v or two e dges lab ele d with ( v , y 3 ) , ( v , y 4 ) for some y 3  = y 4 with y 3 , y 4 ∈ Y − Y x,v The tr e e is shatter e d by H if e ach br anch is r e alizable similar to ILdim . The weighte d impr ovement dimension of H , WILdim( H ) , is the maximum depth of a tr e e shatter e d by H . W e can now pro ve a similar result to Lemma 9 . Without loss of generality , we assume for eac h x ∈ V and v ∈ ∆( x ) , V al ( z k ) − V al ( z 1 ) > Cost ( x, v ) . Otherwise, no labeling incen tivizes the agent to mov e from x to v and w e could simply remov e v from ∆( x ) . Lemma 16. L et VS ⊆ H . F or e ach x , at le ast one of the fol lowing happ ens: max y WILdim ( VS x,y ) is achieve d at a unique y (c ase 1) or ther e exists v ∈ ∆( x ) − { x } such that max y WILdim ( VS v ,y ) is achieve d at a unique y / ∈ Y x,v (c ase 2), or at le ast one of these maximums is strictly smal ler WILdim(VS) (c ase 3). Pr o of. Assume case 3 do esn’t happ en. If case 1 do esn’t happ en, then there are y 1 , y 2 ∈ Y suc h that WILdim ( VS x,y 1 ) = WILdim ( VS x,y 2 ) = WILdim ( VS ) . If case 2 doesn’t happen then for each v , either there is y ∈ Y x,v with WILdim ( VS v ,y ) = WILdim ( VS ) or there are y 3 , y 4 / ∈ Y x,v suc h that WILdim ( VS v ,y 3 ) = WILdim ( VS v ,y 4 ) = WILdim ( VS ) . These w ould all imply that VS shatters a tree of depth WILdim ( VS ) + 1 , which is a contradiction. Th us, at least one of the cases m ust happ en. Theorem 17. The optimal mistake b ound of deterministic le arners in the ful l fe e db ack multiclass setting with weighte d gr aph for se quenc es r e alizable by H e quals WILdim( H ) . Pr o of. The pro of is similar to the pro of for un weigh ted graphs. In particular, for the upper bound, w e can adapt Algorithm 2 according to the three cases in Lemma 16 . 7 Conclusion In this w ork we studied online learning with impro ving agents. Our work extends and impro ves results of previous w orks on this topic in many asp ects; W e presented instance optimal learners that work for any h yp othesis class and any improv emen t graph. F urthermore, w e in tro duced the m ulticlass v ersion and also describ ed instance optimal learners for both the full feedbac k and bandit feedbac k settings. Finally , w e studied the setting where there is a cost function asso ciated with eac h p ossible mov ement and the agents only mov e if the utility that they gain is more than the cost they pa y for mo ving. W e now raise some questions that w e ﬁnd interesting and lea v e open for future work. 1. In this pap er w e only fo cused on deterministic learners. Can randomized learners ac hieve b etter (exp ected) mistake b ounds? 2. W e only fo cused on the realizable setting, where the lab els are consistent with some hypothesis in the class (that is kno wn to the learner). In what wa y can our results b e extended to the 11 agnostic setting where we the adv ersarially pro vided labels are not restricted by the hypothesis class? 3. W e only considered the classiﬁcation error. One may care about other costs of learning. F or example, even if the learner do es not mak e a mistake, it could be that the agent had to change its feature v ector in order to get a desirable label, while had the learner implemen ted the true classiﬁer, the agen t could ha ve sav ed such a change (and still get the desirable lab el). Suc h a case can b e view ed as incen tivizing an unnecessary burden on the agen t. It would b e in teresting to see what is the trade-oﬀ b etw een achieving low classiﬁcation error and minimizing the unnecessary burden. 4. W e assume the impro vemen t graph is kno wn to the learner. What can we achiev e if w e relax this assumption to only provide the learner with some limited prior knowledge ab out the agen t’s impro v ement graph? 5. W e assume the learner publishes its implemen ted classiﬁer. Namely , the agents actions are driv en b y the learner’s hypothesis. Is it possible to learn when the agen t is only allo wed to access past classiﬁers? Similar problem has b een studied in the strategic classiﬁcation [ SXY25 ]. 8 A c kno wledgmen ts Shai Ben-David is supp orted by an NSERC Discov ery Gran t and a Canada CIF AR AI Chair. References [ABN + 25] Idan Attias, A vrim Blum, Keziah Naggita, Dony a Saless, Dravy ansh Sharma, and Matthew W alter “ P AC Learning with Impro v ements”. In: F orty-se c ond International Confer enc e on Machine L e arning . 2025. [ABY23] Saba Ahmadi, A vrim Blum, and Kunhe Y ang “ F undamen tal bounds on online strate- gic classiﬁcation”. In: Pr o c e e dings of the 24th ACM Confer enc e on Ec onomics and Computation . 2023, pp. 22–58. [A YZ24] Saba Ahmadi, Kunhe Y ang, and Hanrui Zhang “ Strategic Littlestone Dimension: Impro ved Bounds on Online Strategic Classiﬁcation”. In: The Thirty-eighth A nnual Confer enc e on Neur al Information Pr o c essing Systems . 2024. [CMMS24] Lee Cohen, Yisha y Mansour, Shay Moran, and Han Shao “ Learnabilit y gaps of strategic classiﬁcation”. In: The Thirty Seventh Annual Confer enc e on L e arning The ory . PMLR. 2024, pp. 1223–1259. [DSBS15] Amit Daniely, Siv an Sabato, Shai Ben-Da vid, and Shai Shalev-Shw artz “ Multiclass learnabilit y and the ERM principle.” In: J. Mach. L e arn. R es. (2015). [HIL W21] Nik a Haghtalab, Nicole Immorlica, Brendan Lucier, and Jac k Z. W ang “ Maximizing w elfare with incen tive-a ware ev aluation mechanisms”. In: Pr o c e e dings of the Twenty- Ninth International Joint Confer enc e on Artiﬁcial Intel ligenc e . IJCAI’20. 2021. [HMPW16] Moritz Hardt, Nimro d Megiddo, Christos Papadimitriou, and Mary W o otters “ Strategic classiﬁcation”. In: Pr o c e e dings of the 2016 ACM c onfer enc e on innovations in the or etic al c omputer scienc e . 2016, pp. 111–122. 12 [KR20] Jon Kleinberg and Manish Ragha v an “ How do classiﬁers induce agen ts to in vest eﬀort strategically?” In: ACM T r ansactions on Ec onomics and Computation (TEA C) 8.4 (2020), pp. 1–23. [Lit88] Nic k Littlestone “ Learning quickly when irrelev an t attributes abound: A new linear- threshold algorithm”. In: Machine le arning (1988). [Lon17] Philip M. Long “ New b ounds on the price of bandit feedback for mistake-bounded online m ulticlass learning”. In: Pr o c e e dings of the 28th International Confer enc e on Algorithmic L e arning The ory . Ed. by Stev e Hanneke and Lev Reyzin. V ol. 76. Pro ceedings of Machine Learning Research. PMLR, 2017, pp. 3–10. [LU22] T osca Lechner and Ruth Urner “ Learning losses for strategic classiﬁcation”. In: Pr o- c e e dings of the AAAI Confer enc e on Artiﬁcial Intel ligenc e . V ol. 36. 7. 2022, pp. 7337– 7344. [LUB23] T osca Lechner, Ruth Urner, and Shai Ben-David “ Strategic Classiﬁcation with Un- kno wn User Manipulations”. In: Pr o c e e dings of the 40th International Confer enc e on Machine L e arning . Ed. by Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Siv an Sabato, and Jonathan Scarlett. V ol. 202. Pro ceedings of Mac hine Learning Research. PMLR, 2023, pp. 18714–18732. [MMH20] John Miller, Smitha Milli, and Moritz Hardt “ Strategic Classiﬁcation is Causal Mo deling in Disguise”. In: Pr o c e e dings of the 37th International Confer enc e on Machine L e arning . Ed. b y Hal Daumé I I I and Aarti Singh. V ol. 119. Pro ceedings of Machine Learning Researc h. PMLR, 2020, pp. 6917–6926. [SEA20] Y onadav Shavit, Benjamin Edelman, and Brian Axelrod “ Causal strategic linear regression”. In: International Confer enc e on Machine L e arning . PMLR. 2020, pp. 8676– 8686. [SS25] Dra vyansh Sharma and Alec Sun “ Conserv ative classiﬁers do consistently well with impro ving agen ts: characterizing statistical and online learning”. In: The Thirty-ninth A nnual Confer enc e on Neur al Information Pr o c essing Systems . 2025. [SXY25] Han Shao, Sh uo Xie, and Kunhe Y ang “ Should Decision-Makers Reveal Classiﬁers in Online Strategic Classiﬁcation?” In: F orty-se c ond International Confer enc e on Machine L e arning . 2025. [ZC21] Hanrui Zhang and Vincen t Conitzer “ Incen tive-A w are P AC Learning”. In: AAAI Confer enc e on A rtiﬁcial Intel ligenc e . 2021. 13

Online Learning with Improving Agents: Multiclass, Budgeted Agents and Bandit Learners

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment