Gaming Helps! Learning from Strategic Interactions in Natural Dynamics
We consider an online regression setting in which individuals adapt to the regression model: arriving individuals are aware of the current model, and invest strategically in modifying their own features so as to improve the predicted score that the c…
Authors: Yahav Bechavod, Katrina Ligett, Zhiwei Steven Wu
Gaming Helps! Learning from Strat egic Int eractio ns in Natural Dynamics Y aha v Bec ha v o d ∗ Katrina Ligett † Zhiw ei Stev en W u ‡ Juba Ziani § Marc h 2, 2021 Abstract W e consider an online regr ession setting in which individuals adapt to the reg ression model: arriving individuals are aw a re of the current mo del, a nd inv est strategica lly in mo difying their own features so as to impr o ve the predicted scor e that the cur ren t mo del assigns to them. Suc h feature manipulation ha s b een o bserv ed in v ario us scena rios—from cre dit a ssessmen t to school admissions—p osing a c hallenge for the learner. Surprisingly , we find that suc h strateg ic manipu- lations may in fact help the learner re c over the meaningful v ariables —tha t is, the features that, when changed, affect the true lab el (as opp osed to non-meanin gful features that hav e no effect). W e show that even s imple behavior on the lea rner’s par t allows her to simultaneously i) accu- rately recover the meaningful features, and ii) incentivize agents to inv est in these meaningful features, pr o v iding incenti ves for improvemen t. ∗ School of Computer Science and Engineering, The Heb rew Universit y . Email: yahav.bechavod@cs.h uji.ac.il . † School of Computer Science and Engineering, The Heb rew Universit y . Email: katrina@cs.huji.ac. il . ‡ School of Computer Science, Carnegie Mellon Universit y . Email: zstevenw u@cmu.edu . § W arre n Center for Netw ork and D ata Sciences, Universit y of P enn sylv ania. Email: jziani@seas.upenn. edu . 1 1 In tro duction As algorithmic decision- making takes a more and more imp ortan t role in m yriad application domains, incen tiv es emerge to ch ange the inputs presen ted to these algorithms. Recen tly , a collection of v ery in teresting pap ers has explored v arious mo dels of strategic b eha vior on the part of the classified individuals in learning settings, and w ays to mitigate the harms to accuracy that can arise from falsified features [ 6 , 1 , 11 , 8 ]. A dditionally , some recen t w ork has fo cused on the design of learning algorithms that incen tivize the classified individuals to mak e “go o d” in v estmen ts in true c hanges to their v ariables [ 15 ]. The present pap er takes a different tack, and explores another p oten tial effect of strategic in vest- men t in true c hanges to v ariables, in an online learning setting: we claim that in teraction b et w een the online learning and the strategic individuals ma y actually aid the learning algorithm in iden- tifying me ani ngful v ariables. By meaningful, w e mean, informally , and within the context of this pap er, v ariables for whic h chan ging their true v alue affects the true lab el and th us, ma y lead agent s to impro ve. In con trast, non-me aningful v ariables do not affect the true lab el; such features are susceptible to gaming, as they can p oten tially b e used to obtain b etter outcomes with resp ect to the p osted mo del without actually impro ving true lab els. The idea is quite s imple. First, if a learning algorithm’ s h y p othesis at a particular round dep ends hea vily on a certain v ariab le, this incen tivizes the arriving individual to in vest in impro ving that v ariable. If that v ariable w ere meaningful (that is, it has an effect on the true lab el), then the learner w ould observe an impro ved true lab el, increasing the observ ed correlation b et w een the v ariable and the lab el. Ho w ever, if that v ariable w ere non-meaningful, the change s w ould not ha ve an effect on the true lab el, reduci ng the observed correlation b etw een the v ariable and the lab el. Second, if a learning algorithm impro v es its hypotheses o ver time, this cha nging sequence of incen tives should encourage inv estmen t in a v ariet y of promising v ariables, exp osing those that are meaningful. This pro cess should naturally induce the learner to shift its dep endence to w ards meaningful v ariables, thereb y incenti vizing individuals to inv est in impro ving as opp osed to gaming, resulting in an ov erall higher-qualit y p opulation. The goal of this pap er is to highligh t this p oten tial b eneficial effect of the in teraction b et w een online learning and strategic mo dification. T o do so, w e ch o o s e to fo cus our study on a simple linear regression setting. In our mo del, there is a true underlying laten t regression parameter vecto r β ∗ , and there is an underlying distribution o ver unmo dified feature vectors. On ev ery round t , the learner must announce a regression vecto r ˆ β t . 1 An individual then app ears, with an unmodified feature v ector x t c hosen i.i.d. from the distribution. Before presen ting himself to the learner, the individual observes ˆ β t and has the opp ortunit y to inv est in chang ing his true features to some ¯ x t ; we fo cus on a simple mo del wherein the individual’s in v estmen t results in a targeted ch ange to a single v ariable. The individual then receiv es utility h ˆ β t , ¯ x t i , and the learner gets f eedbac k ¯ y t = β ∗ ⊺ ¯ x t + ε t , where ε t is some noise. Within this simple mo del, we consider s imple b eha v iors for b oth the learner and the individuals: A t eac h time t , the individual mo difies his features so as to maximize his utility giv en the p osted ˆ β t ; p eriodically , the learner up dates ˆ β t with her best estimate of β ∗ giv en the (mo dified) f eature s and lab els s he has observ ed, via least-square regression. Our main result is that under this s imple b eha vior, the learner reco vers β ∗ accurately , af ter observing sufficien tly man y individuals. Our 1 Even tually , the learner we will consider do es not up date its regression vector at every round, but rather p erio di- cally , so that indiv iduals can b e treated in batches. 2 result is divided in t w o parts: first, we s ho w that least-square regression accurately reco v ers β ∗ with resp ect to features that man y individuals ha v e in vested in. Second, we sho w that these dynamics incen tivize in vestmen ts in ever y feature, leading to accurate reco v ery of β ∗ in its entire t y , under an assumption on ho w the learner breaks ties betw een m ultiple least-square solutions. Our accuracy guaran tees for a feature impro ve with the n um b er of times that f eature is in vested in. It is imp ortan t to emphasize that we fo cus on a setting in which individuals’ mo difications (whic h w e refer to inter change ably as “manipulations”) of their v ariables can b e true in vestmen ts (e.g., studying to ac hieve b etter mastery of material b efore an exam—the exam score is the v ariable and the mastery level is the lab el) rather than deceitful manipulations (e.g., c heating on the exam to ac hieve a higher score without impro ving mastery). Deceitful manipulations w ould not help to exp ose meaningful v ariables, b ecause suc h c hanges w ould nev er affect the true lab el (sub j ect mastery), regardless of whether the manipulations wer e in meaningful or non-meaningful v ariables. Notice that an y discov ery of meaningful v ariables that o ccurs in our mo del is a result of the inter action b et w een the online learner and the strategic individuals. On the one hand, online learning with no strategic resp onse has no abilit y to distinguish non-meaningful v ariables f rom meaningful ones when the tw o are correlated. On the other hand, if strategic individuals faced with a static scoring algorithm tried to maximize their scores b y in v esting in a non-meaningful feature, the resulting information wou ld b e insufficien t for an observe r to draw conclusions ab out whether other features are meaningful or not. F or example, historical data migh t sho w that b oth a s tuden t’s grades in high sch o ol and the make of car his paren ts driv e to the univ ersit y visit da y are predictiv e of success in univ ersity . Supp ose, for simplicit y , that success in high sch o ol is causally related to success in univ ersity , but that mak e of paren ts’ car is not, and is merely a pr oxy for other features that cont rol one’s chanc es of success in college. If the univ ersity admissions pro cess put large wei gh t on high school grades, that w ould incen tivize studen ts to in ve st effort in p erforming well in high s c ho ol, whic h w ould also observ ably pa y off in univ ersit y , whic h would reinforce the emphasis on high sc ho ol grades. If the admissions pro cess put large w eigh t on the mak e of car in whic h studen ts arriv e to the visit da y , that w ould incen tivize ren ting fancy cars for vis its. H o w ever, this w ould result in a differen t distribution o ver the observ ed studen t v ariables, and on this mo dified distribution the cor relation b et w een cars and univ ersity success woul d b e we akene d , and therefore the admissions form ula wou ld not p erform well. In future y ears, the univ ersit y would naturally correct the form ula to de-emphasize cars. It is imp ortan t to note that our work op erates under a simplifying as s umption with regards to the underlying structure of the problem (in tro duced in Section 3 ). A dding an assumption of this kind is necessary since in the general case recov ering the exact mo del structure is hard. Our work th us aims to bring atten tion to a natural mecha nism, based on re-training, for exp osing meaningful v ariables, that w e b eliev e is w orth y of f urther atten tion. 2 Related W ork Muc h of the w ork on learning ass umes that an individual’s data is a fixed input that is indep enden t of the algorithm used by the decision-mak er. In practice, ho w ev er, individuals ma y try to adapt to the mo del in place in order to impro v e their outcomes. A recent line of w ork s tudies suc h strategic b eha vior in classification settings. P art of this line of wor k concerns itself with the negativ e consequences of strategic b eha vior, when individuals aim to game the mo del in place; for example, 3 individuals ma y manipula te their data or features (often at a cost) in an effort to obtain p ositiv e qualification outcomes or otherwise manipulate an algorithm’s output [ 6 , 19 , 7 , 1 , 14 , 12 , 2 , 11 , 8 , 4 , 3 ] or ev en to protect their priv acy [ 9 , 5 ]. The goal in these results is to provid e algorithms whose outputs are robust to suc h gaming. [ 17 ] and [ 13 ] fo cus on the so cial impact of robust classification, and show that i) robust classifiers come at a so cial cost (b y forcing ev en q ualified individuals to in vest in costly feature manipulations in order to b e classified p ositiv ely) and ii) disparate abilities to game the mo del inevitably lead to unfair outcomes. Another p art of this line of w ork instead s ees strategic manipul ation as possibly positiv e, when the classifier incen tivizes individuals to in vest in true impro veme nts to their features—e.g., a studen t ma y decide to s tudy and actually impro v e his kno wledge in order to raise hi s test score. [ 15 ], [ 23 ], [ 21 ], [ 22 ] and [ 10 ] study how to incen tivize agen ts to in vest effort in mo difying meaningful features that improv e their lab els. Much of this line of w ork assumes that the decision- mak er already understands whic h features are meaningful and affect agen ts’ lab els or outcomes, and whic h do not. In con trast, w e consider a setting where the decision-mak er do es not initially kno w whic h features affect agen ts’ lab els, and aims to lev erage the agen ts ’ strategic b eha vior to exp ose what f eatures these are. Most closely related to this pap er is the w ork of [ 16 ], as w ell as the concurren t w orks of [ 18 ] and [ 20 ]. [ 16 ] formalize the distinction b etw een gaming and actual impro veme n ts b y dra wing a connection to causalit y and in tro ducing causal graph s that model the effects of the f eatures and target v ariables on eac h other. They show that in s uc h settings, the decision-mak er should incen tivize actual impro v ements rather than gaming, and that designing go o d incen tives that push agen ts to impro v e is at least as hard as causal inference. [ 20 ] study the sample complexit y of learning a linear regression mo del so as to either i) maximize the accuracy of the predictions, ii) maximize the agents’ self-impro veme nts, or iii) reco v er the causalit y s tructu re of their problem. [ 18 ] sho w ho w re-training can lead to stable and optimal outcomes when the learner’s mo del affects the distribution of agent features and lab els; while our pap er considers a similar re-training framew ork, our assumptions differ f rom those of [ 18 ]. 3 Mo del W e consider a linear regression setting where the learner estimates the regression parameters based on strategically manipulated data from a sequence of agen ts o v er rounds. There is a true laten t regression parameter β ∗ ∈ [ − 1 , 1] d that gener ates an agen t’s label as a function of his f eature v ector. That is, for an y agen t with feature vector x ∈ [ − 1 , 1] d , the real-v alued lab el y is obtained via y = β ∗⊤ x + ε , where ε is a noise random v ariable with | ε | ≤ σ , and E [ ε | x ] = 0 . W e also refer to an individual’s features as v ariables. There is a di s tributio n ov er the unmo difie d f eatures x in [ − 1 , 1 ] d ; w e let µ b e the mean and Σ b e the co v ariance matrix of this distribution; we note that the distribution of unmo dified features ma y b e degenerate, i.e., Σ ma y not b e full-rank. F or example, this can happ en in settings in whic h the non-meaningful features are merely pro xies for the meaningful features (i.e., those that really con trol the lab el); in that case, one may imagine that the non-meaningful features are (p ossibly randomized) functions of the meaningful features, leading in particular to low-ran k observ ations when f ew f eatures are meaningful. Throughout the pap er, w e set µ = 0 . 2 2 This can b e done whenever the learner can estimate the mean feature vector, since t he learner can then center the features. The learner could estimate the mean by using unlab eled historical data; for example, she could collect 4 The agen ts and the learner in teract in an online f ashion. A t time t , the learner first p osts a regression estimate ˆ β t ∈ R d , then an agen t (indexed b y t ) arriv es with their unmo dified f eature v ector x t . A gen t t mo difies the f eature x t in to ¯ x t in resp onse to ˆ β t , in order to impro ve their ass igned score h ˆ β t , ¯ x t i . Finally , the learner observ es the agen t’s realized lab el after f eature mo dification , giv en b y ¯ y t = β ∗ ⊺ ¯ x t + ε t . Meaningful vs non-meaningf u l features. When an agen t mo difies a feature k , this may also affect the agen t’s true lab el. W e divide the co ordinates of an y feature v ector x in to me aningful and non-me aningful features; meaningful features inform and con trol an agent ’s lab el, while non- meaningful features are those that can be manipulate d without directly affecting an agen t’s lab el. (One can think, in tuitiv ely , of the meaningful features as causal, and the non-meaningful features as non-causal, but the language of causalit y is typic ally reserve d f or more complex settings than ours.) F ormally , f or an y k ∈ [ d ] , feature k is meaningful if and only if the co ordinat e β ∗ ( k ) 6 = 0 , and non-meaningful if and only if β ∗ ( k ) = 0 . An agen t t can mo dify his true lab el b y mo difying meaningful features. A s suc h, note that β ∗ captures the underlying mo del s tructur e of our problem. The magnitude of eac h feature in β ∗ captures the exten t to whic h said feature is meaningful and affects the agents’ lab els. W e remark that s trateg ic agen ts—that b est-resp ond to the learner’s mo del to impro v e their regression outcomes—ma y at times ha ve incen tive s to manipulate a feature k s uch that β ∗ ( k ) = 0 ; this can happ en when the learner s ets ˆ β ( k ) 6 = 0 . In s uc h cases, agen ts can impro ve their regression outcomes wi thout impro ving their true lab el, whic h w e refer to as gaming . When agen ts mo dify a feature k that aligns with the true mo del, w e refer to s uc h a mo dificati on as an i mpr ov ement . Agen ts’ resp onses. Agen ts are str ate gic : they mo dify their features s o as to maximize their o wn regression outcome; 3 mo dificatio ns are costly and agents are budgeted. W e assume agen t t incurs a linear cost c t (∆ t ) = P d k =1 c t ( k ) | ∆ t ( k ) | to c hange his features b y ∆ t , and has a total budget of B t to mo dify his f eatures. { c t ( k ) } k ∈ [ d ] , B t ’s are dra wn i.i.d. from a distribution C that is unkno wn to the learner. W e assume C has discrete supp ort { c 1 , B 1 , . . . , c l , B l } , and w e denote by π i the probabilit y that ( c t , B t ) = c i , B i . W e assume c i ( k ) > 0 , B i > 0 f or all i ∈ [ l ] , k ∈ [ d ] ; that is, every agen t can modify his features, but no feature can b e mo dified f or free. 4 When facing regression parameters ˆ β , agent t solves M ( ˆ β , c t , B t ) = argmax ∆ t ˆ β ⊤ ( x t + ∆ t ) s.t. d X k =1 c t ( k ) | ∆ t ( k ) | ≤ B t ; data during a p eriod when the algorithm do es not make any decision on the agents, thus they w ould hav e no incenti ve to modify their features. 3 Imp ortan tly , our agents’ goal is not to co operate with the learner. Agents are self-in terested and aim to maximize their ow n regression outcomes; th ey do not actively seek to help the learner improv e the accuracy of her mo del. The agen ts prefer when the learner emph asize s features that are easier to manipulate, even if said features are non- meaningful. These incen tives ma y b e ill-aligned with the learner’s goal of optimizing predictive p ow er and reco vering model structu re, which requires putting more we igh t on meaningful features. 4 In our mo del, mo difying a feature affects only th at feature and th e lab el, but do es not affect the va lues of any other features. W e leav e exploration of more complex mo dels of feature interven tion to future work. 5 That is, agen t t strategically aims to maximize his predicted outcome giv en a budget of B f or mo difying his features, when facing mo del ˆ β . The solution of the ab o v e program do es not dep end on x t , only on ˆ β and ( c t , B t ) , and is given by ∆ t = d X k =1 sg n ˆ β ( k ) 1 ( k = argmax j ˆ β ( j ) /c t ( j ) ) B t c t ( k ) , up to tie-breaking; when seve ral f eature s maximize | ˆ β ( j ) | /c t ( j ) , the agen t mo difies a single one of these features. W e call D τ the set of features that ha ve b een mo dified b y at least one agent t ∈ [ τ ] . Remark 3.1. W e make the line arity assumption on the c ost f unctions for simplicity. Our r esults extend to a mor e gener al class of c ost functions that do not i nduc e mo dific ations wher ein sever al fe atur es ar e mo difie d in a p erfe ctly c orr elate d fashion. The key te chnic al in sight we ne e d is that the m an ipulations ar e ful l-r ank in the subsp ac e define d by the fe atur es that have b e en manipulate d so far, define d as V τ ( E ) in the p ap er. V ery str ong fe a- tur e c orr elations (which may also b e thought of p ossible “dir e ctions” f or mo dific ation) im ply a very smal l m inimum eigenvalue of the observation matri x, making r e c overy har der and incr e asing sample c omplex ity. T hi s is unavoidable: the mor e fe atur es ar e c orr elate d, the har der they ar e to distinguish information-the or etic al ly; if two fe atur es wer e p erfe ctly c orr elate d, it would b e imp ossible to know which one affe cte d the lab el. In The or em 4.1 , we enc o de this c orr ela tion b e twe en mo dific ation acr oss fe atur es in a p ar ameter we c al l λ . As f e atur e mo dific ations b e c ome mor e and mor e c orr elate d, the value of λ b e c omes smal ler and our r e c ov ery guar ante es we aken. Natural learner dynamics: batc h le ast-squ ares regression. Ou r goal here is to iden tify simple, natural learning dynamics that exp ose meaningful v ariables. Note that a simple wa y for the learner to exp ose and leve rage meaningful v ariab les is to use an explore-first then exploit t yp e of algorithm: initially , the learner can p ost a mo del that fo cus es on a single feature at a time to observ e ho w c hanging this feature affects the distribution of agen ts lab els. After sequen tially exploring eac h feature, the learner obtains an accurate estimate of β ∗ that she can deplo y for the remainder of the time horizon. How ev er, one may wan t to av oid suc h an approac h that artificially s eparates features in practice: p osting mo dels that ignore most of an agen t’s attribute for the sake of learning ma y not b e desirable in real life. A bank ma y not wan t to offer loans “blindly” and willingly ignore most of a customer’s data when making lending decisions just for the purp ose of learning whic h features are predictiv e of an agen t’s abilit y to repa y loans. Instead, in this pap er, w e fo cus on algorithms based on r e-tr aining : i.e., p eriodically , the learner up dates her mo del based on the data she has observ ed so far, so as to k eep it consisten t with the history of agen t b eha vior. A bank ma y b e willing to p eriodically up date their loan decision rule in order to k eep up with new, unexp ected agen t b eha vior. While re-training leads to more natural dynamics than a “naive” explore-then-exploit approac h, it comes with new techni cal c hallenges. In particular, p erio dic re-training leads to adaptivity : indeed, as the mo del p osted in the curren t p erio d dep end on past data, and the agen ts’ s trategi c b eha vior dep ends on the mo del in place, the observ ed mo difie d data in eac h p eriod dep ends on the data in all previous p erio ds. In turn, we cannot treat data p oin ts as indep enden t across p erio ds. The dynamics we consider are formally giv en in Algorithm 1 . I t is p ossible that more sophis- ticated learning algorithms could yield b etter guaran tees with resp ect to regret and reco ve ry; the fo cus of this pap er is on simple and natural dynamics rather t han optimal ones. 6 When the learner up dates her regression parameters, say at time τ , she do es so based on the agen t data observ ed up un til time τ . W e model the learner as pic king ˆ β from the set LS E ( τ ) of solutions to the least-square regression problem run on the agen ts’ data up un til time τ , formally defined as LS E ( τ ) = argmin β τ X t =1 ¯ x ⊤ t β − ¯ y t 2 . W e introduce notation that will b e useful for regression analysis. W e let ¯ X τ ∈ R τ × d b e the matrix of (mo dified) observ ations up un til time τ . Eac h row corresp onds to an agen t t ∈ [ τ ] , and agen t t ’s ro w is given b y ¯ x ⊤ t . Similarly , let ¯ Y τ = ( ¯ y t ) ⊤ t ∈ [ τ ] ∈ R τ × 1 . W e can rewrite, for any τ , LS E ( τ ) = argmin β ¯ X τ β − ¯ Y τ ⊤ ¯ X τ β − ¯ Y τ . (1) Agen ts are group ed in ep o chs. The time horizon T is divided in to ep o c hs of size n , where n is chosen by the learner. A t the start of ev ery epo ch E , the learner up dates the p osted regression parameter v ector as a function of the history of ¯ x t , ¯ y t up un til ep o ch E . W e let τ ( E ) = E n denote the last time step of ep o c h E . D τ ( E ) denotes the set of features that hav e b een mo dified b y at least one agen t b y the end of ep o c h E . Algorithm 1: Online Regression with Ep o c h-Based Strategic mo dificat ion (Ep o c h size n ) Learner pic ks (an y ) initial ˆ β 0 . for every ep o ch E ∈ N do for t ∈ { ( E − 1) n + 1 , . . . , E n } do Agen t t rep orts ¯ x t ∈ M ( ˆ β E − 1 , c t , B t ) . Learner observ es ¯ y t = β ∗⊤ ¯ x t + ε t . end Learner pic ks ˆ β E ∈ LS E ( τ ( E )) . end Examples W e first illustrate wh y unmodified observ ation s are insufficien t for any algorithm to distinguish meaningful from non-meaningful features. Consider a setting where non-meaningful features, as merely pro xies for the meaningful features, are in fact conv ex com binations of these meaningful features in the underlying (unmo dified) distribution. Absen t additional information, a learner w ould b e faced with degenerate sets of observ ations that ha ve rank strictly less than d , whic h can mak e accurate reco very of the mo del structure imp ossible: Example 3.2. Supp ose d = 2 , β ∗ = (1 , 0) . Supp ose fe atur e 1 is me aningful an d f e atur e 2 is non- me aningful and i s c orr elate d with 1 : the distribution of unmo difie d fe atur es i s such that for any fe atur e ve ctor x , fe atur e 2 is identic al to fe atur e 1 as x (2) = x (1) . T hen, any r e gr ession p ar amete r of the form β ( α ) = ( α, 1 − α ) for α ∈ R assigns agents the sam e sc or e as β ∗ . Inde e d, β ∗⊤ x = x (1) = αx (1) + (1 − α ) x (2) = β ( α ) ⊤ x. In turn, in the absenc e of additional inform ati on other than the observe d fe atur es and lab e ls, β ∗ is indistinguishable fr om any β ( α ) , many of which r e c ove r the mo del st ructur e p o orly (e.g., c onsider any α b ounde d away fr om 1 ). 7 A t this p oin t, a reader ma y w onder wh y it is imp ortan t in Example 3.2 to reco v er the true mo del β ∗ , rather than simply any v ector β that is consisten t with all the data observ ed so far. A ma jor reason to do so is b ecause only the true model β ∗ can guaran tee robustness in resp onse to agen t mo difications, and accurately predict lab els after agen ts hav e c hanged their features. This is illustrated in Example 3.3 b elo w: Example 3.3. Consider the setting of Example 3.2 , and imagine agents have much lower c ost for manipulating fe atur e 2 than fe atur e 1 . Then, p osting a r e gr ession p ar amete r ve ctor of the form ( α, 1 − α ) wher e α is sm al l enough may le ad agents to mo dify the s e c ond, non-me aningful fe atur e . When facing such a mo dific ation of the form ∆ = (0 , ∆(2)) , ( α, 1 − α ) pr e dicts lab el αx (1) + (1 − α ) ( x (2) + ∆(2)) = x (1) + (1 − α )∆(2) , for an agent with x (1) = x (2) , while the true lab el is given by β ∗⊤ ( x + ∆) = x (1) . In turn, the pr e dicte d and true lab els ar e differ ent for any α 6 = 1 . W e next illustrate that strategic agen t mo difications may aid in reco very of meaningful features, but only f or those features that individuals actually in vest in c hanging: Example 3.4. Consider a setting wher e d = 3 , fe atur e 1 is me aningful, and fe atur es 2 and 3 ar e non- me aningful and ar e c orr elate d with fe atur e 1 as fol lows: for any fe atur e ve ctor x , x (2) , x (3) = x (1) . L et β ∗ = (1 , 0 , 0) . Consider a situation i n which the lab els ar e noi s eless (i.e., ε = 0 almost sur ely). Supp ose that agents only mo dify their me aningful fe atur e by a (p ossibly r andom) amount ∆(1) . Note that the differ enc e (in absolute value) b etwe en the sc or e obtaine d by applying a given r e gr es- sion p ar am eter ˆ β and the sc or e obtaine d by applying β ∗ to fe atur e ve ctor x is given by ˆ β ⊤ x − β ∗⊤ x = ˆ β (1) ( x (1) + ∆(1)) + ˆ β (2) x (2) + ˆ β (3) x (3) − x (1) − ∆(1) = ˆ β (1) + ˆ β (2) + ˆ β (3) − 1 x (1) + ˆ β (1) − 1 ∆(1) . In p articular, for appr opriate distributions of x and ∆(1) , the pr e dictions of ˆ β and β ∗ c oincide if only i f ˆ β (1) = 1 and ˆ β (2) = − ˆ β (3) . As such, the le arner le arns after enough observations that ne c essarily, β ∗ (1) = 1 . However, any r e gr ession p ar amete r ve ctor with ˆ β (1) = 1 , ˆ β (2) + ˆ β (3) = 0 is indistinguishable fr om β ∗ , an d ac cur ate r e c overy of β ∗ (2) and β ∗ (3) is imp ossible. Note that ev en in the noiseless setting of Example 3.4 , only the f eature that has b een mo dified can b e reco vered accurately . In more complex settings where the true lab els are noisy , one should not hop e to reco v er ev ery feature w ell, but rather only those that ha ve b een modified sufficien tly man y times. 4 Reco v ery Guaran tees for Mo dified F eatures In this section, we fo cus on c haracterizing the reco very guaran tees (with resp ect to the ℓ 2 -norm) of Algorithm 1 at time τ ( E ) = E n for any ep o c h E , with respect to the features that ha ve b een mo dified up until τ ( E ) (that is, in ep ochs 1 to E ). W e leav e discussion of ho w the dynamics shap e the set D τ ( E ) of mo dified features to Section 5 . The main result of this section guaran tees the accuracy of the ˆ β E that the learning process con ve rges to in its in teraction with a sequence of strategic agen ts. The accuracy of the ˆ β E that is 8 reco v ered for a particular f eature naturally dep ends on the n um b er of ep o c hs in which that feature is modified b y the agen ts. F or a feature that is nev er mo dified, w e ha ve no abilit y to distinguish whether it is meaningful or not. R eco v ery impro ves as the num b er of observ ations of the mo dified v ariable increases. F ormally , our reco very guaran tee is given b y the follo wing t heorem : Theorem 4.1 ( ℓ 2 Reco very Guaran tee for Mo dified F eatures) . Pick any ep o ch E . With pr ob ab ility at le ast 1 − δ , for n ≥ κd 2 λ p τ ( E ) log (12 d/δ ) , v u u t X k ∈ D τ ( E ) ˆ β E ( k ) − β ∗ ( k ) 2 ≤ K p dτ ( E ) log (4 d/δ ) λn , wher e K , κ, λ ar e instanc e-sp e cific c onstants that only dep end on σ , C , Σ , such that λ > 0 . When the ep och size is c hosen so that n = Ω ( τ ( E ) α ) for α > 1 / 2 , our reco very guaran tee impro v es as τ ( E ) b ecomes larger. No w, let us fix τ ( E ) = T as the time horizon, and study ho w the relationship b et w een E and n at fixed τ ( E ) affects the reco v ery guaran tees. When n = Θ( τ ( E )) (equiv alen tly , E = Θ (1) , and agents are group ed in a small, constan t num ber of ep o c hs ), our b ound b ecomes O (1 / p τ ( E ) ) ; this matc hes the w ell-kno wn reco ve ry guaran tees of least square regression for a single batch of τ ( E ) i.i.d observ ations dra wn from a non-degenerate distribution of features. When the ep o c h size n is s ub-lin ear in τ ( E ) (i.e., E ≫ 1 , and agen ts are group ed in more n umerous but smaller ep o c hs), the accuracy guaran tee degrades to O ( p τ ( E ) /n ) , where p τ ( E ) /n ≫ 1 √ τ ( E ) . This is b ecause some features ma y b e mo dified only in a small n umber of ep o c hs, 5 that is, Θ( n ) times, and the num ber of times suc h features are mo dified drive s ho w accurately they can b e recov ered. Pr o of sketch for The or em 4.1 . F ull pro of in App endix A . W e fo cus on the subspace V τ ( E ) of R d spanned b y the observed features ¯ x 1 , . . . , ¯ x τ ( E ) , and for an y z ∈ R d , w e denote by z ( V τ ( E ) ) the pro jection of of z on to V τ ( E ) . First, w e sho w v ia concen tration that in this subspace, the mean-square error is strongly conv ex, with parameter Θ( n ) (see Claim A.6 ). This strong conv exit y parameter is con trolled by the smallest eigen v alue of ¯ X ⊤ τ ( E ) ¯ X τ ( E ) o ver subspace V τ ( E ) . F ormally , w e low er b ound this eigenv alue and sho w that with probabilit y at least 1 − δ / 2 , for n large enough, ˆ β E ( V τ ( E ) ) − β ∗ ( V τ ( E ) ) ⊤ ¯ X ⊤ τ ( E ) ¯ X τ ( E ) ˆ β E ( V τ ( E ) ) − β ∗ ( V τ ( E ) ) ≥ λn 4 . (2) Second, w e b ound the effect of the noise ε on the mean-squared error by O ( p τ ( E ) ) in Lemma A.3 , once again via concen tration. F ormally , we abuse notation and let ε τ ( E ) , ( ε t ) ⊤ t ∈ [ τ ( E )] , and show that with probabilit y at least 1 − δ / 2 , ˆ β E ( V τ ( E ) ) − β ∗ ( V τ ( E ) ) ⊤ ¯ X ⊤ τ ( E ) ε τ ( E ) ≤ ˆ β E ( V τ ( E ) ) − β ∗ ( V τ ( E ) ) 2 · K p dτ ( E ) log (4 d/δ ) . (3) Finally , we obtain the result via Lemma A.2 , that states that taking the first-order conditio ns on the mean-squared error yields ¯ X ⊤ τ ( E ) ¯ X τ ( E ) ˆ β E ( V τ ( E ) ) − β ∗ ( V τ ( E ) ) = ¯ X ⊤ τ ( E ) ε τ ( E ) , 5 In particular, as we will see, we exp ect correlated, non-meaningful features to only b e mod ified in a small number of ep ochs: once a n on-meaningful feature k has b een mo dified in a few ep o c hs, it is accurately recov ered. In further p eriods E , th e learner sets ˆ β E ( k ) close to 0 . This disincentivizes further mo difications of feature k . 9 whic h can b e com bined with Equations ( 2 ) and ( 3 ) to s ho w our b ound with resp ect to sub-space V τ ( E ) . In turn, as D τ ( E ) defines a sub-space of V τ ( E ) , our accuracy b ound applies to D τ ( E ) . Remark 4.2. The or em 4.1 is not a dir e ct c onse quenc e of the classic al r e c ov ery guar ante es of le ast- squar e r e gr ession, as they assume ¯ X ⊤ τ ( E ) ¯ X τ ( E ) has ful l r ank d . W e de al with de gener ate distributions over mo difie d fe atur e s, that c an arise in our setting as p er Examples 3.2 and 3.4 . 5 Exploration via Least Squares Tie-Breaking In this section, w e show that a natural tie-breaking rule among the s et of least squares incen tivizes agen ts’ mo dification of a div erse set of v ariables o ver time. Recall w e are solving the least-square problem LS E ( τ ( E )) giv en in Equation ( 1 ) for all ep o c hs E . When ¯ X ⊤ τ ( E ) ¯ X τ ( E ) is in v ertible, this has a single solution. Ho w ev er, in our s etting, it may b e the case that ¯ X ⊤ τ ( E ) ¯ X τ ( E ) is rank-deficien t (see Examples 3.2 , 3.4 ). In this case, the least-square problem admits a con tinuu m of solutions. This gives rise to the question of whic h solutions are preferable in our setting, and ho w to break ties b et w een sev e ral solutions. The learner’s choi ce of regression parameters in eac h ep o c h affects the distribution of feature mo dificatio ns in subsequen t ep o c hs. As the recov ery guaran tee of Theorem 4.1 only applies to features that ha ve b een mo dified, we w ould like our tie-breaking rule to regularly incen tiv ize agen ts to mo dify new f eatures. W e first show that a natural, commonly used tie-breaking rule—pic king the minim um norm solution to the least-square problem—ma y fail to do so: Example 5.1. Consi der a setting with d = 2 , β ∗ = (1 , 2) and noiseless lab els, i.e., ε t = 0 always. Supp ose t hat with pr ob ability 1 , every agent t has fe atur es x t = (0 , 0) , budget B t = 1 , and c osts c t (1) = c t (2) = 1 to mo dify e ach fe atur e. W e let the tie-br e aking pick the solution with the le ast ℓ 2 norm among al l solutions to the le ast-squar e pr oblem. Pick any in itial r e gr ession p ar ameter ˆ β 0 with ˆ β 0 (1) > ˆ β 0 (2) . F or every agent t in ep o ch 1 , t picks mo dific ation ve ctor ∆ t = (1 , 0) . This induc es observations ¯ x t = (1 , 0) , ¯ y t = 1 . The set of le ast-squ ar e solutions (wi th err or exactly 0 ) in ep o ch 1 is then given by { (1 , β 2 ) : ∀ β 2 ∈ R } , and the mi nimum-norm solution chosen at the end of ep o ch 1 is ˆ β 1 = (1 , 0) . This solution inc entivizes agents to set ∆ t = (1 , 0) , and Algorithm 1 gets stuck in a lo op wher e every agent t r ep orts ¯ x t = (1 , 0) , and the algorithm p osts r e gr ession p ar ameter ve ctor ˆ β E = (1 , 0) in r esp onse, in every ep o ch E . The se c ond fe atur e is never mo difie d by any agent, and is not r e c over e d ac cur ately. Example 5.1 highligh ts that a wrong c hoice of tie-breaking rule can lead Algorithm 1 to explore the s ame features o ver and ov er again. In resp onse, w e prop ose the followi ng tie-breaking rule, describ ed in Algorithm 2 : Int uitiv ely , at the end of ep o c h E , our tie-breaking rule picks a solution in LS E ( τ ( E )) with large norm. This ensures the exis tence of a f eature k 6∈ D τ ( E ) that has not ye t b een mo dified up un til time τ ( E ) , and that is assigned a large w eight b y our least-square solution. In turn, this feature is more likely to b e mo dified in future ep o c hs. Our main result in this section sho ws that the tie-breaking rule of Algorithm 2 ev en tually incen tivizes the agen ts to mo dify all d features, allo wing for accurate reco very of β ∗ in its en tiret y . The in tuition b ehind our algorithm is to c ho ose a tie-breaking rule that puts enough w eight on directions that hav e not y et b een explored, incenti vizing agen ts to explore them. 10 Algorithm 2: Tie-Breaking Sch eme at Time τ ( E ) . Input: Epo c h E , observ ations ( ¯ x 1 , ¯ y 1 ) , . . . , ( ¯ x τ ( E ) , ¯ y τ ( E ) ) , parameter α Let U τ ( E ) = s p an ¯ x 1 , . . . , ¯ x τ ( E ) . if r ank U τ ( E ) < d then Find an orthonormal basis B ⊥ τ ( E ) for U ⊥ τ ( E ) . Set v = P b ∈ B ⊥ τ ( E ) b 6 = 0 , renormalize v := v k v k 2 . Pic k β E a v ector in LS E ( τ ( E )) with minimal norm. Set ˆ β E = β E + αv . else Set ˆ β E b e the unique elemen t in LS E ( τ ( E )) . end Output: ˆ β E . Theorem 5.2 (Reco very Guaran tee with Tie-Breaking Sc heme (Algorithm 2 )) . S upp ose the ep o ch size sati s fi es n ≥ κd 2 λ p 2 T log(24 d/δ ) , and take α to b e α ≥ γ √ d + K d p 2 T log (8 d/δ ) λn ! , wher e γ , K , κ, λ ar e instanc e-sp e cific c onstants that only dep end on σ , C , Σ , and λ > 0 . If T ≥ dn , we have with pr ob ability at le ast 1 − δ that at the end of the last ep o ch T /n , ˆ β T / n − β ∗ 2 ≤ K p 2 dT log(8 d/δ ) λn , under the tie-br e aking rule of Algorithm 2 . Remark 5.3. T he b ound in The or em 5.2 pr ovides guidanc e for sele cting the ep o ch length, so as to ensur e optimal r e c ove ry gua r ante e s. Under the natur al as sumption that T >> d , the optimal r e c overy r ate is achieve d when r oughly n = Θ( T /d ) . This r esults in an O ( d p ( d log d ) /T ) upp er b ound on the ℓ 2 distanc e b etwe en the r e c ove r e d r e gr ession p ar ameters and β ∗ . Pr o of sketch of The or em 5.2 . F ull pro of in App endix B . F or α arbitrarily large, the norm of ˆ β b ecomes arbitrarily large. Because at the end of ep o c h E , ˆ β E guaran tees accurate recov ery of all features mo dified up until time E n , it mu s t be that ˆ β E ( k ) is arbitrari ly large for some feature k that has not y et b een mo dified. In turn, this feature is mo dified in ep o c h E + 1 . Af ter d ep o c hs, and in particular for T ≥ dn , this leads to D T = [ d ] . The reco v ery guaran tee of Theorem 4.1 then applies to all f eatures. 6 Conclusion This w ork tak es a first step to w ards illuminating a phenomenon w e b eliev e is b oth surprising and w orth y of f urther s tudy: s trateg ic agen ts ma y in fact help a learner in b etter understanding the un- derlying structure of a classification problem. As an immediate implication, the reco very guaran tees w e ha ve pro ve n pro vide the learner with kno wledge regarding how to c ho ose go o d incen tives, la ying the ground for individual impro vemen t, rather than gaming. In future w ork, it w ould b e natural to explore this intera ction in ric her and more complex settings. 11 A c kno wledgmen ts P art of this w ork w as done while the authors w ere v isiting the Simons Institute for the Theory of Computing. The w ork of Y ah a v Bec ha vod and Katrina Ligett was supp orted in part b y Israel Science F oundation (ISF) gran ts #1044/16 and 2861/20, the United States Air F orce and D ARP A under con tracts F A8750-16-C-0 022 and F A8750-19-2-0222, and the F edermann Cyb er Securit y Cen ter in conjunction with the Israel national cyb er directorate. Y aha v Bech a vod w as also supp orted in part b y the Apple Scho lars in AI /ML PhD F ellow s hip. Katrina Ligett w as also funded in part b y in part by a gran t from Georgeto wn Universit y and Simons F oundation Collab oration 733792. Zhiw ei Stev en W u w as supp orted in part b y the N SF F AI A w ard #1939606, a Go ogle F acult y Researc h A w ard, a J.P . Morgan F acult y A ward, a F ace b o ok Researc h A w ard, and a Mozilla Research Gran t. Juba Ziani w as supp orted in part by the Inaugural PIMCO Graduate F ellowship at Caltec h, the National Science F oundation through gran t CNS-1518941, as w ell as the W arren Cen ter for N et w ork and Data Sciences at the Universit y of Pen ns y lv ania . An y opinions, findings and conclusions or recommenda tions expressed in this material are those of the author(s) and do not necessarily reflect the views of the U nited States Air F orce and D ARP A. W e thank Mohammad F ereydounian and Aaron Roth for usef ul discussions. References [1] M ichael Brückner , Christian Kanzo w, and T obias Scheffer. Static prediction games for adv er- sarial learning problems. Journal of Machine L e arn ing R ese ar ch , 13(Sep):2617–2654, 2012. [2] Y ang Cai, Constan tinos Dask alakis, and Christos H. Pa padimitriou. Optim um statistical es ti- mation with s trategic data sources. In COL T , 2015. [3] Y iling Chen, Y ang Liu, and Chara Podimata. Grinding the s pace: Learning to classify against strategic agents. arXi v pr eprint arXi v:1911.04004 , 2019. [4] Y iling Chen, Cha ra P o dimata, Ariel D Pro caccia, and Nisa rg Shah. Strategypro of linear regression in high dimensions. In Pr o c e e dings of the 2018 ACM Confer enc e on Ec onomics and Computation , pages 9–26, 2018. [5] R ac hel Cummings, Stratis Ioannidis, and Katrina Ligett. T ruthful linear regression. In COL T , 2015. [6] N ilesh Dalvi, P edro Domingos, Sumit Sanghai, and Deepak V erma. Adv ersarial classification. In Pr o c e e di ngs of the tenth ACM SIGKDD i n ternational c onfer enc e on Know le dge disc overy and data mi ning , pages 99–108, 2004. [7] Ofer Dek el, F elix Fisc her, and Ariel D. Pro caccia. Incen tiv e compatible regression learning. Journal of Computer and System Scienc es , 76(8):759 – 777, 2010. [8] Jinshuo Dong, Aaron Roth, Zac hary Sch utzman, Bo W aggoner, and Zhiw ei Steven W u. Strate- gic classification f rom rev ealed preferences. In Pr o c e e dings of the 2018 A C M Conf er enc e on Ec onomics and Computation , pages 55–70, 2018. 12 [9] A rpita Ghosh, Katrina Ligett, Aaron Roth, and Gran t Schoeneb ec k. Buying priv ate data without verificatio n. In Pr o c e e dings of the Fif te enth ACM Confer enc e on Ec onomics and Com- putation , EC ’14, pages 931–948, 2014. [10] Nik a Hagh talab, Nicole Immorlica, Brendan Lucier, and Jac k W ang . Maximizing w elfare with incen tiv e-aw are ev aluation mecha nisms. T ec hnical report , w orking pap er, 2020. [11] Moritz Hardt, Nimro d Megiddo, Christos P apadimitriou, and Mary W o otters. Strategic clas- sification. In Pr o c e e dings of the 2016 ACM c onfer enc e on innovations in the or etic al c omputer scienc e , pages 111–122, 2016. [12] Thibaut H orel, Stratis Ioannidis, and S. Muth ukris hna n. Budget feasible mec hanisms for exp er- imen tal design. In LA TI N 2014: The or etic al Inf ormatics , Lecture Notes in Computer Science, pages 719–730, 2014. [13] Lily H u, Nicole Immorlica, and Jennifer W ortman V aughan. The disparate effects of strategic manipulation . In Pr o c e e dings of the Confer enc e on F ai rn ess, A c c ountability, and T r ansp ar ency , pages 259–268, 2019. [14] Stratis Ioannidis and P atric k Loiseau. L inear regression as a non-co op erativ e game. In W eb and In ternet Ec onomics , pages 277–290, 2013. [15] Jon Klein b erg and Manish R agha v an. Ho w do classifiers induce agen ts to in vest effort strate- gically? In Pr o c e e dings of the F A T* , pages 825–844, 2019. [16] John Miller, Smitha Milli, and Moritz H ardt. Strategic adaptation to classifiers: A causal p ersp ectiv e. arXiv pr eprint arXiv:1910.10362 , 2019. [17] Smitha Milli, John Miller, Anca D Dragan, and Moritz Hardt. The so cial cost of strategic classification. In Pr o c e e dings of the Confer enc e on F airness, A c c ountability, and T r ansp ar ency , pages 230–239, 2019. [18] Juan C Per domo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. P erformativ e prediction. arXiv pr eprint arXiv:2002.06673 , 2020. [19] Javier P erote and Juan Per ote-P ena. Strategy-pro of estimators f or simple regression. In Math- ematic al So cial Scienc es 47 , pages 153–176, 2004. [20] Y on ada v Sha vit, Benjamin Edelman, and Brian Axelro d. Learning from strategic agent s: A c- curacy , impro v emen t, and causalit y . arXiv pr eprint arXiv:2002.10066 , 2020. [21] Behzad T abibian, Stratis T s irtsis , Mo ein Kha jehnejad, Adish Singla, Bernhard Sc hölkopf, and Man uel Gomez-Ro driguez. Optimal decision making under strategic b eha vior. arXiv pr eprint arXiv:1905.09239 , 2019. [22] Stratis T sirtsis and Manu el Gomez-Ro driguez. Decisions, coun terfactual explanation s and strategic b ehavior . arXiv pr eprint arXiv:2002.04333 , 2020. [23] Berk Us tun, Alexander Spangher, and Y an g Liu. A ctionable recourse in linear classification. In Pr o c e e dings of the Conf er enc e on F airness, A c c ountability, and T r ansp ar ency , pages 10–19, 2019. 13 A Pro of of Theorem 4.1 A.1 Preliminaries A.1.1 Useful concentrat i on Our pro of will require applying the follo wing concen tration inequalit y , deriv ed f rom Azuma’s in- equalit y: Lemma A.1. L et W 1 , . . . , W τ b e r andom variables in R such that | W t | ≤ W max . Supp ose for al l t ∈ [ τ ] , for al l w 1 , . . . , w t − 1 , E [ W t | W t − 1 = w t − 1 , . . . , W 1 = w 1 ] = 0 . Then, with at le ast 1 − δ , τ X t =1 W t ≤ W max p 2 τ log (2 /δ ) . Pr o of. This is a reform ulated v ersion of Azuma’s inequalit y . T o see this, define Z t = t X i =1 W i ∀ t, and initialize Z 0 = 0 . W e s tart b y noting that f or all t ∈ [ τ ] , since Z t = t X i =1 W i = W t + t − 1 X i =1 W i = W t + Z t − 1 , w e hav e E [ Z t | Z t − 1 , . . . , Z 1 ] = E [ W t | Z t − 1 , . . . , Z 1 ] + E [ Z t − 1 | Z t − 1 , . . . , Z 1 ] = E [ W t | Z t − 1 , . . . , Z 1 ] + Z t − 1 . F urther, it is easy to see that Z i = z i ∀ i ∈ [ t − 1] if and only if W i = z i − z i − 1 ∀ i ∈ [ t − 1] , hence E [ W t | Z t − 1 = z t − 1 , . . . , Z 1 = z 1 ] = E [ W t | W i = z i − z i − 1 ∀ i ∈ [ t − 1]] = 0 . Com bining the last t wo equations implies that E [ Z t | Z t − 1 , . . . , Z 1 ] = Z t − 1 , and the Z t ’s define a martingale. Since for all t , | Z t − Z t − 1 | = | W t | ≤ W max , w e can apply Azuma’s inequalit y to sho w that with probabilit y at least 1 − δ , | Z τ − Z 0 | ≥ W max p 2 τ log (2 /δ ) , whic h immediately give s the result. 14 A.1.2 Sub-space decomp osi tion and pr o j ection W e will also need to divide R d in severa l sub-spaces, and pro ject our observ ations to said subspaces. Sub-space d ecomp osition W e fo cus on the sub-space generated b y the non-modified features x t ’s and the sub-space generated by the feature modifications ∆ t ’s. W e let r b e the rank of Σ , and let λ r ≥ . . . ≥ λ 1 > 0 b e the non-zero eigen v alues of Σ . F urther, we let f 1 , . . . , f r b e the unit eigen v ectors (i.e., suc h that k f 1 k 1 = . . . = k f r k 1 = 1 ) corresp onding to eigen v alues λ 1 , . . . , λ r of Σ . As Σ is a symmetric matrix, f 1 , . . . , f r are orthonormal. W e abuse notations in the pro of of Theorem 4.1 and denote Σ = sp an( f 1 , . . . , f r ) when clear from context. F or all k , let e k b e the unit vector suc h that e k ( k ) = 1 and e k ( j ) = 0 ∀ j 6 = k . At time τ , w e denote D τ = s p an ( e k ) k ∈ D τ the sub-space of R d spanned b y the f eatures in D τ . Finally , w e let V τ = Σ + D τ = s p an ( f 1 , . . . , f r ) + sp an ( e k ) k ∈ D τ b e the Minko wski sum of sub-spaces Σ and D τ . Pro jection ont o sub-spaces F or an y vector z , sub-space H of R d , we write z = z ( H ) + z ( H ⊥ ) where z ( H ) is the pro j ection of z on to sub-space H , i.e. is uniquely defined as z ( H ) = X q ∈ B ( z ⊤ q ) q for any orthonormal basis B of H . W e also let z ( H ⊥ ) b e the pro jection on the orthogonal complemen t H ⊥ . In particular, z ( H ) is orthogonal to z ( H ⊥ ) . F ur ther, we write ¯ X τ ( H ) the matrix whose ro ws are giv en by ¯ x t ( H ) ⊤ for all t ∈ [ τ ] . A.2 Main Pro of Characterization of the least-square estimat e v i a fi rst-order conditions First, for an y least square solution ˆ β E at time τ ( E ) , we write the first order conditions solv ed b y ˆ β E V τ ( E ) , the pro jection of ˆ β E on sub-space V τ ( E ) . W e abuse notations to let ε τ ( E ) , ( ε t ) t ∈ [ τ ( E )] the vector of all ε t ’s up un til time τ ( E ) , and state the result as follo ws: Lemma A.2 (First-order conditions pro j ected on to V τ ( E ) ) . Supp ose ˆ β E ∈ L S E ( τ ( E )) . Then, ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) = ¯ X τ ( E ) V τ ( E ) ⊤ ε τ ( E ) . Pr o of. F or simplicit y of notations, w e drop all τ ( E ) indices and subscripts in this pro of. Remem b er that LS E = argmin β ¯ X β − ¯ Y ⊤ ¯ X β − ¯ Y . Since ˆ β E ∈ L S E , it m ust s atisfy the first order conditions giv en by 2 ¯ X ⊤ ¯ X ˆ β E − ¯ Y = 0 , whic h can b e rewritten as ¯ X ⊤ ¯ X ˆ β E = ¯ X ⊤ ¯ Y . 15 Second, w e note that for all t , x t ∈ sp an( f 1 , . . . , f r ) and ∆ t ∈ sp an ( e k ) k ∈ D (b y definition of D ). This immediately implies, in particular, that ¯ x t = x t + ∆ t ∈ V . In turn, ¯ x t ( V ) = ¯ x t for all t , and ¯ X = ¯ X ( V ) . As such, the first order condition can b e written ¯ X ( V ) ⊤ ¯ X ( V ) ˆ β E = ¯ X ( V ) ⊤ ¯ Y . No w, we remark that ¯ X ( V ) ⊤ ¯ X ( V ) ˆ β E = X t ∈ S ¯ x t ( V ) ¯ x t ( V ) ⊤ ˆ β E = X t ∈ S ¯ x t ( V ) ¯ x t ( V ) ⊤ ˆ β E ( V ) + X t ∈ S ¯ x t ( V ) ¯ x t ( V ) ⊤ ˆ β E ( V ⊥ ) = X t ∈ S ¯ x t ( V ) ¯ x t ( V ) ⊤ ˆ β E ( V ) = ¯ X ( V ) ⊤ ¯ X ( V ) ˆ β E ( V ) , where the second-to-last equality follo ws from the fact that V and V ⊥ are orthogonal, whic h imme- diately implies ¯ x t ( V ) ⊤ ˆ β E ( V ⊥ ) = 0 for all t . T o conclude the pro of, w e note that ¯ Y = ¯ X ⊤ β ∗ + ε = ¯ X ( V ) ⊤ β ∗ ( V ) + ε . Plugging this in the ab o ve equation, w e obtain that ¯ X ( V ) ⊤ ¯ X ( V ) ˆ β E ( V ) = ¯ X ( V ) ⊤ ¯ X ( V ) ⊤ β ∗ ( V ) + ¯ X ( V ) ⊤ ε. This can b e rewritten ¯ X ( V ) ⊤ ¯ X ( V ) ˆ β E ( V ) − β ∗ ( V ) = ¯ X ( V ) ⊤ ε, whic h completes the pro of. Upp er-b oundi n g the right-hand side of the fi rst order conditions W e now use concen tra- tion to give an upp er b ound on a f unction of the righ t-hand s ide of the first order conditions, ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ε τ ( E ) . Lemma A.3. With pr ob ability at le ast 1 − δ , ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ε ≤ ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 · K ′ p dτ ( E ) log (2 d/δ ) . wher e K ′ is a c onstant that only dep ends on the distribution of c osts and the b ound σ on the noise. Pr o of. Pic k an y k ∈ [ d ] , and define W t = ¯ x t ( k ) ε t . First, we remark that | ¯ x t ( k ) | ≤ | x t ( k ) | + | ∆ t ( k ) | ≤ 1 + max k ∈ [ d ] , i ∈ [ l ] B i c i ( k ) . 16 In turn, | W t | ≤ K ′ where K ′ , 1 + max k ∈ [ d ] , i ∈ [ l ] B i c i ( k ) σ. F urther, note that b oth x t ( k ) and ε t are indep enden t of the history of pla y up through time t − 1 , hence of W 1 , . . . , W t − 1 , and that ε t is f urther indep endent of ∆ t (the distribution of ∆ t is a f unction of the curren tly p osted ˆ β E − 1 only , whic h only dep ends on the previous time steps). Noting that if A, B , C are random v ariables, w e ha ve E A,B [ AB | C = c ] = X a X b ab Pr [ A = a, B = b | C = c ] = X a X b ab Pr [ A = a | B = b, C = c ] Pr [ B = b | C = c ] = X b b X a a Pr [ A = a | B = b, C = c ] ! Pr [ B = b | C = c ] = X b b E A [ A | B = b, C = c ] Pr [ B = b | C = c ] = E B E A [ A | B , C = c ] B | C = c , and applying this with A = ε t , B = ∆ t ( k ) , C = W 1 ∩ . . . ∩ W t − 1 , we obtain E [ W t | W t − 1 , . . . , W 1 ] = E [ ¯ x t ( k ) ε t | W t − 1 , . . . , W 1 ] = E [ x t ( k ) ε t | W t − 1 , . . . , W 1 ] + E [∆ t ( k ) ε t | W t − 1 , . . . , W 1 ] = E [ x t ( k ) ε t ] + E ∆ t E ε t [ ε t | ∆ t ( k ) , W t − 1 , . . . , W 1 ] · ∆ t ( k ) W t − 1 , . . . , W 1 = E x t h x t ( k ) · E ε [ ε t | x t ( k )] i + E ∆ t ∆ t ( k ) · E ε t [ ε t ] W t − 1 , . . . , W 1 = 0 , since E ε t [ ε t ] = 0 and E ε [ ε t | x t ( k )] = 0 . Hence, w e can apply Lemma A.1 and a union b ound ov er all d features to s how that with probabilit y at least 1 − δ , τ ( E ) X t =1 ¯ x t ( k ) ε t ≥ − K ′ p 2 τ ( E ) log (2 d/δ ) ∀ k ∈ [ d ] . By Cauc h y-Sch w arz, we hav e ˆ β E ( V ) − β ∗ ( V ) ⊤ τ ( E ) X t =1 ¯ x t ε t ≤ ˆ β E ( V ) − β ∗ ( V ) 2 · τ ( E ) X t =1 ¯ x t ε t 2 ≤ ˆ β E ( V ) − β ∗ ( V ) 2 v u u t d X k =1 X t ¯ x t ( k ) ε t ! 2 ≤ ˆ β E ( V ) − β ∗ ( V ) 2 · K ′ p 2 dτ ( E ) log (2 d/δ ) . 17 Strong conv exit y of the mean-squared error in sub-s p ace V ( τ ( E )) W e give a low er b ound on the eigen v alues of ¯ X ⊤ ¯ X on sub-space V ( τ ( E )) , so as to sho w that at time τ ( E ) , an y least square solution ˆ β E satisfies ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) ≥ Ω ( n ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 2 . T o do so, w e will need the follo wing concen tration inequalities: Lemma A.4. Supp ose E [ x t ] = 0 . Fi x τ ( E ) = E n f or some E ∈ N . With pr ob ability at le ast 1 − δ , we have that τ ( E ) X t =1 z ⊤ x t x ⊤ t z ≥ λ r τ ( E ) − 2 r d p τ ( E ) log (6 r /δ ) k z k 2 2 ∀ z ∈ Σ , and τ ( E ) X t =1 z ⊤ ∆ t ∆ ⊤ t z ≥ min i,k ( π i B i c i ( k ) 2 ) n − max i,k B i c i ( k ) 2 p 2 n log (6 d/δ ) ! k z k 2 2 ∀ z ∈ D τ ( E ) and τ ( E ) X t =1 z ⊤ x t ∆ ⊤ t z ≥ − 2 max i,k B i c i ( k ) d p τ ( E ) log (6 d/δ ) k z k 2 2 ∀ z ∈ R d . Pr o of. Deferred to App endix A.2.1 . W e will also need the follo wing statemen t on the norm of the pro jections of an y z ∈ V to D and Σ : Lemma A.5. L et λ ( D , Σ) = in f z ∈D +Σ k z ( D ) k 2 + k z (Σ) k 2 s.t. k z k 2 = 1 . Then, λ ( D , Σ) > 0 . Pr o of. With resp ect to the Euclidean metric, the ob jectiv e function is con tin uous in z (the or- thogonal pro jection op erators are linear hence con tinuo us functions of z and z → k z k 2 also is a con tin uous function), and its feasible set is compact (as it is a sphere in a b ounded-dime nsional space ov er real v alues). By the extreme v alue theorem, the optimization problem admits an opti- mal solution, i.e., there exists z ∗ with k z ∗ k 2 = 1 suc h that λ ( D , Σ) = k z ∗ ( D ) k 2 + k z ∗ (Σ) k 2 . No w, supp osing λ ( D , Σ) ≤ 0 , it m ust necessarily b e the case that z ( D ) = 0 , z (Σ) = 0 . In particular, this means z is orthogonal to b oth D and Σ . I n turn, z m ust b e orthogonal to ev ery vector in D + Σ ; since z ∈ D + Σ , this is only p ossible when z = 0 , con tradicting k z k 2 = 1 . 18 W e can no w mo v e on to the pro of of our lo we r b ound for ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) . Corollary A.6. Fix τ ( E ) = E n for some E ∈ N . Wi th pr ob ability at le ast 1 − δ , ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) ≥ λn 2 − κ ′ d 2 p τ ( E ) log (6 d/δ ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 2 , for som e c onstants κ ′ , λ that only dep end on σ , C , and Σ , with λ > 0 . Pr o of. Since it is clear f rom context, w e drop all τ ( E ) subscripts in the notation of this pro of. First, w e remark that z ⊤ ¯ X ⊤ ¯ X z = X t z ⊤ ¯ x t ¯ x ⊤ t z = X t z ⊤ x t x ⊤ t z + X t z ⊤ ∆ t ∆ ⊤ t z + 2 X t z ⊤ ∆ t z ⊤ x t . W e ha ve by Lemma A.5 that f or all z ∈ V = D + Σ , k z ( D ) k 2 + k z (Σ) k 2 ≥ λ ( D , Σ) k z k 2 . Let λ (Σ) , min D ⊂ [ d ] λ ( D , Σ) . Since there are finitely man y subsets D of [ d ] (and corresp onding sub-spaces D ) and since for all suc h subsets, λ ( D , Σ) > 0 , w e hav e that λ (Σ) > 0 . F urther, k z ( D ) k 2 + k z (Σ) k 2 ≥ λ (Σ) k z k 2 . Therefore, it must b e the case that either k z ( D ) k 2 ≥ λ (Σ) 2 k z k 2 or k z (Σ) k 2 ≥ λ (Σ) 2 k z k 2 . W e divide our pro of into the corresp onding tw o cases: 1. The first case is when k z (Σ) k 2 ≥ λ (Σ) 2 k z k 2 . Then, note that since z ⊤ ∆ t ∆ ⊤ t z ≥ 0 alw a ys , w e ha ve X t z ⊤ ¯ x t ¯ x ⊤ t z ≥ X t z ⊤ x t x ⊤ t z + 2 X t z ⊤ ∆ t z ⊤ x t = X t z (Σ) ⊤ x t x ⊤ t z (Σ) + 2 X t z ⊤ ∆ t z ⊤ x t , where the last equalit y follo ws from the fact that x t ∈ Σ and z = z (Σ) + z (Σ ⊥ ) . By Lemma A.4 , w e get that for some constan t C 1 that dep ends only on C , X t z ⊤ ¯ x t ¯ x ⊤ t z ≥ λ r τ ( E ) − 2 r d p τ ( E ) log (6 r /δ ) k z (Σ) k 2 2 − C 1 d p τ ( E ) log (6 d/δ ) k z k 2 2 ≥ λ (Σ) λ r 2 τ ( E ) − λ (Σ) r d p τ ( E ) log (6 r /δ ) − C 1 d p τ ( E ) log (6 d/δ ) k z k 2 2 ≥ λ (Σ) λ r 2 τ ( E ) − λ (Σ) d 2 p τ ( E ) log (6 d/δ ) − C 1 d p τ ( E ) log (6 d/δ ) k z k 2 2 . 19 (The second step assumes λ r τ ( E ) − 2 r d p τ ( E ) log (6 r /δ ) ≥ 0 . When this is negativ e, the b ound trivially holds as P t z ⊤ ¯ x t ¯ x ⊤ t z ≥ 0 .) 2. The second case arises when k z ( D ) k 2 ≥ λ (Σ) 2 k z k 2 . Note that X t z ⊤ ¯ x t ¯ x ⊤ t z ≥ X t z ⊤ ∆ t ∆ ⊤ t z + 2 X t z ⊤ ∆ t z ⊤ x t = X t z ( D ) ⊤ ∆ t ∆ ⊤ t z ( D ) + 2 X t z ⊤ ∆ t z ⊤ x t , as ∆ t ∈ D and z = z ( D ) + z ( D ⊥ ) . By Lemma A.4 , it follo ws that for some constan ts C 2 , C 3 that only dep end on C , X t z ⊤ ¯ x t ¯ x ⊤ t z ≥ n min i,k ( π i B i c i ( k ) 2 ) − C 2 p n log (6 d/ δ ) ! k z ( D ) k 2 2 − C 3 d p τ ( E ) log (6 d/δ ) k z k 2 2 ≥ λ (Σ) n 2 min i,k ( π i B i c i ( k ) 2 ) − λ (Σ) C 2 2 p n log (6 d/δ ) − C 3 d p τ ( E ) log (6 d/δ ) ! k z k 2 2 ≥ λ (Σ) n 2 min i,k ( π i B i c i ( k ) 2 ) − λ (Σ) C 2 2 p τ ( E ) log (6 d/δ ) − C 3 d p τ ( E ) log (6 d/δ ) ! k z k 2 2 . Noting that b y definition λ r > 0 and m in i,k π i B i c i ( k ) 2 > 0 , and pic king the w orse of the tw o ab o v e b ounds on P t z ⊤ ¯ x t ¯ x ⊤ t z concludes the pro of with λ = λ (Σ) 2 min λ r , min i,k ( π i B i c i ( k ) 2 )! > 0 . W e can no w pro v e Theorem 4.1 . By Lemma A.2 , we hav e that ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) = ¯ X τ ( E ) V τ ( E ) ⊤ ε τ ( E ) , whic h immediately yields ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) = ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X V τ ( E ) ⊤ ε τ ( E ) b y p erforming matrix multi plication with ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ on b oth sides on the first- order conditions. F urther, by Lemma A.3 , Corollary A.6 , and a union b ound, w e get that with 20 probabilit y at least 1 − δ , ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) ≥ λn 2 − κ ′ d 2 p τ ( E ) log (12 d/δ ) ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 2 , and ˆ β E V τ ( E ) − β ∗ V τ ( E ) ⊤ ¯ X τ ( E ) V τ ( E ) ⊤ ε ≤ ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 · K ′ p dτ ( E ) log (4 d/δ ) . Com bining the tw o ab o ve inequalities with the first-order conditions yields ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 ≤ K ′ p dτ ( E ) log (4 d/δ ) λn 2 − κ ′ d 2 p τ ( E ) log (12 d/δ ) . F or n ≥ 4 κ ′ d 2 λ p τ ( E ) log (12 d/δ ) , the b ound b ecomes ˆ β E V τ ( E ) − β ∗ V τ ( E ) 2 ≤ 4 K ′ p dτ ( E ) log (4 d/δ ) λn . The pro of concludes b y letting K , 4 K ′ , κ , 4 κ ′ and noting that s ince D τ ( E ) ⊂ V τ ( E ) b y con- struction, the statemen t holds true ov er D τ ( E ) (pro jecting on to a subspace cannot increase the ℓ 2 -norm). A.2.1 Pro of of Lemma A.4 F or the first statemen t, note that for all k 6 = j ≤ r , E h f ⊤ k x t x ⊤ t f j i = f ⊤ k E h x t x ⊤ t i f j = λ j f ⊤ k f j , as f j is (b y definition) an eigen v ector of Σ = E x t x ⊤ t for eigen v alue λ j . Note that the f ⊤ j x t x ⊤ t f k = ( f ⊤ j x t )( f ⊤ k x t ) are random v ariables that are indep enden t across t . F urther, by Cauc hy-Sc h w arz, ( f ⊤ k x t )( f ⊤ j x t ) ≤ k f k k 2 k f j k 2 k x t k 2 2 = k x t k 2 2 ≤ d. Therefore, we can apply Ho effding with a union b ound o ver the r 2 c hoices of ( f k , f j ) to show that with probabilit y at least 1 − δ ′ , τ ( E ) X t =1 f ⊤ k x t x ⊤ t f j − λ j τ ( E ) f ⊤ k f j ≤ d p 2 τ ( E ) log (2 r 2 /δ ′ ) . 21 Note now that for all z ∈ Σ , w e can write z = P r k =1 z ⊤ f k f k , and as s uc h τ ( E ) X t =1 z ⊤ x t x ⊤ t z − r X k ,j =1 ( z ⊤ f k )( z ⊤ f j ) λ j τ ( E ) f ⊤ k f j = τ ( E ) X t =1 r X k ,j =1 ( z ⊤ f k )( z ⊤ f j ) f ⊤ k x t x ⊤ t f j − r X k ,j =1 ( z ⊤ f k )( z ⊤ f j ) λ j τ ( E ) f ⊤ k f j = r X k ,j =1 ( z ⊤ f k )( z ⊤ f j ) X t f ⊤ k x t x ⊤ t f j − λ j τ ( E ) f ⊤ k f j ! ≤ d p 2 τ ( E ) log (2 r 2 /δ ′ ) r X k ,j =1 | z ⊤ f k || z ⊤ f j | ≤ r d p 2 τ ( E ) log (2 r 2 /δ ′ ) k z k 2 2 , where the last step follo ws from the fact that b y Cauch y-Sc h war z, r X k =1 | z ⊤ f k | ≤ v u u t r X k =1 1 2 v u u t r X k =1 ( z ⊤ f k ) 2 = √ r k z k 2 . Hence, for z ∈ Σ , remem b ering f ⊤ k f j = 0 when k 6 = j and f ⊤ k f k = 1 , and noting k z k 2 2 = P r k =1 ( z ⊤ f k ) 2 , w e get that τ ( E ) X t =1 z ⊤ x t x ⊤ t z ≥ r X k ,j =1 ( z ⊤ f k )( z ⊤ f j ) λ j τ ( E ) f ⊤ k f j − r d p 2 τ ( E ) log (2 r 2 /δ ′ ) k z k 2 2 = r X k =1 λ k τ ( E )( z ⊤ f k ) 2 − r d p 2 τ ( E ) log (2 r 2 /δ ′ ) k z k 2 2 ≥ λ r τ ( E ) r X k =1 ( z ⊤ f k ) 2 − r d p 2 τ ( E ) log (2 r 2 /δ ′ ) k z k 2 2 = λ r τ ( E ) − 2 r d p τ ( E ) log (2 r /δ ′ ) k z k 2 2 . F or the s econd statemen t, we remind the reader that the costs of mo dification are such that ∆ t ( k ) 2 ≤ max i,j n B i c i ( j ) o 2 , and that within an y ep o c h φ , the ∆ t ’s are independen t of each other. W e can therefore apply Ho effding’s inequalit y and a union b ound (o v er k ∈ D τ ( E ) ⊂ [ d ] ) to show that with probabilit y at least 1 − δ ′ , for an y k ∈ D τ ( E ) , there exists an ep o c h φ ( k ) ≤ E (pic k an y φ in which k is mo dified) such that X t ∈ φ ( k ) e ⊤ k ∆ t ∆ ⊤ t e k ≥ n E ∆ t ( k ) 2 − max i,j B i c i ( j ) 2 p 2 n log( d/δ ′ ) ≥ n min i ∈ [ l ] , j ∈ [ d ] ( π i B i c i ( j ) 2 ) − max i,j B i c i ( j ) 2 p 2 n log ( d/δ ′ ) . 22 The last inequalit y holds noting that k can b e mo dified in p erio d φ ( k ) only if there exists a cost t yp e i on the supp ort of C suc h that k is a b est resp onse to ˆ β φ ( k ) − 1 ; in turn, k is mo dified with probabilit y π i b y amoun t ∆( k ) = B i /c i ( k ) , leading to E ∆ t ( k ) 2 ≥ π i B i c i ( k ) 2 . Since ∆ t ( k )∆ t ( j ) = 0 when k 6 = j as a single direction is mo dified at a time, note that for all z ∈ D τ ( E ) , we hav e X t ≤ τ ( E ) z ⊤ ∆ t ∆ ⊤ t z = X t ≤ τ ( E ) d X k =1 ∆ t ( k ) 2 z ⊤ e k e ⊤ k z = d X k =1 X t ≤ τ ( E ) ∆ t ( k ) 2 ( z ⊤ e k ) 2 ≥ X k ∈ D τ ( E ) X t ∈ φ ( k ) ∆ t ( k ) 2 ( z ⊤ e k ) 2 ≥ X k ∈ D τ ( E ) n min i ∈ [ l ] , j ∈ [ d ] ( π i B i c i ( j ) 2 ) − max i,j B i c i ( j ) 2 p 2 n log( d/δ ′ ) ! ( z ⊤ e k ) 2 = n min i ∈ [ l ] , j ∈ [ d ] ( π i B i c i ( j ) 2 ) − max i,j B i c i ( j ) 2 p 2 n log( d/δ ′ ) ! X k ∈ D τ ( E ) ( z ⊤ e k ) 2 . F or z ∈ D τ ( E ) , P k ∈ D τ ( E ) ( z ⊤ e k ) 2 = k z k 2 2 , and the second inequalit y immediately holds. Finally , let us pro v e the last inequalit y . T ak e ( k , j ) ∈ [ d ] 2 , and let us write W t = e ⊤ k x t ∆ ⊤ t e j . First, note that x t and ∆ t are indep enden t: in ep o c h φ , the distribution of ∆ t is a function of ˆ β φ − 1 (and C ) only , whic h only dep ends on the realizations of x, ε, ∆ in previous time steps. F urther, x t is indep enden t of the history of features and mo dification s up un til time t − 1 included. Hence, it m ust b e the case that E [ W t | W t − 1 , . . . , W 1 ] = E h E h e ⊤ k x t ∆ t , W t − 1 , . . . , W 1 i ∆ ⊤ t e j W t − 1 , . . . , W 1 i = E h E h e ⊤ k x t i ∆ ⊤ t e j W t − 1 , . . . , W 1 i = E h e ⊤ k x t i · E h ∆ ⊤ t e j W t − 1 , . . . , W 1 i = 0 , where the last equality follow s from the fact that E [ x t ] = 0 . F urther, e ⊤ k x t ∆ ⊤ t e j = | x t ( k ) || ∆ t ( j ) | ≤ max i,k B i c i ( k ) . 23 W e can therefore apply Lemma A.1 and a union b ound o ver all ( k , j ) ∈ [ d ] 2 to sho w that with probabilit y at least 1 − δ ′ , τ ( E ) X t =1 e ⊤ k x t ∆ ⊤ t e j ≤ m ax i,k B i c i ( k ) p 2 τ ( E ) log (2 d 2 /δ ′ ) . In particular, we get that f or all z ∈ R d , X t ∈ E z ⊤ x t ∆ ⊤ t z = X k ,j X t ∈ E ( z ⊤ e k )( z ⊤ e j ) e ⊤ k x t ∆ ⊤ t e j ≤ X k ,j | z ⊤ e k || z ⊤ e j | X t ∈ E e ⊤ k x t ∆ ⊤ t e j ≤ m ax i,k B i c i ( k ) p 2 τ ( E ) log (2 d 2 /δ ′ ) X k | z ⊤ e k | ! 2 ≤ 2 d max i,k B i c i ( k ) p τ ( E ) log (2 d/δ ′ ) k z k 2 2 , where the last step follo ws from the fact that b y Cauch y-Sc h war z, X k | z ⊤ e k | ! 2 = X k | z ( k ) | ! 2 ≤ X k 1 2 · X k z ( k ) 2 = d · k z k 2 2 . W e conclude the pro of with a union b ound o ver all three inequalities, taking δ ′ = 3 δ . B Pro of of Theorem 5.2 W e drop the τ ( E ) subscripts when clear from con text. W e first note that ˆ β E is a least-square solution. Claim B.1. ˆ β E ∈ L S E ( τ ( E )) . Pr o of. This follo ws immediately from noting that ¯ X ˆ β E − ¯ Y ⊤ ¯ X ˆ β E − ¯ Y = ¯ X β E − ¯ Y ⊤ ¯ X β E − ¯ Y , as ¯ X ⊤ v = ¯ X ( U ) ⊤ v = 0 b y definition of U , and since v ∈ U ⊥ . Second, w e s ho w that ˆ β E has large norm: Claim B.2. ˆ β E 2 ≥ α. 24 Pr o of. First, w e note that necessarily , β E ∈ U τ ( E ) . Supp ose not, then w e can write β E = β E U τ ( E ) + β E U ⊥ τ ( E ) , with β E U ⊥ τ ( E ) 6 = 0 . By the same argumen t as in Claim B.1 , β E U τ ( E ) is a least-square solution. Using orthogonal it y of U τ ( E ) and U ⊥ τ ( E ) and the fact that β E U ⊥ τ ( E ) 2 > 0 , w e ha ve k β E k 2 = β E U τ ( E ) 2 2 + β E U ⊥ τ ( E ) 2 2 > β E U τ ( E ) 2 2 . This con tradicts β E b eing a minim um norm least-square solution. Hence, it m ust b e the case that β E ∈ U τ ( E ) . Since v ∈ U ⊥ τ ( E ) , we hav e that β E and v are orthogonal with k v k 2 = 1 , implying ˆ β E 2 2 = k β E k 2 2 + α 2 k v k 2 2 ≥ α 2 . This concludes the pro of. W e argue that s uc h a s olution places a large amoun t of w eigh t on curren tly unexplored features: Lemma B . 3 . At ti me τ ( E ) , supp ose r ank U τ ( E ) ≤ [ d ] . Supp ose n ≥ κd 2 λ p τ ( E ) log (12 d/δ ′ ) . T ak e any α with α ≥ γ √ d + K d p T log(4 d/δ ′ ) λn ! , wher e γ is a c onstant that dep ends only on C . With pr ob ability at le ast 1 − δ ′ , ther e exists i ∈ [ l ] and a fe atur e k / ∈ D τ ( E ) with ˆ β E ( k ) c i ( k ) > ˆ β E ( j ) c i ( j ) , ∀ j ∈ D τ ( E ) . Pr o of. Since ˆ β E ∈ L S E ( τ ( E )) , it m ust b e by Theorem 4.1 that with probabilit y at least 1 − δ ′ , s X k ∈ D ˆ β E ( k ) − β ∗ ( k ) 2 ≤ K p dτ ( E ) log (4 d/δ ′ ) λn ≤ K p dT log (4 d/δ ′ ) λn . (4) First, since z → p P k ∈ D z ( k ) 2 defines a norm (in f act, the ℓ 2 -norm in R | D | ), it m ust b e the case that s X k ∈ D ( z ( k ) − z ′ ( k )) 2 ≥ s X k ∈ D z ( k ) 2 − s X k ∈ D z ′ ( k ) 2 . In turn, plugging this in Equation ( 4 ), w e obtain s X k ∈ D ˆ β E ( k ) 2 ≤ s X k ∈ D β ∗ ( k ) 2 + K p dT log (4 d/δ ′ ) λn ≤ k β ∗ k 2 + K p dT log(4 d/δ ′ ) λn ≤ √ d + K p dT log(4 d/δ ′ ) λn . 25 By the triangle inequalit y and the lemma’s assumption, w e als o ha v e that s X k ∈ D ˆ β E ( k ) 2 + s X k / ∈ D ˆ β E ( k ) 2 ≥ k ˆ β E k 2 ≥ α. Com bining the last t wo equations, w e obtain √ d + K p dT log (4 d/δ ′ ) λn + s X k / ∈ D ˆ β E ( k ) 2 , ≥ α whic h implies that for α ≥ γ √ d + K d √ T log(4 d/δ ′ ) λn , we hav e: s X k / ∈ D ˆ β E ( k ) 2 ≥ α − √ d − K p dT log(4 d/δ ′ ) λn ≥ α − √ d − K p dT log(4 d/δ ′ ) λn ≥ √ d ( γ − 1) 1 + K p dT log (4 d/δ ′ ) λn ! . Second, note that Equation ( 4 ) implies immediately that for an y j ∈ D T , ˆ β E ( j ) − β ∗ ( j ) ≤ K p dT log (4 d/δ ′ ) λn , and in turn, ˆ β E ( j ) ≤ | β ∗ ( j ) | + K p dT log (4 d/δ ′ ) λn ≤ 1 + K p dT log (4 d/δ ′ ) λn . Therefore, s X k / ∈ D ˆ β E ( k ) 2 ≥ √ d ( γ − 1) max j ∈ D ˆ β E ( j ) . Hence, there m ust exist feature k 6∈ D with ˆ β E ( k ) ≥ ( γ − 1) max j ∈ D ˆ β E ( j ) . Pic king γ such that for some i ∈ [ l ] , γ − 1 ≥ max j ∈ D c i ( k ) c i ( j ) yields the result immediately . 26 The pro of of Theorem 5.2 follo ws directly from Lemma B.3 and a union b ound ov er the first d ep o c hs. With probabilit y at least 1 − dδ ′ , for ever y ep o c h E ∈ [ d ] , there is a feature k / ∈ D τ ( E ) suc h that for s ome i ∈ [ l ] , ˆ β E ( k ) c i ( k ) > ˆ β E ( j ) c i ( j ) ∀ j ∈ D τ ( E ) . This implies that there exists k ∈ D τ ( E +1) but k / ∈ D τ ( E ) . Applying this d times, w e ha v e that if T ≥ dn , necessarily D T = [ d ] . W e can then apply Theorem 4.1 to then show that with probabilit y at least 1 − δ ′ ˆ β T /n − β ∗ 2 ≤ K p dT log(4 d/δ ′ ) λn . T ak ing a union b ound ov er the tw o ab o v e even ts and δ = 2 dδ ′ , we get the theorem statemen t with probabilit y at least 1 − δ ′ ( d + 1) ≥ 1 − δ . 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment