Online Learning for Supervisory Switching Control

We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy the best controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some…

Authors: Haoyuan Sun, Ali Jadbabaie

Online Learning for Sup ervisory Switc hing Con trol Hao yuan Sun, Ali Jadbabaie Massac husetts Institute of T ec hnology Abstract W e study sup ervisory switching con trol for partially-observed linear dynamical systems. The ob jectiv e is to identify and deplo y the best con troller for the unkno wn system b y p eriodically selecting among a collection of N candidate controllers, some of which ma y destabilize the un- derlying system. While classical estimator-based sup ervisory control guarantees asymptotic sta- bilit y , it lac ks quantitativ e finite-time p erformance b ounds. Con versely , current non-asymptotic metho ds in b oth online learning and system iden tification require restrictiv e assumptions that are incompatible in a control setting, such as system stability , which preclude testing p oten- tially unstable controllers. T o bridge this gap, we prop ose a no vel, non-asymptotic analysis of sup ervisory control that adapts m ulti-armed bandit algorithms to a control-theoretic setting. The prop osed data-driv en algorithm ev aluates candidate controllers via scoring criteria that lev erage system observ ability to isolate the effects of state history , enabling both detection of destabilizing controllers and accurate system identification. W e present tw o algorithmic v ariants with dimension-free, finite-time guarantees, where eac h identifies the most suitable con troller in O ( N log N ) steps, while simultaneously achieving finite L 2 -gain with resp ect to system distur- bances. 1 In tro duction This paper presents a sup ervisory switching framework that guarantees finite-time adaptation for unkno wn linear systems, lev eraging online algorithm to establish non-asymptotic analysis for this classical problem. In many applications in volving complex systems, suc h as p o wer systems [16], autonomous vehicles [2], and epidemic con trol [5], it is impractical for a single controller design to ac hieve acceptable performance across all op erating conditions of the unknown plan t. T o address this challenge, switching c ontr ol employs a collection of candidate con trollers alongside a switc hing p olicy that selects the con troller b est suited to the current system b ehavior [14]. This approach has imp ortan t applications in many established control paradigms, including gain scheduling [23, 25] and linear parameter-v arying con trol [17, 33]. Within this framew ork, sup ervisory switching c ontr ol relies on a higher-level p olicy that monitors system measurements to determine which controller should b e inserted into the feedbac k lo op. One w ell-studied class of supervisory policies is the estimator-b ase d sup ervision [7, 18, 19]. This approac h considers a setting where the true unknown system b elongs to a predefined collection of mo dels, eac h associated with a candidate controller pro viding satisfactory p erformance f or that sp ecific mo del. Then, a m ulti-estimator is constructed b y sim ulating the predicted output of each mo del and computing its deviation from the true observ ed process y t . A switching signal p t is then determined by perio dically selecting the mo del most consistent with the observ ations. Pro vided the This w ork w as supp orted by ONR grant #N00014-23-1-2299 and AFOSR MURI gran t #F A9550-25-1-0375. Corresp onding author: Hao yuan Sun ( haoyuans [at] mit [dot] edu ). 1 switc hes are separated by a dwel l time sufficiently long to a v oid c hattering, the certain ty equiv alence principle ensures the closed-lo op system resulted from this policy is asymptotically stabl e. How ever, this traditional estimator-based approach do es not quantitativ ely sp ecify the time required to find a suitable controller and deriving precise finite-time b ounds remains an op en challenge. T o develop a non-asymptotic analysis of sup ervisory switching control, w e revisit this clas- sical problem through the lens of mo dern learning techniques. A natural starting p oint is the non-asymptotic analysis of system identification, whic h estimates the parameters of an unkno wn dynamical system from a single tra jectory of input-output data. Conceptually , estimator-based su- p ervisors select the mo del that minimizes the error b et ween predicted and observ ed outputs, akin to the least-squares ob jectives underpinning system iden tification. Recent adv ances in non-asymptotic system iden tification ha ve extended classical asymptotic results [15] to quan tify the sample complex- it y of learning accurate system estimates from data [13, 27, 34]. This has led to numerous metho ds for the non-asymptotic identification of partially observ ed linear systems, typically b y estimating the system’s resp onse matrix to an exploratory signal and reconstructing the system parameters via the Ho-Kalman algorithm [4, 21, 24, 28]. Ho wev er, these metho ds cannot b e directly applied to our sup ervisory switching control paradigm because they presupp ose system stability . This assumption is frequen tly violated in switching systems, where not all candidate con trollers stabilize the underly- ing plan t. Therefore, standard system iden tification approac h cannot ensure safe exploration, as the p oten tially unstable controllers ma y cause the system to diverge b efore sufficient data is collected. Sev eral works ha ve attempted to address this limitation of p erforming system identification under instabilit y . F or example, one can explicitly identify system dynamics by injecting large probing signals, and subsequen tly control the learned system with online con v ex optimization [6]. A differen t approach pairs a multi-armed bandit algorithm with a stability certificate to safely switc h betw een candidate con trollers for unkno wn nonlinear systems [10, 12]. Crucially , how ever, these metho ds apply only to fully-observed systems, requiring direct knowledge of the system states to guaran tee b oth p erformance and stability . In con trast, our prior work [29] addresses the partially- observ ed setting by emplo ying a tw o-step pro cedure: first detecting and eliminating destabilizing con trollers, and then applying least-squares system iden tification to the remaining stable candidates. While this approach ensures safety under partial observ ability , it relies on a conserv ativ e instability detection criterion and exhaustively searc hes through controllers in a fixed order. This results in a highly inefficien t sample complexity where the n umber of exploration steps scales exp onentially with the num b er of candidates. Con tributions. In this work, w e prop ose a data-driven analysis of sup ervisory switc hing control that establishes non-asymptotic guarantees for adapting to partially-observed linear systems with a collection of N candidate controllers. Drawing on techniques from the machine learning and multi- armed bandit literature, our approach (Algorithm 1) ev aluates each candidate controller ov er fixed switc hing in terv als. Sp ecifically , our ev aluation criteria lev erage system observ abilit y to appro xi- mate state tra jectories from output data, effectively isolating the influences of earlier destabilizing con trols. This eliminates the need for the stability or fully-observed state assumptions mandated b y previous works. T o achiev e this, we introduce t w o ev aluation criteria to simultaneously detect destabilizing con- troller (Criterion 1) and identify the unkno wn system (Criterion 2). The fixed interv al betw een switc hes is delib erately chosen to ensure sufficien t measurements for both criteria, thereby pro vid- ing a quan titative lo wer b ound on the dwel l time that w as only qualitatively defined in classical estimator-based sup ervision. Finally , based on these ev aluation metrics, our algorithm selects the candidate controller that balances exploration and exploitation. Unlik e standard multi-armed bandit frameworks, which 2 assume indep endent and stationary rew ards, our algorithm explicitly accounts for the evolution of the system states, striking a balance b etw een learning performance and system stabilit y . W e presen t t wo algorithmic v arian ts that find the most suitable controller in O ( N log N ) steps and at the same time guaran tee finite L 2 gain with respect to disturbances (Theorem 3). These guaran tees represent a significant improv emen t ov er our prior work [29], which required exp( N ) steps for identification. 2 Problem Settings W e consider a collection of discrete-time, partially-observ ed linear systems { ( C i , A i , B i ) } N i =1 . The unkno wn true system parameters ( C ⋆ , A ⋆ , B ⋆ ) b elongs to this collection, and we denote their index as i ⋆ . Each of these systems evolv es according to the follo wing discrete-time dynamics: x t +1 = A i x t + B i ¯ u t + w t , y t = C i x t + η t , where the dimensions are x t ∈ R d x , ¯ u t ∈ R d u and y t ∈ R d y . W e assume zero initialization x 0 = 0 and indep enden t Gaussian noise for b oth the pro cess disturbance w t ∼ N (0 , σ 2 w I d x × d x ), and observ ation noise η t ∼ N (0 , σ 2 η I d y × d y ). Eac h candidate mo del is paired with a stabilizing candidate linear con troller. W e assume that while a mo del’s asso ciated controller pro vides satisfactory p erformance for that sp ecific mo del, applying a controller mismatched to the true plant ma y result in an unstable closed-lo op system. T o facilitate switching b et ween candidate controllers, w e adopt the multi-c ontr ol ler arc hitecture K ( p t ; ˇ x t , y t ) in [7], where ˇ x t is the in ternal state of the controller, p t ∈ [ N ] is a piece-wise constant switc hing signal indicating the activ e controller, and y t is the system’s output. And to ensure p ersistency of excitation, we inject an indep enden t, additiv e Gaussian signal u t ∼ N (0 , σ 2 u I d u × d u ), so that the control signal into the op en-lo op system is ¯ u t = u t + K ( p t , ˇ x t , y t ). F or ev ery pair 1 ≤ i, j ≤ N , w e denote M [ i,j ] =  C [ i,j ] , A [ i,j ] , B [ i,j ]  as the closed-lo op system resulting from applying the j th candidate controller to the i th op en-lo op mo del. As illustrated b y Figure 1, applying a constant switc hing signal p t = j on the switching system results in a closed- lo op mo deled by M [ i ⋆ ,j ] . F urthermore, we can pre-compute every p ossible closed-lo op system M [ i,j ] from the collection of candidate mo dels and controllers. Consequently , under a piecewise-constant switc hing signal, our observ ations are generated from one of these N 2 p ossible pro cesses. Closed-lo op ( C [ i ⋆ ,j ] , A [ i ⋆ ,j ] , B [ i ⋆ ,j ] ) ( C ⋆ , A ⋆ , B ⋆ ) K u t + y t p t = j Figure 1: Illustration of a linear switching system (with a fixed switching signal) T o quantitativ ely analyze the finite-time p erformance of this switc hing system, w e assume that 3 applying the matched con troller yields an observ able system, and that an y destabilizing con troller can b e readily detected: Assumption 1. F or every 1 ≤ i ≤ N , the correctly matched closed-lo op system M [ i,i ] is ε c -strictly observ able with index ν . That is, there exists a constant ε c > 0 so that σ min h C [ i,i ] A ν − 1 [ i,i ] ; . . . ; C [ i,i ] A [ i,i ] ; C [ i,i ] i ≥ ε c ∀ 1 ≤ i ≤ N . Assumption 2. If a closed-lo op system M [ i,j ] is unstable, there exists a constan t ε a > 0 so that its sp ectral radius satisfies ρ ( A [ i,j ] ) ≥ 1 + ε a . F urthermore, we assume the candidate controllers are sufficiently distinct such that their result- ing closed-lo op systems exhibit differen t dynamics. Assumption 3. Giv en a system ( C , B , A ), define its Mark ov parameter G with horizon h as G := h C A h − 1 B , . . . , C AB , C B i . Then, for a fixed horizon h , the Marko v parameters G [ i,j ] corresp onding to the N 2 p ossible closed- lo op systems are sufficien tly differen t. Sp ecifically , there exists a constan t γ > 0 such that for an y v alues of j and i  = i ′ , we hav e   G [ i,j ] − G [ i ′ ,j ]   op ≥ γ . Notations. F or a matrix M , we denote ∥ M ∥ F as its F robenius norm, ∥ M ∥ op = σ max ( M ) as its op erator norm, ρ ( M ) as its sp ectral radius, and tr( M ) as its trace. F or a stable linear system ( C, A, B ), w e define its H -infinity norm as ∥ C , A, B ∥ H ∞ = sup ∥ s ∥ =1 σ max ( C ( sI − A ) − 1 B ), and for simplicit y , we use shorthand ∥ C , A ∥ H ∞ = ∥ C, A, I ∥ H ∞ and ∥ A ∥ H ∞ = ∥ I , A, I ∥ H ∞ . T o streamline our exp osition, we o ccasionally omit constant factors that do not meaningfully impact our conclusions. W e use big- O notation f ∈ O ( g ) if lim sup x →∞ f ( x ) /g ( x ) is finite. Also, w e write f ≲ g if f ≤ c · g for some univ ersal constan t c , and f ≍ g if b oth f ≲ g and g ≲ f hold. Unless stated otherwise, all dimension-dep endent terms are written explicitly . In particular, w e consider the F rob enius norm and the trace to b e dimension-dep enden t, whereas the op erator norm is dimension-free. Finally , to analyze sto c hastic disturbances, w e consider a generalization of Gaussian random v ariables. Intuitiv ely , a sub-Gaussian r andom variable has tails that is dominated by a Gaussian distribution. Definition 1. W e say that a zero-mean random v ector X ∈ R d is σ 2 - sub-Gaussian if for every unit v ector v and real v alue λ , it holds that E exp( λ ⟨ v , X ⟩ ) ≤ exp( σ 2 λ 2 / 2) . 3 Algorithm Design and Ev aluation Criteria In this section, we in tro duce our online algorithm for sup ervisory switching control and establish its finite-time p erformance guaran tees. The algorithm pro ceeds in episo des of fixed length τ , where τ is a constan t sp ecified in Prop ositions 1 and 2 to ensure sufficient measuremen ts are collected for each switc hing decision. During each episode, a candidate controller is applied to the plan t, and w e observ e the resulting output tra jectory . Up on completion of the episo de, w e ev aluate the activ e con troller using a scoring criterion S = ⌊ 1 2 ( S (1) + S (2) ) ⌋ , where S (1) determines whether the 4 con troller stabilizes the unkno wn plan t, and S (2) utilizes least-squares system iden tification to assess consistency b etw een the con troller’s asso ciated mo del and the observed data. W e then select the con troller for the next episo de by maximizing the av erage score plus an exploration factor go verned b y the parameter a ℓ . After a predetermined n umber of episo des L , we terminate exploration and commit to the most frequen tly selected controller. This pro cedure is formalized in Algorithm 1. Algorithm 1 Online algorithm for sup ervisory switching control 1: Input: collection of candidate mo dels { ( C i , A i , B i ) } N i =1 and their asso ciated controllers. 2: Define ¯ s i ( q ) = 1 q P q ℓ =1 s i [ ℓ ] b e the av erage s core. 3: Let Q i ( ℓ ) b e the num b er of applications of controller i through episo de ℓ . 4: for episo des ℓ = 1 , 2 , . . . , L do 5: if ℓ ≤ N then 6: Select controller i ℓ = ℓ . 7: else 8: Select controller i ℓ = argmax i ¯ s i ( q ) + q a ℓ Q i ( ℓ − 1) . 9: Roll out a tra jectory of length τ steps and observe ( y τ ( ℓ − 1)+1 , . . . , y τ ℓ ). 10: F rom this tra jectory , compute score s i ℓ [ Q i ℓ ( ℓ )] = S ( y τ ( ℓ − 1)+1 , . . . , y τ ℓ ; i ℓ ). 11: Commit to controller ˆ i = argmax i Q i ( L ). W e emphasize that this setting in tro duces fundamen tal challenges distinct from standard online learning frameworks: • Standard multi-armed bandit problems assume indep enden t and iden tically distributed obser- v ations, where actions in one round do not affect observ ations in future rounds. In con trast, in our setting, the system state at the end of one episo de serves as the initial condition for the next, creating intricate dep endencies across switc hing decisions. • Theoretical analysis of online learning with states — such as reinforcement learning (RL) — t ypically rely on one of t wo assumptions to b ound state tra jectories during exploration. Episo dic RL [9] assumes the system is p erio dically reset to a default initial state distribution, preven ting comp ounding errors across episo des. Conv ersely , non-episo dic online RL analyzes contin uous op er- ation without resets [20, 26, 30], but relies on restrictive structural assumptions, such as op en-lo op system stability , to preven t the state from div erging. In contrast, our setting requires exploring p o- ten tially destabilizing controllers without resets. This means that the systems states ma y div erge, th us violating the core ass umptions of standard RL analysis. T o o v ercome these challenges, our ev aluation criteria leverage system observ ability to isolate the effect of the initial states for each episo de. Specifically , w e can use observ ability to approximate the episo de’s initial state from the output tra jectory and use this estimate to subtract the transient resp onse from the data. This allo ws us to ev aluate eac h con troller’s intrinsic p erformance without the influence of the state history from previous episo des. 3.1 Criterion 1: Instabilit y Detection The core ob jective of this criterion is to construct a statistic that remains b ounded when the activ e con troller and its asso ciated mo del matc h the true system, but grows unbounded when the activ e con troller is destabilizing. T o ac hieve this, we lev erage the system’s observ ability to reconstruct an estimate of the episo de’s initial state from output measurements alone. 5 F or conv enience, denote the current episo de’s outputs as y 1 , . . . , y τ and let j index the active con troller. W e write out the following input-output relation for the first ν steps:      y ν . . . y 2 y 1      = O [ i ⋆ ,j ] x 1 + " T ( ν − 1) [ i ⋆ ,j ] 0 1 × ( ν − 1) #      w ν − 1 . . . w 2 w 1      + " T ( ν − 1) [ i ⋆ ,j ] 0 1 × ( ν − 1) #      B [ i ⋆ ,j ] u ν − 1 . . . B [ i ⋆ ,j ] u 2 B [ i ⋆ ,j ] u 1      +      η ν . . . η 2 η 1      , (1) where we denote O = [ C A ν − 1 ; . . . ; C A ; C ] as the observ abilit y matrix and T ( k ) =        C C A C A 2 . . . C A k − 1 0 C C A . . . C A k − 2 0 0 C . . . C A k − 3 . . . . . . . . . . . . . . . 0 0 0 . . . C        as a k × k T o eplitz matrix. Multiplying b oth sides of (1) by the pseudo-in verse O † [ i ⋆ ,j ] yields: 1 O † [ i ⋆ ,j ] [ y ν , . . . , y 1 ] ⊤ = x 1 + O † [ i ⋆ ,j ] ξ , (2) where ξ aggregates all of the disturbance terms. W e first consider the case where the activ e controller matches the true system ( i ⋆ = j ). In this case, the closed-lo op dynamics are stable, and the T o eplitz matrices are b ounded in norm b y   C [ j,j ] , A [ j,j ] , B [ j,j ]   H ∞ (see [31]). Under Assumption 1, the closed-lo op system is ν -strictly observ able, which means    O † [ j,j ]    op ≤ 1 /ε c . Consequently , the aggregated disturbance term O † [ j,j ] ξ is sub-Gaussian. By standard concen tration inequalities for sub-Gaussian random v ariables, the estimate ˆ x 1 := O † [ j,j ] [ y ν , . . . , y 1 ] ⊤ is a close approximation to the true initial state of the episo de x 1 with high probabilit y . Then, sim ulating a predicted tra jectory ˆ y 1 , . . . ˆ y τ starting from ˆ x 1 under the candidate mo del yields a residual y t − ˆ y t that remains b elow a threshold Θ with high probabilit y . Next, when the activ e con troller is destabilizing, Assumption 2 ensures the closed-lo op system p ossesses an explosiv e mo de in the direction of the leading eigen vector of A [ i ⋆ ,j ] . Since the system is p ersistently excited by Gaussian pro cess noise, this explosiv e mo de is activ ated with non-zero probabilit y . In this regime, the residual y t − ˆ y t can b e decomposed as the sum of estimation error plus accumulated disturbances, where the first part arises b ecause ˆ x 1 do es not approximate the true initial state x 1 when i ⋆  = j . Critically , the estimation error is determined entirely by the first ν observ ations, while the system disturbances are accum ulated at every step of the tra jectory . As a result, the accumulated disturbance contains a comp onent driven by pro cess noise realized after the ν -th step, which is indep enden t of the estimation error. Hence, with non-zero probability , these indep enden t comp onents will not cancel, and con tributions from the explosive mo de will dominate the residual for sufficiently large τ . Th us, the residual y τ − ˆ y τ will often exceed the threshold Θ when the active controller is destabilizing. W e formalize this in tuition in the follo wing instabilit y criterion, whic h outputs 1 when it classifies 1 This pseudo-inv erse can be computed efficien tly via Arnoldi iteration without explicitly forming the observ abilit y matrix. 6 the current closed-lo op system as stable and 0 otherwise. Criterion 1 Instability detection criterion S (1) 1: Input: an observed tra jectory y 1 , . . . , y τ of length τ , active controller j . 2: Let O [ j,j ] b e the observ ability matrix of the mo del ( C [ j,j ] , A [ j,j ] , B [ j,j ] ), compute ˆ x 1 = O † [ j,j ] [ y ν , . . . , y 1 ] ⊤ . 3: Compute a predicted tra jectory ˆ y t = C [ j,j ] A t − 1 [ j,j ] ˆ x 1 . 4: if ∥ y τ − ˆ y τ ∥ > p 2 d y Θ log(2 d y /δ ) then return 0 5: else return 1 W e state the theoretical guarantees for this criterion b elo w, with the full pro of deferred to App endix B. Prop osition 1. Given a fixe d thr eshold Θ dep ending only on ε c , σ u , σ w , σ η and max i   C [ i,i ] , A [ i,i ] , B [ i,i ]   H ∞ , and a sufficiently long tr aje ctory length τ ≥ T 1 ≍ ν + log(1 + ε a ) − 1 log  d y Θ log(2 d y /δ ) ε a ε c σ w  , the instability criterion S (1) ( y 1 , . . . , y τ ; j ) satisfies: • If the j -th c ontr ol ler destabilizes the underlying plant, S (1) = 0 with pr ob ability ≥ 2 / 5 . • If the active c ontr ol ler matches the true system ( i ⋆ = j ), S (1) = 1 with pr ob ability ≥ 1 − δ . Remark 1. A sharp er, dimension-free guarantee for instability detection can be attained by em- plo ying a more sophisticated thresholding scheme. As this v ariant follo ws directly from the pro of of Prop osition 1, we defer its discussion to App endix B. While this criterion may not successfully detect a destabilizing controller in a single episo de, the matc hing con troller will achiev e a strictly higher expected score ov er repeated trials. This separation in expected scores enables the algorithm to reject destabilizing candidates ov er m ultiple episo des. W e note, ho wev er, that this criterion is inconclusive for the remaining candidate controllers that do not matc h the true system but are stabilizing. W e shall address this scenario with our second criterion. 3.2 Criterion 2: System Iden tification Ha ving established a mec hanism to detect destabilizing con trollers, we now in tro duce a second criterion to distinguish the matc hing controller from other stabilizing candidates. The ob jective is to construct a statistic that remains small when the active con troller’s associated mo del matches the true system, but is large when there is a mismatch. Like the previous part, we denote the curren t episo de’s outputs as y 1 , . . . , y τ and let j index the active con troller. W e b egin by recursiv ely 7 expand the system dynamics o v er a horizon of h steps: y t = C [ i ⋆ ,j ] x t + η t = C [ i ⋆ ,j ] ( A [ i ⋆ ,j ] x t − 1 + B [ i ⋆ ,j ] u t − 1 + w t − 1 ) + η t = C [ i ⋆ ,j ] A h [ i ⋆ ,j ] x t − h + h X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u t − s | {z } G [ i ⋆ ,j ] z t + h X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] w t − s + η t := G [ i ⋆ ,j ] z t + e t , where z t = [ u t − h ; . . . ; u t − 1 ] is a sliding window of exploratory input signals, e t accoun ts for b oth the initial condition and accumulated noise, and G [ i ⋆ ,j ] denotes the Marko v parameter that is equal to the system’s output con trollabilit y matrix ov er horizon h : G [ i ⋆ ,j ] := h C [ i ⋆ ,j ] A h − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] , . . . , C [ i ⋆ ,j ] A [ i ⋆ ,j ] B [ i ⋆ ,j ] , C [ i ⋆ ,j ] B [ i ⋆ ,j ] i . W e then construct a shifted output tra jectory b y subtracting both the predicted tra jectory rolled out from the estimated initial condition and the candidate mo del’s nominal resp onse from the Mark ov parameters: e y t = y t − C [ i ⋆ ,j ] A t − 1 [ j,j ] ˆ x 1 − G [ j,j ] z t = ( G [ i ⋆ ,j ] − G [ j,j ] ) z t + ( e t − C [ i ⋆ ,j ] A t − 1 [ j,j ] ˆ x 1 ) := ( G [ i ⋆ ,j ] − G [ j,j ] ) z t + r t , where r t aggregates the residual terms not mo deled by z t . When the active controller matches the true system ( i ⋆ = j ), the parameter gap ∆ := G [ i ⋆ ,j ] − G [ j,j ] is identically zero. By the same stability argumen ts established in Section 3.1, the residual term r t is a σ 2 r -sub-Gaussian random v ector. Estimating ∆ from the data p oints { ( z t , e y t ) } τ t = ν + h +1 via ordinary least-squares (OLS) yields: b ∆ := argmin ∆ τ X t = ν + h +1 ∥ e y t − ∆ z t ∥ 2 = τ X t = ν + h +1 e y t z ⊤ t ! τ X t = ν + h +1 z t z ⊤ t ! − 1 . F or sufficiently large τ , the v alue of    b ∆    op falls b elow a threshold γ with high probabilit y . Next, when the activ e con troller does not matc h the true system ( i ⋆  = j ), Assumption 3 guar- an tees that ∥ ∆ ∥ op ≥ γ . W e can decomp ose the OLS estimate as follows: b ∆ = τ X t = ν + h +1 e y t z ⊤ t ! τ X t = ν + h +1 z t z ⊤ t ! − 1 = τ X t = ν + h +1 (∆ z t + r t ) z ⊤ t ! τ X t = ν + h +1 z t z ⊤ t ! − 1 = ∆ + τ X t = ν + h +1 r t z ⊤ t ! τ X t = ν + h +1 z t z ⊤ t ! − 1 . Crucially , in the second term, r t ’s and z t ’s are indep endent. Then, it turns out that when controller j do es not destabilize the underlying plant, there is non-zero probability that the residuals do not 8 cancel the true difference ∆, yielding    b ∆    op ≥ ∥ ∆ ∥ op ≥ γ . T o deriv e b ounds free of the system dimensions d x , d u , d y , our criterion ev aluates the OLS estimate along sp ecific directions. Assumption 3 implies that, for ev ery k  = j , there exist unit v ectors ( u k , v k ) satisfying u ⊤ k ( G [ k,j ] − G [ j,j ] ) v k =   G [ k,j ] − G [ j,j ]   op ≥ γ . W e refer to these as the critic al dir e ctions , which can b e pre-computed from the collection of candidate mo dels. Criterion 2 System identification criterion S (2) 1: Input: an observed tra jectory y 1 , . . . , y τ of length τ , active controller j . 2: Let O [ j,j ] b e the observ ability matrix of the mo del ( C [ j,j ] , A [ j,j ] , B [ j,j ] ), compute ˆ x 1 = O † [ j,j ] [ y ν , . . . , y 1 ] ⊤ . 3: Compute a predicted tra jectory e y t = y t − C [ j,j ] A t − 1 [ j,j ] ˆ x 1 − G [ j,j ] z t . 4: for all critical directions ( u k , v k ) , k = 1 , . . . , N do 5: Compute OLS estimate b ∆ k on data p oints  ( v ⊤ k z t , u ⊤ k e y t )  τ t = ν + h +1 6: if | b ∆ k | > γ then return 0 7: return 1 W e state the theoretical guarantees for this criterion below. And the pro of of this result in given in App endix C. Prop osition 2. F or a sufficiently long tr aje ctory length τ ≥ T 2 ≍ ν + h + σ 2 r σ 2 u γ 2 log( N /δ ) , and when c ontr ol ler j is not destabilizing the underlying plant, the system identific ation criterion S (2) ( y 1 , . . . , y τ ; j ) satisfies: • If we have a matching c ontr ol ler ( i ⋆ = j ), then S (2) = 1 with pr ob ability ≥ 1 − δ . • Otherwise, if i ⋆  = j , then S (2) = 0 with pr ob ability ≥ 1 / 2 . W e note that the ratio σ 2 r σ 2 u in our bound corresp onds to the in v erse signal-to-noise ratio of the exploratory signal u t . By combining b oth Criterion 1 and Criterion 2, the matc hing controller ac hieves the highest exp ected score o ver rep eated episo des, thus pro v ably identify the most suitable con troller. 4 Main Result With the guaran tees of the tw o ev aluation criteria established, we no w analyze the p erformance of our sup ervisory switc hing control algorithm. T o balance exploration and exploitation, w e draw inspiration from the Upp er Confidence Bound (UCB) algorithm [3]. At each episo de, w e select the con troller by maximizing its empirical av erage score plus an exploration b onus. The magnitude of this b on us encourages exploration of con trollers with limited in teractions, and the parameter a ℓ con trols the exploration strength. 9 W e emphasize that standard UCB analysis fails in our setting due to the state dep endency b et w een differen t episo des and the requiremen t for closed-lo op system stabilit y . Ho wev er, by lev er- aging the prop erties of the criterion S (Prop ositions 1 and 2), we can isolate the empirical av erage scores from the initial state of each episo de. This effectiv ely allo ws for a UCB-st yle analysis by decoupling the episo des. W e presen t tw o distinct choices for the exploration parameter a ℓ . The first choice prioritizes aggressiv e, uniform exploration, while the second dynamically scales to pro vide a more refined trade- off b etw een exploration and exploitation. W e demonstrate that b oth v ariants enable the framework to iden tify the optimal controller in O ( N log N ) episo des while simultaneously ensuring system stabilit y . Theorem 3. Supp ose the length of e ach episo de satisfies τ ≥ max( T 1 , T 2 ) , and total numb er of episo des e qual to L = O ( N log ( N /δ )) . L et the explor ation p ar ameter set to either a ℓ = L 72 N or a ℓ = 1 2 log( π 2 N ℓ 2 6 δ ) . Then with pr ob ability 1 − δ , A lgorithm 1 satisfies the fol lowing pr op erties: • The algorithm c ommits to the most suitable c ontr ol ler at termination, i.e. i ⋆ = ˆ i . • The close d-lo op tr aje ctory has finite L 2 -gain with r esp e ct to disturb anc es u t and w t , i.e. ther e exist c onstants C 0 , C 1 so that for any T > τ L , T X t =1 ∥ x t ∥ 2 ≤ C 0 τ L X t =1 ∥ u t ∥ 2 + ∥ w t ∥ 2 ! + C 1 T X t = τ L +1 ∥ u t ∥ 2 + ∥ w t ∥ 2 ! . Pr o of sketch. F rom Prop ositions 1 and 2, we can probabilistically b ound the outcomes of our ev al- uation metric S indep endently of the initial state of the episo des. This enables us to decouple the state dep endencies of different episo des. Let µ + denote a lo w er b ound on the exp ected score of the matching con troller ( j = i ⋆ ), and let µ − denote an upp er b ound on the exp ected score of any non-matching controller ( j  = i ⋆ ). Recall that Q i ( ℓ ) denotes the num b er of times controller i has b een applied through episo de ℓ . Similar to standard UCB analysis, if Algorithm 1 selects a non-matching con troller j  = i ⋆ at episo de ℓ , then at least one of the following three conditions m ust b e met: • Av erage score for the matc hing controller is to o lo w: ¯ s i ⋆ [ Q i ⋆ ( ℓ )] ≤ µ + − q a ℓ Q i ⋆ ( ℓ ) . • Av erage score for a non-matc hing controller is to o high: ∃ j  = i ⋆ , s.t. ¯ s j [ Q j ( ℓ )] ≥ µ − + q a ℓ Q j ( ℓ ) . • The exploration b onus is to o big: ∃ j  = i ⋆ , s.t. 2 q a ℓ Q j ( ℓ ) > µ + − µ − . Notice that the third condition establishes an upp er b ound on the num b er of times a non-matc hing con troller can b e selected b efore its exploration bonus shrinks b elo w the exp ected score gap. Let E denote the “go o d” ev ent where the first t wo conditions do not occur. Conditioned on E , we can upp er-b ound the frequency of selecting any non-matching con troller by 1 4 ( N − 1)( µ + − µ − ) 2 a L . W e can then low er b ound Pr( E ) with martingale concentration equalities, yielding a probabilistic lo wer b ound on the frequency of selecting the matching con troller. T o establish the first part of the theorem, w e ev aluate our t wo specific choices for the exploration parameter a ℓ : • In the first v arian t, w e set a ℓ = L 72 N to ensure that the matc hing controller is selected at least half of the times under E . Then we choose a sufficiently long horizon L so that the even t E o ccurs with probability 1 − δ . 10 • In the second v arian t, we dynamically set a ℓ = 1 2 log( π 2 N ℓ 2 6 δ ) so that the ev ent E o ccurs with probabilit y 1 − δ for any horizon L . Then, w e pic k a sufficiently large L so that the matching con troller is selected at least half of the times under E . The second part of the theorem follo ws from the fact that applying the matching controller results in a stable closed-lo op system. The constan t C 0 is an upp er b ound on the w orst-case transien t energy accum ulated o v er the O ( N log ( N /δ )) exploratory episo des. And the constan t C 1 is directly inherited from p erformance of the matching con troller. F or the complete pro of of this result, we refer the readers to App endix D. The first part of this theorem establishes a sample complexit y b ound, stating that our algo- rithm requires O ( N log N ) episo des to iden tify the matc hing controller with high probability . This represen ts a significant impro vemen t ov er our prior work [29] that required O (exp( N )) steps. The second part of this theorem addresses closed-lo op stability . W e note that the transient term C 0 scales as exp( O ( N log ( N /δ ))), which corresp onds to the cum ulativ e state growth incurred while exploring p otentially destabilizing controllers ov er O ( N log ( N /δ )) episo des. As shown by Chen and Hazan [6], suc h exploration p enalt y is unav oidable when testing unstable con trollers. Compared to the fully-observed setting in Li et al. [12], our b ound con tains an additional log ( N /δ ) factor in the exp onen t. This reflects the reality that our instability criterion cannot guaran tee successful detection in every single episo de due to partial observ ation, thereby necessitating rep eated trials to ac hieve high-confidence identification. Finally , w e emphasize that the v alue of C 0 reflects a highly conserv ativ e, w orst-case scenario where all destabilizing controllers are applied consecutiv ely , whic h w ould b e exc e e dingly r ar e in practice b ecause the UCB selection mechanism heavily fa vors the matc hing controller. Remark 2. The stability guaran tees of the classical estimator-based supervision are t ypically given in terms of a time-discoun ted L 2 -norm. Sp ecifically , for a forgetting factor λ > 0, Hespanha [7] sho wed that there exist constants C 0 , C 1 suc h that for an y T > 0, T X t =1 e − 2 λ ( T − t ) ∥ x t ∥ 2 ≤ C 0 e − 2 λT + C 1 T X t =1 e − 2 λ ( T − t ) ∥ w t ∥ 2 . In con trast, our Theorem 3 b ounds the undiscounted case ( λ = 0). This regime is not co vered b y classical analysis, as the traditional framew ork requires a dw ell time that scales in versely with λ . F urthermore, we can provide quantitativ e b ounds for the constants C 0 , C 1 in terms of the problem parameters, whereas classical analysis is asymptotic and lea ves the explicit v alues of C 0 and C 1 unsp ecified. 5 Conclusion In this w ork, we presented a sup ervisory switc hing con trol framew ork that guaran tees finite-time adaptation for unknown partially-observed linear systems with a collection of candidate controllers. Our prop osed data-driv en metho d achiev es a dimension-free sample complexit y of O ( N log N ) steps to identify the matching con troller while main taining finite L 2 -gain, ev en when some of the candidate con trollers risk destabilizing the underlying plant. Lo oking forward, a promising direction for future researc h inv olves expanding the capacit y of our ev aluation criteria. Currently , our criteria only ev aluate the mo del asso ciated with the activ e con troller. F uture algorithmic extensions could significan tly strengthen b oth safet y and p erformance 11 guaran tees by simultaneously cross-ev aluating the observ ed tra jectory against all candidate mo dels within a single episo de. References [1] Y asin Abbasi-Y adk ori, D´ avid P´ al, and Csaba Szep esv´ ari. Impro ved algorithms for linear sto c hastic bandits. A dvanc es in neur al information pr o c essing systems , 24, 2011. [2] A Pedro Aguiar and Joao P Hespanha. T ra jectory-tracking and path-following of underactuated autonomous v ehicles with parametric mo deling uncertain t y . IEEE tr ansactions on automatic c ontr ol , 52(8):1362–1379, 2007. [3] P eter Auer, Nicolo Cesa-Bianchi, and P aul Fischer. Finite-time analysis of the m ultiarmed bandit problem. Machine le arning , 47(2):235–256, 2002. [4] Ainesh Bakshi, Allen Liu, Ankur Moitra, and Morris Y au. A new approac h to learning lin- ear dynamical systems. In Pr o c e e dings of the 55th A nnual ACM Symp osium on The ory of Computing , pages 335–348, 2023. [5] Mic helangelo Bin, Emanuele Crisostomi, Pietro F erraro, Ro deric k Murray-Smith, Thomas P arisini, Rob ert Shorten, and Sebastian Stein. Hysteresis-based sup ervisory con trol with appli- cation to non-pharmaceutical containmen t of covid-19. A nnual r eviews in c ontr ol , 52:508–522, 2021. [6] Xin yi Chen and Elad Hazan. Blac k-b o x control for linear dynamical systems. In Confer enc e on L e arning The ory , pages 1114–1143. PMLR, 2021. [7] Joao P Hespanha. T utorial on sup ervisory control. In L e ctur e Notes for the workshop Contr ol using L o gic and Switching for the 40th Conf. on De cision and Contr., Orlando, Florida , 2001. [8] Daniel Hsu, Sham M Kak ade, and T ong Zhang. A tail inequalit y for quadratic forms of subgaussian random vectors. Ele ctr onic Communic ations in Pr ob ability , 17, 2012. [9] Chi Jin, Zeyuan Allen-Zh u, Sebastien Bub eck, and Michael I Jordan. Is Q-learning prov ably efficien t? A dvanc es in neur al information pr o c essing systems , 31, 2018. [10] Jih un Kim and Jav ad Lav aei. Online bandit nonlinear control with dynamic batc h length and adaptiv e learning rate. T r ansactions on Machine L e arning R ese ar ch , 2025. [11] Beatrice Lauren t and P ascal Massart. Adaptive estimation of a quadratic functional by mo del selection. Annals of statistics , pages 1302–1338, 2000. [12] Yingying Li, James A Preiss, Na Li, Yiheng Lin, Adam Wierman, and Jeff S Shamma. Online switc hing control with stabilit y and regret guarantees. In L e arning for Dynamics and Contr ol Confer enc e , 2023. [13] Yingying Li, Jing Y u, Lauren Conger, T aylan Kargin, and Adam Wierman. Learning the uncertain ty sets of linear con trol systems via set membership: A non-asymptotic analysis. In International Confer enc e on Machine L e arning , 2024. [14] Daniel Lib erzon. Switching in systems and c ontr ol , volume 190. Springer, 2003. 12 [15] Lennart Ljung. System identification. In Signal analysis and pr e diction , pages 163–173. Springer, 1998. [16] Lexuan Meng, Eleonora Riv a Sanseverino, Adriana Luna, T omislav Dragicevic, Juan C V asquez, and Josep M Guerrero. Microgrid sup ervisory con trollers and energy management systems: A literature review. R enewable and Sustainable Ener gy R eviews , 60:1263–1273, 2016. [17] Ja v ad Mohammadpour and Carsten W Sc herer. Contr ol of line ar p ar ameter varying systems with applic ations . Springer Science & Business Media, 2012. [18] A Stephen Morse. Sup ervisory control of families of linear set-p oin t con trollers-part i. exact matc hing. IEEE tr ansactions on Automatic Contr ol , 41(10):1413–1431, 1996. [19] A. Stephen Morse. Sup ervisory con trol of families of linear set-p oint controllers. 2. robustness. IEEE T r ansactions on Automatic Contr ol , 42(11):1500–1515, 1997. [20] Mic hael Muehlebach, Zhiyu He, and Michael I. Jordan. The sample complexity of online reinforcemen t learning: A m ulti-mo del p ersp ective. In International Confer enc e on L e arning R epr esentations , 2026. [21] Samet Oymak and Necmiye Oza y . Non-asymptotic identification of L TI systems from a single tra jectory . In 2019 A meric an c ontr ol c onfer enc e (ACC) , pages 5655–5661. IEEE, 2019. [22] Mark Rudelson and Roman V ersh ynin. Hanson-wright inequality and sub-gaussian concentra- tion. 2013. [23] Wilson J Rugh and Jeff S Shamma. Researc h on gain sc heduling. Automatic a , 36(10):1401– 1425, 2000. [24] T uhin Sark ar, Alexander Rakhlin, and Mun ther A Dahleh. Finite time lti system iden tification. Journal of Machine L e arning R ese ar ch , 22(26):1–61, 2021. [25] Jeff S Shamma and Mic hael Athans. Gain sc heduling: P oten tial hazards and possible remedies. IEEE Contr ol Systems Magazine , 12(3):101–107, 2002. [26] Max Simcho witz and Dylan F oster. Naive exploration is optimal for online LQR. In Interna- tional Confer enc e on Machine L e arning , 2020. [27] Max Simcho witz, Horia Mania, Stephen T u, Michael I Jordan, and Benjamin Rech t. Learning without mixing: T ow ards a sharp analysis of linear system identification. In Confer enc e On L e arning The ory , pages 439–473, 2018. [28] Max Simcho witz, Ross Bo czar, and Benjamin Rech t. Learning linear dynamical systems with semi-parametric least squares. In Confer enc e on L e arning The ory , pages 2714–2802. PMLR, 2019. [29] Hao yuan Sun and Ali Jadbabaie. A least-square method for non-asymptotic iden tification in linear switc hing con trol. In 2024 IEEE 63r d Confer enc e on De cision and Contr ol (CDC) , pages 2993–2998. IEEE, 2024. [30] Yi Tian, Kaiqing Zhang, Russ T edrake, and Suvrit Sra. Can direct latent mo del learning solv e linear quadratic gaussian control? In L e arning for Dynamics and Contr ol Confer enc e , pages 51–63. PMLR, 2023. 13 [31] P aolo Tilli. Singular v alues and eigen v alues of non-hermitian block to eplitz matrices. Line ar algebr a and its applic ations , 272(1-3):59–89, 1998. [32] Martin J W ainwrigh t. High-dimensional statistics: A non-asymptotic viewp oint , volume 48. Cam bridge universit y press, 2019. [33] F en W u. Contr ol of line ar p ar ameter varying systems . Universit y of California, Berkeley , 1995. [34] Ingv ar Ziemann, Anastasios Tsiamis, Bruce Lee, Y assir Jedra, Nikolai Matni, and George J P appas. A tutorial on the non-asymptotic theory of system iden tification. In Confer enc e on De cision and Contr ol (CDC) . IEEE, 2023. A Mathematical Preliminaries This section several technical results that will b ecome useful in our pro ofs. In this w ork, we emplo y system identification method based on ordinary least square (OLS). Consider a linear mo del y t = Θ ∗ z t + r t , t = 1 , . . . , T , where the sequence of random vectors { y t } T t =1 and { z t } T t =1 are adapted to a filtration {F t } t ≥ 0 and r t are residuals/noise that are σ 2 -sub-Gaussian. Then, OLS seeks to recov er the true parameter Θ ⋆ through the following estimate: ˆ Θ = T X t =1 y t z ⊤ t ! V − 1 , where V = X t z t z ⊤ t . W e can b ound the estimation error ˆ Θ − Θ ⋆ with the self-normalized martingale tail b ound. Prop osition 4 (Theorem 1 in [1]) . Consider { z t } T t =1 adapte d to a filtr ation {F t } t ≥ 0 . L et V = P T t =1 z t z ⊤ t . If the sc alar-value d r andom variable r t | F t − 1 is σ 2 -sub-Gaussian, then for any V 0 ⪰ 0 ,      T X t =1 z t r t      2 ( V + V 0 ) − 1 ≤ 2 σ 2 log δ − 1 det( V + V 0 ) − 1 / 2 det( V 0 ) 1 / 2 ! with pr ob ability 1 − δ . W e also need a concen tration inequalit y o ver a quadratic function on Gaussian random v ariables (e.g. chi-squared distributions). Prop osition 5. Consider a symmetric matrix M ∈ S d × d and a r andom ve ctor g ∼ N (0 , I d × d ) . Then, for any δ ∈ (0 , 1 /e ) , we have Pr  g ⊤ M g > tr( M ) + 2 ∥ M ∥ F p log(1 /δ ) + 2 ∥ M ∥ op log(1 /δ )  < δ , Pr  g ⊤ M g < tr( M ) − 2 ∥ M ∥ F p log(1 /δ )  < δ W e note this is a simplified v ersion of the Hanson-W right inequalit y [22], and this particular can b e derive from Lemma 1 of [11] and Prop osition 1 of [8]. Next, w e emplo y the following concentration inequality on martingales, whic h is crucial to the pro of of our main result b ecause the outcomes of differen t episo des are not indep enden t. 14 Prop osition 6 (Corollary 2.20 in [32]) . L et { D k } ∞ k =1 b e a martingale differ enc e se quenc e adapte d to a filtr ation {F k } ∞ k =1 . If D k ∈ [0 , 1] almost sur ely for al l k , then for any t > 0 , Pr 1 n      n X k =1 D k      ≥ t ! ≤ 2 exp( − 2 nt 2 ) . Finally , w e in tro duce the following technical lemma to manipulate inequalities of the form b ≥ a log b , which sho ws up when we b ound the episo de length τ . Lemma 7. F or any p ositive r e al numb ers a, b . If b ≥ 2 a log (2 a ) , then we have b ≥ a log b . Pr o of. W e consider tw o cases on the v alue of a : • Case 1: a < e . Then, w e note that min b ( b − a log b ) = a − a log a , which is positive when a < e . • Case 2: a > e . In this case, 2 a log (2 a ) > a . So, for b ≥ 2 a log (2 a ), we ha ve ∂ ∂ b b − a log b = 1 − a/b > 0. Hence it suffices to c hec k that b ≥ a log b when b = 2 a log(2 a ). 2 a log(2 a ) − a log(2 a log (2 a )) = 2 a log(2 a ) − a log(2 a ) − a log log(2 a ) = a log(2 a ) − a log log (2 a ) ≥ a min x ( x − log x ) = a. So we are done. B Pro of of Prop osition 1 In addition to proving Prop osition 1, w e also presen t and analyze a v arian t of Criterion 1 that ac hieves a dimension-free guaran tee. In this v arian t, w e consider unit vectors u 1 , . . . , u N that are asso ciated with the unstable mo des of the destabilizing controllers. The explicit form of these unit v ectors will b e defined later in the pro of. Criterion 3 Instability detection criterion S (1) (dimension-free v ariant) 1: Input: an observed tra jectory y 1 , . . . , y τ of length τ , active controller j . 2: Let O [ j,j ] b e the observ ability matrix of the mo del ( C [ j,j ] , A [ j,j ] , B [ j,j ] ), compute ˆ x 1 = O † [ j,j ] [ y ν , . . . , y 1 ] ⊤ . 3: Compute a predicted tra jectory ˆ y t = C [ j,j ] A t − 1 [ j,j ] ˆ x 1 . 4: for all unit v ectors u k , k = 1 , . . . , N do 5: if ∥ y τ − ˆ y τ ∥ > p 2Θ log(2 n/δ ) then return 0 6: return 1 This v ariant ac hiev es the same guaran tee in Prop osition 1, but will a tra jectory length of τ ≳ ν + log (1 + ε a ) − 1 log  Θ log(2 n/δ ) ε a ε c σ w  . 15 T o start with the pro of, we write out the following input-output relation for the first ν steps:      y ν . . . y 2 y 1      = O [ i ⋆ ,j ] x 1 + " T ( ν − 1) [ i ⋆ ,j ] 0 1 × ( ν − 1) #      w ν − 1 . . . w 2 w 1      + " T ( ν − 1) [ i ⋆ ,j ] 0 1 × ( ν − 1) #      B [ i ⋆ ,j ] u ν − 1 . . . B [ i ⋆ ,j ] u 2 B [ i ⋆ ,j ] u 1      +      η ν . . . η 2 η 1      , (3) where we denote O = [ C A ν − 1 ; . . . ; C A ; C ] as the observ abilit y matrix and T ( k ) =        C C A C A 2 . . . C A k − 1 0 C C A . . . C A k − 2 0 0 C . . . C A k − 3 . . . . . . . . . . . . . . . 0 0 0 . . . C        as a k × k T o eplitz matrix. Case 1: matching controller ( j = i ⋆ ) Since O [ i ⋆ ,j ] = O [ j,j ] , w e can mult iply b oth sides of (3) b y the pseudo-inv erse O [ j,j ] to get ˆ x 1 = O † [ j,j ]      y ν . . . y 2 y 1      = x 1 + O † [ j,j ] " T ( ν − 1) [ j,j ] 0 1 × ( ν − 1) #      w ν − 1 . . . w 2 w 1      + O † [ j,j ] " T ( ν − 1) [ j,j ] 0 1 × ( ν − 1) #      B [ j,j ] u ν − 1 . . . B [ j,j ] u 2 B [ j,j ] u 1      + O † [ i ⋆ ,j ]      η ν . . . η 2 η 1      . Under Assumption 1, the closed-lo op system is ν -strictly observ able, and we ha ve    O † [ i ⋆ ,j ]    op ≤ 1 /ε c . And since the matching controller is stabilizing, the T o eplitz matrices are b ounded by   C [ j,j ] , A [ j,j ] , B [ j,j ]   H ∞ . It follo ws that x 1 − ˆ x 1 is ε − 1 c    C [ j,j ] , A [ j,j ]   H ∞ σ 2 w +   C [ j,j ] , A [ j,j ] , B [ j,j ]   H ∞ σ 2 u + σ 2 η  (4) sub-Gaussian. Then, we recursively expand the system dynamics to ha ve y t = C [ i ⋆ ,j ] A t − 1 [ i ⋆ ,j ] x 1 + t − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u t − s + t − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] w t − s + η t . whic h means y τ − ˆ y τ = C [ i ⋆ ,j ] A τ − 1 [ i ⋆ ,j ] ( x 1 − ˆ x 1 ) + τ − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u τ − s + τ − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] w τ − s + η τ . Therefore, the quantit y y τ − ˆ y τ is   C [ j,j ] , A [ j,j ]   H ∞ ε − 1 c    C [ j,j ] , A [ j,j ]   H ∞ σ 2 w +   C [ j,j ] , A [ j,j ] , B [ j,j ]   H ∞ σ 2 u + σ 2 η  +   C [ j,j ] , A [ j,j ]   H ∞ σ 2 w +   C [ j,j ] , A [ j,j ] , B [ j,j ]   H ∞ σ 2 u + σ 2 η 16 sub-Gaussian. If w e let ζ = max n   C [ i,i ] , A [ i,i ]   H ∞ ,   C [ i,i ] , A [ i,i ] , B [ i,i ]   H ∞ : 1 ≤ i ≤ N o , w e can say that y τ − ˆ y τ is Θ := (1 + ζ ε − 1 c )( ζ σ 2 w + ζ σ 2 u + σ 2 η ) sub-Gaussian. By the prop erty of sub-Gaussian (see Equ. 2.9 in [32]), w e hav e that for any unit v ector u and λ > 0, Pr( | u ⊤ ( y τ − ˆ y τ ) | ≥ λ ) ≤ 2 exp( − λ 2 / (2Θ)) . (5) Case 2: acti v e controller j is destabilizing Like in the previous case, we ha v e that y τ − ˆ y τ = C [ i ⋆ ,j ] A τ − 1 [ i ⋆ ,j ] ( x 1 − ˆ x 1 ) + t − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u τ − s + t − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] w τ − s + η τ . Because ˆ x 1 is indep endent of the noise at time ν + 1, we can express the the difference as ∆ := y τ − ˆ y τ = C [ i ⋆ ,j ] A τ − ν − 2 [ i ⋆ ,j ] w ν +1 + ξ , where ξ is a quantit y indep enden t of w ν +1 . Under Assumption 2, the closed-lo op dynamics A [ i ⋆ ,j ] has spectral radius ≥ 1 + ε a . So, we let unit vector q i ⋆ b e an eigenv ector of A [ i ⋆ ,j ] corresp onding to an eigen vector λ [ i ⋆ ,j ] whose magnitude ≥ 1 + ε a . Then, due to strict observ abilit y , we hav e that ε c ≤   O [ i ⋆ ,j ] q i ⋆   =    h λ ν − 1 [ i ⋆ ,j ] C [ i ⋆ ,j ] q i ⋆ ; . . . ; λ [ i ⋆ ,j ] C [ i ⋆ ,j ] q i ⋆ ; C [ i ⋆ ,j ] q i ⋆ i    =   C [ i ⋆ ,j ] q i ⋆   | λ [ i ⋆ ,j ] | ν − 1 | λ [ i ⋆ ,j ] | − 1 = ⇒   C [ i ⋆ ,j ] q i ⋆   ≥ ε c ε a | λ [ i ⋆ ,j ] | ν . It follows that    C [ i ⋆ ,j ] A τ − ν − 2 [ i ⋆ ,j ] q i ⋆    ≥ | λ [ i ⋆ ,j ] | τ − ν − 2 ε c ε a | λ [ i ⋆ ,j ] | ν ≥ ε c ε a (1 + ε a ) τ − 2 ν − 2 . No w, we can let ( u i ⋆ , v i ⋆ ) b e the leading singular vectors of C [ i ⋆ ,j ] A τ − ν − 2 [ i ⋆ ,j ] corresp onding to the top singular v alue ≥ ε c ε a (1 + ε a ) τ − 2 ν − 2 . Since w ν +1 is Gaussian, we know that from standard normal CDF, Pr( | v ⊤ i ⋆ w ν +1 | > σ w / 4) > 4 / 5. Then we hav e that ∥ ∆ ∥ ≥ | u ⊤ i ⋆ ∆ | ≥ | u ⊤ i ⋆ C [ i ⋆ ,j ] A τ − ν − 2 [ i ⋆ ,j ] w ν +1 | =    C [ i ⋆ ,j ] A τ − ν − 2 [ i ⋆ ,j ]    op | v ⊤ i ⋆ w ν +1 | ≥ ε c ε a (1 + ε a ) τ − 2 ν − 2 · | v ⊤ i ⋆ w ν +1 | ≥ 1 4 ε c ε a (1 + ε a ) τ − 2 ν − 2 σ w w.p. 2 / 5 . Note the probability is halv ed b ecause by symmetry , there is 50/50 c hance that the signs of u ⊤ i ⋆ ξ and u ⊤ i ⋆ C [ i ⋆ ,j ] A τ − ν − 2 [ i ⋆ ,j ] w ν +1 are the same. W e note that for the current active controller j w e can pre-compute ( u i , v i ) for 1 ≤ i ≤ N that 17 mak es A [ i,j ] is unstable (and set the stable ones to 0) by applying an iterativ e SVD algorithm 2 on the matrix C [ i,j ] ( A [ i,j ] /λ [ i,j ] ) τ − ν − 2 . Then, w e use these u i ’s for the threshold in Criterion 3, so that for at least one of 1 ≤ i ≤ N , w e hav e | u ⊤ i ∆ | ≥ 1 4 ε c ε a (1 + ε a ) τ − 2 ν − 2 σ w w.p. 2 / 5. With the analysis for b oth cases, we can no w complete the pro of for b oth v ariants. V ariant 1 (Criterion 1). Note that w e can b ound ∥ ∆ ∥ by considering the basis v ector e 1 , . . . , e d y in R d y and measure | e ⊤ k ∆ | for all k ’s. Then, after applying union b ound to (5), we ha ve that Pr  | e ⊤ k ( y τ − ˆ y τ ) | ≤ q 2Θ log(2 d y /δ ) ∀ k  ≥ 1 − δ. It follows that Pr  ∥ y τ − ˆ y τ ∥ ≤ q 2 d y Θ log(2 d y /δ )  ≥ 1 − δ. So, this criterion correctly finds the matching controller to be stabilizing with probabilit y 1 − δ . Next, if we set τ = 2 + 2 ν + log(1 + ε a ) − 1 log 4 p 2 d y Θ log(2 d y /δ ) ε a ε c σ w ! , an unstable A [ i ⋆ ,j ] results in ∥ y τ − ˆ y τ ∥ ≤ p 2 d y Θ log(2 d y /δ ) with probability ≥ 2 / 5. So, this criteria correctly detect an unstable A [ i ⋆ ,j ] with probability at least 2/5. V ariant 2 (Criterion 3). After applying union b ound to (5), we ha ve that Pr  | u ⊤ k ( y τ − ˆ y τ ) | ≤ p 2Θ log(2 n/δ ) ∀ k  ≥ 1 − δ. So, this criterion correctly finds the matching controller to be stabilizing with probabilit y 1 − δ . Next, if we set τ = 2 + 2 ν + log(1 + ε a ) − 1 log 4 p 2Θ log(2 n/δ ) ε a ε c σ w ! , an unstable A [ i ⋆ ,j ] results in u ⊤ i ⋆ ( y τ − ˆ y τ ) | ≤ p 2 d y Θ log(2 d y /δ ) with probabilit y ≥ 2 / 5. So, this criteria correctly detect an unstable A [ i ⋆ ,j ] with probability at least 2/5. C Pro of of Prop osition 2 Recall that, in Criterion 2, we defined e y t = y t − C [ j,j ] A t − 1 [ j,j ] ˆ x 1 − G [ j,j ] z t = ( G [ i ⋆ ,j ] − G [ j,j ] ) z t + r t . Assumption 3 implies that, for ev ery k  = j , there exist critic al dir e ctions ( u k , v k ) satisfying u ⊤ k ( G [ k,j ] − G [ j,j ] ) v k =   G [ k,j ] − G [ j,j ]   op ≥ γ . Next, we estimate the difference G [ i ⋆ ,j ] − G [ j,j ] along these critical directions by p erforming the OLS estimate from data p oin ts  ( v ⊤ k z t , u ⊤ k e y t )  τ t = ν + h +1 . 2 This is t ypically a mo dified Lanczos algorithm, such as MA TLAB’s svds function. 18 In particular, we define the OLS estimate along a direction ( u, v ) as ∆ u,v = τ X t = ν + h +1 ( u ⊤ e y t )( v ⊤ z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 = τ X t = ν + h +1 u ⊤ (( G [ i ⋆ ,j ] − G [ j,j ] ) z t + r t )( v ⊤ z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 . Then, this criteria considers the activ e controller’s asso ciated mo del as matc hing the true system if ∆ u k ,v k is small along every critical direction. And this criteria declares non-matching if an y one of the ∆ u k ,v k is large. Case 1: ma tc hing controller ( j = i ⋆ ). It follows that for each critical direction, we hav e ∆ u j ,v j = τ X t = ν + h +1 ( u ⊤ e y t )( v ⊤ z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 = τ X t = ν + h +1 ( u j r t )( v ⊤ j z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 . Therefore, it suffices to sho w that for arbitrary unit vectors ( u, v ), the quantit y ∆ u,v = τ X t = ν + h +1 ( u ⊤ r t )( v ⊤ z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 | {z } :=Φ − 1 (6) satisfies | ∆ u,v | ≤ γ with probability 1 − δ / N . Then, from union b ound, Criterion 2 correctly finds con troller j to match the true system with probability 1 − δ . T o b egin, w e first establish that the residual term r t is sub-Gaussian. In Section 3.2, we derived that y t = C [ i ⋆ ,j ] A h [ i ⋆ ,j ] x t − h + G [ i ⋆ ,j ] z t + h X s =1 C [ i ⋆ ,j ] A s [ i ⋆ ,j ] w t − s − 1 + η t . And since x t − h = A t − h − 1 [ i ⋆ ,j ] x 1 + t − h − 1 X s =1 A s − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u t − h − s + t − h − 1 X s =1 A s − 1 [ i ⋆ ,j ] w t − h − s , w e can write y t = C [ i ⋆ ,j ] A t − 1 [ i ⋆ ,j ] x 1 + G [ i ⋆ ,j ] z t + t − h − 1 X s =1 C [ i ⋆ ,j ] A s + h − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u t − h − s + t − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] w t − s + η t . 19 Therefore, r t = y t − C [ i ⋆ ,j ] A t − 1 [ i ⋆ ,j ] ˆ x 1 − G [ i ⋆ ,j ] z t = C [ i ⋆ ,j ] A t − 1 [ i ⋆ ,j ] ( x 1 − ˆ x 1 ) + t − h − 1 X s =1 C [ i ⋆ ,j ] A s + h − 1 [ i ⋆ ,j ] B [ i ⋆ ,j ] u t − h − s + t − 1 X s =1 C [ i ⋆ ,j ] A s − 1 [ i ⋆ ,j ] w t − s + η t . Recall that, from (4) in the previous part, the estimation error on the initial state is x 1 − ˆ x 1 is ε − 1 c (   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ]   H ∞ σ 2 w +   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ] , B [ i ⋆ ,i ⋆ ]   H ∞ σ 2 u + σ 2 η ) sub-Gaussian. W e conclude that r t is   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ]   H ∞ ε − 1 c (   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ]   H ∞ σ 2 w +   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ] , B [ i ⋆ ,i ⋆ ]   H ∞ σ 2 u + σ 2 η ) +   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ]   H ∞ σ 2 w +   C [ i ⋆ ,i ⋆ ] , A [ i ⋆ ,i ⋆ ] , B [ i ⋆ ,i ⋆ ]   H ∞ σ 2 u + σ 2 η sub-Gaussian. If w e let ζ = max n   C [ i,i ] , A [ i,i ]   H ∞ ,   C [ i,i ] , A [ i,i ] , B [ i,i ]   H ∞ : 1 ≤ i ≤ N o , w e can sa y that r t is σ 2 r := (1 + ζ ε − 1 c )( ζ σ 2 w + ζ σ 2 u + σ 2 η ) sub-Gaussian. Additionally , we note that this calculations implies that z t and r t are indep endent. Then, applying Prop osition 4 to the OLS estimate (6) yields that for µ > 0, and probability 1 − δ / (4 N ), Φ 2 ∆ 2 u,v Φ + µ = 1 Φ + µ τ X t = ν + h +1 ( u ⊤ r t )( v ⊤ z t ) ! 2 ≤ 2 σ 2 r log  4 N √ Φ + µ δ √ µ  . (7) T o b ound ∆ u,v , we need to find b oth upp er and low er b ound for Φ. Also, for conv enience, define τ ′ = τ − ν − h . F or the upper b ound, we note that Φ can b e written as a quadratic form g T M g , where g ∈ R τ d and indexed so that g i,j = u i [ j ]. W e note that we can write as τ X t = ν + h +1 ( v ⊤ z t ) 2 = τ X t = ν + h +1 z ⊤ t v v ⊤ z t = τ X t = ν + h +1 g ⊤ M t g , where M t has a submatrix equal to v v ⊤ and rest of the en tries are zero. Then, by Hanson-W right inequalit y (Prop osition 5), we ha ve that for λ ∈ (0 , 1), Pr  1 σ 2 u g ⊤ M g > tr( M ) + 2 ∥ M ∥ F √ λ + 2 ∥ M ∥ op λ  < Pr  1 σ 2 u g ⊤ M g > tr( M ) + 4 ∥ M ∥ F √ λ  < exp( − λ ) , Pr  1 σ 2 u g ⊤ M g < tr( M ) − 2 ∥ M ∥ F √ λ  < exp( − λ ) F or the trace, w e note that tr( M ) = P t tr( M t ) = τ ′ . And the F rob enius norm satisfies ∥ M ∥ F ≤ 20 P t ∥ M t ∥ F = τ ′ . Therefore Pr  Φ ≥ σ 2 u ( τ ′ + 4 √ τ ′ λ )  ≤ exp( − λ ) , Pr  Φ < σ 2 u ( τ ′ − 2 √ τ ′ λ )  ≤ exp( − λ ) . Setting λ = log (4 N /δ ) in the first line, we ha ve Pr  Φ ≥ σ 2 u (3 τ ′ + 2 log(4 N /δ ))  ≤ Pr  Φ ≥ σ 2 u ( τ ′ + 4 p τ ′ log(4 N /δ ))  ≤ δ / (4 N ) . (8) Setting λ = τ ′ / 16 in the second line, we hav e Pr  Φ < σ 2 u τ ′ 2  ≤ exp  − τ ′ 16  . (9) With these b ounds in mind, w e claim that for any µ > 0, τ ′ = max  16 log(4 N /δ ) , 2 µ σ 2 u , 8 σ 2 r σ 2 u γ 2 log  800 N 2 σ 2 r δ 2 γ 2 µ  (10) ensures that ∆ u,v ≤ γ with probabilit y 1 − δ / N . First, applying the first tw o quan tities to (9) yields that Φ ≥ µ with probabilit y 1 − δ / (4 N ). Th us, we hav e that ∆ 2 u,v ≤ Φ + µ Φ 2 · 2 σ 2 r log  4 N √ Φ + µ δ √ µ  ≤ 2 Φ σ 2 r log  32 N 2 Φ δ 2 µ  ≤ 2 Φ σ 2 r log  32 N 2 σ 2 u (3 τ ′ + 2 log(4 N /δ )) δ 2 µ  ≤ 2 Φ σ 2 r log  100 N 2 σ 2 u τ ′ δ 2 µ  . (11) with probability 1 − 3 δ / (4 N ), where the first line follo ws from (7), the third line follows from (8), and the final line follo ws from the first quan tity of (10). Also, from Lemma 7, the third quantit y of (10) implies 100 N 2 σ 2 u δ 2 µ τ ′ ≥ 100 N 2 σ 2 u δ 2 µ · 8 σ 2 r σ 2 u γ 2 log  100 N 2 σ 2 u δ 2 µ · 8 σ 2 r σ 2 u γ 2  ⇒ 100 N 2 σ 2 u δ 2 µ τ ′ ≥ 100 N 2 σ 2 u δ 2 µ · 4 σ 2 r σ 2 u γ 2 log  100 N 2 σ 2 u τ ′ δ 2 µ  ⇔ τ ′ ≥ 4 σ 2 r σ 2 u γ 2 log  100 N 2 σ 2 u τ ′ δ 2 µ  . Then, applying the first quan tity of (10) to (9), w e hav e that Φ ≥ σ 2 u τ ′ 2 ≥ 2 γ 2 σ 2 r log  100 N 2 σ 2 u τ ′ δ 2 µ  (12) with probability 1 − δ / (4 N ). Therefore, b y combining (11) and (12), we conclude that | ∆ u,v | ≤ γ 21 with probability 1 − δ / N . Finally , by picking µ = σ 2 r /γ 2 and union b ounding ov er all N critical directions, w e hav e that when τ = ν + h + max  16 log(4 N /δ ) , 2 σ 2 r σ 2 u γ 2 , 8 σ 2 r σ 2 u γ 2 log  800 N 2 δ 2  ≍ σ 2 r σ 2 u γ 2 log( N /δ ) , the quantit y ∆ u k ,v k is small along all critical directions. So, Criterion 2 correctly outputs 1 with probabilit y 1 − δ . Case 2: non-matc hing controller j  = i ⋆ . Per the definition of the critical directions, we hav e that u ⊤ i ⋆ ( G [ i ⋆ ,j ] − G [ j,j ] ) v i ⋆ =   G [ i ⋆ ,j ] − G [ j,j ]   op ≥ γ . Then, w e hav e ∆ u i ⋆ ,v i ⋆ = τ X t = ν + h +1 u ⊤ i ⋆ (( G [ i ⋆ ,j ] − G [ j,j ] ) z t ( v ⊤ i ⋆ z t ) + ( u ⊤ i ⋆ r t )( v ⊤ i ⋆ z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 = τ X t = ν + h +1   G [ i ⋆ ,j ] − G [ j,j ]   op v ⊤ i ⋆ z t ( v ⊤ i ⋆ z t ) + ( u ⊤ i ⋆ r t )( v ⊤ i ⋆ z t ) ! τ X t = ν + h +1 ( v ⊤ z t ) 2 ! − 1 =   G [ i ⋆ ,j ] − G [ j,j ]   op τ X t = ν + h +1 ( v ⊤ i ⋆ z t ) 2 ! τ X t = ν + h +1 ( v ⊤ i ⋆ z t ) 2 ! − 1 + τ X t = ν + h +1 ( u ⊤ i ⋆ r t )( v ⊤ i ⋆ z t ) ! τ X t = ν + h +1 ( v ⊤ i ⋆ z t ) 2 ! − 1 =   G [ i ⋆ ,j ] − G [ j,j ]   op + τ X t = ν + h +1 ( u ⊤ i ⋆ r t )( v ⊤ i ⋆ z t ) ! τ X t = ν + h +1 ( v ⊤ i ⋆ z t ) 2 ! − 1 ≥ γ + τ X t = ν + h +1 ( u ⊤ i ⋆ r t )( v ⊤ i ⋆ z t ) ! τ X t = ν + h +1 ( v ⊤ i ⋆ z t ) 2 ! − 1 ≥ γ w.p. 1 / 2 , where the second line follows from the fact that ( u i ⋆ , v i ⋆ ) are the leading singular vectors of G [ i ⋆ ,j ] − G [ j,j ] , and the last inequality holds because the second term is o dd in z t ’s, so it is p ositive with probabilit y 1/2. Therefore, Criterion 2 correctly finds controller j ’s asso ciated mo del not matching the true system with probabilit y at least 1/2. D Pro of of Theorem 3 W e shall prov e the tw o parts of this theorem in order. P art 1: Sample complexity Giv en Prop osition 1 and 2, we can instan tiate the following guarantees for our ev aluation metric: Corollary 8. Given that e ach episo de has sufficiently length of τ = max( T 1 , T 2 ) steps, the metric S ( y 1 , . . . , y τ ; j ) = ⌊ 1 2 ( S (1) ( y 1 , . . . , y τ ; j ) + S (2) ( y 1 , . . . , y τ ; j )) ⌋ satisfies: • If the active c ontr ol ler matches the true system ( i ⋆ = j ), then S = 1 with pr ob ability ≥ 14 / 15 . 22 • If the active c ontr ol ler destabilizes the true system, then S = 0 with pr ob ability ≥ 2 / 5 . • If the active c ontr ol ler is not matching ( i ⋆  = j ) but stabilizing, then S = 0 with pr ob ability ≥ 1 / 2 . Note that these b ounds are instantiated b y setting δ = 1 / 30 for b oth Prop ositions. And w e emphasize that the constant δ for rest of the pro of is ab out the success probability of Algorithm 1 and distinct from either Prop ositions. Recall that for any 1 ≤ i ≤ N and q > 0, w e defined the a v erage score to b e: ¯ s i [ q ] = 1 q q X ℓ =1 s i [ ℓ ] . W e first utilize Martingale concen tration inequalities to sho w that • If i = i ⋆ , then Pr  ¯ s i [ q ] ≤ 14 15 − r α q  ≤ exp( − 2 α ) . • If i  = i ⋆ , then Pr  ¯ s i [ q ] ≥ 3 5 + r α q  ≤ exp( − 2 α ) . T o show this claim, w e consider each of the three cases in Corollary 8 separately . First, with the matc hing con troller ( i ⋆ = j ), w e denote E q − 1 [ s i ⋆ [ q ]] as the random v ariable conditioned on the initial state of the episo de where controller i ⋆ is b eing applied for the q -th time. Due to Corollary 8, w e ha ve that E q − 1 [ s i ⋆ [ q ]] ≥ 14 / 15 . Then, by Azuma-Ho effding inequality (Proposition 6), we can b ound Pr  ¯ s i ⋆ [ q ] ≤ 14 15 − r α q  ≤ Pr 1 q q X ℓ =1 ( s i ⋆ [ ℓ ] − E ℓ − 1 [ s i ⋆ [ ℓ ]]) ≤ − r α q ! ≤ exp  − 2 q  p α/q  2  = exp( − 2 α ) . Similarly , w e ha ve that when the activ e con troller j is destabilizing the underlying plant, we hav e that E q − 1 [ s j [ q ]] ≤ 3 / 5. Then, b y Azuma-Ho effding inequality , w e hav e Pr  ¯ s j [ q ] ≥ 3 5 + r α q  ≤ Pr 1 q q X ℓ =1 ( s j [ ℓ ] − E ℓ − 1 [ s j [ ℓ ]]) ≥ r α q ! ≤ exp( − 2 α ) . Finally , b y the same logic, we hav e that the remaining con trollers that are non-matc hing ( i ⋆  = j ) but stabilizing satisfies Pr  ¯ s j [ q ] ≥ 3 5 + r α q  ≤ exp( − 2 α ) . Next, ha ving established concen tration of the a verage scores, w e consider the tw o algorithmic v arian ts separately . V ariant 1: a ℓ = L 72 N . W e consider the even t that E =  ¯ s i ⋆ [ q ] ≥ 14 15 − r a ℓ q  ∧  ¯ s i [ q ] ≤ 3 5 + r a ℓ q  ∀ i  = i ⋆ , 1 ≤ q ≤ L  . 23 Under this even t E , we notice that ¯ s i ⋆ [ q ] + r a ℓ q ≥ 14 15 ∀ q , and for any i  = i ⋆ , if q > 36 a ℓ , then ¯ s i [ q ] + r a ℓ q ≤ 3 5 + 2 r a ℓ q < 3 5 + 1 3 = 14 15 ∀ q > 36 a ℓ . Due to the selection rule of Algorithm 1, w e conclude that any of the non-matching controller can only b een selected at most 36 a ℓ = L 2 N times. Therefore, under ev ent E , the matching controller is the most frequently selected one and the algorithm succeeds in identification (i.e. ˆ i = i ⋆ ). It remains to find sufficien tly large n umber of episo des L so that the even t E occurs with probabilit y 1 − δ . F rom our calculations ab o ve, the probability of this even t o ccurring is at least Pr( E ) ≥ 1 − LN exp( − 2 a ℓ ) = 1 − LN exp( − L/ 36 N ) . Then, according to Lemma 7, we hav e that L ≥ 72 N log (72 N 2 /δ ) ⇔ LN δ ≥ 72 N 2 δ log(72 N 2 /δ ) ⇒ LN δ ≥ 36 N 2 δ log( LN /δ ) ⇔ L 36 N ≥ log( LN /δ ) ⇔ LN exp( − L/ 36 N ) ≤ δ. Therefore, with a ℓ = L 72 N and L = 72 N log(72 N 2 /δ ) ∈ O ( N log( N /δ )), Algorithm 1 identifies the matc hing controller with probabilit y 1 − δ . V ariant 2: a ℓ = 1 2 log( π 2 N ℓ 2 6 δ ). W e consider the even t that E =  ¯ s i ⋆ [ Q i ⋆ ( ℓ )] ≥ 14 15 − r a ℓ Q i ⋆ ( ℓ )  ∧  ¯ s i [ Q i ( ℓ )] ≤ 3 5 + r a ℓ Q i ( ℓ )  ∀ i  = i ⋆ , 1 ≤ ℓ ≤ L  . W e note that since Q i ( ℓ ) ≤ ℓ and a ℓ is increasing in ℓ , w e hav e that E ⊇ L \ q =1 E ℓ , where E q =  ¯ s i ⋆ [ q ] ≥ 14 15 − r a q q  ∧  ¯ s i [ q ] ≤ 3 5 + r a q q  ∀ i  = i ⋆  F rom our calculations ab ov e, the probability of eac h even t o ccurring is at least Pr( E q ) ≥ 1 − N exp( − 2 a q ) = 1 − N exp( − log( π 2 q 2 N / 6 δ )) = 1 − 6 δ π 2 q 2 . 24 Then, by union b ound, w e hav e that Pr( E ) ≥ 1 − L X q =1 6 δ π 2 q 2 ≥ 1 − 6 δ π 2 ∞ X q =1 q − 2 = 1 − δ. Under this even t E , we notice that ¯ s i ⋆ [ Q i ⋆ ( ℓ )] + r a ℓ Q i ⋆ ( ℓ ) ≥ 14 15 ∀ ℓ, and for any i  = i ⋆ , if Q i ( ℓ ) > 36 a ℓ , then ¯ s i [ Q i ( ℓ )] + r a ℓ Q i ( ℓ ) < 3 5 + 1 3 = 14 15 ∀ ℓ : Q i ( ℓ ) > 9 a ℓ . Due to the selection rule of Algorithm 1, w e conclude that any of the non-matching controller can only b een selected at most 36 a L = 18 log ( π 2 N L 2 6 δ ) times. If L ≥ 2 N · 36 a L , then under even t E , the matc hing controller is the most frequently selected one and the algorithm succeeds in identification (i.e. ˆ i = i ⋆ ). Finally , we find a v alue of L satisfying L ≥ 72 N a L = 36 N log ( π 2 N L 2 6 δ ). Then according to Lemma 7, we hav e that L ≥ 144 N log 24 √ 6 √ δ π N 3 / 2 ! ⇔ r N 6 δ · π L ≥ 144 N 3 / 2 √ 6 δ π log 144 N 3 / 2 √ 6 δ π ! ⇒ r N 6 δ · π L ≥ 72 N 3 / 2 √ 6 δ π log r N 6 δ · π L ! ⇔ L ≥ 72 N log r N 6 δ · π L ! ⇔ L ≥ 36 N log  π 2 N L 2 6 δ  . Therefore, with a ℓ = 1 2 log( π 2 N ℓ 2 6 δ ) and L = 144 N log  24 √ 6 √ δ π N 3 / 2  ∈ O ( N log( N /δ )), Algo- rithm 1 identifies the matc hing controller with probabilit y 1 − δ . P art 2: System stability F or this part, w e consider the t w o stages of the algorithm. In the first L episo des of the algorithm, w e are testing differen t controllers that are p otentially destabilizing. T o derive a (conserv ative) b ound for this stage, let us define R 1 := max i,j   A [ i,j ]   op , R 2 := max i,j   B [ i,j ]   op . And for con venience, denote T ′ = τ L and A t b e the closed-lo op dynamics at time t . Specifically , w e hav e A t = A [ i ⋆ ,i ⌈ t/τ ⌉ ] . 25 Then, we directly expand the closed-lo op dynamics:      x T ′ . . . x 2 x 1      = T ( T ′ )      w T ′ . . . w 2 w 1      +  T ( T ′ − 1) 0 1 × ( ν − 1)       B T ′ u T ′ − 1 . . . B 2 u 2 B 1 u 1      , where T ( k ) =      I A T ′ − 1 A T ′ − 1 A T ′ − 2 A T ′ − 1 A T ′ − 2 A T ′ − 3 . . . A T ′ − 1 A T ′ − 2 . . . A T ′ − k +1 0 I A T ′ − 2 A T ′ − 2 A T ′ − 3 . . . A T ′ − 1 A T ′ − 2 . . . A T ′ − k +2 0 0 I A T ′ − 3 . . . A T ′ − 1 A T ′ − 2 . . . A T ′ − k +3 . . . . . . . . . . . . . . . . . .      is a k × k blo ck matrix. In particular, we can write T ( k ) = k X ℓ =1 H ℓ D ℓ , where H is a shift matrix (with 1’s on the first sup er-diagonal and 0’s elsewhere), and D ℓ is a blo c k-diagonal matrix corresp onding to the diagonals of T . It follows that    T ( k )    op ≤ k X ℓ =1 R ℓ − 1 1 ≤ R k 1 R 1 − 1 . Therefore, T ′ X t =1 ∥ x t ∥ 2 ≤ 2 R 2 T ′ 1 ( R 1 − 1) 2 T ′ X t =1 ∥ w t ∥ 2 + 2 R 2 T ′ − 2 1 R 2 ( R 1 − 1) 2 T ′ X t =1 ∥ u t ∥ 2 . No w, when the UCB portion of the algorithm succeeds in identifying the matching controller, the closed-loop system follows a stable tra jectory after time T ′ . Sp ecifically , for constants κ 0 , κ 1 asso ciated with the matched closed-lo op dynamics A [ i ⋆ ,i ⋆ ] , we hav e T X t = T ′ +1 ∥ x t ∥ 2 ≤ κ 0 ∥ x T ′ ∥ 2 + κ 1 T X t = T ′ +1 ( ∥ w t ∥ 2 + ∥ u t ∥ 2 ) . In conclusion, we can c ho ose C 0 = 2 κ 0 max ( R 2 T ′ 1 ( R 1 − 1) 2 , R 2 T ′ − 2 1 R 2 ( R 1 − 1) 2 ) , C 1 = κ 1 . 26

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment