Dual Control with Active Learning using Gaussian Process Regression

Dual Con trol with Activ e Learning using Gaussian Pro cess Regression T ansu Alp can T ec hnical Univ ersit y Berlin Deutsc he T elek om La borator ie s alp c an@se c.t-la bs.tu-b erlin.de Abstract In many r eal world pro blems, control decisions ha ve to be made with limited information. The controller ma y hav e no a prior i (or even po sterior i) data on the nonlinear system, e xcept from a limited n umber of po ints th at ar e obtained o ver time. This is either due to high cost of observ ation or the highly non-stationar y nature of the system. The resulting conﬂict b etw een information collection (iden tiﬁcation, explo- ration) a nd co ntrol (optimization, exploitation) necess itates an active learning approa ch for iteratively selecting the c o ntrol actions which concurrently provide the data p oints for system iden tiﬁcation. This pap er prese nts a dua l control approach where the information acquired at each control s tep is quantiﬁed using the en tro py measur e from in- formation theory and serves as the training input to a state-of-the-a rt Gaussian pro c e ss regressio n (Ba yesian learning) metho d. The e x plicit quantiﬁcation of the information obtained fro m each data p o int allows for iter ative optimization of b oth iden tiﬁcation and control ob jectives. The approa ch developed is illus trated with tw o exa mples: control of logistic map as a chaotic s y stem a nd p osition control of a ca rt with inv e r ted pendulum. 1 In tro duc tion In many real world pr oblems, control d ecisions hav e to b e made with limited information. Obtaining extensiv e and accurate inf ormation abou t the con- trolled s y s tem can often be a costly and time consuming p ro cess. In some cases, acquiring detailed in f ormation on system characteristic s ma y b e sim- ply infeasible due to high observ ation costs. I n others, the observ ed system ma y b e so nonstationary that by the time th e information is obtained, it is 1 already outdated due to system’s fast-c hanging natur e. Th er efore, the only option left to the control ler is to develo p a strategy for collecting inf orma- tion eﬃciently and choose a model to estimate the “missing p ortions” of the system in order to control it according to a given ob jectiv e. A v ariant of this problem h as b een well- known in the cont rol literature since 1960s as dual c ontr ol . Th e un derlying concept in dual control is obtain- ing good p ro cess in formation through pertur bation while con trolling it. The con troller has necessarily dual goa ls. First the con troller must con trol the pro cess as well as p ossible. Second, the controller must inject a probing sig- nal or p erturbation to get more information ab out the pr ocess. By gaining more pro cess information b etter control can b e ac hieved in the future [20]. The problem considered here diﬀers from the classical dual control pr ob- lem in the very limited amount of in formation av ailable to the cont roller. The cont roller h ere cannot aim to id enti fy the system ﬁrst to obtain b etter p erformance in the futu r e d ue to non-stationarity and/or prohib itiv e ob- serv ation costs. F urthermore, the p erturbation idea is not fully applicable since eac h action-observ ation pair provides a sin gle data p oint for identify- ing the nonlin ear d iscrete-time system, u nlike in the id enti ﬁcation of (linear) con tinuous-time systems. This pap er appr oac hes the “dual con trol” problem f r om a Ba yesian p er- sp ectiv e. Gaussian processes (GP) are u tilized as a state- of-the-art regres- sion (function estimation) metho d for iden tifying the u nderlying state-space equations of the discrete-time non lin ear system from ob s erved (training) data. More imp ortantly , the adopted GP (Ba yesian) framework allo ws ex- plicit qu anti ﬁcation of in formation, which eac h observed data p oint p rovides within th e a-priori c hosen model. Hence, the information collection goal can b e explicitl y com bin ed with the con trol ob jectiv es and posed as a (w eight ed- sum, multi-ob jectiv e) optimization p roblem b ased on one (or multi-) step look ahead. This results in a joint and iterative scheme of activ e learning and control . The p rop osed app roac h consists of three main parts: observ ation, up date of GP for regression, and optimizatio n to d etermine the next con trol action. These three steps, shown in Figure 1 are tak en iterativ ely to ac hieve the dual ob jective s of identi ﬁcation and control. Observ ations, given that th ey are a scarce resource in the class of pr ob - lems considered, pla y an imp ortant r ole in this approach. Uncertain ties in the obser ved quantities can b e mod eled as additive noise. Likewise, prop er- ties (v ariance or bias) of additive n oise can b e u sed to mo del the r eliabilit y of (and bias in) the data p oints observe d. GPs p r ovi de a straightforw ard mathematical s tr ucture for incorp orating these asp ects to the mo del u nder 2 Figure 1: The underlying mo del of the du al con trol approach. some simplifying assumptions. The set of observ ations collected p rovide the (sup ervised) training data for GP regression in order to estimate th e characteristic s of the function or system at hand . This pro cess relies on the GP methods, whic h will be de- scrib ed in S ubsection 2.1. Thus, at eac h iteration an u p -to-date description of the function or system is obtained based on the latest observ ations. The ﬁnal step of the approac h provides a basis for d etermining the next con trol action based on an optimizatio n pro cess th at tak es into accoun t d ual ob jective s. T he information measuremen t asp ect of these ob jectiv es will b e discussed in Subs ection 2.2. An imp ortant issue here is the fact that there are inﬁnitely many cand idate points in this optimization pro cess, but in practice only a ﬁn ite collection of them can b e ev aluated. The inv estigated approach incorp orates many concepts that hav e been implicitly considered by h euristic schemes, and builds u p on results from seemingly disjoint but relev ant ﬁelds such as information theory , mac hine learning, optimization, and control th eory . Sp eciﬁcally , it combines concepts from these ﬁelds by • explicitly qu antifying the inform ation acquired usin g the entrop y mea- sure from information theory , • mod eling and estimating the (nonlin ear) controlled system adopting a Ba yesian approac h and using Gaussian p r ocesses as a state-of-the-a rt regression method , • using an iterativ e scheme for observ ation, learning, and cont rol, 3 • capturing all of these asp ects und er the umbrella of a multi-ob jectiv e “meta” optimization and control f ormulati on. Despite m ethod s and approac hes from mac hine (statistical) learning are hea vily utilized in this framework, the p r oblem at hand is very diﬀerent from many classical mac hine learning on es, even in its learning asp ect. In most classical app lication domains of machine learning such as data minin g, computer vision, or image and voic e r ecognition, the diﬃculty is often in handling signiﬁcant amount of data in con trast to lack of it. Many metho ds such as Ex p ectation-Ma ximization (EM) inh erently make this assumption, except from “acti ve learning” sc hemes [3 ]. Information theory p la ys plays an imp ortant role in ev aluating scarce (and exp ensive) data and developing strategies for obtaining it. Inte restingly , data scarcity conv erts at the same time the d isadv antage s of some methods int o adv antages, e.g. the scalability problem of Gaussian pro cesses. It is worth noting that the class of problems describ ed here are muc h more fr equently encountered in p ractice than it may ﬁrst seem. So cial sys- tems and economics, where information is scarce and systems are very non- stationary by nature constitute an imp ortant app lication domain. The con- trol fr amework prop osed is fu r ther applicable to a wide v ariety of ﬁelds due to its fu ndamentally adaptive nature. On e example is decentralize d resource allocation decisions in net work ed and complex systems, e.g. wired and wire- less net works, where parameters c hange quic kly and global information on net work characteristics is not av ailable at the lo cal decision-making nod es. Another example is security and information tec h nology risk management in large-sc ale organizati ons, where acquiring information on in dividual su b- systems and pro cesses can b e very costly . Y et another example application is in biological systems where individ ual organisms or sub systems operate autonomously (eve n if they are part of a larger system) un der limited local information. 2 Metho d ology This section summarizes the r esults in [2] and pr esents the un derlying meth- od s that are u tilized within the du al control framewo rk. First, the regression mod el and Gaussian P r ocesses (GP) are presented. Sub s equently , mo deling and m easurement of information is discu s sed using (Shannon) information theory . 4 2.1 Regression and Gaussian Pro cesses (GP) The system identiﬁcat ion pr oblem here invo lves inferr ing the nonlinear func- tion(s) f in the state-space equations d escribing the system u sing the set of observed data p oints. Th is is kno wn as r e g r ession in mac hine learning lit- erature, wh ic h is a su p ervise d le arning m etho d since the data observed here is at the same time the training data. This learning pro cess inv olves selec- tion of a “mo del” , w here the learned function ˆ f is, f or example, expressed in terms of a set of p arameters and sp eciﬁc basis functions. Gaussian pro- cesses (GP) pr ovi de a n onparametric alternativ e to this bu t follo w in spirit the same idea. The main goal of regression inv olve s a trade-oﬀ. On the one hand, it tries to minimize the observe d error b et ween f and ˆ f . On the other, it tries to infer the “real” shap e of f and make goo d estimates u sing ˆ f ev en at unobserved points (generaliz ation). If the former is o verly emphasized, then one ends up w ith “o ve r ﬁtting” , wh ic h means ˆ f follo w s f closel y at observed p oints b ut has w eak pr edictiv e v alue at un observed ones. This delicate balance is usually achiev ed b y b alancing the prior “b eliefs” on th e nature of the function, captured by th e model (basis functions), and ﬁtting the mo del to the observed d ata. This p ap er focus es on Gaussian Pro cess [11] as the chosen r egression method within the pr op osed du al cont rol approac h without loss of any gen- eralit y . There are multiple reasons b ehind this pr eference. Firstly , GP pro- vides an elegant mathemati cal metho d for easily combining many asp ects of the app roac h. Secondly , b eing a n onparametric metho d GP eliminates any discussion on mo del degree. Th irdly , it is easy to implement and understand as it is based on wel l-known Gaussian p robability concepts. F ourthly , n oise in observ ations is immed iately tak en into accoun t if it is mo d eled as Gaus- sian. Finally , one of the main dra wbacks of GP namely being computational hea vy , does not really apply to the p roblem at hand since the amount of data a v ailable is already very limited. It is not possib le to present here a comprehensive treatmen t of GP . There- fore, a ve ry rudimentary ov erv iew is provided next within the conte xt of the con trol p roblem. Consider a set of M data p oints D = { x 1 , . . . , x M } , where eac h x i ∈ X is a d − dimensional vec tor, and the corresp ondin g v ector of scalar v alues is f ( x i ) , i = 1 , . . . , M . Assume that the observ ations are dis- torted by a zero-mean Gaussian n oise, n with v ariance σ ∼ N ( 0 , σ ). Then, the resulting observ ations is a v ector of Gaussian y = f ( x ) + n ∼ N ( f ( x ) , σ ). 5 A GP is formally deﬁned as a collection of rand om v ariables, any ﬁnite num b er of wh ic h h a ve a joint Gaussian distribu tion [11]. It is completely sp eciﬁed by its mean fun ction m ( x ) and co v ariance f unction C ( x , ˜ x ), where m ( x ) = E [ ˆ f ( x )] and C ( x , ˜ x ) = E [( ˆ f ( x ) − m ( x ))( ˆ f ( ˜ x ) − m ( ˜ x ))] , ∀ x , ˜ x ∈ D . Let us for simplicity choose m ( x ) = 0 . Then, the GP is characterized ent irely by its cov ariance function C ( x , ˜ x ). Since the noise in observ ation ve ctor y is also Gaussian, the cov ariance fun ction can b e d eﬁned as the sum of a kernel function Q ( x , ˜ x ) and the diagonal noise v ariance C ( x , ˜ x ) = Q ( x , ˜ x ) + σ I , ∀ x , ˜ x ∈ D , (2.1) where I is the identit y matrix. While it is p ossible to choose here any (p ositiv e deﬁ n ite) kernel Q ( · , · ), one classical choice is Q ( x , ˜ x ) = exp  − 1 2 k x − ˜ x k 2  . (2.2) Note that GP makes use of the well-kno wn kernel trick here b y represent ing an inﬁnite d imensional con tinuous function using a (ﬁnite) set of contin uous basis functions and associated vector of real parameters in accordance with the r epr esenter the or em [12]. The (noisy) 1 training set ( D , y ) is u s ed to deﬁn e the corresp onding GP , G P ( 0 , C ( D )), through the M × M cov ariance function C ( D ) = Q + σ I , where the conditional Gaussian distribu tion of any p oint outside the training set, ¯ y ∈ X , ¯ y / ∈ D , giv en the training d ata ( D , t ) can b e computed as follo w s. Deﬁne the vect or k ( ¯ x ) = [ Q ( x 1 , ¯ x ) , . . . Q ( x M , ¯ x )] (2.3) and scalar κ = Q ( ¯ x , ¯ x ) + σ . (2.4) Then, the conditional distr ib ution p ( ¯ y | y ) that characterizes the G P ( 0 , C ) is a Gaussian N ( ˆ f , v ) with mean ˆ f and v ariance v , ˆ f ( ¯ x ) = k T C − 1 y and v ( ¯ x ) = κ − k T C − 1 k . (2.5) 1 The sp ecial case of p erfect observ ation without noise is handled the same w ay as long as th e k ernel fun ction Q ( · , · ) is p ositiv e deﬁnite. 6 This is a key resu lt th at deﬁn es GP regression as th e mean fu nction ˆ f ( x ) of the Gaussian distribution and provides a pred iction of the fu n ction f ( x ). A t the same time, it b elongs to the well-deﬁned class ˆ f ∈ F , which is the set of all p ossible sample fun ctions of th e GP F : = { ˆ f ( x ) : X → R such that ˆ f ∈ G P ( 0 , C ( D )) , ∀ D , C } , where C ( D ) is deﬁn ed in (2.1) and G P through (2.3), (2.4), and (2.5 ), ab ov e. F urthermore, the v ariance fun ction v ( x ) can b e used to measure the uncertaint y lev el of th e pr ed ictions provided by ˆ f , which will b e d iscussed in the next subs ection. 2.2 Quan t ifying the Information in Observ ations Eac h observ ation provides a data p oint to the regression pr ob lem (estimat- ing f by constructing ˆ f ) as d iscussed in th e previous subsection. A ctive le arning addresses the question of “ho w to quantify in f ormation obtained and optimize the observ ation pr ocess?” . F ollowing the approac h discussed in [9, 10], the approach h ere provides a precise answer to this question. Making any decision on the next (set of ) observ ations in a principled manner necessitates ﬁrst me asuring the i nformation obtaine d fr om e ach ob- servation within the ado pte d mo del . It is important to note th at the infor- mation m easur e here is dep end ent on the chose n mod el. F or example, the same observ ation provides a diﬀerent amount of information to a r andom searc h mo del th an a GP one. Shannon information th eory readily provides the necessary mathemat- ical framewo rk for measuring the in f ormation conten t of a v ariable. Let p be a p robability distrib u tion ov er the set of p ossible v alues of a dis- crete rand om v ariable A . The entrop y of the random v ariable is giv en by H ( A ) = ∑ i p i log 2 ( 1 / p i ), which quantiﬁes the amount of uncertaint y . Th en, the information ob tained fr om an observ ation on the v ariable, i.e. redu ction in uncertain ty , can be quantiﬁed simply b y taking the diﬀerence of its in itial and ﬁnal entrop y , I = H 0 − H 1 . It is imp ortant h ere to a void the common conceptual p itfall of equ at- ing entrop y to information itself as it is sometimes done in communication theory literature. Since this issu e is n ot of great imp ortance f or the class of problems considered in communication theory , it is often ignored. Ho w- ev er, th e diﬀerence is of conceptual imp ortance in this problem. 2 In this 2 See http://ww w.ccrnp.ncifcrf. gov/~toms/informatio n.is.not.uncertainty.html for a detailed d iscussion. 7 case, (Shan n on) information is deﬁne d as a me asur e of the de cr e ase of un- c ertainty after (e ach) observation (within a give n mo del) . T o apply this idea to GP , let the zero-mean multiv ariate Gauss ian (n or- mal) probability distr ibution be den oted as p ( x ) = 1 p 2 π | C p ( x ) | exp  − 1 2 [ x − m ] T | C p ( x ) | − 1 [ x − m ]  , (2.6) where x ∈ X , | · | is the determinan t, m is the mean (vec tor) as deﬁned in (2.5), and C p ( x ) is the co v ariance matrix as a function of the newly observed p oint x ∈ X given b y C p ( x ) =     C ( D ) k ( x ) T k ( x ) κ     . (2.7) Here, the vector k ( x ) is deﬁned in (2.3) and κ in (2.4), resp ectively . Th e matrix C ( D ) is the co v ariance m atrix based on the training data D as deﬁned in (2.1). The entrop y of th e multiv ariate Gaussian distribution (2.6) is [1] H ( x ) = d 2 + d 2 ln ( 2 π ) + 1 2 ln | C p ( x ) | , where d is the dimension. Note that, th is is the entropy of the GP estimate at the p oint x based on the av ailable data D . The aggrega te entrop y of the function on the region X is giv en by H ag g : = Z x ∈ X 1 2 ln | C p ( x ) | d x . (2.8) The problem of c ho osing a new data p oint ˆ x suc h that th e in formation obtained from it within the GP regression mo del is maximized can b e for- mulat ed as: ˆ x = ar g max ˜ x I = ar g max ˜ x Z x ∈ X [ H 0 − H 1 ] d x (2.9) = ar g min ˜ x Z x ∈ X 1 2 ln | C q ( x , ˜ x ) | d x , where the int egral is computed o ve r all x ∈ X , and the co v ariance matrix 8 C q ( x , ˜ x ) is deﬁned as C q ( x , ˜ x ) =       C ( D ) k T ( ˜ x ) k T ( x ) k ( ˜ x ) ˜ κ Q ( x , ˜ x ) k ( x ) Q ( x , ˜ x ) κ       , (2.10) and ˜ κ = Q ( ˜ x , ˜ x ) + σ . Here, C ( D ) is a M × M matrix and C q is a ( M + 2 ) × ( M + 2 ) one, whereas κ and Q ( x , ˜ x ) are scalars and k is a M × 1 vecto r. T his result from [2] is summarized in the following pr op osition. Prop osition 1. As a maximum information data c ol le ction str ate gy for a Gaussian Pr o c ess with a c ovarianc e matrix C ( D ) , the next observation ˆ x should b e chosen in such a way that ˆ x = arg max ˜ x I = arg m in ˜ x Z x ∈ X ln | C q ( x , ˜ x ) | d x , wher e C q ( x , ˜ x ) is deﬁne d in (2.10). An Appro ximate Solution to Information Maximization When making a d ecision on the next action through multi-ob j ective opti- mization, there are (inﬁn itely) many cand id ate points. A pragmatic solution to the problem of ﬁndin g solution candidates is to (adaptiv ely) sample the problem domain X to obtain the set Θ : = { x 1 , . . . , x T : x i ∈ X , x i / ∈ D , ∀ i } that d oes not ov erlap with kno wn p oints. In lo w (one or t wo) dimensions, this can b e easily ac hieved through grid sampling metho d s. In higher di- mensions, (Quasi) Monte Carlo schemes can b e utilized. F or large problem domains, the current d omain of interest X can b e deﬁned around the last or most promising obser v ation in such a wa y that suc h a sampling is compu- tationally feasible. Like wise, multi-resolutio n schemes can also b e deploy ed to increase computational eﬃciency . Giv en a set of (candidate) points Θ sampled from X , the result in Prop o- sition 1 can b e revisited. The problem in (2.9) is then app roximat ed [15] by max ˜ x I ≈ min ˜ x ∑ x ∈ Θ ln | C q ( x , ˜ x ) | (2.11) ⇒ ˆ x = arg min ˜ x ∈ Θ ∏ x ∈ Θ | C q ( x , ˜ x ) | , 9 using monotonicit y prop erty of the natur al logarithm and the fact that the determinant of a cov ariance m atrix is non-negativ e. T hus, th e follo wing counte rp art of Proposition 1 is obtained: Prop osition 2. As an appr oximately maximum information data c ol le ction str ate gy f or a Gaussian Pr o c ess with a c ovarianc e matrix C ( D ) and given a c ol le ction of c andidate p oints Θ , the next observation ˆ x ∈ Θ should b e chosen in such a way that ˆ x = arg min ˜ x ∈ Θ ∏ x ∈ Θ | C q ( x , ˜ x ) | ≈ ar g max ˜ x ∈ Θ I , wher e C q ( x , ˜ x ) is given in (2.10). Although it is an approximation, ﬁndin g a s olution to the optimization problem in Prop osition 2 can still b e computationally costly . Therefore, a greedy algorithm is prop osed as a computationally simpler alternative. Cho osing the maximum v ariance ˆ x as ˆ x = arg max ˜ x ∈ Θ v ( ˜ x ) ≈ ar g m in ˜ x ∈ Θ ∏ x ∈ Θ | C q ( x , ˜ x ) | , leads to a large (p ossibly largest) reduction in ∏ x ∈ Θ | C q ( x , ˆ x ) | , and hence provides a rough approxima te solution to (2.11) and to the resu lt in Pr op o- sition 1. T his result from [2] is consistent with widely-known heuristics su ch as “maxim um ent ropy” or “minimum v ariance” method s [14] and a v arian t has b een discus s ed in [9]. Prop osition 3. G iven a Gaussian Pr o c ess with a c ovarianc e matrix C ( D ) and a c ol le ction of c andidate p oints Θ , an appr oximate solution to the max- imum information data c ol le ction pr oblem deﬁne d in Pr op osition 1 is to cho ose the sample p oint(s) ˜ x in such a way that it has (they have) the max- imum varianc e within the set Θ . 3 Dual Con trol with Limited In formation Consider a nonlinear d iscrete-time representat ion of a dyn amical system that ev olves on a d − dimensional state s p ace X d ⊂ R d steered by con trol actions c hosen fr om an e − dimensional space U e ⊂ R e . Usu ally , the dimens ion of the control space is s m aller than th e state one, e ≤ d . It is assumed here for simplicity that b oth con trol and state spaces are nonempty , con vex, and compact. The system states evo lve according to x i ( t + 1 ) = f i ( x ( t ) , u ( t )) , i = 1 , . . . , d , (3.1) 10 where x ( t ) ∈ X d , x i ( t ) is a scala r, t = 1 , . . . denotes d iscrete time instances, and each f i : X d × U e → R is a p ossibly n onlinear fu nction. State s of dy- namical systems are, how ever, often not observ able. Therefore, d eﬁne a mapping from the states to observ able quantities y as y j ( t ) = g j ( x ( t )) , j = 1 , . . . , ¯ d , (3.2) where eac h g j : X d → R is p ossibly a nonlinear function, and ¯ d ≤ d . If nothing is kn o wn about the dynamic system deﬁned by (3.1)-(3.1) in the b eginning, and there is no observ ation or system n oise, th en the system can b e simpliﬁed to its inpu t-output relationship: y j ( t + 1 ) = g j  f [ g − 1 ( y ( t )) , u ( t )]  ⇒ y j ( t + 1 ) = h j ( y ( t ) , u ( t )) , j = 1 , . . . , ¯ d , (3.3) where eac h h j : X ¯ d × U e → R is p ossibly a nonlinear fu nction. As a simpli- ﬁcation, system and observ ation noise can b e mo deled as zero-mean Gaus- sian 3 . T hus, a noisy v ariant of system (3.3) is y j ( t + 1 ) = h j ( y ( t ) , u ( t )) + n ( t ) , j = 1 , . . . , ¯ d , (3.4) where n ( t ) ∼ N ( 0 , σ ) and σ is the resp ectiv e noise v ariance. 3.1 Problem F ormu lation The dual control problem is deﬁn ed as follo ws. Consider an u nknown n on- linear d iscrete-time dynamic system, which has a control inp ut and a (par- tially) observ able output that is p ossibly distorted by n oise. The con trol input ma y aﬀect the s y s tem linearly , which leads to a simpler p roblem, or its eﬀect may be nonlinear and un known to the decision maker. The ob- jectiv e of the decision maker is to control the system in such a wa y that it follo ws a given reference signal. Eac h action tak en is assumed to b e very costly an d the decision m ak er may only hav e limited time to satisfy du al goals of identiﬁcatio n and control. What is the b est str ate gy to addr ess this pr oblem ? Based on the discussion ab o ve, the describ ed problem can be formulate d more concretely . Let r ( t ) ∈ X ¯ d ∀ t den ote the ¯ d − dimensional reference signal. The discrete-time nonlinear s y s tem can b e mo deled u sing (3.4), wh ere y ( t ) is the output, u ( t ) is the cont rol actio n, and n ( t ) is the observ ation noise at time t . Then, the follo win g dual con trol problem is formulated. 3 Biased Gaussian noise can b e easily handled by GPs by in tro ducing a mean fun ction, whic h we omit in this paper for simplicit y . 11 Problem 1. [ D ual Contr ol under Limite d Information ] Let a discrete-time system b e describ ed by th e follo wing inp ut-output relationship y j ( t + 1 ) = h j ( y ( t ) , u ( t )) + n ( t ) , j = 1 , . . . , ¯ d where y ( t ) is the ¯ d − d im en sional outp ut, u ( t ) is the e − dimensional con trol action, and n ( t ) ∼ N ( 0 , σ ) is a zero-mean Gaussian observ ation noise with v ariance σ at time t . The f u nction h j : X ¯ d × U e → R is p ossibly nonlinear for all j . Given a ¯ d − dimensional desired reference signal r ( t ), wh at is the b est control strategy (series of control actions) µ ( t ) such that µ ( t ) = ar g m in u ( t ) k y ( t ) − r ( t ) k , ∀ t = 1 , . . . , k· , ·k is a norm quantifying the mism atch b etw een the observ ed and desired outputs? If there was more information on the s y s tem av ailable or more time for exp erimenta tion, one could hav e resorted to the rich literature on adaptiv e and r obust control to ﬁ nd a solution. How eve r, Pr oblem 1 d iﬀerentia tes from the ones in the classical adaptive and robust control literature by the fact that the decision maker starts with zero or v ery little pr ior inform ation and a solution h as to b e found online wh ile learning the s ystem. Th is puts sp ecial emphasis on observ ations and quantifying information using th e metho ds describ ed in Section 2.2. Using GP regression for estimating the system dynamics in (3.4 ) and Shannon inf ormation theory to measure and m aximize the amount of infor- mation obtained with eac h observ ation, a mo d el-based v ariant of Problem 1 is deﬁn ed . Problem 2. [ Mo del-b ase d Contr ol under Limite d Information ] Let a discrete time dynamic system b e describ ed by the follo w in g inpu t-output relat ionsh ip y j ( t + 1 ) = h j ( y ( t ) , u ( t )) + n ( t ) , j = 1 , . . . , ¯ d where y ( t ) is the ¯ d − d im en sional outp ut, u ( t ) is the e − dimensional con trol action, and n ( t ) ∼ N ( 0 , σ ) is a zero-mean Gaussian observ ation noise with v ariance σ at time t . The f u nction h j : X ¯ d × U e → R is p ossibly nonlinear for all j . The goal is to con trol the s y s tem in such a w ay that the outpu t y ( t ) follows a given ¯ d − dimensional reference signal r ( t ). Let ˆ h b e an estimate of system dynamics h b ased on an a priori mod el and a set of observ ations. What is the b est con trol strategy (series of con- trol actions) µ ( t ) that solves the multi-ob jective problem with the following compon ents? 12 • Obje c tiv e 1: min u ∈ U k y ( t ) − r ( t ) k • Obje c tiv e 2: max u ∈ U I ( ˆ h , u ( t )) The main (ﬁrst) ob jectiv e of Problem 2 is naturally th e same as the one of Pr oblem 1. The second ob jectiv e states the “exploration” or information collect ion asp ect. As a side n ote, un like the static optimization problem in [2], ho w close the estimated system d y n amic ˆ h app r o ximates the original one is n ot set as an ob jective . The reason b ehind this is th e fact that th e data p oints used for identifying ˆ h can only b e select ed ind irectly through control actions u . Therefore, a reasonably complete iden tiﬁcation of the system d ynamics ma y b e to o costly . A partial identiﬁca tion relev ant to th e main ob jectiv e is suﬃcient for the pu r p ose h ere. 3.2 Solution Approach The solution app r oac h to Problem 2 u tilizes the method ology in Section 2. The GP v ariance maximization approximates here the inf ormation maxi- mization ob j ectiv e. A (rand om or grid -based) sampling scheme is adopted again for ev aluating cand idate solutions, in this case, a combination of th e observed current state and av ailable control actio ns. A weig hted-sum sc heme is utilized to combine the t wo ob jectiv es in Problem 2. A visual d epiction of the cont rol f ramework is shown in Figure 2. Dynamic System GP Model Update estimated I/O Model Observation noise solution candidates weights & constraints + previous data Controller control action state observed output Figure 2: The dual control fr amewo rk for id entifying and controlling an unknown dyn amic system with limited in formation. Since the problem is by its v ery n ature iterativ e, the b est con trol strategy has to b e ev aluated at th e current state, taking into account newly receive d 13 information and using the latest up date of estimated system dynamics. As a starting p oint, a gradient or greedy algorithm is proposed whic h aims to balance b oth exploration and exploitation ob jectives. Prop osition 4. L et a discr ete time dynamic system b e describ e d by the fol lowing input-output r elationship y j ( t + 1 ) = h j ( y ( t ) , u ( t )) + n ( t ) , y ( t ) ∈ X ¯ d , u ( t ) ∈ U e , j = 1 , . . . , ¯ d , wher e n ( t ) ∼ N ( 0 , σ ) is a zer o-me an Gaussian observation noise with varianc e σ . F urther let Φ b e a grid-b ase d or r andomly sample d set of available c ontr ol actions u fr om the c ontr ol sp ac e U . Given a r e fer enc e signal r ( t ) ∈ X ¯ d , deﬁne the optimization pr oblem min u ( t ) ∈ Φ J ( u , y , r , w ) ⇒ min u ( t ) ∈ Φ w 1 k ˆ y ( t + 1 ) − r ( t ) k − w 2 v ( ˆ y ( t + 1 ) , µ ( t )) , (3.5) wher e ˆ y j ( t + 1 ) = ˆ h j ( y ( t ) , u ( t )) + n ( t ) is the next estimate d output using a GP b ase d on c ontr ol u ( t ) , and v ( ˆ y ( t + 1 ) , u ( t )) is the varianc e of the asso ciate d Gaussian as deﬁne d in (2.5). The solution to this pr oblem µ ( t ) = ar g min u ( t ) ∈ Φ J ( u , y , r , w ) , t = 1 , . . . appr oximates the b e st c ontr ol str ate gy u nder limite d information, and henc e appr oximately solves Pr oblem 2. Couple of r emarks should be made at this p oint r egarding the solution approac h p resente d. Firstly , the approac h in Prop osition 4 constitutes a greedy one, which aims to solve the problem in shortest time based on a v ailable information and goes in the d irection of the steepest gradien t (here of the weigh ted sum of ob j ectiv es). Th e main concern here is wh ether suc h an algorithm gets stuck in a lo cal minimum. This issue can b e remedied at least partially by pu tting a higher weigh t to the information collect ion ob jective . Secondly , it is implicitly assumed here that the system at hand is at least partially ob s erv able and cont rollable. It is natur ally diﬃcult, if not imp ossible, to c heck such prop erties of an u nknown system. Thus, the approac h here can b e interpreted also as a “b est eﬀort” one, w h ic h aims to ac hieve the b est p erformance possible giv en con trollabilit y and observ ability limitations. A summary of the solution approac h discussed abov e for a sp eciﬁc set of c hoices is pr o vided by Algorithm 1. 14 Algorithm 1 Dual Control with Limited Information Input: Pr oblem domain, GP meta-parameters, ob jective weigh ts [ w 1 , w 2 ], initial data set D , r eferen ce signal r , control actions Φ . while system is (partially) observ able and con trollable do Estimate the system dynamics (I/O function) ˆ h using GP . Compute the b est control action u ∈ Φ solving (3.5). Compute v ariance, v ( y , u ), of ˆ h as an estimate of I ( ˆ h ). Up date the data set D us in g newly observed data p oint y . end while 4 Examples 4.1 Dual Control of Logistic Map The logistic map x ( n + 1 ) = r x ( n ) ( 1 − x ( n )) , parameterized by the scalar r is a well-kno wn one-dimensional discrete-time nonlinear system, where n d enotes the time step or iteration. It is c hosen as an illustrativ e example du e to its interesting prop er ties and for visualization purp oses. F or r = 3 . 5 , logistic map con verges to a limit cycle while it exhibits c haotic b eha vior for r = 3 . 8 as shown in Figure 3. 0 5 10 15 20 25 30 35 40 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Trajectory of Logistic Map Time steps x r=3.8 r=3.5 Figure 3: Example tra jectories of the logistic map for r = 3 . 5 and r = 3 . 8 . 15 Linear Con trol: First, the logistic map is control led with additive actions w hile b eing iden- tiﬁed using the GP metho d describ ed in Algorithm 1: x ( n + 1 ) = r x ( n ) ( 1 − x ( n )) + u ( n ) . The con troller knows here that the cont rol is linear (additive), and utilizes this extra knowledge in identifying the system which simpliﬁes the pr ob lem signiﬁcant ly . The system description (inp ut-output relationship) from the p ersp ective of the con troller is: y ( n + 1 ) = ˆ h ( y ( n ) ) + u ( n ) . The cont rol actions are taken from the ﬁnite set Φ = { u i ∈ [ − 1 , 1 ] : u i + 1 = u i + 0 . 02 , i = 1 , . . . , 101 } . The k ernel v ariance is 0 . 5 and the wei ghts in the ob jectiv e function (3.5) are chosen as w 1 = w 2 = 1 . The goal is stabilize the sy s tem at x ∗ = 0 . 8 , which constitutes the constant reference signal. The starting p oint is x 0 = 0 . 1 . The con trol actions and state estimatio n errors ov er time (in eac h step based on arrived data p oint s) for r = 3 . 5 and the corresp on d ing tra jectory of the logistic map are depicted in Figures 4 and 5. Note that, in th is case the logistic map acts only as a nonlinear system with a limit cycle rather than b eha ving chaotic ally . It is observed that appro ximately the ﬁrst 10 steps are used by the algorithm to exp lore or learn th e system after which the tra j ectory approaches to th e target. T he Figure 6 shows the estimated f unction versus the original mappin g for u = 0 as wel l as one standard deviation from estimated v alue. It can b e seen th at the v ariance is minimum, i.e. the estimate is b est, around the target v alue. The same numerical analysis is rep eated for r = 3 . 8 in which case the logistic map b ehav es c haotically and the task turns to fr om control of an unknown nonlinear system to control of an unknown chaotic system. In this case, the goal is to stabilize the system at x ∗ = 0 . 8 . Th e control actions and state estimation err ors o ver time (in each step based on arriv ed data p oints) for r = 3 . 8 and the corresp onding tra jectory of the logistic map are depicted in Figures 7 and 8. Note that the learning p ro cess takes longer in this case p ossibly due to the c haotic (complex) b eh avior of the system. Th e mapping shown in Figure 9 shows the estimated function v ersus th e original m apping for u = 1 . 5 . 16 0 10 20 30 40 50 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Time steps Value Control Actions and Estimation Error Control action Estimation error Figure 4: Th e con trol actions and state estimation errors f or logist ic map with r = 3 . 5 and linear con trol. 0 10 20 30 40 50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Trajectory of the Controlled System Time steps x Figure 5: The cont rolled tra jectory of the logistic map for r = 3 . 5 and linear con trol. 17 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 Actual versus Learned Mapping x(t) x(t+1) Variance Estimate Actual Data Figure 6: T he logi stic map and its estimate along with one standard devia- tion for u = 0 and r = 3 . 5 after 100 iterations (data p oints). 0 10 20 30 40 50 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Time steps Value Control Actions and Estimation Error Control action Estimation error Figure 7: Th e con trol actions and state estimation errors f or logist ic map with r = 3 . 8 and linear con trol 18 0 10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Trajectory of the Controlled System Time steps x Figure 8: The cont rolled tra jectory of the logistic map for r = 3 . 8 and linear con trol. −0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Actual versus Learned Mapping x(t) x(t+1) Variance Estimate Actual Data Figure 9: T he logi stic map and its estimate along with one standard devia- tion for u = 1 . 5 and r = 3 . 8 after 100 iterations (data p oints). 19 Nonlinear and Unkno wn Con trol: Next, th e logistic map is con trolled with actions that aﬀect the sys tem non- linearly in a wa y that is unknown to the cont roller: x ( n + 1 ) = 3 . 8 x ( n ) ( 1 − x ( n )) + cos ( u ) . The system description (input-output r elationship) fr om the p ersp ectiv e of the controll er is: y ( n + 1 ) = ˆ h ( y ( n ) , u ( n )) . Compared to the linear and known con trol case, this pr oblem is obviously muc h harder to address. Th e con trol actions are take n from the ﬁnite set Φ = { u i ∈ [ 0 , π ] : u i + 1 = u i + 0 . 1 , i = 1 , . . . , 32 } . The weig hts in th e ob jectiv e fu nction (3.5) are chosen initially as w 1 = 1 and w 2 = 0 to emphasize exploration in the b eginning but w 2 is increased gradually to w 2 = 40 to ac hieve as go od control p erformance as p ossible. Figures 10, 11, and 12 summarize the obtained results. Since the ob jec- tiv e of th e Algorithm 1 is n ot only learnin g the entire system b eh a vior but ac hieving the con trol target in a greedy manner, th e system is estimated accurately only around the target v alue. It is obs erved that th e learning pro cess tak es longer (t wice as muc h of the case in the linear con trol) and the con trol act ions are less accurate. It should b e k ept in mind, h o wev er, that concurrently identifying and adaptively control ling a chaotic system with limited information is not an easy task. 4.2 P osition C on trol of a Cart wit h Inv erted Pendulum The inv erted p end ulum on a cart is a classic example system for control problems. In this case, th e problem is formulated as the p osition cont rol of the cart w ith the inv erted p endu lu m, which is d eﬁned by the follo w in g set of discrete-time nonlinear state-space equations [19, 18]: x 1 ( n + 1 ) = x 1 ( n ) + T x 2 ( n ) , (4.1) x 2 ( n + 1 ) = x 2 ( n ) + T M + m sin 2 ( x 3 ( n )) [ u ( n ) (4.2) + mLx 2 4 ( n ) sin ( x 3 ( n )) − bx 2 ( n ) − mg cos ( x 3 ( n )) sin ( x 3 ( n ))] , 20 0 20 40 60 80 100 120 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 Time steps Value Control Actions and Estimation Error Control action Estimation error Figure 10: The control actions and state estimation errors for logistic m ap with r = 3 . 8 and nonlinear con trol. 0 20 40 60 80 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Trajectory of the Controlled System Time steps x Figure 11: The con trolled tra jectory of the logistic map for r = 3 . 8 and nonlinear cont rol. 21 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Actual versus Learned Mapping x(t) x(t+1) Variance Estimate Actual Data Figure 12: Th e logistic map and its estimate along w ith one stand ard devi- ation for u = 1 . 5 and r = 3 . 8 after 100 iterations (data p oints). x 3 ( n + 1 ) = x 3 ( n ) + T x 4 ( n ) , (4.3) x 4 ( n + 1 ) = x 4 ( n ) + T L  M + m sin 2 ( x 3 ( n ))  (4.4) [ − u ( n ) cos ( x 3 ( n )) + ( M + m ) g sin ( x 3 ( n )) bx 2 ( n ) cos ( x 3 ( n )) − mLx 2 4 ( n ) cos ( x 3 ( n )) sin ( x 3 ( n ))  , y ( n ) = x 1 ( n ) , (4.5) where T = 0 . 0 5 is the sampling p eriod , y = x 1 is the p osition of the cart, x 2 = d x / dt is the cart velocit y x 3 = θ is th e inve rted pend ulum angle, x 4 = d θ / d t is th e angular vel o city . The p arameter v alues are: b = 12 . 98 , M = 1 . 378 , L = 0 . 32 5 , g = 9 . 8 , and m = 0 . 051 . F urther details on this standard model are av ailable in [19, 18]. First, the cart is cont rolled using a one-step lo ok-ahead strategy with ful l know le dge from the starting p oint x = [ 0 , 0 , 0 . 6 , 0 ] with control actions c hosen f r om the set { u i ∈ [ − 10 , 10 ] : u i + 1 = u i + 1 , i = 1 , . . . , 21 } . The ob jectiv e is to ﬁx the p osition of the cart to y ∗ = 0 . 5 . T he we ights in the ob jectiv e function (3.5) are w 1 = 1 and w 2 = 20 . The results of this case sh o wn in Figures 13 and 14 provide a b enchmark to compare against. Next, the cart is controll ed us in g a one-step look-ahead str ategy as a as 22 0 20 40 60 80 100 −10 −8 −6 −4 −2 0 2 4 6 8 10 Control Actions Time steps F Figure 13: Th e control actions for the cart with inv erted pend ulum, single- step look ahead, and full knowledge. 0 20 40 60 80 100 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Time step x and x’ Position and Velocity of Cart x x’ Figure 14: The tra j ectory of the cart with inv erted p endulu m controlled with full knowledge and sin gle-step lo ok ahead. 23 black-b ox system ; y ( n + 1 ) = ˆ h ( ˙ y ( n ) , u ( n )) . As side inform ation, the controlle r knows (4.1), bu t has to estimate (4.2) while (4.3) and (4.4) eﬀect ively act as external/unmo deled dynamics. T he kernel and n oise v ariance in GP are c hosen as 0 . 5 and 0 . 01 , respectively . Th e results obtained u s ing Algorithm 1 are sh own in Figures 15 and 16. The p erformance is satisfactory considering that the tra jectory is within 10% distance of the target within 30 steps. 0 20 40 60 80 100 −10 −8 −6 −4 −2 0 2 4 6 8 10 Control Actions Time steps F Figure 15: Dual con trol of the cart with inv erted p endulum and single-step look ahead. 5 Literature Review The b o ok [10] pr ovi des imp ortant and v aluable insights int o th e relation- ship b etw een information th eory , in f erence, and learning, where measuring information conten t of data p oints using Sh annon information is discu s sed. Ho wev er, fo cusing mainly on more traditional cod ing, communication, and mac hine learning topics, the bo ok do es n ot discuss the type of con trol pr ob- lems presented in this pap er. Learning plays an imp ortant role in the pr esent ed fr amework, especially r e gr ession , whic h is a classical mac hine (or statistical) learning m ethod . A ve ry go o d introd uction to the sub ject can b e found in [3]. A complemen- tary and detailed d iscussion on ke rn el metho d s is in [12]. Another relev ant 24 0 20 40 60 80 100 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Time step Value Position, Estimate, and Velocity of Cart x x’ x est Figure 16: The tra jectory of the cart with inv erted p endulu m under dual con trol with s in gle-step look ahead estimates. topic is Bay esian inference [17, 10], which is in the foundation of the pre- sente d f ramework. In machine learning literature, Gaussian pr o cesses (GPs) are getting increasingly popu lar d u e to their v arious fav orable characteris- tics. The bo ok [11] presents a comprehensive treatmen t of GPs. Additional relev ant works on the sub j ect in clude [10, 12, 8], wh ic h also discuss GP regression. Gaussian pr ocesses hav e b een recen tly app lied to the area of optimiza- tion and regression [4] as well as system identiﬁcation [16]. Wh ile the latter mentio ns activ e learning [14], n either work discusses explicit information quantiﬁcat ion or build s a connection with Shannon inform ation theory . Us- ing GP for system identiﬁcati on is discussed again in [7], y et again without information collect ion asp ects. T he paper [9] discusses in a static optimiza- tion setting ob jective fun ctions whic h measure the exp ected informativeness of cand idate measur ements within a Bay esian learning fr amewo rk. Th e sub - sequent study [13] inv estigates activ e learning for GP regression in mac hine learning app lications u sing v ariance as a (heuristic) conﬁdence measure for test p oint rejection. Dual control is an old topic, wh ic h h as attracted the interest of the researc h communit y in the second half of the last cen tur y [20]. The article [21] revisits this sub ject and incorp orates inform ation explicitly into the 25 dual con trol problem, but fo cuses on estimation of parameters in a known, linear sys tem. Ad op tin g a d iﬀerent p ersp ectiv e, a dyn amic programming approac h is p resente d recently in [5], w here an approximate v alue-fun ction based reinforcement learning algorithm based on GPs and its online v ariant are presented. An app lication of GP-based identiﬁcat ion and con trol to an autonomous blimp is discussed in [6]. 6 Conclusion The dual control approach presented in this p ap er add resses fo cu ses on blac k-b ox control with v ery limited inform ation. The information acquired at eac h control step is quantiﬁed using the entrop y m easure from informa- tion th eory and serves as the training input to a state -of-the-art Gaussian pro cess r egression (Bay esian learnin g) method . The quantiﬁcation of the information obtained fr om eac h d ata p oint allo ws for iterative and joint optimization of b oth iden tiﬁcation and control ob jectiv es. Th e resu lts ob- tained from tw o illustrativ e examples, control of logistic map as a chaotic system and position control of a cart with inv erted p endulum, demonstrate the developed approach. The dynamic con trol problem in this p ap er diﬀers from the static opti- mization analysis in [2] in multiple wa ys. On e of the m ain diﬀerences is the fact that the system states are no w in ﬂuenced indirectly thr ough control ac- tions. Th e data p oint s used for id enti fyin g the u nderlying system m ap p ing can only b e selected ind ir ectly (unlike static optimization) and u nder the constraint s imp osed by the nature of the “con trol” in the dynamic system at hand. The presente d results s hould b e consider ed mainly as an initial step. F uture research directions are abundant and in clude fu rther inv estigation of the exploration-exploi tation trade-oﬀ, m ore elaborate adaptive weigh t- ing parameters, and rand om sampling metho ds for problems in higher di- mensional spaces. Applications to multi-p erson decision-making and game theory constitute another interesting futu r e r esearc h topic. Ac kno wledgemen t This work is sup p orted by Deutsc he T elek om Lab oratories. 26 References [1] N. Ahmed and D. Gokhale, “Entrop y expressions and their estimators for multiv ariate distr ibutions,” IEEE T r ansactions on Informa tion The- ory , vo l. 35, no. 3, pp . 688–6 92, Ma y 1989 . [2] T . Alp can, “A f ramework for optimization un der limited inform ation,” in 5th Intl. Conf. on Pe rformanc e E valuation Metho dolo gies and T o ols (V alueT o ols) , ENS , Cachan, F rance, Ma y 2011. [3] C . M. Bishop, Pattern R e c o gni tion and Machine L e arning (Information Scienc e and Statistics) . Secaucus, NJ, USA: Sprin ger-V erlag New Y ork, Inc., 2006. [4] P . B oyle, “Gaussian pro cesses for regression and optimi- sation,” Ph.D. dissertation, Victoria Universit y of W elling- ton, W ellington, New Zealand, 200 7. [Online]. Av ailable: http:/ /researcharc hive.vu w.ac.nz/handle/1006 3/421 [5] M. P . Deise nr oth, C. E. Rasm ussen , and J. P eters, “Gaussian pro- cess d ynamic p rogramming,” Neur o c omputing , vol . 72, pp. 1508–152 4, Marc h 2009. [6] J . K o, D. Klein, D. F o x, and D. Haehnel, “Gaussian pro cesses and reinforcement learning for identiﬁcati on and cont rol of an autonomous blimp,” in IEEE Intl. Conf. on R ob otics and Automation , April 2007, pp. 742–7 47. [7] J . Ko cijan, “Gaussian pr ocess models for systems identiﬁcation,” in Pr o c. of 9th Intl PhD Workshop on Systems and Contr ol: Y oung Gen- er ation Viewp oint , I zola, Slo venia, Octob er 2008 . [8] D. J. C. MacKa y , “Introdu ction to Gaussian pro cesses,” in Neu r al Net- works and Machine L e arning , ser. NA TO ASI Series, C. M. Bishop, E d. Kluw er Academic Press, 199 8, pp . 133–166 . [9] — — , “ Inf ormation-based ob jective fun ctions for ac- tiv e data selection,” Neur al Computa tion , vo l. 4, no. 4, pp. 590–604, 199 2. [Online]. Av ailable: http:/ /www .m itp r essjournals.org/doi/abs/10.1 162/neco.199 2.4.4.590 [10] ——, Information The ory, Infer enc e, and L e arning Algo- rithms . Cambridge Univ ersity Press, 2003. [Online]. Av ailable: http:/ /www .in ference.phy .cam.ac.uk/mac k ay/itila / 27 [11] C. E. Rasmussen and C . K. I. Williams, Gaussian Pr o c e sses for Machine L e arning (A daptive Computation and Machine L e arning) . The MIT Press, 2005. [12] B. S cholk opf and A. J . Smola, L e arning with Kernels: Supp ort V e c- tor Machines, R e gularization, Optimization, and Beyond . Cambridge, MA, USA: MIT Press, 2001. [13] S. Seo, M. W allat, T . Graepel, and K . Ob er m a yer, “Gaussian pr ocess regression: active data selection and test p oint rejection,” in Pr o c. of IEEE-IN N S-ENNS Intl. Joint Conf. on N eur al Networks IJCN N 2000 , vo l. 3, July 2000, pp . 241–2 46. [14] B. Settles, “Activ e learning literature survey ,” Universit y of Wisconsin– Madison, Computer Sciences T echnical Rep ort 1648, 2009. [15] R. T emp o, G. Calaﬁore, and F. Dabbene, R andomize d Algorithms for Ana lysis and Contr ol of U nc ertain Systems . London, UK: Sp ringer- V erlag, 2005. [16] K. R. Thompson, “Implementat ion of gaussian pr ocess mo d els for n on- linear system iden tiﬁcation,” Ph.D. d iss ertation, Univ ersity of Glasgo w, Glasgo w, Scotland, 2009. [17] M. E. Tipping, “Ba yesian inference: An introduction to principles and practice in mac hine learning,” in A dvanc e d L e ctur es on M achine L e arn- ing , 2003, pp. 41–62. [18] D. W ang and J. Huang, “A neur al net work-based approximatio n method f or d iscrete-time nonlinear servomec h anism problem,” IE EE T r ans. on N eur al Networks , vo l. 12, no. 3, pp . 591–5 97, Ma y 2001. [19] ——, “ A neural netw ork based metho d for solving discrete-time nonlinear output regulat ion problem in sampled-data systems,” in A dvanc es in Neur al N e tworks - ISNN 2004 , ser. Lecture Notes in Computer Science, F. Yin, J. W ang, and C. Guo, Ed s. Sp ringer Berlin / Heidelb erg, 2004, vo l. 3174, p p. 97–9 7, 10.1007 /978-3-54 0-2 8648-6 9. [Online]. Av ailable: http://dx.doi.org/ 10.1007/ 978- 3- 540- 28648- 6 9 [20] B. Wittenmark, “Adaptiv e d ual control,” in Contr ol Systems, R ob otics and Automation, Encyclop e dia of Life Supp ort Systems (EOLSS), De- velop e d under the auspic e s of the UNESCO . Oxford, UK : Eolss P u b- lishers, Jan. 2002. 28 [21] J. J . Y ame, “Dual adaptiv e control of s to chastic systems v ia information theory ,” in P r o c. of 26th IE EE Conf. on De cision and Contr ol CDC , vo l. 26, December 1987, pp . 316 –320. 29

Dual Control with Active Learning using Gaussian Process Regression

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment