A Framework for Optimization under Limited Information

A F ramew ork for Optimization under Limit ed Informatio n T ansu Alp can T ec hnical Univ ersit y Berlin Deutsc he T elek om Lab oratories alp c an@se c.t-labs.tu-b erlin.de Abstract In many rea l world problems, optimization decis ions hav e to b e made with limited infor mation. The decision maker may hav e no a priori or posterior i data a bo ut the o ften no nconv ex ob jective func- tion except from on a limited num b er of p oints that are obtained ov er time through c o stly observ ations. This paper pres ents a n optimiza- tion framew ork that takes in to acc o unt the information c ollection (ob- serv ation), estimatio n (regre ssion), and optimizatio n (maximization) asp ects in a ho listic and structured manner. Explicitly quantifying the information acq uired at each o ptimiza tion step using the en tropy measure from information theory , the (nonconv ex) ob jective function to b e optimized (maximized) is mo deled a nd estimated by ado pting a Bay esian a pproach and using Gaus sian proce s ses as a state-of-the-ar t regres s ion metho d. The r esulting iterative scheme allows the decision maker to solve the problem by expressing preferences fo r each aspect quantitativ ely and concurrently . 1 In tro duction In m any real w orld problems, optimization decisions ha v e to b e made with limited information. Whether it is a static optimization or dynamic con trol problem, obtaining detailed and accurate information ab out the pr oblem or system can often b e a costly an d time consuming pro cess. In some cases, acquiring extensiv e inform ation on system c h aracteristics ma y b e simply infeasible. In others, th e observed system ma y b e so nonstationary that b y the time the inform ation is obtai ned, it is already outd ated due to sys tem’s fast-c hanging nature. Therefore, the only option left to the decision-mak er is to dev elop a strategy for collecting information eﬃcie ntly and c ho ose a 1 mo del to estimate the “missing p ortions” of th e problem in ord er to solve it satisfactorily and acco rdin g to a given ob jectiv e. T o mak e th e discussion more concrete, consider the p roblem of maximiz- ing a (Lipsc hitz) con tinuous nonc onvex ob jectiv e f u nction, wh ic h is unknown except fr om its v alue at only a small num b er of data p oints. The decision mak er ma y hav e no a priori information ab out the fun ction and start with zero data p oin ts. F urtherm ore, only a limited num b er of –p ossibly n oisy– observ ations m a y b e a v ailable before making a decision on the maxim um v alue and its lo cation. The function itself, h o w ev er, remains unkn own ev en after the d ecision is m ade. What is the b est str ate gy to addr ess this pr oblem ? The decision making framew ork p r esen ted in this paper captures the p osed problem by taking into accoun t the information collection (observ a- tion), estimatio n (regression), and (multi-o b jective ) optimization asp ects in a holistic and structur ed manner. Hence, the framew ork en ables the d ecision mak er to solv e the problem by expressing preferences for eac h asp ect quan- titativ ely and concurrently . It explicitl y incorp orates m an y concepts that ha v e b een implicitly considered by heuristic schemes, and b uilds up on many results f rom seemingly disjoin t b ut relev ant ﬁelds suc h as information the- ory , mac h ine learning, and optimization and cont rol theories. Sp eciﬁcally , it com bines concepts from these ﬁ elds b y • exp licitly quantifying the information acquired using the en trop y m ea- sure from information th eory , • m o deling and estimating th e (nonconv ex) f u nction or (nonlinear) sys- tem adopting a Ba y esian approac h and using Gaussian p ro cesses as a state-of-t he-art regression metho d , • u sing an iterativ e sc heme for observ ation, learning, and optimization, • captur ing all of these asp ects u nder the u mbrella of a m ulti-ob jectiv e “meta” optimization formulatio n. Despite method s and approac hes from mac hine (statistical ) learnin g are hea vily utilized in this framew ork, the problem at hand is v ery diﬀerent from m an y classical mac hine learning ones, even in its learning asp ect. In most classical application domains of mac hine lea rnin g su ch as data mining, computer vision, or image and voice recog nition, the diﬃculty is often in handling signiﬁ cant amoun t of d ata in contrast to lack of it. Many metho ds suc h as Exp ectation-Maximiz ation (EM) inheren tly mak e th is assumption, except from “activ e learning” schemes [3]. In formation theory pla ys p la ys an imp ortan t role in ev aluating sca rce (and exp ensive) data and dev eloping 2 strategies for obtaining it. Interestingl y , d ata scarcit y con v erts at the same time the disadv an tages of some metho ds into adv an tages, e.g. the scalabilit y problem of Gauss ian pro cesses. It is w orth noting that the class of pr oblems d escrib ed here are muc h more frequently encountered in p ractice than it may ﬁr st seem. F or ex- ample, the class of black- b o x metho ds kno wn as “kriging” [10] ha ve b een applied to s u c h pr oblems in geology and minin g as well as to hydrology since mid -1960s. In addition, the solutio n framew ork prop osed is applicable to a w ide v ariet y of ﬁelds du e to its fund amen tal nature. O ne example is decen tralized resour ce allocation d ecisions in net w ork ed and complex sys- tems, e.g. wired and wireless net w orks, where parameters change qu ic kly and global information on netw ork c haracteristics are n ot a v ailable at the lo cal d ecision-making no des. Another example is securit y-related d ecisions where opp onen ts sp end a conscious eﬀort to hide their actions. A related area is secur ity and information tec hnology risk managemen t in large-scale organizatio ns, w here acquiring information on ind ivid ual s ubsystems and pro cesses can b e v ery costly . Y et another example application is in biologi- cal systems wher e individ ual organisms or subsystems op erate autonomously (ev en if they are part of a larger system) un der limited lo cal information. 2 Problem Deﬁnition and A pproac h A concrete deﬁnition of the motiv ating problem m en tioned in the intro- duction section is h elpf ul for describing the multiple asp ects of the limited information decision making framework. Without loss of any ge neralit y , let X ⊆ Ψ ⊂ R d b e a nonempt y , conv ex, and co mpact (closed and b ound ed) subset of the original p roblem d omain Ψ of d dimensions. The origi nal domain Ψ do es not hav e to b e con v ex, compact, or ev en fully kno wn. Ho w ev er, adopting a “divide and conquer” appr oac h, the su b set X pro vides a reasonable starting p oint. Deﬁne next the ob jectiv e fun ction to b e maximized f : X → R , whic h is unkno wn except from on a ﬁn ite n umb er of p oints (p ossibly imp er- fectly) observe d. As a simplifying assumption, let f b e Lip s c hitz con tin uous on X . On e of th e m ain distinguishing charact eristics of this pr oblem is the limitations on set of observ ations Ω n := { x 1 , . . . , x n : x i ∈ X ∀ i, n ≥ 1 } , 3 due to co st of obtaining information or n on-stationarit y of the u nderlying system. Assume for now that the cost of ob s erving the v alue of the ob jectiv e function f ( x ) is the same for an y x ∈ X . Then, a basic searc h pr oblem is deﬁned as follo ws: Problem 1 ( Basic Se ar ch Pr oblem ) Consider a Lipschitz-c ontinuous ob- je ctive function f : X → R on the d -dimensional nonempty, c onvex, and c omp act set X ⊂ R d . The function is unknown exc ept fr om on a ﬁnite numb er of observe d data p oints. What is the b est se ar ch str ate gy Ω N := { x 1 , . . . , x N : x i ∈ X ∀ i, N ≥ 1 } that solves max Ω N f ( x ) , for a given N ? The num b er of observ ations, N , in Problem 1 m a y b e imp osed b y the nature of th e sp eciﬁc app lication domain. In m an y pr oblems, where there is no time constrain t, adopting an iterativ e (one-b y-one) approac h, and hence c ho osing N = 1 is clearly b eneﬁcial as it allo ws for u sage of incoming new information at ea c h step. Alternativ ely , the assumption on the equal obser- v ation cost can b e relaxed and b e formulated as a constrain t X x ∈ Ω n c o ( x ) ≤ C, where c o ( x ) : X → R is th e observ ation cost function, and the scalar C is the total “exploration bud get”. It is also p ossible to deﬁne this cost iterativ ely based on the (distance fr om) p revious observ ation, e.g. c o ( x n , x n − 1 ). In suc h cases, a lo cation-based iterativ e searc h s cheme can b e considered. The simplest (b oth conceptually and computationally) strategy to solve Problem 1 is rand om search on the domain X . As suc h no attempt is made to “learn” the prop erties of th e function f . Unless, f is “alg orithmically random” [14], whic h is rarely the case, this strategy w astes the inform ation collect ed on f . A slightly more complicated and v ery p opular set of s tr ate- gies combine random searc h with simple mo deling of the function thr ough gradien t metho ds. In this case, the collected information is used to mo del f rudimentarily usin g deriv ed gradien ts to “deﬁne slop es” in a heuristic man- ner. Th en, these slop es of f are explored step-b y-step in the up w ards d irec- tion to ﬁnd a lo cal maxim um, after which the searc h algorithm r an d omly 4 jumps to another lo cation. It is also p ossible to randomize the gradien t clim bing scheme for additional ﬂexibility [24]. The framew ork pr esen ted in this pap er tak es one fur ther s tep and ex- plicitly mo d els the (entire) ob jectiv e fun ction f (on the set X ) using the information collected instead of heuristically describing only the slop es. Th e function ˆ f , whic h mo d els, appro ximates, and estimates f , b elongs to a cer- tain class functions suc h that ˆ f ∈ F . The selection and p rop erties of this class is based on “a priori” information av ailable and can b e interpreted as the “w orld view” of the decision maker. Th ese prop er ties can often b e expressed u sing meta-parameters whic h are then up d ated based on the ob- serv ations through a separate optimization p r o cess. Lik ewise, a slo w er time- scale pr o cess can b e u sed for mo del selection if pr o cessing capabilities p ermit a multi -mo del approac h. This mo d el-based s earc h pr o cess, whic h lies at the cen ter of the fr ame- w ork, is fundamentally a manifestation of the Ba yesia n appr oac h [18]. It ﬁrst imp oses explicit and a priori mo deling assumptions b y c ho osing ˆ f from a certain class of fun ctions, F , and then inf ers (learns, u p d ates) ˆ f in a struc- tured manner as more information b ecomes a v ailable through observ ations. F rom a computational p oint of view, the d ecision making framework with limited in formation lies at one end of the computation vs. observ ation sp ectrum, while random searc h is at the opp osite end. The framew ork tries to utilize eac h piece of information to th e maxim um p ossible exten t almost regardless of the computational cost. The underlying assumption here is: observ a tion is v ery costly whereas computation is rather c heap . This assumption is not only v alid for a wide v ariet y of pr oblems fr om dif- feren t ﬁelds ranging from netw orkin g and securit y to economics and risk managemen t, but also insp ir ed from biologic al systems. In man y b iologica l organisms, from single cells to h uman b eings, op erating close to this end of th e compu tation-observ ation sp ectrum is more adv antage ous than doing random searc h. When doing random searc h on the domain X , at eac h stag e i.e. giv en the previous observ atio ns, eac h remaining candid ate d ata p oin t provides equiv alen t amoun t of inform ation. Ho w ev er, this is n ot the case w hen doing mo del-based searc h . Dep ending on the mo del adopted and previous infor- mation collected, diﬀerent unexp lored p oint s p ro vide diﬀerent amount of information. This information can b e exactly quantiﬁed usin g the deﬁ n ition of ent ropy and information f r om the ﬁeld of (Shannon) inf ormation theory . Accordingly , the scalar quantit y I ( ˆ f , Ω n ) d enotes the aggrega te information obtained f rom the set of observ ations Ω n within th e mo del represent ed by ˆ f . A related issue is the reliabilit y and p ossibly noisy nature of observ atio ns, 5 whic h will b e discussed in fu rther detail in th e next section. An extension of Problem 1 th at captures the asp ects discussed ab o v e is deﬁned next. Problem 2 ( Mo del-b ase d Se ar ch Pr oblem ) L et f : X → R b e an ob- je ctive fu nc tion on the d -dimensional nonempty, c onvex, and c omp act set X ⊂ R d , which is unknown exc ept fr om on a ﬁnite nu mb er of observe d data p oints. F urther let ˆ f ( x ) b e an estimate of the obje ctive function obtaine d using an a priori mo del and observe d data . What is the b est se ar ch str ate gy Ω N := { x 1 , . . . , x N : x i ∈ X ∀ i, N ≥ 1 } that solves the multi-obje ctive pr oblem with the fol lowing c omp onents? • O bje ctiv e 1: max Ω N f ( x ) gi v en ˆ f ( x ) • O bje ctiv e 2: arg min Ω N R  f ( x ) , ˆ f ( x )  , ˆ f ∈ F • O bje ctiv e 3: max Ω N I ( ˆ f , Ω n ) Her e, R ( · , · ) is a risk or e xp e c te d loss function qu antifying the mismatch b etwe en actual and estimate d f u nctions on the observat ion data [23]. The sc alar quantity I is the aggr e gate informatio n obtaine d fr om the set of ob- servations Ω N within the mo del r epr esente d by ˆ f . The c ar dinality of Ω N , N , c an b e either given, e. g. N = 1 , or deﬁne d as an additional c onstr aint P x ∈ Ω n c o ( x ) ≤ C , wher e c o ( x ) : X → R is the observation c ost function, and the sc alar C is the total “ explor ation budget”. Figure 1: The three fundamenta l asp ects of decision making with limited information. 6 It is imp ortan t to observe h ere that the three ob jectiv es d eﬁ ned in Prob- lem 2 are (almost) indep en den t from and orth ogonal to eac h other despite b eing closely r elated. Obje ctive 1 pu rely aims to maximize th e unknown ob jectiv e fun ction f using the b est estimat e (mo del) ˆ f . Obje ctive 2 fo cuses on minimizing the error b etw een the estimate ˆ f and the r eal unknown fun c- tion f based on the observ ations made. Obje ctive 3 tr ies to maximize the amoun t of in f ormation p r o vided by eac h (costly) obser v ation or exp erimen t. It is w orth noting that Ob je ctive 3 is in dep end en tly formulate d from Obje c- tive 2 , in other wo rds, exploratio n is d one indep endent ly from estimatio n. In con trast, ensu r ing a b alance b et w een Obje ctive 1 and 2 is n ecessary to ensure that solution is robust. These ob jectiv es and the fund amen tal as- p ects of decision m aking with limited inf ormation are visually depicted in Figure 1. T able 1: F und amen tal T rade-oﬀs Exploration Exploitation Observ ation v ersus Computation Robustness Optimization There are m ultiple trade-oﬀs that are inh eren t to th is problem as listed in T able 1. Th e ﬁ rst one, exploration v ersus exploitation, p uts exploration or obtaining more observ ations against exploitation, i.e. trying to ac hieve the given ob jectiv e. Observ atio n v ersus computation captures the trade- oﬀ b etw een building soph isticated mo dels using the a v ailable information to the f ullest extend and making more observ ations. Robustness v ersus optimization puts risk a v oidance against optimizat ion w ith resp ect to the original ob j ective as in exploitation. 3 Metho dology This sectio n pr esen ts the metho ds that are utilized within the framewo rk whic h addresses the problem d eﬁ ned in the previous one. First, the re- gression mo d el and Gaussian Pro cesses (GP) are p r esen ted. Sub s equen tly , mo deling and measuremen t of information is discussed based on (Sh annon) information theory . 7 3.1 Regression and Gaussian Pro cesses (GP) Problem 2 p r esen ted in the pr evious section inv olv es in ferring or learning the function f using the set of observe d data p oin ts. This is kno wn as the r e gr ession problem in mac hine learning and is a sup ervised learning metho d since th e observ ed data constitutes at the same time the learning data set. This learning pro cess in vo lv es selec tion of a “mo del”, where the learned f unction ˆ f is, for example, expressed in terms of a set of parameters and sp eciﬁc basis fun ctions, and at the same time minim ization of an error measure b et w een th e f unctions f and ˆ f on the learning d ata set. Gaussian pro cesses (GP) provi de a nonp arametric alternativ e to this but f ollo w in spirit the same idea. The main go al of regression in vo lv es a trade-oﬀ. On the one hand, it tries to minimize the observe d error b et w een f and ˆ f . On the other, it tries to in fer the “real ” sh ap e of f and mak e goo d estimations using ˆ f ev en at unobserved p oin ts. If the former is o v erly emphasized, then one ends up with “o ver ﬁtting”, whic h means ˆ f follo ws f closely at observed p oints but h as w eak p redictiv e v alue at u nobserved ones. This d elicate balance is usually ac hiev ed by balancing th e prior “b eliefs” on the nature of the function, captured by the mo del (basis functions), and ﬁtting the mo del to the observed data. This pap er fo cuses on Gaussian Pr o cess [23] as th e c hosen regression metho d w ith in the framew ork dev elop ed without loss of any generalit y . There are m ultiple reasons b ehind this p reference. Firstly , GP provides an elegan t mathematical metho d for easily com bining man y asp ects of the framew ork. Secondly , b eing a nonparametric metho d GP eliminates any discussion on mo del degree. T hirdly , it is easy to implement and under- stand as it is based on w ell-kno wn Gaussian probabilit y conce pts. F ourthly , noise in ob s erv ations is immediately tak en into account if it is mo deled as Gaussian. Finally , one of the m ain d ra wbac ks of GP n amely b eing computa- tional hea vy , do es n ot really apply to the problem at hand since the amount of data av ailable is already very limited. It is n ot p ossible to present h ere a comprehen siv e treatmen t of GP . Therefore, a v ery rudimentary o v erview is pr o vided next within the con- text of the decision making problem. Consider a set of M data p oin ts D = { x 1 , . . . , x M } , where eac h x i ∈ X is a d − dimensional v ector, and the corresp onding vect or of scalar v alues is f ( x i ) , i = 1 , . . . , M . Assume that the observ ations are distorted by a zero-mean Gaussian noise, n w ith v ariance σ ∼ N (0 , σ ). 8 Then, the resulting observ ations is a v ector of Gaussian y = f ( x ) + n ∼ N ( f ( x ) , σ ). A GP is form ally d eﬁned as a collection of random v ariables, an y ﬁ- nite num b er of wh ic h ha ve a joint Gaussian distribu tion. It is co mpletely sp eciﬁed by its mean fun ction m ( x ) and co v ariance function C ( x, ˜ x ), where m ( x ) = E [ ˆ f ( x )] and C ( x, ˜ x ) = E [( ˆ f ( x ) − m ( x ))( ˆ f ( ˜ x ) − m ( ˜ x ))] , ∀ x , ˜ x ∈ D . Let us for simplicit y c ho ose m ( x ) = 0. Then, the GP is c haracterized en tirely b y its co v ariance function C ( x, ˜ x ). Since the noise in observ ation v ector y is also Gaussian, the co v ariance function can b e deﬁn ed as the sum of a kernel function Q ( x, ˜ x ) and th e d iagonal noise v ariance C ( x, ˜ x ) = Q ( x, ˜ x ) + σ I , ∀ x, ˜ x ∈ D , (1) where I is the iden tit y matrix. While it is p ossible to choose h ere an y (p ositiv e deﬁnite) k ernel Q ( · , · ), one classical c hoice is Q ( x, ˜ x ) = exp  − 1 2 k x − ˜ x k 2  . (2) Note that GP mak es use of the we ll-kno wn kernel trick here by representing an inﬁnite dimensional con tin uous function using a (ﬁnite) set of contin u ous basis functions and asso ciated v ector of real parameters in accordance with the r e pr esenter the or e m [26]. The (noisy) 1 training set ( D , y ) is u sed to deﬁn e the corresp onding GP , G P (0 , C ( D )), thr ough the M × M co v ariance function C ( D ) = Q + σ I , where the cond itional Gaussian d istr ibution of any p oint outside the training set, ¯ y ∈ X , ¯ y / ∈ D , give n the training d ata ( D , t ) can b e computed as follo ws. Deﬁne the vec tor k ( ¯ x ) = [ Q ( x 1 , ¯ x ) , . . . Q ( x M , ¯ x )] (3) and scalar κ = Q ( ¯ x, ¯ x ) + σ. (4) Then, the conditional distribution p ( ¯ y | y ) that c haracterizes the G P (0 , C ) is a Gaussian N ( ˆ f , v ) with mean ˆ f and v ariance v , ˆ f ( ¯ x ) = k T C − 1 y and v ( ¯ x ) = κ − k T C − 1 k . (5) This is a k ey r esu lt that deﬁnes GP regression as the mean function ˆ f ( x ) of the Gaussian distrib ution and pro vides a prediction of the ob jectiv e 1 The sp ecial case of p erfect observ ation without noise is handled the same w ay as long as the kernel function Q ( · , · ) is p ositive deﬁnite 9 function f ( x ). A t the same time, it b elongs to the we ll-deﬁned class ˆ f ∈ F , whic h is the set of all p ossible sample fu nctions of the GP F := { ˆ f ( x ) : X → R su c h that ˆ f ∈ G P (0 , C ( D )) , ∀ D , C } , where C ( D ) is deﬁn ed in (1) and G P through (3), (4), and (5), ab o v e. F ur- thermore, the v ariance fu n ction v ( x ) can b e u sed to measure the un certain t y lev el of the predictions pr o vided b y ˆ f , whic h will b e d iscussed in the next subsection. 3.2 Quan tifying Information in Observ ations In the framew ork presented, eac h obs erv ation p ro vides a data p oint to the regression pr oblem (estimating f b y constru cting ˆ f ) as discu ssed in the previous subsection. Man y works in the learning literature consid er the “training” d ata used in regression av ailable (all at once or s equ en tially) and do not discuss the p ossibility of the decision mak er inﬂ uencing or ev en op- timizing the data colle ction pro cess. The active le arning problem deﬁn ed in Section 2 r equires, ho w ev er, exactly addressing th e question of “ho w to quan tify information obtained and optimize the observ atio n pro cess?”. F ol- lo wing th e approac h d iscussed in [17, 18], the framework here pro vides a precise answer to this question. Making any d ecision on the n ext (set of ) observ ations in a principled manner necessitates ﬁrst me asuring the information obtaine d fr om e ach ob- servation within the adopte d mo del . It is imp ortan t to note that the in for- mation m easur e here is dep endent on the chosen mo del. F or example, the same observ ation p ro vides a diﬀeren t amoun t of information to a ran d om searc h mo d el than a GP one. Shannon inform ation theory readily pro vides the necessary mathemat- ical fr amework f or measuring the inform ation con ten t of a v ariable. Let p b e a pr obabilit y distr ibution o ve r the set of p ossib le v alues of a dis- crete r an d om v ariable A . The entrop y of the rand om v ariable is giv en b y H ( A ) = P i p i log 2 (1 /p i ), whic h quantiﬁes the amount of u ncertain t y . Then, the information obtained from an observ ation on the v ariable, i.e. reduction in uncertain t y , can b e quantiﬁed simply by taking the diﬀerence of its in itial and ﬁnal entrop y , I = H 0 − H 1 . It is imp ortan t here to av oid the common conceptual pitfall of equating en- trop y to in f ormation itself as it is s ometimes d one in comm unication theory 10 literature. 2 Within this framewo rk, (Shann on ) information i s deﬁne d as a me asur e of the de cr e ase of unc ertainty after (e ach) observation (within a given mo del) . This can b e b est explained with the follo wing simple example. 3.2.1 Example: Bisection Cho ose a n umber b et we en 1 and 64 randomly with u niform probabilit y (prior). What is the b est searching s trategy for ﬁndin g th is n umber? Let the random v ariable A represen t this num b er. In the b eginning the en trop y of A is H 0 ( A ) = 64 X i =1 1 64 log 2  1 64  = 6 (b its) . The information m aximizatio n problem is d eﬁned as max I = max H 0 − H 1 = min H 1 , since H 0 , the entrop y b efore the action (obta ining inform ation) is constan t. The ent ropy H 1 is the one after in formation is obtained, and hence is dir ectly aﬀected by the sp eciﬁc action c hosen. Now, deﬁn e the action as setting a threshold 1 < t < 64 to chec k whether the c hosen num b er is less or higher than this threshold t . T o simplify the analysis, consider a contin u ous v ersion of the problem by deﬁ ning p as the prob ab ility of the c hosen num b er b eing less than th e threshold. Thus, in this uniform prior case, the pr oblem simpliﬁes to min p H 1 = min p p log ( p ) + (1 − p ) log (1 − p ) , whic h has the deriv ativ e dH 1 dp = log ( p ) − log (1 − p ) . Clearly , the thresh old p ∗ = 0 . 5 is the global minimum, w hic h roughly cor- resp ond s to t = 32 (ignoring qu antiza tion and b ound ary eﬀects). Thus, bi- section from the midd le is the optimal searc h strateg y for the un if orm prior. In this example, the n umber can b e found in the w orst-case in 6 steps, eac h 2 Since th is issue is n ot of great imp ortance for the class of prob- lems considered in comm unication theory , it is often ignored. How - ever, the diﬀerence is of conceptu al imp ortance in t h is problem. See http://www .ccrnp.ncifcrf .gov/ ~ toms/infor mation.is.not. uncertainty.html for a d etailed discussion. 11 pro viding one bit of inform ation. Nonuniform probabilities (p r iors) can b e handled in a similar w a y . If this search pr o cess (bisection) is rep eatedly applied without an y feed- bac k, then it results in the optimal quan tization of the searc h sp ace b oth in the un if orm case ab o v e and for th e nonuniform probabilities. I f feedbac k is a v ailable, i.e . one learns after eac h bisection whether the n umber is larger or less than the b ound ary , then this is as sh o wn the b est searc h str ategy . 4 Mo d el The mo del adopted in the framew ork for decision making w ith limited in - formation builds on the metho ds present ed in th e p revious section and ad- dresses the problem introd uced in Section 2 . The mo d el consists of thr ee main p arts: observ ation, up date of GP for regression, and optimization to determine next action. These three steps, sho wn in Figure 2 are tak en it- erativ ely to ac hiev e the ob jectiv es in Problem 2. As a result of its iterativ e nature, this appr oac h ca n b e considered in a sense similar to the w ell-kno wn Exp ectation-Maximizat ion algo rithm [3]. Figure 2: T he main parts of the und erlying mo d el of the decision making framew ork. Observ ations, giv en that they are a s carce resource in the class of prob- lems considered, pla y an imp ortant role in the mo del. Uncertain ties in the observ ed quan tities can b e mo deled as ad d itiv e noise. Lik ewise, prop erties (v ariance or bias) of additiv e noise can b e u sed to mo del the reliabilit y of (and b ias in) the data p oint s observ ed. GPs pr o vide a straigh tforwa rd m ath- ematical structure for incorp orating these asp ects to the mod el under some 12 simplifying assump tions. The set of observ ations collected pro vide th e (sup ervised) training d ata for GP regression in order to estimate the charac teristics of the function or system at hand. This pro cess relies on the GP metho d s describ ed in Subs ec- tion 3.1 . Thus, at eac h iteration an up-to-date description of th e fu nction or system is obtained based on the latest observ ations. Sp eciﬁcally , ˆ f pro vides an estimate of the original function f . 3 Assuming an additiv e Gaussian noise mo del, the n oise v ariance σ can b e used to mo del uncertain ties, e.g . older and n oisy data resulting in higher σ v alues. The ﬁn al and most imp ortant part of the mo del pro vides a basis for determining the next ac tion after an optimizatio n pro cess that takes in to accoun t all three ob jectiv es in Problem 2. The in formation asp ect of these ob jectiv es is already discussed in Su bsection 3.2. An imp ortant issu e here is the fact that there are in ﬁnitely man y candidate p oin ts in this optimiz ation pro cess, but in practice only a ﬁ nite collect ion of them can b e ev aluated. 4.1 Sampling Solution Candidates When m aking a decision on the next action through m ulti-ob j ectiv e opti- mization, ther e are (inﬁnitely) many candid ate p oin ts. A pragmatic solution to the problem of ﬁ nding solution candidates is to (adaptiv ely) sample the problem domain X to obtain the set Θ := { x 1 , . . . , x T : x i ∈ X , x i / ∈ D , ∀ i } that d o es not ov erlap with kno wn p oint s. In low (one or t w o) dimensions, this can b e easily ac hiev ed through grid sampling metho ds. In higher di- mensions, (Qu asi) Monte Carlo schemes can b e u tilized. F or large problem domains, the cur ren t domain of interest X can b e deﬁned around the last or most promising observ ation in suc h a wa y that su c h a sampling is compu- tationally feasible. Lik ewise, multi-resolution schemes can also b e deplo y ed to increase compu tational eﬃciency . Although suc h a solution ma y seem restrictiv e at ﬁr st glance, it is in sp irit not v ery diﬀerent from other sc hemes such as sim ulated annealing, w hic h are widely used to address nonconv ex optimization p roblems. Ho wev er, a ma jor diﬀerence b et w een this an d other schemes is the fact that the candidate sampling and ev aluation are done here “a priori” due to exp erimenta tion b eing costly while other metho ds rely on abundance of information. 3 See [23, Chap 7.2] for a discussion on asymp t otic an alysis of GP regression. It should not be noted, how ever, th at asymptotic p rop erties are of little relev ance to the problem at h and. 13 A natural qu estion that arises is: wh ether and un der wh at co nditions do es such a sampling method giv e satisfactory results. The follo wing resu lt from [30, 31] p ro vides an answ er to this question in terms of num b er of samples requir ed. Theorem 1 Deﬁne a multivariate fu nc tion f ( x ) on the c onvex, c omp act set X , which admits the maximum x ∗ = arg m ax x ∈X f ( x ) . Base d on a set of N r andom samples Θ = { x 1 , . . . , x N : x i ∈ X ∀ i } fr om the entir e set X , let ˆ x := arg max x ∈ Θ f ( x ) b e an estimate of the maximum x ∗ . Given an ε > 0 and δ > 0 , the minimum numb er of r andom samples N which guar ante e s that P r ( P r [ f ( x ∗ ) > f ( ˆ x )] ≤ ε ) ≥ 1 − δ, i.e. the pr ob ability that ’the pr ob ability of the r e al maximum surp assing the estimate d one b eing less than ε ’ is lar ger than 1 − δ , is N ≥ ln 1 /δ 1 / (1 − ε ) . F urthermor e, this b ound is tight if the function f is c ontinuous on X . It is interesting and imp ortan t to n ote that this b ound is indep endent of the sampling d istribution used (as long as it co v ers the whole set X with nonzero probabilit y), the fu nction f itself, as we ll as the prop erties and dimension of the set X . 4.2 Quan tifying Information in GP The information measuremen t and GP approac hes in Section 3 can b e d i- rectly combined. Let the zero-mean multiv ariate Gaussian (normal) proba- bilit y distribution b e denoted as p ( x ) = 1 p 2 π | C p ( x ) | exp  − 1 2 [ x − m ] T | C p ( x ) | − 1 [ x − m ]  , x ∈ X , (6) where | · | is the determinant, m is the mean (v ector) as deﬁn ed in (5 ), and C p ( x ) is the co v ariance matrix as a function of the newly observ ed p oin t x ∈ X giv en by C p ( x ) =     C ( D ) k ( x ) T k ( x ) κ     . (7) 14 Here, the vecto r k ( x ) is deﬁn ed in (3) and κ in (4), resp ectiv ely . Th e matrix C ( D ) is the co v ariance matrix based on the trainin g data D as deﬁned in (1). The en tropy of the m ultiv ariate Gaussian distrib ution (6) is [1 ] H ( x ) = d 2 + d 2 ln(2 π ) + 1 2 ln | C p ( x ) | , where d is the dimension. Note that, this is th e en trop y of the GP estimate at the p oin t x based on the a v ailable d ata D . The aggregat e ent ropy of the function on th e region X is giv en b y H agg := Z x ∈X 1 2 ln | C p ( x ) | dx. (8) The problem of c ho osing a new data p oin t ˆ x s u c h that the informa- tion obtained f rom it within the GP regression mo del is maximized can b e form ulated in a w a y similar to the one in the bisection example: ˆ x = arg max ˜ x I = arg max ˜ x Z x ∈X [ H 0 − H 1 ] dx = arg min ˜ x Z x ∈X 1 2 ln | C q ( x, ˜ x ) | dx, (9) where the int egral is computed o ve r all x ∈ X , and the co v ariance matrix C q ( x, ˜ x ) is d eﬁned as C q ( x, ˜ x ) =       C ( D ) k T ( ˜ x ) k T ( x ) k ( ˜ x ) ˜ κ Q ( x, ˜ x ) k ( x ) Q ( x, ˜ x ) κ       , (10) and ˜ κ = Q ( ˜ x , ˜ x ) + σ . Here, C ( D ) is a M × M matrix and C q is a ( M + 2) × ( M + 2) one, whereas κ and Q ( x, ˜ x ) are scalars and k is a M × 1 vec tor. This result is summarized in the follo wing prop osition. Prop osition 1 As a maximum information data c ol le ction str ate gy for a Gaussian Pr o c ess with a c ovarianc e matrix C ( D ) , the next observation ˆ x should b e c hosen in su c h a way that ˆ x = arg max ˜ x I = arg min ˜ x Z x ∈X ln | C q ( x, ˜ x ) | dx, wher e C q ( x, ˜ x ) is deﬁne d in (10). 15 An Appro ximate Solution to Information Maximization Giv en a set of (candidate) p oin ts Θ sampled from X , the r esult in Prop o- sition 1 can b e revisited. The problem in (9) is then appr o ximated [31] b y max ˜ x I ≈ min ˜ x X x ∈ Θ ln | C q ( x, ˜ x ) | (11) ⇒ ˆ x = arg min ˜ x ∈ Θ Y x ∈ Θ | C q ( x, ˜ x ) | , using monotonicit y prop er ty of the natural logarithm and the fact that the determinan t of a co v ariance m atrix is non-negativ e. Thus, th e follo wing coun terpart of Prop osition 1 is obtained: Prop osition 2 As an appr oximately maximum information data c ol le ction str ate gy for a Gaussian Pr o c ess with a c ovarianc e matrix C ( D ) and given a c ol le ction of c andidate p oints Θ , the next observation ˆ x ∈ Θ should b e chosen in such a way that ˆ x = arg min ˜ x ∈ Θ Y x ∈ Θ | C q ( x, ˜ x ) | ≈ arg max ˜ x ∈ Θ I , wher e C q ( x, ˜ x ) is given in (10). Although it is an approximati on, ﬁn ding a solution to the optimization problem in Prop osition 2 can still b e computationally costly . T h erefore, a greedy algorithm is prop osed as a compu tationally s impler alternativ e. Let, x ∗ ∈ Θ b e deﬁn ed as x ∗ := arg max x ∈ Θ | C p ( x ) | = | C ( D ) | | κ ( x ) − k ( x ) C − 1 ( D ) k T ( x ) | , where the matrix C p is giv en b y (7) [21 ]. The ﬁrst term ab o v e, | C ( D ) | is ﬁxed and the second one, | κ ( x ) − k ( x ) C − 1 ( D ) k T ( x ) | , is the same as the GP v ariance v ( x ) in (5 ). Hence, the sample x ∗ is one of those with the maxim um v ariance in the set Θ, given current data D . It follo ws from (10) and basic matrix th eory that if ˜ x = x for a giv en x then | C q ( x, ˜ x ) | is minimized. As a simpliﬁcation, ignore the dep endencies 16 b et we en C q ( x, ˜ x ) matrices for d iﬀeren t x ∈ Θ. Then, choosing the maxim um v ariance ˆ x as ˆ x = arg max ˜ x ∈ Θ v ( ˜ x ) ≈ arg min ˜ x ∈ Θ Y x ∈ Θ | C q ( x, ˜ x ) | , leads to a large (p ossibly largest) redu ction in Q x ∈ Θ | C q ( x, ˆ x ) | , and hence pro vides a rough appro ximate solution to (11) and to the result in Prop osi- tion 1. T his resu lt is consistent with w id ely-kno wn h eu ristics su c h as “max- im um en trop y” or “minimum v ariance” metho ds [28] and a v ariant has b een discussed in [17]. Prop osition 3 Given a Gaussian Pr o c ess with a c ovarianc e matrix C ( D ) and a c ol le ction of c andidate p oints Θ , an appr oximate solution to the max- imum informatio n data c ol le ction pr oblem deﬁne d in Pr op osition 1 is to cho ose the sample p oint(s) ˜ x in such a way that it has (they have) the max- imum varianc e within the set Θ . 5 Optimization with Limited Information Let f : X → R b e the un kno wn Lipsc hitz-co ntin u ous fu nction of interest on the d -dimens ional nonempty , conv ex, and compact set X ⊂ R d . The amoun t of in formation ab out this fu nction av ailable to the decision mak er is limited to a ﬁnite num b er of p ossibly n oisy observ atio ns. Since the observ a- tions are costly , the goal of the decision make r is to ﬁnd the maxim um of f , estimate f as accurate ly as p ossible using a v ailable observ atio ns, and select the most informativ e d ata p oin ts, at the same time. Th is naturally calls for an ite rativ e and m yo pic optimization pro cedu re since eac h new observ a- tion p ro vides a n ew data p oint that concur r en tly aﬀects the maximization, function estimation (regression), and inform ation quantit y . The ﬁrst and b asic ob jectiv e is the maximization of the fun ction f ( x ) on x ∈ X . As a simpliﬁcation, obser v ations are assu med to b e sequ en tial, one at a time. Since f is basically unknown, this problem has to b e formulate d as max ˜ x ∈X F 1 ( ˜ x ) = ˆ f ( ˜ x ) , where ˆ f is the b est estimate obtained through GP regression (5 ) using the curren t d ata set D . Data uncertain t y (observ ation errors) is mo d eled through additive Gaussian noise with v ariance σ as a ﬁrst approximat ion. 17 The second ob jectiv e is to min imize th e diﬀerence (estimation err or) b et we en ˆ f and f . Deﬁne e ( x ) = ˆ f ( x ) − f ( x ) , ∀ x ∈ X . Giv en the set of noisy observ ations O = { f ( x i ) + n ( x i ) : x ∈ D , ∀ i } , where n ∼ N (0 , σ ) denotes zero mean Gaussian noise, it is p ossible to use another GP regression (5 ) to estimate this err or function, ˆ e ( D , x ), on the en- tire set X . Thus, the second ob jectiv e is to ensu re that the next observ ation ˜ x s olv es min ˜ x ∈X F 2 ( ˜ x ) = Z τ ∈X | ˆ e ( ˜ x, D , τ ) | dτ . Note that, F 2 here corresp onds to a risk or loss estimate fun ction. The thir d ob jectiv e is to maximize th e amount of information obtained with eac h observ ation ˜ x , or max ˜ x ∈X F 3 ( ˜ x ) = I ( ˜ x, ˆ f ) = Z x ∈X ln | C q ( x, ˜ x ) | dx, giv en the b est estimate of the original fun ction, ˆ f . This ob jectiv e has already b een discus sed in Section 3.2 in detail. The v alues of the three ob jectiv es, F 1 , F 2 , F 3 , cannot b e ev aluated n u- merically on the entire set X . Therefore, a sampling metho d is used as de- scrib ed in Section 4 to obtain a set of solution candidates Θ , whic h replaces X in the maximization and minimization problems ab o v e. Next, sp eciﬁc problem formulati ons are presen ted b ased on suc h a sampling of solution candidates. The o v erall s tructure of the framework is visualized in Figure 3. 5.1 Solution Approac hes The m ost common approac h to m ulti-ob jectiv e optimization is the w eigh ted sum metho d [19, 9]. Th e three ob j ectiv es discussed ab o v e can b e com- bined to obtain a single ob j ective using the resp ectiv e w eigh ts [ w 1 , w 2 , w 3 ], P 3 i =1 w i = 1, 0 ≤ w i ≤ 1 ∀ i . Assumin g a single data p oint is c hosen from and observed among the candidates Θ at eac h step, i.e. ˜ x = Ω 1 , a sp eciﬁc w eigh ted sum formulat ion to address Problem 2 is obtained. Prop osition 4 The solution, ˜ x ∈ Θ , to the optimiza tion pr oblem max ˜ x ∈ Θ F ( ˜ x ) = 3 X i =1 F i ( ˜ x ) = w 1 ˆ f ( ˜ x ) − w 2 1 N X τ ∈ Θ | ˆ e ( ˜ x, D , τ ) | + w 3 I ( ˜ x , ˆ f ) , (12) 18 Figure 3: The decision making framewo rk for static optimization w ith lim- ited information. c onstitutes the b est se ar ch str ate gy for this weighte d sum formulation of Pr ob- lem 2. As d iscussed in Subsection 3.2 and stated in Prop osition 2, the informa- tion ob jectiv e, F 3 , in (12) can b e app ro ximated by sub s tituting it with GP v ariance v ( x ) in (5) to decrease computational load. Th us, an appr o ximation to the solution in Prop osition 4 is: Prop osition 5 The solution, ˜ x ∈ Θ , to the optimiza tion pr oblem max ˜ x ∈ Θ F ( x ) = 3 X i =1 F i ( ˜ x ) = w 1 ˆ f ( x ) − w 2 1 N X τ ∈ Θ | ˆ e ( ˜ x, D , τ ) | + w 3 v ( ˜ x ) , (13) wher e v ( ˜ x ) is deﬁne d in (5), appr oximates the se ar ch str ate gy in Pr op osi- tion 4. The w eigh ting scheme describ ed is only meaningful if the three ob jec- tiv es are of the same order of magnitude. Th erefore, the original ob jectiv e functions, F i , i = 1 , 2 , 3, ha v e to b e trans f ormed or “normalized”. There are man y diﬀerent approac hes to p erform suc h a transformation [19, 9]. T h e most common one, whic h coi ncidental ly is kno wn as normaliza tion, aims to map eac h ob jectiv e fun ction to a predeﬁn ed in terv al, e.g. [0 , 1]. T o do this, estimate ﬁrst an u p p er F U i and lo w er F L i b ound on eac h individual ob jective F i ( x ). Th en, the i th normalized ob jectiv e is F N i ( x ) = F i ( x ) − F L i F U i − F L i . 19 The main issu e in n ormalizatio n is to determine the appr opriate upp er and lo we r b ound s, wh ic h is a very pr oblem-dep endent one. I n the case of Prop osition 5, the estimated functions ˆ f and ˆ e on the set Θ as w ell as the existing observ atio ns D , can b e utilized to obtain these v alues. The sp eciﬁc b ounds for the resp ectiv e ob jectiv es F U 1 = max x ∈ Θ ˆ f ( x ), F L 1 = min x ∈ Θ ˆ f ( x ), F U 2 = max x ∈ Θ | ˆ e ( x, D ) | , F L 2 = 0, F U 3 = max x ∈ Θ κ ( x ), and F U 3 = 0 p ro vide a suitable starting estimate and can b e com bined with a prior domain kno wledge if n ecessary . Thus, a normalized version of the form ulation in Prop osition 5 is obtained. Prop osition 6 The solution, ˜ x ∈ Θ , to the optimiza tion pr oblem max ˜ x ∈ Θ F ( x ) = 3 X i =1 F N i ( ˜ x ) = w 1 ∆ 1  ˆ f ( x ) − F L 1  − w 2 ∆ 2 1 N X τ ∈ Θ | ˆ e ( ˜ x, D , τ ) | + w 3 ∆ 3 v ( ˜ x ) , (14) wher e ∆ i = F U i − F L i i = 1 , 2 , 3 , pr ovides an appr oximation to the b est se ar ch str ate gy for solving the normalize d weighte d-sum f ormulation of Pr oblem 2. The b ounded ob jectiv e f unction metho d pr o vides a suitable alterna- tiv e to the weig hted sum form ulation ab o v e in addressin g the m ulti-ob jectiv e problem deﬁned. The b ounded ob jectiv e fu nction metho d m inimizes th e single most imp ortant ob jectiv e, in this case F 1 ( x ), while the other tw o ob jectiv e fun ctions F 2 ( x ) and F 3 ( x ) are co nv erted to form add itional con- strain ts. Such constrain ts are in a sense similar to QoS ones that natur ally exist in many real life problems [20, 2, 29]. As an adv an tage, in th e b oun ded ob jectiv e form ulation there is no need for normalization. The b oun d ed ob jectiv e co unterpart of th e result in Prop osition 5 is as follo ws. Prop osition 7 The solution, ˜ x ∈ Θ , to the c onstr aine d optimization pr ob- lem max ˜ x ∈ Θ ˆ f ( x ) (15) such that 0 ≤ F 2 ( ˜ x ) = 1 N X τ ∈ Θ | ˆ e ( ˜ x, D , τ ) | ≤ b 1 , and 0 ≤ F 3 ( ˜ x ) = v ( ˜ x ) ≤ b 2 , wher e b 1 and b 2 ar e given (pr e determine d) sc alar b ounds on F 2 and F 3 , r esp e ctively, pr ovides an appr oximate b e st se ar ch str ate gy for a b ounde d- obje ctive formulatio n of Pr oblem 2. 20 The adv ant age of the b ound ed ob jectiv e function metho d is that it pro- vides a b ound on the information collection and estimation ob jectiv es while maximizing the estimated f unction. This leads in p r actice to an initial em- phasis on in f ormation collection and correct estimation of the ob jectiv e fun c- tion. In that sense, th e metho d is m ore “classica l”, i.e. follo ws the common metho d of learn ﬁ rst and maximize later. F urth ermore, it d o es not r equire normalization, i.e. it is easier to deploy . The metho d has, ho we ve r, a sig- niﬁcan t disadv an tage w h ic h mak es its u sage prohibitiv e. In large-scale or high-dimensional problems, the space to explore to satisfy an y b oun d on information is simply imm ense. Therefore, one do es n ot ha v e the lux u ry of identifying the function ﬁrs t to maximize it later as it wo uld take too man y samples to do this. In such cases, it makes more sense to deplo y the w eigh ted su m metho d , p ossib ly along with a co oling sc heme to mo d ify the w eigh ts as p art of a co oling sc heme to balance dep th -ﬁrst vs . br eadth-ﬁrst searc h. Un til now, it has b een (implicitly) assumed that the static optimizat ion problem at hand is stat ionary . How ever, in a v ariet y of problems this is not the case and the function f ( x, t ) c hanges with time. Th e decision making framew ork allo ws for mo d eling suc h systems in the follo wing wa y . Let O ( t ) = { f ( x i , t i ) + n ( x i , t i ) : x i ∈ D , t i ≤ t, ∀ i } , b e the set of noisy or unreliable past observ atio ns until time t , where n ( x, t ) ∼ N (0 , σ ( t )) is th e zero mean Gaussian “noise” term at time t . No w, the de- terioration in the past information due to c hange in f ( x, t ) can b e captured b y increasing th e v ariance of the n oise term, σ ( t ), with time. F or example, a simple linear dynamic can b e deﬁned as dσ ( t ) dt = η , where η > 0 captures th e lev el of stationarit y , e.g. a large η in d icates a rapidly c hanging s ystem and f u nction f ( x, t ). 5.2 Algorithm An algorithmic sum mary of the s olution approac h es discussed ab o v e f or a sp eciﬁc set of c hoices is provided by Algorithm 1, whic h d escrib es b oth w eigh ted-sum and b ound ed ob jectiv e v arian ts. 21 Algorithm 1 Optimization with Limited Information 1: I nput: F unction domain, X , GP meta-parameters, ob jectiv e w eigh ts [ w 1 , w 2 , w 3 ] or b ounds b 1 , b 2 , initial data set ( D , y ). 2: Use GP with a Ga ussian k ernel and sp eciﬁc expected err or v ariances for function ˆ f and error f unction ˆ e estimation. 3: w hile Searc h bu dget a v ailable, 1 ≤ n ≤ N max . do 4: S amp le domain X to obtain Θ( n ). In some cases, Θ( n ) = Θ ∀ n . 5: Estimate ˆ f and ˆ e based on observ ed d ata ( D , y ) on Θ( n ) using GPs. 6: Comp ute v ariance, v ( x ), of ˆ f (5) on Θ( n ) as an estimate of I ( ˆ f ). 7: if W eigh ted-sum metho d then 8: Next action m aximizes a normalized and w eigh ted sum of ob jectiv es P 3 i =1 F N i as stated in P r op osition 6. 9: else if Bounded ob jectiv e metho d t hen 10: Next action is solution to the constrained problem in P r op osition 7. 11: end if 12: Up date the obs er ved data ( D , y ). 13: end while 5.3 Numerical Analysis The Algorithm 1 is illustrated next with m ultiple n umerical examples. It is w orth reminding that the main issue here is to solve th e optimization pr ob - lems with minimum d ata using activ e learning. In all examples, a uniform grid is u sed to sample the solutio n space rather than resorting to a more sophisticated m etho d s in ce the examples are c hosen to b e only one or t w o dimensional for visu alizatio n pur p oses. Example 1 The ﬁrst n umerical example aims to visualize the presented framew ork and algorithm. Hence, the c hosen f unction is only one dimens ional, f ( x ) = sin (5 x ) /x on th e in terv al X = [0 . 1 , 3 . 9]. The in terv al is linearly samp led to obtain a grid with a distance of 0 . 01 b et wee n p oin ts, i.e. Θ = { x i ∈ X ∀ i : x 1 = 0 . 1 , x 2 = 0 . 11 , . . . , x N = 3 . 9 } . A Gaussian k ernel with v ariance 0 . 1 is chosen for estimating b oth ˆ f and ˆ e . The weigh ts are equal to one, w = [1 , 1 , 1], in the weig hte d-sum metho d . Th e b ounds are b 1 = 0 . 5 for the error b ound and b 2 = 0 . 2 for the b ound on maxim um v ariance estimate in the b ounded ob jectiv e method . The initial data consists of a single p oin t, x = 0 . 1. 22 Figure 4 sho ws the r esults based on the normalized wei ght ed-sum metho d in Prop osition 6 after 5 iterations (6 samp les in total, toget her with the initial data p oint) . Th e v ariance h ere is v ( x ) of the estimate d function ˆ f using data p oint s D . Clearly , the estimated p eak is n ot the one of the real function f . Next, Figure 5 shows that after 11 iterations (12 d ata p oin ts in D ), the function and the lo cation of its p eak is estimated correctly . The sequence of p oin ts selected du ring the iteration p ro cess are: D = { 0 . 47 , 3 . 22 , 1 . 17 , 1 . 66 , 2 . 43 , 2 . 06 , 3 . 9 , 2 . 83 , 3 . 6 , 0 . 82 , 1 . 42 } . 0.5 1 1.5 2 2.5 3 3.5 4 −4 −3 −2 −1 0 1 2 3 4 x y Weighted Sum Optimization with Limited Information variance real estimate data points peak Figure 4: O ptimization result usin g the weigh ted-sum metho d with 6 data p oints. The amoun t of information obtained du r ing the iterativ e optimization is of particular inte rest. Figure 6 d epicts the mean v ariance v and entrop y I of the estimated fu nction ˆ f on Θ at eac h iteration step. In this sp eciﬁc example, the t wo quant ities are v ery well correlated. Note, ho w ev er, that this correlation is a f unction of the relativ e weigh ts b et w een inf ormation collect ion and other ob jectiv es. Finally , Figure 7 depicts the results of the b oun d ed ob jectiv e metho d with the giv en b ound s. The num b er of iterations is 11 as b efore, wh ic h 23 0.5 1 1.5 2 2.5 3 3.5 4 −4 −3 −2 −1 0 1 2 3 4 x y Weighted Sum Optimization with Limited Information variance real estimate data points peak Figure 5: Optimization result using the weig hted-sum m etho d with 12 data p oints. 1 2 3 4 5 6 7 8 9 10 11 0 0.5 1 1.5 2 2.5 3 3.5 Optimization step Value Mean Entropy versus Mean Variance Mean variance Mean entropy Figure 6: Mean v ariance v and entrop y I on Θ at eac h iteration step. 24 0.5 1 1.5 2 2.5 3 3.5 4 −4 −3 −2 −1 0 1 2 3 4 x y Bounded Objective Optimization with Limited Information variance real estimate data points peak Figure 7: Optimization resu lt using the b ounded ob jective metho d with 12 data p oints. giv es an opp ortunity of direct comparison with the weigh ted-su m metho d. The sequence of p oin ts s elected during the iteration p ro cess are: D = { 0 . 47 , 3 . 22 , 1 . 17 , 1 . 66 , 2 . 43 , 2 . 06 , 3 . 9 , 2 . 83 , 3 . 6 , 0 . 82 , 1 . 42 } . Example 2 The ob jectiv e f unction in the second numerical example is the Goldstein&Price function [8], whic h is sho wn in Figure 8 in its inv erted form to ensur e con- sistency with the maximization form ulation in this p ap er. Th e pr oblem d o- main consists of the t w o dimensional rectangular r egion X = [ − 2 , 2] × [ − 2 , 2], whic h is linearly s amp led to obtain a un iform grid with a 0 . 05 in terv al b e- t w een sample p oints. A Gaussian k ernel with v ariance 0 . 5 and 0 . 1 is c hosen for estimating ˆ f and ˆ e , resp ectiv ely . Th e weig hte d-sum metho d is utilized in Algorithm 1 with the weigh ts w = [4 , 2 , 3]. T he search budget is c ho- sen as 50 b efore stopp ing th e algorithm (for the searc h space of approx. 6400 samples in the grid). The real global minim um (p eak) of the (in- v erted) Goldstein&Price function is at (0 , − 1) and the lo cation foun d b y the algorithm usin g the 50 data p oin ts is ( − 0 . 15 , − 1 . 05). Figure 9 depicts 25 the estimated fun ction, the data p oint s as well as the optim um found . Al- though the real optim um v alue is − 3 (in th e inv erted v ersion) while the obtained one is − 9 . 75 , the result is s till very satisfactory considering that the simple sampling sc h eme u sed and the Goldstein&Price function tak es v alues in a range of 1 million, i.e. the error is less th an 0 . 00 1 p ercen t of the range. Finally , Figure 10 depicts the mean v ariance v and ent ropy I of the estimated fun ction ˆ f on Θ at eac h iteration step. −2 −1 0 1 2 −2 −1 0 1 2 −12 −10 −8 −6 −4 −2 0 x 10 5 Inverted Goldstein and Price function Figure 8: The in ve rted Goldstein&Price function [8]. Example 3 The third example u ses the same setup as the second one b ut this time with the (in v erted) Brain fu nction [6] sh o wn in Figure 11. T he rectangular problem domain X = [ − 5 , 10] × [0 , 15 ] is samp led uniform ly to obtain a grid of p oin ts with a 0 . 2 interv al. The real global minimums (p eaks) of the (in v erted) Branin function are at (9 . 4 , 2 . 47), ( − π, 1 2 . 28), and ( π , 2 . 28) whereas the locations fou n d b y the algorithm are (9 , 2 . 6), ( − 3 . 2 , 12), and (3 , 2 . 2). The v alues at these locations found v ary b et w een − 4 . 3 and − 0 . 5 compared to the real global v alue of − 0 . 4 (of the in v erted fu n ction). Thus, the algorithm ag ain p erforms satisfactorily . Figure 9 shows the computed lo cation of one optimum, the data p oin ts, as w ell as the estimated fu nction based on th e d ata p oints. 26 −2 −1 0 1 2 −2 0 2 −10 −5 0 5 x 10 5 Optimization of Inverted G&P function with Limited Information estimate data points peak Figure 9: Optimization of the inv erted Goldstein&Price function [8] usin g the wei ght ed-sum metho d with 50 data p oints. 0 10 20 30 40 50 0 0.5 1 1.5 2 2.5 3 3.5 Optimization step Value Mean Entropy versus Mean Variance Mean variance Mean entropy Figure 10: Mean v ariance v and ent ropy I on Θ at eac h iteration step. 27 −5 0 5 10 0 5 10 15 −350 −300 −250 −200 −150 −100 −50 0 Inverted Branin function Figure 11: The inv erted Branin fun ction [6]. −5 0 5 10 0 5 10 15 −400 −300 −200 −100 0 Optimization of Branin function with Limited Information estimate data points peak Figure 12: Optimization of the in v erted Branin fun ction [6] u sing the w eigh ted-sum metho d with 50 data p oints. 28 Example 4 The fourth example is based on the six-hump camel function [7] (see Fig- ure 13 ) on the domain X = [ − 2 , 2] × [ − 2 , 2], which is samp led un iformly with a 0 . 05 in terv al. All of the parameters are c hosen to b e the s ame as b efore. Figure 14 sho ws the computed lo cation of t wo optimums, the 50 d ata p oin ts, as w ell as the estimated function based on the data p oints. Th e optimum lo cations found are (0 , 0 . 65) and (0 . 0 5 , − 0 . 6) with resp ectiv e v alues of 0 . 98 and 1 . 06, wh ereas the real lo cations are ( − 0 . 09 , 0 . 71) and (0 . 09 , − 0 . 71) with the v alue 1 . 03. −2 −1 0 1 2 −2 −1 0 1 2 −8 −6 −4 −2 0 2 Inverted Six−hump Camel function Figure 13: The inv erted six-h ump camel function. 6 Literature Review Decision making with limited inf ormation is related to searc h theory . The idea of using in formation (theory) in this conte xt is hardly new as evidenced b y the article “A New Lo ok at the Relation Bet we en Information Theory and Searc h Theory” from 1979 [22]. The sub ject is further stu died in [11]. The topic of optimal s earc h is more r ecently revisited b y [35], whic h con tains substanti al historical notes and studies problems where the searc h target distribution in itself is unob s erv able. 29 −2 −1 0 1 2 −2 0 2 −8 −6 −4 −2 0 2 Optimization of Six−hump Camel function with Limited Information estimate data points peak Figure 14: Optimization of the in v erted six-hump camel fun ction [7] usin g the wei ght ed-sum metho d with 50 data p oints. The b o ok [18] pr o vides imp ortant and v aluable insight s into the rela- tionship b et w een in formation theory , infer en ce, and learnin g. Measuring information cont ent of exp er im ents using Sh an n on information is explicitly men tioned and a s lightly informal v ersion of the b isection example in Su bsec- tion 3.2 is discuss ed . Ho w ev er, fo cus in g mainly on more traditional co d ing, comm unication, and mac h ine learning topics, the b o ok do es n ot discus s the t yp e of decision making pr oblems present ed in th is pap er. Learning p la ys an imp ortan t r ole in the presen ted fr amew ork, esp ecially r e gr ession , whic h is a classica l mac hine (or s tatistical) learning metho d . A v ery go o d in tro du ction to the sub ject can b e foun d in [3]. A complemen- tary and detailed discussion on k ernel metho ds is in [26]. Another relev an t topic is Ba yesia n inference [33, 18], which is in the found ation of the pre- sen ted fr amew ork. In mac hine learnin g literature, Gaussian p ro cesses (GPs) are getting increasingly p opu lar due to their v arious fa v orable c haracteris- tics. Th e b o ok [23] presents a comprehensive treatmen t of GPs. Additional relev ant w orks on the sub ject include [18, 26, 16], whic h also d iscuss GP regression. Con v ex optimization [4] is a well -und ersto o d topic that is often easy to hand le ev en if a v ailable information is limited. Optimizing noncon ve x functions, ho w ev er, is still a researc h sub ject [12]. It is in teresting to note that the metho d kn o wn as kriging in global optimization is almost the same as GP regression in m ac hine learning. The ﬁeld sto chastic pr o gr amming 30 fo cuses on optimization under uncertaint y but assumes a certain amount of pr ior kno wledge on the p roblem at hand and mo d els the u ncertain t y probabilistically [25]. The p opular heuristic method simulate d anne aling [24] is essen tially based on iterativ e r andom searc h . Another p opular heuristic sc heme p article swa rm optimiza tion [13] is also b ased on random search but parallel in n ature as a distinguish in g c haracteristic rather than iterativ e. Gaussian pro cesses hav e b een recen tly applied to th e area of optimiza- tion and regression [5] as w ell as system iden tiﬁcation [32]. While the latter men tions activ e learnin g, neither w ork discusses explicit information quan- tiﬁcation or b uilds a connection w ith Shann on information theory . Th e recen t articles [15, 34], whic h utilize GP regression for optimizatio n in a setting similar to the one in this pap er and for state-space inference and learning, resp ectiv ely , do n ot consid er information-theoretic asp ects of the problem, either. Likewise, the article [10] on sto c hastic blac k b o x optimiza- tion, which considers a problem similar to the one here, do es not tak e in to accoun t explicit measuremen t of in f ormation. The area of activ e learning or exp eriment design fo cuses on d ata scarcit y in mac hine learning and makes use of Shannon information theory among other criteria [28]. The pap er [17] d iscusses ob jectiv e functions whic h mea- sure the exp ected informativ eness of candidate measur emen ts w ith in a Ba y esian learning f ramew ork. T he su bsequent study [27] in ve stigates activ e learning for GP r egression us in g v ariance as a (heur istic) conﬁdence m easure for test p oint rejectio n. 7 Discussion The foundation of th e approac h adopted in th is p ap er is Ba y esian infer- ence, where the main idea is to choose an a priori mo del and up date it with actual exp erimen tal data observed (see [18, Chap. 2] for a b eautiful intro- ductory discuss ion on the su b ject). As long as th e a pr iori mo del is clo se to the realit y (of the problem at han d ), this inference metho dology wo rks v ery eﬃcien tly as ind icated b y the n umerical examples in Section 5.3. In man y cases this b ac kground in formation, w h ic h is sometimes referr ed to as “domain knowle dge”, is already av ailable. Ho we ve r, in others one h as to explore the mo del domain and learn mo d el meta-parameters in a time scale naturally longer th an the one of actual optimization [16]. The GP regression adopted in the presen ted fr amew ork is only one metho d for fun ction estimation and other, e.g. p arametric, metho ds can easily replace GP for the regression p art. In an y case, the regression metho d - 31 ology here is consisten t with the pr inciple of “Occam’s razor”, more sp ecif- ically its interpretatio n usin g Kolmogoro v complexit y [14 ]. A priori, th e optimization p roblems at hand are more pr obable to b e simple r ather than complex to describ e in accordance with u niversal distribution [14]. Hence, giv en a data set it is reasonable to start describ ing it with the simplest expla- nation. GP regression already incorp orates this line of thin king by relying on a ke rnel-based app roac h and making use of the rep r esen ter theorem [23, Chap. 6.2]. As a visual example, w e refer Figures 4 and 5 for a comparison of fun ction estimates with diﬀerent sets of a v ailable data. This pap er considers a class of problems where data is scarce and ob- taining it is cost ly . In formation theory pla ys an esp ecially imp ortan t role in devising optimal sc hemes for obtaining new d ata p oint s (activ e learning). The en tropy measure from Shannon information theory pro vides the neces- sary metric for this purp ose, which quant iﬁes the “exploration” asp ect of the problem. Using a m ulti-ob jectiv e optimization form ulation, the presen ted framew ork allo ws explicit w eigh ting of explor ation vs. exploitation asp ects. This trade-oﬀ is also v ery similar to one b et w een the w ell-kno wn depth-ﬁrst vs. breadth-ﬁrst searc h algorithms in searc h theory . The amoun t of information obtained from eac h data p oin t is d iﬀeren t here only b ecause a sp eciﬁc a p riori general mo del is utilized to explain the obs erv ed d ata (GP regression). Because of this th e amoun t of infor- mation obtained is sp eciﬁc to the mo del. Otherwise, without this Ba y esian approac h, eac h data p oin t w ould give the same information (inv ers ely pro- p ortional to th e total num b er of candidate p oin ts). The illustrative examples discuss ed are low-dimensional, whic h mak es it p ossib le to use grid s for sampling. How ever, in higher dimensions (i.e. when the problem is muc h more “diﬃcu lt”) th is “luxury” is n ot aﬀordable and one has to necessarily resort to Monte Carlo metho d s . I n such cases, the tr ad e-oﬀ b et w een exploration an d exploitation is eve n more emphasized. P ossible m etho ds to address this issu e in clude, “co oling” approac hes similar to those u sed in simulate d annealing, multi-resolution sampling based on region of in terest or using top ological prop erties of Gaussian m ixtures to in telligen tly estimate candidate p oin ts based on the current state. The optimization app roac h pr esented here can also b e in terpreted from a biologic al p ersp ectiv e. If an analogy b et we en the decision-mak er and a biologica l organism is established, then the a-pr iori Ba yesian mo del (meta parameters of the GP) that is r eﬁned o v er a long time scale corresp onds to ev olution of a sp ecies in an environmen t (problem domain). Eac h individ ual organism b elogning to the sp ecies obtains new information to ac hiev e its ob jectiv e while preserving resources as muc h as p ossible. T h e existing evo- 32 lutionary basis (GP m o del) giv es them an adv antag e to ﬁn d a solution m uc h faster compared to random searc h. F rom the p ersp ectiv e of the sp ecies, it also mak es sense for some of its mem b ers to exp lore th e mo del (meta param- eter) d omain and fu rther reﬁne it through adaptation. Those w ith b etter meta parameters ac hieve then their ob jectiv es ev en more eﬃcie ntl y and ob- tain an ev olutionary edge in n atural selection (assuming comp etition). 8 Conclusion The decisio n making fr amework presented in this pap er addresses the prob- lem of d ecision making und er limited information b y taking in to account the information collec tion (observ ation), estimation (regression), and (m ulti- ob jectiv e) optimization asp ects in a holistic and structured manner. The metho dology is based on Gaussian pro cesses an d activ e learning. V arious issues su c h as quan tifying information con tent of n ew data p oints using in- formation theory , the relationship b et we en inform ation and GP v ariance as w ell as related appro ximation and multi- ob jectiv e optimization sc hemes are discussed. Th e framework is demonstrated with multiple numerical exam- ples. The presente d framew ork sh ou ld b e considered mainly as an initial step. F uture researc h directions are abu ndant and include fur th er in v estigatio n of the exploration-exploitation trade-oﬀ, adaptiv e w eigh ting parameters, and random sampling metho d s for p roblems in h igher dim en sional sp aces. Additional researc h topics are the relationship of the framework w ith ge- netic/ev olutionary metho ds, d y n amic control p roblems, and m ulti-p ers on decision making, i.e. game theory . Ac kno wledgemen ts This w ork is su pp orted by Deutsc he T elek om Lab oratories. The au th or wishes to thank Lacra P a v el, Sla w omir Stanczak, Holg er Boc he, and Kiv anc Mihcak for s timulating discussions on the sub ject. References [1] N. Ahm ed and D. Gokhale, “Entrop y exp r essions and their estima- tors f or multiv ariate distributions,” IEE E T r ansactions on Informatio n The ory , vol. 35, n o. 3, p p. 688–692 , Ma y 1989. 33 [2] T. Alp can, X. F an, T. Ba¸ sar, M. Arcak, and J. T . W en, “Po we r con trol for multice ll CDMA wir eless net wo rks: A team optimization approac h,” Wir eless Networks , vo l. 14, n o. 5, p p. 647–6 57, Octob er 2008. [Online]. Av ailable: pap ers/Alp can- Winet.p d f [3] C. M. Bishop, Pattern R e c o gnition and Machine L e arning (Inform ation Scienc e and Statistics) . S ecaucus, NJ, USA: Sp ringer-V erlag New Y ork, Inc., 2006. [4] S . Bo yd and L. V anden b erghe, Convex O ptimization . New Y ork, NY, USA: Cambridge Univ ersit y Pr ess, 2004. [5] P . Bo yle, “Gaussian pro cesses for regression and optimi- sation,” Ph.D. dissertation, Victoria Univ ersit y of W elling- ton, W ellington, New Z ealand, 2007. [Online]. Av aila ble: h ttp://researc harc hiv e.vu w.ac.nz/handle/10063/421 [6] F. H. Branin, “Widely con v ergen t metho d for ﬁnding multiple solutions of sim ultaneous n onlinear equations,” IBM Journal of R ese ar c h and Development , vo l. 16, pp . 504–522 , September 1972. [Online]. Av ailable: http:// dx.doi.org/10.114 7/rd.165.0504 [7] L. C. W. Dixon and G. P . Szego, “The optimization p roblem: An in- tro duction,” in T owar ds Glob al Optimization II , L. C. W. Dixon and G. P . Szego, Eds. New Y ork, NY, USA: North-Holland, 1978. [8] A. A. Goldstein and J. F. Price, “On d escent from lo cal min im a,” M ath- ematics of Computation , vo l. 25, no. 115, p p. 569– 574, July 1971. [9] O. Gro d zevic h and O. Romank o, “Normalizatio n and other topics in multi-o b j ectiv e optimization,” Pr o ceedings of the Fields– MIT A CS In dustrial Problems W orkshop, 2006. [Onlin e]. Av ailable: h ttp://www.maths- in- in dustry .org/miis/23 3/ [10] D. Huang, T. Allen, W. Notz , and N. Zeng, “Gl obal optimization of sto c hastic blac k-b ox s ystems via sequential kriging m eta-models,” Jour- nal of Glob al O ptimization , v ol. 34, pp. 441–46 6, 2006. [11] E. T. Ja ynes, “En trop y and searc h-theory ,” in Maximum-E ntr opy and Bayesian Metho ds in Inverse Pr oblems , C. R. Smith and J. W. T. Grand y, Eds. Springer, 1985, p . 443. [Online]. Av ailable: h ttp://ba y es.wustl.edu/etj/articles/se arch.p df 34 [12] D. R. Jones, “A taxonom y of global optimization metho d s based on resp onse surfaces,” Journal of Glob al Optimiza- tion , vol. 21, pp. 345–383, Decem b er 2001. [Online]. Av ailable: h ttp://dx.doi.org/10. 1023/A:1012771025575 [13] J. Kennedy and R. Eb erhart, “P article swa rm optimization,” in Pr o c. of IEEE Intl. Conf. on N eur al Networks , v ol. 4, No v em b er 199 5, pp. 1942– 1948. [14] M. L i and P . Vitan yi, An Intr o duction to Kolmo gor ov Compl exity and Its Applic ations , 2nd ed., ser. T exts in Comp u ter Science. New Y ork, NY, USA: S pringer, 1997. [15] D. Lizot te, T. W ang, M. Bo wling, and D. Sc h uu rmans, “Gaussian p ro cess r egression for optimization,” in NIPS 2005 W orksh op on V alue of Information in Inference, Learning and Decision-Making, 2005. [On line]. Av ailable: h ttp://domino.researc h.ibm.com/comm/researc h pro jects.nsf/pages/nips05w orkshop.ind ex.h tml [16] D. J . C. MacKa y , “In tro du ction to Gaussian pro cesses,” in Neur al Net- works and Machine L e arning , ser. NA TO ASI S eries, C. M. Bishop, Ed. Klu we r Academic Press, 1998, p p. 133– 166. [17] ——, “ Inform ation-based ob jectiv e functions for ac- tiv e data selecti on,” Neur al Computation , v ol. 4, no. 4, pp. 590–6 04, 19 92. [Online]. Av ailable: h ttp://www.mitpressjour nals.org/doi/abs/1 0.1162/nec o.1992.4.4.590 [18] ——, Information Th e ory, Infer enc e, and L e arning Algo- rithms . Cam brid ge Univ ersit y Press, 2003. [Online]. Av ailable: h ttp://www.inference.phy .cam.a c.uk/mac k ay/ itila/ [19] R. T. Marler and J. S. Ar ora, “Su rv ey of m ulti-ob jectiv e opti- mization metho ds for engineering,” Structur al and Multidisciplinary Optimization , vo l. 26, n o. 6, p p. 369–395, 2004. [On line]. Av ailable: h ttp://www.sprin gerlink.com/op enurl.asp?genre=article&id=doi:10.1 007/s00158- 003- 0368- 6 [20] Y. Pan, L. Pa ve l, and T. Alp can, “A s ystem p erf ormance app roac h to OSNR optimization in optical net wo rks,” IEEE T r ansactions on Communic ations , vo l. 58, no. 4, pp. 1193–1 200, Ap ril 201 0. [Online]. Av ailable: pap ers/TComm preprint v7.p d f 35 [21] K. B. Pe tersen and M. S. Pedersen, “The ma- trix co okb o ok,” Octob er 2008. [Online]. Av ailable: h ttp://www2.imm.dtu.dk/pu b db /p.php?3274 [22] J. G. Pierce, “A new lo ok at the relation b et w een inf or- mation theory and searc h theory ,” Oﬃce of Na v al Researc h, Arlington, V A, USA, T ec h. Rep., June 1978. [Onlin e]. Av ailable: h ttp://handle.dtic.mil/100 .2/AD A063845 [23] C. E . Rasmussen and C. K. I. Williams, Gaussian Pr o c esses for Machine L e arning (A daptive Computatio n and Machine L e arning) . The MIT Press, 2005. [24] R. Ruten bar, “Simulate d annealing algorithms: an ov erview,” IEEE Cir cuits and Devic es Magazine , vo l. 5, no. 1, p p. 19–26 , January 198 9. [25] N. V. Sahinidis, “Optimization u n der u ncertain t y: state-o f-the-art and opp ortun ities,” Computers & Chemic al Engine ering , v ol. 28, no. 6-7, pp . 971–9 83, J une 2004, fOCAPO 2003 Sp ecial iss ue. [Online]. Av ailable: h ttp://www.sciencedirect.com/sci ence/article/B6TFT- 49YH97T- 1/2/f 15875aad97740410eﬀc52641 6 2 8 9 a a [26] B. S c holk opf and A. J. S mola, L e arning with Kernels: Supp ort V e c- tor Machines, R e gularization, Optimization, and B e yond . Cam brid ge, MA, USA: MIT Press, 2001. [27] S. Seo, M. W allat, T . Graep el, and K . Ob erma y er, “Gaussian pro cess regression: activ e data selection and test p oin t rejection,” in Pr o c. of IEEE-IN N S-ENNS Intl. Joint Conf. on Neur al Networks IJCNN 2000 , v ol. 3, July 2000, pp. 241–246. [28] B. Settles, “Activ e learning literature survey ,” Univ ersit y of Wisconsin– Madison, Compu ter S ciences T ec hnical Rep ort 1648, 2009. [29] R. Srik an t, The Mathematics of Internet Congestion Contr ol , ser. Systems & Con trol: F oundations & App lications. Boston, MA: Birkhauser, 2004. [30] R. T emp o, E . W. Bai, and F. Dabb ene, “Probabilistic robustn ess anal- ysis: Explicit b ounds for the m inim um num b er of samples,” Systems & Contr ol L etters , vo l. 30, no. 5, pp. 237–242, 19 97. [Online]. Av ailable: h ttp://www.sciencedirect.com/sci ence/article/B6V4X- 3SP7DCD- 4 /2/3dc655107eﬀ50f12b326ea10 f 5 8 e f 0 a 36 [31] R. T emp o, G. Calaﬁore, and F. Dabb ene, R andomize d Algorithm s for Ana lysis and Contr ol of Unc ertain Systems . London, UK: Springer- V erlag, 2005 . [32] K. R. Thomp son, “Implemen tation of gaussian pro cess mo dels for non- linear system identiﬁcat ion,” Ph.D. dissertation, Unive rsity of Glasgo w , Glasgo w, Scotland, 2009. [33] M. E. Tipping, “Ba y esian inference: An int ro d uction to principles and p ractice in mac hine learning,” in A dvanc e d L e ctur es on Machine L e arning , 200 3, pp. 41–62. [Online]. Av ailable: h ttp://springerlink.metapress.com/op enurl.asp?genre=article { & } issn=0302- 9743 { & } v olume=317 6 { & } s p a g e = 4 1 [34] R. T u rner, M. P . Deisenroth, and C. E. Rasm ussen, “State-space infer - ence and learning with gaussian p ro cesses,” in Pr o c. of 13th Intl. Conf. on Artiﬁcial Intel ligenc e and Statistics (A IST A TS) , Chia Laguna Re- sort, Sardin ia, Italy , Ma y 2010. [35] Q. Zhu and J. Oomm en , “On the optimal searc h problem: the case when the target d istribution is u nkno wn,” in Pr o c. of XVII Intl. Conf. of Chile an Computer Sc i enc e So cie ty , 1997, p p. 268–2 77. 37

A Framework for Optimization under Limited Information

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment