Sequential search based on kriging: convergence analysis of some algorithms
Let $\FF$ be a set of real-valued functions on a set $\XX$ and let $S:\FF \to \GG$ be an arbitrary mapping. We consider the problem of making inference about $S(f)$, with $f\in\FF$ unknown, from a finite set of pointwise evaluations of $f$. We are ma…
Authors: Emmanuel Vazquez, Julien Bect
Sequen tial searc h based on kriging: con v ergence analysis of some algorithms V azquez, Emman uel Bect, Julien SUPELEC, Gif- su r- Y vette, F r anc e e-mail: emmanuel.vazquez@sup ele c.fr, julien.b e ct@sup ele c. f r 1 In tro duction Let F b e a set of real-v alued functions on a set X and le t S : F → G b e an arbitrary mapp in g. W e consider the problem of making inference ab out S ( f ), with f ∈ F u nkno wn , from a finite set of p oint- wise ev aluations of f . W e are mainly in terested in the problems of appr o ximation and optimization. F ormally , a deterministic alg orithm to infer a quan tity of in terest S ( f ) from a set of n ev aluations of f is a pair X n , b S n consisting of a deterministic se ar ch str ate gy X n : f 7→ X n ( f ) = ( X 1 ( f ) , X 2 ( f ) , . . . , X n ( f )) ∈ X n , and a mapping b S n : F → G , suc h that: a) X 1 ( f ) = x 1 , for some arbitrary x 1 ∈ X b) F or all 1 ≤ i < n , X i +1 ( f ) dep ends measurably on I i ( f ), wh ere I i = (( X 1 , Z 1 ) , . . . , ( X i , Z i )), and Z i ( f ) = f ( X i ( f )), 1 ≤ i ≤ n . c) There exists a measurable function φ n suc h that b S n = φ n ◦ I n . The algorithm X n , b S n describ es a sequence of decisions, made f r om an increasing amount of infor- mation: for eac h i = 1 , . . . , n − 1, the algorithm uses information I i ( f ) to c ho ose the next ev aluation p oin t X i +1 ( f ). The estimator b S n ( f ) of S ( f ) is the terminal d ecision. W e shall denote by A n the class of all strate gies X n that query sequential ly n ev aluations of f and also define the su b class A 0 n ⊂ A n of non -adaptive strategies, that is, the class of all strategies such that the X i s do n ot dep end on f . A classical approac h to study the p erformance of a s equ ential strategy is to consider the worst error of estimation on some class of fun ctions F ǫ wo r s tcase ( X n ) : = sup f ∈F L ( S ( f ) , b S n ( f )) , where L is a loss fu n ction. Th ere are many resu lts d ealing w ith th e p roblems of fun ction appr o ximation and optimizati on in the worst case setting. T w o n oticeable r esu lts concern con vex and symm etric classes of b ounded fun ctions. F or such classes, f rom a w orst-case p oin t of view, an y strategy w ill b eha ve similarly f or th e pr oblem of global optimization and that of fun ction appro ximation. Moreo v er the use of adaptiv e metho ds can not b e justified by a w orst case analysis (see, e.g., Nov ak, 1988, Prop ositions 1.3.2 and 1.3.3). T hese results, com bined with the f act that most optimization algorithms are adaptiv e, lead to think that the w orst-case setting ma y not b e the most appropr iate f r amew ork to assess the p erformance of a searc h algorithm in practice. Indeed, it w ould b e also imp ortan t, in practice, to kno w whether the loss L ( S ( f ) , b S n ( f )) is close to, or on the con trary m uch smaller than ǫ wo r s tcase , for “t yp ical” fun ctions f ∈ F not corresp onding to wo rs t cases. T o address th is question, a classical app r oac h is to adopt a Ba y esian point of view. In this p ap er, w e consider metho ds where f is s een as a sample path of a real-v alued ran d om pro cess ξ defin ed on some pr obabilit y space (Ω , B , P 0 ) with p arameter in X . Then, X n ( ξ ) is a rand om sequence in X , with the p rop ert y that X n +1 ( ξ ) is measurable with resp ect to th e σ -algebra generated b y ξ ( X 1 ( ξ )), . . . , ξ ( X n ( ξ )). F rom a Bay esian decision-theoretic p oin t of view, the random p ro cess represen ts prior kn o wledge about f and mak es it p ossible to infer a quan tit y of in terest b efore ev al- uating the fun ction. This p oin t of v iew has b een w id ely explored in the domain of optimiza tion and computer exp erimen ts. Und er this setting, the p erformance of a giv en strategy X n can b e assessed b y studying the a ve rage loss ǫ a verage ( X n ) : = E L ( S ( ξ ) , b S n ( ξ )) . Ho w muc h does adaption help on the a verag e, and is it p ossible to derive rates of deca y for errors in a v erage? In this article, we shall mak e a b rief review of results concerning a v erage err or b ounds of Ba y esian search metho ds b ased on a random pro cess p rior. This article h as three parts. The p recise assumptions ab out ξ are giv en in Section 2. Section 3 deals with th e problem of f u nction appro ximation, while S ection 4 deals with the problem of optimization. 2 F ramew ork Let ξ b e a r andom pro cess defined on a pr obabilit y space (Ω , B , P 0 ), with parameter x ∈ R d . Assume moreo v er that ξ has a zero mean and a contin uous cov ariance fun ction. The kriging predictor of ξ ( x ), based on the observ ations ξ ( X i ( ξ )), i = 1 , . . . , n , is th e orthogonal pro jection (1) b ξ n ( x ) : = n X i =1 λ i ( x ; X n ( ξ )) ξ ( X i ( ξ )) of ξ ( x ) on to sp an { ξ ( X i ( ξ )) , i = 1 , . . . , n } in L 2 (Ω , B , P 0 ). A t step n ≥ 1, giv en ev aluation p oints X n ( ξ ), the kr iging co efficien ts λ i ( x ; X n ( ξ )) can b e obtained b y solving a system of linear equations (see, e.g., Chilès and Delfiner, 1999) . Note that for any sample path f = ξ ( ω , · ), ω ∈ Ω, the v alue b ξ n ( ω , x ) is a function of I n ( f ) only . The mean-square err or (MSE) of estimatio n at a fixed p oint x ∈ R d will b e denoted b y σ 2 n ( x ) : = E { ( ξ ( x ) − b ξ ( x ; X n ( ξ ))) 2 } . It is generally not p ossible to compute σ 2 n ( x ) when X n is an adaptiv e strategy . Regularit y assumptions. Assume th at there exists Φ : R d → R s u c h that k ( x, y ) = Φ( x − y ), whic h is in L 2 ( R d ) and h as a F ourier transform ˜ Φ( u ) = (2 π ) − d/ 2 Z R d Φ( x ) e i ( x,u ) dx that satisfies (2) c 1 (1 + k u k 2 2 ) − s ≤ ˜ Φ( u ) ≤ c 2 (1 + k u k 2 2 ) − s , u ∈ R d , with s > d/ 2 and constan ts 0 < c 1 ≤ c 2 . Note that the Matérn co v ariance w ith regularit y parameter ν (see, e.g., Stein, 1999) satisfies su c h a regularit y assump tion, with s = ν + d/ 2. T ensor-pro du ct co v ariance functions, how eve r, nev er satisfy such a condition (see Ritter, 2000, chapter 7, for s ome results in this case). Let H b e the RKHS of fun ctions generated by k . Denote by ( · , · ) H the inner pro duct of H , and b y k · k H the corresp onding norm. It is well kno wn (see, e. g. W endland, 20 05) that H is the S ob olev space W s 2 ( R d ) = n f ∈ L 2 ( R d ); ˜ f ( · )(1 + k · k 2 2 ) s/ 2 ∈ L 2 ( R d ) o due to the follo wing r esu lt. Prop osition 1. H ⊂ L 2 ( R d ) and ∀ f ∈ H , k f k 2 H = Z R d | ˜ f ( u ) | 2 ˜ Φ( u ) − 1 du . k f k 2 H is e quivalent to the Sob olev norm k f k 2 W s 2 ( R d ) = k ˜ f ( · ) 1 + k · k 2 2 s/ 2 k L 2 ( R d ) 3 Appro ximation W e first consider the problem of appro ximation, with the p oin t of view exp osed in Section 2. Using the notatio ns introd u ced abov e, th e problem of app ro ximation corresp onds to considering op erators S and b S n defined by S ( ξ ) : = ξ | X and b S n ( ξ ) : = b ξ n | X , with X ⊂ R d a compact d omain w ith non- empt y in terior. F or the design of co mp u ter exp erimen ts, classical criteria for asse ssin g the qualit y of a strategy X n ∈ A n for the appro ximation problem are the maxim um mean-square err or (MMSE) ǫ mmse ( X n ) : = sup x ∈ X E ξ ( x ) − b ξ n ( x ) 2 = sup x ∈ X σ 2 n ( x ) and the in tegrated mean-square error (IMSE) ǫ imse ( X n ) : = E k ξ − b ξ n k 2 L 2 ( X ,µ ) = Z X σ n ( x ) 2 µ (d x ) (see, e.g., S acks et al., 1989; Curr in et al., 1991; W elc h et al., 1992; Santner et al., 2003). These criteria corresp ond to G -optimalit y and I -optimalit y in the theory of (parametric) optimal design. As ment ioned earlier, computing σ 2 n ( x ) is us ually not p ossible in the case of adaptiv e sampling strategie s, ev en for a Gaussian pr o cess. F rom a th eoretical p oint of view, how ever, it is imp ortant to kno w if adap tive strategies can impr o v e up on n on-adaptiv e strategies for the app ro ximation p roblem. Prop osition 2. A ssume that ξ is a Gaussian pr o c ess. Then adaptivity do es not help for the appr oxi- mation pr oblem, with r esp e ct to either th e MMSE or the IM SE criterion. Pr o of. F or any adaptiv e strategy X n , it can b e prov ed b y induction (using the fact that X i +1 only dep ends on I i ) that, for eac h x ∈ X , (3) σ 2 n ( x ) = E σ 2 ( x ; X 1 ( ξ ) , . . . , X n ( ξ )) , where σ 2 ( x ; x 1 , . . . , x n ), x 1 , . . . , x n ∈ X , denotes the MSE at x of the non -adaptive strateg y that selects the p oin ts x 1 , . . . , x n . Therefore, for eac h x ∈ X , σ 2 n ( x ) ≥ min x 1 , .. ., x n ∈ X σ 2 ( x ; x 1 , . . . , x n ) , whic h prov es the claim in the case of the MMSE criterion. Similarly , integ rating (3) yields Z X σ 2 n dµ = E Z X σ 2 ( x ; X n ( ξ )) µ (d x ) ≥ min x 1 , ..., x n ∈ X Z X σ 2 ( x ; x 1 , . . . , x n ) µ (d x ) , whic h prov es the claim in the case of the IMSE criterion. In the case of the IMSE criterion, Prop osition 2 can b e seen as a sp ecial case of a general r esult ab out linear p r oblems (see, e.g., Ritter, 2000, Chapter 7). The follo wing prop osition establishes a connection b et ween the MMSE criterion and th e wo r st-case L ∞ -error of approximat ion in the unit ball of H , whic h w ill b e useful to establish the optimal rate for IMSE- and MMSE-optimal designs. Prop osition 3. L et H 1 denote the unit b al l of H . F or any non-adaptive str ate gy X n ∈ A 0 n , the M MSE criterion e quals the squar e d wo rst-c ase L ∞ -err or of appr oximation in H 1 using b S n : ǫ mmse ( X n ) = sup f ∈H 1 k S ( f ) − b S n ( f ) k L ∞ ( X ) ! 2 . Pr o of. Let X n ∈ A 0 n b e a non-adaptiv e strategy suc h that X i ( ξ ) = x i , i = 1 , . . . , n , f or some arbi- trary x i s in X . Denote b y λ i ( x ) = λ i ( x ; X n ( ξ )) the co r r esp onding kriging co efficien ts (whic h do not dep end on ξ ). Using the f act that the mapping ξ ( x ) 7→ k ( x, · ) extends linearly to an isomet ry from span { ξ ( y ) , y ∈ R d } to H , we ha ve for all x ∈ X σ n ( x ) = ξ ( x ) − b ξ n ( x ) L 2 (Ω , B , P 0 ) = k ( x, · ) − X i λ i ( x ) k ( x i , · ) H = sup f ∈H 1 f , k ( x, · ) − X i λ i ( x ) k ( x i , · ) H . = sup f ∈H 1 ( f − b S n f )( x ) . Th u s, sup x ∈ X σ n ( x ) = sup f ∈H 1 sup x ∈ X ( f − b S n f )( x ) = sup f ∈H 1 f − b S n f L ∞ ( X ) . The follo win g prop ositi on summ arizes kno wn r esults concerning the optimal rate of d eca y in the class of n on-adaptiv e strategies for b oth the IMS E crite rion and the MMSE criterion. Note that, by Prop osition 2, this rate is also the optimal rate of d eca y in the class of all adaptiv e strategie s if ξ is a Gaussian pr o cess. Prop osition 4. A ssume that ξ has a c ontinuous c ovarianc e fu nction satisfying the r e gularity assump- tions of Se ction 2, and let ν = s − d/ 2 > 0 . Then ther e exists C 1 > 0 such that, for any X n ∈ A 0 n , (4) C 1 n − 2 ν /d ≤ ǫ imse ( X n ) ≤ µ ( X ) ǫ mmse ( X n ) Mor e over, if X has a Li pschitz b oundary and satisfies an interior c one c ondition, then ther e exists C 2 > 0 such tha t (5) inf X n ∈A 0 n ǫ imse ( X n ) ≤ µ ( X ) inf X n ∈A 0 n ǫ mmse ( X n ) ≤ C 2 n − 2 ν /d . The optimal r ate of de c ay is ther efor e n − 2 ν /d for b oth criteria. Pr o of. It is p ro v ed in (Ritter, 2000, Chapter 7, Prop osition 8) that there exists C 1 > 0 su c h that ǫ imse ( X n ) ≥ C 1 n − 2 ν /d in the case where X = [0; 1] d . This readily pro v es the lo wer b oun d (4) since an y X with non-empt y in terior con tains an hyp ercub e on wh ic h Ritter’s result h olds. If X is a b oun ded Lipsc hitz domain satisfying an int erior cone condition, then (Narco wic h et al., 2005, Prop osition 3.2) there exists c 1 > 0 su ch that k S ( f ) − b S n ( f ) k L ∞ ( X ) ≤ c 1 h s − d/ 2 n k S ( f ) k W s 2 ( X ) for all f ∈ H , w h ere h n = sup x ∈ X min i ∈{ 1 ,...,n } k x − X i ( f ) k 2 is the fill distance of the non-adaptiv e strategy X n in X . Therefore k S ( f ) − b S n ( f ) k L ∞ ( X ) ≤ c 1 h ν n k S ( f ) k W s 2 ( X ) ≤ c 1 h ν n k f k W s 2 ( R d ) ≤ c 2 h ν n k f k H for some c 2 > 0, using the equiv alence of the Sob olev W s 2 ( R d ) n orm with the RKHS norm (see Section 2). Considering any non-adaptiv e space-filling strategy X n with a fill distance h n = O ( n − 1 /d ) yields inf X n ∈A 0 n sup f ∈H 1 f − b S n f L ∞ ( X ) ≤ c 3 n − ν /d for s ome c 3 > 0 and the up p er-b ound (5) th en follo ws from Prop ositio n 3. Finding a non-adaptiv e MMSE-optimal d esign is a difficult non-con vex opti mization problem in nd dimensions. Instead of add ressing directly such a high-dimensional glo b al optimization problem, w e can use the classical sequen tial non-adaptiv e greedy strateg y X n ( · ) = ( x 1 , . . . , x n ) ∈ X n defined b y (6) x i +1 = argmax x ∈ X σ 2 ( x ; x 1 , . . . , x i ) , 1 ≤ i < n . Of course, the s trategy is su b optimal but it only in v olv es simp ler optimiza tion problems in d dimensions and has the adv an tage that it can be stopp ed at any time. F ollo win g Binev et al. (20 10), it can b e established that this greedy strategy is r ate optimal. Prop osition 5. A ssume that ξ has a c ontinuous c ovarianc e fu nction satisfying the r e gularity as- sumptions of Se ction 2, and let ν = s − d/ 2 > 0 . L et X n b e the se quential str ate gy define d by (6). Then, ǫ mmse ( X n ) = O( n 2 ν /d ) . Pr o of. Theorem 3.1 in Binev et al. (2010), applied to the compact su bset { ξ ( x ) , x ∈ X } in L 2 (Ω , B , P 0 ), states that the greedy algorithm (6) preserves p olynomial r ates of deca y . The result follo ws from Prop osition 4. 4 Optimization In this section, we consider the p roblem of global optimiz ation on a compact domain X ⊂ R d , whic h corresp onds formally to op erators S and b S n defined by S ( ξ ) = su p x ∈ X ξ ( x ) and b S n ( ξ ) = max i ∈ 1 ,...,n ξ ( X i ( ξ )). In a Ba ye sian setting, a classical criterion to assess the p erformance of an optimization p ro cedure is the a v erage error ǫ opt ( X n ) : = E ( S ( ξ ) − b S n ( ξ )) . Although it ma y b e not p ossible in the con text of this article to m ak e a comprehensive review of kno wn results concerning the a ve rage case in the Gaussian case, it can b e safely said ho wev er that suc h results are scarce and sp ecific. In f act, most av ailable results ab out the a ve r age-case err or co ncern the one-dimensional Wie n er pro cess ξ on the in terv al [0 , 1]. Under this setting, Ritter (1990) shows that the a verag e error of the b est non-adaptive optimizatio n pr o cedu re decreases at rate n − 1 / 2 (extensions of th is result for non-adaptiv e algorithms and the r -fold Wiener mea su re can b e foun d in W asilk o wski, 1992). Under the same assump tions f or ξ , Calvin (1997 ) deriv es the exact limiting distribution of the error of a particular adaptiv e algorithm, whic h su ggests that adaptivit y do es yield a b etter a ve r age error for the optimizatio n pr ob lem—the result is that, f or an y 0 < δ < 1, it is p ossible to find an adaptiv e s trategy suc h that n (1 − δ ) ( S ( ξ ) − b S n ( ξ )) con v erges in distribution. A theoretical result concerning the optimal a v erage-error criterion for less restrictiv e Gaussian p ri- ors is also a v ailable. If the co v ariance of a Gaussian pr o cess ξ is α -Hölder con tinuous, then Grünew älder et al. (2010) sh o w that a sp ace filling strategy X n ac hiev es (7) ǫ opt ( X n ) = O( n − α/ (2 d ) (log n ) 1 / 2 ) . Th u s, und er the assum p tions of Sect ion 2, for a Matérn co v ariance with regularit y parameter ν , the rate of the optimal a v erage error of estimation of the optim um is less than n − ν /d (log n ) 1 / 2 (since a Matérn co v ariance is α -Hölder con tin u ou s with α = 2 ν ). Note that this b ound is not sharp in general since the optimal non-adaptiv e rate is n − 1 / 2 for the Brownian m otion on [0; 1], the co v ariance function of wh ic h is α -Hölder cont inuous with α = 1. In view of these results, we can safely sa y that charact erizing the a v erage b eha vior of adaptiv e sequen tial optimization algorithms is still an op en (and apparent ly difficult) p roblem. A t present, the only w ay to d ra w useful conclusions ab out the in terest of a particular optimization algorithm is to resort to numerical simulati ons. Empirical stud ies such as the one pr esen ted in Benassi et al. (2011) for in s tance are therefore very useful f r om a p r actica l p oint of view, since they mak e it p ossible to obtain fin e and sound p erformance assessment s of any strategy with a reasonable computational cost. References R. Benassi, J. Bect, and E. V azquez. Robust gaussian pro cess-based global optimization us in g a fully ba y esian exp ecte d impro v ement criterion. In Pr o c e e dings of fifth L e arning and Intel ligent Optimization Confer enc e (LION 5) , Rome, 2011. P . Binev, A. Cohen, W. Dahmen, R. De V ore, G; Pet rov a, and P . W o jtaszczyk. Conver genc e R ates for Gr e e dy A lgorithms in R e duc e d Basis Metho ds , vo lume IGPM Rep ort 310. R WTH Aac hen, 2010. J.M. C alvin. A v erage p erformance of a class of adaptiv e algorithms for global optimization. The A nnals of A pplie d Pr ob ability , 7(3):7 11–730, 1997. J.-P . Chilès and P . Delfiner. Ge ostatistics: Mo deling Sp atial Unc ertainty . Wiley , New Y ork, 1999. C. Cu rrin, T. Mitc hell, M. Morris, and D. Ylvisak er. Ba yesi an prediction of d eterministic functions, with applications to the design and analysis of computer exp eriments. J. A mer. Statist. A sso c. , pages 953–963, 1991. S. Gr ü new älder, J.Y. A udib ert, M. Opp er, and J . Sh a w e-T a ylor. Regret b ounds for gaussian process bandit problems. In Pr o c e e dings of th e 14 th Inter national Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS) , 2010 . F. J. Narco wich, J. D. W ard, and H. W endland. Solol ev b oun ds on functions with scattered zeros, with applications to radial basis fun ction su rface fitting. Math. Comp. , 74:74 3–763, 200 5. E. No v ak. Deterministic and sto chastic err or b ounds in numeric al analysis , v olume 1349 of L e c tur e Notes in Mathematics . Springer-V erlag, 1988. K. Ritter. Approxima tion and op timization on the wiener space. Journal of Complexity , 6(4):3 37–364, 1990. K. Ritter. A ver age-c ase analysis of numeric al pr oblems , v olume 1733 of L e ctur e N otes in Mathematics . Springer V erlag, 2000. J. Sac ks, W. J. W elc h , T. J. Mitc hell, and H. P . W yn n. Design and analysis of compu ter exp eriment s. Statist. Sci. , 4(4): 409–435, 1989. T. J. Santner, B. J. Wil liams, and W. I. Notz. The Design and A nalysis of Computer Exp eriments . Springer, 2003. M. L. Stein. Interp olation of Sp atial Data: Some The ory for Kriging . S pringer, New Y ork, 1999. G.W. W asilk o wski. On av erage complexit y of gl obal optimizatio n p roblems. Mathematic al pr o g r am- ming , 57(1):313 –324, 1992. W. J. W elc h, R. J. Buc k, J. Sac ks, H. P . W ynn, T. J. Mitc hell, and M. D. Morris. Screening, pr ed icting, and computer exp erimen ts. T e chnometrics , 34(1):15 –25, 1992 . H. W endland. Sc atter e d Data Ap pr oximation . Monog raph s on Applied and Compu tational Mathe- matics. Cam br idge Univ. Pr ess, Cam brid ge, 2005.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment