Maximin design on non hypercube domain and kernel interpolation

In the paradigm of computer experiments, the choice of an experimental design is an important issue. When no information is available about the black-box function to be approximated, an exploratory design have to be used. In this context, two dispers…

Authors: ** 논문에 명시된 저자 정보는 제공되지 않았으나, 본 연구는 **컴퓨터 실험 및 통계학 분야**의 전문가들(예: Kriging 및 커널 보간 전문가, 최적화 알고리즘 연구자)으로 구성된 공동 연구팀에 의해 수행된 것으로 추정된다. **

Maximin design on non h yp ercub e domain and k ernel in terp olation Yv es Auffra y Dassault Aviation & D ´ epartemen t de Math´ ematiques, Univ ersit´ e Paris-Sud, F rance Pierre Barbillon INRIA Saclay , pro jet select , D ´ epartemen t de Math´ ematiques, Univ ersit´ e Paris-Sud, F rance Jean-Mic hel Marin ∗ † Institut de Math ´ ematiques et Mo d ´ elisation de Montpellier Univ ersit´ e Montpellier 2 No vem b er 4, 2018 Abstract In the paradigm of computer exp erimen ts, the choice of an exp erimen tal design is an imp ortan t issue. When no information is av ailable about the blac k-b o x function to b e appro ximated, an exploratory design has to be used. In this con text, t wo disp ersion criteria are usually considered: the minimax and the maximin ones. In the case of a h yp ercub e domain, a standard strategy consists of taking the maximin design within the class of Latin hypercub e designs. Ho wev er, in a non hypercub e context, it do es not mak e sense to use the Latin hypercub e strategy . Moreov er, whatever the design is, the blac k-b o x function is t ypically approximated thanks to k ernel interpolation. Here, w e first provide a theoretical justification to the maximin criterion with respect to k ernel interpolations. Then, w e prop ose sim ulated annealing algorithms to determine maximin designs in any b ounded connected domain. W e prov e the conv ergence of the differen t schemes. Finally , the metho dology is applied on a challenging real example where the black-blo x function describ es the b eha viour of an aircraft engine. Keyw ords: computer exp erimen ts, k ernel interpolation, Kriging, maximin designs, sim ulated annealing. ∗ Corresp onding author: place Eug` ene Bataillon, Case Courrier 051, 34095 Montpellier cedex 5 † jean-michel.marin@univ-montp2.fr 1 1 In tro duction A function f : E → R is said to b e an expensive black-box function if f is only kno wn through a time consuming code. It is assumed that E is enclosed in a kno wn bounded set of R d . E is not necessarily a hypercub e domain or explicit. E ma y b e given through an indicator function only . It is assumed that testing if a p oin t b elongs to E is not burdensome. Hence simulate a p oint in E can b e performed by sampling rejection. In order to deal with some concerns such as pre-visualization, prediction, optimization and probabilistic analysis which depend on f , an appro ximation of f is usually used. This is the paradigm of computer exp erimen ts (Santner et al., 2003; F ang et al., 2005) where the unknown function f is deterministic. An approximation of f can b e obtained using a kernel interpolation metho d (Schabac k, 1995, 2007). F or the corresp onding cov ariance structure, the Best Linear Unbiased Predictor (BLUP) giv en by Kriging metamo deling (Matheron, 1963) provides the same approximation of f . Due to its flexibilit y and its adaptivit y to non-linear functions, Kriging is one of the most used appro ximation metho d b y the computer experiments’ communit y . In our tests, we hav e observ ed that Kriging w orks well for approximating a large class of non linear functions when the input space is of dimension less than ten. F or more details on Kriging, one can see for instance: Sac ks et al. (1989b,a); Cressie (1993); Laslett (1994); Stein (1999, 2002); Li and Sudjianto (2005); Joseph (2006); den Hertog et al. (2006). The kernel in terp olation metho dology needs the choice of a kernel K (k ernel satisfying some conditions detailed below) and a design X = { x 1 , . . . x N } where the function f is to b e ev aluated, giving { f ( x 1 ) , . . . , f ( x N ) } . As it is w ell-known, a space of functions H K is asso ciated to K . If the function f b elongs to H K , we can con trol the p oin twise error of s K, X ( f ), the in terp olator of f on X . In this deterministic paradigm (the function f is not random), there are essentially tw o main kinds of prop erties that a design can hav e (Ko ehler and Owen, 1996): • pro jection prop erties such as Latin h yp ercub e designs (McKa y et al., 1979) or its generalization Latin hyper-rectangle sampling which allo ws for non-equal cell prob- abilities (Mease and Bingham, 2006); • exploratory prop erties which are warran ted by criteria such as: – minimax which means that the design has to minimize h X = sup y ∈ E min 1 ≤ i ≤ N k y − x i k , (1) – maximin which means that the design has to maximize δ X = min 1 ≤ i,j ≤ N k x i − x j k . (2) 2 Moreo ver, b et ween tw o designs X 1 and X 2 suc h that δ X 1 = δ X 2 , using the maximin criterion, w e c ho ose the design for whic h the n um b er of pairs of points with distance equal to δ X 1 is minimal. – Mean Squared Error (MSE) based criteria. These criteria are link ed to the MSE of the BLUP in the con text of Kriging metamo deling. Designs can b e sough t to minimize the Integrated Mean Squared Error (IMSE) of the BLUP o ver the domain E or to minimize the maximum of the MSE ov er E . The minimax and maximin criteria hav e b een prop osed for Kriging metamo deling b y Johnson et al. (1990). Sac ks et al. (1989a) ha ve detailed the MSE-based criteria. F or others criteria, one can see (Bursztyn and Steinberg, 2006). F or kernels defined by radial basis functions, Schabac k (1995) and Madych and Nelson (1992) hav e shown that the mininax criterion h X explicitly interv enes in an upp er b ound on the p oin t wise error b etw een f and s K, X ( f ). The upp er bound has the form G ( h X ) where G is an increasing function R + → R + . Here, w e prov e that h X ≤ δ X and then, that a maximin design also provides an uniform upp er b ound of the p oin twise error. minimax and IMSE criteria are costly to ev aluate and, t ypically , the maximin criterion is privileged. In the case where E is a h yp ercubic set, Morris and Mitc hell (1995) pro vided an algorithm based on sim ulated annealing to obtain a design v ery close to a maximin Latin h yp ercube designs, (the criterion optimized is not exactly the maximin one). F or the t wo-dimensional case, v an Dam et al. (2007) deriv ed explicit constructions for maximin Latin hypercub e designs when the distance measure is L ∞ or L 1 . F or the L 2 distance measure, using a branc h-and-b ound algorithm, they obtained maximin Latin h yp ercub e designs for N ≤ 70. F or some non hypercubic domains, the use of pro jection prop erties can lead to p o or exploratory designs. F or instance, if E = {  x 1 , x 2  ∈ [0 , 1] 2 : x 1 ≥ x 2 } then, the only Latin h yp ercub e design is on the line x 1 = x 2 . Moreov er, in some cases, they are imp ossible to satisfy . Therefore, we fo cus on exploratory prop erties only . In the case of an explicit constrained subset of [0 , 1] d , Stinstra et al. (2003) proposed an algorithm based on the use of NLP solv ers. Here, w e prop ose some algorithms to achiev e a maximin design for general (even not explicit) non h yp ercubic domains. Our schemes are based on simulated annealing. Our prop osals are not heuristic, we study the con vergence prop erties of the schemes prop osed. Recall that the sim ulated annealing algorithm aims at finding a global extremum of a function by using a Marko vian kernel which is the comp osition of an exploratory k ernel and an acceptance step dep ending on a temperature which decreases during the iterations. In some presentations (e.g. Bartoli and Del Moral (2001)), the simulated annealing algo- rithm is based on a Metrop olis-Hastings algorithm (Chib and Green b erg, 1995). In that case, it can b e prov ed that the resulting Mark o v c hain tends to concen trate on a global extrem um of the function to b e optimized with high probabilit y when the num b er of it- 3 erations tends to infinit y . Here, w e in tro duce a simulated annealing sc heme based on a Metrop olis-within-Gibbs algorithm (Roberts and Rosen thal, 2006) and pro v e its conv er- gence. The paper is organized as follows, in Section 2 the kernel interpolation metho d is describ ed and a theoretical justification of the minimax and maximin criteria is provided thanks to the point wise error b ound betw een the interpolator and the function f . Then, in Section 3 the simulated annealing algorithm is presented. A pro of of conv ergence is given. Section 4 deals with the case where E is not explicit and can only b e known by an indicator function. Tw o v arian ts of the algorithm are prop osed and their theoretical prop erties are stated. In Section 5, the algorithms are tried on some examples and practical issues are discussed. Finally , in a last Section, the metho dology is applied on a real example for whic h the domain is not an hypercub e. 2 Error b ounds with k ernel interpolations A k ernel is a symmetric function K : E × E → R where E is the input space whic h is assumed to b e b ounded. The kernel has to b e at least conditionally p ositive definite to b e used in kernel interpolation. F or the sake of simplicity , kernel interpolation is presented for p ositiv e definite kernels only . R E denotes the space of functions from E to R . Definition 2.1. A kernel K is p ositive definite if ∀ ( ζ 1 , x 1 ) . . . ( ζ N , x N ) ∈ R × E , X 1 ≤ l,m ≤ N ζ l ζ m K ( x l , x m ) ≥ 0 . F or an y x ∈ E , let K x denote the partial function x 0 ∈ E 7→ K ( x , x 0 ) ∈ R . The linear com binations of functions taken in { K x , x ∈ E } span a functional pre-Hilb ert space F K where < L X l =1 ζ l K x l , M X m =1 µ m K x 0 m > F K = M X m =1 L X l =1 ζ l µ m K ( x l , x 0 m ) is the scalar pro duct. Aronsza jn’s theorem states that there exists a unique space H K whic h is a completion of F K where the following repro ducing prop ert y holds ∀ f ∈ H K , x ∈ E , f ( x ) = < f , K x > H K . H K is called a Repro ducing Kernel Hilb ert Space (RKHS). Let us denote b y s K, X ( f ) the orthogonal pro jection of f on H K ( X ) = span { K x 1 , . . . , K x N } ( f is assumed to b e in H K ; X = { x 1 , . . . , x N } and K are giv en). 4 Lemma 2.1. s K, X ( f ) interp olates f on X . Among the interp olators of f on X , s K, X ( f ) has the smal lest norm: s K, X ( f ) is the solution of the fol lowing pr oblem  min g ∈H K k g k H K g ( x k ) = f ( x k ) , k = 1 , . . . N . This interpolator corresp onds to the BLUP in the Kriging literature (Cressie, 1993; Stein, 2002). It has a Lagrangian formulation. Lemma 2.2. F or any x ∈ E , s K, X ( f )( x ) = N X i =1 u i ( x ) f ( x i ) wher e the functions ( u i : E → R ) ∈ H K ( X ) ar e such that, ∀ 1 ≤ i ≤ N ,  u i ( x i ) = 1 u i ( x k ) = 0 if k 6 = i , and K [ X , x ] = K [ X , X ] U ( x ) , wher e U ( x ) =    u 1 ( x ) . . . u N ( x )    , K [ X , x ] =    K ( x 1 , x ) . . . K ( x N , x )    and K [ X , X ] is such that ( K [ X , X ]) 1 ≤ i,j ≤ N = K ( x i , x j ) . Hence, the p oint wise error can b e b ounded from ab o ve, ∀ x ∈ E | f ( x ) − s K, X ( f )( x ) | = | < f , K x − N X i =1 u i ( x ) K x i > H K | ≤ k f k H K k K x − N X i =1 u i ( x ) K x i k H K . Let P X ( x ) = k K x − P N i =1 u i ( x ) K x i k H K . P X dep ends only on the k ernel K and on the design X . P X corresp onds to MSE of the BLUP . When it is in tegrated on the domain E , w e obtain the IMSE criterion. As already explained, the IMSE or the the maxim um MSE can b e minimized to determine an exploratory design. Ho wev er, it dep ends on the kernel and it is costly to compute. F or some kernels K defined by K ( x , x 0 ) = φ ( k x − x 0 k 2 ) where k x k 2 = q P d i =1 ( x i ) 2 , ( x ∈ E ) and φ : R + → R , Schabac k (1995) provides the following upp er bound on P X ( x ): P X ( x ) ≤ G K ( h X ) . The quantit y h X = sup y ∈ E min 1 ≤ i ≤ N k y − x i k is asso ciated to the minimax criterion. G K is an increasing function, obviously dep ending on the kernel. The smo other the k ernel K , 5 the faster G K ( h ) tends to 0 for h > − → 0. F or instance, the Gaussian kernel is defined by K ( x , x 0 ) = e − θ k x − x 0 k 2 2 where θ is a real p ositive parameter; in that case, G K ( h ) = C e − δ /h 2 where C and δ are constan ts dep ending on θ . By choosing a design X with a lo w h X , one thus ensures a small p oin twise error indep enden tly of the c hosen (radially symmetric) k ernel. This justifies using minimax (1) optimal designs. The next prop osition sho ws that a b ound on the p oin twise interpolation error is still guaran teed when a maximin optimal design is used. Prop osition 2.1. If X is a maximin design, E is enclose d in the union of the b al ls of c enter x i and of r adius δ X = min 1 ≤ i,j ≤ N k x i − x j k . Pro of This proposition is pro v ed b y contradiction: let X b e a maximin design and let us suppose that there exists a p oint x 0 ∈ E suc h that k x 0 − x i k > δ X for all x i ∈ X . Let ( x i 0 , x j 0 ) ∈ X 2 b e a pair of p oin ts suc h that k x i 0 − x j 0 k = δ X and construct the design X 0 = { x 1 . . . x i 0 − 1 , x 0 , x i 0 +1 . . . x N } where the p oin t x i 0 is replaced by the p oin t x 0 . δ X 0 ≥ δ X and, in the case δ X 0 = δ X , X 0 is better than X with respect to the maximin criterion b ecause the X 0 con tains less pairs of p oin ts for which the distance is equal to δ X . Th us, there is a contradiction b ecause X is not a maximin design. Hence, for an y x ∈ E , there exists a x i ∈ X suc h that k x − x i k ≤ δ X . 2 As a consequence of this prop osition, if X is a maximin design, | f ( x ) − s K, X ( f )( x ) | ≤ k f k H K G K ( δ X ) . This result justifies theoretically the use of maximin designs when a kernel in terp olation is used as an approximation of f . Besides it prov es that the interpolation done thanks to a maximin design is consistent since δ X N tends to 0 for a sequence ( X N ) N ∈ N of maximin designs of resp ectively N p oin ts. 3 Computing maximin designs In this Section, w e prop ose an algorithm to provide a maximin design with N p oin ts in an y set E enclosed in a b ounded set. It is based on a sim ulated annealing metho d. It aims at finding the global minimum of the function U : E N → R + , U ( X ) = diam( E ) − δ X where diam( E ) is the diameter of the set E (diam( E ) = max x , x 0 ∈ E k x − x 0 k ). It is obvious that to minimize U is equiv alen t to maximize δ : X 7→ δ X . The initialization step consists of sim ulating uniformly a lot of p oin ts in the domain E (using sampling rejection) and of calculating the corresp onding empirical cov ariance matrix denoted by Σ. A t the end of the initialization step, we randomly k eep N p oin ts, denoted by X (0) = { x (0) 1 , . . . , x (0) N } . An inv erse co oling sc hedule β : t 7→ β t (i.e. ( β t ) t 6 is an increasing positive sequence and lim t →∞ β t = ∞ ) is chosen in order to ensure the con vergence of the algorithm. The paramater τ is a v ariance parameter which is allow ed to change during the iterations but, at each iteration, τ is suc h that τ 0 ≥ τ ≥ τ min . A paramater γ > 0 is fixed to b e a very small integer to prev ent from numerical problems. W e prop ose to iterate the following steps, for t = 1 , . . . : Algorithm 1. 1. A pair of p oints ( x ( t ) i , x ( t ) j ) is drawn in X ( t ) acco rding to a multinomial distribution with p robabilities p rop o rtional to 1 / ( k x i − x j k + γ ) ; 2. One of the tw o p oints is chosen with probabilit y 1 2 , it is denoted by x ( t ) k ; 3. A constraint Gaussian random walk is used to propose a new p oint : x prop k ∼ N E d ( x ( t ) k , τ Σ) x prop k is constrained to b elong to E . The proposed design is denoted by X prop = { x ( t ) 1 , . . . , x ( t ) k − 1 , x prop k , x ( t ) k +1 , . . . , x ( t ) N } ; 4. X ( t +1) = X prop with probabilit y min 1 , exp  − β t ( U ( X prop ) − U ( X ( t ) ))  q τ ,k ( X prop , X ( t ) ) q τ ,k ( X ( t ) , X prop ) ! , otherwise X ( t +1) = X ( t ) . The idea b ehind this prop osal is to force the pairs of p oin ts which are very close to b e more distant. In order to explicit the prop osal k ernel Q τ ( X , d Y ) where d Y is an infinitesimal neigh- b orhoo d of the state Y , let us introduce some notations: • d X i,j = 1 / ( k x i − x j k + γ ), • D X = P k,l : k 0 and osc ( U ) is the smal lest p ositive numb er h such that for al l X , Y in E N , U ( Y ) − U ( X ) ≤ h . Pro of Let us pro ve first that, for all X ∈ E N and i = 1 , . . . , N , q τ ,i ( X , . ) ≥ q min > 0 and q τ ,i ( X , . ) ≤ q max , Q τ ( X , . )-almost everywhere on E N . The fact that q τ ( X , . ) ≤ q max is true since the normalization constan ts are low er-b ounded, the Gaussian densities are uniformly b ounded since τ 0 ≥ τ ≥ τ min > 0 and all the other terms can b e upp er b ounded by 1. The other assertion is only true Q τ ( X , . )-almost ev erywhere on E N . It m eans that the lo wer b ound on q τ ,i ( X , Y ) holds when X and Y are such that X − i = Y − i and are b oth in E N . The following lo wer b ounds are used: • G − 1 x i ,τ Σ ≥ 1, • P j : j 6 = i 1 2 d X i,j D X ≥ γ N ( diam ( E )+ γ ) , • φ ( y | x , τ Σ) ≥ 1 (2 π ) d/ 2 | Σ | 1 / 2 τ d/ 2 0 exp  − 1 2 τ − 1 min diam( E ) 2 ξ  where ξ is the largest eigen- v alue of Σ − 1 . q min > 0 is found by multiplying these expressions and, for any i = 1 , . . . , N , it is a low er b ound of q τ ,i ( X , Y ) whic h do es not depend on τ and on the states if X ∈ E N and Y ∈ E N ha ve at least N − 1 p oints in common. By definition of osc ( U ), for all X ∈ E N and for all Q τ ( X , . )-almost everywhere Y ∈ E N , a β ,τ ,i ( X , Y ) ≥ e − β osc ( U ) q min q max . 9 Th us for ( X , A ) ∈ E N × B ( E N ), K A 1 β ,τ ( X , A ) ≥ e − β osc ( U ) q min q max Q τ ( X , A ). Then, for all non-decreasing sequence 0 ≤ β 1 ≤ . . . ≤ β N and for a sequence ( τ i ) 1 ≤ i ≤ N suc h τ 0 ≥ τ i ≥ τ min , we ha ve ∀ ( X , A ) ∈ E N × B ( E N ) ( K A 1 β 1 ,τ 1 · · · K A 1 β N ,τ N )( X , A ) ≥ e − ( β 1 + ··· + β N ) osc ( U )  q min q max  N ( Q τ 1 · · · Q τ N )( X , A ) ≥ e − N β N osc ( U )  q min q max  N ( Q τ 1 · · · Q τ N )( X , A ) ≥ e − N β N osc ( U )  q min q max  N q N min λ N ( A ) . 2 U : E N → R + is low er b ounded with resp ect to λ N (the Leb esgue measure on the com- pact set E N ). W e use the following notation m = sup a [ a ; λ N ( { X ; U ( X ) < a } ) = 0], b y definition λ N ( { X ; U ( X ) < m } ) = 0. Moreo ver, for all  > 0, w e define U  λ N = { X ∈ E N ; U ( X ) ≤ m +  } which is clearly such that λ N ( U  λ N ) > 0 and U ,c λ N = { X ; U ( X ) > m +  } . As stated in Bartoli and Del Moral (2001) ∀  > 0 , lim β →∞ µ β ( U  λ N ) = 1 . F ollowing the pro of of Theorem 4.3.16 in Bartoli and Del Moral (2001) and using Prop o- sition 3.1, we obtain the con vergence of this algorithm. Theorem 3.1. If the se quenc e ( τ n ) n ≥ 0 is such that ∀ n ≥ 0 , τ 0 ≥ τ n ≥ τ min > 0 and if β n = 1 C log( n + e ) , C > N osc ( U ) , we get ∀  > 0 , lim n →∞ P η ( X ( n ) ∈ U  λ N ) = 1 wher e { X ( n ) ; n ≥ 0 } denotes the r andom se quenc e we get fr om Algorithm 1 with an initial pr ob ability distribution η on E N . Unfortunately , the function U is not regular enough to estimate the conv ergence speed. 4 V arian ts of the algorithm In the case where E is not explicit, the normalization constant G m,S of a Gaussian dis- tribution with mean m and cov ariance matrix S cannot be computed. Hence, the ratio of densities of prop osal k ernels is not tractable. In that case, w e first prop ose to use as a 10 prop osal an unconstrained Gaussian random walk. The steps 3 and 4 of Algorithm 1 are mo dified. Algorithm 2. The first steps until step 3 ar e the same. Step 3 is r eplac e d with 3bis. A Gaussian random walk is used to propose a new p oint : x prop k ∼ N d ( x ( t ) k , τ Σ) . A nd step 4 is r eplac e d with 4bis. If X prop ∈ E N , X ( t +1) = X prop with probabilit y min 1 , exp  − β t ( U ( X prop ) − U ( X ( t ) )))  ˜ q τ ,k ( X prop , X ( t ) ) ˜ q τ ,k ( X ( t ) , X prop ) ! , otherwise X ( t +1) = X ( t ) . The prop osal kernel corresp onding to this algorithm where the Gaussian random w alk is not constraint to remain in the domain E reads as: for any X ∈ E N , Y ∈ ( R d ) N , Q τ ( X , d Y ) = N X i =1 ˜ q τ ,i ( X , Y ) λ ( d y i )Dir X − i ( d Y − i ) , where for i = 1 , . . . , N , ˜ q τ ,i ( X , Y ) = φ ( y i | x i , τ Σ))   X j : j 6 = i 1 2 d X i,j D X   . As for Algorithm 1, if ( τ n ) n ≥ 0 is such that ∀ n ≥ 0 , τ 0 ≥ τ n ≥ τ min > 0 and if β n = 1 C log( n + e ) with C > N osc ( U ), the random sequence we get from Algorithm 2 gives ∀  > 0 , lim n →∞ P η ( X ( n ) ∈ U  λ N ) = 1. How ever, since a p oint can b e prop osed outside of the domain E , this algorithm can suffer from a lack of efficiency . Another solution is to use the first algorithm without the ratio of densities of prop osal kernels. 11 Algorithm 3. The first steps until step 4 ar e the same than in Algorithm 1. Step 4 is r eplac e d with 4ter. X ( t +1) = X prop with probabilit y min  1 , exp  − β t ( U ( X prop ) − U ( X ( t ) )))  , otherwise X ( t +1) = X ( t ) . The global kernel asso ciated to Algorithm 3 is K A 3 β ,τ ( X , d Y ) = b β ,τ ( X , Y ) Q τ ( X , d Y ) +  1 − Z E N b β ,τ ( X , Z ) Q τ ( X , d Z )  δ X ( d Y ) , where b β ,τ ( X , Y ) = min  1 , exp( − β U ( Y )) exp( − β U ( X ))  . The measure µ β is not K A 3 β ,τ -in v arian t. Hence, w e cannot use the Marko v chain con vergence theory to obtain a result similar to Theorem 3.1. Ho wev er, µ β is K A 3 β ,τ -irreducible and we can easily state the follo wing prop osition. Prop osition 4.1. F or al l non-de cr e asing se quenc e 0 ≤ β 1 ≤ . . . ≤ β N and for a se quenc e ( τ i ) 1 ≤ i ≤ N such τ 0 ≥ τ i ≥ τ min , we have ∀ ( X , A ) ∈ E N × B ( E N ) ( K A 3 β 1 ,τ 1 · · · K A 3 β N ,τ N )( X , A ) ≥ α A 3 e − N β N osc ( U ) λ N ( A ) wher e α A 3 > 0 . As sho wn in Lo catelli (1996), this prop osition leads to the fact that a design reaching a neigh b orho od U  λ N of a global maxim um of δ X can b e achiev ed in a finite n umber of iterations almost surely using Algorithm 3. Prop osition 4.2. F or any  > 0 , if, ∀ n ∈ N , β n ≤ 1 C log( n + e ) with C > N osc ( U ) the exp e cte d time until the first visit in U  λ N is finite. Pro of The exp ected time until the first visit in U  λ N is equal to ∞ X k =1 k P ( X (1) , . . . , X ( k ) / ∈ U  λ N | X (0) / ∈ U  λ N ) × P ( X ( k +1) ∈ U  λ N | X (0) , . . . , X ( k ) / ∈ U  λ N ) . The aim is to find an upp er b ound in order to sho w that it is finite. The second probability in the argument of the series is limited from ab o ve with one. The first probability in the 12 argumen t of the series is the probability of never visiting U  λ N in the first k steps. It can also b e written as: P ( X (1) , . . . , X ( N ) / ∈ U  λ N | X (0) / ∈ U  λ N ) ×· · ·× P ( X ( b k/ N c N ) , . . . , X ( k ) / ∈ U  λ N | X ( b k/ N c N − 1) , . . . , X (0) / ∈ U  λ N ) . Thanks to Prop osition 4.1, it holds that P (at least one visit in U  λ N in the first N steps) ≥ P ( X ( N ) ∈ U  λ N ) ≥ α A 3 λ N ( U  λ N ) exp( − β N N osc ( U )) . Th us, P ( X (1) , . . . , X ( N ) / ∈ U  λ N | X (0) / ∈ U  λ N ) ≤ 1 − α A 3 λ N ( U  λ N ) exp( − β N N osc ( U )) . And in a similar wa y , P ( X ( iN +1) , . . . , X (( i +1) N ) / ∈ U  λ N | X (0) , . . . , X ( iN ) / ∈ U  λ N ) ≤ 1 − α A 3 λ N ( U  λ N ) exp( − β N ( i +1) N osc ( U )) . Hence, the exp ected time b efore the first visit in U  λ N can b e b ounded from ab o ve b y ∞ X k =1 k b k/ N c Y i =1  1 − α A 3 λ N ( U  λ N ) exp( − β N ( i +1) N osc ( U ))  . As log (1 − 2 x ) < − x if 0 < x < 1 / 2, the previous sum is b ounded b y ∞ X k =1 k exp   − b k/ N c X i =1 α A 3 λ N ( U  λ N ) 2 exp( − β N ( i +1) N osc ( U ))   . If β k is chosen suc h that β n = 1 C log( n + e ) , ( C > N osc ( U )), the sum b ecomes ∞ X k =1 k exp   − α A 3 λ N ( U  λ N ) 2 b k/ N c X i =1  1 ( i + 1) N  N osc ( u ) /C   , whic h can b e b ounded ab o ve by ∞ X k =1 k exp − α A 3 λ N ( U  λ N ) 2 b k / N c  1 ( k + N )  N osc ( u ) /C ! , whic h is a con vergen t series. 2 Since the b est design ever found during the iterations is sav ed, the previous prop osition pro vides a theoretical guarantee for Algorithm 3. Moreo ver, as a direct consequence , the Mark o v c hain defined by Y ( n ) = X ( t n ) with t n = arg min 1 ≤ t ≤ n U ( X ( t ) ) is such that lim n →∞ P η ( Y ( n ) ∈ U  λ N ) = 1. In practice, n is finite and the chosen design is Y ( n ) and not X ( n ) . Also, w e can consider that the previous conv ergence result is sufficient. How ever, this kind of result can b e obtained with any algorithm pro ducing a Marko v chain whic h w ell visits the space of states even if the temp erature is fixed! Algorithm 3 directly deriv es from Algorithm 1 and we can exp ect that they hav e similar b ehaviors. 13 5 Numerical illustrations First, the three algorithms are tested on three differen t toy cases: a design with 100 p oin ts in [0 , 1] 2 , a design with 250 p oin ts in [0 , 1] 5 and a design with 400 p oin ts in [0 , 1] 8 . In these h yp ercubic cases, the normalization constants can be computed and Algorithm 1 can b e used. In each case, 100 calls are made to one million iterations of each algorithm. W e observed that the chains pro duced by the algorithms remain quite stationnary after one million iterations. The inv erse co oling sc hedule is β n = (1 /T 0 ) log( n ) and the v ariance sc hedule is τ n = τ 0 / √ n . In order to choose T 0 , a lot of designs with N p oin ts can b e dra wn uniformly in E . Then, a median of δ X , the minimum distance b et w een pair of points in these designs, is computed. Thus, it is a mean to access to an order of magnitude of δ X when X is uniformly distributed. A fraction of this v alue is a go o d choice for T 0 according to our tries. Note that it is muc h low er than the one required in the conv ergence theorem. F or τ 0 , w e suggest to use τ 0 = V ol( E ) / N 1 /d where V ol( E ) is the v olume of E or an upp er bound of this volume. Clearly , τ 0 whic h parametrizes the random walk v ariance should not exceed V ol( E ) and the previous formula is deriv ed from analogy with a grid. F or a d -dimensional space, the num b er of p oin ts in a grid reads as N = k d where k is an in teger which corresp onds to the n umber of pro jected p oin ts on each axis. Thus, it seems reasonable to divide the volume of the domain by k or more generally b y N 1 /d to ensure a go o d exploration of the space. Figures 1, 2 and 3 presen t the results. F or eac h algorithm, the b o xplots of the best solutions to the maximization of δ X o ver one million iterations (b o xplots are constructed using 100 replicates) are giv en. Algorithms 1 and 3 giv e the b est results. Algorithm 2 suffers from the fact that the prop osal can be outside of the domain. The computation of U is the most time consuming step of these algorithms, that is why the comparison is based on the num b er of iterations. Other co oling sc hedules than the ones whic h ha ve theoretical guarantees can b e tried. It seems that they can lead to satisfying results which are ev en b etter than the ones obtained with the log sc hedule. Since the results depend to o muc h on the examples, it is quite hard to state a general rule. How ever, a sc hedule β n = 1 /T 0 √ n is robust to a bad choice in T 0 and a schedule τ n = τ 0 / √ n p erforms quite well. The v ariance decreasing sc hedule is set to freeze for a given n , thus it satisfies the b oundedness assumption of the theoretical results. Finally , in the domain E T = { ( x 1 , x 2 ) ∈ [0 , 1] 2 : x 1 > x 2 } , four strategies are compared to obtain a design with 100 p oin ts: taking a design whose p oin ts are realizations of a uniform distribution on E T , taking a lhs-maximin design in [0 , 1] 2 with 200 points thanks to the algorithm of Morris and Mitchell (1995) and keeping the subset whic h is in E T only (it gives what w e call a truncated lhs-maximin design), using a Sob ol’ sequence of 100 p oin ts constrained to b e in E T , making use of Algorithm 3. T able 1 displays some statistics 14 on v alues of δ X for 100 replicates. Only the mean is given for the Sob ol’ sequence since it is a deterministic strategy . The truncated lhs-maximin strategy provides designs with appro ximately 100 p oin ts (b et w een 93 and 107 on the 100 replicates). Mean V ariance Min Max Uniform 0.0048 8 . 2 · 10 − 6 4 . 0 · 10 − 4 0.013 T runcated lhs-maximin 0.034 8 . 2 · 10 − 6 0.025 0.039 Sob ol’ sequence 0.011 N/A N/A N/A Algorithm 3 0.080 7 . 8 · 10 − 8 0.079 0.081 T able 1: Comparison of designs in E T based on the v alues of δ X for 100 replicates 6 Application to a sim ulator of an aircraft engine The b eha vior of an aircraft engine is describ ed by a numerical code. A run of the co de de- termines if the given fligh t conditions are acceptable and, provided they are, computes the corresp onding outputs. The function which asso ciates the outputs to the flight conditions is denoted b y f . It is accessible only through runs of the co de. It is a black b o x function and a run is time-consuming. A thousand calls to the co de are run in ten minutes. The goal is to incorp orate the mo delization of the engine in a global mo del of an aircraft for a preliminary design study . Since the simulator of the engine is to o burdensome, we are ask ed to compute an approximation of f whic h can b e included in the global mo del. The acceptable fligh t conditions represent the domain of definition of f , denoted by E . Outside E , the co de cannot provide outputs since the conditions are physically imp ossible or the co de encoun ters conv ergence failures. E is not explicit, as explained ab o ve w e ha ve to run the co de to know if the flight conditions are acceptable. Therefore, we need to estimate E (the indicator function asso ciated to E ). This is not our goal here. E is included in a kno wn h yp ercub e (low er and upp er b ounds are av ailable on eac h of these v ariables). Using other prior information and some calls to f , a binary classification tree has b een built to determine an estimate of the indicator function of E (Auffra y et al., 2011). This method works quite well and leads to a misclassification error rate around 0 . 5%. The resulting domain is not an hypercub e. In the follo wing case study , only the flo w rate output is fo cused on. The flight con- ditions are describ ed b y ten v ariables such as altitude, sp eed, temp erature, h umidity ... A v ariable selection pro cedure has shown that only d = 8 input v ariables are useful for prediction (Auffra y et al., 2011). Hence, the considered function to be appro ximated is f : E ⊂ R d → R . A maximin design is drawn thanks to 10 7 iterations of Algorithm 3. The initial temp erature T 0 and the initial v ariance τ 0 w ere c hosen as describ ed in the previous section. 15 The in verse co oling shedule was β n = (1 /T 0 ) √ n and the v ariance schedule w as constant during the first quarter of iterations and then τ n = τ 0 / p n − 10 7 / 4. Appro ximations of the function f are made by kernel in terp olations on four different designs: the maximin design that was computed, a des ign whose p oin ts follow a uni- form distribution on E , a design obtained b y truncating a Latin hypercub e design of 5 , 000 p oin ts defined on the hypercub e domain containing E and a design giv en by a low- discrepancy sequence (Sobol) constrained to be in E (see Bratley and F o x, 1988). The lhs is truncated by k eeping only the p oin ts which belongs to E . The kernel in terp olations are computed by the Matlab to olb o x DA CE (Lophav en and Sondergaard, 2002). The regression functions are c hosen as the polynomials with degree smaller than or equal to t wo and the kernel is a generalized exp onen tial k ernel: K ( x , x 0 ) = exp   − d X j =1 θ j | x ( j ) − x 0 ( j ) | ν   , where x ( j ) , x 0 ( j ) , j = 1 , . . . , d are resp ectiv ely the j th co ordinates of x , x 0 and θ 1 , . . . , θ d , ν are parameters which are estimated using the usual maximum likelihoo d estimators. Oth- ers methods suc h as cross v alidation could b e used to c ho ose these parameters. The results giv en in Section 2 only applies when the kernel is isotropic and Gaussian which means θ 1 = . . . = θ d = θ and ν = 2. The four designs are sets of approximately 1 , 300 p oin ts which are included in the domain E according to the estimated indicator function. F or the lhs , w e need around 5 , 000 p oin ts to get approximately 1 , 300 p oin ts in E . The function f is computed at the p oin ts of the designs. Some p oin ts hav e to b e remo ved from the designs since the co de indicates that they are not in E (recall that the designs were built thanks to an estimate of E ). T able 2 pro vides the p erformances of kernel interpolations according to the designs. The p erformances are ev aluated on another set of T = 1 , 300 p oin ts uniformly distributed in E (obtained using sampling rejection) on which the function f is also computed. If ˆ f denotes a kernel in terp olator and { z 1 , . . . , z T } is the set of test p oin ts, those quantities are rep orted: • the Mean Relative Error (MRE), 1 T T X i =1      f ( z i ) − ˆ f ( z i ) f ( z i )      , • the Maximum Relativ e Error (MaxRE), max i =1 ,...,T      f ( z i ) − ˆ f ( z i ) f ( z i )      , 16 • the Mean Squared Error (MSE), 1 T T X i =1  f ( z i ) − ˆ f ( z i )  2 . T able 2 also con tains the n umber of p oin ts whic h are actually in E and the minimal distance δ X b et w een the pairs of p oin ts of the designs. T o compute these distances, the designs were translated in to the hypercub e [0 , 1] 8 . mRE MaxRE MSE Nb of P oints δ X Uniform 0.49% 5.2% 0.63 1284 0.15 lhs 0.48% 6.9% 0.73 1275 0.14 maximin 0.47% 3.5% 0.56 1249 0.33 Sob ol’ sequence 0.46% 7.7% 0.62 1277 0.15 T able 2: Comparison of p erformances of k ernel interpolation on the different designs The maximin design mak es the kernel in terp olation more efficient especially according to the MaxRE criterion (although it contains less admissible p oin ts than other designs). As it was shown, the kernel interpolation accuracy dep ends sharply on the spreading out of the p oin ts of the design. Th us, the maximin design which ensures that any p oin t of E is not far from the p oin ts of the design leads to the b est p erformances. Ac kno wledgemen ts The authors are grateful to Pierre Del Moral for very helpful discussions on the con vergence prop erties of the algorithms. This w ork has b een supported by the Agence Nationale de la Rec herche (ANR, 212, rue de Bercy 75012 Paris) through the 2009-2012 pro ject Big’MC. References Auffra y , Y., Barbillon, P ., and Marin, J.-M. (2011). Mo d ` eles r ´ eduits ` a partir d’exp´ eriences n um´ eriques. Journal de la So ci´ et ´ e F r an¸ caise de Statistique , 152(1):89–102. Bartoli, N. and Del Moral, P . (2001). Simulation & algorithmes sto chastiques . C ´ epadu ` es. Bratley , P . and F ox, B. L. (1988). Algorithm 659: Implemen ting Sob ol’s quasirandom sequence generator. ACM T r ansactions on Mathematic al Softwar e , 14(1):88–100. Burszt yn, D. and Stein b erg, D. M. (2006). Comparison of designs for computer exp eri- men ts. Journal of Statistic al Planning and Infer enc e , 136(3):1103–1119. 17 Figure 1: Case of a design of 100 p oin ts in [0 , 1] 2 Chib, S. and Greenberg, E. (1995). Understanding the Metrop olis-Hastings algorithm. The Americ an Statistician , 49(4):327–335. Cressie, N. A. C. (1993). Statistics for sp atial data . Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons Inc., New Y ork. den Hertog, D., Kleijnen, J., and Siem, A. (2006). The correct Kriging v ariance estimated b y b ootstrapping. The Journal of the Op er ational R ese ar ch So ciety , 57(4):400–409. F ang, K.-T., Li, R., and Sudjianto, A. (2005). Design and Mo deling for Computer Exp er- iments (Computer Scienc e & Data Analysis) . Chapman & Hall/CRC. Hastings, W. (1970). Mon te Carlo Sampling Methods Using Mark ov Chains and Their Applications. Biometrika , 57(1):97–109. Johnson, M. E., Mo ore, L. M., and Ylvisak er, D. (1990). Minimax and maximin distance designs. Journal of Statistic al Planning and Infer enc e , 26(2):131–148. Joseph, V. R. (2006). Limit kriging. T e chnometrics , 48(4):458–466. 18 Figure 2: Case of a design of 250 p oin ts in [0 , 1] 5 Ko ehler, J. R. and Ow en, A. B. (1996). Computer exp erimen ts. In Design and analysis of exp eriments , volume 13 of Handb o ok of Statistics , pages 261–308. North-Holland, Amsterdam. Laslett, G. M. (1994). Kriging and splines: an empirical comparison of their predictive p erformance in some applications. Journal of the Americ an Statistic al Asso ciation , 89(426):391–409. Li, R. and Sudjianto, A. (2005). Analysis of computer exp erimen ts using p enalized likeli- ho od in Gaussian Kriging mo dels. T e chnometrics , 47(2):111–120. Lo catelli, M. (1996). Conv ergence Prop erties of Sim ulated Annealing for Contin uous Global Optimization. Journal of Applie d Pr ob ability , 33(4):1127–1140. Lopha ven, N.S., N. H. and Sondergaard, J. (2002). D ACE, a Matlab Kriging to olbox. T echnical Report IMM-TR-2002-12, DTU. Madyc h, W. R. and Nelson, S. A. (1992). Bounds on multiv ariate p olynomials and exp o- nen tial error estimates for multiquadric in terp olation. Journal of Appr oximation The ory , 70(1):94–114. 19 Figure 3: Case of a design of 400 p oin ts in [0 , 1] 8 Matheron, G. (1963). Principles of Geostatistics. Ec onomic Ge olo gy , 58(8):1246–1266. McKa y , M., Bec kman, R., and Cono ver, W. (1979). A comparison of three metho ds for selecting v alues of input v ariables in the analysis of output from a computer co de. T e chnometrics , 21(2):239–245. Mease, D. and Bingham, D. (2006). Latin Hyp errectangle Sampling for Computer Exp er- imen ts. T e chnometrics , 48(4):467–477. Morris, M. D. and Mitchell, T. J. (1995). Exploratory designs for computational exp eri- men ts. Journal of Statistic al Planning and Infer enc e , 43:381–402. Rob erts, G. O. and Rosenthal, J. S. (2006). Harris recurrence of Metrop olis-within-Gibbs and trans-dimensional Marko v chains. Annals of Applie d Pr ob ability , 16(4):2123–2139. Sac ks, J., Schiller, S., Mitc hell, T., and Wynn, H. (1989a). Design and analysis of computer exp erimen ts (with discussion). Statistic al Scienc e , 4(4):409–435. Sac ks, J., Schiller, S. B., and W elch, W. J. (1989b). Designs for computer experiments. T e chnometrics , 31(1):41–47. 20 San tner, T. J., Williams, B. J., and Notz, W. I. (2003). The Design and Analysis of Computer Exp eriments . Springer Series in Statistics. Springer-V erlag, New Y ork. Sc haback, R. (1995). Error estimates and condition num b ers for radial basis function in terp olation. A dvanc es in Computational Mathematics , 3(3):251–264. Sc haback, R. (2007). Kernel-based meshless metho ds. T ec hnical rep ort, Institute for Numerical and Applied Mathematics, Georg-August-Universit y Go ettingen. Stein, M. L. (1999). Interp olation of sp atial data. Some the ory for Kriging . Springer Series in Statistics. Springer-V erlag, New Y ork. Stein, M. L. (2002). The screening effect in Kriging. The Annals of Statistics , 30(1):298– 323. Stinstra, E., den Hertog, D., Stehou wer, P ., and V estjens, A. (2003). Constrained maximin designs for computer exp erimen ts. T e chnometrics , 45(4):340–346. v an Dam, E. R., Husslage, B., den Hertog, D., and Melissen, H. (2007). Maximin Latin Hyp ercube Designs in Tw o Dimensions. Op er ations R ese ar ch , 55(1):158–169. 21

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment