Efficient Multi-Start Strategies for Local Search Algorithms

Local search algorithms applied to optimization problems often suffer from getting trapped in a local optimum. The common solution for this deficiency is to restart the algorithm when no progress is observed. Alternatively, one can start multiple ins…

Authors: Andras Gy"orgy, Levente Kocsis

Efficient Multi-Start Strategies for Local Search Algorithms
Journal of Artificial Intelligence Researc h 41 (2011) 407-444 Submitted 02/11; published 07/11 Efficien t Multi-Start Strategies for Lo cal Searc h Algorithms Andr´ as Gy¨ orgy gy a@szit.bme.hu Machine L e arning R ese ar c h Gr oup Computer and Automation R ese ar ch Institute of the Hungarian A c ademy of Scienc es 1111 Budap es t, Hungar y Lev ente Kocsis kocsis@szt aki.hu Data Mining and Web Se ar ch R ese ar ch Gr o up , Infor m at ics L ab or atory Computer and Automation R ese ar ch Institute of the Hungarian A c ademy of Scienc es 1111 Budap es t, Hungar y Abstract Lo cal searc h algorithms applied to optimization problems often suffer from getting trapp ed in a local optimum. The common solution for this deficiency is to rest art the algorithm when no progress is observed. Alternativ ely , one can start m ultiple instances of a local search algorithm, and allo cate computational resource s (in particular, pro cessing time) to the instances depend ing on their b eha vior. Hence, a multi-start strategy has to decide (d yn amically ) when to allo cate additional resources to a particular instance and when to start new instances. In this pap er we prop ose multi-start strategies motiv ated b y works on multi-armed bandit problems and Lipschitz optimization with an unknown constan t. The strategies con tin uously estimate the p oten ti al p erformance of each algorithm instance b y supp osing a con vergence rate of the lo cal search algorithm up to an unknown constan t, and in every phase allo cate resources to those in s tanc es that could conv erge to the optimum for a particular range of the constant. Asymptotic bounds are giv en on th e p erformance of the strategies. In particular, we pro ve that at most a quadratic increase in the num b er of times the target function is ev aluated is needed to achiev e the p erformance of a l o cal searc h algorithm started from the attraction region of the optim um. Exp erimen ts are provided using SPSA (Simultaneous P erturbation Sto c hastic Approximation) and k- means as lo cal search algorithms, and the results indic ate that the prop osed strategies w ork w ell i n practice, and, in all cases studied, need only logarithmically more ev aluations of the target function as opp osed to the th eor et ic ally suggested quadratic increase. 1. In tro duction Lo cal searc h algorithms applied to optimization problems often suffe r from getting trapp ed in a lo cal optim um. Moreo ver, lo cal sear ch algorithms that are guaran teed to con verge to a global opti mum under some conditions (such as Simulated Annealing or Simultaneous P erturbation Sto c hastic Appro ximation, SPSA, see , e.g., Spall, Hill, & Stark , 2006), usually con verge at a very slow pace when the conditions are satisfied. On the other hand, if the algorithms are emplo yed with more aggressiv e settings, m uch faster conv ergence to lo cal optima is ac hiev able, but with no guaran tee to find the global optim um. The c ommon solu- c  2011 AI Access F oundation. A ll rights reserved. Gy ¨ orgy & Kocsis tion to escap e from a lo cal optimum is to restart the algorithm when no progress is observed (see e.g., Mart ´ ı, Moreno-V ega, & Duarte, 2010; Zabinsky , Bulger, & Khompatraporn, 2010, and the references therein). Alternatively , one can start multiple instances of the local search algorithm, and allo cate computational resources, in particular, pro cessing time, to the instances dep ending on their b eha vi or. Instances can b e started at any time, and so the num b er of instances may grow o ver time dep ending on the allo cation strategy . (see, e.g., Chapter 10 of Battiti, Brunato, & Mascia, 2008 and the references therein). In thi s type of problems the computational cost is usually measured as the total num b er of steps made by all searc h algorithm instances: this often reflects the situation that the ev aluati on of the target function to b e optimized is exp ensiv e, and the costs re l ate d to deter m i ne which algorithms to use next are negligible compared to the former (e.g., this is clearly the case if the task is to tune the parame t er s of a system whose p erformance can onl y b e tested via length y exp eriments, se e , e.g., Bartz - Beielstein, 2006; Hutter, Ho os, Ley ton- Br own, & St ¨ utzle, 2009). In this pap er we address the ab ov e problem of dynamically starting several inst anc es of lo cal search algorithms and allo cating resources t o the instances based on their (p otential) performance. T o our kno wledge, solutions to the ab o ve problem hav e either b een based on heuristics or on the assumption that the local optima the searc h al gor i thm s con verge to hav e an extreme v alue distribution (see Section 2 b elow). In this pap er , w e prop ose new multi-start strategies under v ery mild conditions on the target function, with attractiv e theoretical and practical prop erties: Supp osing a conv e r gen ce rate of the lo cal search algorithms up to an unk nown constan t, our strategies con tinuously estimate the p otential p erfor m ance of eac h algorithm instance and in ev ery phase allo cate resources to those instances that could con verge to the optim um for a particular range of the constan t. The selection mechanism is analogous to the DIRECT algorithm (Jones, Perttunen, & Stuckman, 1993; Finkel & Kelley , 2004; Horn, 2006) for optimizing Lipschitz-functions with an unkno wn constan t, where prefere nc e is giv en to rectangles that may con tain the global optimum. The optimum within each rectangle is estimated in an optimistic wa y , and the esti mat e dep ends on the size of the rectangle. In our s tr at e gie s w e use the function describing the conv ergence rate of the lo cal s ear ch algorithms in a similar wa y as the siz e of the rectangles are used in the DIRECT algorithm. Since in the prop osed multi-start strategies the p otential perfor man ce of eac h lo cal search algorithm is con tinuously estimated from the currently b est v alue of the target function returned b y that algorithm, our metho d is restricted to work with lo cal search algorithms that re tu rn the b est known v alue of the target function after each step. This is the case, for example, in certain meta-learning problems, where the goal is to find a go o d parameter setting of a learning algorithm. Here the searc h space i s the parameter space of the learning algorithm, and one step of the lo cal searc h metho ds means running the le arn i ng algorithm completely on a possibly v ery large data set. On the other hand, i f the local searc h algorithm is some sort of a gradient searc h optimizing an error function o ver some training data, then the v alue of the target functi on i s us ual l y av ailable only in the case of batch learning (p otentially after some very cheap computations), but not when the gradient is estimated only from a few samples. The res t of the pap er is organized as follows. Section 2 summarizes related researc h. The problem is defined formally in Secti on 3. The new m ulti-start lo cal search strategies of 408 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms this pap er are describ ed and analyzed in Section 4: in Section 4.1 we deal with a se l ec ti on mec hanism among a fixed n um b e r of instances of the loc al search algori th m, while, in addition, simple sc hedules for starting new instances are also considered in Section 4.2, whic h are natural ex t ens i ons of the case of finitely many lo c al s ear ch algorithm instances. This section concludes with a discussi on of the results in Section 4.3. Simulation results on real and synthetic data are provided in Sec ti on 5. Conclusions and future w ork are describ ed in Sect i on 6. 2. Related W ork The proble m of allo cating r e sou rc e s among sev eral instances of search algorithms can b e comfortably handled in a generalized version of the maximum K -armed bandit problem. The original v ersion of this problem consists of several rounds, where in each round one c ho oses one of K arms, receiv es some rew ard dep en di ng on the choice, with the goal of maximizing the highest reward received o v er sev eral rounds. This model can easily b e used for our problem b y considering eac h lo cal searc h algorithm inst anc e as an arm: pulling an arm means taking one additional step of the corres p onding algorithm, that is, ev aluating the t arge t function at a p oint suggested b y that algorithm, and the reward received is the v alue of the target function at the sampled p oint. A generic algorithm for the standard maxim um K-armed bandit problem, where each rew ard is assumed to hav e indep endent and iden tical distr i but i on, is pro vided by Adam (2001), where the so-called reserv ation price of an ins tanc e is introduced, whic h gives the maximum amoun t of resources worth to sp end on an instance: if an instance achiev es its reserv ation price, it is useless to select it again. The computation of the reserv ation price depends on a mo del of the algorithm that can b e learn t und er some sp ecific constraints. No w consider a scenario where several instances of some, p ossibly randomized lo cal searc h algorithms are run after each other with t he goal of maximizing the ex p ected p er- formance. Each ins tan ce is run un til it terminates. In this scenario it is natural to assume that the v alues returned by the instances (usually some lo cal optima of the target func- tion) are indep endent. F urthermore, since go o d search algorithms follo w (usual l y heuristic) pro cedures that yield s ubst antially better results than pure random guessing, Ci ci r e l l o and Smith (2004, 2005) suggested that the rew ards (ev aluated target function v alues) of the searc h instances may b e view ed as the max i mum of man y random v ariables (if the in- stances are run for sufficien tly long time), and hence may be modeled b y extreme v alue distributions. Several algor i thm s are based on this assumption, and are hence developed for the maxim um K -armed bandit problem with returns follo wing generalized extreme v alue distributions: Cicirello and Smith apply (somewhat heuristic) metho ds that use the ab o ve extreme-v alue-dist r i but i on assumption at each decision p oin t of a meta-learning algorithm , while Streeter and Smith (2006a) use this mo del to obtain upp er confidence b ounds on the p erformance estimate of each type of algorithms used and then try onl y the algorithms with the b est e x p ected result. The latter is a theoretically justified example of t he natural strategy to prob e the algorithm instances for a while, e st i mat e their future p erformance based on the results of this trial phase, and then use the most promising algorithm for the time remaini ng. Streeter and Smith (2006b) prop osed a distribution free approach that 409 Gy ¨ orgy & Kocsis com bines a m ulti-armed bandit exploration strategy with a heuristic selection among the a v ailable arms. While in the standard maximum K -armed bandit problem the rew ards in each round are ass ume d to be indep endent, this is clearly not the case in our situation wher e the algorithm instances are run parallel and the rew ard for ev aluating the target function at a p oint is the improv emen t up on the curren t maxim um, since the samples chosen by a lo cal searc h al gor i t hm usually dep end on previous sample s . Nevertheless, the ideas and lessons learn t from the maximum K -armed bandit problems can b e used in our case , as well: for example, the algorithm Threshold Ascent of Streeter and Smith (2006b) giv es reasonably go o d solutions in our case, or the principle of probing instances for a while and then using the most promising in the time remaining also carries o ver to this situation easil y : suc h algorithms, ha ving first an exploration then an exploitation phase, will be referred to in the seq ue l as explore-and-exploit algorithms. In this class of algorithms, simple rules were suggested by Bec k and F reuder (2004) to predict the future p erf orm ance of each algorithm, while Carchrae and Bec k (2004) emplo y Bay esian prediction. Another related problem is to find fast algorithms among several ones that solv e the same problem. More precisely , several algorithm instances are av ailable that al l pro duce the correct answ er t o a certain question if run for a sufficien tly l ong time. The time needed for an algorithm instance to find the answ er i s assumed to b e a random quantit y with indep endent and identical distributions for all the instances, and the goal is to com bine the giv en algorithms to minimize the exp ected running time un til the answer is found. When the distribution of the running time is kno wn, an optimal non-adaptiv e time-allo cation strategy 1 is to p erform a sequence of r uns with a certain cut-off ti me that dep ends on the distribution (Lub y , Sinclair, & Zuc kerman, 1993). If the distribution is unkno wn, a particular running time sequence can b e chosen that results in an exp ected total running ti m e that is only a logarithmic factor larger than the optimum achiev able if the distribution is known. W e note that this strategy is among t he few that pro vide a schedule that incr e ase s the num b er of algorithm instances. The ab o ve set-up can be sp ecialized to our problem: the goal is to find an ε -optimal approximation of the optimum and the runnin g time is the num b er of st ep s needed by the giv en searc h algorithm to ac hieve such an approximation. Note that in this case the running time of an y algorithm instance pro viding an ε -sub optimal solution has to b e defined to b e infinity , but the results of Lub y et al. remain v alid if an ε -optimal solution can be found with p ositive probability . F or the same problem, Kautz, Horvitz, Ruan, Gomes, and Selman (2002) prop osed an allo cation strategy based on up dating dynamically the b elief ov er the run-time distribution . Concerning the latter, Ho os and St ¨ utzle (1999) found empirically that run-time distributions are approximately exp onen tial in certain (NP- hard) problems, while Ribei r o, Rosseti, and V allejos (2009) de al t with the comparison of differen t run-time distributions. Finally , when a set of time allo cation strategies ar e av ailable and the optimization prob- lem is to b e solv ed sev eral times, one can use the standar d multi-armed bandit framework as done by Gagliolo and Schmidh ub er (2006, 2007, 2010). Running sev eral instances of an algorithm or sev eral algorithms in parallel and selecting among the algorithms hav e b een in tensively studied, for example, in the area of meta- 1. In a non-adaptiv e time-allo cation strategy the ru n n in g time of an algorithm instance is fixed in adv ance, that is, the measured performanc e of the algorithm instances has no effect on the sc hedule. 410 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms learning (Vilalta & Drissi, 2002) or automatic algorithm configurati on (Hutter et al., 2009). The underlying problem is v ery similar in b oth cases: automatic algorithm configuration usually refers to tuning search algorithms, while meta-learning is used for a subse t of these problems, tuning machine learning algorithms (the latter often all ows more specific use of the data). The main problem here is to allo cate time slices to particular algorithms with the aim of maximiz i ng the b est result returned. This allocation may dep end on the in termediate p erformance of the algorithms. Most of the automatic algorithm configuration and meta- learning systems use v arious heur i st i c s to e x pl or e the space of algorithms and parameters (see, e.g., Hutter et al., 2009). Finally , it is imp ortant to note that, although multi-start lo cal search strategie s solv e global optimization problems, w e concen trate on maximizing the p erformance given the underlying family of lo cal opti mi z ati on metho ds. Since the c hoice of the latter has a ma jor effect on the ac hiev able p erformance, we do not compar e our res ul ts to the v ast literature on global optimization. 3. Preliminaries Assume we wish to maxi mize a real v alued function f on the d -dimensional unit h yp ercub e [0 , 1] d , that is, the goal is to find a maximizer x ∗ ∈ [0 , 1] d suc h that f ( x ∗ ) = f ∗ where f ∗ = max x ∈ [0 , 1] d f ( x ) denotes the maximum of f in [0 , 1] d . F or simplicit y , we assume that f is contin uous on [0 , 1] d . 2 The contin uity of f implies the existence of x ∗ , and, in particul ar , that f is b ounded. Therefore, without loss of generalit y , we as sume that f is non-negativ e. If the form of f is not known explicitly , searc h algorithms usually ev aluate f at sev eral lo cations and return an estimate of x ∗ and f ∗ based on these observ ations. There is an ob vious trade-off b et ween the num b er of samples used (i.e., the num b er of p oin ts where the target function f is ev aluated) and the quality of the estimate, and the p erformance of an y search strategy ma y b e measu r ed b y the accuracy it achiev es in estimating f ∗ under a constrain t on the n umber of samples used. Giv en a lo cal searc h algorithm A , a general strategy for finding a go o d approximation of the optimum x ∗ is to run several instances of A initialized at differen t starting points and appro ximate f ∗ with the maximum f v alue observed. W e concen trate on lo cal searc h algorithms A defined formally b y a sequence of p ossibly randomized sampling fu nct i ons s n : [0 , 1] d · n → [0 , 1] d , n = 1 , 2 , . . . : A ev aluates f at locations X 1 , X 2 , . . . where X i +1 = s i ( X 1 , . . . , X i ) for i ≥ 1, and the starting p oint X 1 = s 0 is chosen uniformly at random fr om [0 , 1] d ; after n observ ations A returns the estimate of x ∗ and the maximum f ∗ , resp ectively , b y b X n = argmax 1 ≤ k ≤ n − 1 f ( X k ) and f ( b X n ) . where ties in the argmax fun ct i on may b e broken arbitrarily , that is, if more samples X k ac hieve the maxim um, b X n can b e c hosen to b e an y of them. T o a void am biguity and 2. The results ca n easily be exten d ed to (arbitrary v a lu ed ) bounded p iecew ise con tinuous functions with finitely man y contin uous components. 411 Gy ¨ orgy & Kocsis simplify notation, here and in the following, unless stated expli ci t l y otherwise, we adopt the con ven tion to use argmax to denote the maximi z i ng sample with the smal l e st index, but the results remain v alid under any other choice t o break ties. F or simplicit y , we consider only start i ng a single local se arch algor i t hm A at di ffe r ent random p oints, although the results of this w ork can b e ex te nde d to allo w v arying the parameters of A (incl udi n g the situation of running differen t lo cal searc h algorithms, where a parameter would c ho ose the actuall y emplo yed se ar ch algorithm); as w ell as to allo w dep endence among the init i al i zat i ons of A (that is, the starting p oi nt and parameters of a local search instance may dep end on information previously obtained ab out the target function). It is clear that if the start i ng points are sampled uniformly from [0 , 1] d and eac h algori thm is ev aluat ed at its starti ng p oint the n this strategy is consistent, that is, f ( b X n ) conv erges to the maximum of f with probabilit y 1 as t he n umber of instance s tends to infinity (in the w orst case we p erform a random searc h that is kno wn to conv erge to the maximum almost surely). On the other hand, if algorithm A has some fa vorable propert i e s then it is p ossible to design m ulti-start strategies th at still k eep the random search based consistency , but pro vide muc h faster conv ergence to the optim um in terms of the num b er of ev aluations of f . Since the sequence f ( b X n ) is b ounded and non-decreasing, it conv erges (no matter what random effects o ccur during the search). The next lemma, pro ved in App endix A, sho ws that, with high probabilit y , the conv e rge nc e cannot b e arbitrarily slo w. Lemma 1 F or any ˆ f ∗ ∈ [0 , 1] d , let E denote the event l i m n →∞ f ( b X n ) = ˆ f ∗ . 3 If P ( E ) > 0 , then, for any 0 < δ < 1 ther e is an event E δ ⊆ E with 1 − P ( E δ ) < δ such that f ( b X n ) → ˆ f ∗ uniformly almost everywhe r e on E δ . In other wor ds, ther e exists a non-ne gative, non- incr e asing function g δ ( n ) with lim n →∞ g δ ( n ) = 0 such that P  lim t →∞ f ( b X t ) − f ( b X n ) ≤ g δ ( n ) for al l n   lim t →∞ f ( b X t ) = ˆ f ∗  ≥ 1 − δ . (1) In certain cases, g δ ( n ) = O ( e − αn ), as sho wn by Nester ov (2004) for (gradient-based) optimization for conv ex functions, b y Gere nc s´ er and V´ ag´ o (2001) for noise-free SPSA for con vex functions, or by Kieffer (1982) for k -means clustering ( or Llo yd’s algor i th m) in one dimension for log-conca ve densiti e s. While these results pe r tai n to the simple situation where the r e is only one lo cal optimum which is the global one, many of these results can b e extended to more general situations, and we observed exp onen tial rat e of con vergence in our own exp eriments with functions with many local maxima. The con vergence prop ert y of lo cal search algorithms guaranteed b y Lemma 1 will b e exploited in the next section to derive efficien t multi-start search strategies. 4. Multi-Start Searc h Strategies Standard m ulti-start searc h strategies r un an instance of A un til it seems to con v erge to a lo cation where there is no hop e to beat the curren tly b est approximation of f ∗ . An 3. In practice we can usually assume that lo cal search algorithms conv erge to local optima, and so ˆ f ∗ may b e assumed to b e a lo cal optimum. 412 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms alternative wa y of usi ng m ultiple instances of lo cal search algorithms is to run all algorithms in parallel, and in eac h round decide which algorithms can tak e an extra step. This approach ma y b e based on estimating the p otential p e r for manc e of a lo cal search algorit hm A based on Lemma 1. Note that if g δ w ere kno wn, an obvious w ay w ould b e to run each instance un til their possi b l e perfor man ce s become separated with high probability in the sense that the margin b etw een the p erformance of the actually b est and the second b est algorithm is so large that the actually b est algorithm is guaranteed to b e the b est, in the long run, with high probability . Then w e could just pick the b est instance and run it until the giv en computational budget is exhausted (this w ould be a simple adaptation of the expl ore -an d- exploit idea of choosing the b est algorithm based on a tr i al phase as in Beck & F reuder, 2004; Carchrae & Bec k, 2004). In practice, how ev er, g δ is usually not k nown, but for certain problem classes and lo cal searc h algorithms it ma y b e kno wn to b elong to some function class, for example, g δ ma y b e kno wn up to a (multiplicativ e) constan t factor (here, for example, the constant may dep end on certain c haracteri st i cs of f , such as its maximum lo cal steepness). Even in the latter case, the b est instance still cannot b e selected with high probability no matter how large the margin is (as g δ ma y b e arbitrarily large). How ever, using ideas from the general metho dology for Lipschitz optimization with an unkno wn const ant (Jones et al., 1993), w e can get around this problem and estimate, in a certain optimistic w a y , the p oten tial p erformance of eac h algorithm instance, and in eac h round w e can step the most promising ones. The main idea of the resulting strategy can b e summar i z ed as follo ws. Assume we ha ve K instances of an algori t hm A , denoted by A 1 , . . . , A K . Let X i,n , i = 1 , . . . , K denote the lo cation at whic h f is e v aluated by A i at the n th time it can tak e a step, where X i, 1 is the starting p oi nt of A i . The estimate of the loc ati on of the maxim um by algorithm A i after n samples (steps) is b X i,n = argmax 1 ≤ t ≤ n f ( X i,t ) and the maximum v alue of the function is estimated by ˆ f i,n = f ( b X i,n ). F or any i , let ˆ f i = lim n →∞ f i,n denote the limiting estimate of the maxim um of f pro vided b y A i . Let g δ b e defined as in Lemma 1 for the largest of these v alues, ˆ f ∗ = max i =1 ,...,K ˆ f i . Since ˆ f ∗ is the b est achiev able estimate of the maximum of f giv en the actual algorithms A 1 , . . . , A K , g δ giv es a high probability con vergence rate for those algorithms that provide the b est estimate of the maxim um in the long run (note that the assumption deals with eac h limiting estimate – usually a l o cal maxim um – separately , that is, here no assumption is made on algorithms whose limiting estimates are less than ˆ f ∗ ). The n, if A i ev aluates f at n i,r p oints b y the end of the r th round and A i con verges to the b est ac hiev able estimate ˆ f ∗ , by Lemma 1 w e hav e, with probabil i ty at least 1 − δ , ˆ f i − ˆ f i,n i,r ≤ g δ ( n i,r ) , and so ˆ f i,n i,r + g δ ( n i,r ) (2) 413 Gy ¨ orgy & Kocsis is an optimistic e st i mate of ˆ f ∗ . If A i is sub opt i mal in the sense that lim n →∞ ˆ f i,n < ˆ f ∗ then the ab o ve estimate is still optimistic if the rate of conv ergence is not slow er than g δ , and p essimistic if the rate of conv ergence is slo wer than g δ . The latter is desirable in the sense that we ha ve a negatively biased estimate on the exp ected performance of an algorithm that we do not w ant to use (we should not w aste samples on sub optimal choices). As in practice g δ is usually not kno wn exactly , the estimate (2) often cannot b e con- structed. On the other hand, if g δ is known up to a constant factor then we can construct a family of estimates for all s cal e s: Let g ′ δ denote a normalized v ersion of g δ suc h that g ′ δ (0) = 1 and g δ ( n ) /g ′ δ ( n ) is a constant for all n , and construct the family of estimates ˆ f i,n i,r + cg ′ δ ( n i,r ) (3) where c ranges ov er all p ositive reals. Then it is reasonable to choose, in each round, those algorithms to take another step that provide the largest estimate for som e v alues of c (t ypically , if an algorithm giv es the largest estimate for some c = c ′ then there is an interv al I con taining c ′ suc h that the algor i thm provides the largest estimate for any c ∈ I ). In this wa y we can get around the fact that w e do not know the real scaling fac t or of g ′ δ , as w e certainly use the algorithms that provide the largest v al ue of (3) for c = g δ (1) /g ′ δ (1), and, as it will b e discussed later, we do not w aste to o many sampl e s for algorithms that maximize (3) for other v alues of c . Using the optimistic estimate (3) is very similar, in spirit, to the optimistic estimates in the standard “upp er confidence b ou nd”-typ e solution to the m ulti-armed bandit problem (Auer, Cesa-Bianchi, & Fisc her, 2002) or in the well-kno wn A ∗ searc h algor i t hm (Hart, Nilsson, & Raphael, 1968). Ho wev er, the exact (lo cal) con vergence rate is not known, ev en up to a constant factor, for many lo cal search algorithms, and even if it is, the corresp onding b ounds are usually meaningful onl y in the asymptotic regime, whic h is often not of practical in terest. Therefore, to giv e more freedom in the design of the algor i t hm, we are going to use an es ti m ate of the form ˆ f i,n i,r + ch ( n i,r ) (4) where, s i mi l ar l y to the requirements on g δ , h is a p ositive, monotone decreasing function with lim n →∞ h ( n ) = 0. W e will also assume, without loss of generalit y , that h (0) = 1. The actual form of h will b e based on the theoretical analysis of the resulting algorithms and some heuristic considerations. Essentially w e will use h functions that con verge to zero exp onentially fast, whic h is in agreement with t he exp onentially fast lo cal con vergence rates in the examples given after Lemma 1. The optimal choice of h , given, for example, g δ , is not known, and is left for future w ork. 4.1 Constant Number of Instances The ab ov e idea can b e tr ansl at e d to the algorithm Met aMax(K) shown in Figure 1. Here w e consider the case when we ha v e a fixed num b er of instances, and our goal is to p erform (almost) as well as the b est of them (in hindsight), while using the minim um n umber of ev aluations of f . Note the slight abuse of notation that in the Met aMax(K) algorithm b X r and ˆ f r denote the estimates of the algorithm after r rounds (and not r steps/samples). In the first part of step (a) of Met aMax we sw eep o ver all p ositive ˆ c and select lo cal searc h algorithms that maximize the estimate (4). It is easy to see , that if A i maximizes 414 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Met aMax(K): A mul ti-st ar t stra tegy with K algorithm inst ances. P arameters: K > 0 and a p ositiv e, monotone decreasing function h with lim n →∞ h ( n ) = 0. Initialization: F or eac h i = 1 , . . . , K , take a step with eac h algorithm A i once, and let n i, 0 = 1 and ˆ f i, 0 = f ( X i, 1 ). F or eac h round r = 1 , 2 , . . . (a) F or i = 1 , . . . , K select algorithm A i if there exists a ˆ c > 0 such that ˆ f i,n i,r − 1 + ˆ ch ( n i,r − 1 ) > ˆ f j,n j,r − 1 + ˆ ch ( n j,r − 1 ) (5) for all j = 1 , . . . , K such that ( n i,r − 1 , ˆ f i,n i,r − 1 ) 6 = ( n j,r − 1 , ˆ f j,n j,r − 1 ). If there are several v alues of i selected that ha ve the same step num b er n i,r − 1 then keep onl y one of these selected uniformly at random. (b) Step each selec t ed A i , and up date v ariables. That is, set n i,r = n i,r − 1 + 1 if A i is selected, and n i,r = n i,r − 1 otherwise. F or each selected A i ev aluate f ( X i,n i,r ) and c omput e the new estimates b X i,n i,r and ˆ f i,n i,r . (c) Let I r = argmax i =1 ,...,K ˆ f i,n i,r denote the index of the algorithm wi th the currently largest estimate of f ∗ , and estimate the lo cation of the maxim um wi t h b X r = b X I r ,n I r ,r and its v alue with ˆ f r = ˆ f I r ,n I r ,r . Figure 1: The Met aMax(K) algorithm. (4) for a particular ˆ c = u then there is a closed interv al I containing u such that A i also maximizes (4) for any ˆ c ∈ I . Therefore, in eac h round, the strategy Met aMax(K) select s the local search algorithms A i for whic h the c orr e sp onding p oin t ( h ( n i,r − 1 ) , ˆ f i,n i,r − 1 ) is a corner of the upp er conv ex h ull of the set P r = { ( h ( n j,r − 1 ) , ˆ f j,n j,r − 1 ) : j = 1 , . . . , K } ∪ { (0 , max 1 ≤ j ≤ K ˆ f j,n j,r − 1 ) } . (6) The selection mechanism is illustrated in Figure 2. T o av oid confusion, note that the random selection in step (a) of Met aMax(K) implies that if all algorithms are in exact l y the same state, that is , ( n i,r − 1 , ˆ f i,n i,r − 1 ) = ( n j,r − 1 , ˆ f j,n j,r − 1 ) for all i, j , then one algorithm is sel e ct ed uniformly at random (this pathological situation ma y arise, e . g. , at the b eginning of the algorithm or if all the lo cal search al gor i th ms give the same estimate of f ∗ for some range of step n umbers). Apart from the case when one of the least used algorithms pro vides the currently b e st estimate, whic h happ ens surely in the first round but usually do es not happ e n later (and includes the previous pathological case), it is guaran teed that in each round w e use at least tw o algorithms, one with the large st 415 Gy ¨ orgy & Kocsis 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 h ( n ) ˆ f ( x ) Figure 2: Selecting al gori t hm instances in Met aMax : the p oints represent the algorithm instances, and the algorithms that lie on the corners of the upp er con vex h ull (dra wn wi th blue lines) are selecte d. estimate ˆ f i,n i,r − 1 = ˆ f r − 1 (for very small v alues of ˆ c ), and one with the smallest step num b er n j,r − 1 (for very large v alues of ˆ c ). Thus, usually at most half of the total num b er of function calls to f can b e used by any opti m al lo cal searc h algorithm. This observ ation gives a prac- tical lo wer b ound (whic h is v alid apart from the pathological situation mentioned ab o ve) on the proport i on of function calls to f made by optimal lo cal search algorithms; surprisingly , Theorem 6 b elow sho ws that this low er b ound is achiev ed by the algorithm asymptot i c al l y . The randomization in step (a) that precludes using m ultiple instances with the same step n um b er is in tro duced to speed up the algorithm in certain pathological cases. F or example, if A 1 con verges to the correct estimate, while all the other algorithms A 2 , . . . , A K pro duce the same e st i mat e in each round, inde p endently of the i r samples, that is inferior to the estimates of A 1 , then if we use the randomization, half of the calls to compute f will b e made by A 1 , but without the randomiz ati on this would drop do wn to 1 /K as in each round w e w ould use each algorithm. F urthermore, we could take a step with all algorithms that lie on the conv e x h ull, but similar pathologi c al examples can b e constructed when it is more beneficial to use only algorithms on the c or ne r s. On the other hand, it almost nev er happ ens in practice that three algorithms lie on the same line, and so algorithms typically nev er fal l at non-corner p oints of the conv ex h ull. In the remainder of this section w e analyze the p erformance of the Met aMax(K) algorithm. Prop osition 2 shows that t he algorithm is consisten t i n the sense that its p erfor- mance asymptotically ac hieves t hat of the b est algorithm instance as the n umber of rounds increases. T o understand the algorit hm b etter, Lemma 3 pro vides a general s uffic i e nt con- dition that an algorithm i nst anc e is not adv anced in a giv en round, while, based on this result, Lemma 4 provides condi t i ons t hat ensure that “subopti mal ” algorithm instances are not used in a round if they hav e b een stepp ed to o many times (i.e., they ha ve ev aluated f at to o man y p oints) b efore. Lemma 5 giv es an upp er b ound on the num b er of algorithm 416 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms instances used in a round. The results of the lemmas are then used to sho w in Theorems 6 and 8 and Remark 9 that “optimal” algorithm instances are used (asymptotically) at least at a minim um frequency that, in turn, yields the asymptotic rate of con vergence of the Met aMax(K) algorithm. The follo wing prop osi ti on sho ws that the Met aMax(K) algorithm is consistent in a sense: Prop osition 2 The Met aMax(K) algorithm is c onsistent in the sense that ˆ f r ≤ f ∗ for al l r , and f ∗ − lim r →∞ ˆ f r = min i =1 ,...,K n f ∗ − lim n →∞ ˆ f i,n o . Pro of The pro of follows trivi al l y fr om the fact that each algorithm is s el e c te d i nfini t e l y often, that is, lim r →∞ n i,r = ∞ . T o see the latter, we show that in ev ery K rounds the n umber of steps tak en by the least used algorithm, that is, min i =1 ,...,K n i,r , is guaranteed to increase by one. That is, for all k ≥ 0, min i =1 ,...,K n i,kK ≥ k . (7) As describ ed ab o ve, in eac h round we select exactly one of the algorithms that hav e made the least num b er of steps. Thus, if there are K such algorithms , the minimum step n umber p er algorithm will increase in K rounds, which comp l et e s the pro of. 2 The Met aMax(K) algorithm is more efficient if sub optimal algorithms do not step to o often. The next le mma pro vides sufficient conditions that an algorithm is not use d in a giv en roun d. Lemma 3 An algorithm instanc e A j is not use d in any r ound r + 1 of the Met aMax(K) algorithm, i f ther e ar e algori t hm s A i and A k such that ˆ f i,n i,r > ˆ f j,n j,r > ˆ f k,n k,r and either n i,r ≤ n j,r or ˆ f j,n j,r ≤ ˆ f i,n i,r  1 − h ( n j,r ) h ( n k,r )  + ˆ f k,n k,r h ( n j,r ) h ( n k,r ) (8) Pro of As in each round the algorithms at the corners of the con vex hull P r +1 are used, it is easy to see that an algorithm A j is not use d in a round r if there are algorithms A i and A k suc h that ˆ f i,n i,r > ˆ f j,n j,r > ˆ f k,n k,r and either n i,r ≤ n j,r or ˆ f i,n i,r − ˆ f j,n j,r h ( n j,r ) − h ( n i,r ) ≥ ˆ f i,n i,r − ˆ f k,n k,r h ( n k,r ) − h ( n i,r ) . (9) T o finish the pro of we sho w that (8) implies the latter. Indeed, (9) is equiv alent to h ( n k,r )( ˆ f i,n i,r − ˆ f j,n j,r ) ≥ h ( n j,r )( ˆ f i,n i,r − ˆ f k,n k,r ) + h ( n i,r )( ˆ f k,n k,r − ˆ f j,n j,r ) . As the last term in the right hand side of the ab o ve inequalit y is negativ e by our assumptions, the inequality is satisfied if h ( n k,r )( ˆ f i,n i,r − ˆ f j,n j,r ) ≥ h ( n j,r )( ˆ f i,n i,r − ˆ f k,n k,r ) 417 Gy ¨ orgy & Kocsis whic h is equiv alent to (8). 2 The ab ov e le mm a p r ovides conditions on not using some algorithm instanc e s in a certain round that dep end on the actual p erformance of the instances. The next result giv es similar conditions, how ever, based on the b est est i mat es (usually lo cal optima) achiev able with the algorithms. Let ˆ f ∗ i = lim r →∞ ˆ f i,n i,r b e the asymptotic estimate of algorithm A i for f ∗ , and let ˆ f ∗ = max 1 ≤ i ≤ K ˆ f ∗ i denote the best estimate ac hiev able using al gor i t hms A 1 , . . . , A K . Let O ⊆ { 1 , . . . , K } be the set of optimal algorithms that con verge to the b est estimate ˆ f ∗ (for these algorithms), and let | O | denote the cardinality of O (i.e., the num b er of optimal algorithm instances). Note that O is a random v ariable that dep ends on the act ual realizations of the p oss i bl y randomized search sequences of the algorithms. The next lemma sho ws that if j 6∈ O , then A j is not used at a round r if it has be en used to o often so far. Lemma 4 L et ∆ = ˆ f ∗ − max j 6∈ O ˆ f ∗ j denote the mar gin b etwe en the estimates of the b est and the s e c ond b est algorithms. Then for any 0 < u < ∆ ther e is a r andom index R ( u ) > 0 such that for any j 6∈ O , A j is not use d by Met aMax(K) at a r ound r + 1 > R ( u ) if n j,r ≥ h − 1 h  min i =1 ,...,K n i,r  1 − ˆ f ∗ j ˆ f ∗ − u !! . (10) F urthermor e, let 0 < δ < 1 , for any i ∈ O , let g δ,i denote the c onver genc e r ate of algorithm A i guar ante e d by L emma 1, and let g δ ( n ) = max i ∈ O g δ,i ( n ) for al l n . Then P ( R ( u ) ≤ K g − 1 δ ( u ) | ˆ f ∗ 1 , . . . , ˆ f ∗ K ) ≥ 1 − δ (wher e g − 1 δ is the gener alize d inverse of g δ ), and no sub optimal algorithm A j , j 6∈ O, is use d for any r > K g − 1 δ ( u ) with pr ob ability at le ast 1 − δ | O | given the limiting estimates ˆ f ∗ 1 , . . . , ˆ f ∗ K . Pro of Let i ∈ O . Since lim n →∞ ˆ f i,n = ˆ f ∗ i = ˆ f ∗ b y assumption and (7) implies that lim r →∞ n i,r = ∞ , there is an almost surely fini t e random index R ( u ) i > 0 suc h that for all r > R ( u ) i w e hav e ˆ f ∗ − ˆ f ∗ i,n i,r ≤ u and so ˆ f ∗ − ˆ f r ≤ u. (11) Using Lemma 1 we can easily deriv e a high probability upp er bound on R ( u ) . Since for any r > K g − 1 δ ( u ) = R ′ , (7) implies n i,r ≥ g − 1 δ ( u ), Lemma 1 yiel ds P  ˆ f ∗ i − ˆ f i,n i,r ≤ u for all r > R ′   ˆ f ∗ i = ˆ f ∗  ≥ 1 − δ. It follo ws that with probability at least 1 − δ | O | there is an i ∈ O suc h that ˆ f ∗ i − ˆ f i,n i,r ≤ u , whic h implies that R ( u ) can b e chosen such that P ( R ( u ) > R ′ | ˆ f ∗ 1 , . . . , ˆ f ∗ K ) ≤ δ | O | . Th us, to pro ve the lemma, it is e nough to sho w (10). Clearly , by (11), the algorithm will pick the estimate of one of the best algorithms after R ( u ) rounds. Let A k b e an algorithm with the least n umber of ste ps taken up to the end 418 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms of round r , that is, n k,r = min i n i,r . I f ˆ f k,n k,r ≥ ˆ f j,n j,r then A j is not used in round r + 1. Moreo ver, since A j 6∈ O , ˆ f j,n j,r ≤ ˆ f ∗ j < ˆ f r and in this case A j is not used in round r + 1 if n j,r ≥ n I r ,r (recall that n I r ,r is the num b er of ev aluations of f initiated by the actually b est algorithm I r ). Therefore, Lemma 3 implies that A j is not used for r > R ( δ ) if h ( n j,r ) ≤ h ( n k,r ) 1 − ˆ f j,n j,r − ˆ f k,n k,r ˆ f r − ˆ f k,n k,r ! . This is clearly satisfied if h ( n j,r ) ≤ h (min i n i,r ) 1 − ˆ f ∗ j ˆ f ∗ − u ! (12) for any 0 < u < ∆, since by (11) we hav e 1 − ˆ f j,n j,r − ˆ f k,n k,r ˆ f r − ˆ f k,n k,r ≥ 1 − ˆ f j,n j,r ˆ f r ≥ 1 − ˆ f ∗ j ˆ f ∗ − u . Applying the inv erse of h to b oth sides of (12) prov es (10), and, hence, the lemma. 2 The abov e result pro vides an “individual” condition for each sub opti mal algorithm for not b eing used in a round. On the ot he r hand, if one of the optimal algorithms has b een stepp ed sufficiently man y times, w e can giv e a cum ulative upp er b ound on the num b er of sub optimal algorithms used in each round. Lemma 5 Assume that h de cr e ases asymptotic al ly at le ast exp onenti al ly fast, that is, ther e exist 0 < α < 1 and n α > 0 such that h ( n +1) h ( n ) < α for al l n > n α . Assume that r is lar ge enough so that n i,r > n α for al l i , and let ε r = 1 − max i : ˆ f i,n i,r 6 = ˆ f r ˆ f i,n i,r ˆ f r > 0 . Then at most ⌈ ln ε r ln α ⌉ + 1 algorithms ar e stepp e d in r ound r + 1 , wher e ⌈ x ⌉ denotes the smal lest inte ger at le ast as lar ge as x . Pro of Let i 0 , i 1 , . . . , i m denote the indices of the algorithms chosen in round r + 1 with ˆ f i 0 ,r < ˆ f i 1 ,r < · · · < ˆ f i m ,r = ˆ f r . Then Lemma 3 implies that n i 0 ,r < n i 1 ,r < · · · < n i m ,r and ˆ f r − ˆ f i k ,r < α ( ˆ f r − ˆ f i k − 1 ,r ) for all k = 1 , . . . , m − 1. Rep eated application of the ab ov e inequalit y implies ˆ f r ε r ≤ ˆ f r − ˆ f i m − 1 ,r < α ( ˆ f r − ˆ f i m − 2 ,r ) < · · · < α m − 1 ( ˆ f r − ˆ f i 0 ,r ) ≤ ˆ f r α m − 1 whic h yi e l ds m − 1 < ln ε r ln α As w e assumed that m + 1 algorithms were chosen in round r + 1, this fact finishes the pro of. 2 419 Gy ¨ orgy & Kocsis Based on Lemmas 4 and 5, the next theorem sho ws that if the lo cal search algorithm con verges fast enough (exp onentially with a problem dep e nde nt rate, or faster than ex- p onential) then half of the functi on calls to ev aluate f corresp ond to optimal algorithm instances. Theorem 6 Assume that the p erformanc e of the algorithms A i , i = 1 , . . . , K ar e not al l the same, that is, | O | < K , and supp ose that lim sup n →∞ h ( n + 1) h ( n ) < min j 6∈ O ( 1 − ˆ f ∗ j ˆ f ∗ ) . (13) Then asymptotic al ly at le ast half of the function c al ls to evaluate f in Met aMax(K) c or- r esp ond to an optimal algorithm. That is, P lim inf r →∞ P i ∈ O n i,r P K i =1 n i,r ≥ 1 2     ˆ f ∗ 1 , . . . , ˆ f ∗ K ! = 1 . (14) F urthermor e, for any 0 < δ < 1 and ε > 0 ther e is a c onstant R ( δ,ε ) > 0 such that ˆ f ∗ − ˆ f r ≤ g δ $ P K i =1 n i,r (2 + ε ) | O | %! (15) with pr ob ability at le ast 1 − | O | δ given ˆ f ∗ 1 , . . . , ˆ f ∗ K simultane o u s l y for al l r > R ( δ,ε ) , wher e g δ is define d in L emma 4, and the t hr eshold R ( δ,ε ) > 0 dep ends on δ, ε, h, g − 1 δ and ∆ , wher e g − 1 δ and ∆ ar e also define d in L emma 4. Pro of W e sho w that a sub optimal A j is not chosen for large enough r if n j,r > min k n k,r . By Lemma 4, it is sufficien t to pro ve that, for large enough r , min k n k,r + 1 ≥ h − 1 h (min k n k,r ) 1 − ˆ f ∗ j ˆ f ∗ − u !! (16) for some 0 < u < ∆ (recall that here r should b e larger than R ( u ) , the almost surely finite random index of Lemm a 4). As the minim um in (13) i s tak en o ver a finite set, it follows that there exist s a small enough p ositive u < ∆ suc h that lim sup n →∞ h ( n + 1) h ( n ) ≤ min j 6∈ O ( 1 − ˆ f ∗ j ˆ f ∗ − u ) , (17) whic h clearly implies (16) as lim r →∞ min k n k,r = ∞ by (7). This fact finishes the pro of of (14), the first part of the theorem. Next we pro ve (15). Let N u > 0 b e a threshold suc h that (17) holds for all n ≥ N u . F urther mor e , by Lemma 1 and the union bound, (1) holds for each lo cal searc h algorithm A i with g δ,i in place of g δ sim ultaneously for all i ∈ O with pr obabi l i ty at least 1 − | O | δ . 420 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Then (7) and a sligh t modi fi cat i on of Lemma 4 imply that (16) holds simultaneously for all r > K max { g − 1 δ ( u ) , N u } = R ′ with probabi l i ty at least 1 − | O | δ given ˆ f ∗ 1 , . . . , ˆ f ∗ K . Since in eac h suc h round at most t w o algorithms are used, for any r > R ′ + c we ha ve P i ∈ O n i,r P K i =1 n i,r > c + R ′ 2 c + K R ′ with high probabilit y . Since the l att er is b ounded from b elow b y 1 / (2+ ε ) for c ≥ R ′ ( K − 2 − ε ) /ε , we hav e P i ∈ O n i,r ≥ P K i =1 n i,r 2+ ε for any r > R ( δ,ε ) = R ′ + R ′ ( K − 2 − ε ) /ε with high probability . Then there is an algorithm A i , i ∈ O s uch that it is used in at least l P K i =1 n i,r | O | (2+ ε ) m rounds, implying the statemen t of the theorem via Lemma 1. 2 Remark 7 The ab ove pr o of of The or em 6 is b ase d on L emma 4. A pr o of b ase d on L emma 5 is also p ossible, sinc e setting α = min j 6∈ O (1 − ˆ f ∗ j / ˆ f ∗ ) in the lemma, ( 13) implies that, for lar ge enough r , ǫ r ≈ α , and so e ach r ound is appr oximately of length 2 by the lemma. It ma y happen that although the decay rate of h is exponential, it is not quite fast enough to satisfy (13), so the optimal scenar i os in the abov e theorem do not hold. In t hi s case it turns out that the n umber of algorithms conv erging to the same lo cal maxim um pla ys the k ey role in determining the usage frequency of the optimal algorithms. Theorem 8 Assume tha t the estimates pr ovide d by the algori t hm s A 1 , . . . , A K c onver ge t o N > 1 distinct limit p oints, such that k 0 = | O | algorithms c onver ge to ˆ f ∗ , and k 1 , k 2 , . . . , k N − 1 algorithms c onver ge to e ach sub optimal limit p oints, r esp e ctively. Supp ose furthermor e that h de cr e ases asymptotic al ly at le ast exp onential ly fast, that is, for some 0 < α < 1 , lim sup n →∞ h ( n +1) h ( n ) < α. Then P lim inf r →∞ P i ∈ O n i,r P K i =1 n i,r ≥ k max K − k 0 + k max     ˆ f ∗ 1 , . . . , ˆ f ∗ K ! = 1 . wher e k max = max 1 ≤ i ≤ N − 1 k i . F urthermor e, using the definitions of L emma 4 for any 0 < δ < 1 and ε > 0 ther e is a c onstant thr eshold R ( δ,ε ) ˆ f ∗ − ˆ f r ≤ ˆ c ˆ g δ $ k max P K i =1 n i,r ( K − k 0 + k max + ε ) | O | %! with pr ob ability at le ast 1 − | O | δ given ˆ f ∗ 1 , . . . , ˆ f ∗ K simultane o u s l y for al l r > R ( δ,ε ) , wher e R ( δ,ε ) dep ends on ε, , . ˆ f ∗ 1 , . . . , sf K and the c onver genc e r ate of al l the algorithms 4 . Pro of Supp ose ˆ f ∗ 1 , . . . , ˆ f ∗ K are given, and fix the random t r a jectories of all the algorithms. If there i s a si ngl e sub optimal algorithm then the statement is trivial as k max / ( K − k 0 + k max ) = 1 / 2 and in each round at least tw o algorithms are used, and at most one of them can b e the suboptimal one. F rom now on assume there are at least t wo sub optimal algorithms. Assume that A j and A k con verge to sub optimal local maxima (strictly less than ˆ f ∗ ). F or an y r large enough, an optimal algorithm A i is b et t er than an y of the sub opt i mal ones, that 4. Instead of the conv ergence rate of al l algorithms, R ( δ,ε ) may b e defined to be dep endent on g δ and ∆. 421 Gy ¨ orgy & Kocsis is, if A i con verges to ˆ f ∗ then ˆ f i,n i,r > ˆ f j,n j,r , ˆ f k,n k,r . Assume, without loss of generality , that ˆ f j,n j,r ≥ ˆ f k,n k,r . If n j,r ≤ n k,r then clearly only A j can b e chosen i n round r + 1. Assume n j,r > n k,r . Since ˆ f j,n j,r and ˆ f k,n k,r are conv ergen t sequences (as r → ∞ ), for large enough r w e hav e, for some α j,k < 1, 1 − ˆ f j,n j,r − ˆ f k,n k,r ˆ f i,n i,r − ˆ f k,n k,r > α j,k ≥ α t j,k (18) where t j,k = ⌈ ln α j,k / ln α ⌉ is a p ositive integer. Note that if A j and A k con verge to the same p oin t, that is, l i m r →∞ ( ˆ f j,n j,r − ˆ f k,n k,r ) = 0, the second term on the left hand side of (18) conv erges to 0, and so α j,k can b e c hosen to b e α , implying t j,k = 1. Rearranging the ab o ve inequalit y one obtains (1 − α t j,k ) ˆ f i,n i,r + α t j,k ˆ f k,n k,r > ˆ f j,n j,r . (19) If n j − n k ≥ t j,k , the n the conditions on h and the fact that b oth n j,r and n k,r tend to infinit y as r → ∞ (recall (7)) imply that, for large enough r , h ( n j,r ) /h ( n k,r ) < α t j,k . Since ˆ f i,n i,r ≥ ˆ f k,n k,r for large enough r , from (19) we obtain ˆ f j,n j,r < (1 − α t j,k ) ˆ f i,ni,r + α t j,k ˆ f k,n k,r ≤  1 − h ( n j,r ) h ( n k,r )  ˆ f i,ni,r + h ( n j,r ) h ( n k,r ) ˆ f k,n k,r . Th us, by Lemma 3, if r is large enough, A j cannot b e used in round r + 1 if n j,r − n k,r ≥ t j,k . Since b oth n j,r and n k,r tend to infinity , it follo ws that, for large enough r , | n j,r − n k,r | ≤ t j,k (20) for any t wo sub optimal algorithms A j and A k . Note that this fact also im pl i e s that from an y set of sub optim al algorithms conv erging to the s ame p oint, even tually at most one can b e used in an y round (since the corresp onding thresholds are t j,k = 1). Clearly , (7) implies that b oth n j,r and n k,r gro w linearly with r , and since their differ- ences is b ounded by (20), lim r →∞ n j,r /n k,r = 1. Therefore, f or any sub optimal algorithm A j , we hav e li m n →∞ n j,r /r ≤ 1 /k max (this is the maximal rate of using elements from the largest group of sub optimal algorithms conv erging to the same lo cal optimum). Finally , as an optimal algorithm is used in each round r , for large enou gh r , w e hav e lim inf r →∞ P i ∈ O n i,r P K i =1 n i,r = lim inf r →∞ P i ∈ O n i,r P i ∈ O n i,r + P i 6∈ O n i,r ≥ lim r →∞ r r + ( K − k 0 ) r k max = k max K − k 0 + k max , where we used the fact that a/ ( a + b ) is an increasing function of a for a, b > 0. Sinc e the ab o ve inequalit y holds for all real i z ati ons of the tra jectories of A 1 , . . . , A K , given ˆ f ∗ 1 , . . . , ˆ f ∗ K , the first statement of the theorem follows. The second statement follows simi l ar l y to ( 15) in Theorem 6. Since the exact v alue of R ( δ,ε ) is not of particul ar in terest, the deriv ation is omitted. 2 422 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Remark 9 The main message of the ab ove the or em is th e somewhat surprising observation that sub optimal algorithms ar e slowe d down if ther e is a lar ge gr oup of sub optimal algorithms c onver ging to the sam e lo c al optimum; the r ate sub optimal algorithms ar e use d at is b ounde d by the the size of th e lar gest such gr oup. 4.2 Unbounded Number of Instan ces It is clear that i f the lo cal searc h algorithms are not consistent (i.e. , they do not ac hieve the global optimum f ∗ ), then, despite its fav orable properti es , the Met aMax(K) strategy is inconsistent, to o. Ho wev er, if we increase the num b er of algorithms to infinity then we get the consistency from random search, while still keeping the reasonably fast con vergence rate from Met aMax(K) . Clearly , one needs to balance b et ween exploration and exploitation, that is, we ha v e to con trol how oft en w e introduce a new algorithm. One solution is to let the M et aMax algorithm solve this problem: the Met aMax( ∞ ) algorithm, gi ven in Figure 3, is the ex- tension of Met aMax(K) that is able to run with infinitely many lo cal s ear ch algorithm instances. Here a ma jor issue is that new lo c al searc h algorit hms ha ve to b e started from time to time (this ensures that the algorithm will conv erge to the global maximum of f since it also p erforms a random search): this is implemen ted by modifyi ng step (a) of the Met aMax(K) algorithm so that a new, randomly initialized lo cal searc h algorithm is in- tro duced in each round (randomly selecting one algorithm “uniformly” from the infinitely man y p ossible algorithms not used so far). Ob viously , w e hav e to skip the initialization step of Met aMax(K) and “start” eac h algorithm with 0 samples. T o b ette r con trol the length of eac h round (i.e., the exploration), in each round r we allow the use of a different function h , denoted b y h r − 1 that may dep end on any v al ue measured b ef ore round r (this is suppressed in the notation). As b efore, w e assume that h r (0) = 1, h r ( n ) is monotone de- creasing in n , and lim n →∞ h r ( n ) = 0 for all r . T ypically w e will make h r − 1 b e dep endent on the total n umber of steps (i.e., the function calls to ev aluate f ) t r − 1 = P K r − 1 i =1 n i,r − 1 made b y all the algorithms b efore round r , where K r − 1 is the n umber of algorithm instances used b efore round r ; note that K r − 1 = r − 1 for all r , as we start exactly one new algorithm in eac h round. It is desired that, although the num b er of local searc h algorithms gro ws to infini ty , the num b er of times the b est lo cal searc h algorithm is adv anced b y the Met aMax( ∞ ) algorithm approac hes infinity reasonably fast. Somewhat r el ax i ng the random initialization condition, w e ma y imagine a si t uat i on where the lo cal searc h algorithms are initialized in some clev er, deterministic w ay , and so in the first few steps they do not find any b et t er v alue than their ini ti al guesses. If all algorithms are optimal (this ma y b e view ed as a result of the clever initialization), then they may provide, for example, the identical estimates 0 . 5 , 0 . 5 , 1 for the first three steps. Then it is easy to see that each algorithm is stepp ed exactly t wice, thus no conv ergence to the optimum (which w ould b e found after the third step) is achiev e d. Although the random initiali zat i on of the search algorithms guaran tees the consistency of Met aMax( ∞ ) (see Prop osition 10 bel ow), robust b ehavior in even such pathological cases is preferred. This can be ac hieved by a slight mo dification of the algorithm: if in a round a local searc h algorithm o vertak es the currently b est algorithm, that is, I r 6 = I r − 1 , then algorithm A I r 423 Gy ¨ orgy & Kocsis Met aMax( ∞ ): A mul ti-st ar t stra tegy with infinitel y many algorithm inst ances. P arameters: { h r } , a set of p ositive, monotone decreasing functions with lim n →∞ h r ( n ) = 0. F or eac h round r = 1 , 2 , . . . (a) Initialize algorithm A r b y sett i ng n r,r − 1 = 0 , ˆ f r, 0 = 0. (b) F or i = 1 , . . . , r select algorithm A i if there exists a ˆ c > 0 such that f i,n i,r − 1 + ˆ ch r − 1 ( n i,r − 1 ) > f j,n j,r − 1 + ˆ ch r − 1 ( n j,r − 1 ) for all j = 1 , . . . , r such that ( n i,r − 1 , ˆ f i,n i,r − 1 ) 6 = ( n j,r − 1 , ˆ f j,n j,r − 1 ). If there are several v alues of i selected that ha ve the same step num b er n i,r − 1 then keep onl y one of these selected uniformly at random. (c) Step each selec te d A i , and up date v ariables. That is, set n i,r = n i,r − 1 + 1 if A i is selected, and n i,r = n i,r − 1 otherwise. F or each selected A i ev aluate f ( X i,n i,r ) and c omput e the new estimates b X i,n i,r and ˆ f i,n i,r . (d) Let I r = argmax i =1 ,...,r ˆ f i,n i,r denote t he index of the algorithm with the currently largest estimate of f ∗ , and estimate the lo cation of the maxim um wi t h b X r = b X I r ,n I r ,r and its v alue with ˆ f r = ˆ f I r ,n I r ,r . Figure 3: The Met aMax( ∞ ) algorithm. is stepp ed sev eral times until it is used more times than A I r − 1 . 5 The resulting al gor i th m, called Met aMax , is given in Figure 4. Note that the algorithms Met aMax( ∞ ) and Met aMax conceptually differ only at one place: step (c) is extended with step (c’) in the new algorithm. As a result, a technical mo dification also app ears in step (d), and, t o simplify the presentation of the Met aMax algorithm, a slight, insignificant modificati on is also introduced in step (b), see the disc us si on b elo w. The mo dification in Met aMax is not reall y significan t in the practical examples we studied (see Section 5), as the n umber of step s tak en by an algorithm that ov er takes the curren tly b e s t algorithm grows v ery quickly also in the Met aMax( ∞ ) algorithm, since in Met aMax an ov ertak e usually in tro duces v ery short rounds (close to the minimum length t wo in many cases) un til the leading algorithm b ecomes also the most used one. The goal of the mo dification in step (b) is only to sync hronize the choice of the opti mal algor i th ms in steps (b) and (c’). An equally go o d solution would b e to choose, in case of a tie in step 5. In this wa y we ac hieve that the actually b est algorithm dominates all others in terms of both accuracy and the n umber of calls made by the algorithms to compute the target function. This is the same t yp e of dominance as used b y Hutter et al. (2009) in a slightly different context. 424 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Met aMax: A mul ti-st ar t stra tegy with infinitel y many algorithm inst ances. P arameters: { h r } , a set of p ositive, monotone decreasing functions with lim n →∞ h r ( n ) = 0. F or eac h round r = 1 , 2 , . . . (a) Initialize algorithm A r b y sett i ng n r,r − 1 = 0 , ˆ f r, 0 = 0. (b) F or i = 1 , . . . , r select algorithm A i if there exists a ˆ c > 0 such that f i,n i,r − 1 + ˆ ch r − 1 ( n i,r − 1 ) > f j,n j,r − 1 + ˆ ch r − 1 ( n j,r − 1 ) for all j = 1 , . . . , r such that ( n i,r − 1 , ˆ f i,n i,r − 1 ) 6 = ( n j,r − 1 , ˆ f j,n j,r − 1 ). If there are several v alues of i selected that ha ve the same step num b er n i,r − 1 then keep onl y one of these that has the smallest index. (c) Step each selec te d A i , and up date v ariables. That is, set n i,r = n i,r − 1 + 1 if A i is selected, and n i,r = n i,r − 1 otherwise. F or each selected A i ev aluate f ( X i,n i,r ) and c omput e the new estimates b X i,n i,r and ˆ f i,n i,r . (c’) Let I r = argmax i =1 ,...,r ˆ f i,n i,r denote t he index of the algorithm with the curren tly largest estimate of f ∗ (in case I r is not unique, choose the one with the smallest n umber of steps n i,r ). If I r 6 = I r − 1 , step algorithm A I r ( n I r − 1 ,r − n I r ,r + 1) times and set n I r ,r = n I r − 1 ,r + 1. (d) Estimate the loc ati on of the maxim um with b X r = b X I r ,n I r ,r and its v alue with ˆ f r = ˆ f I r ,n I r ,r . Figure 4: The Met aMax algorithm. (c’), an algorithm that was used in the current round. Also note that, as a result of the mo difications, the cu r re ntly b est algorithm (with index I r ) has tak en the most steps, and so the ex tr a num b er of steps tak en in step (c’) is indeed p ositive. An imp ortant consequence of the mo difications is that, in any round r , the num b er of steps tak en b y th e lo cal searc h algorithm A I r , whic h is the b est at the end of the round, is b etw een r and 2 r (see Theorem 15 b elow). The rest of the section is devoted to the theoreti c al anal y s i s of Met aMax( ∞ ) and Met aMax , following the lines of the analysis provided for Met aMax(K) . First, in Propo- sition 10, it is sho wn that the algorithm is cons i ste nt, that is, the solution found by the algorithm actually conv erges to f ∗ . Lemma 12 (a counterpart of Lemma 4) shows that sub optimal algorithms can make only finitely many steps, while Lemma 14 gives an upp er b ound on the length of eac h round. The main theoretical results of this section apply only 425 Gy ¨ orgy & Kocsis to the Met aMax algorithm: Theorem 15 gives a low er bound on the n umber of steps tak en b y the actually b est algorithm b y the end of a given round, while, as a consequence, Theo- rem 16 sho ws the rate of con vergence of the algorithm as a function of the total n umber of steps (i.e., function calls to ev aluate f ) used by all algorithm instances: it turns out that at most quadratically more steps are needed than for a generic lo cal search algorithm instance that conv erge s to the optimum. Since the Met aMax( ∞ ) and the Met aMax strategies perform a random search (the n umber of algorithms tends to infinity as the length of each round is finite), the algorithms are consistent: Prop osition 10 The str ate gies Met aMax( ∞ ) and the Met aMax ar e c onsistent. That is, lim r →∞ ˆ f r = f ∗ almost sur ely. Pro of Clearly , the even t that ˆ f r do es not conv erge to f ∗ can b e written as n lim r →∞ ˆ f r 6 = f ∗ o = ∞ [ n =1 ∞ \ R =1 ∞ [ r = R n ˆ f r < f ∗ − 1 /n o (21) No w the con tinuit y of f implies that, for any n , if X is c hosen uniformly from [0 , 1] d then q n = P ( f ( X ) > f ∗ − 1 /n ) > 0. Th us, for any round r , P ( ˆ f r < f ∗ − 1 /n ) ≤ (1 − q n ) r , and so P ∞ r =1 P ( ˆ f r < f ∗ − 1 /n ) is finite. Therefore, the Borel-Cantelli lemma (see, e. g. , Ash & Dol ´ eans-D ade , 2000) implies that P ∞ \ R =1 ∞ [ r = R n ˆ f r < f ∗ − 1 /n o ! = 0 for all n . This, together with (21) finishes the pro of, as P  lim r →∞ ˆ f r 6 = f ∗  ≤ ∞ X n =1 P ∞ \ R =1 ∞ [ r = R n ˆ f r < f ∗ − 1 /n o ! = 0 . 2 In the reminder of this section we will assume that lo cal search algorithms that ac hieve an almost optimal v alue even tually conv erge to the opti mum. Assumption 11 L et F ∗ ⊂ R denote the set of lo c al maxima of f , and let ∆ = f ∗ − sup ˆ f ∈F ∗ , ˆ f 0 and if an algorithm A i is such that ˆ f i,n > f ∗ − ∆ for some n , then lim n →∞ ˆ f i,n = f ∗ . If the local search algorithms con v erge to lo cal optima (which is a reasonable assumption in practice), the ab ov e assumption is usually satis fied : the only situation when it do es not hold is the pathological case when f has infinitely man y lo cal m ax i ma and the set of these maxima is dense at the global maximum. 426 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Under Assumption 11 we can prov e, similarly to Lemma 4, that any sub optimal algo- rithm is selected only a limited num b er of times that in cr e ase s with h − 1 r . In particular, i f h r = h for all r large enough, then an y sub optimal algorithm is chosen only finitely many times. Lemma 12 Supp o s e Assumption 11, and let q = P ( f ( X ) > f ∗ − ∆ / 2) for an X uniformly distribute d in [0 , 1] d . Then, for b oth the Met aMax( ∞ ) and the Met aMax algorithms, a sub optima l algorithm A j starte d b efor e r ound r + 1 is not use d at r ound r + 1 , with pr ob abili ty at le ast 1 − (1 − q ) r , if n j,r ≥ h − 1 r  ∆ 2 f ∗ − ∆  . In addition, if h r ( n ) is a non-de cr e as i ng function of r for al l n , then lim sup r →∞ n j,r h − 1 r − 1  ∆ 2 f ∗ − ∆  < ∞ almost sur ely. (22) In p articular, if h r is a c onstant function, that is, h r = h 0 for al l r , then lim r →∞ n j,r < ∞ almost sur ely. Remark 13 Note that in the se c ond p art of the lemma we c ould dr op the monotonicity assumption on h r and r eplac e h − 1 r − 1  ∆ 2 f ∗ − ∆  with max 0 ≤ r ′ ≤ r − 1 h − 1 r ′  ∆ 2 f ∗ − ∆  in (22) . Pro of Consider if algorithm A j is used at a round r + 1. First note that with probability at le ast 1 − (1 − q ) r , ˆ f r > f ∗ − ∆ / 2. F urthermore, the newly i ntro duced algorithm, A r +1 is not used yet, and w e hav e n r +1 ,r = 0 and ˆ f r +1 , 0 = 0. Thus, by Lemma 3, A j is not used if ˆ f j,n j,r ≤ ˆ f r  1 − h ( n j,r ) h r (0)  = ˆ f r (1 − h r ( n j,r )) . Since this is e q ui v alent to n j,r ≥ h − 1 r 1 − ˆ f j,n j,r ˆ f r ! , and h − 1 r 1 − ˆ f j,n j,r ˆ f r ! ≤ h − 1 r  ∆ 2 f ∗ − ∆  b y ˆ f r − ˆ f j,n j,r ˆ f r ≥ ˆ f r − ( f ∗ − ∆) ˆ f r > ( f ∗ − ∆ / 2) − ( f ∗ − ∆) f ∗ − ∆ / 2 = ∆ 2 f ∗ − ∆ , (23) the first statement of the pro of foll ows. T o pro ve the second part, let b R denote the first round in whic h there is an optimal algorithm A i with ˆ f i,n i, b R > f ∗ − ∆ / 2. Then for an y sub optimal algorithm A j , the first part of the lemma impl i es that, for any r > b R , n j,r ≤ max  b R, max 0 ≤ r ′ ≤ r − 1 h − 1 r ′  ∆ 2 f ∗ − ∆  + 1  = max  b R, h − 1 r − 1  ∆ 2 f ∗ − ∆  + 1  427 Gy ¨ orgy & Kocsis where the equality holds since h − 1 r ( n ) is non-decreasing in r . Th us lim sup r →∞ n j,r h − 1 r − 1  ∆ 2 f ∗ − ∆  ≤ lim sup r →∞ max    b R h − 1 r − 1  ∆ 2 f ∗ − ∆  , 1 + 1 h − 1 r − 1  ∆ 2 f ∗ − ∆     ≤ max    b R h − 1 0  ∆ 2 f ∗ − ∆  , 1 + 1 h − 1 0  ∆ 2 f ∗ − ∆     (24) where w e used that h − 1 r is non-decreasing in r . Since b R is finite, (24) is also finite with probabilit y 1. 2 A simple mo dification of Lemma 5 implies that on ce a ∆ / 2-optimal sample p oint is found then only a limited n umber of sub optimal algorithms is c hosen in eac h round. Lemma 14 Consider algorithms Met aMax( ∞ ) and Met aMax . Supp ose As s u m pt i on 11 holds, and assume that f ∗ − ˆ f R < ∆ / 2 for some R > 0 . In any r ound r > R , if h r ( n ) = α n r for some 0 < α r < 1 and for al l n ≥ 0 , then at mos t  ln 2 f ∗ − ∆ ∆ ln(1 /α r )  algorithms ar e chosen that have estimates ˆ f j ≤ f ∗ − ∆ . Pro of The pro of follows from Lemma 5 taking into accoun t that an y sub optimal algorithm A j satisfies ˆ f ∗ j ≤ f ∗ − ∆ and that at least one opti mal algorithm is chosen in each round r > R : Simil ar l y to (23), ε r defined in Lemma 5 can be b ounded as ε r > ∆ / (2 f ∗ − ∆), and so the n um b er of subopti mal algorithms used in round r is b ounded b y l ln(1 /ε r ) ln(1 /α r ) m ≤  ln 2 f ∗ − ∆ ∆ ln(1 /α r )  . 2 Finally w e can de r i ve the conv ergence rate of the algorithm Met aMax . First w e b ound the n umber of steps taken by the curre ntly be st algori thm , b oth in terms of the n umber of rounds and the tot al n umber of steps tak en by all the lo cal searc h algorithms. Theorem 15 Consider the Met aMax algorithm. At the end of any r ound r the numb er of steps taken by the cu rr ently b est algorithm is b etwe en r and 2 r . That is, r ≤ n I r ,r < 2 r. (25) F urthermor e, the numb er of c al ls n I r ,r to evaluate f by the curr entl y b est algorithm A I r c an b e b ounde d by a function o f the total numb er of times t r = P r i =1 n i,r the tar get function f is evaluate d by al l lo c a l se ar ch instanc es as n I r ,r ≥ √ 2 t r + 7 − 1 2 . (26) Pro of The first statemen t of the lemma is very simple, since in an y round the actually b est algorithm takes one step if there is no ov ertaking, and one or t w o steps if there is 428 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms o vertaking. Indeed, in any round r ≥ 2, if there i s no ov ertaking, that is, I r = I r − 1 , then n I r ,r = n I r ,r − 1 + 1. Otherwise, if I r 6 = I r − 1 , then n I r ,r = n I r − 1 ,r + 1, and since 0 ≤ n I r − 1 ,r − n I r − 1 ,r − 1 ≤ 1, w e hav e 1 ≤ n I r ,r − n I r − 1 ,r − 1 ≤ 2 in all situati ons. Since in the first round clearly the only algorithm used tak es 1 step, that is, n I 1 , 1 = 1, (25) follo ws. T o pro ve the second part, notice that in an y round r , at most n I r − 1 ,r − 1 + 1 algorithms can b e stepp ed in step ( c ) as no algorithm can b e used that has tak en more steps than the curren tly b est one. Also, in step (c’ ) no extra samples are used if t he r e i s n o ov ertaking. In case of o vertaking, A I r has to b e adv anced in step (c), as well as A I r − 1 , and so at most n I r − 1 ,r − 1 + 1 extra steps has to b e taken b y A I r . Therefore, t r ≤ t r − 1 + 2 n I r − 1 ,r − 1 + 2 . Th us, sin ce no o vertaking happ ens in round 1, w e obtain t r ≤ 1 + r X s =2 2( n I s − 1 ,s − 1 + 1) . Then, by (25) we hav e t r ≤ 1 + 4 r X s =2 s = 1 + 2( r + 2)( r − 1) ≤ 1 + 2( n I r ,r + 2)( n I r ,r − 1) whic h yi e l ds (26). 2 Note that in the pro of we used a crude estimate on the length of a usual round (without o vertaking) relativ e to, for example, Lemma 14. This, how ever, affects the result only by a constan t factor as long as we are not able to b ound the num b e r of rounds or the n umber of extra steps t aken when o vertaking happ ens, since the effect of the ov ertakings itself in tro duces the quadratic dep endence in the pro of of (26). Exp erimen tal results in Section 5 sho w (see Figure 10) that the num b er of al gor i t hm instances (which is in turn the num b er r of rounds) has a usual gro wth rate of Ω( t r / ln t r ), whic h, if taken into account, may sharp en the b ound on ho w often the bes t algorithm is chosen. Under Assumption 11, the random searc h compone nt of Met aMax implies that even- tually w e will hav e an optimal algorithm that is the b est. F rom that p oint the con vergence rate of the opti mal lo cal search algorithms d et e rm i ne the p erformance of the s ear ch, and the num b er of steps t aken b y t he b est lo cal search algorithm is b ounded b y Theorem 15. Theorem 16 Supp o s e Assumption 11 holds. Then ther e is an almost sur ely finite r andom index R such that for al l r o u nds r > R , the es timate ˆ f r of the Met aMax algorithm and the total numb er of steps t r taken by al l lo c al se ar ch algorithms up to the end of r ound r satisfies f ∗ − ˆ f r ≤ g δ  √ 2 t r + 7 − 1 2  with pr ob ability at le as t 1 − δ , wher e g δ is define d by L emma 1 for the glob al maximum f ∗ . 429 Gy ¨ orgy & Kocsis Remark 17 (i) The value of R c an b e b ounde d with high pr ob ability using the pr op erties of uniform r andom se ar ch for the a ctu al pr oblem; this would yield similar b ounds as in The or ems 6 and 8 for the M et aMax(K) algorithm. (ii) Note the explor at i on- ex pl o i ta ti on tr ade-off in the Met aM ax algorithm: the value of R is p otential ly de cr e ase d if we intr o duc e new algorithms mor e often, while n I r ,r is r e duc e d a t the same time. (ii i ) The or ems 15 and 16 imply that, asymptotic al ly, the Met aMax algorithm ne e ds only quadr atic al ly mor e function evaluations than a lo c a l se ar ch algorithm that is ensur e d to c onver ge to the optimum. In p articular , if f is of the form f ( x ) = P s i =1 f i ( x ) I S i ( x ) wher e the S i form a p artition of [0 , 1] d , I S i denotes the indic ator function of S i , and the f i b elong to some nic ely b ehaving function class such that a lo c al se ar ch algorithm starte d in S i c onver ges to the maximum of f i on S i (e.g., f is a pie c ewi s e c onc ave f u nct io n with exp onential c onver genc e r ate for the SPSA algorithm, which is use d with a sufficiently smal l step size) , then we c an pr eserve the p erformanc e of a lo c al se ar c h algorithm for the original functi o n class at the pric e of an asymptotic al ly quadr atic incr e ase in the numb er of function c al ls to evaluate f (i.e., the total numb er of steps taken by al l lo c al se ar ch algorithm instanc es). 4.3 Discussion on the Results In some sense the theore ti cal results presen ted in the previous sections are w eak. The consistency result for the Met aMax(K) algorithm follows easily from t he f act that each lo cal search algorithm is used infinitely many times, and the consistency of Met aMax( ∞ ) and Met aMax follows from the consistency of a random search. The p erformance b ounds pro vided hav e a disadv antage that they are asymptotic in the sense that they hold only after a p ossibly large n umber of rounds (a weakness of the b ounds is that the minim um num b er of rounds is obtained fr om prop erties of uniform random searc h/sampling for the particular problem, neglecting more at tr ac ti ve prop erties of the algorithms). In fact, it is quite easy to construct scheduling strat egi e s that are consisten t and asymptotically an arbitrarily large fraction of the function ev aluations (even almost all ) is used by optimal lo cal searc h algorithms: the explore-and-exploit algorithms achiev e both of these goals if the n um b er of function ev aluations to b e used is known ahead and they use an arbitrarily small fr act i on of the ev alu ati ons of the target function f for exploration . W e compare the p er for manc e of our algorit hms to suc h explore-and-exploit algorit hms in Section 5. In particular, to matc h the p erformance guarantees for the Met aMax family , we use algorithms that sp end half of their time with exploration and half with exploitat i on, where in the exploration part a uniform allo cation strategy is used when there is a finite num b er of lo cal search algorithms, and the schedule of Luby et al. (1993) is used for infinitely many local search al gor i t hms . Although the theoretical guarantees pro v ed in this paper for the Met aMax family also hold for these ex pl or e -and- ex p l oi t algorithms, in all exp eriments the Met aMax family seems to b ehav e superior compared to these algorithms, as exp ected. The theoretical results also do not giv e sufficient guidance on how to c hose the parameter h or h r (the time -v arying version of h is not considered for the Met aM ax(K) algorithm for simplicit y and ease of presentation). Most of our results require a sufficiently fast exp onential decay in h , which is p rob l em dependent and cannot b e determined in adv ance. A sufficien tly fast deca y rate w ould ensure, for example, that for the Met aMax(K) algorithm w e could alwa ys use the stronger results of Theorem 6 and would never hav e to deal with 430 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms the case when only the bound of Theorem 8 holds. One may easily c ho ose h to b e any function that de c r eas es sup er-exp onentially: this would mak e the asymptotic b ounds work, ho wev er, would slow do wn expl or at i on (in the extreme case of h r ( n ) ≡ 0, whic h is excluded b y our conditions, no exploration w ould b e p erformed, and the algorithms would use only the actually b est lo cal searc h algorithm). In practice w e hav e alw ays found that it is appropriate to c hose h r deca y exp onentially . F urthermore, we h av e found it ev en more effectiv e to gradually decrease the dec ay rate to enhance explor at i on as time elaps es (the rationale b ehind this approach is the assumption that go o d algorithms should hav e more or less conv erged after a while, and so there ma y b e a greater p otential in exploration to impro ve our estimates). Finally , the connec ti on b etw een g δ and h should b e further in vestigated. Keeping the abov e limitations of the theoretical results in mi nd , w e still b elieve that the theoretical analyses given pro vide imp ortant insight to the algorithms and may guide a p ot e ntial user in practical applications, esp ecially since the prop erties of the Met aMax family that can be prov ed in the asymptotic regime (e.g., that the rounds are q ui te short) can usually b e observ ed in practic e , as well. F u rt he r mor e , we think that it is p ossible to impro ve the analysis to b ound the thresholds from which the results b ecome v alid with reasonable v alues, but this w ould require a differen t approach and , therefore, is left for future work. 5. Exp eriments The v ariants of the Met aMax algori t hm are tested in syn thetic and real examples. Since there was only negligi bl e difference in the p erformance of Met aMax( ∞ ) and Met aMax , 6 in the follo wing w e presen t results only for Met aMax(K) and Met aMax . Firs t we demon- strate the p erformance of our algorithm in optimizing a synthetic function (using SPSA as the lo cal search algor i t hm) . Next the b ehavior of our algorithm is tested on standard data sets. W e show how Met aMax can b e applied for tu ni ng the parameters of mac hine learning algorithms: a cl assi fi cat i on task is solved b y a neural ne tw ork, and the parameters of the training algorithm (back-propagation) are fine-tuned b y Met aMax com bined with SPSA. Met aMax is also applied to b o ost the p erformance of k -means cluste r i ng. At the end of the section, w e compare the results of the exp eriments to the theoretical b ounds obtained in Section 4.2. In all the exp eriments, in accordance with our simplifying assumptions in tro duced in Section 3, the main difference b etw een the individual runs of the particular l o cal search algorithm is their starting p oint. Obviously , more general diversification tec hniques exist: for example, the parameters of the lo cal search algorithm cou l d also v ary from instance to instance (including running instances of different lo cal search algorithms, where a param- eter would select the actually employ ed search algorithm), and the initialization (starting p oint and par ame tr i z ati on ) of a new instance could also dep end on the results deliv ered by 6. F or example, the relative difference betw een the a verage error e MetaMax( ∞ ) of Met aMax( ∞ ) and e MetaMax of Met aMax in optimizing the parameters of a m ulti-lay er p erceptron for learning the l etter data set (see Section 5.1 and esp ecially Figure 6, right for more details) w as 0.033 with a stan- dard deviatio n of 0.06 (av eraged ov er 1000 exp eriments), where the relative difference is defined as   e MetaMax( ∞ ) − e MetaMax   / max( e MetaMax( ∞ ) , e MetaMax ). 431 Gy ¨ orgy & Kocsis existing instances. Although the Met aMax strategies could also be applied to these more general scenarios, their b ehavior can b e b etter studied in the simpler scenario; hence, our exp eriments c or r es p ond only to thi s setup. 5.1 Optimizing Parameters with SPSA In this section we compare the t wo v ersions of the Met aMax algorithm with six multi-start strategies, inc l udi n g three with a constant and three with a v ariable n umber of algorithm instances. The strat e gi es are run for a fixed T time steps, that is, the target function can b e ev aluated T times, together by all local searc h instances (note that sev eral reference strategies use T as a parameter). W e used SPSA (Simultaneous P erturbation Sto chastic Appro ximation; Spall, 1992) as the base lo cal searc h algorithm in all cases. SPSA is a local searc h algorithm with a sampling function that uses gradient desce nt with a sto c hastic appro ximation of the deriv ative: at the actual lo cation X t = ( X t, 1 , . . . , X t,d ), SPSA estimates the l th partial deriv ative of f by ˜ f t,l ( X t,l ) = f ( X t + φ t B t ) − f ( X t − φ t B t ) 2 φ t B t,l , where the B t,l are i.i.d. Bernoulli random v ariables that are the components of the vector B t , and then uses the sampling function s t ( X t ) = X t + a t ˜ f t ( X t ) to c ho ose the next p oint to b e sampled, that is, X t +1 ,l = X t,l + a t ˜ f t,l ( X t,l ) for l = 1 , . . . , d ( φ t and a t are scalar parameters). In the impleme ntation of the algorithm w e hav e follow ed the guidelines pro vided in (Spall, 1998), wit h the gain sequence a t = a/ ( A + t + 1) α , and perturbation size φ t = φ/ ( t + 1) γ , where A = 60, α = 0 . 602 and γ = 0 . 101. The v alues of a and φ v ary in the differen t exp er i me nts; they are c hosen heuristically based on our exp erience with similar problems (this should not cause any problem here, as the goal of the exp erimen ts is not to pro vide fast solutions of the global optimization problems at hand but to demonst r ate the b ehavior of the m ulti-start algorithms to b e compared). In addition to the t wo ev aluations required at the p erturb ed p oints, w e also ev aluate the function at the curren t p oi nt X t . The starting p oint is chosen randoml y , and the function is ev aluated first at thi s p oin t. The si x reference algorithms the Met aMax(K) and Met aMax algorithms are com- pared to the following: Unif : This algorithm sel e ct s from a constant n umber of instances of SP SA uniformly . In our impleme ntation the instance I t = t mo d K is selected at time t , where K denotes the num b er of instances. ThrAsc : The Threshold Ascen t algorithm of Streeter and Smith (2006b). The algo- rithm b egins with selec t i ng each of a fix e d num b er of instances once. After this phase at eac h time step t ThrAsc selects the best s estimates pro duced so far by all algorithm instances A i , i = 1 , . . . , K in all the previous time steps, and for e ach A i it counts how man y of these estimates were pro duced by A i . Denot i ng the latter v alue by S i,t , at time t the algorithm selects the inst ance with index I t = argmax i U ( S i,t /n i,t , n i,t ), where n i,t is 432 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms the num b er of times t he i th instance has b een selected up to time t , U ( µ, n ) = µ + α + p 2 nµα + α 2 n and α = ln(2 T K/δ ). s and δ are the paramet er s of the algorithm, and in the exp eriments the b est v alue for s app eared to b e 100, while δ w as set to 0.01. W e note that Threshold Ascen t has b een develop ed for the maxim um K -armed bandit problem; nevertheless, it pro vides sufficien tly go o d p erformance in our setup to test it in our exp eriments. Rand : The random search algorithm. It can b e seen as running a sequence of SPSA algorithms suc h that each ins tan ce is used for exactl y one step, which is the ev aluation of the random starting point of the SPSA algorith m. Luby : The algorithm based on the work of Luby et al. (1993). This metho d runs sev eral instances of SPSA sequen tially after each other, where the i th instance is run for t i steps, with t i defined by t i =  2 k − 1 , if i = 2 k − 1 t i − 2 k − 1 +1 , if 2 k − 1 ≤ i < 2 k − 1 The abov e definition produces a sc hedule such that from t he first 2 k − 1 algorithm instances one is run for 2 k − 1 steps, tw o for 2 k − 2 steps, four for 2 k − 3 steps, and so on. EE-Unif : This algorithm is an instance of explore-and-exploit algorithms. F or the first T / 2 st e ps the Unif algorithm is used for exploration, and, subsequently , in the e x pl or at i on phase, the SPSA instance that has achiev ed the highest v alue in the exploration phase is selected. EE-Luby : This algorithm is similar to EE-Unif , except that Luby is us ed for explo- ration. Both v ersions of the Met aMax algorithm were tested. Motiv ated b y the fact that SPSA is kno wn to conv erge to a global optimum exp onen tially fast if f satisfies some r e st r i c ti ve conditions (Gerencs ´ er & V´ ag´ o, 2001), we chose a h r ( n ) that decays exp onentially fast. T o con trol the exp l orat i on of the so far sub optimal algorithm instances, w e allow ed h r ( n ) to b e a time-v arying function, that is, it changes with t r , the total n um b er of fun ct i on calls to ev aluate f (or equally , the total n umber of steps tak en) by all algorithms so far. Thus, at round r + 1 w e used h r ( n ) = e − n/ √ t r (27) (note that w e used the time -v arying version of h r also in the case of Met aMax(K) – the latter can easily b e extended to this situation, but this has b een omitted to simplify the presen tation). F or the algorithms with a fixed num b er of lo cal searc h instances ( Met aMax(K) , Unif , EE-Unif , and ThrAsc ), the num b er of instances K was set to 100 in the simulations, as this choice pr ovided reasonably go o d pe r for manc e in all problems analyzed. The multi-start algori t hms w ere tested using tw o versions of a syn thetic function, and b y tuni ng the parameters of a learning algorithm on tw o standard data sets. 433 Gy ¨ orgy & Kocsis The syn thetic function was a slightly mo difie d 7 v ersion of the Griew ank function (Griewank, 1981): f ( x ) = d Y l =1 cos 2 π x l √ l − d X l =1 4 π 2 x 2 l 100 where x = ( x 1 , . . . , x d ) and the x l w ere constrained to the in terv al [ − 1 , 1]. W e show the results for the 2-dimensional and the 10-dimensional cases. The parameters of SPSA were a = 0 . 05 and φ = 0 . 1 for the 2-dimensional case, and a = 0 . 5 and φ = 0 . 1 for the 10-dimensional case. The p erformance of the searc h algorithms w ere measured b y the error define d as the difference b et ween the maximum v alue of the function (in this case 1) and the best result obtained by the search al gor i t hm in a given n umber of steps. The re su l ts for the ab ov e multi-start strategies for the tw o- and the 10- dimensional test functions are shown in Figure 5. Eac h error curve is a veraged ov er 10,000 runs, and each strategy was run for 100,000 steps (or iterations). One ma y observe that in b oth cases the tw o versions of the Met aMax algorithm conv erge the fastest. ThrAs c is b etter than Unif , while Luby seems fairly comp etitive with these tw o. The tw o explore- and-exploit-type algorithms ( EE-Unif and EE-Luby ) ha ve s i mi l ar p erformance on the 2- dimensional function, and clearl y b etter than their ‘non-exploiting’ base algorithm s, but on the 10-dimensional function their b eha vior is somewhat pathological in the sense that for low v alues of T their per for m ance s are the b est among all algorithms, but with increasing T , the error actually increases such that their resp ective base algorithms ac hieve smaller errors for some v alues of T . The random searc h seems an option only for the 2-dimensional function. Similar results w ere obtained for di men si ons b etw een 2 and 10. The pathological b ehavior of the explore-and-e x pl oi t algorithms start to app ear gradually starting from the 5-dimensional function, and it is pronounced from 8 dimensions on wards. Limited experimental data obtained for higher dimensions up to 100 (a veraged ov er only a few hundred runs) sh ows that the sup erior i ty of Met aMax is preserv ed for high- di me nsi on al problems as well. The reason for the pathological behavior of the explore-and-exploit strategies (i.e., why the error curves are not monotone decreasing in the num b er of iterations) can be illustrated as follo ws. Assume w e hav e t wo SPSA instances, one con verging to the global optim um and another one conv erging to a sub optimal lo cal optimum. Assume that in the first few steps the optimal algorithm giv es b etter result, then the sub optimal algorithm t akes ov er and reaches its lo cal maximum, while if t he algorithms are run even further, the optimal algorithm b eats the sub optimal one. If the ex pl or ati on is stopp ed in the first or the last regime, an explore-and-expl oi t algorithm will choose the first, optimal lo cal searc h i ns tanc e , whose p erformance may get to quite close to the global optimum in the exploitation phase (ev en if it is stopped in the first regime). If the exploration is stopp ed in t he middle regime, the sub optimal search instance will b e selected for exploitation, whose p erformance ma y not even get close to the global optim um. In this scenario, the error after the exploitation phase ( i . e . at the end) is low er if T is small, and increases wi th higher v alues of T . Decrease in the err or with incre asi ng T is only assured when the optimal instance conv erges in the exploration phase past all sub optimal lo cal optima, which results in selecting the optimal lo cal searc h instance for exploitation. In the ab ov e sc en ari o the error will decrease fast 7. The mo dification was made in order to hav e more signific a nt differences b etw een the v alues of the function at the global maximum and at other lo cal maxima. 434 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 1 10 100 1000 10000 100000 average error iteration Rand Luby Unif ThrAsc EE-Luby EE-Unif MetaMax100 MetaMax 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 1 10 100 1000 10000 100000 average error iteration Rand Luby Unif ThrAsc EE-Luby EE-Unif MetaMax100 MetaMax Figure 5: The a verage error for the multi-start strategies on the 2-dimensional (l ef t) and 10-dimensional (righ t) mo dified Griew ank function. 99% confidenc e interv als are sho wn with the same color as the correspondi ng curv es. Note that most of the in terv als are very small. 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 1 10 100 1000 10000 100000 average error iteration Rand Luby Unif ThrAsc EE-Luby EE-Unif MetaMax100 MetaMax 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 10000 100000 average error iteration Rand Luby Unif ThrAsc EE-Luby EE-Unif MetaMax100 MetaMax Figure 6: The a v erage error for the m ulti-start strategies on t uni ng the parameters of Mul- tila yer P erceptron for the v ehicl e data set (left) and the l ett er data set (r i ght). 99% confidence in terv als are al so shown with the same color as the corresp onding curv es. initially , then increase for a whi l e and then it may decrease again till it con verges to 0, whic h is quite similar to what we obse r ve in Figure 5, righ t. This pathological b e havior b ecomes more transparent if there are many lo cal search algorithms, as the length of the exploitation phase scales with the n umber of local searc h instances if the length of the exploration for each instance is k ept fixed. Analyzing the experimental data sho ws that more complex versions of the scenario outlined abov e ha ve o ccurred in the sim ulations and are the main cau se of the observed pathological behavior (the non-monotonicit y of the error curv es). F or tuning the parameters of a learning algori t hm, w e ha ve used tw o standard data sets from the UCI Machine Learning Rep ository (Asuncion & Newman, 2007): v ehicl e and 435 Gy ¨ orgy & Kocsis letter , and the Multi l ay er P erceptron learning algorithm of W ek a (Witten & F rank, 2005) (here the back-propagation algorit hm is used in the training phase). Tw o parameters were tuned: the learning rate and the momen tum, b oth in the range of [0 , 1]. The size of the hidden la y er for the Multilay er P erceptron w as set to 8, while the num b er of ep o chs to 100. The parameters of the SPSA algori th m were a = 0 . 5 and φ = 0 . 1, the same as for the 10-dime nsi on al Griewank function (as in the pre v i ous exp eriment, the parameters w ere c hosen based on our exp erience). The rate of correctly classified items on the test set for v ehicl e using Mult i l ay er Perceptron with v arying v alues of the tw o parameters is sho wn in Figure 7, with the highest r ate b eing 0 . 910112. Similarly , the classi fi cat i on rate for l etter is shown in Figure 8, with the highest rate b eing 0 . 7505. The error rates of the optimized Multilay er Perceptron on the data sets v ehicl e and letter are sho wn in Figure 6, when the parameters of the learning algorithm were tuned b y the multi-start strategies ab ov e. The error in these cases i s the difference b etw een the b est classification rate t hat can be obtai ne d (0.910112 and 0.7505, re sp ectively) and the b est classification rate obtai ne d by the multi-start strategies in a gi ven num b er of steps. The results shown are av eraged ov er 1,000 r un s. W e observe that the Met aMax algorithm (with an increasing num b er of algorithm instances) con verged fastest in a verage, the three strategies with a fixed num b er of algorithm instances had nearly identical results, Luby (and its explore-and-exploit v ariant) was sligh tly worse than these, and the random sear ch w as the slow est, although it p erformed not nearly as badly as for the synthetic functions. The reason wh y the random search had relatively b etter p erformance (re l at i ve to those that used SPSA) could b e t wofold: (i) large parts of the error surface offer fairly small error, and (ii) the error surface is l es s smooth, and therefore SPSA i s less successful in using gradien t inf or mati on . The explore-and-exploit v ariants p erformed well on the v ehicl e data set initially , but their per for manc e worsened for larger v alues of T (compared to Met aMax , and to other algorithms to some e x t ent). This, coupled with the observ ation for Figure 5, righ t w ould suggest that explore-and-e x pl oi t v ariants are more comp e ti t i ve for smal l v alues of T , de spi t e their asymptotic guarantees. In summary , the Met aMax algorithm (with an increasing num b er of algorithm in- stances) provided by far the b est p erformance in all tests, usually requiring significantly few er steps to find the optim um than the other algorithms. E.g., for the l etter data set only the Met aMax algorithm found the global optim um in all runs in 100,000 time steps. W e can conclude that Met aMax conv erged faster than the other m ulti-start strategies in- v estigated in all four test cases, with a notable adv an tage on the difficult surfaces (at least from a gradient-based optimization viewp oint) induced b y the classification tasks. 5.2 k -Means Clustering In this section we consider the problem of partitioning a set of M d -dimensional real vectors x j ∈ R d , j = 1 , . . . , M in to N clusters, where each cluster S i is represe nted by a c e nter (or reconstruction) p oi nt c i ∈ R d , i = 1 , . . . , N . The cost function to b e minimized is t he sum of distances ρ ( x, c i ) from the data p oin ts to the corresp onding cen ters, that is, w e w ant to minimize P N i =1 P x ∈ S i ρ ( x, c i ). There are t w o necessary con di ti on s f or optimalit y (see, e.g., Linde, Buzo, & Gr ay , 1980; Gersho & Gray , 1992): for all i = 1 , . . . , N , S i = { x : ρ ( x , c i ) ≤ ρ ( x, c j ) for all j = 1 , . . . , N } (28) 436 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Figure 7: Classification rate on the v ehicle data set. The rates are plotted b y subtracting them from 0.911112 and thus global optima are at the scattered black sp ots corresp onding to a v alue equal to 0.001. Figure 8: Classification rate on the letter data set. The rates are plotted b y subtrac t- ing them from 0.7515 and thus global optima are at the scattered blac k sp ots corresp onding to a v alue equal to 0.001. (with ties broken arbitrarily) and c i = argmin c ∈ R d X x ∈ S i ρ ( x, c ) . (29) 437 Gy ¨ orgy & Kocsis A usual c hoice for ρ is the squared Euclidean distance, in which case c i = P x ∈ S i x | S i | . 8 Ac- cording to the ab ov e necessary conditions, the k -means algorith m (or Generalized-Lloyd algorithm, see, e.g., Linde et al., 1980; Ger sho & Gra y , 1992) alternates b et ween parti- tioning the data set according to (28) with the centers b eing fixed, and recomputing the cen ters by (29) while the partitioning is kept fixed. It can easily b e seen that the cost (or error) cannot increase in any of the ab ov e steps, hence the algorithm con verges to a lo cal minimum of the cost function. In practice, the algorithm stops when there is no (or insufficien t) decrease in the cost function. How ev er, the k -means algorithm is often trapp ed in a local optimum, whose v alue is influenced b y the initial set of cen ters. As with SP SA, restarting k -means with a differen t initialization may resu l t in finding the global optimum. W e consider tw o ini ti al i z at i on techniques: the first, termed k-means , c ho oses the centers uniformly at random from the data p oin ts; the sec ond , k-means++ (Arthur & V assilvitski i , 2007) choos es an initial cen ter uniformly at random from the data set, and then chooses the further cen ters from the data p oi nts with a probability that is prop ortional to the distance b etw een the data p oint and the closest center already selected. The k -means algorithm usually terminates after a relativ ely small n umber of steps, and th us multi-start strategies with bounded num b er of instances would run out of active lo cal searc h algorithms, and t he r ef or e do not app ear particularly attractive. How ev er, this is a natural domain to consider the strategy that starts a new instance, when the previous has finished. This strategy will b e referred subsequently as Serial . F or the ab o ve men tioned considerations, we test only Met aMax , the v ariant of our algorithms applicable for an un b ounded n umber of instances. 9 As in the exp eriments with SPSA, w e used h r as in (27). Note that some theoretical results indicate that k -means may conv erge at an exp onential rate (in particular, Ki e ffe r , 1982 sho wed t hat the rate of conv ergence is exp onential for random v ariables with l og-c oncav e densitie s in 1-dimension p rovided that the logarithm of the density is not piecewise affine). Tw o multi-start strategies, Serial and Met aMax were tested for the data set clou d from the UCI Machine Learning Rep ository (Asuncion & Newman, 2007). The data set w as employ ed by Arth ur and V assilvitskii (2007) as w ell. The n um b e r of clusters was set to ten. The p er for manc e of the multi-start strategi e s is defined as the difference betw een the smallest cost function obtained b y the strategy in a given num b er of steps and the smallest cost seen in an y of the exp eriments (5626.6357). The results a veraged ov er 1,000 runs are plotted in Figure 9. With b oth initializat i on metho ds the Met aMax strategy con verges faster then the Serial strategy . W e note that for this data set, k-means++ with its more clever initialization pro cedure yields faster conv ergence than the standard k-means with its uniform in i ti al i z ati on , whic h is consisten t with the results presented by Arth ur and V assilv i ts k i i (2007). 8. An extension to clustering random v ar ia b les is well-kno wn and stra ig htforward, but is o mit ted here b ecause in this pap er w e only consider clustering finite data sets. 9. Note that in the Met aMax algorithm we hav e to do the practic a l mo dification that if a local search algorithm has terminated then it will not be c hosen an ymore. This clearly improv es the p erforma n c e as an algorithm is not c hosen anymore when no improve ment can b e observed. 438 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms 1 10 100 1000 10000 100000 1 10 100 1000 10000 100000 average error iteration Serial kmeans MetaMax kmeans 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 10000 1 10 100 1000 10000 100000 average error iteration Serial kmeans++ MetaMax kmeans++ Figure 9: The av erage error for the multi-start strategies with k-means (left) and k- means++ (righ t). 99% confidence in terv als are sho wn with the same color as the corresp onding cu r ves. 5.3 Practical Considerations In all exp eriments with the Met aMax algorithm presented ab ov e, w e observed that the n umber of algorithm instances r (sho wn in Figure 10) gro ws at a rate of Ω( t r / ln t r ) (recall that t r is the total num b er of function calls to ev aluate f , or the total num b er of steps, b y all algorithm instances by the end of round r ). On the oth er hand, in the deriv ation of the theoreti cal b ounds (see Theorem 15 and Theorem 16) w e used a b ound r ≥ Ω( √ t r ). In contrast to the quadratic penalty suggested b y Theorem 16, plugging the Ω( t r / ln t r ) estimate of r in to the theorem we w ould find that only a logarithm i c factor more calls to ev aluate f (total num b e r of ste ps ) are need ed to achiev e the p erformance of a search algorithm started from the attraction region of the optim um. Finally , p erhaps the main practical que st i on concerning the Met aMax family of multi- start algorithms is to decide when to use them. As a rule of thum b, we hav e to sa y that there should b e a sufficien tly large p e r for manc e difference b et ween an “a verage” run of the lo cal searc h algorithm and the b est one. Cl ear l y , if a single lo c al searc h pr o duces an acceptable result then it is not worth the effort to run several instances of the lo cal search, esp ecially not with a complicated sc hedule. In many real problems it is often t he case that it is relativ ely easy to get cl ose to the optimum, whic h ma y be acceptable for some applications, but approaching the optimum with greater precision is hard; if the latter is of imp ortance, the Met aMax al gori t hm and its v ariants may b e v ery useful. Last, one ma y w onder ab out t he computational costs of our algorithms. As it was discussed b efore, we consider the case when the ev aluation of the target function is very exp ensive: this is clearly not the case for the Griewank funct i on, which is only used to demonstrate basic prop erties of the algorithm, but holds for man y of the optimization problems in practi ce, including all other exp eriments considered in this pap er. In these probl e ms the function ev aluation is indeed exp e nsi ve (and dep ends on the av ailable data), while the o v erhead introduced b y the Met aM ax algorithms dep ends on the num b er of rounds. F or the Met aMax(K) algorithm we hav e to find the upp er con vex h ull of a set of K p oin ts in each round; in the worst case this can take as long as O ( K 2 ) calculations, but in practice this is usually 439 Gy ¨ orgy & Kocsis 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1 10 100 1000 10000 100000 number of algorithm instances * ln(t)/t iteration Griewank 2D Griewank 10D vehicle letter K-MEANS K-MEANS++ min max Figure 10: Number of algori t hm instances ( r ) for Met aMax . The av erage n um b er of instances are sho wn on the six b enchmarks: Griew ank function (2- and 10- dimensional), parameter tuning of Multilay er Perceptron (on the v ehicl e and on the letter data set), and clustering with k-means and k-means++ . The maxim um and minim um num b er of instances o ver all runs of all b enc hmarks are also sho wn. One can notice that for larger v alues of t r , 0 . 45 t r / ln t r ≤ r ≤ 1 . 65 t r / ln t r . m uch c heap er, as the upp er conv ex hull is determined by the p oint that corresponds to the actually b est estimate and by the p oi nt that corresp onds to the least used algorithm, whic h requires only O ( K ) computations, or even less, if some sp ecial ordering tricks are in tro duced. Since the target function f is ev aluated at least twice in eac h round, on av erage at most O ( K 2 ) computational o v erhead is needed for eac h ev aluation of f in the w orst case, which is practically reduced to O ( K ), or even less. Similar considerat i ons hold for the Met aMax( ∞ ) and the M et aMax algorithms, resulting in an a verage O ( r 2 ) w orst-case o verhead for eac h call to f (in r rounds), which is closer to O ( r ) or even less in practice. In all examples we considered (apart of the case for the Gri ewank function), this amount of ov erhead has b een negligible relative to the computational resources needed to ev aluate f at a single p oint. 6. Conclusions In this pap er we pro vided m ulti-start strategies for lo cal searc h algorithms. The strategies con tinuously estimate the p otential p erformance of each algorithm instance in an opti mi s ti c w ay , b y supp osing a con vergence rate of the lo cal searc h algorithms up to an unkno wn constan t, and in ev ery phase resources are allocated to those instances that could conv erge to the optim um for a particular range of the constan t. Three versions of the algorithm w ere pr es ented, one that is able to follow the p erformance of the be st of a fixed num b er of lo cal search algori t hm instances, and tw o that, with gradual l y increasing the n umber of the lo cal searc h algorithms, achiev e global consistency . A theoretic al analysis of the asymptotic 440 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms b ehavior of the algorithms was also giv en. Sp ecifically , under some mild conditions on the function to b e maximized ( e. g. , the se t of the v alues of the lo cal maxima is not dense at the global maximum), our b est algorithm, Met aMax , preserves the p erformance of a lo cal searc h algorithm for the original functi on class with at most a quadratic increase in the n umber of times the target function nee ds to b e ev aluated (asymptoticall y ) . Simulations demonstrate that the algorithms w ork quite w ell in practice. While the theoretical b ound suggests that the target function has to be ev aluated a quadratic factor more times to achiev e the p erformance of a searc h algorit hm that is started from the attraction region of the optim um, in the exp eriments we found only a logarithmic p enalty . It is not clear whether this difference is the result of our sligh tly conserv ative (asymptotic) analysis or the choice of the experi m ental settings. Also, a finite sample analysis of the algorithm is of interest, as the exp er i me nts indicate that the Met aMax algorithm pro vides go o d p erformance ev en for a relatively small num b er of steps taken by the local search algorithms, in the sense that it provides a sp eed-up compared to other approac hes even if the n um b er of times the target function can b e ev aluated (i.e., the total n umber of steps that can b e taken b y all algorithms together) is relati vely small. Finally , future work is needed to clarify the connection b etw een the conv ergence rate of the optimal algorithms ( g δ ) and the function h r used in the expl or at i on. Ac kno wledgments The authors would lik e to thank the anon ymous referees for their numerous insigh tful and const r uc ti ve comments. This researc h w as supp orted in part by the Mobi l e Innov ation Cen ter of Hungary , b y the National Dev elopment Agency of Hungary from the Researc h and T echnological Innov ation F und (KTIA-OTKA CNK 77782), and by the P ASCAL2 Netw ork of Excellence (EC gran t no. 216886). Parts of this pap er w ere presented at ECML 2009 (Ko csis & Gy¨ orgy , 2009). App endix A. Pro of of Lemma 1 Fix δ ∈ (0 , 1) and let U n = ˆ f ∗ − f ( b X n ). Si n ce U n → 0 almost everywhere on E , Egoroff ’s theorem (se e , e.g. Ash & Dol´ eans-Dade, 2000) implies that there i s an ev ent E δ ⊆ E with 1 − P ( E δ ) < δ such t hat U n → 0 uniformly almost everywhere on E δ . The second part of the lemma follows from the definition of uniform con vergence. 2 References Adam, K. (2001). Learning while searching for the best alternati ve. Journal of Ec onomic The ory , 101 , 252–280. Arth ur, D., & V assilvitskii, S. (2007). k -means++: The adv antages of careful seeding. In Pr o c e e dings of th e 18th Annual ACM-SIAM Symp osium on Discr ete Algorithms , pp. 1027–1035. Ash, R. B., & Dol ´ eans-Dade, C. A. (2000). Pr ob ability & Me asur e The ory . Academic Press. 441 Gy ¨ orgy & Kocsis Asuncion, A., & Ne wman, D. J. (2007). UCI mac hine learning rep ository . Auer, P ., Cesa-Bian chi, N., & Fischer, P . (2002). Fini te time analysis of the multiarmed bandit problem. Machine L e arning , 47 (2-3), 235–256. Bartz-Beielstein, T. (2006). Exp erimental R ese ar ch in Evolu t i ona ry Computation – The New Exp erimentalism . Natural Computing Series. Springer, New Y ork. Battiti, R., Brunato, M., & Mascia, F. (2008). R e active Se ar ch and Intel ligent Optimization , V ol. 45 of Op er ations r ese a r ch/Computer Scienc e Interfac es . Springer V erlag. Bec k, C. J., & F reuder, E. C. (2004). Simple rules for low-kno wledge algorithm selection. In R ´ egin, J. C., & Rueher, M. (Eds.), CP AIOR , Lecture Notes in Computer Science 3011, pp. 50–64. Springe r. Carc hrae, T., & Beck, J. C. (2004). Low-kno wledge algor i thm con trol. In Pr o c e e dings of the Ninete enth National Confer enc e on A rtificial Intel ligenc e (AAAI) , pp. 49–54. Cicirello, V. A., & Smith, S. F. (2004). Heuristic selection for stochastic sear ch opti mi z at i on: Mo deling solution quality b y extreme v alue theory . In In Pr o c e e dings of the 10th International Confer enc e on Principles and Pr actic e of Constr aint Pr o gr ammi ng , pp. 197–211. Springer. Cicirello, V. A., & Smith, S. F. (2005). The max k -armed bandit: A new mo del of exploration applied to search heuristic selection. In In Pr o c e e dings of the Twentieth National Confer enc e on Artificial Intel ligenc e , pp. 1355–1361. Fink el, D. E., & Kelley , C. T. (2004). Con vergence analysis of the direct algorithm. T ech. rep. CRSC-TR04-28, NCSU Mat he mati cs Department. Gagliolo, M., & Sc hmidh ub er, J. (2006). Learning dynamic algorithm p ortfolios. A nnals of Mathematics a nd Artificial Intel ligenc e , 47 (3–4), 295–328. AI&MA TH 2006 Sp ecial Issue. Gagliolo, M., & Schmidh ub er, J. (2007). Learn i ng restart strategies. In V eloso, M. M. (Ed.), IJCAI 2007 — Twentieth International Joint Confer enc e on Artificial Intel ligenc e, vol. 1 , pp. 792–797. AAAI Press. Gagliolo, M., & Schmidh ub er, J. (2010). Algorithm selection as a bandit problem with un b ounded losses. In Blum, C., & Battiti, R. (Eds.), L e arning and Intel ligent Op- timization , V ol. 6073 of L e c tu r e Notes in Computer Scienc e , pp. 82–96. Springer Berlin/Heidelb erg. Gerencs ´ er, L., & V´ ag´ o, Z. (2001). The mathematics of noise-free SPSA. In Pr o c e e dings of the IEEE Confer enc e on De cision and Contr ol , pp. 4400–4405. Gersho, A., & Gray , R. M. (1992). V e ctor Quantization and Signal Compr ession . Klu wer, Boston. Griew ank, A. O. (1981). Gener al i z ed descent for global optimization. Journal of Optimiza- tion The ory and Applic ations , 34 , 11–39. Hart, P ., Nilsson, N., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. Systems Scienc e and Cyb ernetics, IEEE T r ans a cti o ns on , 4 (2), 100 –107. 442 Efficient Mul ti-St ar t Stra tegies for Local Search Algorithms Ho os, H. H., & St ¨ utzle, T. (1999). T o w ards a c haracterisation of the b ehaviour of s to c hastic lo cal search algorithms for SA T. Artificial Intel l i ge nc e , 112 , 213–232. Horn, M. (2006). Optimal algorithms for global optimization in case of unknown lipschitz constan t. Journal of Complexity , 22 (1), 50–70. Hutter, F., Hoos, H. H., Leyton-Bro wn, K., & St ¨ utzle, T. (2009). ParamILS: an automatic algorithm configuration framew ork. Journal of Artificial Intel ligenc e R ese ar ch , 36 (1), 267–306. Jones, D. R., P erttunen, C. D., & Stuc kman, B. E. (1993). Lipsc hitzian optimization without the lipschitz constant. Journal of Optimization The ory and Applic ations , 79 (1), 157– 181. Kautz, H., Horvi tz , E., Ruan, Y., Gomes, C., & Selman, B. (2002). Dynamic restart p ol i - cies. In Pr o c e e dings of the Eighte enth National Confer enc e on Artificial Intel ligenc e (AAAI) , pp. 674–681. Kieffer, J. C. (1982). Exp onential rate of con vergence for Llo yd’s metho d I. IEEE T r ans. Inform. The ory , IT-28 , 205–210. Ko csis, L. , & Gy¨ orgy , A. (2009). Efficien t m ulti-start strategies for lo cal search algorithms. In Buntine, W., Grob elnik, M., Mladenic, D., & Sha we-T aylor, J. (Eds.), Machine L e arning a nd Know le dge Disc overy in Datab ases , V ol. 5781 of L e ctur e Notes in Com- puter Scienc e , p p. 705–720. Springer Berlin/Heidelb erg. Linde, Y., Buzo, A. , & Gray , R. M. (1980). An algorithm for vector quan tizer design. IEEE T r ansactions on Communic ations , COM-28 , 84–95. Lub y , M., Sinclair, A., & Zuck erman, D. (1993). Optimal sp ee dup of Las Vegas algorithms. Information Pr o c essing L etters , 47 , 173–180. Mart ´ ı, R., Moreno-V ega, J., & Duar t e, A. (2010). Adv an ce d m ulti-start metho ds. In Gen- dreau, M., & P otvin, J.-Y. (Eds.), Handb o ok of Metaheuristics, 2nd e di ti o n (2 edition). Springer. Nestero v, Y. (2004). Intr o ductory L e ctur es on Convex Optimization: A Basic Course . Klu wer Academic Publishers. Rib eiro, C., Rosseti, I., & V allejos, R. (2009). On the use of run time d i st ri b uti on s to ev aluate and compare sto c hastic lo cal searc h algorithms. In St ¨ utzle, T., Bir at tar i , M., & Ho os, H. (Eds.), Engine ering Sto c ha s ti c L o c al Se ar ch Algorithms. Designing, Implem ent i ng and Analyzing Effe ctive Heuristic s , V ol. 5752 of L e ctur e Notes i n Computer Scienc e , pp. 16–30. Springer Berlin/Heidelb erg. Spall, J., Hill, S., & Stark, D. (2006). Theoretical fr ame work for comparin g sev eral stochastic optimization approac hes. In Calafiore, G., & Dabb ene, F. (Eds.), Pr ob abilistic and R andomize d Metho ds for Design under Unc ertainty , c hap. 3, pp. 99–117. Springer- V erlag, London. Spall, J. C. (1992). Multiv ariat e stochastic appro ximation using a simultaneous p ert ur bat i on gradien t appro ximation. IEEE T r ansactions on Automatic Contr ol , 37 , 332–341. Spall, J. C. (1998). Implementation of the simultaneous p erturbation algorithm for sto c has- tic optimization. IEEE T r ansactions on A e r osp ac e Ele ctr onic Systems , 34 , 817–823. 443 Gy ¨ orgy & Kocsis Streeter, M. J., & Smith, S. F. (2006a). An asymptotically optimal algorithm for the max k -armed bandit problem. In Pr o c e e dings, The Twenty-First National Confer enc e on A rtificial Intel ligenc e and the Eighte enth Innovative Applic ations of A rtificial Intel li - genc e Confer enc e , pp. 135–142. Streeter, M. J., & Smith, S. F. (2006b). A simple distribution-free approach to the max k -armed bandit problem. In Principles and Pr actic e of Constr ai nt Pr o gr ammi ng - CP 2006, 12th International Confer enc e, CP 2006, Nantes, F r anc e, Septemb er 25-29, 2006, Pr o c e e dings , pp. 560–574. Vilalta, R., & Drissi, Y. (2002). A p ersp ective view and survey of meta-learning. A rtificial Intel ligenc e R eview , 18 (2), 77–95. Witten, I. H., & F rank, E. (2005). Data Mining: Pr actic al Machine L e arning T o ols and T e chniques (2nd edition). Morgan Kaufmann, San F rancisco. Zabinsky , Z. B., Bulger, D., & Khompatrap orn, C. (2010). Stopping and restarting strategy for sto chastic sequen tial search in global optimization. J. of Glob al Optimization , 46 (2), 273–286. 444

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment