NAPX: A Polynomial Time Approximation Scheme for the Noahs Ark Problem

NAPX: A P olynomial Time Appro ximation Sc heme for the Noa h’s Ark Problem Glenn Hic key 1 , Paz Carmi 1 , Anil Maheshw a ri 1 , and Norb ert Z eh 2 1 School of Computer Science, Carleton Universit y , Ottaw a, Ontario, Canada 2 F acult y of Computer Science, D alhousie Universit y , Halifax, Nov a Scotia, Canada ghickey@sc s.carleto n.ca, carmip@gma il.com, anil@scs.carleton. ca, nzeh@cs.da l.ca Abstract. The Noah’s Ark Problem (NAP) is an N P-Hard optimiza- tion problem with relev ance to ecological conserv ation managemen t. It asks to maximize the phylogenetic diversi ty (PD) of a set of taxa giv en a ﬁxed budget, where eac h t ax on is asso ciated with a cost of conserv ation and a probability of extinction. NAP h as received renewed interest with the rise in av ailabilit y of genetic sequence d ata, allowing PD to b e used as a practical measure of biod iversit y . How ever, only simpliﬁed instances of the problem, where one or more parameters are ﬁxed as constants, hav e as of yet b een addressed in th e literature. W e present NAPX, th e ﬁrst algorithm for the general vers ion of NA P that returns a 1 − ǫ ap- proximati on of the opt imal solution. It ru ns in O „ nB 2 h 2 ( log n +log 1 ǫ ) 2 log 2 (1 − ǫ ) « time where n is th e num ber of species, and B is th e total b udget and h is the heigh t of th e input tree. W e also provide impro v ed b ounds for its exp ected running time. Key words: Noah’s Ark Problem, phylogenetic d ivers it y , app ro xima- tion algorithm 1 In tro duction 1.1 Motiv ation Measures o f biodiversity are commonly used a s indica tors of en vironmental health. Bio diversit y is presently being lost a t an ala rming rate, due largely to hu- man activity . It is sp eculated that this loss can lead to disa strous co nsequences if left unc heck ed [9]. Consequently , th e discipline of conserv atio n biology has arisen and a considerable a mount of resource s are b eing allo ca ted to r esearch and implement conser v ation pr o jects aro und the world. A conserv a tion strategy will necessarily dep end on the meas ure of bio diversity used. T raditionally , indices based on sp ecies richness and abunda nce hav e been used to qua n tify the bio diversity of an ecosystem [8 ]. These indices a re based on counting a nd do not account for genetic v aria nce. Phylogenetic diversit y (P D) [3] addres s es this issue by taking into acco unt evolutionary relationships de- rived fro m DNA or protein samples. The use of PD in biological conser v ation has b ecome increasingly widesprea d as more ph ylogenetic informatio n bec o mes av a ilable [7 ]. It is a lso us e d to deter mine diverse sequence sa mples in co mparative genomics [10]. The Noah’s Ark Pro blem (NAP) [13 ] is a n abstraction of the fundamental problem of many conser v ation pro jects: how b est to allo cate a limited amount of resource s to maxima lly co nserve ph ylogenetic diversity . This is in turn a general- ization of the Knapsa ck Problem [4] and is ther efore NP-Hard. Several algor ithms hav e b een pr op osed to solve s pec ia l cases of the pr oblem but, as yet, no non- heuristic solutions have b een prop osed to solve general instances of NAP . Given that NAP itself is an abstractio n of re a listic scenar ios, it is imp ortant to have a general solution in order to b e able to extend this framework to use ful applica- tions. F or this reas o n, we present an algorithm that c an b e used to co mpute an approximate so lution for NAP in p olynomial time, so long as the approximation factor is held constant, and total budget is po lynomial in the input size. 1.2 Deﬁnitions Throughout this pap er, w e use the following deﬁnition of a ph ylogenetic tree T , with notation cons istent with that of [6]. T has a ro o t of degr ee 2, interior vertices o f deg ree 3, and n leav es, each asso ciated with a spec ies from set X . If a n edg e e of T is incident to a leaf, it is called a p endant edg e. Otherwis e e has exa ctly t w o adjacent edge s , l and r , b elow it (not o n the path fro m e to the r o ot) a nd these are referred to a s e ’s c hildren. λ is a function that as signs a non-negative branch length to each edge in T . The ph ylogenetic diversit y of T , PD( T ) is deﬁned as P D ( T ) = X e λ ( e ) , (1) where the summation is over each edge e of the tree. Intuitiv ely , this measure corres p onds to the amount of evolutionary history represe n ted by T . The Noa h’s Ark Pro blem has the ob jective of maximizing the e x pec ted P D, E ( P D ), under the following constraints. Each taxo n i ∈ X is as so ciated with a n initial surviv a l pro ba bilit y a i , which can b e increased to b i at s ome integer cost c i ; and the to tal exp enditure cannot exceed the budget B . Since B is a factor in the running time, we as sume that the budget and each cos t have been divided by the greatest common divisor of all the costs. In the original formulation of NAP , each species was a ls o asso ciated with a utilit y v alue. How ever, in [6 ] it was s hown that these v a lue s are redundant as they can be incorp ora ted into the br anch lengths without alter ing the problem. T o av o id accounting for de- generately small proba bilit y v a lues, we ma ke the assumption that the cons e rved surviv al pro babilities are not expo nent ially small in n . In other words, ther e ex- ists a c onstant k such that b i ≥ n − k for each i ∈ X . W e feel this as sumption is reasona ble as it is unrealistic that money would b e allo ca ted to obtain such a negligible probability of surviv a l. If a sp e c ies survives, the infor mation repr esented by its pa th to the ro ot is conserved. Conseque ntly , the probability that a n edge survives is equiv alent to 2 the probability tha t at least one leaf b elow it in T survives. Let C e be the set of leav es b elow e in the tree and S b e the set of sp ecies selected for pro tection. E ( P D | S ), c a n b e derived fr om (1) as fo llows: E ( P D | S ) = X e λ ( e )   1 − Y i ∈ C e ∩ S (1 − b i ) Y j ∈ C e − S (1 − a j )   , (2) where the summa tio n is over all edges. NAP asks to maximize E ( P D | S ) sub ject to X s ∈ S c s ≤ B . Our algo r ithm is based on decomp osing T into clades w hich are asso cia ted with the edges of the tree. A clade corresp onding to edge e , denoted K e , is the minimal subtree o f T con taining e a nd C e , the set of leaves b elow it. The E ( P D ) of K e can b e computed as in (2) but summing only over edg es in the c lade. The entire tree can be considere d a clade by attaching an edge of length 0 to its ro ot. If e has tw o descendant edges l a nd r , then we s ay K e has tw o ch ild c lades K l and K r . 1.3 Related W o rk Let a i c i − → b i NAP r efer to the problem as desc r ib e d ab ov e , wher e the surv iv al probabilities and c ost of each taxon are input v ariables . Fixing one or mor e of these v aria bles as co nstants pro duces a hierar ch y of increasing ly simpler sub- problems [1 1]. The simplest, 0 1 − → 1 NAP , is equiv ale nt to ﬁnding the set of B leav e s whose induced subtree (including the ro ot) has maxim um PD and can b e solved by a g reedy algor ithm [12] [10]. 0 c i − → 1 NAP on ultrametric (all leaves equidistant fr om the ro ot) tree s and (1 − x i ) 1 − → (1 − κx i ) for ge neral trees wher e x i is a v ariable probability and κ is a co nstant fa c tor such that 0 ≤ κ ≤ 1 c a n likewise be solved in p olynomia l time by g reedy algorithms [6 ]. Giv en tha t 0 c i − → 1 NAP is itself a genera lization of the Knapsack pr oblem whic h is NP-Har d, it is extremely unlik ely that an exact, p olynomia l- time solutio n for this kind of NAP or any gener alizations will ev er b e found. Pardi and Goldman [11] did ﬁnd a pseudop olynomial-time dynamic pr ogramming alg o rithm for the 0 c i − → 1 NAP on genera l (non-ultrametric ) tr ees tha t makes the realistic assumption that B is p olynomial in n . They also show that any instance of a i c i − → 1 NAP can b e transformed to an instance of 0 c i − → 1 NAP , allowing their algorithm to solve such instances as well. This a lgorithm r elies upon the obser v ation that the solution to 0 c i − → 1 NAP for any cla de can b e obtained from the so lutions to its tw o child cla des [11]. Whic h solutions to use de p ends on how the budget is allo cated to the tw o s ub- problems. If the budget at K e is b , then there are b + 1 wa y s to split it a cross K l and K r . By solving these b + 1 pairs o f subproblems , the optimal solution can b e found in the pair with ma ximum total E ( P D ) (plus the exp ected contribution 3 of e ). Recursively pr o ceeding in this fashio n from the ro o t down would not yield an eﬃcient algorithm as the num b er of p oss ible budget divisio ns incre a ses exp o- nent ially with e ach level of the tree. Instead, the clades are pro cessed b ottom-up from the leav es. All b + 1 sco res are co mputed and sto r ed in a dynamic prog ram- ming table for ea ch clade. Ea ch scor e can b e determined by taking the ma ximum of b + 1 po ssible s c ores of its child clades, which are a lready computed or com- puted dir ectly from (2) if the clade contains a single leaf. Ea ch table entry can therefore b e computed in O ( B ) time. There are O ( B ) ent ries p er c lade a nd O ( n ) clades in the tree giving a total r unning time of O ( nB 2 ). This pro c e dur e do es not work for a i c i − → b i NAP beca use this version of the problem do es not display the same optimal substructur e [1 1]. In 0 c i − → 1, the dynamic progra mming a lgorithm implicitly max imize s the surviv a l proba bility of the cla de in addition to its E ( P D ) v a lue. The total scor e of the tr ee is a function of b oth o f these v a lues whic h is why the a lgorithm works fo r this ca se. In a i c i − → b i NAP , a budget assig nmen t that ma ximizes surviv al probability o f the clade do es not guara ntee that it will have maximal E ( P D ) and vice versa. The correct a llo cation c a nnot b e made without knowledge of the entire tree; hence , the optimal substructure ex ploited by [11] for 0 c i − → 1 NAP is not present. As an example, consider the instance o f NAP in Figure 1 with B = 3. The optimal solution is to co nserve w and y for E ( P D ) = 225. How e ver, lo cally computing the bes t allo cation o f budget 1 for the clade c ontaining y a nd z will select z for conserv a tion, and any chance of obtaining the optimal so lution w ill b e los t. In this ca se, it is more imp ortant to maximize the surv iv al probability of the c la de rather than E ( P D ), but there is no wa y fo r an algorithm to b e aware of this without globally solving the entire tree. 5 b a =0 y y y =1 =1 x x c b a =0 x =0.5 =2 c b a w w w =0 =1 =2 B=3 5 100 x 115 5 y w z c b a =0 z z z =0.5 =1 20 c Fig. 1. An example why the dynamic pr ogramming algor ithm o f [11 ] do e s not work for general instances o f NAP . The optimal allo cation for the clade co n tain- ing y and z is not part of a globally optimal solution. 4 2 NAPX Algorithm 2.1 Description In this s ection we pres en t NAPX, an O  nB 2 h 2 ( log n +log 1 ǫ ) 2 log 2 (1 − ǫ )  dynamic pr ogram- ming algor ithm for a i c i − → b i NAP that pro duces a (1 − ǫ )-approximation of the optimal solution, where h denotes the height o f T . As that in [1 1], our alg orithm is only po ly nomial if B is p o ly nomial in n . This a ssumption is justiﬁable if, for example, B is expressed in millions of do llars and its v alue will be a reas o n- ably s mall integer. Without lo ss of generality , we also assume that no single cost exceeds the budget. NAPX essentially g eneralizes the dynamic pr ogramming table of [11] by co m- puting for ea ch clade, ea ch desired surv iv al probability of the clade, and a ny budget betw een 1 and B , the maximum E ( P D ) sco re achiev able while gua ran- teeing this surviv a l probability . This w ay , we need not ma ke the choice betw een maximizing E ( P D ) or pro ba bilit y as the ta bles ar e constructed. F rom the deﬁ- nition of E ( P D ) in (2 ), the probability o f surviv a l o f an edge can b e written as a function of its tw o children. Let P e denote the surviv al probability of edg e e , and l a nd r b e e ’s children. Then P e = P l + P r − P l P r . (3) In the optimal solution for NAP on T , assume b dolla r s are assigned to clade K e and e surv ives with pro bability p . It follows that i and b − i dollar s a re assigned to K l and K r resp ectively where 0 ≤ i ≤ b . These sub clades must survive with probabilities j and p − j 1 − j (or 0 when p = j = 1), for some 0 ≤ j ≤ p , in order to satisfy (3 ). Because the probability is contin uous, we discr e tize it into interv als by rounding it down to the near est multiple of a chosen constant α . P robabilities less than a c hosen cutoﬀ v alue p min are rounded to zero. p ∈ n 0 , α ⌈ log α p min ⌉ , ..., α 2 , α, 1 o If tw o non-zero probabilities lie in the same interv al, their r atio is at most α . If they a re in co nsecutive interv als, their ra tio is likewise b ounded by α 2 . F o r notational c o nv enience, we deﬁne a mapping π ( · ) that rounds a proba bilit y to the low er b ound of its corresp onding interv a l. π ( p ) = ( 0 if p < p min α ⌈ log α p ⌉ otherwise. W e now fo rmally describ e our alg orithm. F or ea ch edge e , we co nstruct a t wo- dimensional table T e where T e ( b, p ) stores the optimal exp ected diversit y of K e given that b dollars are assigned to it and it survives with a pro bability that lies no less than p . The table is constructed in the following manner if e is a p endant 5 edge inciden t to the leaf for s pec ie s s . T e ( b, p ) =      a s λ ( e ) if b < c s and p = π ( a s ), b s λ ( e ) if b ≥ c s and p = π ( b s ), or −∞ otherwise. (4) Otherwise, T e is c o mputed from the ta bles of its t w o c hildren, T l and T r . T e ( b, p ) = pλ ( e ) + max i,j,k { T l ( i, j ) + T r ( b − i, k ) } (5) sub ject to i ∈ { 0 , 1 , 2 , ..., b } , j, k ∈ { 0 , α ⌈ log α p min ⌉ , ..., α 2 , α, 1 } , π ( j + k − j k ) = p The E ( P D ) sco re for the ent ire tr ee ca n b e obtained b y attaching an edge e r of leng th 0 to the ro ot and ﬁnding max j { T e r ( B , j ) } . The tables ar e computed from the b ottom up, and each time a n entry is ﬁlled, p ointers are kept to the tw o ent ries in the child tables fro m which it was co mputed. This wa y the optimal budget allo cation can b e obtained b y following the po int ers down from the en try for the optimal score for e r . 2.2 Appro xi mation Ratio In this section, we express the worst-case a ppr oximation ra tio as a function of the constants p min and α in tro duced ab ove, b eg inning with p min . Note that since any sp ecies s with c s > B ca n b e transformed int o a new sp ecies s ′ with c s ′ = 0 , b s ′ = a s and a s ′ = a s without aﬀecting the outcome, w e can safely assume that c s ≤ B for all s ∈ X . Lemma 1 . L et I b e an instanc e of NAP for which ther e exists a c onstant k su ch that b i ≥ n − k ≥ p min for al l i ∈ S . Consider a tr ansforme d instanc e I ′ wher e al l a i values in the r ange (0 , p min ) ar e r ounde d to 0. L et O P T ( I ) and OP T ( I ′ ) b e the exp e cte d PD sc or es of the optimal solutions to I and I ′ r esp e ctively. Then the ra tio of these sc or es is b ounde d as fol lows: OP T ( I ′ ) ≥ (1 − n k +1 p min ) OP T ( I ) Pr o of. Let path( s ) b e the set of edg es co mprising the path from leaf s to the ro ot. W e deﬁne w ( s ) a s the exp ected diversity of the path from s to the ro ot if s is conserved: w ( s ) = b s X e ∈ path( s ) λ ( e ) . Let w max = max s ∈ X { w ( s ) } . This v alue allows us to place a trivial low er b ound on the optimal solution (recalling that w e can assume that c s ≤ B ). w max ≤ O P T ( I ) . (6) 6 W e also o bserve that if any sp ecies s s ur vives with a non-zero pr obability s maller than p min in the optimal solution, its contribution to OP T( I ) will be b ounded by p min w ( s ) b s . It follows that OP T ( I ′ ) ≥ O P T ( I ) − X s ∈ X p min w ( s ) b s . Since b s ≥ n − k and w ( s ) ≤ w max , we can express the b ound as OP T ( I ′ ) ≥ O P T ( I ) − n p min w max n − k . Dividing by OPT( I ) yields OP T ( I ′ ) OP T ( I ) ≥ 1 − n k +1 p min w max OP T ( I ) . F rom (6) we obtain OP T ( I ′ ) OP T ( I ) ≥ 1 − n k +1 p min , which completes the pro of. ⊓ ⊔ The size of the probability interv als in the tables, determined by α , also aﬀects the approximation ratio . This relations hip is detailed in the following lemma. Lemma 2 . L et OP T e ( b, p ) denote t he optimal exp e cte d PD sc or e for clade K e if e survives with pr ob ability exactly p and b dol lars ar e al lo c ate d to i t. Now c onsider an instanc e of NAP su ch t hat al l a s and b s ar e either 0 or at le ast p min . F or any O P T e ( b, p ) wher e e is at height h in the tr e e, t her e exists a t able entry T e ( b, p ′ ) c onstructe d by N APX such that t he fol lowing c onditions hold: i ) T e ( b, p ′ ) ≥ α h OP T e ( b, p ) ii ) p ′ ≥ α h p Pr o of. If p = 0, then O P T e ( b, p ) = 0 and the lemma ho lds. F or the remainder of the pro of, we assume p ≥ p min . The pro of will pro ceed by induction on h , the height of e in the tree, b eginning with the ba se c a se where h = 1 and e is a penda nt connected to leaf s . W e need o nly consider the cases where the o ptimal solution is deﬁned. So without los s of genera lity , assume we hav e O P T e ( b, a s ) = λ ( e ) a s . F rom (4), we know there is an entry T e ( b, π ( a s )) = a s λ ( e ) and therefore bo th i ) and ii ) ho ld. W e now assume that the lemma holds for h ≤ x a nd consider so me edg e e at height x + 1. By deﬁnition, O P T e ( b, p ) can b e e xpressed in terms of its children l and r . OP T e ( b, p ) = p λ ( e ) + O P T l ( i, j ) + OP T r ( b − i, k ) 7 where j + k − j k = p . F rom the induction hypothesis, there exist T l ( i, j ′ ) ≥ α x OP T l ( i, j ) and T r ( b − i, k ′ ) ≥ α x OP T r ( b − i, k ) where j ′ ≥ α x j and k ′ ≥ α x k . Let q = j ′ + k ′ − j ′ k ′ . It follows that q ≥ α x j + α x k − α 2 x j k ≥ α x p. (7) The left inequa lit y in (7) holds b ecause j ′ + k ′ − j ′ k ′ increases as j ′ or k ′ increase, so lo ng as their v alues do not exceed 1. This can b e chec k ed by obser ving that the partial deriv a tiv es with r esp ect to j ′ and k ′ are 1 − k ′ and 1 − j ′ , resp ectively . T l ( i, j ′ ) and T r ( b − i, k ′ ) will be cons idered when computing the entry T e ( b, p ′ ) where p ′ = π ( q ). Since q ≥ p min , we hav e π ( q ) ≥ αq b ecaus e it simply rounds q to the near est m ultiple of α . Therefo r e, p ′ ≥ α x +1 p and T e ( b, p ′ ) can b e express ed as follows. T e ( b, p ′ ) ≥ p ′ λ ( e ) + T l ( i, j ′ ) + T r ( b − i, k ′ ) ≥ α x +1 pλ ( e ) + α x OP T l ( i, j ) + α x OP T ( b − i, k ) ≥ α x +1 ( pλ ( e ) + O P T l ( i, j ) + O P T ( b − i , k ) ≥ α x +1 OP T e ( b, p ) ⊓ ⊔ Combining Lemmas 1 and 2 allows us to state that NAPX retur ns a solution that is a t leas t a factor of (1 − n k +1 p min ) α h of the o ptimal solutio n. In this section we show that these r esults a lso imply that a (1 − ǫ ) a pproximation can be obtained in poly nomial time for a n arbitrar y constant ǫ . Lemma 3 . O h  log n + lo g 1 ǫ  | log(1 − ǫ ) | ! pr ob ability int ervals ar e r e quir e d in the table in or der to obtain a 1 − ǫ appr oximation. Pr o of. The num b er of probability interv a ls, t , req uired for the table is bo unded by the num b er of times 1 must b e multiplied by α to re ach p min . Hence α t ≤ p min and t =  log p min log α  . (8) F rom Lemmas 1 and 2 w e can obtain the desired approximation ratio by s electing α = q (1 − ǫ ) 1 h and p min = 1 − √ 1 − ǫ n k +1 . Plugging these v alues into (8) gives t =       log  1 − √ 1 − ǫ n k +1  log  q (1 − ǫ ) 1 h        =  2 h (log(1 − √ 1 − ǫ ) − ( k + 1 ) log n ) log(1 − ǫ )  8 It can b e s hown that lo g(1 − √ 1 − ǫ ) is O (log ǫ ), so multiplying by − 1 − 1 we can express t asymptotically as t ∈ O  h (log n − log ǫ ) − log(1 − ǫ )  = O h  log n + lo g 1 ǫ  | log(1 − ǫ ) | ! . ⊓ ⊔ Theorem 1. NAPX is a (1 − ǫ ) -appr oximation with time c omplexity O nB 2 h 2  log n + log 1 ǫ  2 log 2 (1 − ǫ ) ! . Pr o of. F or each table entry T ( b, p ), we must ﬁnd the maximum of all p ossible combinations of entries in the left and rig ht child tables that satisfy b and p . These combinations c o rresp ond to the p os sible { i, j, k } tr iples fro m (5). There are O ( B t 2 ) s uc h combinations as i corre spo nds to the budget a nd j and k co r- resp ond to pro bability in terv als. F urthermore, for ﬁxed v alues of p a nd j , there are p otentially O ( t ) diﬀerent v alues of k that co uld sa tisfy π ( j + k − j k ) due to rounding. It follows that a naive a lgorithm would have to compar e all O ( B t 2 ) combinations when computing the maximum in (5) for each table entry . F ortunately , b ecause π ( j + k − j k ) is monotonically nondecr easing with r esp ect to either j or k , w e ca n directly compute for any ﬁxed p and j the interv al of k ent ries that satisfy π ( j + k − j k ) = p :  log α  αp − j 1 − j  ,  log α  p − j 1 − j  . Finding the v alue of k in the interv al such that T ( b − i, k ) is maximized is eﬀectively a ra ng e ma xima quer y (RMQ) on an a rray . Regardless of the s ize of the interv al, such a query can b e per formed in co nstant time if instead o f an a r ray , the v alues are stored in a RMQ structure a s describ ed in [1]. Such str uctures are linear b oth in space and the time they take to cr eate, meaning that we ca n use them to store each co lumn in the table (corresp onding to budget v alue i ) without adversely aﬀecting the complexity . Now, when g iven a pair { i , j } , the optimal v alue o f k can b e co mputed in constant time, bringing the complex ity of ﬁlling a single table entry to O ( B t ), the num b er of combinations o f the pair { i, j } . There ar e O ( B t ) entries in ea ch table, and a table for each of the O ( n ) edges in the tre e . The space complexity is therefore O ( nB t ) and the time complexity is O ( nB 2 t 2 ). Substituting t for the v alue that yields a (1 − ǫ ) appr oximation ra tio shown in Lemma 3 gives O nB 2 h 2  log n + log 1 ǫ  2 log 2 (1 − ǫ ) ! . 2.3 Exp ected Runni ng Time Since in g eneral the heig h t o f a phylogenetic tree with n leav e s is O ( n ), the running time der ived ab ov e is technically cubic in n . F ortunately , for most inputs 9 we ca n exp ect the height to b e mu ch sma ller. In this section, we w ill provide improv ed running times for tr ees generated by the tw o pr incipal r andom mo dels. Additionally we will show that caterpillar tr ees, which should b e the pathologica l worst-case top o logy a ccording to the ab ov e a nalysis, a c tually hav e a muc h lower complexity . The Y ule-Har ding mo del [14][5], also known as the equal-ra tes-Markov mo del, assumes that tre e s are for med b y a success io n of random sp ecia tion even ts. The exp ected height of trees formed in this way , reg ardless of the sp eciation ra te, is O (log n ) [2 ] g iving a time co mplexity of O  nB 2 log 2 n ( log n +log 1 ǫ ) 2 log 2 (1 − ǫ )  . A caterpilla r tree is a tr e e where a ll int ernal no des are on a pa th b eginning at the ro ot, and therefor e has height n . This implies that every in ternal edge has at leas t o ne child edge that is incident to a leaf. Supp ose edge e has child l that is incident to the leaf for s p ecies s . This table only contains tw o meaningful v alues: T l (0 , a s ) and T l ( c s , b s ). Therefor e to compute entry T e ( b, p ), only O (1) combinations of child table entries need to b e co mpa red and the time complexity is improv ed to O  n 2 B ( log n +log 1 ǫ ) | log(1 − ǫ ) |  . 3 Conclusion NAPX is, to o ur be s t knowledge, the ﬁrs t p o lynomial-time algo rithm for a i c i − → b i NAP that places guarantees o n the approximation ra tio. While there ar e still some limitations , esp ecially for larg e budg ets or tree heights, our alg o rithm still signiﬁcantly increas es the num b er of instance s of NAP that can b e so lved. Mor e- ov er , our exp ected running time analy sis shows tha t the algor ithm will usually be m uch more eﬃcient than its w orst-case complexity suggests . This work tow ards a more g e neral solution is imp ortant if the Noa h’s Ark Pro blem framework is to b e used for r e al conserv a tion pro jects. So me interesting questions do remain, how ever. Do es NAP rema in NP-Hard when the budget is constrained to b e po ly- nomial in n ? W e conjecture that it is, but the usua l reduction from Knapsa ck is clearly no long e r v a lid. W e would a lso like to ﬁnd an eﬃcien t a lgorithm whose complexity is indep endent of h and/or B . References 1. M.A. Bend er and M.I.F. Colton. The LCA Problem Revisited. LA TIN 2000: The- or etic al Informatics: 4th L atin A meric an Symp osium, Punta Del Este, Uru guay, April 10-14, 2000: Pr o c e e dings , 2000. 2. P .L. Erdos, M.A. S t eel, L.A. Szekely , and T.J. W arnow . A few logs suﬃce to build (almost) all trees: Part I I . The or etic al Computer Scienc e , 221(1):77–1 18, 1999. 3. D.P . F aith. Conserv ation eval uation and phylogenetic diversit y . Biolo gi c al Con- servation , 61:1 – 10, 1992. 4. Mic hael R. Garey and Da vid S. Johnson. Computers and Intr actabili ty: A Guide to the The ory of NP-Completeness . W.H. F reeman and Company , 1979. 10 5. E.F. Harding. The Probabilities of Ro oted T ree-Shap es Generated b y Random Bifurcation. A dvanc es in Applie d Pr ob ability , 3(1):44–77, 1971. 6. K. Hartmann and M. Steel. Maximizing Phylogenetic Diversit y in Bio diversit y Conserv ation: Greedy Solutions to the Noah’s A rk Problem. Systematic Biolo gy , 55(4):644– 651, 2006. 7. S.B. Heard and A.O. Mo o ers. Phylog enetically p att ern ed speciation rates and extinction risks change the loss of evo lutionary history during extinctions. Pr o c. R. So c. L ond. B , 267:613–620, 2000. 8. A.E. Magurran. Me asuring Biolo gic al Di versity . Blac kw ell Publishing, 2004. 9. S. Nee and R.M. Ma y . Extinction and th e Loss of Evo lutionary H istory. Scienc e , 278(5338 ):692, 1997. 10. F. Pa rdi and N. Goldman. Sp ecies choice for comparative genomics: Being greedy w orks. PL oS Genetics , 1(6):71, 2005. 11. F. Pardi and N. Goldman. Resource-aw are taxon selection for maximizing p hylo- genetic diversit y . Syst Biol , 56(3):431–44, 2007. 12. M. Steel. Phylogenetic diversit y and the greedy algorithm. Systematic Biolo gy , 54(4):527 – 529, 2005. 13. M.L. W eitzman. The Noah’s Ark problem. Ec onometric a , 66:1279 – 1298, 1998. 14. G.U. Y ule. A Mathematical Theory of Evolution, Based on the Conclusions of D r. JC Willis, FRS. Philosophic al T r ansactions of the Ro yal So ciety of L ondon. Series B, Containing Pap ers of a Bi olo gic al Char acter , 213:21–87, 1925. 11

NAPX: A Polynomial Time Approximation Scheme for the Noahs Ark Problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment