Antilope - A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem

1 Antilope – A Lagrangian Relaxation Approach to the de novo Pepti de Sequencing Problem Sandro Andreotti, Gunna r W . Klau ∗ , Knut Reinert ∗ Abstract —Peptide sequencing from mass spectrometry data is a key step in p roteome resear ch. Especially de novo sequen cing, the identi ﬁcation of a peptide from its spectrum alone, is still a challenge ev en for state-of-the-art algorithmic approaches. In this paper we present A N T I L O P E , a new fast and ﬂexibl e approach based on mathematical programming . It builds on th e spectrum graph model and works with a var iety of scoring schemes. A N T I L O P E combines Lagrangian relaxation for solving an integer linear program ming fo rmulation with an adapt ation of Y en’s k shortest paths algorithm. It shows a signiﬁ cant impro vement in running t ime co mpared to mixed integer optimization and perfo rms at the same speed l ike other state-of-the-art tools. W e also implemented a generic p robabilistic scoring scheme that can be trained automatically f or a dataset of annotated spectra and is independent of the mass spectrometer type. Ev alu ations on b enchmark data sho w t hat A N T I L O P E is competitive to the popular state-of-the-art programs PepNovo and Novo HMM both in terms of run time and accuracy . Furthermore, it offers increased ﬂexibili ty in the n umber of considered ion types. A N T I L O P E will be freely av ailable as part of the open source proteomics library OpenMS. I . I N T RO D U C T I O N Mass spectr ometry-b ased high throu ghput identiﬁcation of peptides and pro teins is a key step in most proteomic s research experiments. It requires fast algorithmic solution s with good identiﬁcation capabilities. Depe nding o n the initial situa tion of the exp eriment, two gener al strategies exist: database- assisted and de novo identiﬁcation. I f a d atabase for the studied proteins exists the ﬁrst m ethod is usu ally p referred over de novo sequen cing. T he cru cial step in database sear ch algorithm s like INSPECT [1], SEQUEST [2], Mascot [3] and OMSSA [ 4] is to ﬁlter the database b ased on different methods. INSPECT ge nerates peptide sequen ce tag s (PST) and keep s only those cand idate pep tides c ontaining th e tag as a subseque nce. SEQUEST uses the par ent mass as ﬁlter criterion. After ﬁltering, the que ry spectrum is scored ag ainst the rem aining cand idates and a r anking of possible identiﬁca- tions is produce d. In addition to the quality of the spectrum , database search method s clearly depend on the corr ectness and completen ess of the database and hen ce on the av ailab ility of a suitable set o f p eptides o r tr anscripts for the studied organism. Even if th is is the case, factors like alternative splice variants and mu tations can lead to m issing identiﬁcation s. S. Andreotti and K. Reinert are with the Departmen t of Computer Scien ce, Freie Univ ersit ¨ at Berli n, Germany and the Inte rnationa l Max Planck Research School for Comput ationa l Biolo gy and Scientiﬁc Computing, Berlin , Germany E-mail: andreott@in f.fu-berlin.de G. W . Klau is with the CWI, L ife Sciences Group, Amsterdam, the Netherl ands, and the Netherla nds Institute for Systems Biol ogy ∗ shared last authors In such situation s de novo sequencing algorithm s provide an alternative as they infer the sequ ence from th e spectru m itself without any information collected in databases. In recent years, many algorith ms and software package s were p ublished, with the most popular being PEAKS [ 5], PepNovo [6], NovoHMM [7], Luteﬁsk [8], Sherenga [9], EigenMS [10], and PILO T [11]. Most of them use the graph- theoretical app roach introduced by Bartels [12] an d construc t a so-called N-C spectru m graph which is then u sed to search for the cor rect sequence. See Fig. 1. Using this formulation, the de n ovo p eptide sequencing problem can be formu lated as the search for the lon gest antisymmetric path, an NP-complete pro blem [13]. PepNovo and Lu teﬁsk solve a special case of this problem b y re stricting the c onstruction o f th e spectr um g raph, w hich enables th em to apply a dynamic p rogramm ing algorithm p roposed b y Chen et al. [ 14], [1 5]. Th e restriction s limit the p ossible inter pretations of each peak to at m ost one N-terminal (usually b- ion) and one C-ter minal ( usually y -ion) ion type. Liu and Cai [16] use tree-deco mposition to solve the r estricted pr oblem. Bafna a nd Edwards [17] pr opose a v ar iant of the dynamic pr ogramm ing approa ch th at also allows for more interpretatio ns leading to a poly nomial algorithm of a h igher degree. Their algor ithm is still limited to so-called simple ion types, exclu ding doubly and tr iply cha rged io ns that can also aid th e id entiﬁcation process. PILOT [11] overcomes all these restrictio ns using an integer linear prog ramming (ILP) formulatio n for the lo ngest antisymmetric pa th prob lem that is ﬂexible and extensible on the cost of efﬁciency . This allows for more interpretatio ns of each peak which can lead to impr oved iden tiﬁcation in situations wher e the prominent b - and y-ions are missing. Furthermo re the I LP f ormulation can b e easily extend ed in se veral ways by simp ly add ing o r m odifying con straints to further restrict or mo dify the set o f possible solution s. Th e approa ch also allows for global reasoning such as limiting the number of a certain amino acid type for each prediction. The main co ntribution of this work is an improvement of this approach b y an extension that retains most of the ﬂexibility an d lead s to a remark able improvement in runnin g time. Instead of focusing on computing one antisymmetric path we propose a novel algo rithm to ﬁnd the k best antisymmetr ic paths. W e achieve th is by applying th e Lag rangian relaxatio n technique to the problem and solving th e sub problems with an elegant v arian t of Y en’ s k shortest paths algorith m. Lagrangian relaxation was alr eady succe ssfully ap plied to biological pr ob- lems such as sequ ence alignmen t [18], pro tein [1 9] and RN A [20] structu ral alignm ent or pr otein threadin g [21]. An addition al con tribution of this pape r is a generic prob- abilistic sco ring schem e that can be trained autom atically 2 0 4 8 7 V E A L R L A E 175.1 229.1 300.2 488.3 m/z in Da intensit y a b MS/MS spectrum spectrum graph 9 9 1 7 4 2 2 8 2 8 7 2 9 9 3 5 8 4 1 2 5 6 8 Fig. 1. Spectrum graph gene ration. (a) Simpliﬁe d tandem mass spectrum of the peptide VEALR. Rounded m/z value s in Da are present ed on top of each peak. (b) The corresponding spectrum graph with two nodes being generated for e ach peak. One under the assumption of being a b-ion, the other under the assumption of being a y-ion. It it obvious that the path starting at node s with mass 0 and ending at node t with mass 568 encodes the correc t peptide sequence . The undi rected edge s connecting complementa ry nodes ar e drawn as dashed lines. for a dataset of annotated spectra and is independ ent of the mass spec trometer type. The perfor mance of both, de novo and database search approache s, depen ds o n a good scoring function to mode l prediction quality . Currently used scor- ing f unctions range fro m rather simp le peak intensity-based scoring to statistical mo dels inclu ding Bayesian networks. The latter show a better perfo rmance b ut r equire re-train ing for different spectro meter typ es and thus depend o n reliable annotated datasets. Ou r ﬂexible scoring scheme allows for user co ntrolled tra ining on supplied ann otated datasets. The topolog y of the network can either be d eﬁned by the user or , following the ap proach prop osed b y Bern [22], learn ed from the gi ven d ataset directly . W e extend this ap proach by considerin g ion intensities and cleav age position s similar to the PepNovo scorin g in order to account for sh ifts of the fragmen tation patterns between different m/z regions along the spectrum. Our software A N T I L O P E ( A N T I sym metric path search with L agrang ian O ptimizatio n for P E p tide identiﬁcation ), an imple- mentation of the impr oved approach, is f reely av ailable as part of upco ming releases of the open source p roteomics library OpenMS [ 23]. The structure of the re mainder of this paper is as follows. Section II describ es our new method . In Sectio n II I we compare our tool with the state-of-the-art tools PepNov o , NovoHMM, LuteﬁskXP and PILOT . Finally , in Sectio n IV, we d iscuss our r esults and future work. I I . N O V E L D E N O VO P E P T I D E S E Q U E N C I N G A L G O R I T H M This section describes our new appro ach to the de novo sequencing problem. At ﬁr st we f ormally introduce the graph- theoretic fo rmulation and the resulting I LP form ulation ou r method A N T I L O P E is b ased on . T hen we p resent our new algorithm ic app roach to ﬁnd the k best solutions of the ILP. Finally , we explain the scoring mod el o f A N T I L O P E . A. Graph-Th eor etica l F ormulatio n Bartels introduced the transformation o f a tandem mass spectrum into the so-called spectrum graph , a now co mmonly used da ta structure in graph-the oretical approaches to the de novo sequencing problem [6], [9], [16], see also Fig. 1. Using this data structur e, the original pr oblem amou nts to ﬁnding a longest p ath with ce rtain pro perties in this graph. When a peptide P is fragmen ted by collision induced dis- sociation (CID) it usually breaks along the ba ckbone between two n eighbor ing amino ac ids into a p air o f N-ter minal (p reﬁx) and C-terminal (su fﬁx) fragmen ts. W e deﬁne the residu al mass of P as the sum of the mo noisotopic masses of all amino acid residues in P . By parent mass M P we denote the total m ass o f P , wh ich is the residual mass, p lus 1 8 Da for an additional w a ter molecu le. Depen ding on th e exact fragmen tation position , different types of f ragment ion s are produ ced that ha ve a certain mass of fset compared to the preﬁx residue mass ( PRM) o r sufﬁx residue mass. Besides the types presented in Fig. 2, also neutral loss v arian ts, e.g., loss of w ater or am monia, of se veral ion types are observed freq uently as well as multiply charged ion s. The fragmentation process is still n ot f ully un derstood and which types are g enerated with which inte nsity depen ds on many factors. The spectrum grap h G , consists of a set o f nodes V , a set N C H R C O H O H N C H R C O H H a y b x c z Fig. 2. Peptide fragment ation along the backbone . This ﬁgure displ ays the most prominent fragmentati on positions for the gener ation of pairs of b/y-ion, a/x-ion and c/z -ion in the backbon e of a peptide . 3 of directed edges E D and a set o f undirected edges E U . In the o riginal deﬁnition the spectrum graph d oes n ot co ntain the set of u ndirected edges, wh ich is a slight mo diﬁcation by Liu and Cai [ 16] wh o ter med this the extended spectrum graph . In the spectrum grap h each node correspon ds to some possible preﬁx residue mass of the peptide to be identiﬁed. Directed edges represent amino acids and connect nodes if their mass difference can be explained by some amino acid. T wo nodes that lead to contradictin g interp retations of some mass peak are called com plementary and are conn ected by an und irected edge. Giv e n th e tandem mass spe ctrum of some u nknown p eptide the construction o f th e spe ctrum grap h is as follows: E ach peak s with mass m s in the input spectrum g enerates a set o f nodes. I f we consider k different N-termin al ion types ( e.g., b- ion and a-ion) w ith mass of fsets δ 1 , . . . , δ k ( +1 Da for b-ions, − 27 Da fo r a-io ns) f rom the PRM, then peak s generates k nodes with masses m s − δ 1 , . . . , m s − δ k . For C-terminal ion- types with offsets δ 1 , . . . , δ k , ad ditional k nodes with masses M p − 18 − ( m s − δ 1 ) , . . . , M p − 18 − ( m s − δ k ) are generated. Each of these n odes represents the preﬁx residue mass un der the assump tion that s was gen erated by an io n of a certain type. Clearly at most o ne o f th ese no des can represen t the true PRM, therefo re they are all contrad icting each other and are connected by undirected edges. Wh enever the mass difference of tw o nodes v i and v k equals the m ass o f so me amino ac id α ( ± ǫ ), we co nnect v i and v k via a dir ected edge ( v i , v k ) labeled with α . Finally we ad d two so called go alpost nodes s and t , with m asses 0 an d P M − 18 Da , re spectiv ely . If the spe ctrum of some peptide P is complete, i.e., fr agment ion peaks are abundant for each possible cleav ag e site of P , then there exists a node for each PRM o f P . T herefor e th e correct seque nce of P is obtain ed by ﬁndin g the s - t - path of nodes cor respondin g to the true p reﬁx sequen ces of P and b y concatenatio n o f the edge labels along this path. E ach node in the spectru m graph has a sco re that rep resents the reliability of that node to correspond to a true PRM. Howe ver, simply look ing fo r the longest p ath in the gr aph often leads to infeasib le solu tions, nam ely if two nodes th at were genera ted by the same peak are inclu ded in the p ath, since in gener al only one of them co rrespond s to a true PRM. This prob lem is aggravated when th e score of eac h node is directly related to the inten sity of the g enerating peak . I n such a scen ario a h igh inten sity p eak generates several hig h scoring nodes and a longest path search then tends to includ e a p air of complementary n odes in the longest path leading to a co ntradicting N- and C-terminal inter pretation of the same peak. Such an inf easible path is called symmetr ic because the pairs of for bidden p airs o f N- terminal a nd C-terminal nod es form a symmetric structure, which can be seen in Fig. 1. T o solve the de novo sequencing problem we hence ha ve to search for a ntisymmetric pa ths. Th ese are paths witho ut co ntradicting nodes. They ther efore do not contain pairs of nodes that are connected by an und irected edg e. See Fig . 3 for an example. Most de n ovo sequen cing algo rithms g enerate o ne pair of compleme ntary nod es for each peak assuming it being either a b -ion or y-ion. Th ese pairs f orm a nested n on-interleaving structure allowing fo r efﬁcient comp utation. But althou gh b- s t Fig. 3. Symmetric path example . This ﬁgure sketche s schemat icall y the situati on when an infeasible symmetric path would be preferr ed over a feasi ble antisymmetr ic solution. Ass uming that the sm all nodes have a score of 1 and the bol d nodes have a score of 2, the il legal s - t path scores higher than the leg al one. Therefore, in this example, a simple longest path search yield s infeasib le solutions. and y -ions are usually the m ost abundant in CID spectra, there are cases in which both of them are missing and th erefore n o correct node is gener ated in this case. Theref ore it is promising to inc lude nodes f or other in terpretation s, es pecially in the low and high m ass range of th e spectrum wh ere fragmentation is usually less complete. While the long est antisymm etric path problem is NP- complete for general dir ected gr aphs [1 3], there exist polyn o- mial algorithms for the special case where the non-in terleaving proper ty is satisﬁed. The poly nomial algorithm proposed by Chen [14] uses dynam ic p rogramm ing to compu te an op- timal solution to the lo ngest antisymmetr ic path with n on- interleaving f orbidd en p airs. In a second paper Lu and Chen [15] e x tended this app roach to com pute sub optimal solu tions by co nstructing a so called m atrix spec trum grap h and ap- plying depth -ﬁrst search and a backtracking algorithm. In contrast, the IL P formu lation p resented in the next section does no t depen d on su ch a nested stru cture an d cor respond s to the d e novo sequen cing p roblem for any desired set of ion types. B. In te ger Linear Pr ogramming F ormula tion Our algorithm is based on the fo llowing integer linear progr amming (I LP) fo rmulation [24], which is very similar to the on e Floudas and DiMaggio u sed for their too l PILO T [11]. Our formulation m odels the problem by m eans of zero - one v a riables f or each ed ge. W e put the score o f each no de on all its outgoin g directed edges. As the gr aph is acyclic, this is a safe transformation. max X ( v i ,v k ) ∈ E D c i,k x i,k (1) X ( v s ,v k ) ∈ E D x s,k = 1 (2) X ( v k ,v t ) ∈ E D x k,t = 1 (3) X ( v i ,v k ) ∈ E D x i,k − X ( v k ,v j ) ∈ E D x k,j = 0 ∀ k ∈ V \ { v s , v t } (4) X v i ∈ e X ( v i ,v k ) ∈ E D x i,k ≤ 1 ∀ e ∈ E U (5) x i,k ∈ { 0 , 1 } (6) W e introdu ce a binary variable x i,k for ev ery directed edge ( v i , v k ) ∈ E D which h as v alue one if edge ( v i , v k ) is part of the path (acti ve) and zero otherwise (inacti ve). The objecti ve 4 function (1) maximizes the summed score of all acti ve directed edges. For the two goalposts s and t , the two constraints (2) and (3) assure th at exactly one acti ve edge lea ves s and one enters t . T og ether with the ﬂow conservation constraints (4), they estab lish a cor responden ce between feasible solu tions of the ILP and s - t path s in the graph . An o ptimal solution of the ILP co nsisting of ob jectiv e fun ction (1) and con straints (2), (3) and (4) corre sponds to a lon gest s - t path , still po ssibly symmetric a nd therefore infe asible fo r the de n ovo sequ encing problem . There fore we add another constra int (5) that makes sure that f or each pair of con tradicting nodes at m ost one will be selected. The difference of ou r model to the one propo sed by Floudas an d DiMagg io is twofold. Fir st, we do not intro duce variables for no des as th ey are not re quired. This does not change the general structure of the formulation a nd has no strong effect on the time requir ed for solvin g. Second , we d o not f ormulate a constraint that prevents the exact mass of the p redicted seq uence to d eviate fro m th e mea sured parent mass b y more tha n a certain th reshold value (usually 2 . 5 Da ). W e argue that in our algorithm it is mo re p romising to de fer this ﬁltering to a later stage of the algorithm. Since we add edges that correspond to p airs and trip les of amino acid s which o ften repr esent sev e ral possible co mbination s of amino acids, there is no exact mass which could b e used in such a constraint. Therefor e we perfo rm the ﬁltering at a later stage when we hav e created th e can didate superset. C. Ap plying Lagrangian Relaxatio n While lin ear pr ogrammin g (LP) p roblems can be solved in po lynomial worst case time, adding in tegrality constraints makes the m gene rally NP-ha rd and the r esulting integer lin- ear progr ams (ILPs) require different algorith mic solution approa ches. One common m ethod is to ﬁr st solve the LP relaxation an d then investigate the obtained solutio n. If th e solution is f ractional one ha s to r esort to techniq ues like branch -and-bo und or br anch-and -cut using upp er and lo wer bound s obtained from heuristics and fr om the relaxed solution . W e apply a different k ind o f r elaxation m ethod, Lagrangia n r elaxa tion , which yields in many cases much mo re efﬁ cient algorithm s than those based on LP relaxation s because it can exploit structural k nowledge of the problem. Lagr angian relaxation is moti vated b y the experience that many hard integer prog ramming problem s correspon d to a signiﬁcan tly easier pro blem that has been com plicated by an additional set of constraints. T o o btain the efﬁciently co mputab le L a- grangian pr oblem, the complicating constraints are removed and replaced by a penalty ter m in the objective functio n. Th e relaxed problem o btained that way is called the Lagrang ian problem an d can often be solved ef ﬁcien tly . The Lagrangian relaxation o f the de n ovo sequencing ILP (1)-(6) is straigh tforward as it is very o bvious th at th e antisymmetry con straints f orm the class o f har d constraints that complicate the computatio nally easy prob lem of a longest path search in a d irected acyclic graph ( D AG). W e can solve this rela xed problem b y mea ns of a simple stand ard a lgorithm, which c an be foun d in re ference ma terial [25]. T o make the Lagrangian relaxation m ore tran sparent we re write the objective function in a way that th e ed ge variables are gro uped by the undirected edge s incident to their left end: max X e ∈ E U X ( v i ,v k ) ∈ E D , v i ∈ e c i,k x i,k . Next we app ly Lag rangian re laxation by d ropping the anti- symmetry constraint ( 5) and movin g it to the objective f unction to penalize its vio lation. Th is le ads to the Lagr angian p roblem Z ( λ ) = max X e ∈ E U X ( v i ,v k ) ∈ E D , v i ∈ e ( c i,k − λ e ) x i,k + X e ∈ E U λ e (7) X ( v s ,v k ) ∈ E D x s,k = 1 X ( v k ,v t ) ∈ E D x k,t = 1 X ( v i ,v k ) ∈ E D x i,k − X ( v k ,v j ) ∈ E D x k,j = 0 ∀ k ∈ V \ { v s , v t } x i,k ∈ { 0 , 1 } The vector λ holds the La grangian multipliers, no n-negative real num bers that deﬁne the weigh t of the penalty term. Lemma 1. Th e Lagrangian pr oblem ( 7) can be solved in linear time and space. Pr oof: Solving the Lagr angian pr oblem con sist of the following steps: First we simply sub tract from each edg e weight c i,k the v alu e λ e , f or all undir ected edges e inc ident to node v i . Then we apply the lin ear time O ( | V | + | E D | ) longest path search algorithm for D A G s on the graph w ith the modiﬁed edg e weights. Finally we add the value of P e ∈ E U λ e to th e score obtained from the lon gest path search algorithm. Obviously each of the steps requires o nly lin ear time and space. By restricting the Lagran gian m ultipliers to non-negative values one can easily show that the value of the so lution o f the Lagrang ian problem is an upper boun d to the optimal value of th e origin al problem [26]. In order to obtain a tigh t bou nd, the strategy is to ﬁnd th e values f or th e Lag rangian mu ltipliers that minim ize Z ( λ ) , which means solving the d ual pr oblem: Z D = min λ ≥ 0 Z ( λ ) . W e apply th e efﬁcient iterative subgr adient op timization al- gorithm, com puting sequences of multip liers λ t where t = 0 , 1 , 2 , . . . deno tes the iteration. W e start with λ 0 e = 0 , f or all e ∈ E U and in each iteration t we compute the subgradients S t e = 1 − P v i ∈ e P ( v i ,v k ) ∈ E D x i,k , for all e ∈ E U and u pdate the Lagr angian multipliers ac cording to formula: λ t +1 e = max { 0 , λ t e − θ t S t e } . (8) One crucial factor with a huge inﬂuence on the p erforma nce is the step-size θ . The subgrad ient method con verges to the optimal solution Z D if the step-size satisﬁes the following condition s [2 7]: lim k →∞ θ k = 0 and lim k →∞ k X i =1 θ i = ∞ . 5 A fo rmula that is widely used for step-size computatio n because it shows g ood perf ormance in practice is gi ven b y θ t = γ t ( Z ( λ t ) − Z ∗ ) P e ∈ E U ( S t e ) 2 , (9) where Z ∗ is the value of the best solu tion to the o riginal problem that w as compu ted yet and γ t deﬁnes a decreasing adaption p arameter . D. Sub optimal Solu tions A straig htforward strategy to c ompute su boptimal solu tions, also implemen ted in PILO T [ 11], is to cut off previous solutions by an additional co nstraint. A known drawback o f this ap proach is that solv ing time may in crease dram atically after gener ating a fe w suboptimal solu tions. W e suggest a different stra tegy and overco me this problem by means of an algorithm which , for a given n umber k , directly comp utes the k long est paths. W e use the algorithm by Y en [2 8] that was originally design ed to com pute the k shortest paths witho ut cycles on ge neral dir ected graph s. Y en’ s algor ithm is a de v iation algorithm based o n the fact that the i -th shortest path P i , will coincide with every shor ter path P i − 1 . . . P 1 up to some node until it d eviates. The farthest node from the source s with this prop erty is called d eviation node d ( P i ) . The strategy to ﬁnd the i + 1 -st shor test s - t pa th P i +1 is, starting at d ( P i ) , to compute fo r ea ch node v i j of P i the shortest path to t , th at deviates from P i at node v i j . Th erefore a shortest path f rom v i j to t is computed which is not allo wed to use the e dge ( v i j , v i j +1 ). This shortest p ath f rom v i j to t is then concatenate d with th e pr eﬁx ( v i 1 . . . v i j − 1 ) of P i to o btain the shortest s - t path tha t de v iates from P i at node v i j . This p ath is added to a candidate set X . After the shortest deviating pa ths of P i have been co mputed, the shortest p ath in the candidate set X co rrespond s to the i + 1 -st shortest s - t path P i +1 and is removed from X . Y en’ s algorithm perfor ms an additiona l trick to guaran tee fo r paths witho ut cycles th at we do not discuss here. For a mor e detailed description of this alg orithm and v ariants please refer to r eference material [28], [29]. Our problem dif fers in a few po ints f rom the o riginal problem solved by Y en’ s algorith m, so it requires a fe w adaptation s. While Y en’ s algor ithm is design ed for gen eral directed graphs th at may co ntain cycles, we are w orking on a D AG. This simpliﬁes the p roblem as we do n ot hav e to worry about cycles and can simp ly tr ansform the sho rtest pa th problem into a longest path proble m. Note th at the long est path problem is NP-comp lete in g raphs with cycles. A second difference is tha t we ha ve the add itional co ndition to ﬁnd antisymmetric paths. Theref ore every time th e shortest path algorithm is called in the Y en’ s algo rithm, we r eplace this by solving th e La grangian relaxatio n for mulation for th e lon gest antisymmetric path search. The following theor em and its proof cap ture the main algorithm ic r esult of th is paper . Theorem 1. The co mbination of our Lagr angia n r e laxation- based alg orithm for an tisymmetric paths a nd a mo diﬁcation of Y en’s algorith m solves the p r oblem o f computing the k longest antisymmetric paths in time O ( k ls ( | E | + | V | )) , where l is the len gth of the lon gest path a nd s is the to tal num ber of subgradient iterations. Pr oof: In iteratio n i + 1 of Y en’ s algorithm the compu ted path deviating from P i at nod e v i j must satisfy two cond itions in ord er to f orm an antisymmetric path in G . 1) There are no two nodes in the path from v i j to t that are in conﬂict. 2) None of the nod es in th e comp uted path from v i j to t is in conﬂict with som e n ode from the preﬁx of path P i up to node v i j . The ﬁr st condition is satisﬁed by the L agrangian relax ation formu lation itself, because if applied to the sub graph v i j . . . t , ev ery feasible solution corresp onds to an antisymmetric path from n ode v i j to t . T o m eet the second co ndition it is sufﬁcient to remove a ll no des fro m the subgraph v i j . . . t that are con- nected v ia an u ndirected edge with some node of the preﬁx s . . . v i j of P i before we comp ute the longest antisym metric path. This trick also simpliﬁes the lon gest antisymmetr ic path search for increasing j as the possibilities to generate an infeasible solu tion are decreasing. The comp lexity of Y en’ s algo rithm f or comp uting the k - longest p aths in a D A G is O ( k | V | ( | E | + | V | )) . The ﬁrst factor | V | come s fr om th e fact that, in a general graph , on e path can po ssibly contain all | V | n odes. I n the case of p eptide sequencing , the length o f a path equals the length of the predicted peptide which usu ally does not exceed a length of 30 for typical experimental settings. In the longest antisymmetric path versio n using Lagr angian r elaxation, the O ( | E | + | V | ) D AG longest path alg orithm gets iter ati vely called dur ing subgrad ient optimization algor ithm. Th erefore th e comp lexity of ou r formulatio n fo r id entiﬁcation of a p eptide containin g l amino acids is O ( k l s ( | E | + | V | )) with s being th e n umber of iterations du ring subg radient o ptimization. Note that the value of s is po ssibly exponential if the subgrad ient op timization does n ot converge and the complete branch and b ound tree h as to b e e numerated . Nevertheless, in the results section we will show that for our peptide sequencing formu lation on average only very few iterations are req uired which leads to a practically efﬁcient algor ithm. E. Sco ring Model W e use a prob abilistic scoring based on a Bayesian network similar to the scoring model of PepNov o. Bayesian networks are directed acyclic graphs wher e n odes rep resent ran dom variables and th e ed ges represent co nditional depend encies between variables. Th e variables in our model are the io n types t ∈ T that are considere d by o ur sco ring mod el and the possible v alu es for each variable is the intensity . Therefor e, as a ﬁrst step, we norm alize the intensity of all peaks to discrete values as d eﬁned by Dan ˇ c´ ık et al. [9] by using their rank as intensity . The usage of Bayesian n etworks for scoring nodes in the spectrum gr aph is m otiv ated by the o bservation that fragmen tation events are no t indepe ndent. For example, th e probab ility of observing a strong b- ion is n ot in depende nt of the abundance an d intensity of the com plementary y-ion . 6 Unlike f or the PepNovo alg orithm, wher e th e stru cture of the pr obabilistic network is predeﬁned leadin g to a ﬁxed set of accounted conditio nal d ependen cies, we implem ented a ﬂexible scoring scheme wh ere the network topo logy can be either d eﬁned b y the user or it can be learned dur ing the training pr ocess automatica lly . For in ference a nd training of the Bayesian ne twork we used the Bayesian Network Classiﬁers in the machine learn ing suite W ek a [30]. Similar to PepNov o, we discr etize the relative position of a cleav age into several (default 3) equ ally sized regions r to account for th e different in tensity distributions in the center and termin al regions usually observed in CID spectra. For each of the regions we train a Bayesian network using som e tr aining set of tandem mass spectra with known peptide id entiﬁcation. For e ach training sp ectrum we construct the n ode set of the spectrum gr aph and selec t an equal n umber of tru e po siti ves ( vertices rep resenting a true PRM) and false positives (vertices rep resenting not representin g PRM). For each of the selected nodes we look for witnessing peaks at their c alculated po sitions and record the ir no rmalized intensity to o btain the training vectors for the Bayesian network. Ea ch of the training vector s has one add itional entr y , the class lab el, which is T f or true p ositi ves and F for false positives. W e select on ly those ion types of the witness set for the n etwork training that a ppear in at least t percen t of the true positiv e samples of the training set where th e thresh old param eter t can be deﬁned by the user . For each of these selected typ es we then ad d a nod e in the Bayesian network. One additional node for the class is a dded. During the n etwork tr aining, the structure (set of directed ed ges) of the ne twork is learn ed and once the structure is ﬁxed the condition al probability tables are learned from the training data. While the user can control a huge r ange of p ossible option s for the W eka Bayesian network classiﬁer training through our program, we set as the default training algorithm the K2-HillClimber and the Bayesian metric for lo cal scorin g [31]. F o r a user de ﬁned network topology the ﬁrst step is skipped and o nly the cond itional pro bability tab les are com puted. Once the n etwork is trained, we score a node v in the spectrum graph by looking fo r peaks in th e spectrum at the calculated masses fo r the selected io n types to ob tain the set of intensity observations I v . Using the trained Bayesian network BN we then compute the log likelihood ratio as: LLR ( v ) = log Pr( I v | BN , class = T ) Pr( I v | BN , class = F ) , (10) where Pr( I v | BN , class = X ∈ { T , F } ) is the pro bability of ob servation I v under mo del BN wh en the class v ariab le is set to X . In co ntrast to Bern an d Datta who obtain th eir false positive samples from perturbation of the correct PRM we only take false positive PRM for which a node was create d du ring the spectrum grap h construction. W e cho se that approac h since we want the Bayesian network to discrim inate between co rrect and false n odes in the spectru m gr aph. By ju st perturb ing the true PRM one will very likely gen erate false po siti ve training samples containing only zero intensity entries which will nev e r be the case for a node in the spe ctrum grap h as it req uires at least on e peak to be g enerated. Additional to the Bayesian network we also use a simple intensity rank score S R ( v ) as it is also used by INSPECT [1]. This score is the ratio between two probab ilities, the proba- bility that a peak with a certain intensity rank corr esponds to a c ertain ion type (e.g. , a b -ion) and the the p robability th at a random ly ch osen peak was generated by that ion ty pe. As th ese values dif fer betwe en different mass region s of a spectrum, we split th e spectrum into three equally spaced mass regions and estimate the p robabilities for each o f them sep arately using the same training da ta as for the Bayesian n etwork. For example, if we gene rate a nod e fo r a peak o f rank 4, an d this nod e interprets the peak as a b-ion, then S R ( v ) is the log r atio between th e p robability that a ran k 4 peak is a b-io n and the probab ility that any ran dom peak is a b -ion. The ﬁn al score s ( v ) f or each node v of th e spectru m graph is then computed as: s ( v ) = LLR(v) + S R ( v ) (11) Nodes h aving n egati ve scores correspon d to unreliable PRMs and a re r emoved from the g raph in ord er to red uce the size of the spe ctrum grap h and speed up the candidate generation pr ocess. Sin ce o ur formu lation is working with edge weights, we move th e node scores onto the edges, such that each dire cted edge gets the score of its left no de. In the ﬁltered spectrum graph we compu te th e p redeﬁned number of sub optimal solution s, each correspon ding to o ne antisymmetric pa th. T o ac count f or missed clea vages we also add edges co rrespond ing to pairs and trip les o f amino a cids to the spectrum graph. For eac h of the generated candidates, in a second step, we try to resolve the pairs and tr iples of amino acids. Th erefore we generate all possible combination s and permutatio ns of am ino acid s to genera te a candid ate superset. The candidates in that superset a re then re-scored by a re ﬁned version of a shar ed peaks coun t where we reward ab u ndant witness pe aks and penalize missing on es. Giv e n a ca ndidate seq uence we look fo r witnessing peaks in the q uery spectrum an d give a bonu s if one was fo und or a penalty if it is m issing. Furth er we check whethe r the peak is a primary isotopic peak, a secondary isoto pic peak or a lone peak. A peak is called a primary isotopic peak if we ﬁnd a child peak a t o ffset 1 Da f or a singly charged ion o r 0. 5 Da fo r a doubly c harged ion. Equiv alently a peak is called a secondary isotopic p eak if it has a paren t peak with offset -1 Da for a singly charged ion or -0.5 Da for a dou bly charged ion. If a peak is n either pr imary isoto pic no r seco ndary isoto pic then it is a lone peak . If a witnessing peak is a p rimary peak we add another bonus to the score while we cha rge a penalty if it is a secondar y p eak. While the re ward and penalty score are actually user parameters we will of fer a gener ic algo rithm to estimate reasonable values in futu re versio ns. The candidates are then re-r anked accor ding to this score a nd the pred eﬁned number of candidates is returned. I I I . R E S U LT S In this section we p resent and discuss ou r computatio nal results. W e co mpared A N T I L O P E with state-of-th e-art alter- native peptide identiﬁcation software with respect to runnin g time and quality . 7 A. Efﬁciency The major contribution of this work is the new algorithm ic approa ch based on Lagr angian relaxation. W e will ﬁrst giv e a th oroug h analysis of th e p erform ance and co mpare it to the ILP fo rmulation (1 )-(6). Like implemented in PILO T , we generate the subo ptimal solutio ns b y introdu cing additio nal constraints that cut off previous solu tions. W e implemented our a lgorithm in C++ and use the Open MS [23] library that offers co n venie nt data structur es and algorithms to han dle and manipulate spectr al da ta. For th e IL P for mulation we use the commercia l CPLEX [32] so lver software (version 9.0) , which is in ge neral the fastest solver available. W e w ere not ab le to directly compar e to PI LO T because the software is not av ailable upon request. In Fig. 4 we compare the running times of the ILP and our Lagra ngian r elaxation formulation on a set of 100 tand em mass spectra from the ISB dataset [33]. In this compariso n we only consider th e time r equired to g enerate the set of candidate sequences and ignore the prepr ocessing of the spectrum and the spectrum graph generation as th ese steps are indep endent of the app lied algo rithm. W e compared the ru nning time requir ed to generate th e top scorin g 20, 30 and 50 candidates for each spectrum. Th e ﬁgure sho ws that our appro ach signiﬁcan tly o utperfo rms the ILP form ulation on all instances and the p erforma nce gain increases with the number of can didates to be gener ated. Ou r algorithm is on av erage ≈ 9 times faster f or the best 20 candid ates, for 30 and 50 can didates the av erage advantage incr eases to a factor of ≈ 1 2 and ≈ 18 . Wh ile the ru n tim e fo r o f th e ILP fo rmulation for the top 50 cand idates was usually above 2 seconds, th e Lagrang ian relax ation f ormulation requ ires on average only a few tenth s of a second. In a closer ana lysis we in vestigated the con vergence be hav- ior o f our La grangian relaxation fo rmulation. It reveals th at for each Lagra nge prob lem solved during the path ran king algorithm only very fe w iteration s of the sub gradient opti- mization are required. Th e path ra nking algorithm m aintains a list of pre v iously detected can didate paths together with their scores. Since the score Z ( λ ) of th e Lagr ange pro blem is an upper bou nd to th e score Z IP of the best possible fe asible solution, subg radient optimization can b e abor ted as soon as Z ( λ ) falls b elow the lo west score in the candidate list. If the Lagrang ian r elaxation does not con verge and cannot be a borted after 10 0 iterations we apply a branching step. W e use the b est infeasible path foun d durin g the sub gradient optimiza tion and arbitrarily choose one node v b in volved in a conﬂict. Then we generate two subpro blems, one forcing v b to be in the path and one fo rbiddin g v b to be selected. W e found that o nly f or a very small fraction o f th e La grange p roblems a bran ching step had to be performed, an d the depth o f branch-an d-bou nd trees never e xceeded a value of three. It is necessary to mention that the performanc e strongly depend s o n th e scoring f unction used, since a go od scoring func tion will not generate many high scorin g no des for the same p eak and on ly the cor rect one should rece i ve a signiﬁcantly high score. Therefo re a good scoring function does not only affect the id entiﬁcation perfor mance but also affects the complexity o f the candida te generation . 20 30 50 0 5 10 15 20 25 30 # suboptimal solutions rpg Fig. 4. Running time comparison between Lagrangia n relaxation and ILP formulat ion for computati on of 20, 30, and 50 suboptimal solutions of 100 benchmark spect ra. Box-and-wh isker plots display median, quar- tiles, and extrema of the dist ribut ion of relati ve perf ormance gains rpg = run-time(ILP) / run-ti m e(Lagrange ) . A N T I L O P E outperforms the CPLEX-based method for all s pectra and all numbers of suboptimal solutions. The adva ntage increa ses with the number of suboptimal soluti ons. The considered spectrum graphs contained between 80 and 200 nodes. B. Seq uencing P erformance W e com pared t he performance of A N T I L O P E to fo ur non-co mmercial de n ovo sequ encing to ols, LuteﬁskXP , NovoHMM, PIL O T 1 , PepNovo and the commercial software PEAKS 2 . W e used two measures, accu racy and recall, to assess their performa nce. A ccuracy denotes the fraction of cor - rectly predicted amin o acid residu es com pared to all predicted residues. Recall is the f raction of cor rectly predicted residue s compare d to the total num ber of residues o f the correct peptide sequences. When look ing at suboptim al solutio ns fo r each algorithm we looked f or the pr ediction with the highest recall and reported the values of th is pre diction f or rec all and p recision. I n case of m ultiple pr edictions with the same recall value we re port the values for the one with the highest precision among them . As benchmar k set, we chose tand em mass spectra fro m the ISB d ataset [33] that were g enerated by an ESI-ion trap mass spectrometer by Thermo Finnigan and spectra from the o pen proteom ics database. This set of reliably ann otated spectra fro m tryptic peptides has already been used for training of PepNovo an d NovoHMM. W e crea ted a tra ining set of 1214 spectra from d oubly cha rged precursor ions of unique pep tides to train the scoring model. During the Bayesian n etwork tra ining fo r each of the 3 m ass secto rs, on ly the ion types t hat had a peak in at least 20% of the true positi ve training samples are selected for the corresp onding Bayesian network. The topologies of the Bayesian n etworks togeth er 1 As PILOT w as not av ailabl e, the identiﬁca tions for the test data were generat ed by the authors of PILO T . 2 W e used the PEAKS Online 2.0 web interf ace. 8 Fig. 5. Benchmark. Comparison of accura cy and recall of A N T I L O P E with NovoHMM, PepNov o, PILO T , PEAKS and LuteﬁskXP. W e compare the accura cy and rec all of the best pre dictio n among the top 1, 3, 5 and 10 ranked cand idates returned by each tool . As the best predic tion we consi der the one with the best recall among the candi dates. Since NovoHMM generates only one candid ate per spectrum it appears only in the ﬁrst plot. Discussion in text. with a brief d iscussion ca n be found in the supplementary ma- terial. The parameters to scor e the p eptide spectrum m atches for the cand idate seq uences in th e super set wer e ch osen a s follows: F or an abundant b- or y -ion we awarded the sco re PSM b = PSM y = 1 , do ubly charged b- or y-io ns scored 0 . 5 , a-ions 0 . 3 and all neutral losses were awarded a score of 0 . 2 . Isotopic peaks for some type t were aw ar ded a score o f PSM t · 0 . 2 . I f som e peak was missing the p enalty o f PS M t · 0 . 5 was subtrac ted fr om the score. When a peak was classiﬁed as a seco ndary peak its scor e is red uced to PSM t · 0 . 8 . The score for some peak is then we ighted with the re lati ve m/z distance b etween the expected and the observed m/z value using a linear fun ction. The test set consists o f 20 0 spec tra of p eptides ( peptides in train ing and test dataset d isjunct) with a m olecular m ass of at most 1600 Da and an av e rage peptide length of 10 residues. W e score a predicted amino acid as corr ect, if its predicted starting mass p osition d oes no t d e viate by more tha n 2 . 5 Da fr om the cor rect starting mass position . Further , in our ev aluatio n we do not discrimina te between the amino acids Q/K an d and I/L since th eir masses cannot be disting uished. T o compare the tools we do not only look at the top hit, but we also loo k at the accu racy and recall for th e best hit in the top 3 , 5 and 10 candid ates. The results are presented in Fig. 5. Since NovoHMM only gen erates on e candid ate per spectrum it appear s only in the ﬁrst p lot. Lookin g only at the top h it, th e recall of A N T I L O P E ( ≈ 73 . 4% ) is only marginally lower th an of PEAKS ( ≈ 73 . 7% ) but slightly better than that of PILOT ( ≈ 71 . 5% ), Nov oHM M ( ≈ 70 . 2% ) and PepNovo ( ≈ 69 . 4% ). The recall of LuteﬁskXP ( ≈ 60 . 6% ) is much lo wer than for all other tools. Since A N T I L O P E an d NovoHMM both compute complete sequ ences, they hav e a lmost eq ual v alu es for accu racy and recall, wh ile for LuteﬁskXP an d PepNovo these values differ as they allow for gaps in their p redicted sequences. If we go over from the top hit to th e best 3, 5 and 10 cand idates, we o bserve that in ter ms of re call, A N T I L O P E is always very close to PepNov o and PEAKS (e qual f or the top 3, 2 . 5% advantage of PepNovo and PEAKS for the top 10) and always a pproxim ately 4% better than PILOT . In term s of accuracy PepNovo ( ≈ 92% ) has a better perform ance th an A N T I L O P E , PEAKS an d PILOT since it allo ws for partial peptide predictio ns. The accura cy of LuteﬁskXP is sligh tly better th an for A N T I L O P E and PILO T but this accuracy is achieved at a much lower recall which is between 12% and 14% lo we r in all fo ur cases. Th e fo ur tools A N T I L O P E , LuteﬁskXP , NovoHMM and PepNovo are comparable in terms of run time wh ich is u sually between 0.5 an d 1 .5 seconds per spec trum. The running time of PILO T as reported by the authors was o n a verage around 9 seconds per spectrum. W e canno t directly estimate the runtime o f PEAKS since the identiﬁcation is performed via a web inter face. I V . C O N C L U S I O N W e p roposed a ne w algorithmic approach to solve the longest antisymmetric path problem by means of Lagrang ian relaxation, co mbined with a polynomial algor ithm for sub opti- mal solutions. Using this appro ach th e algorithm is ﬂe xible and not restricted to the n ested structure o f the spectrum grap h and solves this problem much faster than an LP relax ation-based method f or the sam e f ormulation . T herefore , for our tool A N - T I L O P E , the candidate generation is no longer the bottleneck as the most time consuming step is the re-ran king p hase since the num ber of possible candidates can easily explo de if se veral double and trip le amino ac id edge s are selected. In terms of sequencing perfor mance, A N T I L O P E is already comp etitiv e to av ailable state-of-the-art programs Pep Nov o a nd P EAKS while it ou tperfor ms LuteﬁskXP and NovoHMM especially if we also consider suboptimal solutions. F o r long peptides PepNov o still has a sm all advantage, which is mostly due to the fact 9 that the cu rrent version of A N T I L O P E produces only com plete annotation s witho ut gap s. Actually we only generated two no des for ea ch peak, on e for a b- and one fo r a y-ion. Generating no des for all io n types decreased th e performance as this alw ay s lead to some high scoring , b ut false no des and thus to wrong interpr etations. Nev ertheless we are sure that generating mo re nodes can lead to b etter identiﬁcatio ns in com bination with a reﬁned scoring sch eme. The algorithmic framework is ﬂexible eno ugh to w o rk with mass spectra generated by different kind of mass spec trometers. So the u ser can deﬁn e for wh ich ion types a n ode shall be generated . Th is can lead to improved identiﬁcation performanc e for different datasets. Combined with a scoring function trained on a representative set of spectra, the ab ility of our algorithm to directly model m ultiply charged ion s can lead to an im provement over the other algorithm s wh en analyzing tandem mass spectra obtained from higher cha rged precursor ions. For the fu ture we plan to im prove our algorithm in se veral directions. W e will include sup port for id entiﬁcation of pep- tides co ntaining post-tra nslational m odiﬁcations. Fur ther we want to support combinations of complementar y fragm entation technique s like CID together with electron transfer dissocia- tion (ETD) or CID with electron captur e dissociation (ECD), which can imp rove the iden tiﬁcation [22], [34]. In the se applications the ﬂexibility of o ur form ulation may becom e a major ad vantage over the o ther pr ograms. T o im prove th e performance for spectra o f lo nger pep tides we will extend A N T I L O P E in a way that it can produ ce partial p redictions a llowing for gap s at the term inals. This, together with a machine learn ing strategy for the re-scor ing like the rank-boo sting a lgorithm used b y PepNovo, should lead to a furth er improvement. A N T I L O P E is freely av ailable as p art up coming releases o f the open sourc e proteom ics library Open MS [ 23] allowing f or convenient integration into experimental workﬂows. R E F E R E N C E S [1] S. T anner , H. Shu, A. Frank, L. C. W ang, E. Zandi, M. Mumby , P . A. Pe vzner , and V . Bafna , “In spect: identiﬁca tion of post transl ationa lly modiﬁed peptides from tandem mass spectra. ” Anal. Che m. , vol. 77, no. 14, pp. 4626–4639, 2005. [2] J. K. Eng, A. L. McCormack, and J. R. Y ates, “ An approach to correlate tandem mass spect ral data of peptides with amino acid sequen ces in a protein data base, ” J . Am. Soc. Mass. Spect r om. , vol. 5, no. 11, pp. 976– 989, 1994 . [3] D. N. Perkins, D. J. Pappin, D. M. Creasy , and J. S. Cot- trell, “Probability -based protei n identiﬁc ation by searching sequence databa ses using mass spectrometry data. ” Elec tropho resi s , v ol. 20, no. 18, pp. 3551–3567, 1999. [4] L. Y . Geer , S. P . Marke y , J. A. Ko walak, L. W agner , M. Xu, D. M. Maynard, X. Y ang, W . Shi, and S. H. Bryant , “Open mass spectrometry search algorithm, ” J. Pr oteome Res. , vol. 3, no. 5, pp. 958–964, 2004. [5] B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby , and G. Lajoie, “Peaks: Powerful softwa re for peptide de no vo sequencing by MS/MS, ” Rapid Commun. Mass Spe ctr om , vol. 17, pp. 2337–2342, 2003. [6] A. Frank and P . Pevzn er , “PepNov o: De novo peptide s equenc ing via probabil istic network modelin g, ” Anal. Chem. , vol. 77, no. 4, pp. 964 – 973, 2005 . [7] B. Fische r , V . Roth, F . Roos, J. Grossmann, S. Bagi nsky , P . W idmayer , W . Gruissem, and J. M. Buhmann, “Nov oHMM: A hidden markov model for de novo peptide sequencin g, ” Anal . Chem. , vol. 77, no. 22, pp. 7265–7273, 2005. [8] J. A. T aylor and R. S. Johnson, “Sequenc e database searches via de nov o peptide sequencin g by tandem mass spectrometry . ” R apid Commun. Mass Spectr om. , vol. 11, no. 9, pp. 1067–1075, 1997. [9] V . Dan ˇ c´ ık, T . A. Addona, K. R. Clauser , J. E. V ath, and P . Pe vzner , “De nov o protein sequencing via tande m m ass-spectro metry , ” J. Comput . Biol. , vol. 6, pp. 327– 341, 1999. [10] M. Bern and D. Goldberg, “De novo ana lysis of peptide tandem mass spectra by spectral grap h partitionin g. ” J. Comput. Biol. , vol. 13, no. 2, pp. 364–378, 2006. [11] P . A. DiMaggio an d C. A. Floudas, “De novo peptide ide ntiﬁcat ion via tandem mass spe ctrometry and int eger linear optimiz ation, ” Anal. Chem. , vol. 79, no. 4, pp. 1433–1446, 2007. [12] C. Bartels, “Fast algorithm for peptide sequencing by mass s pec- troscop y , ” Biologica l Mass Spectr ometry , vol. 19, no. 6, pp. 363–368, 1990. [13] H. N. Gabow , S. N. Maheshwari , and L . J. Osterweil, “On two problems in the generation of program test paths, ” IEEE T rans. Softw . Eng. , vol. 2, no. 3, pp. 227–231, 1976. [14] T . Chen, M. Y . Kao, M. T epel, J. Rush, and G. M. Church, “ A dynamic programming approach to de nov o peptide sequencing via tandem mass spectromet ry . ” J. Comput. Biol. , vol. 8, no. 3, pp. 325–337, 200 1. [15] B. Lu and T . Chen, “ A suboptimal algorithm for de novo peptide sequenci ng vi a tandem mass spectrometr y . ” J . Comput . Biol. , vol. 10, no. 1, pp. 1–12, 2003. [16] C. Liu, Y . Song, B. Y an, Y . Xu, and L. Cai, “Fast de no vo pe ptide sequenci ng and spectral alignment via tree decomposition, ” in Proc . 11th P aciﬁc Symp Biocomp (PSB) . W orld Sci entiﬁc, 2006, pp. 255–266. [17] V . Bafna and N. Edwards, “On de novo interpretat ion of tandem mass spectra for peptid e ide ntiﬁcat ion, ” in P r oc. 7th A nn Inte rn Conf Res Comp Mol Bio (R ECOMB) . AC M Press, 2003, pp. 9–18. [18] E. Althaus an d S. Can zar , “ A Lagra ngian re laxati on approa ch for the multiple sequence alignment problem, ” J. Comb . Optim. , v ol. 16, no. 2, pp. 127–154, 2008. [19] A. Capra ra, R. Carr, S. Istrail, G. L ancia , and B. W alenz, “1001 opti mal PDB structure alignment s: inte ger programming m ethods for ﬁnding the maximum contact map ove rlap. ” J . Comput. Biol. , vol. 11, no. 1, pp. 27– 52, 2004. [20] M. Bauer , G. W . Klau, and K. Rei nert, “ Accurate multiple s equenc e- structure ali gnment of RN A sequence s usin g combinat orial optimi za- tion. ” BMC Bioinf , vol. 8, no. 1, p. 271, 2007. [21] R. Andonov , S. Bale v , and N. Y anev , “Protein threadin g: From mathe - matical m odels to paralle l implementatio ns, ” INFORMS J . on Compu t- ing , vol. 16, no. 4, pp. 393–405, 2004. [22] R. Datta and M. Bern, “Spectrum fusion: using multiple mass spectra for de nov o peptide sequencing. ” J. Comput. Biol. , vol. 16, no. 8, pp. 1169–1182, 2009. [23] M. Sturm, A. Bertsch, C. Groepl, A. Hildebrandt , R. Hussong, E. Lange, N. Pfeifer , O. Schulz-Tr iegl af f, A. Zerck, K. Reinert, and O. Kohl bacher , “OpenMS - an open-source softw are frame work for mass spectrometry , ” BMC Bioinf. , v ol. 9, p. 163, 2008. [24] S. Andre otti, “F ast de novo sequenci ng with mathematical program- ming, ” Master’ s thesi s, Freie Uni versit¨ at Berlin, G erman y , January 2008. [25] T . H. Cormen, C. E. Leiserson, R. L. Riv est, and C. Stein, Intr oduction to A lgorithms , 2nd ed. T he MIT Press, 2001. [26] D. Bertsimas and J. Tsitsiklis, Intr oduction to Linear Optimization . Athena Scientiﬁc, 1997. [27] M. Held, P . W olfe, and H. D. Cro wder , “V alidati on of subgradie nt optimiza tion, ” Mathematical Pro gramming , vol. 6, pp. 62–8 8, 1974. [28] J. Y . Y en, “Findin g the K shorte st loopless paths in a network, ” Manag ement Science , v ol. 17, pp. 712–716, 1971. [29] E. Martins and M. Pascoal, “A new implement ation of Y en’ s ranking loopless paths algo rithm, ” Quarterly Journal of the B elgian, F renc h and Italian Operations R esear ch Soci eties , vol . 1, pp. 121–1 33, 2003. [30] M. Hall, E. Frank, G. Holmes, B. Pfahringe r, P . Reutemann, and I. H. Wit ten, “The WEKA data mining softw are: an update, ” SIGKDD Explor . Newsl. , vol. 11, pp. 10–18, November 2009. [Onli ne]. A vai lable: http:/ /doi.acm.or g/10.1145/1 656274.1656278 [31] R. R. Bouckae rt, “Bayesian network classiﬁe rs in weka, ” (W orking paper series. Univer s ity of W aikato, Department of Computer Scienc e) , vol. 14, October 2004. [32] IBM, “ C P L E X , ” http://www . cple x.com . [33] A. Kell er , S. Purvine, A. I. Nesvizh skii, S. Stolyar , D. R. Goodlett , and E. K olker , “Experi mental protein mixt ure for vali dating tande m mass spectra l ana lysis, ” OMICS : A Journal of Inte grative B iolo gy , vol. 6, pp. 207–212, 2002. 10 [34] A. Bertsch, A. Leinenbach, A. Pervukhin, M. Lubeck, R. Hartmer , C. Baessmann, Y . A. Elnakady , R. M ¨ ulle r , S. B ¨ ocker , C. G. Huber , and O. K ohlbacher , “De novo pepti de sequenc ing by tandem MS using complement ary CID and electro n transfe r dissociation, ” Electr ophor esis , vol. 30, no. 21, pp. 3736–3747, 2009. Sandro Andreotti rec ei ved the MSc de gree in Bioinfor matics from Freie Univ ersit ¨ at Berlin, Ger- many , in 2008 where he is currently working as PhD student in the Algorith mic Bioinformatic s group of Knut Reinert. His research is focussing on compu- tatio nal proteomic s and discre te optimiza tion. Gunnar W . Klau rece i ved his PhD in Compute r Science in 2001 from Saarland Univ ersity , German y . Currentl y , he heads the L ife S cienc es group at CWI, the national research center for mathemati cs and computer scie nce in the Netherl ands. His resea rch intere sts are combinator ial algorithms and discr ete optimiza tion in co mputation al biology . Knut Reinert recei ved his PhD 1999 from Saarland Uni versity , Germany . He w orked from 1999 -2002 for Celera Genomics and took part in the sequencin g of the human genome. Currently he is professor for Algorithmic Bioinf ormatics at the FU Berli n in Germany . His researc h interests lie in de velop ing algorit hms for sequ ence anal ysis and proteomics.

Antilope - A Lagrangian Relaxation Approach to the de novo Peptide Sequencing Problem

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment