A pruned dynamic programming algorithm to recover the best segmentations with $1$ to $K_{max}$ change-points
A common computational problem in multiple change-point models is to recover the segmentations with $1$ to $K_{max}$ change-points of minimal cost with respect to some loss function. Here we present an algorithm to prune the set of candidate change-p…
Authors: Guillem Rigaill
1 A pruned d ynamic pr ogramming algorithm to reco ver the best segmentations with 1 to K max change-points. Guillem Rigaill 1 1 Institut of Plant Sciences P aris-Saclay (IPS2), UMR 9213/UMR1403, CNRS, INRA, Université P aris-Sud, Université d’Evry , Université P aris-Dider ot, Sorbonne P aris-Cité, Bâtiment 630, 91405 0rsay , F rance e-mail: rigaill@inra.evry.fr Abstract: A common computational problem in multiple change-point models is to recov er the segmentations with 1 to K max change-points of minimal cost with respect to some loss function. Here we present an algorithm to prune the set of candidate change-points which is based on a functional representation of the cost of segmentations. W e study the worst case comple xity of the algorithm when there is a unidimensional parameter per segment and demonstrate that it is at worst equi v alent to the complexity of the segment neighbourhood algorithm: O ( K max n 2 ) . For a particular loss function we demonstrate that pruning is on av erage ef ficient e ven if there are no change-points in the signal. Finally , we empirically study the performance of the algorithm in the case of the quadratic loss and show that it is f aster than the segment neighbourhood algorithm. Keyw ords: multiple-change-point detection, dynamic programming, functional cost, segment neigh- bourhood 1. Introduction A common computational problem in multiple change-point models is to reco ver segmentations with 1 to K max change-points of minimal cost where the cost is some well chosen criteria, such as minus the log-likelihood or the quadratic loss (Bai and Perron, 2003; Picard et al., 2005; Harchaoui and Cappé, 2007; Guédon et al., 2007; Killick and Eckley, 2011; Arlot et al., 2012; Cleynen and Lebarbier, 2014; Cleynen et al., 2014a). Many algorithms ha ve been proposed to exactly solv e this problem (Bellman, 1961; Fisher, 1958; Auger and Lawrence, 1989; Bai and Perron, 2003; Guédon, 2008). All these are dynamic programming algorithms and hav e a complexity which is linear in K max and quadratic in the length of the signal n : O ( K max n 2 ) . W e will refer to these algorithms as se gment neighbourhood algorithms. In practice, this quadratic complexity in n is a problem. Indeed, if n is larger than 10 5 or 10 6 one run of the algorithm can take sev eral hours or ev en days. In applications such as DNA copy number studies the signal length is typically of this order: n = 10 5 - 10 6 . Sev eral strategies have been dev eloped to cope with this 1 2 Rigaill problem, the most famous is probably the binary segmentation heuristic (Scott and Knott, 1974). A common idea is to first identify a restricted set of candidate change- points with any f ast heuristic and then run a se gment neighbourhood algorithm on this restricted set (K olesniko v and Fränti, 2003; Harchaoui and Lévy-Leduc, 2010; Gey et al., 2008). These heuristics typically ha ve a comple xity which is linear in n . From a computational perspecti ve the main dra wback of these approaches is that they lack optimality . Most algorithms and heuristics that aim at recov ering the se gmentations of minimal cost operate on segmentations through their costs. Here we propose a ne w representation of the segmentations that we call the functional cost where the cost of a segmentation is represented as a function of a, possibly multidimensional, parameter . W e demonstrate that using this functional cost it is theoretically possible to prune the set of segmentations while searching for segmentations of minimal costs with 1 to K max change-points and thus to carry out the calculation only for a small subset of candidate segmentations. W e call the resulting algorithm pruned dynamic programming algorithm (pDP A). From an intuitiv e point vie w , if we consider a random sequence of n data-points with K true abrupt change-points, we expect that the segmentation with those K change-points will greatly outperform other possible segmentations in terms of cost. Hence it makes sense that it is possible to efficiently prune the set of segmentations. On the contrary , if we now consider a random sequence of n data-points without any change-points, all segmentations will have roughly the same cost and we do not expect an ef ficient pruning of the set of segmentations. This intuition is indeed true if segmentations are represented through their cost. In the conte xt of another change-point optimization problem, the optimal partitioning problem (Jackson et al., 2005), this idea was made particularly clear by Killick et al. (2012). For this problem, the PEL T algorithm, that prunes se gmentations based on their cost, was proven to be linear if the true number of change-points is linear in the number of data-points. If it is not linear, in particular if there are no change-points, PEL T’ s pruning is less efficient. Here we ar gue that if we consider the functional costs of se gmentations, rather than their costs, we can hav e an efficient pruning even for random sequences without any change-points. As a proof of concept we study the comple xity of the pDP A for change-point models with a unidimensional parameter per se gment. In that case, we demonstrate that the algorithm is at worst as efficient as segment neighbourhood algorithms. For a special loss function we demonstrate that on av erage pruning is ef ficient e v en if we consider random sequences without change- points and we retrie v e an a verage comple xity which is sub-quadratic in O ( n log ( n )) . We implemented the algorithm for the commonly used quadratic loss and empirically pDP A to recover the best se gmentations with 1 to K max change-points. 3 sho w that it is faster than se gment neighbourhood algorithms. Related w orks Since the first pre-print of this work (Rigaill, 2010), the pDP A has been implemented for other losses (Cleynen et al., 2014b), and in the context of DN A copy number analysis its competitiv e runtime compared to the segment neigh- bourhood algorithm, PEL T or other approaches for this problem was confirmed by others (Hocking et al., 2013; Hocking, 2013; Maidstone et al., 2014). Furthermore, we demonstrate in this ne w v ersion of the paper , for a simple, yet non-trivial, loss function that on av erage the pDP A pruning is ef ficient e v en if we consider random sequences without change-points. Finally , the main contrib ution of this paper is of a computational nature, i.e an algorithm that recov ers the segmentations with 1 to K max change-points of minimal cost. From a computational point of view se veral methods boil down to this specific problem. Importantly , the statistical properties of these methods hav e been studied theoretically and empirically through simulations and on real data by man y (Y ao and Au, 1989; Horváth, 1993; La vielle and Moulines, 2000; La vielle, 2005; Lebarbier, 2005; Birgé and Massart, 2007; Boysen et al., 2009; Cle ynen and Lebarbier, 2014; Zhang and Sie gmund, 2007; Lai et al., 2005; Cleynen et al., 2014a), and recently the pDP A as implemented in the R cghse g package for the quadratic loss was sho wn to reach state of the art performances for the segmentation of DN A cop y number profiles (Hocking et al., 2013). Outline Section 2 describes the change-point framework used in the paper . Section 3 gives a quick ov ervie w of segment neighbourhood algorithms and informally describes the functional cost and the pDP A. Section 4 gives a detailed description of the functional cost and the pDP A. Section 5 is about the worst case, av erage and empirical complexity of the pDP A. 2. Change-point framework and cost minimization Data W e assume that we have a sequence of n observ ations. W e denote this sequence Y = { Y t } 1 ≤ t ≤ n and denote Y i : j a subsequence between data-point i and data-point j − 1, i.e. Y i : j = { Y t } i ≤ t < j . Segmentations W e define a segmentation m of Y by a set of K change-points splitting the sequence in K + 1 segments. W e denote the positions of these K change- points τ k for k = 1 , . . . , K and we also set the conv ention that τ 0 = 1 and τ K + 1 = n + 1 . W e call r k the k -th segment of a segmentation: r k = τ k − 1 : τ k = { i | τ k − 1 ≤ i < τ k } . τ k − 1 is thus the first data-point belonging to r k and τ k is the first data-point not belonging to r k . For K > 0 , we define M K 1: t as the set of all possible segmentations with exactly 4 Rigaill K change-points of the sequence Y 1: t . There are t − 2 K such segmentations and so for the whole sequence we hav e | M K 1: n + 1 | = n − 1 K . Statistical model Our goal is to infer from the data both the positions and the number of change-points. Many methods, statistical models and model selection criteria hav e been proposed to address this problem for various types of data and v arious types of changes, i.e. change in the mean, in the variance, in the distribution etc . . . (Lavielle, 2005; Cle ynen and Lebarbier, 2014; Arlot et al., 2012). In practice most of these methods critically rely on algorithms or heuristics that explore the set of all possible segmentations in search for a list of optimal se gmentations w .r .t. some well chosen statistical criteria. Howe ver , this set of segmentations is extremely large ( | M K 1: n + 1 | = n − 1 K ) and it is hard to explore it ef ficiently , especially when n is large. From a computational point of view this exploration problem is often re-formulated as a cost minimization problem under the constraint that the number of change- points is K = 1 , 2 · · · K max , where K max is defined by the user (Bellman, 1961; Fisher, 1958; Auger and La wrence, 1989; Ha wkins, 2001; Bai and Perron, 2003; Harchaoui and Cappé, 2007; Guédon, 2008; Arlot et al., 2012; Cle ynen and Lebarbier , 2014; Guédon, 2013). T o be more specific, the aim is to find the se gmentations of minimal cost in M K 1: n + 1 for K from 1 to K max , where the cost of a se gmentation is the sum of the costs of its segments: R m = K + 1 ∑ k = 1 c τ k − 1 : τ k , (1) with c τ k − 1 : τ k is the cost of the k -th segment of m . A smaller class of pr oblems In this paper , we consider a smaller class of models and methods for which it is possible to write the cost of a segment as follo ws: c τ k − 1 : τ k = min µ ( τ k − 1 ∑ t = τ k − 1 γ ( Y i , µ ) + g ( µ ) ) , where γ is a loss function, depending on the data-point Y i and the (possibly multi- dimensional) parameter µ , and the function g is a regularization penalty . For this class of models we sho w in the following that it is theoretically possible to prune the set of segmentations and we explain how to implement this pruning if µ is unidimensional. The quadratic loss belongs to this class of model: γ ( Y i , µ ) = ( Y i − µ ) 2 . This frame work also includes maximum likelihood inference of identically distrib uted and independent data-points. Indeed, it suffices to take γ equal to minus the log- likelihood: γ ( Y i , µ ) = − log ( p ( Y i , µ )) . Let us gi v e some examples. pDP A to recover the best se gmentations with 1 to K max change-points. 5 1. W e can consider a linear model. T o do that we take Y i = ( Z i , x i 1 , . . . , x i p ) in R p + 1 , µ = ( β 0 , β 1 , . . . , β p , σ 2 ) in R p + 1 × R + and within a se gment τ k − 1 : τ k we assume that Z i = β 0 + p ∑ j = 1 β j x i j + ε i with ε i ∼ N ( 0 , σ 2 ) i.i.d. In that case minus the log-likelihood is γ ( Y i , µ ) = 1 2 log ( 2 π σ 2 ) + 1 2 σ 2 ( Z i − β 0 − p ∑ j = 1 β j x i j ) 2 . If the v ariance is kno wn this simplifies to γ ( Y i , µ ) = ( Z i − β 0 − ∑ p j = 1 β j x i j ) 2 . 2. W e can also consider the segmentation in the mean of a p -dimensional sequence. Here we take Y i = ( X i 1 , . . . X i p ) in R p , µ = ( β 1 , . . . , β p , σ 2 ) in R p × R + and for all i within a segment τ k − 1 : τ k we assume that X i j = β j + ε i j with ε i j ∼ N ( 0 , σ 2 ) i.i.d. In that case minus the log-likelihood is γ ( Y i , µ ) = p 2 log ( 2 π σ 2 ) + 1 2 σ 2 p ∑ j = 1 ( X i j − β j ) 2 . If the v ariance is kno wn this simplifies to γ ( Y i , µ ) = ∑ p j = 1 ( X i j − β j ) 2 . 3. W e could also consider categorical data. Indeed, we can take Y i ∈ { 1 , . . . , p } , µ = ( π 1 , . . . , π p ) in [ 0 , 1 ] p with ∑ j π j = 1 and assume that within a se gment, Y i are independent and follow a multinomial distrib ution with P ( Y i = j ) = π j . In that case minus the log-likelihood is γ ( Y i , µ ) = − p ∑ j = 1 I { Y i = j } log ( π j ) , where I is the indicator function. Many se gmentation models do not include a regularization penalty ( g ( µ ) = 0 ), ho we ver , in some cases it can be of interest to include a regularization penalty such as the ridge penalty . W e could take this into account by setting g ( µ ) = λ µ 2 . For simplicity , in the rest of this paper we will only consider the case g ( µ ) = 0 , ho we ver extension to g ( µ ) 6 = 0 is straightforward. 6 Rigaill 3. Exact segment neighbourhood algorithm and pruning In this section, we first describe the main update rule of the standard dynamic programming algorithm - often called the segment neighbourhood algorithm - used to recov er the segmentations of minimal cost for K = 1 to K max : Cost K 1: n + 1 = min m ∈ M K 1: n + 1 { R m } , (2) Then we present the functional cost representation of a se gmentation and infor - mally explain ho w using this representation it is possible to prune the search space of the segment neighbourhood algorithm. In the last subsection (3.5) we e xplicitly describe the optimal partitioning problem and the advantages and disadv antages of functional pruning ov er inequality based pruning (as is done in PEL T (Killick et al., 2012)). 3.1. The segment neighbourhood algorithm update rule R m is segment additiv e, thus the Bellman optimality principle applies and if a segmentation is optimal any sub-se gmentation of this segmentation is also optimal. Mathematically , this can be expressed as the follo wing update rule: Cost K 1: t = min τ < t { Cost K − 1 1: τ + c τ : t } , (3) This update rule can be performed with an O ( t ) time complexity . Many algorithms de veloped for various statistical models implement this particular update rule (Fisher, 1958; Bellman, 1961; Auger and La wrence, 1989; Bai and Perron, 2003; Guédon, 2008). This algorithm was called the se gment neighbourhood algorithm by Auger and Lawrence (1989). T o recover Cost K 1: n + 1 we need to apply update rule (3) for every t smaller than n + 1 and for ev ery K smaller than K max . The overall time complexity is thus in O ( K max n 2 ) . Most of these algorithms have an O ( n 2 ) space complexity . Howe ver some, like Guédon (2008), hav e an O ( K max n ) space complexity . 3.2. Functional cost The segment neighbourhood algorithm and in f act most algorithms and heuristics that aim at recov ering the se gmentation of minimal cost operate on segmentations through their costs, R m , which are numbers in R . Our pruned DP A algorithm is pDP A to recover the best se gmentations with 1 to K max change-points. 7 dif ferent and uses a slightly more comple x representation that we call the functional cost of a segmentation. W e define this function of µ as: e R m ( µ ) = K ∑ k = 1 c τ k − 1 : τ k + e c τ K : τ K + 1 ( µ ) (4) with e c τ K : τ K + 1 ( µ ) = ∑ τ K + 1 − 1 t = τ K γ ( Y i , µ ) if τ K + 1 > τ K and e c τ K : τ K + 1 ( µ ) = 0 if τ K + 1 = τ K In words, the functional cost, e R m ( µ ) , is the cost of segmentation m if the parameter of m ’ s last segment is set to µ , rather than the optimal value ˆ µ = arg min { e c τ K : τ K + 1 ( µ ) } . The minimum value of the functional cost is simply the cost: i.e. min µ { e R m ( µ ) } = R m The functional cost is a more complex representation of a segmentation than the cost, howe v er , it leads to some great simplifications. Namely the functional cost is point additiv e, i.e if we consider a segmentation m of M K 1: t with change-points τ 1 , τ 2 , ..., τ K and the segmentation m 0 of M K 1: t + 1 with the same change-points we hav e: e R m 0 ( µ ) = e R m ( µ ) + γ ( Y t , µ ) . In the next sub-section (3.3) we explain informally ho w this functional formula- tion and the point additiveness makes it theoretically possible to prune the set of segmentations. 3.3. Functional cost and pruning The pDP A searches for: g Cost K 1: n + 1 ( µ ) = min m ∈ M K 1: n + 1 { e R m ( µ ) } , (5) for K = 1 to K max . This is the functional formulation of equation (2). Like the cost (see update rule (3)), the functional cost is segment additi ve and a similar update rule applies: g Cost K 1: t ( µ ) = min τ < t { Cost K − 1 1: τ + e c τ : t ( µ ) } . (6) Thanks to the point-additi veness of e R m ( µ ) , it is possible to further simplify this update rule. If we consider a gi v en v alue of µ , two last change-points τ and τ 0 and a time t 0 > t we get the following implication: Cost K − 1 1: τ + e c τ : t ( µ ) ≤ Cost K − 1 1: τ 0 + e c τ 0 : t ( µ ) = ⇒ Cost K − 1 1: τ + e c τ : t 0 ( µ ) ≤ Cost K − 1 1: τ 0 + e c τ 0 : t 0 ( µ ) . (7) 8 Rigaill In other words, if for a giv en t and µ the functional cost of last change-point τ is less than the functional cost of τ 0 , then this will still be the case for an y time point in the future ( i.e. for any t 0 > t ). Hence for the parameter value µ we can discard τ 0 as it will ne ver be minimal. If we consider only one v alue for µ , this pruning rule can be formalised as the follo wing update rule: g Cost K 1: t + 1 ( µ ) = min n g Cost K 1: t ( µ ) + γ ( Y t , µ ) , Cost K − 1 1: t + γ ( Y t , µ ) o , (8) and this update rule can be performed with an O ( 1 ) time complexity. W e only need to compare the v alues of g Cost K 1: t ( µ ) + γ ( Y t , µ ) and Cost K − 1 1: t + γ ( Y t , µ ) . 3.4. Implementing update rule (8) for all possible µ Update rule (8) cannot be implemented in practice, because the rule would have to be applied to all possible values of µ and in most cases the set of all µ is uncountably infinite. A ke y idea to cope with this problem is to consider the set of µ for which a particular last change-point τ is better than any other change-point τ 0 in the sense that Cost K − 1 1: τ + e c τ : t ( µ ) is smaller than Cost K − 1 1: τ 0 + e c τ 0 : t ( µ ) . W e will call this set S K 1: t , τ . W e will properly define and study the properties of S K 1: t , τ in section 4. As yet let us hav e a look at these sets on a small example. W e will consider the segmentations with one change-point ( K = 1 ) of a four -point signal with the quadratic loss: γ ( y i , µ ) = ( y i − µ ) 2 . Our observations are y 1 = 0 , y 2 = 0 . 5 , y 3 = 0 . 4 , y 4 = − 0 . 5 (see Figure 1). Y i i F I G U R E 1 . F our-point signal. y i as a function of i. y 1 = 0 , y 2 = 0 . 5 , y 3 = 0 . 4 , y 4 = − 0 . 5 pDP A to recover the best se gmentations with 1 to K max change-points. 9 Functional cost of Y 1:3 Let us first consider the first two data-points: Y 1:3 . The first segmentation we need to consider is the one with a change-point at τ = 2 . The functional cost of this segmentation is simply the cost of its first se gment, r 1:2 , which is 0 , plus the functional cost of the last se gment, which is the polynomial function µ → ( y 2 − µ ) 2 . W e should also consider a segmentation with a change-point at τ = 3 . Indeed, for t = 2 its functional cost is well defined: it is the cost of its first segment, r 1:3 , which is 0 . 125 , plus the functional cost of its last se gment, r 3:3 . r 3:3 is empty and so its functional cost is simply the zero function: µ → 0 . These two functional costs are represented in figure 2-left as a solid red line for τ = 2 and a dashed orange line for τ = 3 . Simple calculations show that the set of µ for which τ = 2 is best, S 1 1:3 , 2 , is an interv al centered on 0 . 5 : [ 0 . 146 , 0 . 854 ] . S 1 1:3 , 3 is a union of two interv als: [ − ∞ , 0 . 146 ] ∪ [ 0 . 854 , + ∞ ] . These interv als are also giv en in figure 2-right. Functional Cost µ τ Functional cost S 1 1:3 , τ 2 0 . 25 − µ + µ 2 [ 0 . 146 , 0 . 854 ] 3 0 . 125 [ − ∞ , 0 . 146 ] ∪ [ 0 . 854 , + ∞ ] F I G U R E 2 . Functional cost of Y 1:3 for K = 1 using the quadratic loss. (Left) Functional cost as a function of µ of se gmentations having a change-point at τ = 2 (solid r ed) and τ = 3 (orange dashed). (Right) Analytical expr ession of the functional costs for τ = 2 and τ = 3 and the set of µ , for which the y ar e optimal: S 1 1:3 , τ . Functional cost of Y 1:4 No w if we consider the first three data-points: Y 1:4 . W e hav e to update the functional cost of the segmentations having a change-point at τ = 2 and 3 . W e do this by adding the function µ → ( y 3 − µ ) 2 to both of them (point additi veness). W e also need to consider another possible change-point: τ = 4 . Its functional cost is simply the cost of segment r 1:4 , which is 0 . 14 , plus the function cost of segment r 4:4 which is the zero function: µ → 0 . These three functional costs are represented in figure 3-left as a solid red line for τ = 2 , a dashed orange line for τ = 3 and a blue dotted line for τ = 4 . Simple calculations show that the set of µ for which τ = 2 is best, S 1 1:4 , 2 , is an interval centered on 0 . 45 : [ 0 . 190 , 709 ] . S 1 1:4 , 4 is a union of two interv als: [ − ∞ , 0 . 190 ] ∪ [ 0 . 709 , + ∞ ] . S 1 1:4 , 3 is strikingly empty . 10 Rigaill Based on equation (7), this means that irrespecti ve of the last data-point it is sure that the se gmentation with a last change-point at τ = 3 will alw ays ha ve a greater cost that those with a change-point at 2 or 4 . Thus for all possible values of µ we can discard τ = 3 . Note ho we ver that neither τ = 2 nor τ = 4 are better than τ = 3 for all µ . This simple example illustrates that by keeping track of the S K 1: t , τ we might be able to prune some candidate change-points very early in the process. W e will make this idea explicit in the ne xt sections (section 4 and 5). Functional Cost µ τ Functional cost S 1 1:4 , τ 2 0 . 41 − 1 . 8 µ + 2 µ 2 [ 0 . 190 , 0 . 709 ] 3 0 . 285 − 0 . 8 µ + µ 2 / 0 4 0 . 14 [ − ∞ , 0 . 190 ] ∪ [ 0 . 709 , + ∞ ] F I G U R E 3 . Functional cost of Y 1:4 for K = 1 using the quadratic loss. (Left) Functional cost of a se gmentations having a change-point at τ = 2 (solid red) τ = 3 (orange dashed) and τ = 4 (blue dotted). (Right) Analytical expr ession of the functional costs for τ = 2 , 3 and 4 and the set of µ , for which the y ar e optimal: S 1 1:4 , τ . 3.5. Segment Neighbourhood, Optimal partitioning and PELT Since 2010 another pruned algorithm, PEL T , has been proposed for change-point detection problems by Killick et al. (2012). pDP A and PEL T are different in nature. First, they do not solve the same problem. PEL T is an extension of the optimal partitioning algorithm (Jackson et al., 2005) rather than the segment neighbourhood algorithm. More precisely , if we call M 1: n + 1 the set of all se gmentations, the optimal partitioning problem is to find: OPCost 1: n + 1 = min m ∈ M 1: n + 1 { R m + λ | m |} , (9) where R m is defined in equation (1), λ is a user defined scalar and | m | is the number of change-points of segmentation m . Intuitiv ely , the number of change-points of the recov ered solution decreases with λ . If the penalty λ is kno wn in adv ance solving the optimal partitioning problem is faster than solving the se gment neighbourhood pDP A to recover the best se gmentations with 1 to K max change-points. 11 problem. The advantage of solving the segment neighbourhood problem is that this gi ves optimal se gmentations for a range of numbers of change-points. Second PEL T and the pDP A do not prune se gmentations in the same way . PEL T’ s prune segmentations based on their cost and the pDP A based on their functional cost. In a recent pre-print, Maidstone et al. (2014) called these two ways of prun- ing respectiv ely inequality based pruning (IP) and functional based pruning (FP) and studied their relationship. The benefit of IP o ver FP is that it is more widely applicable than FP , i.e. the assumptions on the segment cost to apply FP imply the assumptions to apply IP . Furthermore the implementation of IP is usually straight- forward which is not the case of FP . The advantage of FP over IP is that it can prune ef ficiently the set of change-points even if there are only a few change-points in the sequence. W e prov e this for a special loss function and under some restrictive conditions in sub-section 5.2. W e empirically confirm this result for the quadratic loss in sub-section 5.3. Furthemore, Maidstone et al. (2014) proved in Theorem 6.1 for the optimal partitioning and segment neighbourhood problem that any candidate change-point pruned by IP will also be pruned by FP at worst at the same time. In other words FP cannot prune less than IP . 4. Functional pruning of the set of segmentations In this section we first describe the key quantities and properties used by the pruned DP A algorithm. Then we describe the algorithm. 4.1. K ey quantities and their properties The four main quantities of interest in the pDP A are gi ven belo w . 1. the optimal functional cost with K change-points: g Cost K 1: n + 1 ( µ ) = min m ∈ M K 1: n + 1 { e R m ( µ ) } , where the functional cost of a segmentation e R m ( µ ) is defined in equation (4). 2. the optimal functional cost with K change-points if the last change-point is τ ≤ t : e C K 1: t , τ ( µ ) = Cost K − 1 1: τ + e c τ : t ( µ ) , (10) where Cost K − 1 1: τ is the optimal cost with K − 1 change-points up to τ defined in equation (2) and e c τ : t ( µ ) is defined in equation (4). 3. the set of µ for which the last change-point τ is optimal: S K 1: t , τ = { µ | e C K 1: t , τ ( µ ) = g Cost K 1: t ( µ ) } . 12 Rigaill 4. the set of last change-points which are optimal for at least one µ 1 τ K 1: t = τ < t | S K 1: t , τ 6 = / 0 Belo w are the properties of these quantities. The proofs of these properties are gi ven afterw ards. Property 1. The optimal functional cost may be computed using only the (pruned) set of chang e-points which ar e optimal for at least one µ . g Cost K 1: t ( µ ) = min τ < t { e C K 1: t , τ ( µ ) } = min τ ∈ τ K 1: t { e C K 1: t , τ ( µ ) } . Property 2. The functional cost with a last c hange-point τ is easy to update using point additivity . e C K 1: t + 1 , τ ( µ ) = e C K 1: t , τ ( µ ) + γ ( Y t , µ ) . Property 3. The set of µ for which a c hange-point τ is optimal decr eases with t (w .r .t. set inclusion) and can be pruned. ∀ τ < t , S K 1: t + 1 , τ = S K 1: t , τ \ { µ | e C K 1: t , τ ( µ ) ≤ Cost K − 1 1: t } (11) S K 1: t + 1 , t = \ τ ∈ τ K 1: t { µ | Cost K − 1 1: t ≤ e C K 1: t , τ ( µ ) } (12) S K 1: t , τ = / 0 = ⇒ ∀ t 0 ≥ t , S K 1: t 0 , τ = / 0 . (13) Property 4. The set of candidate change-points can be pruned and computed r ecur sively . τ K 1: t + 1 = { τ ∈ ( τ K 1: t ∪ { t } ) | S K 1: t + 1 , τ 6 = / 0 } . Property 1 is very similar to the update rule of the segmentation neighbourhood algorithm (equation (3)). The main difference is that the set of last change-points to consider is not necessarily all those before t : { τ | τ < t } but rather those that still hav e a chance to be optimal at this stage: τ K 1: t . At worst this τ K 1: t is equal to { τ | τ < t } 1 Depending on the properties of γ it might be possible to consider only the set of last change-points τ for which S K 1: t , τ is not restricted to a finite set of µ . This is the case for the quadratic loss which is continuous. pDP A to recover the best se gmentations with 1 to K max change-points. 13 but based on property 4 this set could be smaller. Indeed any change-point that has been discarded from τ K 1: t will ne ver be included again. The four pre vious properties are in fact simple consequences of the point ad- diti veness and segment additi veness of the functional cost. Howe v er gi ven their combinatorial nature they might look a bit tedious. T o clarify those properties we provide detailed proofs in the follo wing paragraphs. Proof of property 1 As the cost in the segment neighbourhood algorithm, the functional cost of a segmentation is se gment additi v e, thus the Bellman optimality principle holds and we recov er the first part of property 1. The second part follows by definition of τ K 1: t . Indeed, if τ is the optimal last change-point then there must exist a µ for which its functional cost is equal to g Cost K 1: t ( µ ) and thus τ is in τ K 1: t Proof of pr operty 2 By definition of e C K 1: t , τ ( µ ) (see equation (10)) we ha ve: e C K 1: t + 1 , τ ( µ ) = Cost K − 1 1: τ + ∑ t i = τ γ ( Y i , µ ) = Cost K − 1 1: τ + ∑ t − 1 i = τ γ ( Y i , µ ) + γ ( Y t , µ ) = e C K 1: t , τ ( µ ) + γ ( Y t , µ ) Proof of pr operty 3 Using property 1 and by definition of S K 1: t + 1 , τ 0 we get that for τ 0 < t + 1: S K 1: t + 1 , τ 0 = µ | e C K 1: t + 1 , τ 0 ( µ ) ≤ min τ < t + 1 { e C K 1: t + 1 , τ ( µ ) } = T τ < t + 1 n µ | e C K 1: t + 1 , τ 0 ( µ ) ≤ e C K 1: t + 1 , τ ( µ ) o . No w , using property 2 we also hav e for τ < t + 1: { µ | e C K 1: t + 1 , τ 0 ( µ ) ≤ e C K 1: t + 1 , τ ( µ ) } = { µ | e C K 1: t , τ 0 ( µ ) ≤ e C K 1: t , τ ( µ ) } . Finally , by defintion we also hav e e C K 1: t , t ( µ ) = Cost K − 1 1: t . Combining these three facts for τ 0 < t we get: S K 1: t + 1 , τ 0 = T τ < t + 1 n µ | e C K 1: t , τ 0 ( µ ) ≤ e C K 1: t , τ ( µ ) o = S K 1: t , τ 0 T n µ | e C K 1: t , τ 0 ( µ ) ≤ e C K 1: t , t ( µ ) o = S K 1: t , τ 0 T n µ | e C K 1: t , τ 0 ( µ ) ≤ Cost K − 1 1: t o 14 Rigaill and we recov er equation (11) of property 3. W e recover equation (12) of property 3 by taking τ 0 = t : S K 1: t + 1 , t = \ τ < t n µ | Cost K − 1 1: t ≤ e C K 1: t , τ ( µ ) o = n µ | Cost K − 1 1: t ≤ min τ < t { e C K 1: t , τ ( µ ) } o = ( µ | Cost K − 1 1: t ≤ min τ ∈ τ K 1: t { e C K 1: t , τ ( µ ) } ) = \ τ ∈ τ K 1: t n µ | Cost K − 1 1: t ≤ e C K 1: t , τ ( µ ) o . Finally , if S K 1: t , τ = / 0 using equation (11) we hav e that S K 1: t + 1 , τ = S K 1: t , τ \ { µ | e C K 1: t , τ ( µ ) ≤ Cost K − 1 1: t } = / 0 and by induction we recov er equation (13) of property 3 Proof of pr operty 4 By definition of τ K 1: t + 1 we hav e τ K 1: t + 1 ⊇ { τ ∈ ( τ K 1: t ∪ { t } ) | S K 1: t + 1 , τ 6 = / 0 } . No w suppose τ is in τ K 1: t + 1 . If τ < t , then using equation (13) of property 3 we see that τ must also be in τ K 1: t . If τ = t , then by definition of τ K 1: t + 1 we ha ve that S K 1: t + 1 , t is not empty . Thus we recov er: τ K 1: t + 1 ⊆ { τ ∈ ( τ K 1: t ∪ { t } ) | S K 1: t + 1 , τ 6 = / 0 } 4.2. The pDP A algorithm The pruned dynamic programming algorithm uses the four properties described in subsection 4.1 to update the optimal functional cost, the set of optimal last change- points and the set of µ for which these last change-points are optimal. For every t ≤ n and K ≤ K max the algorithm: 1. updates for each τ in τ K 1: t the set of µ for which they are optimal (equation (11)); 2. initialises the set of values for which a change-point at t is optimal (equation (12)); 3. updates the set of possible last change-points (property 4); pDP A to recover the best se gmentations with 1 to K max change-points. 15 4. updates the functional cost of possible last change-points (property 2); 5. computes the minimal cost with K change-points at t (property 1). More formally the algorithm can be described as: Data : A sequence Y 1: n + 1 Result : T wo matrices D K , t and I K , t of size ( n + 1 ) × K max containing the optimal cost and optimal last change-points f or K = 1 to K max do τ K 1: K + 1 ← { K } e C K 1: K + 1 , K ( µ ) ← Cost K − 1 1: K + γ ( Y K , µ ) f or t = K + 1 to n do f or τ ∈ τ K 1: t do S K 1: t + 1 , τ ← S K 1: t , τ T { µ | e C K 1: t , τ ( µ ) ≤ Cost K − 1 1: t } end S K 1: t + 1 , t ← T τ < t { µ | Cost K − 1 1: t ≤ e C K 1: t , τ ( µ ) } τ K 1: t + 1 ← { τ ∈ ( τ K 1: t ∪ { t } ) | S K 1: t + 1 , τ 6 = / 0 } f or τ ∈ τ K 1: t + 1 do e C K 1: t + 1 , τ ( µ ) ← e C K 1: t , τ ( µ ) + γ ( Y t , µ ) end D K , t + 1 ← min τ ∈ τ K 1: t + 1 n min µ { e C K 1: t + 1 , τ ( µ ) } o I K , t + 1 ← arg min τ ∈ τ K 1: t + 1 n min µ { e C K 1: t + 1 , τ ( µ ) } o end end Algorithm 1: Pruned DP A algorithm Importantly g Cost K 1: t ( µ ) , S K 1: t , t and τ K 1: t are used only at step K , t and K , t + 1 . So they need not be stored, the y can be discarded or ov erwritten immediately . Implementing the pruned DP A The pruned DP A critically relies on the possibil- ity of easily updating the functional cost ( µ → e C K 1: t , τ ( µ ) ) and the set of µ for which one particular last change-point is optimal ( S K 1: t , τ ). If this is not the case then the algorithm and the fact that the set of last change-points can be functionally pruned is purely theoretical. Representing and updating the functional cost is easy typically if there exists a simple analytical decomposition of the functional cost. For example, if we consider 16 Rigaill the quadratic loss, the functional cost is a second degree polynomial function and it can be represented by the three coefficients of this polynomial function. Other loss functions can be represented in such a way , and in fact the pruned DP A has been implemented for other losses since its first pre-print (namely the Poisson and negati ve binomial log-likelihood losses, Cleynen et al. (2014b)) Representing the sets S K 1: t , τ is a priori more difficult. Their representation depends on the dimensionality of µ and the characteristic of the loss function γ . In the next section we study theoretically and empirically the expected and worst case complexity of functional pruning in the simple case where µ is a unidimensional parameter in R and all functions µ → ∑ γ ( Y i , µ ) are unimodals. Here by unimodal we mean functions such that for any c in R the set { x | f ( x ) ≤ c } is an interval. In that case it is simpler to keep track of the sets { µ | Cost K − 1 1: t ≤ e C K 1: t , τ ( µ ) } and S K 1: t , τ as they are respecti v ely interv als and union of interv als. Functional pruning is theoretically possible even if those conditions are not valid and in particular if µ is multidimensional. Howe v er the implementation and the theoretical study of functional pruning in a multidimensional case is not straightforward and is outside the scope of this paper . 5. Complexity of the algorithm f or a unidimensional µ In this section we study the complexity of the pruned DP A for µ in R . W e also assume that updating the functional cost is in O ( 1 ) , which is the case if there is a simple analytical decomposition of the functional cost. Finally we will assume that any function µ → ∑ i γ ( Y i , µ ) is a unimodal function of µ . This last assumption is true if γ is con ve x, which is often the case if we take γ to be minus the log-lik elihood. Under those conditions all S K 1: t , τ are unions of interv als 2 . A ke y question then is how man y interv als do we precisely need. In this section, we propose a bound of this number and using it we bound the w orst case complexity of the pDP A and sho w that it is at worst as ef ficient as the se gment neighbourhood algorithm (see subsection 5.1). Then we theoretically study the average comple xity of the pDP A algorithm for a particular loss function for a random sequence without change-points (see subsection 5.2). Finally , we empirically study the complexity of the algorithm for the quadratic loss (see subsection 5.3). 2 Indeed, if all µ → ∑ i γ ( Y i , µ ) are unimodal functions then all { µ | e C K 1: t , τ ( µ ) ≤ Cost K − 1 1: t } are intervals as Cost K − 1 1: t is a constant in R and all e C K 1: t , τ ( µ ) are unimodal. Then using property 3 we get by induction that all S K 1: t , τ are unions of intervals. pDP A to recover the best se gmentations with 1 to K max change-points. 17 5.1. W orst case complexity of the pDP A In this section we prov e that the pDP A is at worst as efficient as the segment neighbourhood algorithm. More precisely we hav e the follo wing proposition. Propostion 5. If all ∑ j γ ( Y j , µ ) ar e unimodal in µ and if both minimising e C K 1: t , τ ( µ ) and finding the r oots of e C K 1: t , τ ( µ ) = Cost K − 1 1: t ar e in O ( 1 ) , the pDP A is at worst in O ( K max n 2 ) time and in O ( K max n ) space. Proof . The ke y quantity to control is the number of interv als needed to represent S K 1: t , τ . For a giv en K and at step t the number of candidate last change-points is obviously bounded by t . If all ∑ t + 1 j = τ + 1 γ ( Y j , µ ) are unimodal, using theorem 8 (prov ed in appendix A) we get that the total number of intervals is bounded by 2 t − 1 . Thus at each step there is at most t last change-points and 2 t − 1 interv als to update. By summing all these bounds from 1 to n and for ev ery possible K we retrie ve an O ( K max n 2 ) worst case time complexity . As for the worst case space comple xity , we need to store two ( n + 1 ) × K max matrices ( D K , t and I K , t ) and at each step there is at most t candidates and 2 t − 1 interv als. This gi ves an O ( K max n ) space complexity The pre vious theorem pro vides a worst case bound on the comple xity . One can wonder in which cases this quadratic complexity in n is reached. Let us consider the sequence such that ∀ i , Y i = i , segmentations with one change-point and the quadratic loss. The best segmentation of Y 1: t has a change-point at t / 2 . So at step t of the pDP A the change-point τ = t / 2 has not been pruned and in fact all change- points from t / 2 to t hav en’ t been pruned. Hence, at each step we ha ve at least t / 2 candidates to update. Hence by summing from t = 1 to n we recover a quadratic complexity: O ( n 2 ) . Ho we ver , as illustrated with an example in subsection 3.4 and demonstrated for a specific loss function in the following subsection (5.2) one can hope that in many cases the pruning will be more ef ficient than that. 5.2. A verage pruning of the pDP A with a special loss function and K max = 1 In this subsection, we prove for a particular loss function and K max = 1 that on av erage we get an ef ficient pruning with the functional cost representation e ven when we consider a random sequence without true change-points. As e xplained in the introduction this result is not intuitive. In subsection 5.3 we empirically show that this is also the case for the quadratic loss function. 18 Rigaill Here, we will use the negati v e log-likelihood loss of a continuous uniform distri- bution defined on [ 0 , µ ] . This loss function is: ( γ ( Y i , µ ) = log ( µ ) if 0 ≤ Y i ≤ µ γ ( Y i , µ ) = ∞ otherwise For this particular loss function 3 and for K max = 1 , it is possible to bound the av erage number of candidate last change-points, E ( | τ K 1: t | ) . Property 6. F or the ne gative log-lik elihood loss, K max = 1 , and for , Y 1: n + 1 , n inde- pendent and identically distrib uted r andom variables of density f and continuous distribution F , E ( | τ 1 1: n | ) = O ( log ( n )) and the averag e time complexity of the pDP A is in O ( n log ( n )) . Proof The proof of E ( | τ 1 1: t | ) = O ( log ( t )) is gi ven in appendix B. W e obtain this result by studying the set S 1 1: n , τ . More precisely we characterize some simple events for which S 1 1: n , τ is empty and compute the probability of these ev ents. Then by taking the expectation and summing o ver all possible τ we get the desired result. For the comple xity using theorem 8 we know that the number of interv als stored by the pruned DP A is always smaller than 2 times the number of candidate change- points. Thus for K max = 1 , for every t ≤ n the pruned DP A updates on average O ( log ( t )) functional costs and intervals. From this the comple xity follo ws 5.3. Empirical complexity of the pDP A In this section, we empirically assess the efficienc y of the pDP A to analyze both simulated and real data in the case of the quadratic loss, γ ( Y i , µ ) = ( Y i − µ ) 2 . The pDP A was implemented in C++ and was run on a Intel Core i7-3687U3 2.10 GHz. The code is av ailable in the cghseg package on the CRAN at the following webpage (http://cran.r-project.or g/web/packages/cghse g/index.html). Here is an example code using this function: install.packages("cghseg") library(cghseg) Kmax <- 40; n <- 10^5; y <- rnorm(n) system.time(res_ <- cghseg:::segmeanCO(y, Kmax)) 3 note that any µ → ∑ i γ ( Y i , µ ) is unimodal. pDP A to recover the best se gmentations with 1 to K max change-points. 19 5.3.1. Synthetic data W e simulated a number of sequences with a constant, a sinusoid or a rectangular wa ve signal. W e considered an additional gaussian noise of variance 1, a uniform noise and a Cauchy noise. W e also considered the worst case scenario for the pDP A which is, as explained at the end of section 5.1, achieved for y i = i . W e compared the pDP A with the se gment neighbourhood algorithm for n ≤ 81 920 . F or all these tests, we set K max = 40 . W e then ran the pDP A alone for larger sequences n ≤ 2 621 440 . The R code to ran the pDP A on these simulations is gi ven in appendix C. Figures 4-A and B show that for all simulated sequences (coloured dotted and dashed lines) except the worst case scenario (black dashed and crossed line) the pDP A is faster than segment neighbourhood (black solid line). It took roughly 150 seconds for the segment neighbourhood algorithm to process a sequence of 81 920 data-points. In the worst case scenario the pDP A was able to process 40 960 data-points in the same amount of time, and for other simulated signals the pDP A was able to process more than 2 . 6 million data-points in the same amount of time. The runtime of the pDP A depends on the nature of the noise, for example it is faster for a Cauchy noise (green dotted lines in figure 4-A, B and green box-plots in figure 4-C). 5.3.2. Real data W e download the publicly av ailable GSE 17359 project from Gene Expression Omnibus (GEO: http://www .ncbi.nlm.nih.go v/geo/). This data set is made of SNP (Single Nucleotide Polymorphism) array experiments. SNP arrays enable the study of DN A copy number gains and losses along the genome. For this kind of sequence, a multiple change-point model based on the quadratic loss is often used (Picard et al., 2005). This model has been shown to reach state of the art performances (Lai et al., 2005; Hocking et al., 2013). Each SNP array experiment can be vie wed as a sequence of 1 . 8 million data- points. W e assessed the runtime of the pDP A to process these sequences from data-point 1 to t for various values of t and again with K max = 40 . The R code is gi ven in appendix D. Runtimes are reported in Figure 4-D. The runtime to process the 1 . 8 million data-points was on av erage 28 seconds and was always smaller than 33 seconds. Note that the runtime performances of the pDP A on such SNP array datasets hav e been confirmed by Hocking et al. (2013). 20 Rigaill A B C D F I G U R E 4 . Runtimes as a function of n in log-scale. A: Mean runtimes in seconds of the segment neighbourhood algorithm (black solid line) and pDP A (dotted, dashed, and dashed dotted line) for sequences of less than 81 920 data-points. The black dashed and cr ossed line is the runtime of the pDP A in the worst case scenario. Coloured dashed and dotted lines corr espond to sequences simulated with or without sine or block waves plus an additional normal (r ed), Cauchy (gr een) or uniform noise (blue) (see appendix C). B: Same as A for sequences of less than 2 . 6 million data-points. C: Boxplot of runtimes (in seconds) of the pDP A to pr ocess simulated sequences of size n = 2621440 . D: Runtimes (in seconds) of the pDP A to process the sequences of the GSE 17359 dataset from data-point 1 to t . Number of intervals W e also directly assess the efficiency of the pruning by counting the number of intervals stored by the pDP A at ev ery time step of the algorithm. Figure 5-A, B and C sho w , for sequences with a constant or sine w av e pDP A to recover the best se gmentations with 1 to K max change-points. 21 signal plus an additional gaussian noise and for sequences of the GSE 17359 project, the limited number of intervals stored by the pDP A. Indeed, for all these sequences we observed less than 50 intervals. If there was no pruning we would e xpect at least n = 1 . 8 million and at worst 2 n − 1 interv als (see subsection 5.1). A B C F I G U R E 5 . Maximum number of intervals stor ed by the pDP A at each point of the sequence for K = 1 . A: F or 100 sequences of 1 . 8 10 6 points simulated with a constant signal plus an additional normal noise of variance 1. B: F or 100 sequences of 1 . 8 10 6 points simulated with a sine wave signal plus an additional normal noise of variance 1. C: F or the 18 profiles of length 1 . 8 . 10 6 of the GSE 17359 dataset. 6. Conclusion This paper presented the pDP A that recov ers exactly the best se gmentations with 1 to K max change-points of an n point sequence w .r .t a loss function. This algo- rithm is based on a functional representation of the cost of segmentations. F or loss functions with a unidimensional parameter per segment the w orst case complexity is in O ( K max n 2 ) , thus the pDP A is at worst equi v alent to segment neighbourhood algorithms. For a special loss function it is prov en that the pDP A allows to prune the set of candidate change-points efficiently e ven when there are no change-points in the signal. Finally , in the case of the quadratic loss, it was sho wn empirically that the pDP A has a small runtime compared to segment neighbourhood algorithms. 22 Rigaill Future work Here we theoretically prov ed and empirically assessed that func- tional pruning is efficient under a restricti v e set of assumptions. It would be interest- ing to more generally characterise what are the conditions for functional pruning to be efficient. In particular functional pruning is theoretically possible for a p dimensional parameter . Recently , Maidstone et al. (2014) proved that functional pruning prunes at least as much as inequality based pruning. Based on this, it follo ws that if the conditions of Theorem 3.2 of Killick et al. (2012) are met then functional pruning with optimal partitioning (Fpop) is ef ficient. Ho we v er whether functional pruning is on av erage ef ficient when p is large and the number of change-points is fixed is to our kno wledge an open question. Another question is ho w to implement functional pruning in the multidimen- sional case. Indeed this would require an ef ficient w ay to represent and update the set of parameter v alues for which a particular change is optimal. T o the best of our kno wledge these sets do not ha ve particularly simple properties and are not necessarily easy to handle, except maybe for some special loss functions such as the ` ∞ loss. Acknowledgement I would like to thank Alice Cle ynen, Michel K oskas, Robert Maidstone, T oby Hocking, Paul Fearnhead, Emilie Lebarbier, Stéphane Robin for fruitful discussions that helped me to better understand the pDP A. Special thanks to T oby , Alban and Patrick. References Arlot, S., Celisse, A., and Harchaoui, Z. (2012). Kernel change-point detection. arXi v preprint arXi v:1202.3878. Auger , I. E. and La wrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bulletin of mathematical biology, 51(1):39–54. Bai, J. and Perron, P . (2003). Computation and analysis of multiple structural change models. Journal of Applied Econometrics, 18(1):1–22. Bellman, R. (1961). On the approximation of curves by line segments using dynamic programming. Communications of the A CM, 4(6):284. Birgé, L. and Massart, P . (2007). Minimal penalties for Gaussian model selection. Probability theory and related fields, 138(1-2):33–73. Boysen, L., K empe, A., Liebscher , V ., Munk, A., and W ittich, O. (2009). Consisten- cies and rates of con ver gence of jump-penalized least squares estimators. The Annals of Statistics, pages 157–183. Cleynen, A., Dudoit, S., and Robin, S. (2014a). Comparing segmentation methods for genome annotation based on rna-seq data. Journal of Agricultural, Biological, and En vironmental Statistics, 19(1):101–118. pDP A to recover the best se gmentations with 1 to K max change-points. 23 Cleynen, A., Koskas, M., Lebarbier , E., Rigaill, G., and Robin, S. (2014b). Seg- mentor3isback: an r package for the fast and exact segmentation of seq-data. Algorithms for Molecular Biology, 9:6. Cleynen, A. and Lebarbier , E. (2014). Segmentation of the Poisson and negati ve binomial rate models: a penalized estimator . ESAIM, P&S, 18:750–769. Fisher , W . D. (1958). On grouping for maximum homogeneity . Journal of the American Statistical Association, 53(284):789–798. Gey , S., Lebarbier , E., et al. (2008). Using cart to detect multiple change points in the mean for large sample. SSB preprint. Guédon, Y . (2008). Exploring the segmentation space for the assessment of multiple change-point models. hal.inria.fr pre-print inria-00311634. Guédon, Y . (2013). Exploring the latent segmentation space for the assessment of multiple change-point models. Computational Statistics, pages 1–38. Guédon, Y ., Caraglio, Y ., Heuret, P ., Lebarbier , E., and Meredieu, C. (2007). Analyz- ing gro wth components in trees. Journal of Theoretical Biology , 248(3):418–447. Harchaoui, Z. and Cappé, O. (2007). Retrospectiv e multiple change-point estimation with kernels. In IEEE W orkshop on Statistical Signal Processing. Harchaoui, Z. and Lévy-Leduc, C. (2010). Multiple change-point estimation with a total v ariation penalty . Journal of the American Statistical Association , 105(492). Hawkins, D. M. (2001). Fitting multiple change-point models to data. Computational Statistics & Data Analysis, 37(3):323–341. Hocking, T . D. (2013). Comparing least squares segmentation code. av ailable at http://sugiyama-www .cs.titech.ac.jp/ toby/org/HOCKING-ml-se g-compare.pdf. Hocking, T . D., Schleiermacher , G., Janoueix-Lerosey , I., Boev a, V ., Cappo, J., Delattre, O., Bach, F ., and V ert, J.-P . (2013). Learning smoothing models of copy number profiles using breakpoint annotations. BMC bioinformatics, 14(1):164. Horváth, L. (1993). The maximum likelihood method for testing changes in the parameters of normal observ ations. The Annals of statistics, pages 671–680. Jackson, B., Scar gle, J. D., Barnes, D., Arabhi, S., Alt, A., Gioumousis, P ., Gwin, E., Sangtrakulcharoen, P ., T an, L., and Tsai, T . T . (2005). An algorithm for optimal partitioning of data on an interval. Signal Processing Letters, IEEE , 12(2):105–108. Killick, R. and Eckle y , I. A. (2011). Changepoint: an r package for changepoint analysis. R package version 0.6, URL http://CRAN. R-project. org/package= changepoint. Killick, R., Fearnhead, P ., and Eckley , I. (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association , 107(500):1590–1598. K olesniko v , A. and Fränti, P . (2003). Reduced-search dynamic programming for approximation of polygonal curves. Pattern Recognition Letters , 24(14):2243– 24 Rigaill 2254. Lai, W . R., Johnson, M. D., Kucherlapati, R., and P ark, P . J. (2005). Comparativ e analysis of algorithms for identifying amplifications and deletions in array cgh data. Bioinformatics, 21(19):3763–3770. Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing, 85(8):1501–1510. Lavielle, M. and Moulines, E. (2000). Least-squares estimation of an unknown number of shifts in a time series. Journal of time series analysis, 21(1):33–59. Lebarbier , É. (2005). Detecting multiple change-points in the mean of Gaussian process by model selection. Signal processing, 85(4):717–736. Maidstone, R., Hocking, T ., Rigaill, G., and Fearnhead, P . (2014). On optimal multiple changepoint algorithms for large data. arXi v preprint arXi v:1409.1842 . Picard, F ., Robin, S., Lavielle, M., V aisse, C., and Daudin, J.-J. (2005). A statistical approach for array cgh data analysis. BMC bioinformatics, 6(1):27. Rigaill, G. (2010). Pruned dynamic programming for optimal multiple change-point detection. arXiv preprint arXi v:1004.0887. Scott, A. and Knott, M. (1974). A cluster analysis method for grouping means in the analysis of v ariance. Biometrics, 30(3):507–512. Y ao, Y .-C. and Au, S. (1989). Least-squares estimation of a step function. Sankhy ¯ a: The Indian Journal of Statistics, Series A, pages 370–381. Zhang, N. R. and Siegmund, D. O. (2007). A modified Bayes information criterion with applications to the analysis of comparativ e genomic hybridization data. Biometrics, 63(1):22–32. A ppendix A: Bound on the number of intervals In this section, we study a special class of functions that we denote B and demon- strate theorem 8. A direct application of this theorem shows that if there are t candidate last change-points, then the pDP A need to store at most 2 t − 1 interv als. W e used this theorem in section 4 to prove the worst-case comple xity of the pDP A. A.1. B functions Definition A.1.1. Let B n denote the set of all functions B : R → R suc h that ∀ µ ∈ R , B ( µ ) = min { t ∈ { 1 ... n }} { u B , t + n + 1 ∑ j = t + 1 f B , j ( µ ) } wher e all u B , t ar e r eal numbers and all f B , j ar e pDP A to recover the best se gmentations with 1 to K max change-points. 25 unimodal functions of µ 4 and any ∑ j f B , j ( µ ) is also unimodal . Note that B n ⊂ B n + 1 . Let B = S B n . If the loss function γ ( y i , µ ) and any ∑ i γ ( y i , µ ) are unimodal in µ , g Cost K 1: t ( µ ) is part of B t − 1 because we hav e: g Cost K 1: t ( µ ) = min τ < t { e C K 1: t , τ ( µ ) } . Definition A.1.2. F or any B ∈ B n and A a subset of { 1 . . . n } we define the function B A as ∀ µ , B A ( µ ) = min { t ∈ A } { u B , t + n + 1 ∑ j = t + 1 f B , j ( µ ) } Property 7. B A ∈ B card ( A ) It is easily shown for A = { 1 . . . n } \ { i } with i ∈ { 1 . . . n } and thus by induction it is true for any A . Definition A.1.3. The rank of a function B ∈ B is R ( B ) = min { n ∈ N ∗ | B ∈ B n } A.2. Decomposition in intervals and order of B Definition A.2.1. Let I be a partition of R in a finite set of intervals I = { I j } { j ∈{ 1 ... k }} . I is a k-decomposition of a function B ∈ B if ∀ I j , ∃ i , ∀ x ∈ I j B i ( x ) = B ( x ) The set of all B functions with a k -decomposition is denoted B k . Similarly the set of all B n functions with a k-decomposition is denoted B k n . Definition A.2.2. The order O ( B ) of a B function is min { k ∈ N ∗ | B ∈ B k } Théorème 8. F or all B ∈ B , we have O ( B ) ≤ 2 × R ( B ) − 1 . Proof W e demonstrate this theorem by induction. It is true if R ( B ) ≤ 1 . Assume it is true for any B with R ( B ) ≤ n . Let B ∈ B with R ( B ) = n + 1. W e have: ∀ µ ∈ R , B ( µ ) = min { B { 1 ... n } ( µ ) , B { n + 1 } ( µ ) } B ( µ ) = min { C ( µ ) , u B , n + 1 } + f B , n + 2 ( µ ) , where C ∈ B n : C ( µ ) = min t { u B , t + n + 1 ∑ j = t + 1 f B , j ( µ ) } 4 Here by unimodal we mean functions such that for any c in R the set { x | f ( x ) ≤ c } is an interval. 26 Rigaill Let I = S j ∈{ 1 ... k } I j be the smallest set of interv als such that: ∀ I j ∈ I , ∀ µ ∈ I j u B , n + 1 > C ( µ ) Let A k be the subset of { 1 . . . n } defined as { i | ∃ x ∈ I k C { i } ( x ) < u B , n + 1 } . As R ( B ) = n + 1 and as for all i in { 1 . . . n } C { i } is unimodal, there exists a unique j such that i ∈ A j and therefore ∑ k j = 1 card ( A j ) = n . In each interv al I k , we hav e C ( µ ) = C A k ( µ ) . By induction, O ( C A k ) ≤ 2 × card ( A k ) − 1 . Overall, for an y B with R ( B ) = n + 1, we hav e: O ( B ) ≤ ∑ k j = 1 O ( C A k ) + ( k + 1 ) ≤ 2 k ∑ j = 1 card ( A k ) + 1 ≤ 2 n + 1 ≤ 2 R ( B ) − 1 A ppendix B: Lower bound on the pr obability that S 1 1: n , τ is empty Model W e consider the negati ve log-likelihood loss of a continuous uniform distribution defined on [ 0 , µ ] . The loss function is: γ ( y i , µ ) = log ( µ ) if µ ≥ Y i γ ( y i , µ ) = ∞ otherwise This loss function is unimodal and any sum of γ ( Y i , µ ) is also unimodal. Indeed, for any t ≥ τ ≥ 1 we hav e: t ∑ τ + 1 γ ( Y i , µ ) = ( t − τ ) log ( µ ) if µ ≥ ˆ y τ , t t ∑ τ + 1 γ ( Y i , µ ) = ∞ otherwise W e define ˆ y τ : t = max i ∈{ τ ... t − 1 } { y i } . W e hav e Cost 0 1: t = min µ ∈ R + ∗ { t − 1 ∑ 1 γ ( Y i , µ ) } = ( t − 1 ) log ( ˆ y 1: t ) . Finally note that the loss is continuous to the right and as explained in the footnote page 11 we can further restrict τ K 1: t to the set of τ such that S 1: t , τ is not restricted to a single v alue. pDP A to recover the best se gmentations with 1 to K max change-points. 27 Proof Using equation 12 we get that for an y t greater than 2 we hav e: S 1 1: t + 1 , t ⊂ { µ | Cost 0 1: t ≤ e C 1 1: t , t ( µ ) } (14) = ] 0 , y t − 1 ] ∪ [ ˆ y 1: t . ˆ y 1: t ˆ y 1: t − 1 t − 1 , + ∞ [ , (15) (16) with ˆ y 1: t ˆ y 1: t − 1 t − 1 ≥ 1 . There are two cases in which it can be sho wn that S 1 1: n , t is empty or restricted to a single v alue and thus can be discarded. Case 1 : If y t − 1 < y t < ˆ y 1: t we hav e that ˆ y 1: t + 1 = ˆ y 1: t = ˆ y 1: t − 1 . Thus we hav e S 1 1: t + 1 , t ⊂ ] 0 , y t − 1 ] ∪ [ ˆ y 1: t , + ∞ [ , { µ | e C 1 1: t + 1 , t ( µ ) ≤ Cost 0 1: t + 1 } = [ y t , ˆ y 1: t + 1 ] . and using equation 11 from property 3 that: S 1 1: t + 2 , t ⊂ (] 0 , y t − 1 ] ∪ [ ˆ y 1: t , + ∞ [) ∩ [ y t , ˆ y 1: t ] = { y t } and by induction we get S 1 1: n , t ⊂ { y t } . Case 2 : If y t < y t − 1 < ˆ y t : n we hav e ˆ y 1: t + 1 = ˆ y 1: t and we get using equation 11: S 1 1: t + 2 , t ⊂ ] 0 , y t − 1 ] ∪ [ ˆ y 1: t . ˆ y 1: t ˆ y 1: t − 1 t − 1 , + ∞ [ T [ y t , ˆ y 1: t + 1 . ˆ y 1: t + 1 ˆ y 1: t t ] ⊂ [ y t , y t − 1 ] . No w we also hav e: { µ | e C 1 1: n , t ( µ ) ≤ Cost 0 1: n } = " ˆ y t : n , ˆ y 1: n . ˆ y 1: n ˆ y 1: t t − 1 n − t # Using equation 11 we get that S 1 1: n , t ⊂ [ y t , y t − 1 ] ∩ { µ | e C 1 1: n , t ( µ ) ≤ Cost 0 1: n } = / 0 , as y t < ˆ y t : n . Let us no w compute the probability of these two e v ents. All Y i are independent, they ha ve the same density f and continuous distribution F , thus the distrib ution of ˆ Y 1: t is F t − 1 and P ( ˆ Y 1: t < x ) = P ( ˆ Y 1 , t ≤ x ) = F t − 1 ( x ) . As Y t is independent of ˆ Y 1: t we get that P ( ˆ Y 1: t < Y t : t + 1 ) = R R 2 f ( x ) F ( x ) P ( ˆ Y 1: t < x ) d x = R R 2 f ( x ) F ( x ) t d x = 2 t + 1 . 28 Rigaill Using this, we see that case 1 happens with probability: P ( Y t − 1 < Y t < ˆ Y 1: t ) = P ( Y t − 1 ≤ Y t ≤ ˆ Y 1: t − 1 ) = 1 2 ( 1 − 2 t ) . and case 2 happens with probability: P ( Y t < Y t − 1 < Y t : n ) = P ( Y t ≤ Y t − 1 ≤ Y t + 1: n ) = 1 2 ( 1 − 2 n − t ) . Furthermore case 1 and case 2 are disjoint events. Thus the probability that the two occur is the sum 1 − 2 t − 2 n − t . Thus the probability that the change-point t has not been discarded at step n is lower than 2 t + 2 n − t . So taking the expectation and summing across all t smaller than n , we recov er that the expected number of non discarded candidate change-points is in O ( log ( n )) . A ppendix C: R code to assess the runtime of pDP A on simulated data Here we provide the R code we used to assess the performances of the pDP A. The code of the pDP A is in C++. First here is a function to simulate dif ferent type of signals. #### simple simulation function funSimu_ <- function(n, noise="norm", signal="flat"){ ### signal if(signal == "flat"){ x <- numeric(n) } if(signal == "sin"){## sin x <- 2*sin(1:n/100) } if(signal == "block"){## 10 blocks nb=5 x <- rep(rep(c(0,2), nb), each=n/(2*nb))[1:n] } if(signal == "sinblock"){## 10 blocks + sin nb=5 x <- rep(rep(c(0,2), nb), each=n/(2*nb))[1:n] + 2*sin(1:n/100) } if(signal == "linear"){ x <- 1:n } pDP A to recover the best se gmentations with 1 to K max change-points. 29 ### noise if(noise == "none") ### do nothing if(noise == "norm") x <- x + rnorm(n) if(noise == "Cauchy") x <- x +rcauchy(n) if(noise == "unif") x <- x+ runif(n,0, 1) -0.5 return(x) } Second, here is a simple function to record the runtime of the algorithm. simpleStat <- function(x){c(mean(x), sd(x), range(x))} #### simple function to test the runtimes for various n testRunTime <- function(Kmax = 50, ns=c(100, 200), rep=5, funSimu=rnorm, noise, signal){ rTime <- list() for(n in ns){ cat("Size:", n , ", Repetition 1/", rep, "\n") for(j in 1:rep){ x <- funSimu(n, noise=noise, signal=signal) rTime[[length(rTime)+1]] <- c("pDPA", n, j, system.time(res_p <- cghseg:::segmeanCO(x, Kmax=Kmax))[3]) } } ### formating the runtimes dTime <- data.frame(matrix(unlist(rTime), ncol=4, byrow=TRUE)) colnames(dTime) <- c("Algo", "n", "rep", "time") dTime$time <- as.numeric(as.vector(dTime$time)) dTime$n <- as.numeric(as.vector(dTime$n)) mTime <- aggregate(dTime$time, by=list(dTime$Algo, dTime$n), FUN= simpleStat, simplify=TRUE ) mTime <- cbind(mTime[, 1:2], mTime[, 3][, 1:4]) colnames(mTime) <- c("Algo", "n", "mean", "sd", "min", "max") mTime$signal <- signal mTime$noise <- noise mTime <- mTime[order(mTime$n), ] return(mTime) } 30 Rigaill Here is a code to run the algorithm for different types of signal (constant, sinusoid, block), v arious types of noise (gaussian, uniform and Cauchy) and different v alues of n . install.packages("cghseg") require(cghseg) ### 20 000 + Kmax=40 ns <- 40* 2^(1:16) respDPA_l <- list() for(signal in c("flat", "sin", "block", "sinblock")){ print(paste("###", signal)) for(noise in c("norm", "Cauchy", "unif")){ print(noise) respDPA_l[[length(respDPA_l)+1]] <- testRunTime(Kmax = Kmax, ns=ns, rep=5, funSimu=funSimu_, noise=noise, signal=signal) }} save(respDPA_l, file="respDPA_l.Rdata", compress=TRUE) Here is a code to run the algorithm in the worst case scenario, Y i = i . Kmax=40 ns <- 40* 2^(1:10) respDPA_w <- list() respDPA_w <- list() for(signal in c("linear")){ print(paste("###", signal)) for(noise in c("none")){ print(noise) respDPA_w[[length(respDPA_w)+1]] <- testRunTime(Kmax = Kmax, ns=ns, rep=5, funSimu=funSimu_, noise=noise, signal=signal) }} save(respDPA_w, file="respDPA_w.Rdata", compress=TRUE) pDP A to recover the best se gmentations with 1 to K max change-points. 31 A ppendix D: R code to assess the runtime of the pDP A on the GSE17359 data set Here we provide the R code we used to assess the performance of the pDP A on the data from the GSE17359 data set. dat <- read.table("GSE17359_affy_snp_ratios_matrix.txt", header=TRUE, sep="\t") require(cghseg) ie <- which(apply(is.na(dat), 1, sum) == 0) dat <- dat[ie,] ns <- c(40* 2^(1:15), nrow(dat)) Kmax = 40 rTime <- list() for(iSample in 2:ncol(dat)){ print(iSample) for(n in ns){ x <- dat[1:n, iSample] rTime[[length(rTime)+1]] <- c("pDPA", n, iSample, system.time(res_p <- cghseg:::segmeanCO(x, Kmax=Kmax))[3]) } } dTime <- data.frame(matrix(unlist(rTime), ncol=4, byrow=TRUE))
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment