Mining Temporal Patterns from iTRAQ Mass Spectrometry(LC-MS/MS) Data
Large-scale proteomic analysis is emerging as a powerful technique in biology and relies heavily on data acquired by state-of-the-art mass spectrometers. As with any other field in Systems Biology, computational tools are required to deal with this o…
Authors: Fahad Saeed, Trairak Pisitkun, Mark A. Knepper
Mining T emporal Patter ns fr om iTRA Q Mass Spectr ometry(LC-MS/MS) Data Fahad Saeed ∗ ∗ T rairak Pisitkun Epithelial Systems Biology Laboratory Epithelial Systems Biology Laboratory National Institutes of Health (NIH) National Institutes of Health (NIH) Bethesda, Maryland 20892 Bethesda, Maryland 20892 Mark A Knepper Jason D Hof fert Epithelial Systems Biology Laboratory Epithelial Systems Biology Laboratory National Institutes of Health (NIH) National Institutes of Health (NIH) Bethesda, Maryland 20892 Bethesda, Maryland 20892 Abstract Lar ge-scale pr oteomic analysis is emerging as a powerful technique in biology and r elies heavily on data acquired by state-of-the-art mass spectr ometers. As with any other field in Systems Biology , computational tools are r equired to deal with this ocean of data. iTRA Q (isobaric T ags for Relative and Absolute quantification) is a technique that al- lows simultaneous quantification of pr oteins fr om multiple samples. Although iTRA Q data gives useful insights to the biologist, it is mor e comple x to perform analysis and draw biological conclusions because of its multi-plexed design. One such pr oblem is to find pr oteins that behave in a sim- ilar way (i.e. change in abundance) among various time points since the tempor al variations in the pr oteomics data r eveal important biological information. Distance based methods such as Euclidian distance or P earson coef ficient, and clustering techniques such as k-mean etc, are not able to take into account the temporal information of the series. In this paper , we present an linear -time algorithm for clus- tering similar patterns among various iTRA Q time course data irr espective of their absolute values. The algorithm, r eferr ed to as T emporal P attern Mining(TPM), maps the data fr om a Cartesian plane to a discr ete binary plane. Af- ter the mapping a dynamic pr ogramming technique allows mining of similar data elements that ar e temporally closer to each other . The pr oposed algorithm accurately clusters iTRA Q data that ar e temporally closer to each other with mor e than 99% accuracy . Experimental results for dif ferent pr oblem sizes are analyzed in terms of quality of clusters, execution time and scalability for larg e data sets. An ex- ample fr om our pr oteomics data is pr ovided at the end to demonstrate the performance of the algorithm and its abil- ∗ * Corresponding author: National Institutes of Health, Building 10, Room 6N312, 10 Center Drive, MSC 1603, Bethesda, MD 20892- 1603. E-mail addr ess : fahad.saeed@nih.gov ity to cluster temporal series irr espective of their distance fr om each other . 1 Intr oduction Mass spectrometry is a fundamental part of any modern proteomics research platform for accurate protein identi- fication and quantification [1] [2] [3]. Mass spectrome- ters measure the mass-to-charge ratio(m/z) of ionized par- ticles [4]. In the case of a typical LC-MS/MS proteomic experiment, the ionized particles (i.e. peptides) are intro- duced into a mass spectrometer at the ion source in the form of liquid solutions, then desolvated and transferred into the gas phase as gas phase ions. A v ariety of search algorithms are then used to match the peptide spectra to sequences in online databases in order to identify the proteins in the mix- ture [5] [6]. iTRA Q (isobaric tags for relativ e and absolute quan- tification) is a technique used to identify and quantify pro- teins from dif ferent sources in one single experiment. It uses isotope coded cov alent tags and is used to study quan- titativ e changes in the proteome [7, 8]. The method is based on cov alent labeling of the N-terminus and lysine side chains from protein digestions with tags of v arious masses for distinction. Up to 8 different tagging reagents (8-plex kit) are used to label peptides from different sam- ples. The samples are then pooled together as a single sam- ple and analyzed by mass spectrometer . The fragmentation of the attached tag generates a lo w molecular mass reporter ion which is useful in quantifying relativ e peptide ab un- dance between the different iTRA Q channels. The iTRA Q technique allows analysis of samples in a more sophisticated and accurate manner in turn giving more rele vant biological information such as phosphory- lation of peptides or the effect of v asopressin at different time points in the mass spec. Although the technique al- lows greater accuracy in quantitation, it raises many com- putational problems. One such problem is to identify the peptides that behav e similarly for a gi ven external agent e.g. dD A VP over the time course study . Time course mea- surements from iTRA Q data are becoming a common pro- cedure in many systems biology e xperiments [9] [10] [11]. If the experiment is subject to variations in time, the con ventional methods to cluster and analyze the similarity such as Euclidean distance, Hamming distance hav e sig- nificant limitations. Like wise the clustering mechanisms that use distance based measures such as k-means [12], or hierarchical clustering [13] do not alw ays succeed when responses are highly v ariable in magnitude. Other meth- ods such as fuzzy clustering of short time-series [14] are not computationally ef ficient after a certain number of time courses due to the combinatorial explosion in possibilities. A facilitating characteristic of successful scalable cluster- ing is still to find a linear algorithm that in volv es a small number of passes ov er the database [15]. In this paper we present a near-linear time cluster- ing algorithm that finds temporal patterns in a given large iTRA Q labeled dataset using one or small number of passes ov er the data and without compr omising the quality of the clusters . Our algorithm draws its moti vation from mapping problems in the parallel processing community [16] [17] and quantization in information theory [18] [19]. The pro- posed algorithm allo ws us to map the time points of a Cartesian plane into a discrete plane. The mapping then produces a predictable number of clustering possibilities that are then binned using a efficient dynamic programming technique to extract the patterns. The paper is or ganized as follows. In section 2 we provide the biological experimental details and the associ- ated computational problem. Section 3 discusses the pro- posed clustering algorithm and complexity analysis. Sec- tion 4 discusses the e xperimental results and illustrati ve e x- amples for our clusters. Finally , conclusions are presented in section 5 of the paper . 2 Pr oblem Statement The objectiv e of the study was to perform a quantita- tiv e comparison of protein phosphorylation under vaso- pressin(dD A VP) treatment using iTRA Q labels for dif fer- ent time points. LC-MS/MS [20] phosphoproteomics anal- ysis was performed and the time course clusters that would be obtained from the study provides the basis for modeling the signaling network inv olved. The flo w diagram for the experiment is sho wn in Fig.1. In computational terms, the problem that we wish to solve is as follo ws. W e are gi ven a set of peptides with time points ( t 1 , t 2 · · · etc ) with each time point having a certain real value which in our case is the iTRA Q ratio. Given the peptides with time points and real values, we want to be able to cluster the peptides that give a similar pattern over kidney collecting duct (IMCD) dDA VP treatment (0.5, 2, 5, 15 min) typsin digestion iTRAQ labeling t im e ( m i n ) v e h ic l e d D A V P 0 .5 iTRAQ-1 1 3 iTRAQ-1 1 5 2 iTRAQ-1 1 4 iTRAQ-1 1 6 5 iTRAQ-1 1 7 iTRAQ-1 1 9 1 5 iTRAQ-1 1 8 iTRAQ-1 2 1 LC-MS/MS database searching phosphopeptide identification phosphorylation site assignment phosphopeptide quantification and clustering Figure 1: The Flo w diagram for the experiment the time course. An example of such is sho wn in the figure 2. Pep$de& T1& T2& T3& T4& ." ." ." VYEPLK" 0.2" 0.3" ‐0.1" 0.5" IHIDPE" 0.1" 0.2" 0.4" 0.6" LEVAK" 0.5" 0.7" 0.1" 0.3" ." ." ." ‐0.2" ‐0.1" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" t1" t2" t3" t4" VYEPLK" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" t1" t2" t3" t4" IHIDPE" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" t1" t2" t3" t4" LEVAK"" Figure 2: The peptides with corresponding time scales and their values. The first and third peptides belong to the same cluster The first column in figure 2 shows peptide sequences followed by 4 columns of iTRA Q ratios corresponding to the change in peptide abundance between the v asopressin and v ehicle control samples (”dD A VP/vehicle”), each cor- responding to a different time point. Of course the number of columns would be dependent on the time courses that are considered and can increase or decrease depending on the particular experiment. Our objecti ve is to determine the data points that have similar temporal patterns. It can be seen in the figure that peptide V Y E P LK and LE V AK hav e similar patterns over the time course because both in- crease from point t 1 to t 2 , decrease from t 2 to t 3 and in- crease from t 3 to t 4 . The peptide I H I D P E on the other hand increases for all the time points. Hence, the peptides V Y E P LK and LE V AK must be clustered together . Now let us formally define the problem. Let there be N data points with each data point having K time course values and X represents the peptide name. Then, let U present the set such that U i = { X i , ( t 1 , · · · , t K ) } where 1 ≤ i ≤ N . Also, let the number of clusters be Q . Then, cluster set C = { c 1 , c 2 , · · · , c Q } where each c j ∈ U i such that 1 ≤ i ≤ N and 1 ≤ j ≤ Q . Then each of the cluster c j has the set of data points that hav e the same temporal pattern with time. The temporal similarity is defined as follows. Let c pq represent q th data point in cluster p where 1 ≤ p ≤ Q and 1 ≤ q ≤ N . Now let c pq ( x ) represent the mapped data point at point x . Let there be another mapped data point c pq ( y ) at point y and ∀ x < y . Now define an array r h where 1 ≤ h ≤ K . Each point in array r h = c pq ( y ) − c pq ( x ) (1) Then, the data points that ha ve strictly equal r h would be considered temporally similar . 3 Pr oposed Mining Algorithm In this section, we present details of the proposed mining strategy , referred to as T emporal Pattern Mining (TPM). W e also analyze the computational complexities of the pro- posed algorithm. The proposed algorithm TPM draws its motiv ation from the mapping problem in parallel and distributed com- puting [16] [21] [17] and information theory [18, 19]. The mapping problem in parallel computing and mapping for our mining algorithm share a similar characteristic. In map- ping for parallel processing, two sets of nodes are consid- ered: problem modules and processor modules. The objec- tiv e is to map the problem modules on to processor mod- ules in an efficient manner . In mapping for the mining algorithm we are seeking a mapping such that the Carte- sian plane of the data points are mapped onto a discrete plane(e.g. binary plane) of finite possibilities. The discrete plane is defined by using the Nyquist sampling technique that allo ws conservation of the information from a contin- uous signal (or a data set with real numbers). These finite possibilities, which can grow exponentially with increasing time courses, are then mined using our ef ficient dynamic programming technique. Algorithm1 giv es an intuiti ve de- scription of the strategy . 3.1 Mapping from Cartesian to discr ete plane The proposed algorithm TPM can be classified as a fea- ture extraction algorithm [22]. As defined in the problem statement in the section abov e we have a number of pep- tides with associated iTRA Q ratio v alues and we are inter- Data : A set U i = { X i , ( t 1 , · · · , t K ) } of peptides and their time courses Result : Compute the cluster set C = { c 1 , c 2 , · · · , c Q } such that the clusters hav e temporal similarity within the distinct clusters for i = 1 to K do Compute A [ i ] := mapping ( U i ) from Cartesian to discrete plane end while ther e ar e values in A that are not null do pick a random value from A call it A R count++ for j = 0 to N do distance = E ditD istance ( A R , A [ j ]) if distance == 0 then c count ← A [ j ] / * This is to eliminate values from A that have already been assigned to a cluster * / A [ j ] ← N U LL end end end Algorithm 1: Mapping based temporal pattern mining algorithm Data : A data point in the Cartesian plane Result : Return the discrete plane representation of the data point mapping(datapoint U w ) V ector V w ← ∅ for w = 0 to U w .leng th () do if w=0 then / * save as name of the peptide * / else if U w +1 − U w ≥ 0 then V w = a end else V w = b end end end end retur n V Algorithm 2: Mapping function Data : T wo strings of length m and n Result : LevenshteinDistance between the tw o string is returned / * d is a table with m+1 rows and n+1 columns * / EditDistance(char s[1..m], char t[1..n]) int d [0 ..m, 0 ..n ] ← ∅ for i fr om 0 to m do d[i, 0] := i / * deletion * / end for j fr om 0 to n do / * insertion * / d[0, j] := j end for j fr om 1 to n do for i fr om 1 to m do if s[i] = t[j] then d[i, j] := d[i-1, j-1] end else d[i, j] := minimum ( d[i-1, j] + 1, d[i, j-1] + 1, d[i-1, j-1] + 1) end end end retur n d[m,n] Algorithm 3: Dynamic programming cluster extraction subroutine ested in extracting clusters of the peptides that giv e similar expression lev els(falling or rising) at different time points. Howe ver , k-means or hierarchical clustering cannot be used because the time points may be closer to each other in Eu- clidean distance, but may not be close in temporal changes ov er the time course. Clustering using the real v alues from the Cartesian plane howe ver , is not feasible because of its continuous nature (infinite values). Therefore, with Cartesian co- ordinates the number of possible cluster combinations will be infinite in nature and clustering for all possible com- binations is not computationally feasible. Thus, a Carte- sian plane co-ordinates has to be mapped to a more dis- crete plane co-ordinates to restrict the number of combina- tions. The mapping function should be such that it would allow us to quantify the variations in the data with respect to time and also make the v alues discrete enough such that the number of combinations that are possible would decrease drastically . T o address these challenges, the mapping function that is presented allows us to make the values more discrete and also conserves the important information of expression lev els between the time periods. Using the same notation that we presented in the pre vious section. Let U present the set of values such that U i = { X i , ( t 1 , · · · , t K ) } where 1 ≤ i ≤ N . X i represents the peptides and t 1 , · · · , t K rep- resents that v alues of the expression level from 1 to K . The mapping from the Cartesian plane would be accomplished as follows: [ M ( x, y ) = a if ( t x < t y ) and ( x < y ) and ( y − x = 1); b o.w . ] (2) The assumption in the mapping function is that the first value is zero. Howe ver , this is an assumption that is appropriate for our data but is a not a generalized rule and can be changed accordingly . The mapping function in its functionality is simple and does the following. It looks for the data points at the next time point. If the current data point is below the previous value it is assigned ’a’. Oth- erwise it is assigned a ’b’. Therefore, after the mapping has been completed, each of the time series would be a sequence of a’ s and b’ s with each of the characters repre- senting the rise or fall in the expression lev el. The number of discrete levels can change according to the biological system under consideration. The mapping function has been defined in a way that makes a clear distinction between the data points that hav e similar temporal patterns and the ones that don’t. Figure 3 shows that data points A(red) and B(black) are very close to each other in distance. Howe ver , in our mining scenario we are more interested in the pattern of the expression lev- els over time and in consideration of our criterion, the data point C( blue) is closer to B(black). If a nai ve k-means t1 t2 t3 t4 Time a b Discrete values t1 t2 t3 t4 Time a b Discrete values t1 t2 t3 t4 Time Real Values t1 t2 t3 t4 Time a b Discrete values Cartesian Plane Mapped Plane Mapped Plane Mapped Plane A B C Figure 3: The mapping of a Cartesian plane to a discrete mapped plane clustering would be executed, data points A and B would be clustered together . Using our mapping function how- ev er, we are able to make a clear distinction irrespecti ve of the Euclidean distance of the data points with respect to each other . Observe that the mapping of co-ordinates that is produced by our algorithm is same for the data points that hav e similar temporal beha viors (i.e. B and C). Therefore, once the mapping has been accomplished, the data points that hav e the same mapping belong to a single cluster . The psuedocode for the mapping function is giv en in Algorithm 2. 3.2 Efficient Extraction of Clusters Once the mapping is complete, the next step is to mine the clusters that are present in the data set. An ef ficient way is needed to extract the clusters from the mapped data because the number of clusters can increase exponentially with increase in the time points as well as the number of discrete le vels desired. Consider , for example, tw o discrete lev els are considered as in our case. The number of clus- ters would be upper bounded by the order of Z N (for Z states or 2 N for two states), but may or may not be present in the data [23]. Even for a moderately large N the num- ber of combinations are huge and for each data point going through all of the possible combinations is waste of pre- cious computing resources. For TPM we de veloped an ef fi- cient technique that allows us to keep the search space con- fined to the clusters that are present in the dataset. There- fore, using our technique the exponential search space of possible clusters is minimized to the set of clusters that are present in the data set, saving valuable computing re- sources. The procedure to mine the rele vant clusters shown in Algorithm 3. The crux of the technique is based on a dy- namic programming edit distance algorithm that allow us to calculate the ’ distance’ of a particular data point from another . W e randomly pick a data point from the mapped data values. The randomly mapped data point is then used to calculate the lev enstein distance with other data points. The data values that have zero distance with one another belong to the same cluster , because they would have the same pattern. The technique is very efficient in practice be- cause for large number of time points, the number of clus- ters that are actually present in the data is far less than the possible number of clusters. Thus, the computational com- plexity is greatly decreased and makes the system more ef- ficient. abababababababababababababb abababababababababababababb abababababababababababababb abababababababababababababb abababababababababababababb abababababababababababababb babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab c11 c12 c13 c14 c15 babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa babababbabaaaabbbbaaaababaa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa abbabababababababababbababa baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab baababababababababababaabab c21 c22 c23 c24 c25 c31 c33 c35 c32 c34 c41 c42 c43 c44 c45 c1={1,5,9,13,17,21} c1={1,5,9,13,17,21} c2={2,6,10,14,18,22} c1={1,5,9,13,17,21} c2={2,6,10,14,18,22} c3={3,7,11,15,19,24} c1={1,5,9,13,17,21} c2={2,6,10,14,18,22} c3={3,7,11,15,19,23} c4={4,8,12,16,20,24} pass 1 pass 2 pass 3 Clusters with indexes Figure 4: Extraction of clusters after mapping is complete Figure 4 shows the clusters are extracted from the set in only small finite operations as compared to large num- ber of operations that would be required if all the possible combinations would be considered. The figure shows 27 time points for each data set that has been mapped using our mapping strate gy . The total number of possible combi- nations are 2 27 for each of the data points making total op- erations equal to 23 × 2 27 i.e. O ( N 2 N ) . Howe ver , consid- ering our strategy , we pick one of the data points randomly . In the figure, data points (time courses) in black font are picked and the edit distance is calculated for each of the data points. These data points would get a zero distance and would be accumulated in a single cluster c 1 . After , adding them to the cluster set these black data points are excluded from the search and the process is repeated until only one kind of data points is present. The total number of operations is equal to 23 × 4 i.e. O ( QN ) where Q is the number of clusters and N is the number of elements present. It is very apparent in fig 4 that only 3 passes are needed to extract all the clusters from the data set. The extraction of the clusters works very well in practice for a two-state dataset. Howev er , it can be further improved by using kd-tree data structure for extraction of clusters for higher dimen- sional states [24] [25] [26]. The k-d tree is a multidimensional bi- nary search mechanism that represents a recursive subdivision of the data space into disjoint subspaces by means of d-dimensional hyper planes. The root of the tree then represents all the patterns, while the children of the root represents subsets of patterns in the subspaces. Searching for the clusters for the algorithm can then be performed in O ( QN (1 − 1 / N o.of states ) ) . Note that as the number of states increase, the searching time would reduce corre- spondingly because of the division of subspaces using the hyper planes. 3.3 Complexity Analysis The time complexity of the algorithm can be broken down into two parts. The first part is for the mapping and the second part for the extraction of the clusters. The mapping part has complex- ity of O ( K N ) since there are N data elements and each is of length K . The second part of the algorithm is for the extraction of the clusters. This part run Q times where Q is the number of clusters present in the data. For each run, dynamic programming algorithm is executed, which runs in O ( K 2 ) time; assumption is that both of the data points that are being compared are equal in length. This procedure runs for Q times giving the complexity of O ( QK 2 ) . Thus the total time complexity of the algorithm is T ( . ) = O ( QK 2 ) + O ( K N ) . 4 P erf ormance Evaluation Performance e valuation was done for the quality of the clus- ters that were extracted as well as the efficienc y of the tech- nique. W e tested the algorithm with data from a lar ge-scale quan- titativ e phosphoproteomics experiments done as follows: Inner medullary collecting duct (IMCD) samples were incubated in the presence or absence of 1nM dD A VP (vasopressin) for 0.5, 2, 5, and 15 minutes (N=3) follo wed by LC-MS/MS-based phospho- proteomic analysis. Quantification used 8-plex iTRA Q and com- mercially available software. These phosphopeptides were ana- lyzed with our algorithm in order to identify groups that changed in abundance with similar temporal responses after exposure to vasopressin. The algorithm identified 16 clusters of phosphopep- tides with distinct temporal profiles. These time-course clusters provide a starting point for modeling of the signaling network in- volv ed. The algorithm 1 has been implemented in Jav a(TM) SE Runtime Environment ( buil d 1 . 6 . 0) . The experiments were con- ducted on a Dell server consisting of 2 Intel Xeon(R) Processors, each running 2.40 GHz, with 12000 KB cache and 64GB DRAM memory . The operating system on the server is Linux RedHat enterprise version with kernel 2.6.9-89.ELlar gesmip. For the timing experiments we used the same data that we got from our biological experiments. In order to access the timing for the algorithm, we generated the data as follows. The com- plexity analysis suggested that the algorithm must e xhibit a linear time with increasing data points. Therefore, we wanted to ac- cess the timing by keeping other variables relatively constant i.e. 1 An executable can be obtained by requesting the author; A webser- vice will also be av ailable at authors’ page. the number of clusters(Q) and the length of the time points(K). W e generated our data set for timing ev aluation by replicating the data that we got from our biological experiment. The ratio of the number of clusters and the length of the time points remained constant whene ver we replicated our data by a real positi ve whole factor . W e tested the timings for data points from 3500 to 80 , 000 elements. 0 2 4 6 8 · 10 4 0 1 2 3 4 N U M B E R O F D A TA P O I N T S ( N ) T ime in seconds Q = 5 Q = 10 Q = 15 Q = 20 Q = 30 Figure 5: T iming with increasing number of data points 0 20 40 60 80 100 0 200 400 600 L E N G T H O F T I M E P O I N T S ( K ) T ime in seconds Q = 5 Q = 10 Q = 15 Q = 20 Q = 30 Figure 6: T iming with increasing number of time points The timings for the algorithm with increasing number of data points is shown in Fig. 5 for up to 80000 . The timings ob- served hav e a linear trend with increasing number of data points as predicted by our complexity analysis. Even for up to 80000 elements the timings observed are no more than 3 seconds. Ob- serve, as the number of clusters decrease, the times observed for the same number of data points have a decreasing trend. The rea- 0 20 40 60 80 100 10 0 10 1 10 2 10 3 L E N G T H O F T I M E P O I N T S ( K ) T ime in seconds N = 3500 N = 7000 N = 20000 N = 40000 Figure 7: T iming with increasing number of time points T able 1: Number of correct clusters with different algorithms Algorithm Clusters Identified Correct Clusters K-means 16 3 Hierarchical 18 6 SOM 20 10 TPM 16 16 son is that for fewer clusters, the number of passes that have to be made to e xtract cluster pattern is smaller further decreasing the computation time. It is also useful to know the beha vior of the al- gorithm with increasing number of time points(K) in the data. Fig. 6 sho ws the timings of the algorithm with increasing number of time points(K) for up to 100 for variable number of clusters. The timings observed are in accordance with our complexity analy- sis that suggested a O ( K 2 ) behavior and time remains under 500 seconds for clustering data sets that have 100 time points. Figure 7 also shows the timings with increasing number of time points while v arying the number of data points. As can be seen the algo- rithm exhibits a very consistent behavior with increasing number of time points as well as increasing number of data points. The constants used during these experiments are K = 4 , Q = 30 and N = 3500 wherever appropriate. The quality of the clustering was then compared with other algorithms such as kmeans [10] [12], Self organizing maps [11] [27], and Hierarchical clustering [9] [28] [29]. A brief summary of the results from these experiments are in table 1. The assess- ment of the quality w as done as follo ws. The standard algorithms used as described in [10] [11] [9] on our dataset which are the same iTRA Q labeled data that has been described in the litera- ture. For the clusters that were identified by these algorithms, if there were more than 1% of data points that were incorrect it is labeled as incorrect cluster . Although there are many data points in the cluster that are similar to one another; data points that are incorrectly clustered can have a serious impact on the quantifica- tion of the proteins. As shown in the table, the number of clus- ters that were identified varied according to the algorithm used because of the high variability in the data. For the cluster that were identified, above mentioned criteria was used to distinguish the correct clusters from the incorrect one. The results obtained matched well with the previous studies e.g. In [11] the authors using self organizing maps(SOM) were only able to use 9 clusters that were sufficiently accurate to be used for quantification stud- ies thus limiting the biological meaningful data analysis. Using our algorithm, it can be seen that the number of clusters observed were in agreement with our theoretical analysis. All of the data points that were included in the clusters didn’t had a variation more than 1% making it a highly accurate clustering algorithm for iTRA Q labeled protein quantification analysis. 0 2 4 6 8 10 12 14 16 − 2 − 1 . 5 − 1 − 0 . 5 0 T I M E I N M I N U T E S iTRA Q ratios Figure 8: Example of clustered data points 0 2 4 6 8 10 12 14 16 − 2 . 5 − 2 − 1 . 5 − 1 − 0 . 5 0 T I M E I N M I N U T E S iTRA Q ratios Figure 9: Example 2 of clustered data points W e were able to identify 16 distinct clusters for phospho- peptides with distinct temporal profiles. W e also performed sub- 0 2 4 6 8 10 12 14 16 − 0 . 2 0 0 . 2 0 . 4 T I M E I N M I N U T E S iTRA Q ratios Figure 10: Example 3 of clustered data points clustering of the data purely for biological analysis i.e. for quan- tification we wanted separate clusters for the ratios that are nega- tiv e all over the time course, the ones that were positiv e and the ones that crossed from positi ve to ne gativ e or negati ve or positiv e ov er the length of the experiment. Some examples of the clus- ters are sho wn in Figs. 8, 9 and 10,each line presenting a peptide. As shown in Fig. 8, e ven though the data points have high vari- ability in terms of distance of the points, the points have been clustered accurately with respect to the pattern of the response. For all of the data points that were clustered in Fig.8, the ratios first decreased then increased and decreased further in the last time point. Same can be observed for Fig. 9 and 10 that even though the points have a lot of variability in terms of distance of the points, they are clustered correctly in terms of patterns over the time. 5 Conclusion W e dev eloped a new algorithm called TPM for clustering time- course patterns following step-inputs in biological systems for iTRA Q labeled phosphopeptides. W e tested the algorithm with data from lar ge-scale quantitative phosphoproteomics experi- ments for up to 80 , 000 data points and up to 100 time point inter- vals. Quantification used 8-plex iTRA Q and commercially avail- able software. These phosphopeptides were analyzed with our algorithm in order to identify groups that changed in abundance with similar temporal responses after v asopressin addition. The algorithm maps the data from a Cartesian plane to a discrete bi- nary plane and uses an efficient dynamic programming technique to mine similar patterns after mapping. The mapping allo ws clus- tering of similar time courses that are temporally closer to each other . The algorithm identified 16 clusters of phosphopeptides with distinct temporal profiles in response to v asopressin. The al- gorithm was also compared for quality to other standard clustering techniques that have been used for similar experiments in the lit- erature. It was shown that the proposed algorithm performed with significantly better accuracy (at least 99% of clusters were as- signed correctly) with the ability to handle large data sets. These time-course clusters provide a starting point for modeling of the signaling network. W e believe that the proposed algorithm will prov e useful to the computational biology and mass spectrometry community . Acknowledgements Mass spectrometry was conducted in the National Heart, Lung and Blood Institute Proteomics Core Facility (director , Mar- jan Gucek). This work was funded by the operating budget of Division of Intramural Research, National Heart, Lung and Blood Institute, National Insitutes of Health (NIH), Project ZO1- HL001285. Refer ences [1] J. Hoffert, T . Pisitkun, G. W ang, R. Shen, and M. Knepper, “Quantitativ e phosphoproteomics of vasopressin-sensiti ve renal cells: Regulation of aquaporin-2 phosphorylation at two sites., ” Proc Natl Acad Sci U S A , 2006. [2] X. Li, S. A. Gerber , A. D. Rudner, S. A. Beausoleil, W . Haas, J. E. Elias, and S. P . Gygi, “Large-scale phos- phorylation analysis of alpha-factor-arrested saccharomyces cerevisiae., ” J Pr oteome Res , vol. 6, no. 3, pp. 1190–7, 2007. [3] A. Gruhler, J. V . Olsen, S. Mohammed, P . Mortensen, N. J. F , M. Mann, and O. N. Jensen, “Quantitative phosphopro- teomics applied to the yeast pheromone signaling pathway , ” Molecular and Cellular Pr oteomics , vol. 4, pp. 310–327, March 2005. [4] J. P . Whitelegge, “Hplc and mass spectrometry of intrinsic membrane proteins, ” vol. 251, December 2003. [5] S. T anner , H. Shu, A. Frank, L. C. W ang, E. Zandi, M. Mumby , P . A. Pevzner , and V . Bafna, “Inspect: identi- fication of post translationally modified peptides from tan- dem mass spectra, ” Analytical Chemistry , v ol. 77, pp. 4626– 4639, July 2005. [6] J. K. Eng, B. Fischer , J. Grossmann, and M. J. Maccoss, “A Fast SEQUEST Cross Correlation Algorithm, ” J. Proteome Res. , September 2008. [7] P . L. Ross, Y . N. Huang, J. N. Marchese, B. W illiamson, K. Parker , S. Hattan, N. Khainovski, S. Pillai, S. Dey , S. Daniels, S. Purkayastha, P . Juhasz, S. Martin, M. Bartlet- Jones, F . He, A. Jacobson, and D. J. Pappin, “Multiplexed Protein Quantitation in , ” Molecular & Cellular Pr oteomics , vol. 3, no. 12, pp. 1154–1169, 2004. [8] L. R. Zieske, “A perspectiv e on the use of iTRA Q reagent technology for protein complex and profiling studies, ” Jour - nal of Experimental Botany , vol. 57, no. 7, pp. 1501–1508, 2006. [9] J. V . Olsen, M. V ermeulen, A. Santamaria, C. Kumar , M. L. Miller , L. J. Jensen, F . Gnad, J. Cox, T . S. Jensen, E. A. Nigg, S. Brunak, and M. Mann, “Quantitativ e Phospho- proteomics Rev eals Widespread Full Phosphorylation Site Occupancy During Mitosis, ” Sci. Signal. , vol. 3, no. 104, pp. ra3–, 2010. [10] A. L. Stev ens, J. S. W ishnok, F . M. White, A. J. Grodzinsky , and S. R. T annenbaum, “Mechanical Injury and Cytokines Cause Loss of Cartilage Integrity and Upregulate Proteins Associated with Catabolism, Immunity , Inflammation, and Repair, ” Molecular & Cellular Pr oteomics , vol. 8, no. 7, pp. 1475–1489, 2009. [11] Y . Zhang, A. W olf-Y adlin, P . L. Ross, D. J. Pappin, J. Rush, D. A. Lauffenbur ger , and F . M. White, “T ime-resolved Mass Spectrometry of T yrosine Phosphorylation Sites in the Epidermal Growth Factor Receptor Signaling Network Rev eals Dynamic Modules, ” Molecular & Cellular Pr o- teomics , vol. 4, no. 9, pp. 1240–1250, 2005. [12] T . Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y . W u, “ An efficient k-means cluster- ing algorithm: Analysis and implementation, ” IEEE T rans. P attern Anal. Mach. Intell. , vol. 24, pp. 881–892, July 2002. [13] J. W ard, Joe H., “Hierarchical grouping to optimize an ob- jectiv e function, ” Journal of the American Statistical Asso- ciation , vol. 58, no. 301, pp. pp. 236–244, 1963. [14] C. S. Mller-Lev et, F . Klawonn, K. Cho, and O. W olken- hauer , “Fuzzy clustering of short Time-Series and unev enly distributed sampling points, ” in Advances in Intelligent Data Analysis V (M. Berthold, ed.), v ol. 2810 of Lectur e Notes in Computer Science (LNCS) , pp. 330–340, Springer V erlag, 2003. [15] J. Ghosh, “Scalable clustering methods for data mining, ” Handbook of Data Mining , pp. 247–277, 2003. [16] S. H. Bokhari, “On the mapping problem, ” IEEE T rans. Comput. , vol. 30, pp. 207–214, March 1981. [17] M. Ahmed and S. Bokhari, “Mapping with space fill- ing surfaces, ” IEEE T rans. P arallel Distrib. Syst. , vol. 18, pp. 1258–1269, September 2007. [18] C. E. Shannon, “Communication in the Presence of Noise, ” Pr oceedings of the IRE , vol. 37, no. 1, pp. 10–21, 1949. [19] H. Nyquist, “Certain topics in telegraph transmission the- ory , ” Proceedings of the IEEE , vol. 90, no. 2, pp. 280–305, 2002. [20] M. T . Davis, J. Beierle, E. T . Bures, M. D. McGinley , J. Mort, J. H. Robinson, C. S. Spahr , W . Y u, R. Luethy , and S. D. Patterson, “ Automated lc-lc-ms-ms platform us- ing binary ion-exchange and gradient reversed-phase chro- matography for improv ed proteomic analyses, ” Journal of Chr omatography B: Biomedical Sciences and Applications , vol. 752, no. 2, pp. 281 – 291, 2001. [21] A. Khokhar and V . Prasanna, “Mapping stereo and image matching algorithms onto parallel architectures, ” Sadhana , vol. 18, pp. 209–221, 1993. 10.1007/BF02742658. [22] T . W arren Liao, “Clustering of time series data - a surve y , ” P attern Recognition , vol. 38, no. 11, pp. 1857–1874, 2005. Cited By (since 1996): 104. [23] R. K othari and D. Pitts, “On finding the number of clusters, ” P attern Recognition Letters , vol. 20, no. 4, pp. 405 – 416, 1999. [24] J. L. Bentley , “Multidimensional binary search trees used for associative searching, ” Commun. A CM , vol. 18, pp. 509– 517, September 1975. [25] D. T . Lee and C. K. W ong, “W orst-case analysis for re- gion and partial region searches in multidimensional binary search trees and balanced quad trees, ” Acta Informatica , vol. 9, pp. 23–29, 1977. 10.1007/BF00263763. [26] S. Maneew ongvatana and D. M. Mount, “Its okay to be skinny , if your friends are fat, ” in Center for Geometric Computing 4th Annual W orkshop on Computational Geom- etry , 1999. [27] H. Ritter, “Self-organizing maps on non-euclidean spaces, ” in K ohonen Maps , pp. 97–108, Elsevier , 1999. [28] A. Fernndez and S. Gmez, “Solving non-uniqueness in agglomerativ e hierarchical clustering using multidendro- grams, ” Journal of Classification , vol. 25, pp. 43–65, 2008. 10.1007/s00357-008-9004-x. [29] J. W ard, “Hierarchical grouping to optimize an objectiv e function., ” Journal of the American Statistical Association , vol. 58, pp. 236–244, 1963.
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment