Statistical File Matching of Flow Cytometry Data
Flow cytometry is a technology that rapidly measures antigen-based markers associated to cells in a cell population. Although analysis of flow cytometry data has traditionally considered one or two markers at a time, there has been increasing interes…
Authors: Gyemin Lee (1), William Finn (2), Clayton Scott (1
1 Statistical File Matching of Flo w Cytometry Data Gyemin Lee, W illiam Finn, and Clayton Scott Abstract Flow cytometry is a technology that rapidly measures antigen-based markers associated to cells in a cell population. Although analysis of flow cytometry data has traditionally considered one or two markers at a time, there has been increasing interest in multidimensional analysis. Howe ver , flow cytometers are limited in the number of markers they can jointly observe, which is typically a fraction of the number of markers of interest. For this reason, practitioners often perform multiple assays based on different, ov erlapping combinations of markers. In this paper , we address the challenge of imputing the high dimensional jointly distributed values of marker attributes based on overlapping marginal observ ations. W e sho w that simple nearest neighbor based imputation can lead to spurious subpopulations in the imputed data, and introduce an alternative approach based on nearest neighbor imputation restricted to a cell’ s subpopulation. This requires us to perform clustering with missing data, which we address with a mixture model approach and novel EM algorithm. Since mixture model fitting may be ill-posed, we also de velop techniques to initialize the EM algorithm using domain knowledge. W e demonstrate our approach on real flow cytometry data. Index T erms statistical file matching, flow cytometry , mixture model, probabilistic PCA, EM algorithm, imputation, incomplete data, clustering I . I N T RO D U C T I O N Flo w cytometry is a technique for quantitative cell analysis [1]. It provides simultaneous measurements of multiple characteristics of individual cells. T ypically , a lar ge number of cells are analyzed in a short G. Lee and C. Scott are with the Department of Electrical Engineering and Computer Science, Uni versity of Michigan, Ann Arbor , MI, USA. E-mail: { gyemin, cscott } @eecs.umich.edu. W . Finn is with the Department of Pathology , University of Michigan, Ann Arbor , MI, USA. E-mail: wgfinn@umich.edu. October 29, 2018 DRAFT 2 period of time – up to thousands of cells per second. Since its dev elopment in the late 1960s, flow cytometry has become an essential tool in various biological and medical laboratories. Major applications of flow cytometry include hematological immunophenotyping and diagnosis of diseases such as acute leukemias, chronic lymphoproliferativ e disorders, and malignant lymphomas [2]. Flo w cytometry data has traditionally been analyzed by visual inspection of one-dimensional histograms or two-dimensional scatter plots. Clinicians will visually inspect a sequence of scatter plots based on dif ferent pairwise marker combinations, and perform g ating, the manual selection of marker thresholds, to eliminate certain subpopulations of cells. They identify various pathologies based on the shape of cell subpopulations in these scatter plots. There has been recent work, re vie wed belo w , on automatic cell gating or classification of pathologies based on multidimensional analysis of flo w c ytometry data. Despite the promise of multidimensional analysis, this direction is limited by the number of mark ers that can be simultaneously measured, which is typically a fraction of the number of markers of interest. It is therefore common in practice to perform multiple assays based on different, overlapping combinations of markers. W e may view these combinations as different mar ginals of the joint distrib ution of all observed markers. Howe ver , ev en when analysis is based on visual inspection of scatter plots, problems arise when the desired marker pair w as not jointly measured. This situation arises frequently in the analysis of historical data. T o address these issues and to facilitate higher dimensional analysis, we present a statistical method for file matching, which imputes higher dimensional flow c ytometry data from multiple lower dimensional data files. While [3] proposed a simple approach based on Nearest Neighbor (NN) imputation, this method is prone to induce spurious clusters, as we demonstrate below . Our method can improv e the file matching of flow cytometry and is less likely to generate false clusters. In the following, we explai n the principles of flo w cytometry and introduce the file matching problem in the context of flow cytometry data. W e then present an approach to file matching which imputes a cell’ s missing marker v alues with the values of the nearest neighbor among cells of the same type. T o implement this approach we develop a method for clustering with missing data. W e model flow cytometry data with a latent variable Gaussian mixture model, where each Gaussian corresponds to a cell type, and de velop an expectation-maximization (EM) algorithm to fit the model. Since a large majority of all v alues are unobserved, most cov ariances cannot be estimated from the data. Howe ver , domain e xperts possess considerable knowledge about the characteristics of different cell types, and we incorporate this kno wledge into the initialization of the EM algorithm. W e compare our method with simple nearest neighbor imputation on real flow cytometry data, and show that our method of fers improved performance. October 29, 2018 DRAFT 3 Fig. 1. A flo w cytometer system. As a stream of cells passes through a laser beam, the photo-detectors detect forward angle light scatter , side angle light scatter, and light emissions from fluorochromes. Then the digitized signals are analyzed in a computer . I I . B A C K G RO U N D A N D M O T I V AT I O N In this section, we explain the principles of flow cytometry . W e also define the statistical file matching problem in the context of flow c ytometry data, and moti v ate the need for an impro ved solution. A. Flow Cytometry In flo w cytometry analysis, a cell suspension is first prepared from peripheral blood, bone marro w , or lymph node. The suspension of cells is then mixed with a solution of fluorochrome-labeled antibodies. T ypically , each antibody is labeled with a different fluorochrome. As the stream of suspended cells passes through a focused laser beam, they either scatter or absorb the light. If the labeled antibodies are attached to proteins of a cell, the associated fluorescent markers absorb the laser and emit light with the corresponding wavelength (color). Then a set of photo-detectors in the line of the light beam and perpendicular to the light capture the scattered and emitted light. The signals from the detectors are digitized and stored in a computer system. F orward scatter (FS) and side scatter (SS) signals as well as v arious fluorescence signals are collected for each cell (see Fig. 1). In a flow cytometer that is capable of measuring d attributes, called markers , the measurements of each cell can be represented with a d -dimensional v ector x = ( x (1) , x (2) , · · · , x ( d ) ) where x (1) is FS, x (2) is SS, and x (3) , · · · , x ( d ) are the fluorescent markers. Thus, the accumulation of N cells forms a N × d matrix. October 29, 2018 DRAFT 4 The detected signals provide information about the physical and chemical properties of each cell analyzed. FS is related to the relative size of the cell and SS is related to its internal granularity or complexity . The fluorescence signals reflect the abundance of expressed antigens on the cell surface. These various attrib utes are used for identification and quantification of cell populations. FS and SS are always measured, while the mark er combination is a part of the experimental design. Flo w cytometry data is usually analyzed using a sequence of one dimensional histograms and two or three dimensional scatter plots by choosing a subset of one, two or three markers. The analysis typically in volves manually selecting and excluding cell subpopulations, called gating, by thresholding and drawing boundaries on the scatter plots. Clinicians routinely diagnose by visualizing the scatter plots. Recently , some attempts hav e been made to analyze directly in high dimensional spaces by mathemat- ically modeling flow cytometry data. In [4], [5], a mixture of Gaussian distrib utions is used to model cell populations, while a mixture of t -distributions with a Box-Cox transformation is used in [6]. A mixture of skew t -distributions is studied in [7]. The knowledge of experts is sometimes incorporated as prior information [8]. Instead of using finite mixture models, some recent approaches proposed information preserving dimension reduction to analyze high dimensional flow cytometry data [9], [10]. Howe ver , standard techniques for multi-dimensional flow cytometry analysis are not yet established. B. Statistical F ile Matc hing The number of dimensions in flow cytometry is limited by the number of light sources and detectable fluorochrome markers, and av ailable reagent combinations. Even though recent innov ations hav e enabled measuring near 20 cellular attributes, there are typically dozens or hundreds of markers of interest in a given flo w cytometry experiment. Furthermore, instruments deployed in clinical laboratories still only measure 5-7 mark ers simultaneously [11]. Being unable to simultaneously measure all markers of interest, it is common to di vide a sample into se veral “tubes” and stain each tube separately with a different set of markers [12]. In practice, partially ov erlapping marker combinations are used to help identify cell populations (see Fig. 2). The marker combinations are designed based on which markers need to be observed together . Howe ver , it is not always possible to anticipate all marker combinations of potential interest. In the sequel, we present a method that generates flow cytometry data in which all the markers of interest are av ailable for the union of cells. Thus, we obtain a single higher dimensional dataset beyond the current limits of instrumentation. Then pairs of markers that are not measured together can still be visualized through scatter plots, and methods of multidimensional analysis may be applied to the full October 29, 2018 DRAFT 5 Fig. 2. Flow c ytometry analysis on a lar ge number of antibody reagents within a limited capacity of a flow cytometer . A sample from a patient is separated into multiple tubes with which different combinations of fluorochrome labeled antibodies are stained. Each output file contains at least two variables, FS and SS, in common as well as some variables that are specific to the file. common specific1 specific2 c s 1 s 2 file 1 ( N 1 ) X 1 file 2 ( N 2 ) X 2 Fig. 3. Data structure of two incomplete data files. T wo files have some overlapping variables c , and some variables s 1 and s 2 that are ne ver jointly observed. File matching combines the two files by completing the missing blocks of variables. dataset. This technique, called file matching, merges two nor more datasets that hav e some commonly observed v ariables as well as some variables unique to each dataset. An ex emplary two file case is drawn in Fig. 3. Each unit (cell) x n is a vector in R d and belongs to one of the data files (tubes) X 1 or X 2 , where each file has N 1 and N 2 units, respectively . While variables c are observed in all the units, units in X 1 hav e v ariables s 2 missing and units in X 2 hav e variables s 1 missing, where s 1 , s 2 , and c represent specific and common v ariable sets. If the observed and missing components of a unit x n are denoted by o n and m n , then o n = c ∪ s 1 and m n = s 2 for x n ∈ X 1 , and o n = c ∪ s 2 and m n = s 1 for x n ∈ X 2 . The file matching problem is a missing data problem where blocks of missing data need to be imputed. Among imputation methods, algorithms using conditional mean or regression are most common. As sho wn October 29, 2018 DRAFT 6 in Fig. 4, howe ver , these imputation algorithms tend to shrink the variance of data. Thus, these approaches are inappropriate in flo w c ytometry where the shapes of cell populations are important in analysis, and the preservation of variability after file matching is highly desired. More discussions on missing data analysis and file matching can be found in [13] and [14]. Fig. 4. Examples of imputation methods: NN, conditional mean, and regression. The NN method relatively well preserves the distribution of imputed data, while other imputation methods such as conditional mean and regression significantly reduce the variability of data. [3] proposed to use Nearest Neighbor (NN) imputation to match flow c ytometry data files. In their approach, missing v ariables of one unit, called the recipient, are imputed with observed variables from a unit in the other file, called the donor , that is most similar . If x i is a unit in X 1 , the missing variables are set as follo ws x s 2 i = x ∗ s 2 j where x ∗ j = arg min x j ∈X 2 k x c i − x c j k 2 . Note that the similarity is based on the distance in the projected space of jointly observed variables. This algorithm is adv antageous over other imputation algorithms, based on conditional mean or regression, as displayed in Fig. 4. It generally preserves the distribution of cells, while the other methods cause the v ariance structure to shrink tow ard zero. October 29, 2018 DRAFT 7 Fig. 5. Comparison of results for two imputations methods to the ground truth cell distribution. Figures show scatter plots on pairs of markers that are not jointly observed. The middle row and the bottom row sho ws the imputation results from the NN and Cluster-based NN, respectively . The results from the NN method show spurious clusters conspicuously in the right two panels. The false clusters are indicated by dotted circles in CD3 vs. CD8 and CD3 vs. CD4 scatter plots. On the other hand, the results from our proposed approach better resemble the true distribution on the top ro w . Ho wev er, the NN method sometimes introduces spurious clusters into the imputed results and fails to replicate the true distribution of cell populations. Fig. 5 sho ws an example of false clusters from the NN imputation algorithm (for detailed e xperimental setup, see Section V). W e present a toy example to explain why the NN imputation can fail, and to moti v ate our approach. October 29, 2018 DRAFT 8 C. Motivating T oy Example Fig. 6 sho ws a toy example dataset in R 3 . In the two data files, tw o of three features of these points are observ ed: c and s 1 in file 1, and c and s 2 in file 2. Each data point belongs to one of tw o clusters, but its label is una vailable. When imputing feature s 1 of units in file 2, the NN algorithm produces four clusters whereas there should be two, as shown in Fig. 6 (d). This is because the NN method uses only one feature, and f ails to leverage the information about the joint distribution of variables that are not observed together . On the other hand, if we can infer the cluster membership of data points, the NN imputation can be applied within the same cluster . Hence, we search a donor from the subgroup (1) for the data points in (3) , and like wise we search a donor from (2) for the points in (4) in the example. Then the file matching result greatly improves and better replicates the true distribution as in Fig. 6 (e). Fig. 6. T oy example of file matching. T wo files (b) and (c) provide partial information of data points (a) in R 3 . The variable c is observed in both files while s 1 and s 2 are specific to each file. The NN method creates false dot populations in the s 1 vs. s 2 scatter plot in (d). On the other hand, the NN applied within the same cluster successfully replicated the true distribution. If the cluster are incorrectly paired, howe ver, the Cluster-NN approach fails, as in (f). In this example, as in real flo w c ytometry data, there is no way to infer cluster membership from the October 29, 2018 DRAFT 9 Input: two files X 1 and X 2 to be matched 1. Cluster the units in X 1 and X 2 . 2. Perform NN imputation within the same cluster . Output: statistically matched complete files b X 1 and b X 2 Fig. 7. The description of the Cluster-based NN algorithm for two files. For two input flow cytometry data files X 1 and X 2 , specific variables are imputed using NN method after clustering each cell into one of K clusters. data alone, and incorrect labeling can lead to poor results (Fig. 6 (f)). Fortunately , in flo w cytometry we can incorporate prior kno wledge to achieve an accurate clustering. I I I . C L U S T E R - BA S E D I M P U TA T I O N O F M I S S I N G V A R I A B L E S W e first focus on the case of matching two files. The case of more than two files is discussed in Section VI. F or the present section, we assume that X 1 and X 2 hav e both been partitioned into K clusters. Let X k 1 and X k 2 denote the cells in X 1 and X 2 from the k th cluster , respectiv ely . Suppose that the data is configured as in Fig. 3. In order to impute the missing v ariables of a unit in file 1, we locate a donor among the data points in file 2 that hav e the same cluster label as the recipient. When imputing incomplete units in file 2, the roles change. The similarity between two units is ev aluated on the projected space of jointly observed variables, while constraining both units to belong to the same cluster . Then we impute the missing variables of the recipient by patching the corresponding v ariables from the donor . More specifically , for x i ∈ X k 1 , we impute the missing variables by x m i i = x ∗ m i j where x ∗ j = arg min x j ∈X k 2 k x c i − x c j k 2 The proposed Cluster -based NN imputation algorithm is summarized in Fig. 7. In social applications such as surv ey completion, file matching is often performed on the same class such as gender , age, or county of residence. Ho wev er , this information that is used to label each unit is av ailable in data, and the inference as in our algorithm is not necessary [14]. I V . C L U S T E R I N G W I T H M I S S I N G D A T A T o implement the abov e approach, we view X 1 and X 2 as a single data set and cluster its elements. W e propose a method for clustering with missing data based on a finite Gaussian mixture model. Mixture models are common models for flow cytometry where each component corresponds to a cell type. October 29, 2018 DRAFT 10 While non-Gaussian models might provide a better fit, there is a trade-of f between estimation error and approximation error . More complicated models tend to be more challenging to fit. Furthermore, ev en with an imperfect data model, we may still achiev e an impro ved file matching. Thus, clustering amounts to fitting the parameters of the mixture model. In general, fitting such a model is ill-posed. For example, in the toy example, there is no way to know the correct cluster inference based solely on the data. Ho wev er, we can leverage domain kno wledge to select the number of components and initialize model parameters. A. Mixtur e of PPCA In a mixture model framew ork, the probability distribution of a d -dimensional data vector x takes the form p ( x ) = K X k =1 π k p k ( x ) where K is the number of components in the mixture and π k is a mixing weight. In flo w cytometry , mixture models are common models of cell subpopulations. Mixture models with Gaussian components are common [4], [5], [8], although distributions with more parameters, such as t -distributions or ske w t -distributions, have been proposed [6], [7]. Howe ver , these models require estimating a large number of parameters, and it becomes difficult to obtain reliable estimates when the number of components or the dimensions of the data increase. In this application, the model needs not be perfect to get impro ved imputation. W e adopt a probabilistic principal component analysis (PPCA) mixture model as a way to model cell populations with fe wer parameters. Without PPCA, our experience has rev ealed that e ven a Gaussian mixture model may ha ve too man y parameters to be accurately fit. PPCA w as proposed by [15] as a probabilistic interpretation of PCA. While con ventional PCA lacks a probabilistic formulation, PPCA specifies a generati ve model. It is a latent v ariable model, in which a data vector is linearly related to a latent v ariable. The latent v ariable space is generally lower dimensional than the ambient v ariable space, so the latent v ariable provides an economical representation of the data. The PPCA model is built by specifying a conditional distribution of a data vector x in R d , giv en a latent variable t in R q : p ( x | t ) = N ( Wt + µ , σ 2 I ) where µ is a d -dimensional vector and W is a d × q linear transform matrix. The latent variable is also October 29, 2018 DRAFT 11 assumed to be Gaussian with p ( t ) = N ( 0 , I ) . Then the marginal distribution of x is also Gaussian: p ( x ) = N ( µ , C ) with a cov ariance matrix C = WW T + σ 2 I . The posterior distribution can be shown to be Gaussian as well: p ( t | x ) = N ( M − 1 W T ( x − µ ) , σ 2 M − 1 ) where M = W T W + σ 2 I is a q × q matrix. The PPCA mixture model is a combination of multiple PPCAs. Each PPCA component explains local data structure or cell subpopulation. The model is defined by the collection of each component parameters θ k = { π k , µ k , W k , σ 2 k } . From a flo w cytometry dataset X = { x 1 , · · · , x N } , an EM algorithm can learn the mixture model by iterativ ely computing these parameters. More details on the PPCA mixture and the EM algorithm are explained in [16] The mixture of PPCA offers a way of controlling the number of parameters to be estimated without completely sacrificing the fle xibility of model. In mixture model frame work, a more common choice is the standard Gaussian mixture model. In the Gaussian mixture model, howe ver , each Gaussian component requires d ( d + 1) / 2 cov ariance parameters to be estimated if a full cov ariance matrix is used. Thus, as the data dimension increases, more data points are needed for reliable estimation of those parameters. The number of parameters can be reduced by constraining the cov ariance matrix to be isotropic or diagonal. These are too restrictiv e, howe ver , since an isotropic or diagonal cov ariance makes the Gaussian component spherical or , respectiv ely , elliptical aligned along the data axes; hence, the correlation structure between v ariables cannot be captured. On the other hand, the PPCA mixture model lies between those two extremes, and allows to control the number of parameters by specifying q , the dimension of the latent variable. B. Mixtur e of PPCA with Missing Data Even though our file matching problem has a particular pattern of missing variables, we de velop a more general algorithm that allows for an arbitrary pattern of missing variables. Our dev elopment assumes v alues are “missing at random, ” meaning that whether a v ariable is missing or not, is independent of its v alue [13]. Our algorithm may be viewed as an extension of the algorithm of [17] to PPCA, or the algorithm of [16] to data with missing v alues. Denoting the observ ed and missing components by o n and m n , each data point can be divided x n = ( x o n n , x m n n ) . In a missing data problem, a set of partial observations { x o 1 1 , · · · , x o N N } is gi ven. Similar October 29, 2018 DRAFT 12 to EM algorithms for Gaussian mixture models, we introduce indicator v ariables z n . One and only one entry of z n is nonzero, and z nk = 1 indicates that the k th component is responsible for generating x n . W e also include the missing components x m n n and the set of latent v ariables t nk for each component to form the complete data ( x o n , x m n , t nk , z n ) for n = 1 , · · · , N and k = 1 , · · · , K . W e deriv e an iterativ e EM algorithm for the PPCA mixture model with missing data. The key dif ference from the EM algorithm for completely observed data is that the conditional expectation is taken with respect to x o as opposed to x in the e xpectation steps. T o de velop an EM algorithm, we employ and extend the two step procedure as described in [16]. In the first stage of the algorithm, the component weights π k and the component center µ k are updated: b π k = 1 N X n h z nk i , (1) b µ k = P n h z nk i h x o n n h x m n n i i P n h z nk i (2) where h z nk i = P ( z nk = 1 | x o n n ) is the responsibility of mixture component k for generating the unit x n , and h x m n n i = E [ x m n n | z nk = 1 , x o n n ] is the conditional expectation. Note that we are not assuming the vectors in the brack et are stackable. This notation can be replaced by the true component ordering without difficulty . In the second stage, we update W k and σ 2 k : c W k = S k W k ( σ 2 k I + M − 1 k W T k S k W k ) − 1 , (3) b σ 2 k = 1 d tr S k − S k W k M − 1 k c W T k (4) from local co variance matrix S k : S k = 1 N b π k X n h z nk ih ( x n − b µ k )( x n − b µ k ) T i . These update rules boil down to the update rules for completely observed data when there are no missing v ariables. W e deriv e the EM algorithm in detail in Appendix A. After model parameters are estimated, the observations are divided into groups according to their posterior distribution: arg max k =1 , ··· K p ( z nk = 1 | x o n n ) , so each unit (cell) is classified into one of K cell populations. Note that this posterior probability is computed in the E-step. October 29, 2018 DRAFT 13 Cell T ype CD markers granulocyte CD45+, CD15+ monocyte CD45+, CD14+ lymphocyte helper T cell CD45+, CD3+ cytotoxic T cell CD45+, CD3+, CD8+ B cell CD45+, CD19+ or CD45+, CD20+ NK cell CD16+, CD56+, CD3- Fig. 8. T ypes of white blood cells. Each cell type is characterized by a set of expressed CD markers. The cluster of differentiation (CD) markers are commonly used to identify cell surface molecules on white blood cells. The ‘ + / − ’ signs indicate whether a certain cell type has corresponding antigens on the cell surface. C. Domain Knowledge and Initialization of EM algorithm In file matching of flo w cytometry data, domain knowledge is critical. First, as explained abov e, the incompletely observed data is insuf ficient to determine the correct cluster labeling. Second, the initial conditions of the EM algorithm af fect its performance and con ver gence rate. Domain knowledge allows us to choose the number of components, and to initialize the algorithm so that it con ver ges to the best local maximum. In flow cytometry , from the design of fluorochrome marker combinations and the knowledge about the blood sample composition, we can anticipate certain properties of cell subpopulations. F or e xample, Fig. 8 summarizes white blood cell types and their characteristic cluster of differentiation (CD) marker expressions. That these are six cell types suggests choosing K = 6 when analyzing white blood cells. The CD markers indicated are commonly used in flow cytometry to identify cell surface molecules on leukoc ytes [18]. Ho wev er , this information is qualitative, and needs to be quantified. T o achiev e this, we use one dimensional histograms. In a histogram, two large peaks are generally expected depending on the expression lev el of the corresponding CD marker . If a cell subpopulation expresses a CD marker , denoted by ‘ + ’, then it forms a peak on the right side of the histogram. On the other hand, if a cell population does not express the marker , denoted by ‘ − ’, then a peak can be found on the left side of the histogram. W e use the locations of the peaks to quantify the expression le vels. These quantified values can be combined with the CD marker e xpression levels of each cell type to specify the initial cluster centers. Thus, each component of µ k of a certain cell type is initialized by either the positi ve quantity or the neg ati ve quantity from the histogram. In our implementation, these are set manually based on visual inspection of histograms. Then we initialize the mixture model parameters { π k , µ k , W k , σ 2 k } as described in Fig. 9. October 29, 2018 DRAFT 14 Input: X 1 , X 2 data files ; K the number of components ; q the dimension of latent variable space ; µ k for initial component mean. f or k = 1 to K do 1. using distance k x o n n − µ o n k k , find the set of data points X k whose nearest component mean is µ k 2. initialize a cov ariance matrix C k with random entries 3. replace submatrices of C k with sample co variance of data points in X k 4. make C k positi ve definite by enforcing the eigen values to be positi ve 5. set π k = |X k | N 1 + N 2 6. set W k with the q principal eigen vectors of C k 7. set σ 2 k with the a verage of remaining eigen values of C k end for Output: { π k , µ k , W k , σ 2 k } for k = 1 , · · · , K Fig. 9. P arameter initialization of an EM algorithm for missing data. Cell populations are partitioned into K groups based on the distance to each component center . The component weight π k is initialized according to the size of each partition. From the covariance matrix estimate bC k , parameters bW k and σ 2 k are initialized by taking eigen-decomposition. c s 1 s 2 c s 1 s 2 Fig. 10. Structure of cov ariance matrix C . The sub-matrices C s 1 ,s 2 k and C s 2 ,s 1 k cannot be estimated from a sample covariance matrix because these variables are never jointly observed. An important issue in file matching arises from the cov ariance matrix. When data is completely observed, a common way of initialization of a cov ariance matrix is using a sample cov ariance matrix. In the case of file matching, howe ver , it cannot be e valuated since some sets of v ariables are ne ver jointly observed (see Fig. 10). W e chose to b uild a cov ariance matrix C k from v ariable to v ariable with sample cov ariances. F or example, we can set C c,s 1 k with the sample cov ariance for v ariables c and s 1 based on cases for which both variables c and s 1 are present. On the other hand, the submatrix C s 1 ,s 2 k cannot be built based on the observation. In our implementation, we set the submatrix C s 1 ,s 2 k with arbitrary October 29, 2018 DRAFT 15 ID N 1 N 2 N e Patient1 10000 10000 5223 Patient2 7000 7000 4408 Patient3 3000 3000 3190 Fig. 11. Three flow cytometry datasets from three dif ferent patients. Each dataset is divided into two data files and an ev aluation set. N 1 and N 2 denote the size of two data files and N e is the size of e valuation set. FS SS CD56 CD16 CD3 CD8 CD4 file 1 file 2 Fig. 12. File structure used in the experiment. FS, SS, and CD56 are common in both files, and a pair of CD markers are observed in only one of the files. The blank blocks correspond to the unobserved variables. The blocks in file 1 are matrices with N 1 rows, and the blocks in file 2 are matrices with N 2 rows. v alues. Howe ver , the resulting matrix may not be positiv e definite. Thus, C k is made positi ve definite by replacing ne gati ve eigen values with a small positi ve v alue. Once a covariance matrix C k is obtained, we can initialize W k and σ 2 k by taking eigen-decomposition of C k . V . E X P E R I M E N T S A N D R E S U L T S W e apply the proposed file matching technique to real flow cytometry datasets, and present experimental results. Three flow cytometry datasets are prepared from lymph node samples of three patients. These datasets were provided by the Department of Pathology at the Uni versity of Michigan. The measurements are of dif ferent sizes and have sev en attributes: FS, SS, CD56, CD16, CD3, CD8, and CD4. Each dataset is randomly permuted ten times and divided into two data files and a separate e v aluation set. In Fig. 11, the cell counts of the two files and the held-out set are denoted N 1 , N 2 , and N e , respectiv ely . T wo attributes from each file are made hidden to construct h ypothetical files with missing data. Thus, CD16 and CD3 are av ailable only in file 1, and CD8 and CD4 are available only in file 2, while FS, SS, and CD56 are common. The pattern of the constructed data files is illustrated in Fig. 12 where the blocks of missing v ariables are left blank. For each white blood cell type, its expected marker expressions (CD markers), relativ e size (FS), and relativ e granularity (SS) are presented in Fig. 13. Because it is from a lymph node sample, the majority of cell population is lymphocytes, while the most common white blood cells in a human body October 29, 2018 DRAFT 16 Cell type FS SS CD56 CD16 CD3 CD8 CD4 granulocyte + + − + − − − monocyte + − − + − − − helper T cell − − − − + − + cytotoxic T cell − − − − + + − B lymphocyte − − − − − − − Natural Killer cell − − + + − − − Fig. 13. Cell types in the dataset and their corresponding marker expressions. ‘+’ or ‘-’ indicates whether a certain cell type expresses the CD marker or not. 0 200 400 600 800 1000 0 50 100 150 X: 419.5 Y: 100 FS X: 791.5 Y: 18 0 200 400 600 800 1000 0 50 100 150 200 250 X: 396.5 Y: 248 SS X: 675.5 Y: 27 0 200 400 600 800 1000 0 50 100 150 X: 245.5 Y: 135 CD56 0 200 400 600 800 1000 0 50 100 150 X: 136.5 Y: 147 CD16 X: 348.5 Y: 47 0 200 400 600 800 1000 0 50 100 150 X: 558.5 Y: 126 CD3 X: 215.5 Y: 49 0 200 400 600 800 1000 0 10 20 30 40 50 X: 171.5 Y: 49 CD8 X: 766.5 Y: 30 0 200 400 600 800 1000 0 50 100 150 200 X: 219.5 Y: 48 CD4 X: 654.5 Y: 164 Fig. 14. Histogram of each marker in the dataset. The peaks are hand-picked and are indicated in each panels. are granulocytes. The ‘ + / − ’ signs indicate whether a certain cell type expresses the markers or not. For example, helper T cells express both CD3 and CD4 but not others. This qualitati ve knowledge is quantified with the help of single dimensional histograms as explained in Section IV -C. T wo dominant peaks are picked from each histogram and their corresponding measurement values are set to the positi ve and negati ve expression levels. Fig. 14 and Fig. 15 summarize this histogram analysis. T wo incomplete data files are completed following the procedure as described in Fig. 7. A mixture of PPCA is fitted with six components because six cell types are e xpected from this dataset. The latent v ariable dimension of each PPCA component is fixed to two. October 29, 2018 DRAFT 17 FS SS CD56 CD16 CD3 CD8 CD4 + 800 680 500 350 550 750 650 − 400 400 240 130 200 170 200 Fig. 15. The positiv e and negati ve expression levels are summarized. The synthesized data after file matching is displayed in Fig. 5. The figure shows scatter plots of specific v ariables: CD16, CD3, CD4, and CD8. Note that these marker pairs are not jointly observed from the two incomplete data files. The imputation results from the NN and the Cluster-based NN methods are compared in the figure. For reference, scatter plots from the original complete dataset (ground truth) are also presented. As can be seen, the results from the Cluster -based NN are far more similar to the true distributions. On the other hand, the results from the NN method generates spurious clusters in the CD3-CD8 and CD3-CD4 scatter plots. In Fig. 5, these f alse clusters are indicated. A. Evaluation method T o quantitativ ely ev aluate the pre vious results, we use Kullback-Leibler (KL) div ergence. The KL di ver gence between tw o distribution f ( x ) and g ( x ) is defined by K L ( g k f ) = E g [log g − log f ] . Let f denote a true distribution responsible for the observ ations, and g denote its estimate. The KL diver gence is not symmetric, so K L ( f k g ) and K L ( g k f ) hav e different meanings. For a gi ven distribution f , a distribution g minimizes K L ( f k g ) when g takes nonzero values in the region where f takes nonzero values; hence, it overestimates the support of f . On the other hand, K L ( g k f ) is minimized for g that is close to zero in the region where f is near zero. A distribution g that minimizes K L ( g k f ) tends to hav e smaller support. Therefore, K L ( g k f ) is a better ev aluation method for detecting spurious clusters in an estimate. Then the empirical estimate of the KL di ver gence is e valuated by K L ( g k f ) ≈ K L ( b g k b f ) ≈ 1 N e N e X n =1 h log b g ( b x n ) − log b f ( b x n ) i . where the distributions f and g are replaced by their corresponding density estimates, and the expectation is approximated by a finite sum o ver imputed results b x n on the held-out v alidation set of size N e . W e used kernel density estimation on the ground truth data and the imputed data for b f and b g , respecti vely . The KL div ergences are computed for ten random permutations, and their av erages and October 29, 2018 DRAFT 18 ID NN (file 1) Cluster-NN (file 1) NN (file 2) Cluster-NN (file 2) Patient1 2.90 ± 0.05 1.55 ± 0.05 2.66 ± 0.03 1.12 ± 0.04 Patient2 4.54 ± 0.07 1.22 ± 0.03 4.12 ± 0.08 0.92 ± 0.03 Patient3 4.46 ± 0.10 2.40 ± 0.11 4.18 ± 0.11 2.30 ± 0.07 Fig. 16. The KL diver gences are computed for ten permutations of each flow cytometry dataset. The av erages and standard errors are reported in the table. For both the NN and Cluster-based NN algorithm, the file matching results are ev aluated. The KL div ergences of Cluster-based NN are closer to zero than those of NN. Thus the results from Cluster-based NN better replicated the true distribution. standard errors are reported in Fig. 16. As can be seen, the KL diver gences from Cluster-based NN are substantially smaller than those from NN. Therefore, the Cluster-based NN yields a better replication of true distribution. V I . D I S C U S S I O N In this paper, we demonstrated the use of a cluster -based nearest neighbor imputation method for file matching in flow cytometry data. W e applied the proposed algorithm on real flow cytometry data to generate a dataset of higher dimensions by mer ging two data files of lower dimensions. The resulting matched file can be used for visualization and high-dimensional analysis of cellular attributes. While the presented imputation method focused on the case of two files, it can be generalized to more than two files. For each missing component of a recipient cell, we can find a donor cell among files that hav e the component of interest. W e en vision two e xtensions of the clustering-based imputation method. The first is training a PPCA mixture model on all the data files. This approach in volv es the entire data points for model fitting. The second method considers a pair of files at a time. In this approach, we first select a donor file in which the missing component of the recipient file is a v ailable. Then we apply method of this paper to the pair of files. This approach in v olves smaller number of data points in training, but mixture models of smaller dimensions need to be fitted multiple times. After training of a mixture model and clustering of each cell, the similarity between cells can be computed. The Euclidean distance on the projected space of commonly observed variables can be used to find the similarity under the constraint that both units should have the same cluster label. The missing components are then imputed from the donor . Future research directions include finding ways of automatic prior information e xtraction. The con- struction of co variance matrices from incomplete dataset in the initialization of the EM algorithm is also an interesting problem. W e expect that better cov ariance structure estimation will be helpful for better October 29, 2018 DRAFT 19 replication of non-symmetric and non-elliptic cell populations in the imputed results. A limitation of this work is that it has only been validated on lymphocyte data, where, for certain marker combinations, cell types tend to form relatively well-defined clusters. Howe ver , for other samples and marker combinations, clusters may be more elongated or less well-defined due to cells being at dif ferent stages of physiologic dev elopment. Future work may also consider more flexible models for clustering such data, and associated inference algorithms. A P P E N D I X Suppose that we are gi ven an incomplete observ ation set. W e can di vide each unit x n as x n = x o n n x m n n by separating the observed components and the missing components. Note that we do not assume that the observ ed v ariables are first, and the notation can be replaced by the actual ordering of components without difficulty . In the PPCA mixture model, the probability distribution of x is p ( x ) = K X k =1 π k p ( x | k ) . where K is the number of components in the mixture and π k is a mixing weight corresponding to the component density p ( x | k ) . W e estimate the set of unknown parameters θ = { π k , µ k , W k , σ 2 k } using an EM algorithm from the partial observations { x o 1 1 , · · · , x o N N } . T o de velop an EM algorithm, we introduce indicator v ariables z n = ( z n 1 , · · · , z nK ) for n = 1 , · · · , N . One and only one entry of z n is nonzero, and z nk = 1 indicates that the k th component is responsible for generating x n . W e also include a set of the latent v ariables t nk for each component, and missing v ariables x m n n to form the complete data ( x o n n , x m n n , t nk , z n ) for n = 1 , · · · , N and k = 1 , · · · , K . Then the corresponding complete data likelihood function has the form: L C = X n X k z nk ln [ π k p ( x n , t nk )] = X n X k z nk ln π k − d 2 ln σ 2 k − 1 2 σ 2 k tr ( x n − µ k )( x n − µ k ) T + 1 σ 2 k tr ( x n − µ k ) t T nk W T k − 1 2 σ 2 k tr W T k W k t nk t T nk where terms independent of the parameters are not included in the second equality . Instead of developing an EM algorithm directly on this likelihood function L C , we extend the strategy in [16] and build a two-stage EM algorithm, where each stage is a two-step process. October 29, 2018 DRAFT 20 In the first stage of the tw o stage EM algorithm, we update the component weight π k and the component mean µ k . W e form a complete data log-likelihood function with the component indicator v ariables z n and missing v ariables x m n while ignoring the latent variables t nk . Then we hav e the follo wing likelihood function: L 1 = N X n =1 K X k =1 z nk ln[ π k p ( x o n n , x m n n | k )] = X n X k z nk ln π k − 1 2 ln | C k | − 1 2 tr C − 1 k ( x n − µ k )( x n − µ k ) T where terms unrelated to the model parameters are omitted in the second line. W e take the conditional expectation with respect to p ( z n , x m n n | x o n n ) . Since the conditional probability factorizes as p ( z n , x m n n | x o n n ) = p ( z n | x o n n ) p ( x m n n | z n , x o n n ) , the next conditional e xpectations follo w h z nk i = p ( k | x o n n ) = π k p ( x o n n | k ) P k 0 π k 0 p ( x o n n | k 0 ) , h z nk x m n n i = h z nk ih x m n n i , h x m n n i = µ m n k + C m n o n k C o n o n − 1 k ( x o n n − µ o n k ) , h z nk x m n n x m n T n i = h z nk ih x m n n x m n T n i , h x m n n x m n T n i = C m n m n k − C m n o n k C o n o n − 1 k C o n m n k + h x m n n ih x m n T n i where h·i denote the conditional expectation. Maximizing hL 1 i with respect to π k , using a Lagrange multiplier , and with respect to µ k gi ve the parameter updates b π k = 1 N X n h z nk i , (5) b µ k = P n h z nk i h x o n n h x m n n i i P n h z nk i . (6) In the second stage, we include the latent v ariable t nk as well to formulate the complete data log- likelihood function. The new v alues of b π k and b µ k are used in this step to compute suf ficient statistics. T aking the conditional expectation on L C with respect to p ( z n , t nk , x m n n | x o n n ) , we ha ve hL C i = X n X k h z nk i ln b π k − d 2 ln σ 2 k − 1 2 σ 2 k tr h ( x n − b µ k )( x n − b µ k ) T i + 1 σ 2 k tr h ( x n − b µ k ) t T nk i W T k − 1 2 σ 2 k tr W T k W k h t nk t T nk i . October 29, 2018 DRAFT 21 Since the the conditional probability factorizes p ( z n , t nk , x m n n | x o n n ) = p ( z n | x o n n ) p ( x m n n | z n , x o n n ) p ( t nk | z n , x o n n , x m n n ) , we can e valuate the conditional e xpectations as follo ws : h ( x n − b µ k )( x n − b µ k ) T i = x o n n h x m n n i − b µ k x o n n h x m n n i − b µ k T + 0 0 0 Q nk , Q nk = C m n m n k − C m n o n k C o n o n − 1 k C o n m n k , h t nk i = M − 1 k W T k ( x n − b µ k ) , h ( x n − b µ k ) t T nk i = h ( x n − b µ k )( x n − b µ k ) T i W k M − 1 k , h t nk t T nk i = σ 2 k M − 1 k + M − 1 k W T k h ( x n − b µ k )( x n − b µ k ) T i W k M − 1 k . Remember that the q × q matrix M k = W T k W k + σ 2 k I . Then the maximization of hL C i with respect to W k and σ 2 k leads to the parameter updates, c W k = " X n h z nk ih ( x n − b µ k ) t T nk i # " X n h z nk ih t nk t T nk i # − 1 , (7) b σ 2 k = 1 d P n h z nk i X n h z nk i tr h ( x n − b µ k )( x n − b µ k ) T i (8) − 2 X n h z nk i tr h ( x n − b µ k ) t T nk i W T k + X n h z nk i tr W T k W k h t nk t T nk i . Substituting the conditional expectations simplifies the M-step equations c W k = S k W k ( σ 2 k I + M − 1 k W T k S k W k ) − 1 , (9) b σ 2 k = 1 d tr S k − S k W k M − 1 k c W T k (10) where S k = 1 N b π k X n h z nk ih ( x n − b µ k )( x n − b µ k ) T i . Each iteration of the EM algorithm updates the set of old parameters { π k , µ k , W k , σ 2 k } with the set of ne w parameters { b π k , b µ k , c W k , b σ 2 k } as in (5), (6), (9), and (10). The algorithm terminates when the value of the log-lik elihood function no longer changes. October 29, 2018 DRAFT 22 R E F E R E N C E S [1] H. Shapiro, Practical Flow Cytometry , 3rd ed. W iley-Liss, 1994. [2] M. Brown and C. W ittwer, “Flo w cytometry: Principles and clinical applications in hematology , ” Clinical Chemistry , vol. 46, pp. 1221–1229, 2000. [3] C. E. Pedreira, E. S. Costa, S. Barrena, Q. Lecrevisse, J. Almeida, J. J. M. van Dongen, and A. Orfao, “Generation of flow cytometry data files with a potentially infinite number of dimensions, ” Cytometry A , vol. 73A, pp. 834–846, 2008. [4] M. J. Boedigheimer and J. Ferbas, “Mixture modeling approach to flow cytometry data, ” Cytometry P art A , vol. 73, pp. 421 – 429, 2008. [5] C. Chan, F . Feng, J. Ottinger, D. Foster , M. W est, and T . Kepler , “Statistical mixture modeling for cell subtype identification in flow cytometry , ” Cytometry P art A , vol. 73, pp. 693–701, 2008. [6] K. Lo, R. R. Brinkman, and R. Gottardo, “ Automated gating of flow cytometry data via robust model-based clustering, ” Cytometry P art A , vol. 73, pp. 321 – 332, 2008. [7] S. Pyne, X. Hu, K. W ang, E. Rossin, T .-I. Lin, L. M. Maier , C. Baecher -Allan, G. J. McLachlan, P . T amayo, D. A. Hafler , P . L. D. Jager , and J. P . Mesirov , “ Automated high-dimensional flow cytometric data analysis, ” PNAS , vol. 106, pp. 8519–8524, 2009. [8] J. Lakoumentas, J. Drakos, M. Karakantza, G. C. Nikiforidis, and G. C. Sak ellaropoulos, “Bayesian clustering of flo w cytometry data for the diagnosis of b-chronic lymphocytic leukemia, ” Journal of Biomedical Informatics , v ol. 42, pp. 251–261, 2009. [9] K. Carter, R. Raich, W . Finn, and A. O. Hero, “Information preserving component analysis: data projections for flo w cytometry analysis, ” Journ. of Selected T opics in Signal Pr ocessing , vol. 3, pp. 148–158, 2009. [10] ——, “Fine: Fisher information non-parametric embedding, ” IEEE T rans. on P attern Analysis and Machine Intelligence , 2009, to appear . [11] S. Perfetto, P . K. Chattopadhyay , and M. Roederer , “Se venteen-colour flow cytometry: unrav elling the immune system, ” Natur e Reviews Immunology , vol. 4, pp. 648–655, 2004. [12] M. S ´ anchez, J. Almeida, B. V idriales, M. L ´ opez-Berges, M. Garc ´ ıa-Marcos, M. Moro, A. Corrales, M. Calmuntia, J. S. Miguel, and A. Orfao, “Incidence of phenotypic aberrations in a series of 467 patients with b chronic lymphoproliferati ve disorders: basis for the design of specific four-color stainings to be used for minimal residual disease in vestigation, ” Leukemia , vol. 16, pp. 1460–1469, 2002. [13] R. Little and D. Rubin, Statistical analysis with missing data , 2nd ed. Wile y , 2002. [14] S. R ¨ assler , Statistical matching: a frequentist theory , practical applications, and alternative Bayesian appr oaches , ser . Lecture Notes in Statistics. Springer , 2002, no. 168. [15] M. Tipping and C. Bishop, “Probabilistic principal component analysis, ” J ournal of the Royal Statistical Society , B , vol. 6, pp. 611–622, 1999. [16] ——, “Mixtures of probabilistic principal component analysis, ” Neural Computation , vol. 11, pp. 443–482, 1999. [17] Z. Ghahramani and M. Jordan, “Supervised learning from incomplete data via an em approach, ” Advances in Neur al Information Processing Systems , vol. 6, 1994. [18] H. Zola, B. Swart, I. Nicholson, B. Aasted, A. Bensussan, L. Boumsell, C. Buckley , G. Clark, K. Drbal, P . Engel, D. Hart, V . Horejsi, C. Isacke, P . Macardle, F . Mala vasi, D. Mason, D. Oli ve, A. Saalmueller, S. F . Schlossman, R. Schwartz-Albiez, P . Simmons, T . F . T edder, M. Uguccioni, and H. W arren, “Cd molecules 2005: human cell differentiation molecules, ” Blood , vol. 106, pp. 3123–3126, 2005. October 29, 2018 DRAFT
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment