Statistical File Matching of Flow Cytometry Data

1 Statistical File Matching of Flo w Cytometry Data Gyemin Lee, W illiam Finn, and Clayton Scott Abstract Flow cytometry is a technology that rapidly measures antigen-based markers associated to cells in a cell population. Although analysis of ﬂow cytometry data has traditionally considered one or two markers at a time, there has been increasing interest in multidimensional analysis. Howe ver , ﬂow cytometers are limited in the number of markers they can jointly observe, which is typically a fraction of the number of markers of interest. For this reason, practitioners often perform multiple assays based on different, ov erlapping combinations of markers. In this paper , we address the challenge of imputing the high dimensional jointly distributed values of marker attributes based on overlapping marginal observ ations. W e sho w that simple nearest neighbor based imputation can lead to spurious subpopulations in the imputed data, and introduce an alternative approach based on nearest neighbor imputation restricted to a cell’ s subpopulation. This requires us to perform clustering with missing data, which we address with a mixture model approach and novel EM algorithm. Since mixture model ﬁtting may be ill-posed, we also de velop techniques to initialize the EM algorithm using domain knowledge. W e demonstrate our approach on real ﬂow cytometry data. Index T erms statistical ﬁle matching, ﬂow cytometry , mixture model, probabilistic PCA, EM algorithm, imputation, incomplete data, clustering I . I N T RO D U C T I O N Flo w cytometry is a technique for quantitative cell analysis [1]. It provides simultaneous measurements of multiple characteristics of individual cells. T ypically , a lar ge number of cells are analyzed in a short G. Lee and C. Scott are with the Department of Electrical Engineering and Computer Science, Uni versity of Michigan, Ann Arbor , MI, USA. E-mail: { gyemin, cscott } @eecs.umich.edu. W . Finn is with the Department of Pathology , University of Michigan, Ann Arbor , MI, USA. E-mail: wgﬁnn@umich.edu. October 29, 2018 DRAFT 2 period of time – up to thousands of cells per second. Since its dev elopment in the late 1960s, ﬂow cytometry has become an essential tool in various biological and medical laboratories. Major applications of ﬂow cytometry include hematological immunophenotyping and diagnosis of diseases such as acute leukemias, chronic lymphoproliferativ e disorders, and malignant lymphomas [2]. Flo w cytometry data has traditionally been analyzed by visual inspection of one-dimensional histograms or two-dimensional scatter plots. Clinicians will visually inspect a sequence of scatter plots based on dif ferent pairwise marker combinations, and perform g ating, the manual selection of marker thresholds, to eliminate certain subpopulations of cells. They identify various pathologies based on the shape of cell subpopulations in these scatter plots. There has been recent work, re vie wed belo w , on automatic cell gating or classiﬁcation of pathologies based on multidimensional analysis of ﬂo w c ytometry data. Despite the promise of multidimensional analysis, this direction is limited by the number of mark ers that can be simultaneously measured, which is typically a fraction of the number of markers of interest. It is therefore common in practice to perform multiple assays based on different, overlapping combinations of markers. W e may view these combinations as different mar ginals of the joint distrib ution of all observed markers. Howe ver , ev en when analysis is based on visual inspection of scatter plots, problems arise when the desired marker pair w as not jointly measured. This situation arises frequently in the analysis of historical data. T o address these issues and to facilitate higher dimensional analysis, we present a statistical method for ﬁle matching, which imputes higher dimensional ﬂow c ytometry data from multiple lower dimensional data ﬁles. While [3] proposed a simple approach based on Nearest Neighbor (NN) imputation, this method is prone to induce spurious clusters, as we demonstrate below . Our method can improv e the ﬁle matching of ﬂow cytometry and is less likely to generate false clusters. In the following, we explai n the principles of ﬂo w cytometry and introduce the ﬁle matching problem in the context of ﬂow cytometry data. W e then present an approach to ﬁle matching which imputes a cell’ s missing marker v alues with the values of the nearest neighbor among cells of the same type. T o implement this approach we develop a method for clustering with missing data. W e model ﬂow cytometry data with a latent variable Gaussian mixture model, where each Gaussian corresponds to a cell type, and de velop an expectation-maximization (EM) algorithm to ﬁt the model. Since a large majority of all v alues are unobserved, most cov ariances cannot be estimated from the data. Howe ver , domain e xperts possess considerable knowledge about the characteristics of different cell types, and we incorporate this kno wledge into the initialization of the EM algorithm. W e compare our method with simple nearest neighbor imputation on real ﬂow cytometry data, and show that our method of fers improved performance. October 29, 2018 DRAFT 3 Fig. 1. A ﬂo w cytometer system. As a stream of cells passes through a laser beam, the photo-detectors detect forward angle light scatter , side angle light scatter, and light emissions from ﬂuorochromes. Then the digitized signals are analyzed in a computer . I I . B A C K G RO U N D A N D M O T I V AT I O N In this section, we explain the principles of ﬂow cytometry . W e also deﬁne the statistical ﬁle matching problem in the context of ﬂow c ytometry data, and moti v ate the need for an impro ved solution. A. Flow Cytometry In ﬂo w cytometry analysis, a cell suspension is ﬁrst prepared from peripheral blood, bone marro w , or lymph node. The suspension of cells is then mixed with a solution of ﬂuorochrome-labeled antibodies. T ypically , each antibody is labeled with a different ﬂuorochrome. As the stream of suspended cells passes through a focused laser beam, they either scatter or absorb the light. If the labeled antibodies are attached to proteins of a cell, the associated ﬂuorescent markers absorb the laser and emit light with the corresponding wavelength (color). Then a set of photo-detectors in the line of the light beam and perpendicular to the light capture the scattered and emitted light. The signals from the detectors are digitized and stored in a computer system. F orward scatter (FS) and side scatter (SS) signals as well as v arious ﬂuorescence signals are collected for each cell (see Fig. 1). In a ﬂow cytometer that is capable of measuring d attributes, called markers , the measurements of each cell can be represented with a d -dimensional v ector x = ( x (1) , x (2) , · · · , x ( d ) ) where x (1) is FS, x (2) is SS, and x (3) , · · · , x ( d ) are the ﬂuorescent markers. Thus, the accumulation of N cells forms a N × d matrix. October 29, 2018 DRAFT 4 The detected signals provide information about the physical and chemical properties of each cell analyzed. FS is related to the relative size of the cell and SS is related to its internal granularity or complexity . The ﬂuorescence signals reﬂect the abundance of expressed antigens on the cell surface. These various attrib utes are used for identiﬁcation and quantiﬁcation of cell populations. FS and SS are always measured, while the mark er combination is a part of the experimental design. Flo w cytometry data is usually analyzed using a sequence of one dimensional histograms and two or three dimensional scatter plots by choosing a subset of one, two or three markers. The analysis typically in volves manually selecting and excluding cell subpopulations, called gating, by thresholding and drawing boundaries on the scatter plots. Clinicians routinely diagnose by visualizing the scatter plots. Recently , some attempts hav e been made to analyze directly in high dimensional spaces by mathemat- ically modeling ﬂow cytometry data. In [4], [5], a mixture of Gaussian distrib utions is used to model cell populations, while a mixture of t -distributions with a Box-Cox transformation is used in [6]. A mixture of skew t -distributions is studied in [7]. The knowledge of experts is sometimes incorporated as prior information [8]. Instead of using ﬁnite mixture models, some recent approaches proposed information preserving dimension reduction to analyze high dimensional ﬂow cytometry data [9], [10]. Howe ver , standard techniques for multi-dimensional ﬂow cytometry analysis are not yet established. B. Statistical F ile Matc hing The number of dimensions in ﬂow cytometry is limited by the number of light sources and detectable ﬂuorochrome markers, and av ailable reagent combinations. Even though recent innov ations hav e enabled measuring near 20 cellular attributes, there are typically dozens or hundreds of markers of interest in a given ﬂo w cytometry experiment. Furthermore, instruments deployed in clinical laboratories still only measure 5-7 mark ers simultaneously [11]. Being unable to simultaneously measure all markers of interest, it is common to di vide a sample into se veral “tubes” and stain each tube separately with a different set of markers [12]. In practice, partially ov erlapping marker combinations are used to help identify cell populations (see Fig. 2). The marker combinations are designed based on which markers need to be observed together . Howe ver , it is not always possible to anticipate all marker combinations of potential interest. In the sequel, we present a method that generates ﬂow cytometry data in which all the markers of interest are av ailable for the union of cells. Thus, we obtain a single higher dimensional dataset beyond the current limits of instrumentation. Then pairs of markers that are not measured together can still be visualized through scatter plots, and methods of multidimensional analysis may be applied to the full October 29, 2018 DRAFT 5 Fig. 2. Flow c ytometry analysis on a lar ge number of antibody reagents within a limited capacity of a ﬂow cytometer . A sample from a patient is separated into multiple tubes with which different combinations of ﬂuorochrome labeled antibodies are stained. Each output ﬁle contains at least two variables, FS and SS, in common as well as some variables that are speciﬁc to the ﬁle. common speciﬁc1 speciﬁc2 c s 1 s 2 ﬁle 1 ( N 1 ) X 1 ﬁle 2 ( N 2 ) X 2 Fig. 3. Data structure of two incomplete data ﬁles. T wo ﬁles have some overlapping variables c , and some variables s 1 and s 2 that are ne ver jointly observed. File matching combines the two ﬁles by completing the missing blocks of variables. dataset. This technique, called ﬁle matching, merges two nor more datasets that hav e some commonly observed v ariables as well as some variables unique to each dataset. An ex emplary two ﬁle case is drawn in Fig. 3. Each unit (cell) x n is a vector in R d and belongs to one of the data ﬁles (tubes) X 1 or X 2 , where each ﬁle has N 1 and N 2 units, respectively . While variables c are observed in all the units, units in X 1 hav e v ariables s 2 missing and units in X 2 hav e variables s 1 missing, where s 1 , s 2 , and c represent speciﬁc and common v ariable sets. If the observed and missing components of a unit x n are denoted by o n and m n , then o n = c ∪ s 1 and m n = s 2 for x n ∈ X 1 , and o n = c ∪ s 2 and m n = s 1 for x n ∈ X 2 . The ﬁle matching problem is a missing data problem where blocks of missing data need to be imputed. Among imputation methods, algorithms using conditional mean or regression are most common. As sho wn October 29, 2018 DRAFT 6 in Fig. 4, howe ver , these imputation algorithms tend to shrink the variance of data. Thus, these approaches are inappropriate in ﬂo w c ytometry where the shapes of cell populations are important in analysis, and the preservation of variability after ﬁle matching is highly desired. More discussions on missing data analysis and ﬁle matching can be found in [13] and [14]. Fig. 4. Examples of imputation methods: NN, conditional mean, and regression. The NN method relatively well preserves the distribution of imputed data, while other imputation methods such as conditional mean and regression signiﬁcantly reduce the variability of data. [3] proposed to use Nearest Neighbor (NN) imputation to match ﬂow c ytometry data ﬁles. In their approach, missing v ariables of one unit, called the recipient, are imputed with observed variables from a unit in the other ﬁle, called the donor , that is most similar . If x i is a unit in X 1 , the missing variables are set as follo ws x s 2 i = x ∗ s 2 j where x ∗ j = arg min x j ∈X 2 k x c i − x c j k 2 . Note that the similarity is based on the distance in the projected space of jointly observed variables. This algorithm is adv antageous over other imputation algorithms, based on conditional mean or regression, as displayed in Fig. 4. It generally preserves the distribution of cells, while the other methods cause the v ariance structure to shrink tow ard zero. October 29, 2018 DRAFT 7 Fig. 5. Comparison of results for two imputations methods to the ground truth cell distribution. Figures show scatter plots on pairs of markers that are not jointly observed. The middle row and the bottom row sho ws the imputation results from the NN and Cluster-based NN, respectively . The results from the NN method show spurious clusters conspicuously in the right two panels. The false clusters are indicated by dotted circles in CD3 vs. CD8 and CD3 vs. CD4 scatter plots. On the other hand, the results from our proposed approach better resemble the true distribution on the top ro w . Ho wev er, the NN method sometimes introduces spurious clusters into the imputed results and fails to replicate the true distribution of cell populations. Fig. 5 sho ws an example of false clusters from the NN imputation algorithm (for detailed e xperimental setup, see Section V). W e present a toy example to explain why the NN imputation can fail, and to moti v ate our approach. October 29, 2018 DRAFT 8 C. Motivating T oy Example Fig. 6 sho ws a toy example dataset in R 3 . In the two data ﬁles, tw o of three features of these points are observ ed: c and s 1 in ﬁle 1, and c and s 2 in ﬁle 2. Each data point belongs to one of tw o clusters, but its label is una vailable. When imputing feature s 1 of units in ﬁle 2, the NN algorithm produces four clusters whereas there should be two, as shown in Fig. 6 (d). This is because the NN method uses only one feature, and f ails to leverage the information about the joint distribution of variables that are not observed together . On the other hand, if we can infer the cluster membership of data points, the NN imputation can be applied within the same cluster . Hence, we search a donor from the subgroup (1) for the data points in (3) , and like wise we search a donor from (2) for the points in (4) in the example. Then the ﬁle matching result greatly improves and better replicates the true distribution as in Fig. 6 (e). Fig. 6. T oy example of ﬁle matching. T wo ﬁles (b) and (c) provide partial information of data points (a) in R 3 . The variable c is observed in both ﬁles while s 1 and s 2 are speciﬁc to each ﬁle. The NN method creates false dot populations in the s 1 vs. s 2 scatter plot in (d). On the other hand, the NN applied within the same cluster successfully replicated the true distribution. If the cluster are incorrectly paired, howe ver, the Cluster-NN approach fails, as in (f). In this example, as in real ﬂo w c ytometry data, there is no way to infer cluster membership from the October 29, 2018 DRAFT 9 Input: two ﬁles X 1 and X 2 to be matched 1. Cluster the units in X 1 and X 2 . 2. Perform NN imputation within the same cluster . Output: statistically matched complete ﬁles b X 1 and b X 2 Fig. 7. The description of the Cluster-based NN algorithm for two ﬁles. For two input ﬂow cytometry data ﬁles X 1 and X 2 , speciﬁc variables are imputed using NN method after clustering each cell into one of K clusters. data alone, and incorrect labeling can lead to poor results (Fig. 6 (f)). Fortunately , in ﬂo w cytometry we can incorporate prior kno wledge to achieve an accurate clustering. I I I . C L U S T E R - BA S E D I M P U TA T I O N O F M I S S I N G V A R I A B L E S W e ﬁrst focus on the case of matching two ﬁles. The case of more than two ﬁles is discussed in Section VI. F or the present section, we assume that X 1 and X 2 hav e both been partitioned into K clusters. Let X k 1 and X k 2 denote the cells in X 1 and X 2 from the k th cluster , respectiv ely . Suppose that the data is conﬁgured as in Fig. 3. In order to impute the missing v ariables of a unit in ﬁle 1, we locate a donor among the data points in ﬁle 2 that hav e the same cluster label as the recipient. When imputing incomplete units in ﬁle 2, the roles change. The similarity between two units is ev aluated on the projected space of jointly observed variables, while constraining both units to belong to the same cluster . Then we impute the missing variables of the recipient by patching the corresponding v ariables from the donor . More speciﬁcally , for x i ∈ X k 1 , we impute the missing variables by x m i i = x ∗ m i j where x ∗ j = arg min x j ∈X k 2 k x c i − x c j k 2 The proposed Cluster -based NN imputation algorithm is summarized in Fig. 7. In social applications such as surv ey completion, ﬁle matching is often performed on the same class such as gender , age, or county of residence. Ho wev er , this information that is used to label each unit is av ailable in data, and the inference as in our algorithm is not necessary [14]. I V . C L U S T E R I N G W I T H M I S S I N G D A T A T o implement the abov e approach, we view X 1 and X 2 as a single data set and cluster its elements. W e propose a method for clustering with missing data based on a ﬁnite Gaussian mixture model. Mixture models are common models for ﬂow cytometry where each component corresponds to a cell type. October 29, 2018 DRAFT 10 While non-Gaussian models might provide a better ﬁt, there is a trade-of f between estimation error and approximation error . More complicated models tend to be more challenging to ﬁt. Furthermore, ev en with an imperfect data model, we may still achiev e an impro ved ﬁle matching. Thus, clustering amounts to ﬁtting the parameters of the mixture model. In general, ﬁtting such a model is ill-posed. For example, in the toy example, there is no way to know the correct cluster inference based solely on the data. Ho wev er, we can leverage domain kno wledge to select the number of components and initialize model parameters. A. Mixtur e of PPCA In a mixture model framew ork, the probability distribution of a d -dimensional data vector x takes the form p ( x ) = K X k =1 π k p k ( x ) where K is the number of components in the mixture and π k is a mixing weight. In ﬂo w cytometry , mixture models are common models of cell subpopulations. Mixture models with Gaussian components are common [4], [5], [8], although distributions with more parameters, such as t -distributions or ske w t -distributions, have been proposed [6], [7]. Howe ver , these models require estimating a large number of parameters, and it becomes difﬁcult to obtain reliable estimates when the number of components or the dimensions of the data increase. In this application, the model needs not be perfect to get impro ved imputation. W e adopt a probabilistic principal component analysis (PPCA) mixture model as a way to model cell populations with fe wer parameters. Without PPCA, our experience has rev ealed that e ven a Gaussian mixture model may ha ve too man y parameters to be accurately ﬁt. PPCA w as proposed by [15] as a probabilistic interpretation of PCA. While con ventional PCA lacks a probabilistic formulation, PPCA speciﬁes a generati ve model. It is a latent v ariable model, in which a data vector is linearly related to a latent v ariable. The latent v ariable space is generally lower dimensional than the ambient v ariable space, so the latent v ariable provides an economical representation of the data. The PPCA model is built by specifying a conditional distribution of a data vector x in R d , giv en a latent variable t in R q : p ( x | t ) = N ( Wt + µ , σ 2 I ) where µ is a d -dimensional vector and W is a d × q linear transform matrix. The latent variable is also October 29, 2018 DRAFT 11 assumed to be Gaussian with p ( t ) = N ( 0 , I ) . Then the marginal distribution of x is also Gaussian: p ( x ) = N ( µ , C ) with a cov ariance matrix C = WW T + σ 2 I . The posterior distribution can be shown to be Gaussian as well: p ( t | x ) = N ( M − 1 W T ( x − µ ) , σ 2 M − 1 ) where M = W T W + σ 2 I is a q × q matrix. The PPCA mixture model is a combination of multiple PPCAs. Each PPCA component explains local data structure or cell subpopulation. The model is deﬁned by the collection of each component parameters θ k = { π k , µ k , W k , σ 2 k } . From a ﬂo w cytometry dataset X = { x 1 , · · · , x N } , an EM algorithm can learn the mixture model by iterativ ely computing these parameters. More details on the PPCA mixture and the EM algorithm are explained in [16] The mixture of PPCA offers a way of controlling the number of parameters to be estimated without completely sacriﬁcing the ﬂe xibility of model. In mixture model frame work, a more common choice is the standard Gaussian mixture model. In the Gaussian mixture model, howe ver , each Gaussian component requires d ( d + 1) / 2 cov ariance parameters to be estimated if a full cov ariance matrix is used. Thus, as the data dimension increases, more data points are needed for reliable estimation of those parameters. The number of parameters can be reduced by constraining the cov ariance matrix to be isotropic or diagonal. These are too restrictiv e, howe ver , since an isotropic or diagonal cov ariance makes the Gaussian component spherical or , respectiv ely , elliptical aligned along the data axes; hence, the correlation structure between v ariables cannot be captured. On the other hand, the PPCA mixture model lies between those two extremes, and allows to control the number of parameters by specifying q , the dimension of the latent variable. B. Mixtur e of PPCA with Missing Data Even though our ﬁle matching problem has a particular pattern of missing variables, we de velop a more general algorithm that allows for an arbitrary pattern of missing variables. Our dev elopment assumes v alues are “missing at random, ” meaning that whether a v ariable is missing or not, is independent of its v alue [13]. Our algorithm may be viewed as an extension of the algorithm of [17] to PPCA, or the algorithm of [16] to data with missing v alues. Denoting the observ ed and missing components by o n and m n , each data point can be divided x n = ( x o n n , x m n n ) . In a missing data problem, a set of partial observations { x o 1 1 , · · · , x o N N } is gi ven. Similar October 29, 2018 DRAFT 12 to EM algorithms for Gaussian mixture models, we introduce indicator v ariables z n . One and only one entry of z n is nonzero, and z nk = 1 indicates that the k th component is responsible for generating x n . W e also include the missing components x m n n and the set of latent v ariables t nk for each component to form the complete data ( x o n , x m n , t nk , z n ) for n = 1 , · · · , N and k = 1 , · · · , K . W e deriv e an iterativ e EM algorithm for the PPCA mixture model with missing data. The key dif ference from the EM algorithm for completely observed data is that the conditional expectation is taken with respect to x o as opposed to x in the e xpectation steps. T o de velop an EM algorithm, we employ and extend the two step procedure as described in [16]. In the ﬁrst stage of the algorithm, the component weights π k and the component center µ k are updated: b π k = 1 N X n h z nk i , (1) b µ k = P n h z nk i h x o n n h x m n n i i P n h z nk i (2) where h z nk i = P ( z nk = 1 | x o n n ) is the responsibility of mixture component k for generating the unit x n , and h x m n n i = E [ x m n n | z nk = 1 , x o n n ] is the conditional expectation. Note that we are not assuming the vectors in the brack et are stackable. This notation can be replaced by the true component ordering without difﬁculty . In the second stage, we update W k and σ 2 k : c W k = S k W k ( σ 2 k I + M − 1 k W T k S k W k ) − 1 , (3) b σ 2 k = 1 d tr  S k − S k W k M − 1 k c W T k  (4) from local co variance matrix S k : S k = 1 N b π k X n h z nk ih ( x n − b µ k )( x n − b µ k ) T i . These update rules boil down to the update rules for completely observed data when there are no missing v ariables. W e deriv e the EM algorithm in detail in Appendix A. After model parameters are estimated, the observations are divided into groups according to their posterior distribution: arg max k =1 , ··· K p ( z nk = 1 | x o n n ) , so each unit (cell) is classiﬁed into one of K cell populations. Note that this posterior probability is computed in the E-step. October 29, 2018 DRAFT 13 Cell T ype CD markers granulocyte CD45+, CD15+ monocyte CD45+, CD14+ lymphocyte helper T cell CD45+, CD3+ cytotoxic T cell CD45+, CD3+, CD8+ B cell CD45+, CD19+ or CD45+, CD20+ NK cell CD16+, CD56+, CD3- Fig. 8. T ypes of white blood cells. Each cell type is characterized by a set of expressed CD markers. The cluster of differentiation (CD) markers are commonly used to identify cell surface molecules on white blood cells. The ‘ + / − ’ signs indicate whether a certain cell type has corresponding antigens on the cell surface. C. Domain Knowledge and Initialization of EM algorithm In ﬁle matching of ﬂo w cytometry data, domain knowledge is critical. First, as explained abov e, the incompletely observed data is insuf ﬁcient to determine the correct cluster labeling. Second, the initial conditions of the EM algorithm af fect its performance and con ver gence rate. Domain knowledge allows us to choose the number of components, and to initialize the algorithm so that it con ver ges to the best local maximum. In ﬂow cytometry , from the design of ﬂuorochrome marker combinations and the knowledge about the blood sample composition, we can anticipate certain properties of cell subpopulations. F or e xample, Fig. 8 summarizes white blood cell types and their characteristic cluster of differentiation (CD) marker expressions. That these are six cell types suggests choosing K = 6 when analyzing white blood cells. The CD markers indicated are commonly used in ﬂow cytometry to identify cell surface molecules on leukoc ytes [18]. Ho wev er , this information is qualitative, and needs to be quantiﬁed. T o achiev e this, we use one dimensional histograms. In a histogram, two large peaks are generally expected depending on the expression lev el of the corresponding CD marker . If a cell subpopulation expresses a CD marker , denoted by ‘ + ’, then it forms a peak on the right side of the histogram. On the other hand, if a cell population does not express the marker , denoted by ‘ − ’, then a peak can be found on the left side of the histogram. W e use the locations of the peaks to quantify the expression le vels. These quantiﬁed values can be combined with the CD marker e xpression levels of each cell type to specify the initial cluster centers. Thus, each component of µ k of a certain cell type is initialized by either the positi ve quantity or the neg ati ve quantity from the histogram. In our implementation, these are set manually based on visual inspection of histograms. Then we initialize the mixture model parameters { π k , µ k , W k , σ 2 k } as described in Fig. 9. October 29, 2018 DRAFT 14 Input: X 1 , X 2 data ﬁles ; K the number of components ; q the dimension of latent variable space ; µ k for initial component mean. f or k = 1 to K do 1. using distance k x o n n − µ o n k k , ﬁnd the set of data points X k whose nearest component mean is µ k 2. initialize a cov ariance matrix C k with random entries 3. replace submatrices of C k with sample co variance of data points in X k 4. make C k positi ve deﬁnite by enforcing the eigen values to be positi ve 5. set π k = |X k | N 1 + N 2 6. set W k with the q principal eigen vectors of C k 7. set σ 2 k with the a verage of remaining eigen values of C k end for Output: { π k , µ k , W k , σ 2 k } for k = 1 , · · · , K Fig. 9. P arameter initialization of an EM algorithm for missing data. Cell populations are partitioned into K groups based on the distance to each component center . The component weight π k is initialized according to the size of each partition. From the covariance matrix estimate bC k , parameters bW k and σ 2 k are initialized by taking eigen-decomposition. c s 1 s 2 c s 1 s 2 Fig. 10. Structure of cov ariance matrix C . The sub-matrices C s 1 ,s 2 k and C s 2 ,s 1 k cannot be estimated from a sample covariance matrix because these variables are never jointly observed. An important issue in ﬁle matching arises from the cov ariance matrix. When data is completely observed, a common way of initialization of a cov ariance matrix is using a sample cov ariance matrix. In the case of ﬁle matching, howe ver , it cannot be e valuated since some sets of v ariables are ne ver jointly observed (see Fig. 10). W e chose to b uild a cov ariance matrix C k from v ariable to v ariable with sample cov ariances. F or example, we can set C c,s 1 k with the sample cov ariance for v ariables c and s 1 based on cases for which both variables c and s 1 are present. On the other hand, the submatrix C s 1 ,s 2 k cannot be built based on the observation. In our implementation, we set the submatrix C s 1 ,s 2 k with arbitrary October 29, 2018 DRAFT 15 ID N 1 N 2 N e Patient1 10000 10000 5223 Patient2 7000 7000 4408 Patient3 3000 3000 3190 Fig. 11. Three ﬂow cytometry datasets from three dif ferent patients. Each dataset is divided into two data ﬁles and an ev aluation set. N 1 and N 2 denote the size of two data ﬁles and N e is the size of e valuation set. FS SS CD56 CD16 CD3 CD8 CD4 ﬁle 1 ﬁle 2 Fig. 12. File structure used in the experiment. FS, SS, and CD56 are common in both ﬁles, and a pair of CD markers are observed in only one of the ﬁles. The blank blocks correspond to the unobserved variables. The blocks in ﬁle 1 are matrices with N 1 rows, and the blocks in ﬁle 2 are matrices with N 2 rows. v alues. Howe ver , the resulting matrix may not be positiv e deﬁnite. Thus, C k is made positi ve deﬁnite by replacing ne gati ve eigen values with a small positi ve v alue. Once a covariance matrix C k is obtained, we can initialize W k and σ 2 k by taking eigen-decomposition of C k . V . E X P E R I M E N T S A N D R E S U L T S W e apply the proposed ﬁle matching technique to real ﬂow cytometry datasets, and present experimental results. Three ﬂow cytometry datasets are prepared from lymph node samples of three patients. These datasets were provided by the Department of Pathology at the Uni versity of Michigan. The measurements are of dif ferent sizes and have sev en attributes: FS, SS, CD56, CD16, CD3, CD8, and CD4. Each dataset is randomly permuted ten times and divided into two data ﬁles and a separate e v aluation set. In Fig. 11, the cell counts of the two ﬁles and the held-out set are denoted N 1 , N 2 , and N e , respectiv ely . T wo attributes from each ﬁle are made hidden to construct h ypothetical ﬁles with missing data. Thus, CD16 and CD3 are av ailable only in ﬁle 1, and CD8 and CD4 are available only in ﬁle 2, while FS, SS, and CD56 are common. The pattern of the constructed data ﬁles is illustrated in Fig. 12 where the blocks of missing v ariables are left blank. For each white blood cell type, its expected marker expressions (CD markers), relativ e size (FS), and relativ e granularity (SS) are presented in Fig. 13. Because it is from a lymph node sample, the majority of cell population is lymphocytes, while the most common white blood cells in a human body October 29, 2018 DRAFT 16 Cell type FS SS CD56 CD16 CD3 CD8 CD4 granulocyte + + − + − − − monocyte + − − + − − − helper T cell − − − − + − + cytotoxic T cell − − − − + + − B lymphocyte − − − − − − − Natural Killer cell − − + + − − − Fig. 13. Cell types in the dataset and their corresponding marker expressions. ‘+’ or ‘-’ indicates whether a certain cell type expresses the CD marker or not. 0 200 400 600 800 1000 0 50 100 150 X: 419.5 Y: 100 FS X: 791.5 Y: 18 0 200 400 600 800 1000 0 50 100 150 200 250 X: 396.5 Y: 248 SS X: 675.5 Y: 27 0 200 400 600 800 1000 0 50 100 150 X: 245.5 Y: 135 CD56 0 200 400 600 800 1000 0 50 100 150 X: 136.5 Y: 147 CD16 X: 348.5 Y: 47 0 200 400 600 800 1000 0 50 100 150 X: 558.5 Y: 126 CD3 X: 215.5 Y: 49 0 200 400 600 800 1000 0 10 20 30 40 50 X: 171.5 Y: 49 CD8 X: 766.5 Y: 30 0 200 400 600 800 1000 0 50 100 150 200 X: 219.5 Y: 48 CD4 X: 654.5 Y: 164 Fig. 14. Histogram of each marker in the dataset. The peaks are hand-picked and are indicated in each panels. are granulocytes. The ‘ + / − ’ signs indicate whether a certain cell type expresses the markers or not. For example, helper T cells express both CD3 and CD4 but not others. This qualitati ve knowledge is quantiﬁed with the help of single dimensional histograms as explained in Section IV -C. T wo dominant peaks are picked from each histogram and their corresponding measurement values are set to the positi ve and negati ve expression levels. Fig. 14 and Fig. 15 summarize this histogram analysis. T wo incomplete data ﬁles are completed following the procedure as described in Fig. 7. A mixture of PPCA is ﬁtted with six components because six cell types are e xpected from this dataset. The latent v ariable dimension of each PPCA component is ﬁxed to two. October 29, 2018 DRAFT 17 FS SS CD56 CD16 CD3 CD8 CD4 + 800 680 500 350 550 750 650 − 400 400 240 130 200 170 200 Fig. 15. The positiv e and negati ve expression levels are summarized. The synthesized data after ﬁle matching is displayed in Fig. 5. The ﬁgure shows scatter plots of speciﬁc v ariables: CD16, CD3, CD4, and CD8. Note that these marker pairs are not jointly observed from the two incomplete data ﬁles. The imputation results from the NN and the Cluster-based NN methods are compared in the ﬁgure. For reference, scatter plots from the original complete dataset (ground truth) are also presented. As can be seen, the results from the Cluster -based NN are far more similar to the true distributions. On the other hand, the results from the NN method generates spurious clusters in the CD3-CD8 and CD3-CD4 scatter plots. In Fig. 5, these f alse clusters are indicated. A. Evaluation method T o quantitativ ely ev aluate the pre vious results, we use Kullback-Leibler (KL) div ergence. The KL di ver gence between tw o distribution f ( x ) and g ( x ) is deﬁned by K L ( g k f ) = E g [log g − log f ] . Let f denote a true distribution responsible for the observ ations, and g denote its estimate. The KL diver gence is not symmetric, so K L ( f k g ) and K L ( g k f ) hav e different meanings. For a gi ven distribution f , a distribution g minimizes K L ( f k g ) when g takes nonzero values in the region where f takes nonzero values; hence, it overestimates the support of f . On the other hand, K L ( g k f ) is minimized for g that is close to zero in the region where f is near zero. A distribution g that minimizes K L ( g k f ) tends to hav e smaller support. Therefore, K L ( g k f ) is a better ev aluation method for detecting spurious clusters in an estimate. Then the empirical estimate of the KL di ver gence is e valuated by K L ( g k f ) ≈ K L ( b g k b f ) ≈ 1 N e N e X n =1 h log b g ( b x n ) − log b f ( b x n ) i . where the distributions f and g are replaced by their corresponding density estimates, and the expectation is approximated by a ﬁnite sum o ver imputed results b x n on the held-out v alidation set of size N e . W e used kernel density estimation on the ground truth data and the imputed data for b f and b g , respecti vely . The KL div ergences are computed for ten random permutations, and their av erages and October 29, 2018 DRAFT 18 ID NN (ﬁle 1) Cluster-NN (ﬁle 1) NN (ﬁle 2) Cluster-NN (ﬁle 2) Patient1 2.90 ± 0.05 1.55 ± 0.05 2.66 ± 0.03 1.12 ± 0.04 Patient2 4.54 ± 0.07 1.22 ± 0.03 4.12 ± 0.08 0.92 ± 0.03 Patient3 4.46 ± 0.10 2.40 ± 0.11 4.18 ± 0.11 2.30 ± 0.07 Fig. 16. The KL diver gences are computed for ten permutations of each ﬂow cytometry dataset. The av erages and standard errors are reported in the table. For both the NN and Cluster-based NN algorithm, the ﬁle matching results are ev aluated. The KL div ergences of Cluster-based NN are closer to zero than those of NN. Thus the results from Cluster-based NN better replicated the true distribution. standard errors are reported in Fig. 16. As can be seen, the KL diver gences from Cluster-based NN are substantially smaller than those from NN. Therefore, the Cluster-based NN yields a better replication of true distribution. V I . D I S C U S S I O N In this paper, we demonstrated the use of a cluster -based nearest neighbor imputation method for ﬁle matching in ﬂow cytometry data. W e applied the proposed algorithm on real ﬂow cytometry data to generate a dataset of higher dimensions by mer ging two data ﬁles of lower dimensions. The resulting matched ﬁle can be used for visualization and high-dimensional analysis of cellular attributes. While the presented imputation method focused on the case of two ﬁles, it can be generalized to more than two ﬁles. For each missing component of a recipient cell, we can ﬁnd a donor cell among ﬁles that hav e the component of interest. W e en vision two e xtensions of the clustering-based imputation method. The ﬁrst is training a PPCA mixture model on all the data ﬁles. This approach in volv es the entire data points for model ﬁtting. The second method considers a pair of ﬁles at a time. In this approach, we ﬁrst select a donor ﬁle in which the missing component of the recipient ﬁle is a v ailable. Then we apply method of this paper to the pair of ﬁles. This approach in v olves smaller number of data points in training, but mixture models of smaller dimensions need to be ﬁtted multiple times. After training of a mixture model and clustering of each cell, the similarity between cells can be computed. The Euclidean distance on the projected space of commonly observed variables can be used to ﬁnd the similarity under the constraint that both units should have the same cluster label. The missing components are then imputed from the donor . Future research directions include ﬁnding ways of automatic prior information e xtraction. The con- struction of co variance matrices from incomplete dataset in the initialization of the EM algorithm is also an interesting problem. W e expect that better cov ariance structure estimation will be helpful for better October 29, 2018 DRAFT 19 replication of non-symmetric and non-elliptic cell populations in the imputed results. A limitation of this work is that it has only been validated on lymphocyte data, where, for certain marker combinations, cell types tend to form relatively well-deﬁned clusters. Howe ver , for other samples and marker combinations, clusters may be more elongated or less well-deﬁned due to cells being at dif ferent stages of physiologic dev elopment. Future work may also consider more ﬂexible models for clustering such data, and associated inference algorithms. A P P E N D I X Suppose that we are gi ven an incomplete observ ation set. W e can di vide each unit x n as x n =   x o n n x m n n   by separating the observed components and the missing components. Note that we do not assume that the observ ed v ariables are ﬁrst, and the notation can be replaced by the actual ordering of components without difﬁculty . In the PPCA mixture model, the probability distribution of x is p ( x ) = K X k =1 π k p ( x | k ) . where K is the number of components in the mixture and π k is a mixing weight corresponding to the component density p ( x | k ) . W e estimate the set of unknown parameters θ = { π k , µ k , W k , σ 2 k } using an EM algorithm from the partial observations { x o 1 1 , · · · , x o N N } . T o de velop an EM algorithm, we introduce indicator v ariables z n = ( z n 1 , · · · , z nK ) for n = 1 , · · · , N . One and only one entry of z n is nonzero, and z nk = 1 indicates that the k th component is responsible for generating x n . W e also include a set of the latent v ariables t nk for each component, and missing v ariables x m n n to form the complete data ( x o n n , x m n n , t nk , z n ) for n = 1 , · · · , N and k = 1 , · · · , K . Then the corresponding complete data likelihood function has the form: L C = X n X k z nk ln [ π k p ( x n , t nk )] = X n X k z nk  ln π k − d 2 ln σ 2 k − 1 2 σ 2 k tr  ( x n − µ k )( x n − µ k ) T  + 1 σ 2 k tr  ( x n − µ k ) t T nk W T k  − 1 2 σ 2 k tr  W T k W k t nk t T nk   where terms independent of the parameters are not included in the second equality . Instead of developing an EM algorithm directly on this likelihood function L C , we extend the strategy in [16] and build a two-stage EM algorithm, where each stage is a two-step process. October 29, 2018 DRAFT 20 In the ﬁrst stage of the tw o stage EM algorithm, we update the component weight π k and the component mean µ k . W e form a complete data log-likelihood function with the component indicator v ariables z n and missing v ariables x m n while ignoring the latent variables t nk . Then we hav e the follo wing likelihood function: L 1 = N X n =1 K X k =1 z nk ln[ π k p ( x o n n , x m n n | k )] = X n X k z nk  ln π k − 1 2 ln | C k | − 1 2 tr  C − 1 k ( x n − µ k )( x n − µ k ) T   where terms unrelated to the model parameters are omitted in the second line. W e take the conditional expectation with respect to p ( z n , x m n n | x o n n ) . Since the conditional probability factorizes as p ( z n , x m n n | x o n n ) = p ( z n | x o n n ) p ( x m n n | z n , x o n n ) , the next conditional e xpectations follo w h z nk i = p ( k | x o n n ) = π k p ( x o n n | k ) P k 0 π k 0 p ( x o n n | k 0 ) , h z nk x m n n i = h z nk ih x m n n i , h x m n n i = µ m n k + C m n o n k C o n o n − 1 k ( x o n n − µ o n k ) , h z nk x m n n x m n T n i = h z nk ih x m n n x m n T n i , h x m n n x m n T n i = C m n m n k − C m n o n k C o n o n − 1 k C o n m n k + h x m n n ih x m n T n i where h·i denote the conditional expectation. Maximizing hL 1 i with respect to π k , using a Lagrange multiplier , and with respect to µ k gi ve the parameter updates b π k = 1 N X n h z nk i , (5) b µ k = P n h z nk i h x o n n h x m n n i i P n h z nk i . (6) In the second stage, we include the latent v ariable t nk as well to formulate the complete data log- likelihood function. The new v alues of b π k and b µ k are used in this step to compute suf ﬁcient statistics. T aking the conditional expectation on L C with respect to p ( z n , t nk , x m n n | x o n n ) , we ha ve hL C i = X n X k h z nk i  ln b π k − d 2 ln σ 2 k − 1 2 σ 2 k tr  h ( x n − b µ k )( x n − b µ k ) T i  + 1 σ 2 k tr  h ( x n − b µ k ) t T nk i W T k  − 1 2 σ 2 k tr  W T k W k h t nk t T nk i   . October 29, 2018 DRAFT 21 Since the the conditional probability factorizes p ( z n , t nk , x m n n | x o n n ) = p ( z n | x o n n ) p ( x m n n | z n , x o n n ) p ( t nk | z n , x o n n , x m n n ) , we can e valuate the conditional e xpectations as follo ws : h ( x n − b µ k )( x n − b µ k ) T i =     x o n n h x m n n i   − b µ k       x o n n h x m n n i   − b µ k   T +   0 0 0 Q nk   , Q nk = C m n m n k − C m n o n k C o n o n − 1 k C o n m n k , h t nk i = M − 1 k W T k ( x n − b µ k ) , h ( x n − b µ k ) t T nk i = h ( x n − b µ k )( x n − b µ k ) T i W k M − 1 k , h t nk t T nk i = σ 2 k M − 1 k + M − 1 k W T k h ( x n − b µ k )( x n − b µ k ) T i W k M − 1 k . Remember that the q × q matrix M k = W T k W k + σ 2 k I . Then the maximization of hL C i with respect to W k and σ 2 k leads to the parameter updates, c W k = " X n h z nk ih ( x n − b µ k ) t T nk i # " X n h z nk ih t nk t T nk i # − 1 , (7) b σ 2 k = 1 d P n h z nk i  X n h z nk i tr  h ( x n − b µ k )( x n − b µ k ) T i  (8) − 2 X n h z nk i tr  h ( x n − b µ k ) t T nk i W T k  + X n h z nk i tr  W T k W k h t nk t T nk i   . Substituting the conditional expectations simpliﬁes the M-step equations c W k = S k W k ( σ 2 k I + M − 1 k W T k S k W k ) − 1 , (9) b σ 2 k = 1 d tr  S k − S k W k M − 1 k c W T k  (10) where S k = 1 N b π k X n h z nk ih ( x n − b µ k )( x n − b µ k ) T i . Each iteration of the EM algorithm updates the set of old parameters { π k , µ k , W k , σ 2 k } with the set of ne w parameters { b π k , b µ k , c W k , b σ 2 k } as in (5), (6), (9), and (10). The algorithm terminates when the value of the log-lik elihood function no longer changes. October 29, 2018 DRAFT 22 R E F E R E N C E S [1] H. Shapiro, Practical Flow Cytometry , 3rd ed. W iley-Liss, 1994. [2] M. Brown and C. W ittwer, “Flo w cytometry: Principles and clinical applications in hematology , ” Clinical Chemistry , vol. 46, pp. 1221–1229, 2000. [3] C. E. Pedreira, E. S. Costa, S. Barrena, Q. Lecrevisse, J. Almeida, J. J. M. van Dongen, and A. Orfao, “Generation of ﬂow cytometry data ﬁles with a potentially inﬁnite number of dimensions, ” Cytometry A , vol. 73A, pp. 834–846, 2008. [4] M. J. Boedigheimer and J. Ferbas, “Mixture modeling approach to ﬂow cytometry data, ” Cytometry P art A , vol. 73, pp. 421 – 429, 2008. [5] C. Chan, F . Feng, J. Ottinger, D. Foster , M. W est, and T . Kepler , “Statistical mixture modeling for cell subtype identiﬁcation in ﬂow cytometry , ” Cytometry P art A , vol. 73, pp. 693–701, 2008. [6] K. Lo, R. R. Brinkman, and R. Gottardo, “ Automated gating of ﬂow cytometry data via robust model-based clustering, ” Cytometry P art A , vol. 73, pp. 321 – 332, 2008. [7] S. Pyne, X. Hu, K. W ang, E. Rossin, T .-I. Lin, L. M. Maier , C. Baecher -Allan, G. J. McLachlan, P . T amayo, D. A. Haﬂer , P . L. D. Jager , and J. P . Mesirov , “ Automated high-dimensional ﬂow cytometric data analysis, ” PNAS , vol. 106, pp. 8519–8524, 2009. [8] J. Lakoumentas, J. Drakos, M. Karakantza, G. C. Nikiforidis, and G. C. Sak ellaropoulos, “Bayesian clustering of ﬂo w cytometry data for the diagnosis of b-chronic lymphocytic leukemia, ” Journal of Biomedical Informatics , v ol. 42, pp. 251–261, 2009. [9] K. Carter, R. Raich, W . Finn, and A. O. Hero, “Information preserving component analysis: data projections for ﬂo w cytometry analysis, ” Journ. of Selected T opics in Signal Pr ocessing , vol. 3, pp. 148–158, 2009. [10] ——, “Fine: Fisher information non-parametric embedding, ” IEEE T rans. on P attern Analysis and Machine Intelligence , 2009, to appear . [11] S. Perfetto, P . K. Chattopadhyay , and M. Roederer , “Se venteen-colour ﬂow cytometry: unrav elling the immune system, ” Natur e Reviews Immunology , vol. 4, pp. 648–655, 2004. [12] M. S ´ anchez, J. Almeida, B. V idriales, M. L ´ opez-Berges, M. Garc ´ ıa-Marcos, M. Moro, A. Corrales, M. Calmuntia, J. S. Miguel, and A. Orfao, “Incidence of phenotypic aberrations in a series of 467 patients with b chronic lymphoproliferati ve disorders: basis for the design of speciﬁc four-color stainings to be used for minimal residual disease in vestigation, ” Leukemia , vol. 16, pp. 1460–1469, 2002. [13] R. Little and D. Rubin, Statistical analysis with missing data , 2nd ed. Wile y , 2002. [14] S. R ¨ assler , Statistical matching: a frequentist theory , practical applications, and alternative Bayesian appr oaches , ser . Lecture Notes in Statistics. Springer , 2002, no. 168. [15] M. Tipping and C. Bishop, “Probabilistic principal component analysis, ” J ournal of the Royal Statistical Society , B , vol. 6, pp. 611–622, 1999. [16] ——, “Mixtures of probabilistic principal component analysis, ” Neural Computation , vol. 11, pp. 443–482, 1999. [17] Z. Ghahramani and M. Jordan, “Supervised learning from incomplete data via an em approach, ” Advances in Neur al Information Processing Systems , vol. 6, 1994. [18] H. Zola, B. Swart, I. Nicholson, B. Aasted, A. Bensussan, L. Boumsell, C. Buckley , G. Clark, K. Drbal, P . Engel, D. Hart, V . Horejsi, C. Isacke, P . Macardle, F . Mala vasi, D. Mason, D. Oli ve, A. Saalmueller, S. F . Schlossman, R. Schwartz-Albiez, P . Simmons, T . F . T edder, M. Uguccioni, and H. W arren, “Cd molecules 2005: human cell differentiation molecules, ” Blood , vol. 106, pp. 3123–3126, 2005. October 29, 2018 DRAFT

Statistical File Matching of Flow Cytometry Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment