Colouring and breaking sticks: random distributions and heterogeneous clustering

We begin by reviewing some probabilistic results about the Dirichlet Process and its close relatives, focussing on their implications for statistical modelling and analysis. We then introduce a class of simple mixture models in which clusters are of …

Authors: Peter J. Green

The purpose of this note is four-fold: to remind some Bayesian nonparametricians gently that closer study of some probabilistic literature might a School of Mathematics, University of Bristol, Bristol BS8 1TW; P.J.Green@bristol.ac.uk be rewarded, to encourage probabilists to think that there are statistical modelling problems worth of their attention, to point out to all another important connection between the work of John Kingman and modern statistical methodology (the role of the coalescent in population genetics approaches to statistical genomics being the most important example; see papers by Donnelly, Ewens and Griffiths in this volume), and finally to introduce a modest generalisation of the Dirichlet process. The most satisfying basis for statistical clustering of items of data is a probabilistic model, which usually takes the form of a mixture model, broadly interpreted. In most cases, the statistical characteristics of each cluster or mixture component are the same, so that cluster identities are a priori exchangeable. In Section 5 we will introduce a class of simple mixture models in which clusters are of different categories, or colours as we shall call them, with statistical characteristics that are constant within colours, but different between colours. Thus cluster identities are exchangeable only within colours. 2 Mixture models and the Dirichlet process Many statistical models have the following character. Data {Y i } are available on n units that we shall call items, indexed i = 1, 2, . . . , n. There may be item-specific covariates, and other information, and the distribution of each Y i is determined by an unknown parameter φ i ∈ Ω, where we will take Ω here to be a subset of a Euclidean space. Apart from the covariates, the items are considered to be exchangeable, so we assume the {Y i } are conditionally independent given {φ i }, and model the {φ i } as exchangeable random variables. Omitting covariates for simplicity, we write It is natural to take {φ i } to be independent and identically distributed random variables, with common distribution G, where G itself is unknown, and treated as random. We might be led to this assumption whether we are thinking of a de Finetti-style representation theorem (Finetti (1931(Finetti ( , 1937)); see also Kingman (1978), Kallenberg (2005)), or by following hierarchical modelling principles (Gelman et al., 1995;Green et al., 2003), Thus, unconditionally, Y i |G ∼ f (•|φ)G(dφ), independently given G. This kind of formulation enables us to borrow strength across the units in inference about unknown parameters, with the aim of controlling the degrees of freedom, capturing the idea that while the {φ i } may be dif-ferent from item to item, we nevertheless understand that, through exchangeability, knowing the value of one of them would tell us something about the others. There are still several options. One is to follow a standard parametric formulation, and to assume a specific parametric form for G, with parameters, or rather 'hyperparameters', in turn given a hyperprior distribution. However, many would argue that in most practical contexts, we would have little information to build such a model for G, which represents variation in the population of possible items of the parameter φ that determines the distribution of the data Y . Thus we would be led to consider more flexible models, and one of several approaches might occur to us: • a nonparametric approach, modelling uncertainty about G without making parametric assumptions; • a mixture model representation for G; • a partition model, where the {φ i } are grouped together, in a way determined a posteriori by the data. One of the things we will find, below, is that taking natural choices in each of these approaches can lead to closely related formulations in the end, so long as both modelling and inference depend solely on the {φ i }. These connections, not novel but not entirely well-known either, shed some light on the nature and implications of the different modelling approaches. Much Bayesian nonparametric distributional modelling (Walker et al., 1999) begins with the Dirichlet process (Ferguson, 1973). Building on earlier work by Dubins, Freedman and Fabius, Ferguson intended this model to provide a nonparametric prior model for G with a large support, yet one remaining capable of tractable prior-to-posterior analysis. Given a probability distribution G 0 on an arbitrary measure space Ω, and a positive real θ, we say the random distribution G on Ω follows a Dirichlet process, where Dirichlet(α 1 , α 2 , . . . , α m ) denotes the distribution on the m-dimensional simplex with density at (x 1 , x 2 , . . . , x m ) proportional to m j=1 x αj -1 j . The base measure G 0 gives the expectation of G: (Kingman, 1967;Ferguson, 1973;Blackwell, 1973;Kingman, 1975), so i.i.d. draws {φ i , i = 1, 2, . . . , n} from G exhibit ties. The parameter θ measures (inverse) concentration: given i.i.d. draws {φ i , i = 1, 2, . . . , n} from G, • as θ → 0, all φ i are equal, a single draw from G 0 ; • as θ → ∞, the φ i are drawn i.i.d. from G 0 . A draw G from a Dirichlet process is a discrete distribution on Ω, so an alternative way to define the Dirichlet process would be via a construction of such a random distribution, through specification of the joint distribution of the locations of the atoms, and their probabilities. Such a construction was given by Ferguson (1973): in this, the locations are i.i.d. draws from G 0 , with probabilities forming a decreasing sequence constructed from increments of a gamma process. This is not the explicit construction that is most commonly used today, which is that known in the Bayesian nonparametric community as Sethuraman's stick-breaking model (Sethuraman and Tiwari, 1982;Sethuraman, 1994). This leads to this algorithm for generating the {φ i }: This construction can be found considerably earlier in the probability literature, especially in connection with models for species sampling. The earliest reference seems to be in McCloskey (1965); for more readily accessible sources, see Patil and Taillie (1977) and Donnelly and Joyce (1989), where it is described in the context of size-biased sampling and the GEM (Generalised Engen-McCloskey) distributions. See also Section 3.3 below. A more direct, classical approach to modelling the distribution of Y in a flexible way would be to use a finite mixture model. Suppose that Y i are i.i.d. with density j w j f 0 (•|φ ⋆ j ) for a prescribed parametric density family f 0 (•|φ), and consider a Bayesian formulation with priors on the component weights {w j } and the component-specific parameters {φ ⋆ j }. The simplest formulation (e.g. Richardson and Green (1997)) uses a Dirichlet prior on the weights, and takes the {φ ⋆ j } to be i.i.d. a priori, but with arbitrary distribution, so in algorithmic form: 1. draw (w 1 , w 2 , . . . , w k ) ∼ Dirichlet(δ, . . . , δ); 2. draw c i ∈ {1, 2, . . . , k} with P {c i = j} = w j , i.i.d., i = 1, . . . , n; It is well known that if we take the limit k → ∞, δ → 0 such that kδ → θ, then the joint distribution of the {φ i } is the same as that obtained via the Dirichlet process formulation in the previous subsections (see for example Green and Richardson (2001)). This result is actually a corollary of a much stronger statement due to Kingman (1975), about the convergence of discrete probability measures. For more recent results in this direction see Muliere and Secchi (2003) and Ishwaran and Zarepour (2002). We are still using the formulation Y i |G ∼ f (•|φ)G(dφ), independently given G, but note that G is invisible in this view; it has implicitly been integrated out. Suppose that, as above, G is drawn from DP (θ, G 0 ), and then {φ i : i = 1, 2, . . . , n} drawn i.i.d. from G. We can exploit the conjugacy of the Dirichlet with respect to multinomial sampling to integrate out G. For a fixed partition {B j } m j=1 of Ω, and integers c i ∈ {1, 2, . . . , m}, we can write where n j = #{i : c i = j}. The jth factor in the product above is 1 if n j = 0, and otherwise θG 0 (B j )(θG 0 (B j )+1)(θG 0 (B j )+2) . . . (θG 0 (B j )+ n j -1), so we find that if the partition becomes increasingly refined, and G 0 is non-atomic, then the joint distribution of the {φ i } can equivalently be described by 1. partitioning {1, 2, . . . , n} = d j=1 C j at random, so that where n j = #C j ; 2. drawing φ ⋆ j ∼ G 0 , i.i.d., j = 1, . . . , d, and then 3. setting Note that the partition model (2.1) shows extreme preference for unequal cluster sizes. If we let a r = #{j : n j = r}, then the joint distribution of (a 1 , a 2 , . . .) is This is equation (A3) of Ewens (1972), derived in a context where n j is the number of genes in a sample of the jth allelic type, in sampling from a selectively neutral population process. The first factor in (2.2) is the multinomial coefficient accounting for the number of ways the n items can be allocated to clusters of the required sizes, and the second factor accounts for the different sets of {n 1 , n 2 . . . , n d } leading to the same (a 1 , a 2 , . . .). Multiplying all this together, a little manipulation leads to the familiar Ewens sampling formula: See also Kingman (1993), page 97. This representation of the partition structure implied by the Dirichlet process was derived by Antoniak (1974), in the form (2.3). He noted that a consequence of this representation is that the joint distribution of the {φ i } given d is independent for θ; thus given observed {φ i }, d is sufficient for θ. A similar observation was also made by Ewens (1972) in the genetics context of his work. Note that as in the previous section, G has been integrated out, and so is invisible in this view of the Dirichlet process model. Whichever of the points of view is taken, items are clustered, according to a tractable distribution parametrised by θ > 0, and for each cluster the cluster-specific parameter φ is an independent draw from G 0 . Much statistical methodology built on the Dirichlet-process model uses only this joint distribution of the {φ i }, and so should hardly be called 'nonparametric'. Of course, even though G itself is invisible in two of the derivations above, the Dirichlet-process model does support inference about G, but this is seldom exploited in applications. In what follows, we will need to make use of different notations for the random partition induced by the Dirichlet-process model, or its relatives. We will variously use Note that the first of these makes no use of the (arbitrary) labelling of the clusters used in the second and third. We have to take care with multiplicities, and the distinction between (labelled) allocations and (unlabelled) partitions. Lack of space precludes a thorough discussion of the huge statistical methodology literature exploiting the Dirichlet process in Bayesian nonparametric procedures, so we will only review a few highlights. Lo (1984) proposed density estimation procedures devised by mixing a user-defined kernel function K(y, u) with respect to a Dirichlet process; thus i.i.d. data {Y i } are assumed distributed as K(•, u)G(du) with G drawn from a Dirichlet process. This is now known as the Dirichlet process mixture model (a better terminology than the formerly-used 'mixture of Dirichlet processes'). The formulation is identical to that we started with in Section 2, but for the implicit assumption that y and u lie in the same space, and that the kernel K(•, u) is a unimodal density located near u. In the 1990s there was a notable flourishing of applied Bayesian nonparametrics, stimulated by interest in the Dirichlet process, and the rapid increase in computational power available to researchers, allowing almost routine use of the Pólya urn sampler approach (see Section 4) to posterior computation. For example, Escobar (1994) re-visited the Normal Means problem, West et al. (1994) discussed regression and density estimation, and Escobar and West (1995) further developed Bayesian density estimation. Müller et al. (1996) ingeniously exploited multivariate density estimation using Dirichlet process mixtures to perform Bayesian curve fitting of one margin on the others. Let us consider a substantial and more specific application in some detail, to motivate the Dirichlet process (DP) set-up as a natural elaboration of a standard parametric Bayesian hierarchical model approach. A remarkable aspect of modern microbiology has been the dramatic development of novel high-throughput assays, capable of delivering very high dimensional quantitative data on the genetic characteristics of organisms from biological samples. One such technology is the measurement of gene expression using Affymetrix gene chips. In Lau and Green (2007), we work with possibly replicated gene expression measures. The data are {Y isr }, indexed by • conditions s = 1, 2, . . . , S, and Typically R s is very small, S is much smaller than n, and the 'conditions' represent different subjects, different treatments, or different experimental settings. We suppose there is a k-dimensional (k ≤ S) covariate vector x s describing each condition, and model parametric dependence of Y on x; the focus of interest is on the pattern of variation in these gene-specific parameters across the assayed genes. Although other variants are easily envisaged, we suppose here that Here φ i = (β i , τ i ) ∈ R k+1 is a gene-specific parameter vector characterising the dependence of gene expression on the condition-specific covariates. A priori, the genes can be considered exchangeable, and a standard hierarchical formulation would model the {φ i } as i.i.d. draws from a parametric prior distribution G, say, whose (hyper)parameters have unknown values. This set-up allows borrowing of strength across genes in the interest of stability and efficiency of inference. The natural nonparametric counterpart to this would be to suppose instead that G, the distribution describing variation of φ across the population of genes, does not have prescribed parametric form, but is modelled as a random distribution from a 'nonparametric' prior such as the Dirichlet process, specifically A consequence of this assumption, as we have seen, is that G is atomic, so that the genes will be clustered together into groups sharing a common value of φ. A posteriori we obtain a probabilistic clustering of the gene expression profiles. Lau and Green ( 2007) take a standard normal-inverse Gamma model, This is a conjugate set-up, so that (β, τ ) can be integrated out in each cluster. This leads easily to explicit within-cluster parameter posteriors: where The marginal likelihoods p(Y Cj ) are multivariate t distributions. We continue this example later, in Sections 5.4 and 5.5. Viewed as a nonparametric model or as a basis for probabilistic clustering, the Dirichlet process is simple but inflexible-a single real parameter θ controls both variation and concentration, for example. And although the space Ω where the base measure G 0 lies and in which φ lives can be rather general, it is essentially a model for 'univariate' variation and unable to handle in a flexible way, for example, time-series data. Driven both by such considerations of statistical modelling (Walker et al., 1999), or curious pursuit of more general mathematical results, the Dirichlet process has proved a fertile starting point for numerous generalisations, and we touch on just a few of these here. The Poisson-Dirichlet distribution and its two-parameter generalisation. Kingman (1975) observed and exploited the fact that the limiting behaviour of random discrete distributions could become nontrivial and accessible through permutation of the components to be in ranked (decreasing) order. The limit law is the Poisson-Dirichlet distribution, implicitly defined and later described (Kingman, 1993, page 98) as 'rather less than user-friendly'. Donnelly and Joyce (1989) elucidated the role of both ranking and size-biased sampling in establishing limit laws for random distributions; see also Holst (2001) and Arratia et al. (2003, page 107). The twoparameter generalisation of the Poisson-Dirichlet model was discovered by Pitman and co-workers, see for example Pitman and Yor (1997). This has been a rich topic for probabilistic study to the present day; see chapters by Gnedin, Haulk and Pitman, and by Aldous in this volume. The simplest view to take of the two-parameter Poisson-Dirichlet model PD(α, θ) is to go back to stick-breaking (Section 2.2) and replace the Beta(1, θ) distribution for the variables V j there by Beta(1α, θ + jα). Ishwaran and James (2001) have considered Bayesian statistical applications of stick-breaking priors defined in this way, and implementation of Gibbs sampling for computing posterior distributions. Dirichlet process relations in structured dependent models. Motivated by the need to build statistical models for structured data of various kinds, there has been a huge effort in generalising Dirichlet process models for such situations-indeed, there is now an 'xDP' for nearly every letter of the alphabet. This has become a rich and sometimes confusing area; perhaps the most important current models are Dependent Dirichlet processes (MacEachern, 1999;MacEachern et al., 2001), Order-based dependent Dirichlet processes (Griffin and Steel, 2006), Hierarchical Dirichlet processes (Teh et al., 2006), and Kernel stick breaking processes (Dunson and Park, 2007). Many of the models are based on stick-breaking representations, but in which the atoms and/or the weights for the representations of different components of the process are made dependent on each other, or on covariates. The new book by Hjort et al. (2010) provides an excellent introduction and review of these developments. Pólya trees. Ferguson's definition of the Dirichlet process focussed on the (random) probabilities to be assigned to arbitrary partitions (Section 2.1). As we have seen, the resulting distributions G are almost surely discrete. An effective way to modify this process to control continuity properties is to limit the partitions to which elementary probabilities are assigned, and in the case of Pólya tree processes this is achieved by imposed a fixed binary partition of Ω, and assigning probabilities to successive branches in the tree through independent Beta distributions. The parameters of these distributions can be set to obtain various degrees of smoothness of the resulting G. This approach, essentially beginning with Ferguson himself, has been pursued by Lavine (1992Lavine ( , 1994)); see also Walker et al. (1999). There is a huge literature on Markov chain Monte Carlo methods for posterior sampling in Dirichlet mixture models (MacEachern, 1994;Escobar and West, 1995;Müller et al., 1996;MacEachern and Müller, 1998;Neal, 2000;Green and Richardson, 2001). Although these models have 'variable dimension', the posteriors can be sampled without necessarily using reversible jump methods (Green, 1995). Cases where G 0 is not conjugate to the data model f (•|φ) demand keeping {φ i } in the state vector, to be handled through various augmentation or reversible jump schemes. In the conjugate case, however, it is obviously appealing to integrate φ out, and target the Markov chain on the posterior solely of the partition, generating φ values subsequently as needed. To discuss this, we first go back to probability theory. The Pólya urn is a simple and well-known discrete probability model for a reinforcement process: coloured balls are drawn sequentially from an urn; after each is drawn it is replaced, together with a new ball of the same colour. This idea can be seen in a generalised form, in a recursive definition of the joint distribution of the {φ i }. Suppose that for each n = 0, 1, 2, . . . , where θ > 0, G 0 is an arbitrary probability distribution, and δ φ is a point probability mass at φ. Blackwell and MacQueen (1973) termed such a sequence a Pólya sequence; they showed that the conditional distribution on the right hand side of (4.1) converges to a random probability distribution G distributed as DP (θ, G 0 ), and that, given G, φ 1 , φ 2 , . . . are i.i.d. distributed as G. See also Antoniak (1974) and Pitman (1995). Thus we have yet another approach to defining the Dirichlet process, at least in so far as specifying the joint distribution of the {φ i } is concerned. This representation has a particular role, of central importance in computing inferences in DP models. This arises directly from (4.1) and the exchangeability of the {φ i }, for it follows that where φ -i means {φ j : j = 1, 2, . . . , n, j = i}. In this form, the statement has an immediate role as the full conditional distribution for each component of (φ i ) n i=1 , and hence defines a Gibbs sampler update in a Markov chain Monte Carlo method aimed at this target distribution. By conjugacy this remains true, with obvious changes set out in the next section, for posterior sampling as well. The Pólya urn representation of the Dirichlet process has been the point of departure for yet another class of probability models, namely species sampling models (Pitman, 1995(Pitman, , 1996)), that are beginning to find a use in statistical methodology (Ishwaran and James, 2003). We will consider posterior sampling in the conjugate case in a more general setting, specialising back to the Dirichlet process mixture case later. The set-up we will assume is based on a partition model: it consists of a prior distribution p(c|θ) on partitions c of {1, 2, . . . , n} with hyperparameter θ, together with a conjugate model within each cluster. The prior on the cluster-specific parameter φ j has hyperparameter ψ, and is conjugate to the likelihood, so that for any subset C ⊆ {1, 2, . . . , n}, p(Y C |ψ) is known explicitly, where C is the subvector of (Y i ) n i=1 with indices in C. We have We first consider only re-allocating a single item at a time (the singlevariable Gibbs sampler for c i ). Then repeatedly we withdraw an item, say i, from the model, and reallocate it to a cluster according to the full conditional for c i , which is proportional to p(c|Y, θ, ψ). It is easy to see that we have two choices: where c i→⋆ denotes the current partition c with i moved to C ⋆ , or • allocate Y i to cluster C -i j , with probability where c i→j denotes the partition c, with i moved to cluster C j . The ratio of marginal likelihoods p(Y |ψ) in the second expression can be interpreted as the posterior predictive distribution of Y i given those observations already allocated to the cluster, i.e. p(Y i |Y C -i j , ψ) (a multivariate t for the Normal-inverse gamma set-up from Section 3.2). For Dirichlet mixtures we have, from (2.1), where n j = #C j and c = (C 1 , C 2 , . . . , C d ), so the re-allocation probabilities are explicit and simple in form. But the same sampler can be used for many other partition models, and the idea is not limited to moving one item at a time. All we require of the model for the Pólya urn sampler to be available for posterior simulation are that 1. a partition c of {1, 2, . . . , n} is drawn from a prior distribution with parameter θ; 2. conditionally on c, parameters (φ 1 , φ 2 , . . . , φ d ) are drawn independently from a distribution G 0 (possibly with a hyperparameter ψ); 3. conditional on c and on φ = (φ 1 , φ 2 , . . . , φ d ), {y 1 , y 2 , . . . , y n } are drawn independently, from not necessarily identical distributions p(y i |c, φ) = f i (y i |φ j ) for i ∈ C j , for which G 0 is conjugate. If these all hold, then the Pólya urn sampler can be used; we see from Section 4.2 that it will involve computing only marginal likelihoods, and ratios of the partition prior, up to a multiplicative constant. The first factor depends only on G 0 and the likelihood, the second only on the partition model. Examples. p(c i→⋆ |θ) and p(c i→j |θ) are proportional simply to • θ and #C -i j for the DP mixture model, • (k-d(c -i ))δ and #C -i j +δ for the Dirichlet-multinomial finite mixture model, • θ+αd(c -i ) and #C -i j -α for the Kingman-Pitman-Yor two-parameter Poisson-Dirichlet process (Section 3.3). It is curious that the ease of using the Pólya urn sampler has often been cited as motivation to use Dirichlet process mixture models, when the class of models for which it is equally readily used is so wide. There is no need to restrict to updating only one c i at a time: the idea extends to simultaneously re-allocating any subset of items currently in the same cluster. The notation can be rather cumbersome, but again the subset forms a new cluster, or moves to an existing cluster, with relative probabilities that are each products of two terms: • the relative (new) partition prior probabilities, and • the predictive density of the moved set of item data, given those already in the receiving cluster. A more sophisticated variant on this scheme has been proposed by Nobile and Fearnside (2007), and studied in the case of finite mixture models. For the remainder of this note, we focus on the use of these models for clustering, rather than density estimation or other kinds of inference. There needs to be a small caveat-mixture models are commonly used either for clustering, or for fitting non-standard distributions; in a problem demanding both, we cannot expect to be able meaningfully to identify clusters with the components of the mixture, since multiple components may be needed to fit the non-standard distributional shape within each cluster. Clustered Dirichlet process methodology in which there is clustering at two levels that can be used for such a purpose is under development by Dan Merl and Mike West at Duke (personal communication). Here we will not pursue this complication, and simply consider a mixture model used for clustering in the obvious way. In many domains of application, practical considerations suggest that the clusters in the data do not have equal standing; the most common such situation is where there is believed to be a 'background' cluster, and one or several 'foreground' clusters, but more generally, we can imagine there being several classes of cluster, and our prior beliefs are represented by the idea that cluster labels are exchangeable within these classes, but not overall. It would be common, also, to have different beliefs about cluster-specific parameters within each of these classes. In this section, we present a variant on standard mixture/cluster models of the kinds we have already discussed, aimed at modelling this situation of partial exchangeability of cluster labels. We stress that it will remain true that, a priori, item labels are exchangeable, and that we have no prior information that particular items are drawn to particular classes of cluster; the analysis is to be based purely on the data {Y i }. We will describe the class of a cluster henceforth as its 'colour'. To define a variant on the DP in which not all clusters are exchangeable: 1. for each 'colour' k = 1, 2, . . . , draw G k from a Dirichlet process DP(θ k , G 0k ), independently for each k; 2. draw weights (w k ) from the Dirichlet distribution Dir(γ 1 , γ 2 , . . .), independently of the This process, denoted CDP({(γ k , θ k , G 0k )}), is a Dirichlet mixture of Dirichlet processes (with different base measures), k w k DP(θ k , G 0k ), with the added feature that the the colour of each cluster is identified (and indirectly observed), while labelling of clusters within colours is arbitrary. It can be defined by a 'stick-breaking-and-colouring' construction: 1. colour segments of the stick using the Dirichlet({γ k })-distributed weights; 2. break each coloured segment using an infinite sequence of independent Beta(1, Note that in contrast to other elaborations to more structured data of the Dirichlet process model, in which the focus is on nonparametric analysis and more sharing of information would be desirable, here, where the focus is on clustering, we are content to leave the atoms and weights within each colour completely uncoupled a priori. The coloured Dirichlet process (CDP) generates the following partition model: partition {1, 2, . . . , n} = k d k j=1 C kj at random, where C kj is the jth cluster of colour k, so that where It is curious to note that this expression simplifies when θ k ≡ γ k , although such a choice seems to have no particular significance in the probabilistic construction of the model. Only when it is also true that the θ k are independent of k (and the colours are ignored) does the model degenerate to an ordinary Dirichlet process. The clustering remains exchangeable over items. To complete the construction of the model, analogously to Section 2.4, for i ∈ C kj , we set k i = k and φ i = φ ⋆ j , where φ ⋆ j are drawn i.i.d. from G 0k . The explicit availability of the (coloured) partition distribution immediately allows generalisation of the Pólya-urn Gibbs sampler to the CDP. In reallocating item i, let n -i kj denote the number among the remaining items currently allocated to C kj , and define n -i k accordingly. Then reallocate i to Again, the expressions simplify when θ k ≡ γ k . In many applications of probabilistic clustering, including the gene expression example from Section 3.2, it is natural to suppose a 'background' cluster that is not a priori exchangeable with the others. One way to think about this is to adapt the 'limit of finite mixtures' view from Section 2.3: 1. draw (w 0 , w 1 , w 2 , . . . , w k ) ∼ Dirichlet(γ, δ, . . . , δ); 2. draw c i ∈ {0, 1, . . . , k} with P {c i = j} = w j , i.i.d., i = 1, . . . , n; Now let k → ∞, δ → 0 such that kδ → θ, but leave γ fixed. The cluster labelled 0 represents the 'background'. The background cluster model is a special case of the CDP, specifically CDP({(γ, 0, H 0 ), (θ, θ, G 0 )}). The two colours correspond to the background and regular clusters. The limiting-case DP(0, H 0 ) is a point mass, randomly drawn from H 0 . We can go a little further in a regression setting, and allow different regression models for each colour. The Pólya urn sampler for prior or posterior simulation is readily adapted. When re-allocating item i, there are three kinds of choice: a new cluster C ⋆ , the 'top table' C 0 , or a regular cluster C j , j = 0: the corresponding prior probabilities p(c i→⋆ |θ), p(c i→0 |θ) and p(c i→j |θ) are proportional to θ, (γ + #C -i 0 ) and #C -i j for the background cluster CDP model. As a practical illustration of the use of the CDP background cluster model, we discuss a regression set-up that expresses a vector of measurements y i = (y i1 , . . . , y iS ) for i = 1, . . . , n, where S is the number of samples, as a linear combination of known covariates, (z 1 • • • z S ) with dimension K ′ and (x 1 • • • x S ) with dimension K. These two collections of covariates, and the corresponding regression coefficients δ j and β j , are distinguished since we wish to hold one set of regression coefficients fixed in the background cluster. We assume . . . where ǫ j ∼ N (0 S×1 , τ -1 j I S×S ), 0 S×1 is the S-dimension zero vector and I S×S is the order-S identity matrix. Here, δ j , β j and τ j are clusterspecific parameters. The profile of measurements for individual i is and the cluster j, the parameters/latent variables are and τ j . The kernel is now represented as k(y i |δ j , β j , τ j ), which is a multivariate Normal density, In particular, we take different probability measures, the parameters of heterogeneous DP, for the background and regular clusters, Here H 0 is a probability measure that includes a point mass at δ 0 and a Normal-Gamma density for β 0 and τ -1 0 . On the other hand, we take G 0 to be a probability measure that is a Normal-Gamma density for (δ ′ j , β ′ j ) ′ and τ -1 j . Thus the regression parameters corresponding to the z covariates are held fixed at δ 0 in the background cluster, but not in the others. We will first discuss the marginal distribution for the regular clusters. Given τ j , (δ ′ j , β ′ j ) ′ follows the (K ′ + K)-dimensional multivariate Normal with mean m and variance (τ j t) -1 and τ j follows the univariate Gamma with shape a and scale b. We denote the joint distribution G 0 (d(δ ′ j , β ′ j ) ′ , dτ j ) as a joint Gamma and Normal distribution, Normal-Gamma( a, b, m, t), and further we take Based on this set-up, we have where vector, Z Cj is a e j S ×K ′ matrix and X Cj is a e j S ×K matrix. Moreover, m G0 (y Cj ) is a multivariate t density with mean Z and degree of freedom 2 a. For the background cluster, we take H 0 to be a joint Gamma and Normal distribution, Normal-Gamma(a, b, m β , t β ). The precision τ 0 follows the univariate Gamma with shape a and scale b. Given τ 0 , β 0 follows the K-dimension multivariate Normal with mean m β and variance (τ 0 t β ) -1 and τ 0 follows the univariate Gamma with shape a and scale b. The mar-ginal distribution becomes and degree of freedom 2a. In some applications, the xs and βs are not needed and so can be omitted, and we consider the following model, (5.5) here we assume that K = 0 or [x 1 • • • x S ] ′ = 0 S×K where 0 S×K is the S × K matrix with all zero entries of the model (5.1). We can derive the marginal distributions analogous to (5.3) and (5.4), (5.7) Here t ν (x |µ, Σ ) is a multivariate t density in d dimensions with mean µ and scale Σ with degrees of freedom ν; (5.8) We consider the application of this methodology to data from a gene expression time-course experiment. Wen et al. (1998) studied the central nervous system development of the rat; see also Yeung et al. (2001). The mRNA expression levels of 112 genes were recorded over the period of development of the central nervous system. In the dataset, there are 9 records for each gene over 9 time points; they are from embryonic days 11, 13, 15, 18, 21, postnatal days 0, 7, 14, and the 'adult' stage (postnatal day 90). - In their analysis, Wen et al. (1998) obtained 5 clusters/waves (totally 6 clusters), taken to characterize distinct phases of development. The data set is available at http://faculty.washington.edu/kayee/cluster/GEMraw.txt. We take S = 9 and K ′ = 5. The design matrix of covariates is taken to be 1 1 1 1 1 0 0 0 0 11 13 15 18 21 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 7 14 0 0 0 0 0 0 0 0 0 1 representing piecewise linear dependence on time, within three separate phases (embryonic, postnatal and adult). In our analysis of these data, we take θ = 1, γ = 5, a = a = 0. and run for 20000 sweeps starting from the partition consisting of all singleton clusters, 10000 being discarded as burn-in. We then use the last 10000 partitions sampled as in Lau and Green (2007), to estimate the optimal Bayesian partition on a decision-theoretic basis, using a pairwise coincidence loss function that equally weights false 'positives' and 'negatives'. We present some views of the resulting posterior analysis of this data set.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment