Fast search for Dirichlet process mixture models
Dirichlet process (DP) mixture models provide a flexible Bayesian framework for density estimation. Unfortunately, their flexibility comes at a cost: inference in DP mixture models is computationally expensive, even when conjugate distributions are used. In the common case when one seeks only a maximum a posteriori assignment of data points to clusters, we show that search algorithms provide a practical alternative to expensive MCMC and variational techniques. When a true posterior sample is desired, the solution found by search can serve as a good initializer for MCMC. Experimental results show that using these techniques is it possible to apply DP mixture models to very large data sets.
💡 Research Summary
Dirichlet‑process (DP) mixture models are a powerful non‑parametric Bayesian tool for density estimation because they automatically infer the number of mixture components. Unfortunately, this flexibility comes with a heavy computational burden. Traditional inference methods—Gibbs sampling (Neal 1998) and variational Bayes (Blei & Jordan 2005)—require many iterations, complex bookkeeping, and often scale poorly to tens of thousands of observations.
The authors observe that in many practical applications the user does not need a full posterior sample of cluster assignments; instead, a single maximum‑a‑posteriori (MAP) clustering is sufficient. They therefore propose to replace stochastic sampling with deterministic search algorithms (A* and beam search) that directly maximize the posterior probability of a clustering. The key insight is that the DP prior admits a closed‑form expression for the probability of a particular partition (Antoniak 1974) and that, when the base distribution G₀ is conjugate to the likelihood, the marginal likelihood of each cluster can also be computed analytically. By combining these two terms we obtain the exact posterior p(c, x)=p(c) p(x | c).
The generic search algorithm maintains a priority queue of partial clusterings (prefixes). At each step it removes the most promising prefix, expands it by assigning the next data point either to an existing cluster or to a brand‑new cluster, scores each resulting child using a heuristic function g, and re‑inserts them into the queue. A beam size b can be imposed to keep the queue bounded; b = ∞ yields an exact A* search, while finite b yields an approximate beam search that trades optimality for speed.
Three scoring functions are defined:
-
Trivial (admissible) – uses only the likelihood of already assigned points, i.e., g₁(c₀)=∏{k∈c₀} H(x{c₀=k}). This is a valid upper bound but provides no look‑ahead, leading to a huge search tree.
-
Tighter (admissible) – treats each unassigned point independently and assumes it will be placed in the cluster (existing or new) that gives the highest marginal likelihood. The resulting bound g₂ is much tighter than g₁ while still guaranteeing admissibility, so A* with g₂ still finds the true MAP if the beam is unlimited.
-
Inadmissible (non‑admissible) – assumes every remaining point will start a new cluster, yielding g₃(c₀)=g₁(c₀) · ∏_{n>N₀} H(x_n). This dramatically overestimates the true posterior, causing the search to prune aggressively. Although optimality is no longer guaranteed, empirical results show that g₃ combined with a modest beam (e.g., b = 10) produces the fastest runtimes with only negligible loss in solution quality. The authors also find that ordering the data by increasing marginal likelihood improves the performance of the inadmissible heuristic.
In addition to the heuristic, the authors present an efficient method for maximizing the prior term p(c). They work directly with the count vector m_i (the number of clusters containing exactly i points) and use Antoniak’s formula to compute p(m|α,N). Adding a new observation either creates a new singleton cluster (increment m₁) or enlarges an existing cluster of size ℓ (decrement m_ℓ, increment m_{ℓ+1}). By greedily choosing the action that yields the largest increase in the prior term, they can compute the optimal p(c) for any prefix in O(N) time, with additional caching to handle very large N.
The experimental evaluation covers three domains:
-
Synthetic Gaussian/Dirichlet‑Multinomial data – data sets ranging from 5 k to 50 k points. The authors compare Gibbs sampling, split‑merge Metropolis–Hastings, and all six search variants (three scoring functions, each with full search and beam‑size = 10). Results show that the inadmissible‑beam method achieves the lowest negative log‑likelihood, highest F‑score, and reduces runtime and queue size by one to two orders of magnitude relative to sampling methods.
-
MNIST handwritten digits – 10 k images (784‑dimensional). The beam search finds a high‑quality MAP clustering in under 30 seconds, whereas Gibbs sampling would require many minutes to converge.
-
NIPS paper abstracts – several thousand documents modeled with a DP mixture of multinomials. Again, beam search with the tighter heuristic attains comparable topic coherence to sampling but with dramatically lower computational cost.
A further experiment demonstrates that initializing Gibbs sampling with the MAP clustering obtained from the search dramatically speeds up convergence, confirming that the search output is a useful warm‑start.
In summary, the paper shows that when only a MAP clustering is required, deterministic search algorithms—particularly beam search with a carefully chosen heuristic—provide a practical, scalable alternative to traditional MCMC and variational inference for Dirichlet‑process mixture models. The approach retains the exact DP prior, works with any conjugate exponential‑family likelihood, and can be combined with sampling methods to improve their efficiency. Future work may explore non‑conjugate extensions, online streaming variants, and theoretical analysis of the inadmissible heuristic’s error bounds.
Comments & Academic Discussion
Loading comments...
Leave a Comment