Learning Determinantal Point Processes
Determinantal point processes (DPPs), which arise in random matrix theory and quantum physics, are natural models for subset selection problems where diversity is preferred. Among many remarkable properties, DPPs offer tractable algorithms for exact inference, including computing marginal probabilities and sampling; however, an important open question has been how to learn a DPP from labeled training data. In this paper we propose a natural feature-based parameterization of conditional DPPs, and show how it leads to a convex and efficient learning formulation. We analyze the relationship between our model and binary Markov random fields with repulsive potentials, which are qualitatively similar but computationally intractable. Finally, we apply our approach to the task of extractive summarization, where the goal is to choose a small subset of sentences conveying the most important information from a set of documents. In this task there is a fundamental tradeoff between sentences that are highly relevant to the collection as a whole, and sentences that are diverse and not repetitive. Our parameterization allows us to naturally balance these two characteristics. We evaluate our system on data from the DUC 2003/04 multi-document summarization task, achieving state-of-the-art results.
💡 Research Summary
This paper addresses the previously open problem of learning Determinantal Point Processes (DPPs) from labeled data. DPPs are probabilistic models that assign higher probability to diverse subsets; they have the attractive property that many inference tasks (marginals, conditioning, sampling) can be performed exactly in polynomial time. The authors introduce a feature‑based conditional DPP model suitable for discriminative learning.
The core of the model is an L‑ensemble representation: a positive semidefinite kernel L is factorized as L_{ij}=q_i φ_i^T φ_j q_j, where q_i>0 measures the intrinsic “quality” of item i and φ_i∈ℝ^n (‖φ_i‖=1) encodes its similarity features. Quality scores are modeled with a log‑linear function q_i=exp(θ^T f_i/2), where f_i are item‑specific feature vectors and θ is the parameter vector to be learned. The similarity features φ_i can be any kernel‑compatible representation (including infinite‑dimensional kernels).
Given a training set of input–output pairs (X_t, Y_t) (e.g., a document cluster and its ideal extractive summary), the learning objective is the log‑likelihood L(θ)=∑t log P_θ(Y_t|X_t). By substituting the L‑ensemble definition, the log‑likelihood decomposes into a linear term in θ (the sum of quality features of the observed items), a constant term involving det(S{Y_t}) (the diversity contribution), and a negative log‑sum‑exp term over all possible subsets. The latter is a concave function of the linear scores, so the entire objective is concave in θ. Consequently, standard convex optimization methods can be applied.
The gradient of L(θ) is the difference between empirical feature counts and their expectations under the model. Crucially, the expected counts can be computed efficiently because the marginal inclusion probability of each item i is given by the diagonal entry K_{ii} of the marginal kernel K = (L+I)^{-1} L. Computing K requires an eigendecomposition of L (O(N^3) time) after which all K_{ii} are obtained in O(N). The authors present Algorithm 1, which details this computation and yields the exact gradient without enumerating the exponential number of subsets.
The paper also compares DPPs to pairwise Markov Random Fields (MRFs) with negative interaction potentials. For two items the models are equivalent, but for three or more items DPPs impose a transitivity constraint on the similarity matrix (stemming from positive semidefiniteness), whereas MRFs can represent a broader class of negative correlations. This analysis clarifies the expressive limits of DPPs relative to more general, but intractable, MRFs.
At test time the goal is to predict a subset Y for a new input X. While exact sampling from a conditional DPP is feasible (cubic time), the authors find that selecting the maximum‑a‑posteriori (MAP) set under a length budget yields better summarization performance. MAP inference for DPPs is NP‑hard, but the authors employ a greedy algorithm that leverages the submodular nature of the DPP log‑determinant objective, providing an efficient approximation.
The methodology is evaluated on the DUC 2003/04 multi‑document summarization benchmark. Sentences are represented with quality features (position, length, TF‑IDF scores) and similarity features (cosine similarity of TF‑IDF vectors). After learning θ, the conditional DPP is used to produce summaries that balance relevance and redundancy. The system achieves state‑of‑the‑art ROUGE scores, outperforming previous approaches based on MRFs or purely relevance‑driven selection.
In summary, the paper makes three major contributions: (1) a convex, feature‑based parameterization of conditional DPPs that enables efficient maximum‑likelihood learning; (2) a theoretical comparison of DPPs and repulsive MRFs, highlighting both expressive power and tractability; and (3) an empirical demonstration that learned DPPs excel at extractive summarization, effectively managing the relevance‑diversity trade‑off. The work opens the door for applying learned DPPs to a wide range of subset selection problems such as recommendation, sensor placement, and diverse retrieval.
Comments & Academic Discussion
Loading comments...
Leave a Comment