Classification by Set Cover: The Prototype Vector Machine

Classification by Set Cover: The Prototype Vector Machine
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a new nearest-prototype classifier, the prototype vector machine (PVM). It arises from a combinatorial optimization problem which we cast as a variant of the set cover problem. We propose two algorithms for approximating its solution. The PVM selects a relatively small number of representative points which can then be used for classification. It contains 1-NN as a special case. The method is compatible with any dissimilarity measure, making it amenable to situations in which the data are not embedded in an underlying feature space or in which using a non-Euclidean metric is desirable. Indeed, we demonstrate on the much studied ZIP code data how the PVM can reap the benefits of a problem-specific metric. In this example, the PVM outperforms the highly successful 1-NN with tangent distance, and does so retaining fewer than half of the data points. This example highlights the strengths of the PVM in yielding a low-error, highly interpretable model. Additionally, we apply the PVM to a protein classification problem in which a kernel-based distance is used.


💡 Research Summary

The paper introduces the Prototype Vector Machine (PVM), a novel nearest‑prototype classifier that formulates prototype selection as a combinatorial optimization problem closely related to the set‑cover problem. Given a training set X with class labels and a candidate prototype set Z (typically Z = X, but any set of points is permissible), the method defines a dissimilarity matrix D_{ij}=d(x_i, z_j) where d can be any non‑negative measure, not necessarily a metric. For a chosen radius ε, each prototype induces an ε‑ball B_ε(z_j). The goal is to select a small collection of prototypes for each class such that (a) the ε‑balls cover as many training points of the same class as possible, (b) they cover as few points of other classes as possible, and (c) the total number of prototypes is minimized.

The authors encode these desiderata in an integer program. Binary variables α_{j}^{(l)} indicate whether candidate z_j is chosen as a prototype for class l. Two slack variables per training point, ξ_i and η_i, capture respectively (i) failure to be covered by a same‑class prototype and (ii) being covered by a prototype of a different class. The objective function is

 ∑i ξ_i + ∑i η_i + λ ∑{j,l} α{j}^{(l)}

where λ ≥ 0 penalizes the inclusion of each prototype. Constraint (3a) forces every training point to be covered by at least one same‑class ε‑ball unless ξ_i = 1, while constraint (3b) limits the number of “wrong‑class” coverings, with η_i counting the excess. The formulation is shown to be equivalent to L independent prize‑collecting set‑cover problems, where the cost of selecting prototype (j,l) is C_{j}^{(l)} = λ + |B_ε(z_j) ∩ (X \ X_l)|, i.e., a base cost plus a penalty proportional to the number of opposite‑class points it would cover.

Because the integer program is NP‑hard, the paper proposes two approximation algorithms. The first relaxes the binary α variables to the interval


Comments & Academic Discussion

Loading comments...

Leave a Comment