Active learning from positive and unlabeled examples
Learning from positive and unlabeled data (PU learning) is a weakly supervised variant of binary classification in which the learner receives labels only for (some) positively labeled instances, while all other examples remain unlabeled. Motivated by applications such as advertising and anomaly detection, we study an active PU learning setting where the learner can adaptively query instances from an unlabeled pool, but a queried label is revealed only when the instance is positive and an independent coin flip succeeds; otherwise the learner receives no information. In this paper, we provide the first theoretical analysis of the label complexity of active PU learning.
💡 Research Summary
The paper presents the first theoretical study of active learning in the Positive‑Unlabeled (PU) setting, where the learner only has access to a set of positively labeled examples and a large pool of unlabeled data. In contrast to standard active learning, where a queried instance always yields a label, the authors assume a more realistic feedback model: when the learner queries an instance, the true label is revealed only if the instance is truly positive and an independent coin with bias ω∈(0,1] lands heads; otherwise the learner receives a null response (denoted “*”). This model corresponds to the Selected‑Completely‑At‑Random (SCAR) assumption commonly used in passive PU learning.
The authors develop a label‑complexity analysis based on the disagreement‑based framework. They define the usual region of disagreement DIS(V) for a version space V⊆H and introduce the disagreement coefficient θ, which measures how the probability mass of DIS(B_D(h,r)) scales with the radius r. Assuming a continuous data distribution (no point masses) and a hypothesis class H of VC dimension d, they prove that the positive portion of the disagreement region, DIS_P(H)=DIS(H)∩{x:ℓ(x)=1}, contains at least a 1/(4θ) fraction of positive examples (Lemma 4.2). Consequently, each query placed in DIS(V) yields a useful label with probability at least ω/(4θ), establishing a lower bound on the effective response rate.
Known class‑prior (π_D) case.
Algorithm 1 adapts the classic CAL (Consistent Active Learning) algorithm. It first draws a random unlabeled sample S of size k=O(d·log(1/γ)/γ²) with γ=ε/(8θ) and restricts the version space to hypotheses whose empirical positive mass does not exceed π_D+γ. This guarantees that any hypothesis in the initial version space has false‑positive rate ≤2γ. The algorithm then repeatedly draws instances from the pool, queries them only if they lie in the current disagreement region, and prunes the version space using any non‑null label received. Using Lemma 4.2, Chernoff bounds, and standard PAC arguments, the authors show that after O(log(1/ε)) rounds, each consisting of O(θ·d·logθ/ω) label requests, the disagreement mass halves. The total number of label requests needed to achieve error ≤ε with confidence 1−δ is
O!\left(\frac{θ\bigl(d\logθ+\log\log(1/ε)+\log(1/δ)\bigr)}{ω},\log\frac1ε\right).
Compared with the original CAL bound O(θ·d·log(1/ε)), the extra factor is simply 1/ω, reflecting the probability that a queried positive instance actually yields a label.
Unknown class‑prior case.
When π_D is not given, the authors propose a two‑stage approach. First, Algorithm 2 estimates ω up to a constant factor. Then a large unlabeled set S₁ is sampled to obtain an empirical positive rate \hatπ_{S₁}. The version space is iteratively refined: at iteration i the learner maintains an upper bound u_i on the true positive rate and a lower bound b_i derived from the fraction of points that all hypotheses in the current version space label as positive. If the observed response rate from DIS(V_i) falls below a threshold proportional to ω/θ, Lemma 5.4 guarantees that \hatπ_{S₁}< (u_i+b_i)/2, allowing the algorithm to halve the upper bound (u_{i+1} = (u_i+b_i)/2). This binary‑search‑style refinement can occur at most O(log 1/ε) times, after which the response rate is guaranteed to be Ω(ω/θ). From that point onward the analysis mirrors the known‑π_D case, yielding the same asymptotic label‑complexity bound.
Implications and discussion.
The results demonstrate that even under the severe information restriction of PU learning—where only a fraction ω of positive queries provide labels—active learning can still achieve the same order of label efficiency as fully supervised active learning, up to the factor 1/ω. The dependence on the disagreement coefficient θ remains, highlighting that the geometry of the hypothesis class relative to the data distribution continues to dominate label complexity. Moreover, the binary‑search procedure for estimating the unknown class prior shows that the learner does not need prior knowledge of π_D to retain optimal rates.
Overall, the paper extends the disagreement‑based active learning theory to a realistic weakly supervised scenario, provides explicit label‑complexity bounds, and offers algorithmic strategies for both known and unknown class‑prior settings. These contributions lay a theoretical foundation for designing cost‑effective labeling strategies in applications such as targeted advertising, anomaly detection, and knowledge‑base completion, where obtaining labels is expensive and often limited to positive feedback.
Comments & Academic Discussion
Loading comments...
Leave a Comment