Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection

Enhancing Object Discovery for Unsupervised Instance Segmentation and Object Detection
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We propose Cut-Once-and-LEaRn (COLER), a simple approach for unsupervised instance segmentation and object detection. COLER first uses our developed CutOnce to generate coarse pseudo labels, then enables the detector to learn from these masks. CutOnce applies Normalized Cut (NCut) only once and does not rely on any clustering methods (e.g., K-Means), but it can generate multiple object masks in an image. Our work opens a new direction for NCut algorithm in multi-object segmentation. We have designed several novel yet simple modules that not only allow CutOnce to fully leverage the object discovery capabilities of self-supervised model, but also free it from reliance on mask post-processing. During training, COLER achieves strong performance without requiring specially designed loss functions for pseudo labels, and its performance is further improved through self-training. COLER is a zero-shot unsupervised model that outperforms previous state-of-the-art methods on multiple benchmarks. We believe our method can help advance the field of unsupervised object localization. Code is available at: https://github.com/Quantumcraft616/COLER.


💡 Research Summary

The paper introduces COLER (Cut‑Once‑and‑LEaRn), a novel framework for unsupervised instance segmentation and object detection that builds on a new mask generator called CutOnce. Traditional unsupervised methods either apply Normalized Cut (NCut) once and then cluster the resulting eigen‑vectors (e.g., with K‑Means) or recursively re‑apply NCut to split foreground/background repeatedly. Both approaches have drawbacks: clustering requires a predefined number of clusters, and recursive splitting suffers from error accumulation and high computational cost.

CutOnce departs from these paradigms by performing NCut exactly once and directly manipulating the second smallest eigen‑vector (y₁) to obtain multiple object masks. Three lightweight yet effective modules are introduced to make this possible:

  1. Density‑Tune Cosine Similarity – The edge weight matrix W is no longer a plain cosine similarity. For each patch i, a local density ρᵢ is computed as the average similarity to its top‑k nearest neighbors. An adaptive temperature Tᵢⱼ = T₀ + α·(ρᵢ+ρⱼ)/2 modulates the cosine similarity, suppressing intra‑object variance while preserving uniform background similarity. This idea mirrors self‑tuning spectral clustering but is implemented with a single matrix operation.

  2. Boundary Augmentation – To overcome the tendency of y₁ to focus on the most salient object, a boundary eigen‑vector X_b is derived by averaging absolute differences between each pixel and its 8‑neighborhood. The enhanced vector X_a = X – X_b reduces attention inside the foreground and amplifies boundary regions, effectively expanding less‑salient objects and preventing adjacent objects from merging into a single segment.

  3. Rank Feature Filter – After graph partitioning, each candidate mask is scored by the sum of its pixel values (rank). Only the top 95 % of masks are retained, discarding noisy or spurious proposals before they are fed to the detector.

These modules together enable CutOnce to generate more than ten object masks per image, a capability not seen in prior NCut‑based methods such as TokenCut, MaskCut, or VoteCut. Moreover, the mask generation time drops to ~0.23 seconds per image, roughly ten times faster than TokenCut’s 5.6 seconds. No CRF or other heavy post‑processing is required.

COLER then uses the coarse pseudo‑masks from CutOnce as training data for a standard detection head (e.g., Mask R‑CNN). Crucially, the authors do not design a special loss for noisy pseudo‑labels; they rely on the conventional detection losses (classification, bounding‑box regression, mask loss). After an initial training pass, a self‑training loop is employed: the trained detector predicts new masks on the unlabeled dataset, these predictions are filtered (using the same rank filter) and added back into the training set, iteratively refining the model.

Extensive experiments on COCO val2017, PASCAL VOC, and LVIS demonstrate that COLER outperforms all previous unsupervised baselines. With AP₅₀ = 22.1 % on COCO, it surpasses TokenCut (18.9 %) and MaskCut (19.3 %). The ability to detect >10 objects per image and the dramatic speed improvement make the approach practical for large‑scale, label‑free vision tasks.

In summary, COLER presents a clean, efficient pipeline that leverages a single NCut operation enhanced by density‑aware similarity, boundary augmentation, and rank‑based mask pruning. This enables high‑quality multi‑object pseudo‑labels without clustering or recursive graph cuts, and the resulting masks can train a strong detector without bespoke loss functions. The work opens a new direction for applying spectral graph methods to unsupervised object discovery, with potential extensions to video, 3‑D data, and real‑time robotics.


Comments & Academic Discussion

Loading comments...

Leave a Comment