PANC: Prior-Aware Normalized Cut for Object Segmentation
Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
💡 Research Summary
The paper introduces PANC (Prior‑Aware Normalized Cut), a weakly‑supervised spectral segmentation framework that injects a small set of manually annotated token‑level priors into a graph‑based normalized‑cut pipeline built on frozen self‑supervised Vision Transformer (ViT) embeddings. The authors first identify a key limitation of recent unsupervised object discovery methods such as TokenCut: they rely on “most salient” heuristics and consequently produce non‑deterministic, unstable masks that are highly sensitive to initialization, seed order, and threshold choices. This instability is especially problematic in multi‑object scenes, homogeneous textures, or fine‑grained domains where saliency cues are ambiguous.
To address this, PANC constructs a compact prior bank from a handful of representative images selected by clustering image‑level CLS embeddings. Human annotators label a few tokens in each representative image as positive (foreground) or negative (background). For a given test image, a relevance‑aware selection mechanism retrieves a small, label‑balanced subset of priors that are both highly similar to the image tokens and diverse among themselves.
The core technical contribution lies in how these priors are incorporated into the token affinity graph. Tokens are first extracted from a frozen DINOv3 ViT (e.g., ViT‑H/16) and pairwise cosine similarities are transformed into non‑negative edge weights via a temperature‑scaled exponential kernel. Two anchor vertices—one for the foreground class and one for the background—are then added. Each token is connected to the appropriate anchor with a uniform coupling strength α proportional to the average affinity in the original graph (α = κ·mean(W_feat)). This yields a block adjacency matrix
W =
Comments & Academic Discussion
Loading comments...
Leave a Comment