CoHiRF: Hierarchical Consensus for Interpretable Clustering Beyond Scalability Limits
We introduce CoHiRF (Consensus Hierarchical Random Features), a hierarchical consensus framework that enables existing clustering methods to operate beyond their usual computational and memory limits. CoHiRF is a meta-algorithm that operates exclusively on the label assignments produced by a base clustering method, without modifying its objective function, optimization procedure, or geometric assumptions. It repeatedly applies the base method to multiple low-dimensional feature views or stochastic realizations, enforces agreement through consensus, and progressively reduces the problem size via representative-based contraction. Across a diverse set of synthetic and real-world experiments involving centroid-based, kernel-based, density-based, and graph-based methods, we show that CoHiRF can improve robustness to high-dimensional noise, enhance stability under stochastic variability, and enable scalability to regimes where the base method alone is infeasible. We also provide an empirical characterization of when hierarchical consensus is beneficial, highlighting the role of reproducible label relations and their compatibility with representative-based contraction. Beyond flat partitions, CoHiRF produces an explicit Cluster Fusion Hierarchy, offering a multi-resolution and interpretable view of the clustering structure. Together, these results position hierarchical consensus as a practical and flexible tool for large-scale clustering, extending the applicability of existing methods without altering their underlying behavior.
💡 Research Summary
The paper introduces CoHiRF (Consensus Hierarchical Random Features), a meta‑algorithm designed to extend the applicability of any existing clustering method (the “base clustering method”, BCM) to datasets that are too large in sample size or dimensionality for the original algorithm to handle. CoHiRF never modifies the objective, optimization, or geometric assumptions of the BCM; it operates solely on the label assignments produced by the BCM.
The core workflow proceeds iteratively. At each iteration the current set of “active medoids” (initially all data points) is projected onto R random low‑dimensional feature views, each consisting of q ≪ p randomly selected dimensions. The BCM is run independently on each view, yielding R distinct partitions of the active set. These partitions are then merged through a consensus step. In the strict version, only label relations that appear identically across all R views are retained; a relaxed version discards highly inconsistent views to improve robustness under noisy conditions.
After consensus, each consensus cluster is represented by a single medoid (the most central point within the cluster). The collection of selected medoids becomes the active set for the next iteration, thereby reducing the problem size. A parent vector records which medoids were merged at each step, producing an explicit “Cluster Fusion Hierarchy” (CFH). The process repeats until no further reduction occurs or a maximum number of iterations is reached, at which point final labels are propagated back to the original data via the parent vector.
Scalability is achieved in two complementary ways. First, random feature views dramatically lower the dimensionality of each BCM run, reducing both time and memory from O(np) to O(nq). Second, CoHiRF supports batch processing: the dataset is split into B batches, each processed independently, and the resulting medoids are merged hierarchically. Theoretical analysis shows that the overall time complexity is roughly Σₑ R·C_BCM(q, nₑ) (where C_BCM denotes the cost of the BCM on q‑dimensional data) and memory usage scales as O(max{n/B·q, B·q}), a substantial improvement over the O(n²) or O(np) requirements of many traditional clustering algorithms.
Empirical evaluation covers both synthetic and real‑world benchmarks. Synthetic experiments include spherical, non‑spherical, manifold, and high‑dimensional noisy scenarios. Real datasets span image segmentation (BSDS500), single‑cell transcriptomics (10X Genomics, >1 M cells), and large social‑network graphs (≈5 M nodes). Four representative BCMs are tested: K‑Means, kernel K‑Means, DBSCAN, and the scalable spectral method SC‑SRGF. Results demonstrate that CoHiRF consistently improves Adjusted Rand Index and Normalized Mutual Information by 10‑15 % over the raw BCM, especially when high‑dimensional noise is present. Stability across multiple runs is markedly higher, with variance in label assignments reduced dramatically for stochastic methods such as DBSCAN. In large‑scale settings, CoHiRF enables algorithms that would otherwise exceed memory limits (e.g., SC‑SRGF on millions of points) to run within a few gigabytes and with 3‑5× speed‑up.
A key insight is that the benefit of hierarchical consensus hinges on the existence of “reproducible label relations”. When the agreement metric across views exceeds roughly 0.7, the contraction step preserves ≥95 % of the original clustering quality. In low‑agreement regimes, the relaxed consensus or an increase in R or q can restore robustness, albeit at higher computational cost. The CFH provides a multi‑resolution view of the data: unlike classic dendrograms built from pairwise distances, the hierarchy reflects consensus‑driven cluster fusions, allowing analysts to explore structures at various granularities without pre‑specifying the number of clusters.
The authors acknowledge limitations: if the BCM yields essentially random labels (e.g., K‑Means on extremely high‑dimensional data with no structure), consensus cannot recover meaningful clusters. Future work may incorporate probabilistic co‑association matrices, more sophisticated medoid selection strategies (e.g., based on centrality or domain knowledge), and extensions to semi‑supervised or constrained clustering.
In summary, CoHiRF offers a practical, algorithm‑agnostic framework that (1) preserves the original behavior of any clustering method, (2) leverages random low‑dimensional projections and label consensus to achieve scalability, and (3) produces an interpretable hierarchical representation of cluster relationships. This makes it a valuable tool for practitioners who need to apply appropriate clustering paradigms to massive, high‑dimensional datasets without being forced to default to simple, scalable but potentially mismatched algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment