Persistent Multiscale Density-based Clustering
Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN’s computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.
💡 Research Summary
The paper introduces Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN), a novel density‑based clustering algorithm that builds on HDBSCAN* but eliminates the need for extensive hyper‑parameter tuning. Traditional density‑based methods such as DBSCAN require a density threshold and a minimum number of points, while HDBSCAN* only needs a minimum cluster size ( m_c ). However, selecting an appropriate m_c remains difficult in practice, especially for exploratory data analysis (EDA) where prior knowledge about the data is scarce.
PLSCAN addresses this by computing, in a single pass, the full hierarchy of leaf‑clusters that would appear for every possible m_c value. The method proceeds in three stages that mirror HDBSCAN*: (1) compute mutual‑reachability distances using a user‑specified k (the number of nearest neighbours), (2) build a minimum spanning tree (MST) and a single‑linkage dendrogram, and (3) construct a “condensed tree” that records every merge event together with the size of the child clusters and the mutual‑reachability distance at which the merge occurs.
The key innovation lies in the subsequent construction of a “leaf tree”. By traversing the condensed tree in increasing distance order, the algorithm determines for each cluster segment the interval of m_c values ((s_{\min}, s_{\max}]) for which the segment remains a leaf (i.e., a local density maximum). This interval is stored alongside the distance range (
Comments & Academic Discussion
Loading comments...
Leave a Comment