Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Flash-KMeans: Fast and Memory-Efficient Exact K-Means
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.


💡 Research Summary

Flash‑KMeans revisits the classic Lloyd’s K‑means algorithm with a systems‑first perspective, targeting modern GPU‑accelerated AI pipelines where clustering must be performed online with low latency and high throughput. The authors identify two fundamental implementation bottlenecks in existing GPU libraries: (1) an I/O bound assignment stage that materializes the full N × K distance matrix in high‑bandwidth memory (HBM), incurring massive read‑write traffic; and (2) severe atomic‑write contention during the centroid update stage, where many threads concurrently scatter‑add to the same centroid, throttling effective bandwidth. To eliminate these inefficiencies without altering the mathematical formulation, Flash‑KMeans introduces two novel kernels.

FlashAssign fuses distance computation with an online arg‑min reduction. By streaming blocks of points and centroids from HBM to on‑chip SRAM, it computes Euclidean distances via matrix‑multiply, immediately updates the per‑point minimum distance and cluster index, and discards the intermediate distance values. This eliminates the O(NK) memory traffic of the distance matrix, reducing it to O(N) and allowing the assignment stage to become compute‑bound rather than memory‑bound.

Sort‑Inverse Update resolves update‑stage contention. After the assignment step, the algorithm sorts the assignment vector by cluster ID, producing contiguous segments of points belonging to the same cluster. It then performs a segmented reduction over each segment, replacing per‑token atomic adds with high‑bandwidth, locality‑friendly reductions. The sorting is implemented with an efficient GPU radix sort, and the subsequent reductions achieve bandwidth close to the hardware’s theoretical peak, delivering up to a 6.3× speedup over traditional scatter‑style atomics.

Beyond kernel redesign, the system‑level co‑optimizations include (a) chunked‑stream overlap that pipelines PCIe host‑to‑device transfers with computation, enabling out‑of‑core processing of up to one billion points while hiding communication latency, and (b) a cache‑aware compile heuristic that automatically selects optimal thread‑block sizes and memory access patterns based on workload parameters (N, K, dimensionality d, batch size B). This heuristic reduces compile‑time tuning overhead by up to 175× and incurs less than 0.3 % performance loss compared to hand‑tuned kernels.

Extensive evaluation on NVIDIA H200 GPUs demonstrates that FlashAssign achieves up to 21.2× acceleration for the assignment kernel, while Sort‑Inverse Update yields up to 6.3× for the update kernel. End‑to‑end, Flash‑KMeans outperforms the strongest baseline by up to 17.9×, and surpasses industry‑standard libraries such as NVIDIA cuML and FAISS by 33× and over 200× respectively. In large‑scale out‑of‑core scenarios, the system delivers a 10.5× speedup, and under dynamic shape workloads it maintains near‑optimal performance with negligible overhead.

In summary, Flash‑KMeans shows that by restructuring data movement and synchronization to respect modern GPU hardware constraints—rather than by reducing FLOPs—exact K‑means can be transformed into a fast, memory‑efficient, and deployable primitive for a wide range of AI applications, including embedding quantization, sparse routing in large language models, and real‑time clustering in generative video models. The approach is orthogonal to algorithmic accelerations and can be combined with existing convergence‑speed techniques for further gains.


Comments & Academic Discussion

Loading comments...

Leave a Comment