Boltzmann-Shannon Index: A Geometric-Aware Measure of Clustering Balance

Boltzmann-Shannon Index: A Geometric-Aware Measure of Clustering Balance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The Boltzmann-Shannon Index (BSI) for clustered continuous data is introduced as a normalized measure that captures the relationship between geometry-based and frequency-based probability distributions defined over the clusters. In essence, it quantifies the similarity across densities of the clusters, which are defined by a given labeling. This labeling may originate from a geometric partitioning of the state space itself, but need not in general. We illustrate its performance on synthetic Gaussian mixtures, the Iris benchmark data set, and a high-imbalance resource-allocation scenario, showing that the BSI provides a coherent assessment in cases where traditional metrics give incomplete or misleading signals. Moreover, in the resource-allocation setting where equal density may be associated with a “fair” distribution, we demonstrate that BSI not only detects inequality with high sensitivity, but also offers a numerically smooth measure that can be easily embedded in optimization frameworks as a regularization term for modern policy-making. Finally, the BSI also offers a new measure of the effectiveness for a given symbolic representation, i.e. coarse-grain states, for continuous-valued data recorded from complex dynamical systems.


💡 Research Summary

The paper introduces the Boltzmann‑Shannon Index (BSI), a normalized metric for evaluating clustered continuous data that simultaneously accounts for frequency‑based and geometry‑based probability distributions over the same set of clusters. Starting from the historical connection between Boltzmann’s entropy (volume of phase‑space regions) and Shannon’s information entropy (uncertainty over discrete symbols), the authors note that traditional Shannon entropy assumes a uniform prior over bins and therefore ignores the spatial arrangement of data points. Recent work on geometric partition entropy (GPE) addresses this by constructing bins of equal mass in the data’s cumulative distribution, but GPE can fail when data contain repeated values or when geometric concentration is ambiguous.

BSI bridges this gap by defining two distributions for a given labeling L of a dataset X: (1) p, the normalized histogram of cluster memberships (frequency), and (2) q, a normalized geometric measure of each cluster’s “volume”. The geometric measure can be obtained either from explicit region volumes (e.g., Voronoi cells) when a bounded domain is known, or more generally by computing the singular values of the data matrix for each cluster and taking the product of all singular values—a quantity that captures spread in every principal direction and is robust to overlapping or outlier‑laden clusters.

The index is defined as
 BSI = 1 – JSD(p‖q) = 1 – ½


Comments & Academic Discussion

Loading comments...

Leave a Comment