CoHSI I; Detailed properties of the Canonical Distribution for Discrete Systems such as the Proteome

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The CoHSI (Conservation of Hartley-Shannon Information) distribution is at the heart of a wide-class of discrete systems, defining the length distribution of their components amongst other global properties. Discrete systems such as the known proteome where components are proteins, computer software, where components are functions and texts where components are books, are all known to fit this distribution accurately. In this short paper, we explore its solution and its resulting properties and lay the foundation for a series of papers which will demonstrate amongst other things, why the average length of components is so highly conserved and why long components occur so frequently in these systems. These properties are not amenable to local arguments such as natural selection in the case of the proteome or human volition in the case of computer software, and indeed turn out to be inevitable global properties of discrete systems devolving directly from CoHSI and shared by all. We will illustrate this using examples from the Uniprot protein database as a prelude to subsequent studies.

💡 Research Summary

**
The paper presents a statistical‑mechanical framework that derives the length distribution of components in a wide class of discrete systems—from proteins in the proteome to functions in software and books in a library—by invoking the Conservation of Hartley‑Shannon Information (CoHSI). The authors model a system as M “boxes” (components) each containing ti indivisible tokens (e.g., amino‑acid residues) drawn from a unique alphabet of size ai. Tokens are ordered, so the number of possible arrangements for a given box is N(ti, ai; ai), a combinatorial function that counts ordered selections with repetition while ensuring every alphabet symbol appears at least once.

Using the principle that all microstates are equally probable, they introduce two Lagrange multipliers: α (normalisation) and β (shape). Maximising entropy under the constraint of fixed total information yields the implicit equation

log ti = −α − β · (d/dti) log N(ti, ai; ai) (1)

For small ti the authors replace the simple Stirling approximation with a Ramanujan‑type correction, leading to a more accurate form (2). When ti≫ai, the derivative term simplifies to log ai, and the equation reduces to

log ti = −α − β log ai (4)

which integrates to the explicit power‑law

ti = e^{−α} · ai^{−β} (5)

Thus β controls the slope of the long‑tail (approximately 1/β), while α ensures the probability density function (pdf) integrates to one.

The authors solve (2) numerically by pre‑computing log N on a grid (ti = 1…100, ai = 1…50), differentiating discretely, and applying a bisection method. Convergence is robust for ti ≥ 4; for ti < ai no solution exists because a box cannot contain fewer tokens than distinct alphabet symbols.

Exploring the parameter space, they find that increasing β shifts the sharp peak leftward and flattens the power‑law tail, whereas increasing α moves the peak rightward without appreciably altering the tail. This demonstrates that, for small ti, both α and β jointly affect normalisation and shape, whereas in the asymptotic regime they decouple.

Empirical validation uses the UniProt/TrEMBL protein length data (versions 15‑07 and 17‑03). Both datasets display a pronounced unimodal peak followed by a power‑law tail with an exponent of approximately –3.13. By fitting the complementary cumulative distribution function (ccdf) across a grid of α and β values, the authors locate the region (α≈4–5, β≈0.2–0.3) that yields an adjusted R² near 0.95, confirming an excellent fit. This demonstrates that CoHSI predicts not only the existence of a conserved average length but also the surprisingly frequent occurrence of very long components, without invoking system‑specific mechanisms such as natural selection or human design.

The paper also emphasizes reproducibility: all source code, data, and scripts are publicly available, and the methodology follows established reproducibility guidelines. By grounding the analysis in a single information‑conservation principle, the work unifies disparate phenomena—protein evolution, software architecture, and textual organization—under a common statistical law. The authors conclude that the CoHSI framework provides a parsimonious, global explanation for the observed length distributions in discrete systems, and they outline a program of subsequent papers to extend the theory to other properties such as variance, higher moments, and dynamical evolution.

CoHSI I; Detailed properties of the Canonical Distribution for Discrete Systems such as the Proteome

💡 Research Summary

Comments & Academic Discussion

Leave a Comment