Scaling laws and fluctuations in the statistics of word frequencies

Scaling laws and fluctuations in the statistics of word frequencies
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps’ law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf’s law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).


💡 Research Summary

The paper investigates three well‑known scaling laws in textual data—Zipf’s law for word frequencies, Heaps’ law for the growth of distinct words with text length, and a third, less studied law governing the fluctuations of vocabulary size around its mean. Using three massive corpora (Google‑ngram, English Wikipedia, and a collection of scientific articles from PLoS ONE), the authors first confirm that Zipf’s rank‑frequency relation (F_r ∝ r^−α) holds and that the average number of distinct words N(M) grows sub‑linearly with total token count M (Heaps’ law, N ∝ M^λ with 0<λ<1).

When they examine the variance of N across documents of the same length, they discover a striking linear relationship σ(M) ≈ 0.1 µ(M), i.e., the standard deviation scales proportionally to the mean (β≈1 in the fluctuation‑scaling relation σ ∝ µ^β). This contrasts sharply with the prediction of a simple Poisson‑null model, which assumes each word follows an independent Poisson process with a fixed global frequency F_r. Under that model, the variance would scale as σ ∝ √µ (β=½), and the mean vocabulary size would be slightly over‑estimated.

To explain the discrepancy, the authors introduce topical variability. Real documents are mixtures of a limited number of latent topics; each topic has its own word‑frequency profile F_r(t). Consequently, the effective frequency of a word in a given document, F_r,doc = Σ_t P_doc(t) F_r(t), is a random variable that fluctuates across documents. By treating the word frequencies as quenched random variables and averaging over both the Poisson process and the topic distribution (a “quenched average”), they derive new expressions for the mean and variance:

  • The mean vocabulary size becomes E_q

Comments & Academic Discussion

Loading comments...

Leave a Comment