Empirical Growing Networks vs Minimal Models: Evidence and Challenges from Software Heritage and APS Citation Datasets
We investigate the evolution rules and degree distribution properties of the Software Heritage dataset, a large-scale growing network linking software source-code versions from open-source communities. The network spans more than 40 years and includes about 6 billion nodes and edges. Our analysis relies on deterministic temporal and topological partitions of nodes and edges, which account for the multilayer and partially timestamped structure of the main graph. We derive a temporal graph that reveals a mesoscale structure and enables the study of edge dynamics–creation, inheritance, and aging–together with comparisons to minimal models using degree distributions and histograms of edge timestamp differences. The temporal graph also exposes regime shifts that correlate with changes in developer practices, as reflected in the average number of edges per new node. We estimate scaling exponents under the scale-free hypothesis and highlight the sensitivity of the estimation method used to both regime shifts and outliers, while showing that partitioning improves regularity and helps disentangle these effects. We extend the analysis to the APS citation network, which also exhibits a major regime shift, with an accelerated growth regime becoming dominant after 1985. Although both datasets are a priori good candidates for advanced quantitative analysis, our results illustrate how structural and dynamical transitions hamper our ability to draw firm conclusions about the existence and observability of a scale-free regime in these empirical networks. These findings underscore the need for refined tools and models to study transient growth regimes, to extend current frameworks toward minimal causal growth models, and to enable robust comparisons between empirical growing networks and minimal models.
💡 Research Summary
This paper conducts a comprehensive empirical investigation of two massive, long‑running networks: the Software Heritage (SWH) dataset, which records software source‑code versions and their provenance over more than four decades, and the American Physical Society (APS) citation network, which captures scholarly citations from the early 20th century to the present. Both datasets contain billions of nodes and edges, but they differ in the completeness of temporal information: SWH’s revision (RV) and release (RL) nodes carry commit timestamps, whereas origin (O) nodes do not, while APS nodes have clear publication and citation dates.
The authors first define the “main graph” G (including O, RV, RL nodes) and then extract a subgraph G_RV/RL that contains only temporally stamped nodes. To study growth mechanisms on the full graph, they introduce two systematic partitioning strategies. The first, “temporal partitioning,” propagates timestamps from RV/RL nodes up to their associated origin nodes and aggregates directed paths through RV/RL to create O→O edges. Four variants are constructed by combining (i) inheritance rules (I: add an edge whenever any directed path exists; WI: add an edge only if a direct RV/RL‑to‑O edge exists) with (ii) true‑time rules (TT: edge direction follows the chronological order of the propagated timestamps; NoTT: direction follows the original path orientation). This yields four derived temporal graphs G_modeI,modeT.
The second strategy, “TSL partitioning,” classifies each origin node by a triplet (T,S,L): the capped in‑degree (T), capped out‑degree (S), and a binary flag for self‑loops (L). With a cap δₘ=1, nodes fall into four types (001, 011, 101, 111). The resulting TSL graph G_TSL_δₘ is then compared to a directed version of the Barabási–Albert model (the Price model), providing a minimal generative baseline.
Applying this pipeline to SWH, the authors discover a clear regime shift around 2010, coinciding with the widespread adoption of Git. Topological partitioning based on out‑degree reveals that the average number of edges per newly added node jumps and then stabilizes, indicating a change in developer workflow (more forking, merging, and collaborative branching). However, the overall in‑ and out‑degree distributions remain highly irregular. Large “outlier” events—massive project forks or sudden influxes of versions—produce heavy tails that dramatically affect power‑law exponent estimates. Using both the Clauset‑Newman‑Shalizi method and maximum‑likelihood estimation, the scaling exponent γ for the in‑degree tail varies between roughly 2.1 and 2.9 depending on the time window and the presence of outliers, underscoring the sensitivity of scale‑free diagnostics to transient dynamics.
The APS citation network exhibits a different but equally striking transition. Around 1985, the citation rate accelerates, leading to an “accelerated growth” regime where the average in‑degree per paper continuously rises rather than stabilizing. This shift is evident in the temporal evolution of degree distributions and in histograms of citation‑date differences. As with SWH, the estimated power‑law exponent changes across the transition, and the presence of highly cited “hub” papers further distorts the tail.
Across both datasets, the authors demonstrate that (1) growth rules are not static; they evolve over decades, violating the constant‑m assumption of classic preferential‑attachment models; (2) multilayer structures and partially missing timestamps necessitate careful graph derivation before any statistical inference; (3) standard scale‑free hypothesis testing is highly vulnerable to regime shifts and outliers, leading to contradictory exponent estimates.
The paper concludes by advocating for a refined methodological framework: (i) systematic partitioning to isolate temporally coherent subgraphs; (ii) explicit handling of regime shifts—by segmenting the timeline and fitting models separately; (iii) development of “transition‑aware” minimal models that allow time‑varying attachment kernels, edge‑inheritance mechanisms, and variable edge‑creation rates. Such models would bridge the gap between the elegant asymptotic theory of preferential attachment and the messy, evolving reality of large‑scale empirical networks, enabling more robust comparisons and deeper insight into the causal processes shaping software ecosystems and scientific citation practices.
Comments & Academic Discussion
Loading comments...
Leave a Comment