High-impact Scientific Software in Astronomy and its creators
In the last decades, scientific software has graduated from a hidden side-product to a first-class member of the astrophysics literature. We aim to quantify the activity and impact of software development for astronomy, using a systematic survey. Starting from the Astrophysics Source Code Library and the Journal of Open Source Software, we analyse 3432 public git-based scientific software packages. Paper abstract text analysis suggests seven dominant themes: cosmology, data reduction pipelines, exoplanets, hydrodynamic simulations, radiative transfer spectra simulation, statistical inference and galaxies. We present key individual software contributors, their affiliated institutes and countries of high-impact software in astronomy & astrophysics. We consider the number of citations to papers using the software and the number of person-days from their git repositories, as proxies for impact and complexity, respectively. We find that half of the mapped development is through US-affiliated institutes, and a large number of high-impact projects are led by a single person. Our results indicate that there are currently over 200 people active on any given day to improve software in astronomy.
💡 Research Summary
The paper addresses the growing importance of scientific software in astronomy and seeks to quantify both its activity and impact using a systematic, data‑driven approach. The authors start from two publicly curated sources: the Astrophysics Source Code Library (ASCL) and the Journal of Open Source Software (JOSS). By November 2025 they have assembled a dataset of 3,432 unique, publicly accessible Git repositories (1,328 from JOSS, 2,104 from ASCL), the vast majority (≈95 %) hosted on GitHub, with a minority on GitLab and a few on other platforms.
Impact is defined not by direct citations to the software itself but by the total number of citations received by all astronomy papers that cite the software (a second‑order impact metric). This is extracted from the NASA ADS database using the “collection:astronomy” filter, ensuring that the metric reflects the scientific influence of the software through the papers that depend on it. The authors acknowledge that citation practices vary across sub‑fields and that citations are an imperfect proxy for scientific value, but they argue that this approach captures a broad, comparable signal across the entire sample.
Effort is measured by counting the number of distinct days on which a contributor made at least one Git commit – termed “person‑days”. Commits are timestamped, deduplicated within 86 400‑second intervals, and aggregated per author. The most active contributor in each repository defines the “maximum” person‑days; any other contributor who has contributed at least 10 % of that maximum is considered a “major” contributor. This metric is deliberately coarse: it does not capture the size or difficulty of individual changes, but it does provide a time‑independent proxy for sustained involvement and cognitive investment. Author names from Git logs are matched to paper authors using flexible string matching (first‑name/last‑name permutations, initials, etc.) to link effort to institutional affiliation.
To uncover thematic structure, the authors perform text mining on the titles and abstracts of the papers associated with each software package. After stop‑word removal, they compute TF‑IDF vectors and apply Non‑Negative Matrix Factorization (NMF) to extract seven latent topics. The topics are further visualized using Uniform Manifold Approximation and Projection (UMAP) based on cosine distances, and each software is assigned to the topic with the highest loading. The seven topics, described by their top 20 keywords, correspond to: (1) cosmology, (2) data‑reduction pipelines, (3) exoplanets, (4) hydrodynamic simulations, (5) radiative‑transfer spectral simulations, (6) statistical inference, and (7) galaxies.
Key findings:
- Geographic distribution – roughly half of the total development effort originates from US‑affiliated institutions (NASA, Caltech, MIT, Harvard, etc.), with the remainder spread across Europe, Asia, and a few other regions.
- Contributor concentration – a large fraction of high‑impact software is driven by one or two core developers. The “single‑point‑of‑failure” risk is highlighted, as well as the outsized influence of individual expertise.
- Effort vs. impact – there is no strong correlation between person‑days and citation count. Some software with modest development time has accrued thousands of citations (e.g., Astropy), while other heavily maintained packages show relatively low citation impact, suggesting that factors such as community adoption, documentation quality, and domain relevance play significant roles.
- Data quality issues – the authors identify systematic problems: (a) incomplete migration of legacy repositories (e.g., IRAF’s early development predates GitHub, leading to under‑counted effort), (b) mismatches between Git author names and paper author names causing inaccurate affiliation mapping, and (c) double‑counting when a developer lists multiple affiliations. Specific examples include IRAF being incorrectly attributed to post‑2017 GitHub contributors rather than the original NOAO team, and Starlink’s affiliations being conflated with recent LSST contributors. The paper treats these limitations transparently and calls for future cleaning work.
The authors propose a combined metric (citations × person‑days) to capture both usefulness and development difficulty, though they note its interpretive limits. They suggest that this dual‑axis framework could be applied to other scientific domains to assess software ecosystems.
Implications: The study provides empirical evidence for funding agencies and institutional leaders to recognize software development as a scholarly output deserving of credit, support, and sustainable maintenance models. It also underscores the need for better citation practices (e.g., software citation standards) and for mechanisms that mitigate reliance on a few individuals for critical infrastructure.
Future work outlined includes: (i) extracting funding acknowledgments to link financial support to software impact, (ii) incorporating private or non‑GitHub repositories to broaden coverage, (iii) refining effort metrics (e.g., weighting commits by lines changed or code complexity), and (iv) longitudinal studies to track how impact evolves as software matures.
In summary, the paper delivers a comprehensive, reproducible analysis of the astronomical open‑source software landscape, revealing dominant themes, geographic and institutional patterns, and the nuanced relationship between development effort and scientific impact. It sets a methodological benchmark for meta‑research on research software and offers actionable insights for the community at large.
Comments & Academic Discussion
Loading comments...
Leave a Comment