Analyzing the Availability of E-Mail Addresses for PyPI Libraries
Open Source Software (OSS) libraries form the backbone of modern software systems, yet their long-term sustainability often depends on maintainers being reachable for support, coordination, and security reporting. In this paper, we empirically analyze the availability of contact information - specifically e-mail addresses - across 686,034 Python libraries on the Python Package Index (PyPI) and their associated GitHub repositories. We examine how and where maintainers provide this information, assess its validity, and explore coverage across individual libraries and their dependency chains. Our findings show that 81.6% of libraries include at least one valid e-mail address, with PyPI serving as the primary source (79.5%). When analyzing dependency chains, we observe that up to 97.8% of direct and 97.7% of transitive dependencies provide valid contact information. At the same time, we identify over 698,000 invalid entries, primarily due to missing fields. These results demonstrate strong maintainer reachability across the ecosystem, while highlighting opportunities for improvement - such as offering clearer guidance to maintainers during the packaging process and introducing opt-in validation mechanisms for existing e-mail addresses.
💡 Research Summary
This paper presents a large‑scale empirical study of the availability and validity of maintainer email addresses in the Python Package Index (PyPI) ecosystem. The authors collected metadata for 686,034 PyPI packages and their associated GitHub repositories, extracting declared email addresses from the PyPI project pages, GitHub profile information, and the presence of a SECURITY.md file. Each email was validated for syntactic correctness and domain resolvability using an external validator, allowing classification into four categories: valid, syntactically incorrect, undeliverable (domain does not resolve), and empty field.
Four research questions guided the analysis: (RQ1) the distribution of sources (PyPI, GitHub, both, or none) where emails are provided; (RQ2) the distribution of reasons for invalid emails; (RQ3) the proportion of packages that provide at least one valid email when considered in isolation; and (RQ4) the proportion of packages that provide valid emails when examined within their dependency chains (both direct and transitive). To capture the structural importance of packages, the authors built a directed dependency graph (1,866,485 edges) and computed PageRank scores, enabling a fine‑grained analysis across different importance tiers (top 10 %, top 1 %, top 0.1 %).
Key findings include:
- Source distribution – 59.5 % of packages list an email exclusively on PyPI, 2.1 % only on GitHub, 20.0 % on both platforms, and 18.4 % provide none. PyPI is clearly the dominant channel for contact information.
- Invalid emails – A total of 698,141 invalid entries were identified. The majority (79.9 %) originate from GitHub, largely due to empty fields (79.7 % of GitHub entries). Only 0.7 % of all invalid entries are syntactic errors, and 4.2 % are undeliverable domains. Empty fields account for 95.1 % of all invalid cases.
- Package‑level coverage – 81.6 % of all packages contain at least one valid email address. Coverage rises with importance: the top 0.1 % of packages (by PageRank) reach 96.1 % valid‑email coverage.
- Dependency‑chain coverage – When considering the full dependency graph, 97.8 % of direct dependencies and 97.7 % of transitive dependencies provide valid emails. Packages that are not used as dependencies exhibit lower coverage, highlighting a gap in metadata completeness for less‑used libraries.
The authors interpret these results as evidence that the PyPI ecosystem is, overall, well‑connected in terms of maintainer reachability, especially for high‑impact libraries and their dependencies. However, the sizable fraction of packages lacking any contact information (≈18 %) and the high proportion of empty‑field entries on GitHub suggest opportunities for improvement. The paper recommends several interventions: (1) enhancing the PyPI upload UI to make email entry mandatory or at least strongly encouraged; (2) offering an opt‑in automated validation service that checks syntax and domain resolvability during package submission; (3) encouraging maintainers to populate contact details on GitHub profiles and in SECURITY.md files; and (4) possibly extending the contact‑information model to include alternative channels such as GPG keys or dedicated support URLs.
Methodologically, the study demonstrates a reproducible pipeline for large‑scale metadata mining, combining REST API calls, dependency graph construction, and PageRank‑based importance weighting. Limitations include reliance on publicly available metadata (private emails are not captured), potential false negatives where maintainers use indirect contact mechanisms, and the static nature of the snapshot (temporal dynamics are not explored). Future work could track changes over time, assess the impact of proposed validation tools, and broaden the scope to other ecosystems (e.g., npm, Maven) for comparative analysis.
In conclusion, the research provides a comprehensive picture of email‑based contact information in PyPI, showing strong overall availability but also pinpointing concrete areas where the ecosystem can become more robust, transparent, and supportive of long‑term open‑source sustainability.
Comments & Academic Discussion
Loading comments...
Leave a Comment