Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists

Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Ongoing breakthroughs in large language models (LLMs) are reshaping scholarly search and discovery interfaces. While these systems offer new possibilities for navigating scientific knowledge, they also raise concerns about fairness and representational bias rooted in the models’ memorized training data. As LLMs are increasingly used to answer queries about researchers and research communities, their ability to accurately reconstruct scholarly coauthor lists becomes an important but underexamined issue. In this study, we investigate how memorization in LLMs affects the reconstruction of coauthor lists and whether this process reflects existing inequalities across academic disciplines and world regions. We evaluate three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, by comparing their generated coauthor lists against bibliographic reference data. Our analysis reveals a systematic advantage for highly cited researchers, indicating that LLM memorization disproportionately favors already visible scholars. However, this pattern is not uniform: certain disciplines, such as Clinical Medicine, and some regions, including parts of Africa, exhibit more balanced reconstruction outcomes. These findings highlight both the risks and limitations of relying on LLM-generated relational knowledge in scholarly discovery contexts and emphasize the need for careful auditing of memorization-driven biases in LLM-based systems.


💡 Research Summary

This paper investigates how large language models (LLMs) reproduce scholarly co‑author lists and whether this process amplifies existing disciplinary and geographic inequities. The authors focus on three widely used LLMs—DeepSeek R1 (671 B parameters), Llama 4 Scout (17 B), and Mixtral 8x7B (7 B)—and evaluate them against two comprehensive bibliographic baselines, OpenAlex and Google Scholar.

To construct a balanced test set, the researchers first selected ten broad scientific fields (as defined by the Stanford/Elsevier Top 2% Scientists List 2024) and eight world regions (North America, South/Central America, Europe, North Africa, Sub‑Saharan Africa, Middle East, East/Southeast Asia, and Oceania). Within each field‑region cell they randomly sampled ten “high‑citation” and ten “low‑citation” seed authors, requiring a minimum of 100 Google Scholar citations to increase the likelihood that the authors appear in the LLM training corpus. After filtering out four authors lacking OpenAlex records, the final sample comprised 1,596 scholars.

The core methodological contribution is the “Discernable Name Extraction” (DNE) metric, adapted from prior work on discoverable extraction. For each seed author s, DNE measures the proportion of true co‑author last names (from the baseline) that are matched—using a Levenshtein similarity threshold of zero—to any name generated by an LLM when prompted with “Give me the co‑authors of


Comments & Academic Discussion

Loading comments...

Leave a Comment