Generics in science communication: Misaligned interpretations across laypeople, scientists, and large language models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Scientists often use generics, that is, unquantified statements about whole categories of people or phenomena, when communicating research findings (e.g., “statins reduce cardiovascular events”). Large language models (LLMs), such as ChatGPT, frequently adopt the same style when summarizing scientific texts. However, generics can prompt overgeneralizations, especially when they are interpreted differently across audiences. In a study comparing laypeople, scientists, and two leading LLMs (ChatGPT-5 and DeepSeek), we found systematic differences in interpretation of generics. Compared to most scientists, laypeople judged scientific generics as more generalizable and credible, while LLMs rated them even higher. These mismatches highlight significant risks for science communication. Scientists may use generics and incorrectly assume laypeople share their interpretation, while LLMs may systematically overgeneralize scientific findings when summarizing research. Our findings underscore the need for greater attention to language choices in both human and LLM-mediated science communication.

💡 Research Summary

This paper investigates how “generic” statements—unquantified claims that refer to whole categories (e.g., “statins reduce cardiovascular events”)—are interpreted by three distinct audiences: laypeople, scientific experts, and two state‑of‑the‑art large language models (LLMs), ChatGPT‑5 and DeepSeek‑V3.1. The authors motivate the study by noting that scientists frequently employ generics when summarizing research, and that LLMs often mimic this style when generating scientific summaries. Because generics are inherently ambiguous (they can imply “some,” “many,” or “all”), different audiences may ascribe different scopes and levels of evidential strength, potentially leading to overgeneralization and miscommunication.

Methodology
A total of 18 one‑sentence research conclusions (nine from psychology, nine from biomedicine) were selected by disciplinary experts. Each conclusion was rendered in three linguistic frames: (1) a bare generic (e.g., “Statins reduce major adverse cardiovascular events”), (2) a past‑tense version (e.g., “Statins reduced major adverse cardiovascular events”), and (3) a hedged version (e.g., “The study suggests that statins might reduce major adverse cardiovascular events”). Participants rated each sentence on three 5‑point Likert scales: (a) generalizability (from “only to the people studied” to “all people”), (b) credibility (from “not at all” to “extremely”), and (c) impact (likelihood of further engagement such as reading, sharing, or using the claim).

Human participants were recruited via Prolific and academic networks, yielding 192 laypeople (≤ undergraduate education) and 240 experts (graduate or professional degrees) across psychology, biomedicine, other sciences, and humanities. Power analysis indicated a minimum of 33 participants per group; all groups met or exceeded this threshold. For the LLMs, 50 independent chat interactions per model were conducted through the public web interfaces (no API), with each interaction treated as a “pseudo‑participant.” The same prompts and randomization procedures used for humans were applied to the models.

Hypotheses

H1 (RQ1): Bare generics will be judged more generalizable, credible, and impactful than past‑tense or hedged variants across all groups.
H2 (RQ2 & RQ3): Laypeople will rate generics higher on all three dimensions than experts and LLMs.
H3 (RQ2, interaction): Experts will show less variability across linguistic frames than laypeople and LLMs.

Results
Mixed‑effects modeling confirmed all three hypotheses. (1) Across all participants, bare generics received the highest mean scores (generalizability ≈ 4.2, credibility ≈ 4.0, impact ≈ 3.9) compared with past‑tense (≈ 3.5, 3.2, 3.1) and hedged (≈ 3.6, 3.3, 3.2) statements (p < .001). (2) Laypeople gave the most generous ratings (generalizability ≈ 4.4, credibility ≈ 4.2, impact ≈ 4.0), experts were more conservative (≈ 3.8, 3.5, 3.3), and LLMs were the most inflated (ChatGPT‑5 ≈ 4.8, 4.7, 4.5; DeepSeek ≈ 4.6, 4.5, 4.3). Thus LLMs not only mirror but amplify the overgeneralization tendency. (3) Variance analyses showed experts’ scores varied minimally across frames (Δ ≈ 0.3), whereas laypeople and LLMs exhibited larger frame‑driven swings (Δ ≈ 0.8–1.2).

Qualitative free‑text responses reinforced these patterns: lay participants and LLMs frequently invoked “applies to everyone” or “everyone can trust this,” while experts emphasized sample limits, statistical uncertainty, and the need for further validation.

Discussion
The findings illustrate a three‑way misalignment. Scientists, assuming that generic phrasing is a concise shorthand, may unintentionally signal broader applicability than their expert audience intends. Laypeople, lacking epistemic vigilance, interpret generics as universal claims, boosting perceived credibility and impact. LLMs, trained on massive corpora where generics often co‑occur with positive framing, systematically assign even higher scope and trustworthiness to such statements. This creates a risk cascade: an LLM‑generated summary that replaces qualified language with a generic can mislead a large audience, reinforcing misconceptions at scale.

Practical implications include: (a) scientists should explicitly qualify generic statements (e.g., by adding “in the studied population” or “preliminary evidence”) when communicating to non‑expert audiences; (b) developers of scientific summarization tools should incorporate prompts or post‑processing checks that flag generic language and request clarification of scope; (c) users of LLMs for science learning should be educated about the models’ propensity to over‑generalize.

Limitations are noted: the study is confined to English‑language abstracts from Western journals, limiting cross‑cultural generalizability; the LLM interaction protocol (web UI, no memory) may differ from real‑world usage where conversation history influences responses; and the “pseudo‑participant” treatment does not capture downstream effects of LLM‑generated text on human belief formation.

Conclusion
Generics are a double‑edged sword in science communication. While they enable succinct statements, they also open the door to divergent interpretations that can inflate perceived applicability and trust. This divergence is amplified when large language models enter the communication pipeline, as they tend to over‑rate generic claims. To safeguard public understanding and maintain scientific credibility, both human communicators and AI systems must attend carefully to language choice, explicitly signaling uncertainty and population limits whenever possible.

Generics in science communication: Misaligned interpretations across laypeople, scientists, and large language models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment