Do LLMs produce texts with "human-like" lexical diversity?
The degree to which large language models (LLMs) produce writing that is truly human-like remains unclear despite the extensive empirical attention that this question has received. The present study addresses this question from the perspective of lexical diversity. Specifically, the study investigates patterns of lexical diversity in LLM-generated texts from four ChatGPT models (ChatGPT-3.5, ChatGPT-4, ChatGPT-o4 mini, and ChatGPT-4.5) in comparison with texts written by L1 and L2 English participants (n = 240) across four education levels. Six dimensions of lexical diversity were measured in each text: volume, abundance, variety-repetition, evenness, disparity, and dispersion. Results from one-way MANOVAs, one-way ANOVAs, and Support Vector Machines revealed that the ChatGPT-generated texts differed significantly from human-written texts for each variable, with ChatGPT-o4 mini and ChatGPT-4.5 differing the most. Within these two groups, ChatGPT-4.5 demonstrated higher levels of lexical diversity than older models despite producing fewer tokens. The human writers’ lexical diversity did not differ across subgroups (i.e., education, language status). Altogether, the results indicate that ChatGPT models do not produce human-like texts in relation to lexical diversity, and the newer models produce less human-like text than older models. We discuss the implications of these results for language pedagogy and related applications.
💡 Research Summary
The paper investigates whether the output of four ChatGPT models (3.5, 4.0, o4 mini, and 4.5) can be considered “human‑like” with respect to lexical diversity, and how this compares to essays written by 240 human participants (120 L1 English speakers and 120 L2 speakers) across four education levels. The authors adopt a multidimensional view of lexical diversity, drawing on Jarvis’s framework and operationalizing six dimensions: volume (token count), abundance (type count after lemmatization), variety‑repetition (MATTR with a 50‑word sliding window), evenness (Shannon‑based evenness index), disparity (WordNet sense index), and dispersion (inverse measure of how often the same lemma recurs within a 20‑word window).
Human participants were recruited via Prolific and asked to write a TOEFL‑style argumentative essay of at least 250 words in response to a prompt about breadth versus depth of knowledge. The sample was balanced for language status (L1/L2) and education (high school, bachelor’s, master’s, doctorate), yielding 30 participants per subgroup. Each ChatGPT model was prompted zero‑shot with the same prompt and asked to “provide an essay response in prose.” Thirty independent outputs were generated per model, for a total of 120 machine‑generated essays. All texts were lemmatized with TreeTagger and processed through custom Python scripts to extract the six lexical‑diversity indices.
Statistical analysis comprised one‑way MANOVAs (and permutation‑based MANOVAs for robustness) to test overall differences between human and machine groups, followed by univariate ANOVAs for each index. The results showed highly significant differences (p < .001) across all six dimensions. The newer models (o4 mini and 4.5) deviated most strongly from human writing. Notably, ChatGPT‑4.5 produced fewer tokens overall but achieved higher scores on abundance, variety‑repetition, and disparity, indicating a tendency to generate shorter yet semantically richer texts. In contrast, the older models (3.5 and 4.0) generated longer texts with lower abundance and higher repetition, again differing from human patterns.
Within the human cohort, no significant effects of education level or L1/L2 status emerged for any lexical‑diversity measure, confirming that these dimensions are relatively stable across proficiency and academic attainment for the given task.
To assess discriminability, the authors trained a linear Support Vector Machine on the six‑dimensional feature vectors. Cross‑validation yielded a classification accuracy of 92 %, with disparity and dispersion contributing the most to model separation. This confirms that the lexical‑diversity profile of LLM output is systematically distinct from that of human writers.
The discussion interprets these findings in the context of language pedagogy. Because LLM‑generated texts differ in the balance of repetition, semantic redundancy, and lexical spread, educators cannot assume that higher lexical diversity automatically translates into higher writing quality. The authors caution that reliance on LLMs for writing assistance may obscure the natural patterns of repetition and cohesion that human writers employ, potentially affecting learners’ development of authentic writing strategies. They recommend that assessment frameworks incorporate multidimensional lexical‑diversity metrics alongside more holistic judgments of coherence and discourse quality when evaluating LLM‑assisted writing.
In sum, the study provides robust empirical evidence that current ChatGPT models do not produce “human‑like” lexical diversity; indeed, newer iterations appear less human‑like than older ones. This challenges the prevailing narrative of ever‑increasing human likeness in generative AI and underscores the need for nuanced, multidimensional evaluation criteria in both research and educational practice.
Comments & Academic Discussion
Loading comments...
Leave a Comment