Can professional translators identify machine-generated text?
This study investigates whether professional translators can reliably identify short stories generated in Italian by artificial intelligence (AI) without prior specialized training. Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.
💡 Research Summary
The paper investigates whether professional Italian translators can reliably detect short stories generated by artificial intelligence without any specialized training. Sixty‑nine translators participated in an in‑person session in Milan, where they were given three anonymized short stories of comparable length: one human‑written story by Alberto Moravia (marked with an oval) and two AI‑generated stories produced by ChatGPT‑4o (marked with a hexagon and a star). The AI texts were selected because they had previously been the most frequently misidentified as human in a postgraduate study; the hexagon story was prompted to emulate the style of Giorgio Faletti, while the star story had no explicit authorial style.
Participants rated each text on a 0‑10 scale (0 = human, 10 = machine, 5 = uncertain) and were asked to underline passages that informed their judgment, explain their reasoning, and note any recognition of the author or prior reading. Demographic data (age, gender, native language, education, years of translation experience) were also collected. No prior AI‑detection training was provided, following earlier work that showed limited benefit from such training.
A “successful identification” was defined as assigning a score ≤5 to the human text, scores ≥5 to both AI texts, and a minimum gap of four points between the lowest and the two highest scores. Under this criterion, 11 participants (16.2 % of the valid sample) succeeded, a result that is statistically unlikely to arise by chance (≈2.45 % probability). Raising the gap to five points still left ten successful participants, with a chance probability of ≈5.47 %. Conversely, nine participants inverted the pattern, rating the human text as AI‑generated and the AI texts as human, indicating a near‑mirror error rate that obscured the mean scores in the aggregated data.
Qualitative analysis of the justifications revealed that the most reliable cues were “low burstiness” (i.e., limited variation in sentence length and structure) and “narrative contradiction” (logical inconsistencies in the storyline). These align with prior observations that synthetic texts often exhibit repetitive, bland prose and superficial plot development. Participants also noted English‑language transfer phenomena: unnecessary possessive adjectives, overuse of present participles, punctuation placement around quotation marks, semantic loans (e.g., “speculazioni”, “casuale”, “sfida”), and deeper discourse‑level calques that sounded natural in English but awkward in Italian. Such transfer effects were especially prominent in the hexagon text.
Statistical tests (Fisher’s exact test for 2 × 2 tables, chi‑square for larger tables) showed no significant relationship between demographic variables (age, gender, education level, native language, years of experience) and success in identification. This suggests that individual linguistic intuition or analytical strategy, rather than professional background per se, drives performance.
All three texts were also processed with the Plagramme AI detector. The human text received a low AI‑likelihood score (17 %), whereas the AI texts scored 83 % and 94 %, confirming objective differences detectable by automated tools. However, the modest alignment between detector scores and translator judgments indicates that current automatic detectors cannot fully substitute expert human assessment.
The authors discuss implications for Synthetic‑Text Editing (STE). If professional translators cannot consistently flag AI‑generated prose, it may mean that Italian synthetic texts have reached a level of human‑like quality that requires minimal post‑editing. Conversely, the identified linguistic markers (low burstiness, narrative contradictions, English‑style transfer) could form the basis of targeted training materials to improve detection and editing skills.
In summary, a minority of professional translators can statistically distinguish AI‑generated Italian short stories, relying chiefly on structural and narrative anomalies and on subtle cross‑linguistic transfer cues. The majority, however, perform at near‑chance levels, often misled by superficial human‑like qualities such as grammatical correctness and emotional tone. The study underscores the need for refined detection criteria, better training, and complementary automated tools to support translators in the emerging landscape of AI‑assisted literary production.
Comments & Academic Discussion
Loading comments...
Leave a Comment