Can ChatGPT Really Understand Modern Chinese Poetry?

ChatGPT has demonstrated remarkable capabilities on both poetry generation and translation, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed experimental outcomes without addressing fundament…

Authors: Shanshan Wang, Derek F. Wong, Jingming Yao

Can ChatGPT Really Understand Modern Chinese Poetry?
Can ChatGPT Really Understand Modern Chinese P oetry? Shanshan W ang 1 Derek F . W ong 1 * Jingming Y ao 2 Lidia S. Chao 1 1 NLP 2 CT Lab, Department of Computer and Information Science, Uni versity of Macau nlp2ct.shanshan@gmail.com, {derekfw ,lidiasc}@um.edu.mo 2 Department of Portuguese, Faculty of Arts and Humanities, Uni versity of Macau jmyao@um.edu.mo Abstract ChatGPT has demonstrated remarkable capa- bilities on both poetry generation and transla- tion, yet its ability to truly understand poetry remains unexplored. Previous poetry-related work merely analyzed e xperimental outcomes without addressing fundamental issues of com- prehension. This paper introduces a compre- hensiv e framework for ev aluating ChatGPT’ s understanding of modern poetry . W e collabo- rated with professional poets to e valuate Chat- GPT’ s interpretation of modern Chinese poems by dif ferent poets along multiple dimensions. Evaluation results sho w that ChatGPT’ s inter- pretations align with the original poets’ intents in over 73% of the cases. Howe ver , its under - standing in certain dimensions, particularly in capturing poeticity , pro ved to be less satisfac- tory . These findings highlight the effecti veness and necessity of our proposed frame work. This study not only e valuates ChatGPT’ s ability to understand modern poetry b ut also establishes a solid foundation for future research on LLMs and their application to poetry-related tasks. 1 Introduction Large language models (LLMs) ha ve been widely used for various tasks ( Zhang et al. , 2024 ; Shen and Y ing , 2025 ; Chatterjee et al. , 2025 ; Lan et al. , 2025a ; Y e et al. , 2025 ; Lin et al. , 2025 ). Pre vious studies hav e explored the application of ChatGPT to poetry-related tasks, primarily focusing on po- etry generation ( Antar , 2023 ; Deng et al. , 2024 ; Hutson and Schnellmann , 2023 ), translation ( V ir- vou et al. , 2023 ; W ang et al. , 2024 ), and detection ( W ang et al. , 2025 ). Howe ver , without exception, these studies hav e drawn conclusions solely based on ChatGPT’ s outputs. In other words, their scope has been limited to the e valuation and analysis of experimental results. For example, V irvou et al. ( 2023 ) conducted an e valuati ve study examining * Corresponding Author the interpretation of poetry generated by ChatGPT . They suggest that ChatGPT has the potential to delve into the depth of poetic content. Ho wever , the fundamental question remains: Does ChatGPT truly understand poetry? Poetry , particularly mod- ern poetry , has long faced the difficulty of being interpreted and read ( Zhang , 2022 ). Despite its significance, the issue of understanding poetry has not been adequately addressed in prior research. The ability to understand poetry ( T ate , 1940 ; Pierce , 2003 ; McGrath , 2018 ) is a prerequisite for all poetry-related tasks. W ithout this foundation, adv ancements in tasks like poetry generation ( Ma- nurung , 2004 ; Zhang and Lapata , 2014 ; Y i et al. , 2018 ; V an de Cruys , 2020 ) and poetry translation ( Genzel et al. , 2010 ; Ghazvininejad et al. , 2018 ; Chakrabarty et al. , 2021 ; Song et al. , 2023 ) re- main constrained. Therefore, to effecti vely adv ance these tasks, it is essential to first address and em- phasize the issue of ChatGPT’ s comprehension of poetry . In this paper , we aim to explore whether ChatGPT is capable of understanding modern po- etry . Specifically , we propose a comprehensi ve frame work for e valuating ChatGPT’ s understand- ing of modern Chinese poetry . Our approach is as follo ws. Firstly , in collaboration with professional poets, we identified fiv e ke y dimensions crucial for understanding poetry: content, expression methods ( Y u , 2001 ; W ang , 1998 ), thought & emotion ( Xi , 2019 ; Mo , 2009 ; W ang , 2006 , 2019 ), modernity ( Cheng , 2005 ; Long , 2005 ; Zang , 1998 ) and po- eticity ( W ang et al. , 2024 ). Then, we developed and refined prompts to enable ChatGPT to interpret modern poetry more accurately . Reading poetry is inherently subjecti ve, as interpretations often v ary significantly among readers ( Brodsky , 1999 ; Jiang , 2012 ; Zhang , 2022 ). Howe ver , the poet’ s original intent provides a unique benchmark for ev aluation. T o assess the validity of ChatGPT’ s interpretations, we in vited the original poets to ev aluate its outputs across the identified dimensions. Our e valuation Expression Methods Thought & Emotion Modernity Poeticity Content Language Imagery Defamiliarizatio n Rhythm Rhetorical Techniques Interpretation of Poems Original Poets Evaluat e Dimensions 1) Identify Dimensions for Understanding Modern Poetry 2) Design Prompt P l e a s e u n d e r s t a n d t h e f o l l o w i n g mo der n po e m t i t le d Ti f r o m t h e fi v e d i m e n s i o n s o f c o n t e n t , e x p r e s s i o n m e t h o d s , t h o u g h t & e m o t i o n , m o d e r n i t y , a n d t h e m o s t p o e t i c s e n t e n c e , a n d f i n a l l y g i v e y o u r i n t e r p r e t a t i o n o f t h e p o e m . I n ad d i t i o n t o c o n t e n t , p l e a s e p r o v i d e examples and specifi c explanations for eac h d ime ns io n. N ot e: t h is p oem i s a com pl ete mo de rn p oe m. T he s pe ci fi c de f in i t io n s o f e ac h di m en s io n a r e a s follows: [Definition of dimensio ns] Poem: {{ Pi }} Interpretation of this poem: {{ Ii } } Prompt Poems 3) ChatGPT Interpret Poetry 4) Original Poets’ Evaluation ChatGPT Figure 1: The framework for e valuating ChatGPT’ s understanding of modern poetry . results re veal that ChatGPT’ s interpretations align with the poets’ original intent in over 73% of the cases. Howe ver , its performance in certain dimen- sions, particularly in capturing poeticity , is less satisfactory . The primary contributions of our work are as follo ws: 1) W e are the first to systematically e val- uate ChatGPT’ s ability to understand poetry . 2) W e introduce a comprehensi ve framework to assess ChatGPT’ s understanding of modern Chinese po- etry . 3) W e define key dimensions necessary for understanding poetry and establish a systematic e valuation methodology . 4) W e in volv ed profes- sional poets to ev aluate ChatGPT’ s interpretations, ensuring a reliable and nuanced assessment. 2 Evaluating ChatGPT’ s Understanding of Modern P oetry T o in vestigate whether ChatGPT can truly under- stand modern poetry , we propose a comprehen- si ve framework called ECUMP ( E valuation of C hatGPT’ s U nderstanding of M odern P oetry). The frame work, as illustrated in Figure 1 , consists of four components: Identify Dimensions for Understanding Moder n Poetry Referring to prior research on modern po- etry comprehension ( Chang , 2012 ) and informed by the recommendations of eight professional mod- ern poets, we identified fiv e critical dimensions for understanding modern poetry: 1) Cont ent, which refers to what the poem describes; 2) Expression Methods, which refer to the techniques emplo yed in writing the poem. These include linguistic fea- tures ( Lang uage), the use of visual and sensory images ( Imag ery), rhetorical strategies such as metaphor , simile, or symbolism ( Rhet orical T ech- niques), the poem’ s internal rhythm reflected in external forms like pauses, line breaks, and stan- zas ( Rhyt hm), and innov ation through challenging con ventional perceptions ( Defa miliarization); 3) Thou ght & Emotion, which captures the ideas and emotions con veyed by the poem; 4) Mode rnity , referring to the contemporary consciousness or modern sensibility embedded in the poem; and 5) Poet icity , a subjective quality encapsulating the poem’ s poetic essence. There may be multiple sen- tences in a poem with varying degrees of poetic quality , but only one sentence is considered the most poetic. T o measure poeticity , we identify the most poetic sentence in the poem as a benchmark for e valuation. Design Prompt Inspired by pre vious work on prompt designing ( Jiao et al. , 2023 ; Gao et al. , 2023 ) and modern poetry comprehension ( Zhang , 2021 ), we designed and optimized a prompt to enable ChatGPT to accurately interpret modern po- etry . The final prompt is presented in T able 1 , with additional details provided in Appendix A.2 . ChatGPT Interpr et Poetry Using the designed prompt, ChatGPT generates poems’ interpretations on identified dimensions, facilitating a systematic e valuation of its comprehension ability . Original Poets’ Evaluation Automatic ev alua- tion methods are unsuitable for poetry-related tasks due to the unique characteristics of modern poetry ( Refaee , 2023 ; No vikov a et al. , 2017 ; W ang et al. , 2024 ). Static e valuation methods may be unreli- able under distribution shifts ( Shen et al. , 2026 ). Instead, we rely on human ev aluation conducted by Please understand the following modern poem titled T i from the fiv e dimensions of content, expression methods, thought and emotion, modernity , and the most poetic sentence, and finally giv e your interpretation of the poem. In addition to content, please provide examples and specific explanations for each dimension. Please note that this is a complete modern poem. The specific definitions of each dimension are as follows: Content: Summarize what the poem is about. Expression Methods: Understand ho w this poem is written from fiv e aspects: language, imagery , rhetorical techniques, rhythm, and defamiliarization. The five specific aspects are as follo ws: 1) What are the characteristics of the language of this poem? Is it innov ative? 2) What imageries are used in this poem and what are their functions? 3) What rhetorical techniques is used in this poem? What is the function? 4) How is the rhythm of this poem? Such as pauses, lines, and stanzas. 5) Does this poem contain any expression of defamiliarization? If there is any unfamiliar expression, please point it out; if not, you can skip it. Thought & Emotion: What thought and emotion does this poem express? Modernity: Is the poem modern or modern-minded? Poeticity: Choose the most poetic sentence in this poem. Poem: {{ P i }} (Directly used the original Chinese poem) Interpretation of this poem: {{ I i }} T able 1: The prompt we designed for ChatGPT to un- derstand modern poetry . the original poets themselves. While interpretations of a poem often v ary among readers ( Jiang , 2012 ), the author’ s original intent provides a definitiv e benchmark. Therefore, the primary ev aluation of ChatGPT’ s interpretations is based on assessments by the original poets. Details of the ev aluation process can be found in Section 4 . 3 Experiment Referring the dataset scale established in prior works ( W ang et al. , 2024 ; V irvou et al. , 2023 ; Hut- son and Schnellmann , 2023 ; Antar , 2023 ), we se- lected a collection of 48 modern Chinese poems for our experiments. Specifically , we collaborated with six professional modern Chinese poets, each of whom contributed eight poems they recently written. This process resulted in a high-quality dataset of 48 modern Chinese poems, comprising 147 stanzas and 750 lines in total. The dataset is di- vided into two categories: 40 poems from fiv e poets that represent common types of modern poetry (de- noted as Com-Poetry) and 8 poems from one poet that represent special types of modern poetry (de- noted as Spe-Poetry). The Spe-Poetry category is characterized by its incorporation of content from ancient Chinese literature, making extensi ve use of classical imagery ( Y eh , 1990 ). These poems may feature protagonists who are historical figures. De- spite these characteristics, both the external forms and the underlying thoughts of Spe-Poetry align with the con ventions of modern poetry , and they are therefore classified as part of this genre. Using the designed prompts, GPT -4 (gpt-4-0125) ( Achiam et al. , 2023 ) was task ed with interpreting these 48 poems across multiple dimensions, as specified in our framew ork. The temperature of the model is set to 0.5 , and the top_p is set to 1.0 . Examples of experimental output are provided in Appendix A.3 . 4 Evaluation 4.1 Original Poets’ Evaluation W e in vited the six original poets to e valuate GPT - 4’ s interpretations of their own poems across fi ve dimensions: content, expression methods, thought & emotion, modernity , and poeticity , The defini- tions of these dimensions are consistent with those introduced in Section 2 . Evaluation Rules The ev aluation methodology is identical for both Com-Poetry and Spe-Poetry . Each original poet e v aluated only GPT -4’ s interpre- tations of their own poems. For poeticity , scores were restricted to 0, 50, or 100 points. A score of 0 indicates that “the most poetic sentence” selected by GPT -4 lacks poeticity entirely , while 100 sig- nifies that the sentence is also recognized by the original poet as the most poetic. 50 indicates that the sentence possesses poeticity b ut is not consid- ered the most poetic by the original poet. For the other four dimensions (content, expres- sion methods, thought & emotion, and modernity), scores ranged from 0 to 100. A score of 0 indi- cates that GPT -4’ s interpretation of the poem is completely inconsistent with the poet’ s intentions, while a score of 100 signifies complete alignment with the original poet’ s thoughts. Higher scores represent a greater overlap between GPT -4’ s under- standing and the original poet’ s intent. Evaluation Results T able 2 presents the e valu- ation results provided by the six original poets on GPT -4’ s interpretation of 48 poems across all di- mensions except poeticity . T able 3 summarizes the poets’ e valuations of GPT -4’ s understanding of poeticity . Figure 2 illustrates the distribution of scores for GPT -4’ s interpretation of 48 poems, e valuated by six poets across different dimensions, with a focus on the range of scores. Cont Lang Imag Rhet Rhyt Defa Thou Mode Com-Poems 80.33 79.05 81.18 77.83 76.15 79.40 78.80 79.88 Spe-Poems 77.50 73.75 81.25 88.75 82.50 77.50 78.75 82.50 T able 2: Evaluation results of dif ferent types of poems by the original poets. N-100 N-0 N-50 Com-Poems 17 2 21 Spe-Poems 3 1 4 T able 3: Ev aluation results of ChatGPT’ s understanding of poeticity . “N-100”, “N-0” and “N-50” respectiv ely represent the number of occurrences of 100, 0, and 50. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 [ 0, 30] ( 30, 40] ( 40 , 50 ] ( 50, 60] ( 60, 70] ( 70, 80] ( 80, 90] ( 90, 100] N u mb e r o f P o e ms R a n g e o f Sco r e s C ont L a ng I ma g R he t R hyt De f a T hou M ode P oe t Figure 2: The distribution of ev aluation scores for GPT - 4’ s interpretation of 48 poems. 4.2 LLMs’ Evaluation T raditional automatic ev aluation methods are not well-suited for poetry-related tasks due to the unique characteristics of poetry ( Novik ova et al. , 2017 ; Refaee , 2023 ; W ang et al. , 2024 ). Fortu- nately , LLMs were used as ev aluation tools due to their ob vious potential and advantages ( Chiang and Lee , 2023 ; W ang et al. , 2023 ; Liu et al. , 2023 ), and hav e made effecti ve progress in the ev aluation of multiple tasks such as open-domain con versations ( Lin and Chen , 2023 ), text summarization ( Chu et al. , 2024 ), dialogue response generation, and open-ended question answer ( Chan et al. , 2023 ). Therefore, we use LLMs as ev aluation tools to e valuate the interpretations generated by GPT -4, applying the same e valuation rules as human e v al- uators. T able 4 presents the prompt we designed for LLMs as the ev aluators to ev aluate GPT -4’ s un- derstanding of poems. T able 5 and T able 6 present the ev aluation results of LLMs as the ev aluators on GPT -4’ s understanding of Com-Poems and Spe- Poems, respecti vely . 5 Analysis Human Evaluation T able 2 and Figure 2 sho w that, except for poeticity , GPT -4 demonstrates a Please score the following “readers’ understanding of the poem” from the eight dimensions of content, lan- guage, imagery , rhetorical techniques, rhythm, defamil- iarization, thought and emotion, and modernity , and gi ve detailed reasons for your score. Please note that you only need to score and give reasons for the “Readers’ Understanding of this poem, not the poem itself. Scoring Dimensions: Same as human ev aluation. Scoring Rules: Each dimension ranges from 0 to 100. A score of 0 means that you think the reader’ s understand- ing of a certain dimension is completely wrong, and 100 indicates that you think the reader’ s understanding of a certain dimension is completely correct; the more cor- rect the reader’ s understanding of a certain dimension is, the higher the score you will giv e. Poem: {{ P i }} Readers’ Understanding of this Poem: {{ U i }} T able 4: The prompt we designed for the LLMs as the ev aluators to ev aluate GPT -4’ s understanding of poems. strong understanding of poetry across all dimen- sions, achieving at least 73% alignment with the original poets’ intentions. And the scores are con- centrated between 60 and 90, indicating a reason- ably consistent performance. Ho wever , GPT -4’ s understanding of poeticity is unsatisfactory . When tasked with identifying the most poetic sentences, GPT -4’ s responses (28 sentences) were incorrect in more than half of the cases (T able 3 ). Notably , GPT -4 selected three sentences that were entirely de void of poeticity , highlighting significant limita- tions in its ability to discern poetic essence. For Com-Poems, GPT -4 demonstrates the strongest un- derstanding of imagery , achie ving an av erage score of 81.18 (T able 2 ). This result aligns with expec- tations, as imagery is a salient and explicit fea- ture of poetry . Additionally , GPT -4 is capable of accurately grasping much of the poem’ s content, with a score of 80.33. Notably , GPT -4 performs better with poems that ha ve clear themes or nar- rati ve structures, providing more comprehensiv e and accurate summaries with fewer omissions. For Spe-Poems, GPT -4’ s performance div erges signifi- cantly from its results with Com-Poems. It demon- strates the strongest understanding of rhetorical techniques (88.75), rhythm (82.50), and modernity (82.50). For example, GPT -4 is generally able to Cont Lang Imag Rhet Rhyt Defa Thou Mode o1-pre view 94.70 92.03 96.33 91.40 90.35 93.90 95.23 96.75 GPT -4o 92.38 89.38 94.00 88.50 88.00 87.38 91.75 93.25 Qwen2.5-Plus 94.50 89.50 94.50 90.00 85.13 91.25 90.25 94.88 Qwen2.5-Max 95.00 89.13 94.13 93.00 90.00 86.13 95.00 95.00 T able 5: The ev aluation results of LLMs as the ev aluators on GPT -4’ s understanding of Com-Poems. Cont Lang Imag Rhet Rhyt Defa Thou Mode o1-pre view 95.00 92.50 96.25 88.75 93.13 83.13 96.25 94.38 GPT -4o 91.88 91.25 94.38 87.50 93.75 81.88 92.50 93.75 Qwen2.5-Plus 95.00 90.00 95.00 90.00 85.00 91.25 90.00 95.00 Qwen2.5-Max 95.00 90.00 95.00 90.00 90.00 85.00 95.00 95.00 T able 6: The ev aluation results of LLMs as the ev aluators on GPT -4’ s understanding of Spe-Poems. identify pauses in Spe-Poems and analyze them in ways that align closely with the original poets’ in- tentions. Howe ver , GPT -4 struggles with language in Spe-Poems, scoring the lowest (73.75) in this dimension. While it can recognize that some of the language originates from ancient Chinese liter- ature, the inherent complexity of Spe-Poems poses challenges to GPT -4’ s ability to fully comprehend the nuances of the language. This aligns with pre- vious works ( Lan et al. , 2025b ; L yu et al. , 2025 ; Qi et al. , 2025 ) that the cultural conte xt behind a text influences the performance of LLMs. There are se veral ke y insights emerge from the analysis. First, in addition to identifying external imagery , GPT -4 can detect the implicit modern consciousness embedded within the poems. Sec- ond, GPT -4 performs better when the language ex- pression is straightforward, with greater coherence and stronger correlations between words leading to more accurate comprehension. Third, GPT -4 can reliably identify similes in sentences with explicit cues such as “like, ” “as, ” and “is. ” Howe ver , its ability to interpret rhetorical techniques diminishes significantly , or ev en fails entirely , when such ex- plicit cues are absent. More detailed analyses are provided in Appendix A.4 . LLMs Evaluation Disappointingly , the results from LLMs de viated significantly from the e valu- ations provided by professional poets, indicating that LLMs are not yet reliable for e valuating the un- derstanding of poetry . For Com-Poems, although GPT -4o performed slightly closer to human ev alua- tions compared to other models, all LLMs’ results were still significantly dif ferent from those of the professional poets. This finding highlights the ne- cessity and reliability of human ev aluation in tasks requiring nuanced and culturally embedded under- standing, such as poetry . Our findings further highlight the reliability and necessity of human ev aluation in poetry domain. Poetry is inherently nuanced, with elements such as imagery , rhythm, and thought requiring a deep cultural and contextual understanding that current AI models cannot fully replicate. As such, human e valuation remains indispensable for accurately as- sessing poetic comprehension in our study . 6 Conclusion Understanding poetry is a fundamental prerequi- site for all poetry-related tasks. In this paper , we propose a comprehensiv e frame work to ev aluate ChatGPT’ s ability to understand modern poetry and examine the extent of its comprehension. Specif- ically , we identify several critical dimensions es- sential for understanding poetry , informed by pro- fessional expertise in the field. Using these dimen- sions, ChatGPT was tasked with interpreting the poems provided by various poets, based on care- fully designed prompts. T o assess these interpre- tations’ quality , we in vited the original poets to e valuate GPT’ s outputs across the identified dimen- sions. The ev aluation results sho w that ov er 73% of ChatGPT’ s interpretations align with the original poets’ intentions and thoughts. Howe ver , signif- icant room for improvement remains in specific dimensions, particularly poeticity . These findings highlight the effecti veness and necessity of our pro- posed framew ork. This study establishes a solid foundation for future research on LLMs and their application to poetry-related tasks. Limitations The data we utilized consists of recently written poems by professional poets, so there has been no prior human analysis of these poems to date. Conse- quently , in our study , the ev aluation of ChatGPT’ s poetic comprehension was conducted manually by the original poets themselves, which required a high cost. Fortunately , our work substantiates the ef fectiveness of ChatGPT as a tool to assist poetry comprehension, which lays a solid foundation for future research on LLMs related to poetry . Ethical Considerations Our work has v alidated the effecti veness of Chat- GPT as a tool for assisting in the understanding of poetry , as it is capable of providing human read- ers with interpretations that lar gely align with the original poets’ intentions. Howe ver , this could po- tentially lead to an ov er-reliance on ChatGPT’ s interpretations by human readers. Acknowledgements This work w as supported in part by the Sci- ence and T echnology De velopment Fund of Macau SAR (Grant No. FDCT/0007/2024/AKP), the UM and UMDF (Grant Nos. MYRG- GRG2024-00165-FST -UMDF , MYRG-GRG2025- 00236-FST , EF2024-00185-FST), the T encent AI Lab (EF2023-00151-FST), the Stanley Ho Medi- cal De velopment F oundation (Grant No. SHMDF- AI/2026/001), and the National Natural Science Foundation of China (Grant No. 62266013). References Josh Achiam, Stev en Adler , Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report . arXiv pr eprint arXiv:2303.08774 . D Antar . 2023. The effecti veness of using chatgpt4 in creativ e writing in arabic: Poetry and short story as a model . J. Brodsky . 1999. W itness and Pleasur e . witness and pleasure. T uhin Chakrabarty , Arkadiy Saakyan, and Smaranda Muresan. 2021. Don’t go far off: An empirical study on neural poetry translation . arXiv preprint arXiv:2109.02972 . Chi-Min Chan, W eize Chen, Y usheng Su, Jianxuan Y u, W ei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. 2023. Chatev al: T ow ards better llm-based ev al- uators through multi-agent debate . arXiv preprint arXiv:2308.07201 . Y ina Chang. 2012. A brief discussion on cutting-edge issues in contemporary poetry research . Northern Literatur e Monthly . Sreejato Chatterjee, Linh Tran, Quoc Duy Nguyen, Roni Kirson, Drue Hamlin, Harvest Aquino, Hanjia L yu, Jiebo Luo, and T imothy Dye. 2025. Assessing histor- ical structural oppression worldwide via rule-guided prompting of large language models . arXiv pr eprint arXiv:2509.15216 . Bo Cheng. 2005. The special ecology of modernity in new poetry–a study on the poets of southwest associ- ated univ ersity . Journal of the School of Liberal Arts of Nanjing Normal University , (4):69–76. Cheng-Han Chiang and Hung-Y i Lee. 2023. Can large language models be an alternati ve to human e valua- tions? In Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V ol- ume 1: Long P apers) , pages 15607–15631. Zhumin Chu, Qingyao Ai, Y iteng T u, Haitao Li, and Y iqun Liu. 2024. Pre: A peer revie w based large language model ev aluator . arXiv pr eprint arXiv:2401.15641 . Zekun Deng, Hao Y ang, and Jun W ang. 2024. Can ai write classical chinese poetry like humans? an empirical study inspired by turing test . arXiv pr eprint arXiv:2401.04952 . Y uan Gao, Ruili W ang, and Feng Hou. 2023. How to design translation prompts for chatgpt: An empirical study . arXiv e-prints , pages arXiv–2304. Dmitriy Genzel, Jakob Uszkoreit, and Franz Och. 2010. “poetic” statistical machine translation: Rhyme and meter . In Pr oceedings of the 2010 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 158–166, Cambridge, MA. Association for Computational Linguistics. Marjan Ghazvininejad, Y ejin Choi, and Ke vin Knight. 2018. Neural poetry translation . In Pr oceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language T echnolo gies, V olume 2 (Short P a- pers) , pages 67–71. James Hutson and Ana Schnellmann. 2023. The po- etry of prompts: the collaborati ve role of generati ve artificial intelligence in the creation of poetry and the anxiety of machine influence . Global Journal of Computer Science and T echnology , 23(D1):1–14. Y ongjun Jiang. 2012. The "unspeakable" in the percep- tual ef fect of poetry - an analysis of the "unspeakable" meaning of feeling . for eign language , 28(1):4. W enxiang Jiao, W enxuan W ang, Jen-tse Huang, Xing W ang, and Zhaopeng T u. 2023. Is chatgpt a good translator? a preliminary study . arXiv preprint arXiv:2301.08745 , 1(10). T ian Lan, Jiang Li, Y emin W ang, Xu Liu, Xiangdong Su, and Guanglai Gao. 2025a. F 2 Bench: An open- ended fairness e valuation benchmark for LLMs with factuality considerations . In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing , pages 2031–2046. Association for Computational Linguistics. T ian Lan, Xiangdong Su, Xu Liu, Ruirui W ang, Ke Chang, Jiang Li, and Guanglai Gao. 2025b. McBE: A multi-task Chinese bias e valuation bench- mark for large language models . In F indings of the Association for Computational Linguistics: ACL 2025 , pages 6033–6056. Dianqing Lin, Aruukhan, Hongxu Hou, Shuo Sun, W ei Chen, Y ichen Y ang, and Guodong Shi. 2025. Can large language models translate unseen languages in underrepresented scripts? In Pr oceedings of the 2025 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing , pages 23137–23150. Association for Computational Linguistics. Y en-Ting Lin and Y un-Nung Chen. 2023. Llm-ev al: Unified multi-dimensional automatic e valuation for open-domain con versations with large language mod- els . arXiv pr eprint arXiv:2305.13711 . Y ang Liu, Dan Iter , Y ichong Xu, Shuohang W ang, Ruochen Xu, and Chenguang Zhu. 2023. G-e val: NLG ev aluation using gpt-4 with better human align- ment . In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 2511–2522. Association for Computational Linguistics. Quanming Long. 2005. Modernity of Chinese New P oetry . Modernity of Chinese Ne w Poetry . Hanjia L yu, Jiebo Luo, Jian Kang, and Allison K oe- necke. 2025. Characterizing bias: Benchmarking large language models in simplified v ersus traditional chinese . In Pr oceedings of the 2025 ACM Confer ence on F airness, Accountability , and T ranspar ency , page 2815–2846. Association for Computing Machinery . Hisar Manurung. 2004. An ev olutionary algorithm ap- proach to poetry generation . Brian McGrath. 2018. Understating poetry . New Liter- ary History , 49(3):289–308. Qingwei Mo. 2009. Emotional classification of new po- etry and its cultural roots . Management Observation , (13):1. Jekaterina Novikov a, Ond ˇ rej Dušek, Amanda Cercas Curry , and V erena Rieser . 2017. Why we need ne w ev aluation metrics for nlg . In Pr oceedings of the 2017 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 2241–2252. Robert B Pierce. 2003. Defining" poetry" . Philosophy and Literatur e , 27(1):151–163. W eihong Qi, Hanjia L yu, and Jiebo Luo. 2025. Rep- resentation bias in political sample simulations with large language models . In Companion Pr oceed- ings of the ACM on W eb Confer ence 2025 , page 1264–1267, New Y ork, NY , USA. Association for Computing Machinery . Eshrag Ali Refaee. 2023. Okaz: A deep-learning-based system for automatic arabic poem generation . In 2023 3r d International Confer ence on Computing and Information T echnology (ICCIT) , pages 396–403. IEEE. Hanwen Shen and T ing Y ing. 2025. Measuring infor- mation distortion in hierarchical ultra long nov el gen- eration: The optimal expansion ratio . arXiv preprint arXiv:2505.12572 . Hanwen Shen, T ing Y ing, Jiajie Lu, and Shanshan W ang. 2026. Preconditioned test-time adaptation for out-of- distribution debiasing in narrati ve generation . arXiv pr eprint arXiv:2603.13683 . W ai Lei Song, Haoyun Xu, Derek F . W ong, Runzhe Zhan, Lidia S. Chao, and Shanshan W ang. 2023. T o- wards zero-shot multilingual poetry translation . In Pr oceedings of Machine T ranslation Summit XIX, V ol. 1: Resear ch T rac k , pages 324–335. Asia-Pacific As- sociation for Machine T ranslation. Allen T ate. 1940. Understanding modern poetry . The English Journal , 29(4):263–274. T im V an de Cruys. 2020. Automatic poetry generation from prosaic text . In Pr oceedings of the 58th an- nual meeting of the association for computational linguistics , pages 2471–2480. Maria V irvou, George A Tsihrintzis, Dionisios N Sotiropouloss, Konstantina Chrysafiadi, Evangelos Sakkopoulos, and Evangelia-Aikaterini Tsichrintzi. 2023. Chatgpt in artificial intelligence-empowered e-learning for cultural heritage: The case of lyrics and poems . In 2023 14th International Confer ence on Information, Intelligence, Systems & Applications (IISA) , pages 1–9. IEEE. Guangming W ang. 1998. Reflections on the ontology of new chinese poetry . Chinese Social Sciences , (4):16. Jiaan W ang, Y unlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. 2023. Is chatgpt a good nlg ev aluator? a preliminary study . In Pr oceedings of the 4th New F rontier s in Summarization W orkshop , pages 1–11. Mi W ang. 2019. Study on the r elationship between external form and internal emotion in the rhythm of new poetry (1917-1949) . Ph.D. thesis, Sichuan Academy of Social Sciences. Shanshan W ang, Derek W ong, Jingming Y ao, and Lidia Chao. 2024. What is the best way for ChatGPT to translate poetry? In Pr oceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 14025– 14043. Shanshan W ang, Junchao W u, Fengying Y e, Derek F . W ong, Jingming Y ao, and Lidia S. Chao. 2025. Benchmarking the detection of LLMs-generated Modern Chinese poetry . In F indings of the Associ- ation for Computational Linguistics: EMNLP 2025 , pages 9533–9552. Association for Computational Linguistics. Shuting W ang. 2006. Gi ving shape to emotion: The theory and practice of rhythm and imagery in new poetry (1917-1937) . W orld Literatur e Review . Y unshu Xi. 2019. T wo characteristics of new poetry language—talking about the creation of ne w poetry part three . Masterpiece Appreciation: Appreciation Edition (F irst Decade) , (8):6. Fengying Y e, Shanshan W ang, Lidia S Chao, and Derek F W ong. 2025. Un veiling llms’ metaphori- cal understanding: Exploring conceptual irrelev ance, context leveraging and syntactic influence . arXiv pr eprint arXiv:2510.04120 . Michelle Y eh. 1990. A new orientation to poetry: The transition from traditional to modern . Chinese Litera- tur e: Essays, Articles, Reviews (CLEAR) , 12:83–105. Xiaoyuan Y i, Maosong Sun, Ruoyu Li, and W enhao Li. 2018. Automatic poetry generation with mutual reinforcement learning . In Proceedings of the 2018 Confer ence on Empirical Methods in Natural Lan- guage Pr ocessing , pages 3143–3153. Haizhang Y u. 2001. About the changes and analysis of new poetry e xpressions . writing , 000(011):3–5. Di Zang. 1998. Modernity and e valuation of ne w poetry . literary contention , (3):6. T aozhou Zhang. 2021. How to read a modern poem?– take the analysis of zhu xiang’ s "rain scene" as an example . J ournal of Chengdu University: Social Sciences Edition , (2):8. T aozhou Zhang. 2022. Problems in reading new poetry from the perspective of "interpretation of poetics" . literary resear ch , (3):13. Xingxing Zhang and Mirella Lapata. 2014. Chinese poetry generation with recurrent neural networks . In Pr oceedings of the 2014 confer ence on empirical methods in natural language pr ocessing (EMNLP) , pages 670–680. Xinnong Zhang, Jiayu Lin, Libo Sun, W eihong Qi, Y ihang Y ang, Y ue Chen, Hanjia L yu, Xinyi Mou, Siming Chen, Jiebo Luo, et al. 2024. Electionsim: Massiv e population election simulation powered by large language model driv en agents . arXiv pr eprint arXiv:2410.20746 . A A ppendix A.1 Professional Poets The poets in volved in this work are not only poetry creators but also seasoned researchers of poetry theory , including members of the Chinese Writers Association 1 , poetry editors, and univ ersity profes- sors. They are all excellent Chinese poets who ha ve published many poetry collections, published many articles on modern Chinese poetry , and won many poetry aw ards. A.2 Details About Prompt Optimization During the experimental process, we observed that when GPT -4 generated the interpretation of poetry , a reminder “Y ou did not provide the complete te xt of the poem” appeared before the interpretation. T o quantify this occurrence, we tallied the frequenc y of such prompts and discov ered that in the analysis of 48 poems, more than one-third of the cases (18 poems in total) elicited a similar response. This suggests that GPT -4 may not fully understand the completeness of poetry . T o better stimulate GPT - 4’ s understanding of poetry , we added the content “This is a complete modern poem” to the designed prompt. Finally , all experiments used the optimized prompts sho wn in T able 1 . A.3 Example of the Experiment Here is an example of the e xperiment on GPT -4. Poem: 醉酒 一事 (The Matter of Being Drunk) 在 酒 的 国 度 里 , 鱼鳌 可 以 (In the kingdom of wine, turtles might) 离 开 海洋 , 鸟 雀 不 必 站 在 树 上 (Depart the sea, and birds need not perch in trees) 男 人 醉酒 后 , 很 放 心 地 (A man, inebriated, with carefree trust) 把 自 己 摆 到 了 街 边 。 人 间 之事 (Lay himself along the street’ s edge, where worldly matters) 全 都 碎 在 杯 里 , 一 饮 而 尽 后 (All shattered in the cup, drained in one gulp) 在 肠胃 里 激 荡 , 像 巨 浪 冲 刷 着 堤 防 (Surging in the stomach, lik e giant wav es washing ov er the embankment) 1 https://www .chinawriter .com.cn/n1/2025/0417/c403937- 40462301.html 车辆 在 街 灯 的 重 影 里 轰 鸣 、 拉扯 (V ehicles roar and tug in the doubled shado ws of streetlights) 犹 如 那 些 轻 易 给 出 的 诺 言 , 在 黑 暗 中 (Like promises too lightly made, in dark) 徒 劳 地 撒 网 , 没 人 会 在 意 一个 醉 汉 (Casting nets in v ain. No one cares about a drunk) 倒 置 的 身 体 和 语 言 , 他 只 是 马 路 上 的 一 块 礁 石 (In verted body and language, he is just a rock on the road) 是 礁 石 上一 条 搁 浅 的 鱼 , 鱼 腹 里 的 一 根 刺 (It’ s a stranded fish on the reef, a thorn in its belly) 想 到 这 里 , 他 真 的 像 一 根 鱼 刺 一 样 (Thinking of this, he really looks like a fishbone) 卡 住 了 道 路 的 喉 咙 (Stuck in the throat of the road) Interpr etation of this poem by GPT -4: • Content: This poem describes the state and feelings of a drunkard. It depicts how alco- hol allows one to escape reality , making the drunkard feel as if the y are in a “kingdom of wine” where they are not bound by the rules of the real w orld. Howe ver , this escapism ul- timately renders him a mere rock in the city , a stranded fish, ev en a fishbone stuck in the throat of the road. • Expression Methods: 1) Language Charac- teristics. The poem uses direct and concrete language, innov ati vely comparing the drunk- ard to a rock and a fishbone, which is a com- mon language feature in modern poetry . 2) Imagery . The poem uses images like “tur- tles, ” “birds, ” “giant wa ves, ” and “vehicles, ” comparing the state of drunkenness to the free- dom of animals and the po wer of natural disas- ters, which enhances the e xpressiveness of the poem. 3) Rhetorical T echniques. The poem employs rhetorical de vices such as metaphor (comparing the drunkard to a rock and fish- bone) and hyperbole (e.g., “all shattered in the cup”), which enhance the expressi veness and emotional impact of the language. 4) Rhythm. The poem controls the rh ythm through short lines and pauses, simulating the discontinuous stream of consciousness of the drunkard and his unsteady presence in the city . 5) Defamil- iarization. The poem achiev es a defamiliar- izing effect by comparing people to objects (rock, fishbone), allowing readers to see the image and state of the drunkard from a new perspecti ve. • Thought & Emotion: The poem con veys a deep understanding of the state of drunken- ness and sympathy for the isolated and un- supported state of the drunkard in society . Through the image of the drunkard, the poet reflects the potential for people to become lost and lonely in modern society . • Modernity: This poem has modernity; it takes the e xperience of an ordinary person in modern society as a starting point, using mod- ern poetic language and expression to explore the individual’ s isolated state in society and the complexity of the inner world. Through the portrayal of the drunkard, the poet reflects the inner struggle of modern people and their longing for freedom. • Poeticity: The sentence “Surging in the stom- ach, like giant wav es washing ov er the em- bankment” is particularly poetic. It vividly describes the ef fect of alcohol in the body and metaphorically alludes to the inner turmoil and conflict within a person. A.4 Detailed Analysis Com-Poems Detailed analysis of Com-Poems across dimensions (From highest to lo west perfor- mance): 1) Imagery . GPT -4 achiev ed the highest com- prehension score in the imagery dimension, with an av erage score of 81.18 (T able 2 ). This aligns with expectations, as imagery is a salient and explicit characteristic of poetry . Notably , GPT -4 scored abov e 80 in imagery interpretation 18 times (Figure 2 ), demonstrating its relativ ely strong capability in recognizing and interpreting visual and descriptiv e elements in poetry . This finding highlights GPT -4’ s ability to engage with one of the most concrete and explicit dimensions of poetic language. 2) Content . GPT -4 scored 80.33 in content com- prehension, indicating its capability to discern the core subject matter and themes of poems. W e ob- served that GPT -4 performs particularly well with poems that hav e clear themes or straightforward narrati ve structures, often generating comprehen- si ve summaries with minimal omissions. For in- stance, GPT -4 scored abov e 80 in content under - standing 19 times (14 scores between 80–90 and 5 scores abov e 90), the highest frequency among dimensions with scores abov e 80 (Figure 2 ). This finding underscores GPT -4’ s proficiency in grasp- ing the explicit content of modern poetry . 3) Modernity . Scoring 79.88, GPT -4 demon- strated a strong ability to recognize modernity , par- ticularly through the imagery unique to modern Chinese poetry . While external modern imagery is relati vely explicit, we were surprised to find that GPT -4 also identified implicit modern conscious- ness embedded within the poems. This nuanced capability reflects its potential to go beyond surface- le vel comprehension when analyzing modern po- etic elements. 4) Defamiliarization . GPT -4 scored 79.40 in defamiliarization, placing it between modernity and language. Defamiliarization, often expressed through innov ativ e linguistic techniques and mod- ernist stylistics, was recognized by GPT -4 in a manner consistent with established poetic analysis. This suggests that GPT -4 can reasonably interpret creati ve deviations in language and style, ev en if its understanding of such techniques is not perfect. 5) Language . GPT -4 scored 79.05 in language comprehension. While this score is relativ ely high, our analysis re vealed limitations in its interpretativ e depth. GPT -4 often resorts to generalized phrases when analyzing linguistic features, which, while broadly applicable, lack specificity for indi vidual poems. For example, GPT -4 frequently describes the language characteristics of modern poetry as employing “succinct and direct modern Chinese language” with “a straightforward and unembel- lished lexicon. ” These descriptions are accurate at a general lev el but fail to capture the unique lin- guistic features of specific poems. Additionally , we found that GPT -4’ s comprehension improves with simpler and more coherent language. When lin- guistic leaps are minimal and word correlations are stronger , GPT -4’ s interpretations are more precise. 6) Thought & Emotion . GPT -4 scored 78.80 in thought & emotion. W e find that GPT -4 is adept at capturing emotional undertones in poems with clear and vivid imagery . Howe ver , its performance declines when themes become more comple x or ab- stract, indicating a limitation in its ability to grasp deeper cognitiv e and emotional nuances. This find- ing emphasizes the critical role of imagery in fa- cilitating GPT -4’ s understanding of a poem’ s emo- tional and intellectual depth. 7) Rhetorical T echniques . GPT -4 scored 77.83 in rhetorical techniques, slightly outperforming its understanding of rhythm. Our analysis shows that GPT -4 reliably identifies similes with explicit cues (e.g., “like, ” “as, ” “is”). Ho we ver , its ability to recognize more subtle or implicit rhetorical de vices is significantly weaker , often leading to incorrect interpretations. For example, GPT -4 misidentified the line “When your message arrives on my phone, I hear your voice” as employing personification. This interpretation is incorrect because: the line does not attrib ute human-like qualities to inanimate objects; the relationship between “message” and “voice” is sequential and causal, not rhetorical. Such errors highlight GPT -4’ s limitations in de- tecting rhetorical techniques without explicit cues. 8) Rhythm . GPT -4’ s comprehension of rhythm scored a relativ ely low 76.15. W e observed that GPT -4 lacks the ability to perform a detailed rhyth- mic analysis. Instead, it defaults to generalized observ ations, such as noting v ariations in sentence length, pauses, and line breaks as indicators of rhythm. Notably , GPT -4 scored above 80 in rhythm comprehension only 8 times, while scores at or be- lo w 70 accounted for 16 occurrences. This finding indicates that GPT -4 struggles to capture the nu- anced rhythmic patterns that are inte gral to poetic expression. 9) Poeticity . Poeticity was the weakest dimen- sion, with GPT -4 scoring the lowest o verall (T able 3 ). In the task of identifying the most poetic sen- tence from 40 poems, GPT -4’ s selections were in- correct for more than half the cases (23 sentences). For e xample, GPT -4 misidentified the phrase “tem- pering a rod of iron into a needle” as highly poetic, interpreting it as a metaphor for dedication and refinement. While this interpretation is valid in iso- lation, GPT -4 failed to consider the line’ s context within the poem, leading to an erroneous judgment of poeticity . Each poem is a vibrant entity ( Zhang , 2021 ), where lines within the poem are both inde- pendent and yet intimately interconnected. This suggests that GPT -4 tends to analyze sentences in isolation rather than considering their interplay within the holistic structure. Spe-Poems W ithin our dataset, Spe-Poetry repre- sents a unique type of modern poetry , often incorpo- rating references to ancient literary content. Prior to conducting the experiments, we anticipated that GPT -4 might struggle with understanding this type of poetry , but the experimental results were surpris- ing, revealing both strengths and limitations. For Spe-Poems, detailed findings for key dimensions are as follo ws: 1) Rhetorical T echniques . For Spe-Poems, GPT -4 demonstrated the strongest understanding of rhetorical techniques, with an impressiv e agree- ment rate of 88.75% with the original poets’ ideas, the highest score across all dimensions. This result is particularly notew orthy . For example, GPT -4 correctly identified the use of metaphor in the line, “The undulating mountain ranges resemble a giant hastening along his way , ” accurately recognizing the explicit cue word “resemble” and pro viding an error-free analysis of the rhetorical de vice. This finding is consistent with GPT -4’ s performance in common modern poetry , where it e xcels in recog- nizing rhetorical techniques with explicit signals. The high accuracy in this dimension highlights GPT -4’ s capability to analyze and interpret rhetori- cal elements in poetry where cues are explicit. 2) Modernity . GPT -4’ s comprehension of modernity in Spe-Poems reached 82.50% agree- ment with the original poets’ intentions, which was a surprising and encouraging result. For instance, in the poem “Lotus Leaves as Clothing (I), ” GPT -4 correctly recognized the modern elements inter- wo ven with ancient themes. While the poem is centered on the ancient poet Qu Y uan, it incorpo- rates modern elements such as the mention of “296 BC” and juxtaposes ancient imagery with mod- ern psychological experiences. GPT -4 successfully identified the poem’ s reflection of univ ersal human emotions, such as loneliness, confusion, and the pursuit of beauty , that transcend time and space, ef fectively interpreting the modern consciousness embedded within the poem. This demonstrates GPT -4’ s potential to bridge classical and modern elements in poetic interpretation. 3) Rhythm . GPT -4’ s understanding of rhythm in Spe-Poems scored 82.50, second only to rhetori- cal techniques. This result stands in contrast to its relati vely weaker performance in rhythm compre- hension for common types of modern poetry . For example, GPT -4 accurately analyzed the rhythm of “Lotus Lea ves as Clothing (I), ” recognizing ho w line breaks and pauses, such as in the brief sen- tences “Night. The lodge. ”, create a tranquil and solitary atmosphere. This analysis was entirely con- sistent with the original poet’ s intent, demonstrat- ing that GPT -4 can effecti vely recognize rhythmic features when they are explicitly designed to ev oke particular moods or tones. 4) Language . GPT -4’ s alignment with the orig- inal poets in language comprehension for Spe- Poems was 73.75%, the lowest among all dimen- sions. This lower score can be attributed to the inherent challenges posed by this category of po- etry , which often employs classical Chinese im- agery and le xicon. These elements are laden with rich cultural connotations and intrinsic meanings that require a deep understanding of the relev ant cultural and historical context. For example, in analyzing the poem “Lotus Leaves as Clothing (I), ” GPT -4 correctly observed the use of a clas- sical poetic style, accurately identifying ancient phrases like “huai xi” while also recognizing the integration of modern language. Howe ver , GPT -4 mistakenly cate gorized “wrinkles and late spring” as modern language, despite their frequent use as poetic images in ancient Chinese poetry . While GPT -4 correctly identified the innovation in the poem, it misunderstood that the sentence “Qu Y uan carrying wrinkles and late spring” innov ativ ely ma- terializes “wrinkles and late spring” through the word "carrying, ” blending modern techniques with classical imagery . Although GPT -4 struggles with fully grasping the linguistic nuances of classical elements, it e xhibits the potential to identify their presence and stylistic relev ance. This suggests that its understanding of language in Spe-Poems, while imperfect, demonstrates a promising foundation for further improv ement. 5) Poeticity . GPT -4’ s ability to understand po- eticity in Spe-Poems remains unsatisfactory , con- sistent with its performance on common modern poetry . When tasked with identifying the most po- etic sentence from 8 poems, GPT -4’ s selections were incorrect in more than half the cases (5 out of 8). In one instance, it selected a sentence entirely de void of poetic sense as the most poetic. This finding reinforces GPT -4’ s tendenc y to analyze sen- tences in isolation, ov erlooking their contextual in- terplay within the broader structure of a poem. This limitation highlights an area where future improve- ments are necessary to enhance GPT -4’ s capacity for holistic poetic interpretation. The detailed analysis presented abov e demon- strates the analytical depth of our work. By sys- tematically e v aluating GPT -4’ s performance across multiple poetic dimensions and providing spe- cific examples, we offer nuanced insights into its strengths and limitations when analyzing Com- Poems and Spe-Poems.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment