Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models
This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced evaluation set of 100 Turkish sentences that systematically pit local against non-local antecedents for the reflexives kendi and kendisi. We compare two contrasting systems: an OpenAI chain-of-thought model optimized for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA 2 derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined paradigm that integrates sentence-level perplexity with a forced-choice comparison between minimally differing continuations. Overall, Trendyol-LLM favors local bindings in approximately 70 percent of trials, exhibiting a robust locality bias consistent with a preference for structurally proximate antecedents. By contrast, the OpenAI model (o1 Mini) distributes its choices nearly evenly between local and long-distance readings, suggesting weaker or less consistent sensitivity to locality in this binding configuration. Taken together, these results reveal a marked contrast in binding behavior across the two systems and motivate closer analysis of how model architecture, training data, and inference-time reasoning strategies shape the representation of Turkish anaphoric dependencies.
💡 Research Summary
This paper investigates whether state‑of‑the‑art large language models (LLMs) capture the binding relations of Turkish reflexive pronouns, focusing on the two forms kendi and kendisi. The authors construct a balanced evaluation set of 100 Turkish sentences, each presented as a minimal pair that pits a locally bound antecedent against a non‑local (long‑distance) antecedent. For each pair, two minimally differing continuations are generated, allowing a forced‑choice paradigm that tests which continuation the model prefers. In addition to this forced choice, sentence‑level perplexity is measured to assess overall language‑model fit.
Two contrasting systems are compared: OpenAI’s chain‑of‑thought (CoT) model o1 Mini, which is optimized for multi‑step reasoning, and Trendyol‑LLM‑7B‑base‑v0.1, a LLaMA‑2‑derived model that has been extensively fine‑tuned on Turkish data. The CoT model is prompted to produce explicit reasoning traces before answering, whereas Trendyol‑LLM is used in a standard left‑to‑right language‑modeling regime, assigning probabilities directly to the two continuations.
Results show a stark divergence. Trendyol‑LLM selects the locally bound antecedent in roughly 70 % of the trials, indicating a robust locality bias that aligns with the structural constraints traditionally associated with kendi. By contrast, o1 Mini distributes its choices almost evenly between local and long‑distance readings, suggesting that the CoT reasoning pipeline does not substantially increase sensitivity to the locality requirements of Turkish reflexive binding. Perplexity scores corroborate this pattern: Trendyol‑LLM exhibits lower overall perplexity on the test items, reflecting better adaptation to Turkish syntax.
The authors interpret these findings in terms of model architecture, training data, and inference‑time strategies. The Turkish‑specific fine‑tuning of Trendyol‑LLM appears sufficient for the model to internalize the structural cues (case marking, verb morphology) that signal binding domains, even though such cues are relatively sparse compared to high‑signal phenomena like subject‑verb agreement. The CoT model, despite its multi‑step reasoning capabilities, relies on a broader multilingual corpus and does not receive explicit supervision for Turkish reflexive binding, which may explain its weaker locality effect.
The paper situates its contribution within a broader literature that treats LLMs as surrogate psycholinguistic subjects. It highlights that while LLMs have demonstrated impressive performance on frequent, overtly marked constructions, they often falter on low‑frequency, structurally opaque, or discourse‑pragmatically conditioned dependencies. Turkish reflexive binding, with its interplay of syntactic locality and discourse pragmatics, serves as a rigorous test case for this limitation.
Finally, the authors outline future directions: expanding the evaluation to other under‑represented languages, designing probing methods that target deeper hierarchical representations, and exploring whether CoT‑style prompting can be tailored (e.g., with syntax‑aware prompts) to improve sensitivity to binding phenomena. Overall, the study reveals that language‑specific fine‑tuning can yield stronger grammatical competence for subtle anaphoric relations than generic reasoning‑oriented prompting, underscoring the importance of both data and architectural choices in building linguistically robust LLMs.
Comments & Academic Discussion
Loading comments...
Leave a Comment