Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.


💡 Research Summary

The paper “Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation” investigates how meaning‑preserving lexical and syntactic perturbations affect both absolute performance and relative rankings of 23 contemporary large language models (LLMs) across three diverse benchmarks: MMLU (multiple‑choice knowledge QA), SQuAD (extractive reading‑comprehension), and AMEGA (clinical guideline adherence).

Two linguistically principled pipelines are introduced. The lexical pipeline uses a quantized Llama‑3.3‑70B‑Instruct to perform guided synonym substitution, ensuring contextual appropriateness and preserving domain‑specific terminology; for SQuAD a constraint guarantees that the answer span remains unchanged. The syntactic pipeline first parses each input with spaCy’s dependency parser to identify constituents (subjects, objects, complements, etc.) and then asks an LLM to reorder or restructure these constituents while keeping the original words intact. Both pipelines output the perturbed text together with a change log, and human verification confirms meaning preservation.

All models are evaluated in a zero‑shot setting (temperature = 0, fixed random seed) on the original and perturbed versions of each dataset. Performance metrics are task‑appropriate: accuracy for MMLU, Exact Match/F1/Semantic Answer Similarity for SQuAD, and a guideline‑adherence score for AMEGA. Statistical significance is assessed via paired t‑tests and bootstrap confidence intervals.

Results show a striking asymmetry. Lexical perturbations cause substantial, statistically significant drops in performance for virtually every model and task, with average declines ranging from 5 to 8 percentage points. The effect is especially pronounced on multiple‑choice tasks and for smaller models, suggesting heavy reliance on surface lexical cues. In contrast, syntactic perturbations produce heterogeneous outcomes: some models improve slightly (≈1–2 pp), others degrade, and many remain unchanged. This heterogeneity indicates that syntactic re‑ordering is not uniformly detrimental and that certain architectures may better capture abstract grammatical structure.

Beyond raw scores, the study examines leaderboard stability. Kendall’s τ between original and perturbed rankings falls below 0.35 for all three benchmarks, meaning that minor, meaning‑preserving changes can reshuffle the top‑10 positions and invalidate static leaderboards as reliable model‑selection tools. Importantly, model size does not guarantee robustness. Larger models (e.g., 70 B‑parameter GPT‑4.1‑mini) are not consistently more stable than mid‑size or even sub‑billion‑parameter models; the relationship between parameter count and perturbation resilience is non‑monotonic and task‑dependent.

The authors conclude that current LLM evaluation practices overestimate true generalization because they ignore surface‑level sensitivity. They advocate for incorporating robustness testing—specifically, systematic lexical and syntactic perturbations—into standard benchmark pipelines. All code, perturbation pipelines, and the perturbed datasets are released on GitHub to promote reproducibility and further research into LLM stability.


Comments & Academic Discussion

Loading comments...

Leave a Comment