Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

LLMs are routinely evaluated on language use, yet their explicit knowledge about linguistic structure remains poorly understood. Existing linguistic benchmarks focus on narrow phenomena, emphasize high-resource languages, and rarely test metalinguistic knowledge - explicit reasoning about language structure. We present a multilingual evaluation of metalinguistic knowledge in LLMs, based on the World Atlas of Language Structures (WALS), documenting 192 linguistic features across 2,660 languages. We convert WALS features into natural-language multiple-choice questions and evaluate models across documented languages. Using accuracy and macro F1, and comparing to chance and majority-class baselines, we assess performance and analyse variation across linguistic domains and language-related factors. Results show limited metalinguistic knowledge: GPT-4o performs best but achieves moderate accuracy (0.367), while open-source models lag. Although all models perform above chance, they fail to outperform the majority-class baseline, suggesting they capture broad cross-linguistic patterns but lack fine-grained distinctions. Performance varies by domain, partly reflecting differences in online visibility. At the language level, accuracy correlates with digital language status: languages with greater digital presence and resources are evaluated more accurately, while low-resource languages perform worse. Analysis of predictive factors confirms that resource-related indicators (Wikipedia size, corpus availability) are more informative than geographic, genealogical, or sociolinguistic factors. Overall, LLM metalinguistic knowledge appears fragmented and shaped mainly by data availability, rather than broadly generalizable grammatical competence. We release the benchmark as an open-source dataset to support evaluation across languages and encourage greater global linguistic diversity in future LLMs.

💡 Research Summary

This paper addresses a critical gap in the evaluation of large language models (LLMs): their explicit, metalinguistic knowledge—i.e., the ability to reason about language structure rather than merely generate fluent text. While numerous benchmarks test surface grammatical competence or acceptability judgments, they are typically limited to a handful of phenomena, focus heavily on high‑resource languages, and rarely assess true metalinguistic reasoning. To overcome these limitations, the authors leverage the World Atlas of Language Structures (WALS), a typological database that documents 192 linguistic features across more than 2,600 languages, spanning phonology, morphology, lexicon, and syntax.

Benchmark Construction
The authors systematically convert each WALS feature into a natural‑language multiple‑choice question. For a given language and feature, a prompt such as “In language X, the typical word order is …” is generated, with the correct answer drawn from the WALS entry and several distractors created by random sampling of other possible values. The conversion pipeline includes automatic template filling, human‑in‑the‑loop verification for fluency, and balancing of answer options to avoid trivial cues. The resulting benchmark, named WALS‑MCQ, covers the full set of documented languages, including a substantial proportion of low‑resource and under‑documented languages.

Evaluation Protocol
Fourteen LLMs are evaluated, ranging from the proprietary GPT‑4o (the most capable model at the time of writing) to open‑source models such as LLaMA‑2, Mistral, and Falcon. All models are queried in a zero‑shot setting using a uniform prompt that presents the question and the answer choices, requesting the model to output the selected option. Performance is measured with two metrics: overall accuracy and macro‑averaged F1 (to mitigate class imbalance). Two baselines are provided for comparison: a random‑guess baseline (uniform probability over choices) and a majority‑class baseline (always selecting the most frequent answer for each feature).

Results
GPT‑4o achieves the highest scores (accuracy = 0.367, macro‑F1 = 0.342) but does not surpass the majority‑class baseline, which sits around 0.31. Open‑source models lag further behind, scoring between 0.20 and 0.30 accuracy. Domain‑wise analysis reveals a clear hierarchy: lexical features are answered most accurately (≈ 0.42), morphological and syntactic features are intermediate (≈ 0.35), and phonological features are the hardest (≈ 0.28). This pattern aligns with the intuition that textual corpora contain abundant lexical information, whereas phonological patterns are poorly represented in orthographic data.

Language‑Level Variation
A regression analysis links per‑language performance to a suite of language‑level indicators. Resource‑related variables—Wikipedia article size, the number of publicly available corpora, and overall digital presence—explain the largest share of variance (Pearson r ≈ 0.58). In contrast, geographic, genealogical, and sociolinguistic factors (e.g., language family, continent, speaker population) have minimal predictive power. Random‑forest feature importance confirms that resource metrics dominate, while typological diversity contributes little.

Discussion of Limitations
The authors acknowledge several constraints. First, WALS itself is uneven: many low‑resource languages have only a handful of documented features, limiting the depth of evaluation for those languages. Second, multiple‑choice format may encourage models to rely on statistical cueing rather than genuine rule induction, potentially inflating performance on high‑frequency patterns. Third, the zero‑shot setting does not explore whether fine‑tuning or few‑shot prompting could improve metalinguistic reasoning.

Conclusions and Future Work
The study concludes that current LLMs possess fragmented metalinguistic knowledge. They capture broad cross‑linguistic tendencies—evidenced by performance above chance—but fail to internalize fine‑grained grammatical distinctions, especially for phonology and for languages with scant digital footprints. The authors argue that expanding digital resources for under‑represented languages and designing benchmarks that probe rule induction (e.g., via probing tasks or generative explanations) are essential next steps. All benchmark data, code, and evaluation scripts are released as open‑source resources to facilitate further research on multilingual metalinguistic competence.

Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

💡 Research Summary

Comments & Academic Discussion

Leave a Comment