Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation
Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.
💡 Research Summary
The paper addresses a longstanding gap in historical linguistics: the lack of a quantitative measure for the regularity of sound correspondences, which has traditionally been judged intuitively. The authors introduce a novel metric called “balanced average recurrence” (BAR) that quantifies how frequently individual correspondence patterns appear across the aligned sites of cognate sets. The computation proceeds in three steps. First, for each column (alignment site) in a cognate set’s phonetic alignment, all compatible correspondence patterns are identified, and the pattern that matches the greatest number of sites is selected. The count of how many sites each chosen pattern covers constitutes its raw recurrence. Second, these raw recurrence values are log‑transformed to dampen the influence of outliers (e.g., a single very frequent pattern) and then averaged across the sites of the cognate set. Third, the exponential of this mean is taken, yielding a BAR score that is close to 1 for highly regular alignments and deviates downward as irregularity increases.
To enable comparison across datasets of differing size, the authors normalize each pattern’s count by the total number of alignment sites in the dataset before log‑transformation. This normalization makes the BAR score comparable across language families, concept inventories, and numbers of cognate sets. The authors illustrate the metric on twenty LexiBank v2.1 datasets (576 languages, 19 families), showing that datasets with manually curated alignments (e.g., CrossAndean, BlumPanotacana, LeeAinu) achieve markedly higher BAR scores, confirming that careful annotation boosts measurable regularity.
The second major contribution is a leave‑one‑out (LOO) validation procedure that uses the BAR score to pinpoint irregular word forms within cognate sets. For each cognate set, the algorithm temporarily removes one word, recomputes the BAR score, and records the change. If the removal leads to a substantial increase in BAR, the omitted word is flagged as an irregular element. By iterating this process over all words in all sets, the method automatically produces a ranked list of candidate irregular forms.
Two experimental regimes evaluate the approach. In simulated data, the authors inject known irregular forms into otherwise regular cognate sets and verify that the LOO method recovers them with high precision. In real data, they create “perturbed” versions of each of the twenty datasets by randomly replacing a single word in a cognate set with an irregular counterpart, then run the LOO detection. The overall detection accuracy across all real‑data experiments is 85 %, demonstrating robust performance despite the noise inherent in automatically aligned lexical data.
Additional analyses explore how dataset size and the proportion of irregular forms affect performance. Subsampling the datasets reduces the stability of BAR scores, while increasing the proportion of injected irregular forms gradually lowers detection accuracy; however, even with up to 30 % irregularity the method retains above‑70 % accuracy, suggesting practical utility for large‑scale corpora.
The paper’s significance lies in (1) providing a concrete, reproducible metric for regularity that can be used to assess and compare the quality of cognate annotations, and (2) delivering an automated tool for detecting irregular cognate judgments, thereby facilitating data cleaning before downstream tasks such as proto‑language reconstruction, cognate prediction, or phylogenetic inference. By integrating BAR into existing computational pipelines, researchers can obtain more reliable lexical datasets, improve the robustness of historical linguistic models, and potentially uncover systematic patterns of analogy or borrowing that were previously hidden behind “intuitive” judgments. Future work may involve coupling BAR with Bayesian models of sound change, extending the method to partial cognacy, or applying it to under‑documented language families to test its universality.
Comments & Academic Discussion
Loading comments...
Leave a Comment