Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish – with its transparent morphological markers – both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
💡 Research Summary
This paper investigates how transformer‑based language models encode complex verbal paradigms in Turkish and Modern Hebrew, with a particular focus on the influence of tokenization strategies. The authors employ the Blackbird Language Matrices (BLM) task, a paradigm‑based multiple‑choice diagnostic that presents three fully‑specified verb forms and one incomplete form in a context set, together with four candidate answer sentences. Models must select the sentence that correctly completes the underlying linguistic rule.
Data are extracted from Universal Dependencies treebanks. For Turkish, sentences are drawn from news, non‑fiction, and dictionary examples (≈183 k tokens); for Hebrew, from news and encyclopedic sources (≈217 k tokens). Each language’s dataset contains 8 000 BLM instances, balanced across four voices: active, passive, causative, and causative‑passive. The split is 90 % training, 10 % testing, with disjoint instances.
Three transformer models are evaluated: (1) a monolingual Turkish BERT (BERTurk), (2) a monolingual Hebrew AlephBERT, and (3) a multilingual Electra (base‑discriminator). Each model’s native tokenizer is used, allowing a direct comparison of token granularity. Token statistics reveal that monolingual models produce relatively atomic segmentations (average 1.86 tokens per Turkish verb form, 1.33 per Hebrew verb form). In contrast, the multilingual Electra splits Hebrew verbs into an average of 5.14 sub‑word tokens, essentially operating at the character level because its vocabulary is dominated by Latin‑script languages. Turkish verbs receive a milder sub‑word split (≈1.95 tokens) but still show increased token counts for morphologically complex forms (up to 6.86 tokens for causative‑passive).
Performance is measured with F1 scores using a simple feed‑forward classifier on top of sentence embeddings. For Turkish, all models achieve high scores, indicating that the language’s transparent, concatenative morphology (affix‑based) is robust to different tokenization granularities. Even the most fragmented sub‑word splits preserve enough morphological cues for the model to learn the relationships among the four voices.
Hebrew yields a stark contrast. The multilingual Electra, due to its character‑level tokenization, fails to capture the non‑concatenative, templatic morphology (binyanim) that combines a root with a pattern. Consequently, its F1 scores drop substantially, especially on causative and causative‑passive forms. AlephBERT, which employs a morpheme‑aware tokenizer that keeps the root and pattern as distinct tokens, performs markedly better, surpassing the multilingual model by roughly 15–20 % absolute F1. This demonstrates that preserving the root‑pattern relationship is crucial for languages with non‑concatenative morphology.
When the authors test on synthetic datasets—artificially generated sentences that strictly follow the paradigm rules—performance improves across all models, confirming that the BLM task is sensitive to noise and lexical variability present in natural corpora. However, the gap between tokenization strategies remains, underscoring that tokenization choices have a decisive impact on real‑world generalization.
The study concludes that tokenization must be tailored to the morphological typology of each language. For concatenative languages like Turkish, standard sub‑word tokenizers (BPE, WordPiece) suffice. For templatic, non‑concatenative languages like Hebrew, a morpheme‑aware or root‑pattern preserving tokenizer is essential; otherwise multilingual models may suffer severe degradation. The authors suggest future work on dynamic tokenization, joint learning of morphology and tokenization, or integrating explicit morphological information during pre‑training to bridge the observed performance gap.
Comments & Academic Discussion
Loading comments...
Leave a Comment