A Rule-based Computational Model for Gaidhlig Morphology
Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
💡 Research Summary
This paper presents a work‑in‑progress rule‑based computational model for the morphology of Scottish Gaelic (Gaidhlig), aimed at addressing the challenges faced by low‑resource languages. The authors begin by highlighting the data scarcity problem that hampers modern neural language models, especially for languages like Gaelic that have limited digital corpora. They argue that a rule‑based approach can make efficient use of the modest amount of lexical information available in resources such as Wiktionary, while offering interpretability and educational value.
The background section situates Gaelic within the Celtic language family, noting its typological features—VSO word order, inflected prepositions, a rich system of initial‑consonant mutations (lenition, prothesis, glottalisation), and a vowel‑harmony distinction between broad and slender vowels. These phenomena create a many‑to‑many mapping between surface forms and grammatical categories, complicating statistical modeling but lending themselves well to explicit rule formulation.
Data preparation is the core of the methodology. The authors download the latest English‑language Wiktionary dump, filter for Gaidhlig entries, and extract the “principal parts” for each headword: for nouns the nominative singular, nominative plural, and genitive singular; for verbs the singular imperative and the verbal noun; for adjectives the positive and comparative forms. These principal parts are stored in a Structured Vocabulary Format (SVF), a line‑based text schema that records lemma, part of speech, principal parts, and glosses. The SVF file is then imported into a relational database, enabling SQL queries that enumerate lexical patterns, count occurrences of specific inflectional categories, and compare dictionary frequencies with those observed in real‑world Gaelic texts.
Morphological rules are expressed declaratively: each rule consists of a condition (e.g., “stem ends with a broad vowel”) and a transformation (e.g., “apply the plural suffix –an, or –ean for slender harmony”). The rule set captures vowel‑harmony constraints, the choice between broad and slender endings, and the various mutation processes triggered by grammatical particles such as the definite article. A Python utility loads the rule base, instantiates objects representing each lemma’s principal parts, and generates all possible inflected forms on demand. For example, given the noun “cath” (cat) and the possessive particle “mo” (my), the system produces the lenited form “mo chat”. Prothesis is stripped during tokenisation and re‑added during generation, ensuring that both analysis and synthesis pipelines remain consistent.
The authors emphasize several advantages of this approach: (1) it requires far fewer training examples than neural models; (2) the rules are human‑readable, facilitating debugging and pedagogical explanation; (3) the computational footprint is tiny, allowing deployment on modest hardware or mobile devices; (4) the database‑backed analysis can reveal diachronic or genre‑specific trends in the usage of particular cases or mutations, informing the design of teaching materials that focus on high‑frequency patterns. They encapsulate the guiding maxim “a rule is worth a thousand data points.”
Limitations are acknowledged. The current implementation handles only content words (nouns, verbs, adjectives) and ignores function words, which are essential for full syntactic parsing. Irregular verbs and rare exceptions must be added manually, and the system does not yet automate the discovery of such outliers. Orthographic variation (e.g., older acute‑accent conventions versus modern grave‑accent conventions) is noted but only partially normalised. Consequently, the model is not a complete lemmatizer or parser, but rather a lower‑level morphology stack that can be integrated into higher‑level tools.
Future work outlined includes automatic detection of irregular paradigms, extension of the rule base to cover prepositions and conjunctions, integration with a rule‑based dependency parser, and adaptation of the pipeline to other Goidelic languages such as Irish and Manx. The authors also suggest that the same methodology could be applied to typologically similar low‑resource languages that possess rich inflectional systems.
In conclusion, the paper demonstrates that a carefully engineered rule‑based system, built on publicly available lexical data, can provide a practical, interpretable, and resource‑efficient foundation for Gaelic language technology. It offers immediate utility for educational software, corpus analysis, and as a scaffolding layer for more sophisticated NLP components, while also charting a path toward broader multilingual low‑resource language support.
Comments & Academic Discussion
Loading comments...
Leave a Comment