The Komi-Zyrian morphology and tools

The GitHub repository contains finite state source files for the Komi-Zyrian language, for building morphological analysers, proofing tools and dictionaries.

The Komi-Zyrian language

Komi-Zyrian belongs to the Permian branch of the Uralic language family. Its closest ties are with the Komi-Permyak and Udmurt languages, the former of which seems to be a southern cluster of a nearly comprehensible pluricentric Komi language. Although the Komi language shares many etymologically related words with Finnish and Hungarian, they are not obvious enough to recognize immediately, e.g., “ki” = “käsi”, “kéz” ‘hand; arm’; “vir” = “veri”, “vér” ‘blood’; “śin” = “silmä”, “szem” ‘eye’.

Komi-Zyrian is the language used in regular issues of newspapers, journals, readers as well as radio and television programs. Here, we have a complete translation of the New Testament from the early part of this millennium. Despite numerous changes in its orthographies during the early part of the twentieth century, the (Zyrian) Komi language has established its own terminology in many fields, and its morphologically prolific character is well maintained. A finite-state description of the language was begun in the mid-1990s by Jack Rueter and much collaboration has been done with native speaker researchers as well as others.

In 2004, an open-source description of about 6,000 lemmas entered what is today known as the Giellalt infrastructure. Here, Komi-Zyrian served as an example of a highly inflectional non-Saami language for the GiellaLT team, and it is also where the Komi specialist, Paula Kokkonen, did Komi-Finnish dictionary work with funding from the Kone Foundation during the Kone Language Programme. In more recent years, Niko Partanen, a doctoral student working with Komi dialects, has made notable contributions to +Dialect lexicon as well.

The finite-state analyser

The present state of the analyser is quite extensive. There are 681 lexicons and approximately 319,760 lemma and stem pairs in the finite-state analyser.  94,663 of the total come from a shared propernouns file and 8,946 come from a list of Komi-Zyrian proper nouns. When examinining the breakdown of nearly infinite part-of-speech sets, we find a descrepency in the ratio for nominals to verbs. Where there are only 26,474 nouns and 20,815 adjectives in the rule-based description, there is an exceptionally large number of verbs, over 163,533. This imbalance in the analyser would seem to be directly related to the incorporation of HunSpell data sets intended to provide extensive coverage for the Komi verbal inflection system without derivation. In combination with the regular derivation in the description, the finite-state analysers have prooven to supply relatively good coverage for spellers and the analysis required by the Universal Dependencies, such as those found in the «UD v2.13» hosted on the Language Bank of Finland Korp server.

Coverage for Komi-Zyrian in PaBiVUS

In anticipation of the forthcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analysers were evaluated against the words and word forms in the New Testament (NT), such that some books were represented by more than one version. The total number of tokens, i.e., words and punctuation marks were approximately 180,381, with 18,643 unique forms. 443 unique forms were not recognized, and a total of 89 unique forms occurred more than once.

Komi-Zyrian materials

New Testament 2008, Gospels of Mark 1995, and John 1997
total tokens: 180,381
unique tokens: 18,643

Beginning time: (2024-04-15)
unique misses 443
number of lines before hapax: 89
Lacking unambiguous PoS: 632
Lacking unambiguous dependency: 12,713

Ending time: 2024-04-18
unique misses: 210
number of lines before hapax: 25
Lacking unambiguous PoS: 310
Lacking unambiguous dependency: 12,448

Size of lexicon.lexc: 319,760
Number of LEXICONs: 681

While evaluating the coverage, it was noted that the Biblical texts had a low coverage for associated proper nouns as well as pair-verb and collective-noun constructions. While reducing the missing unique forms from 443 to 234, Rueter performed special morphological work to allow for parallel inflection in nouns and verbs. This meant utilizing special features in the Helsinki Finite-State Transducers (HFST) which allow the alignment of morphology regardless of adjacency or number of iterations, e.g., “zon[jas]-nyv[jas]” approximately ‘boy[s]-n-girl[s]’ vs “boy-n-girl”. Naturally, in Komi-Zyrian, we would expect large paradigms where grammatical categories are marked in tandom. Rules in the two-level model were also corrected for more consistent coverage.

Improvements included addition of missing words as well as paradigmatic enhancement

Missing words,
śojtög-jutög ‘without eating or drinking’ <paradigmatic enhancement>,
two-level rule work «пышъясны» ‘to escape’ extend context of present rule й:ъ
Indication of transivity in pair verbs
Verbal abessive is marked as a part of regular derivation “vermyny” ‘to win’ >> “vermytom” ‘invincible’.

Future work with the analysers

Initial work with disambiguation revealed that, not all words with the nominal plural marker /jas/ are actually nouns. In fact, “mentioned” words and raised NP heads can also take nominal plural marking. Hence, the word /tadzi/ ’in this way’ might also appear as a plural noun in text, as the English and Finnish conjunctions “buts” in “no buts about it” and “muttia” in “ei mitään muttia”. Raised NP heads, in contrast, show what might be done in cases of NP head elipsis — – the number and case categories are inherited directly from the elided noun. In Mordvin studies, this phenomenon is referred to as secondary declension. An English near equivalent might be seen the construction: “The King of England’s name is Charles III”, where the ‹’s› comes after the final element of the noun phrase instead of the noun possession is associated with. In Komi-Zyrian, the structure /as mu-yś-jas dor-as/ ‘own country-from-Plural near-at.theirs’ would translate to ‘At the [people] from one’s country’, where the word ‘people’ has been elided.

Although the coverage of the analyser has been improved, with only 25 non-hapax unique forms missing, There is still much more work to be done. It seems, there are still issues to address in morpho-syntactic analysis. This is evidenced by at least two figures. Unique word forms still lacking recognition number at 310, that is a little over 1% of the total. In addition, there are 12,448 word forms, 7% of the total, which still lack any kind of dependency marking. Thus, there are still points of development to work on. Follow our progress on GiellaLT and in the UralicNLP python, java and .net libraries.


