The Erzya morphology and tools

Name: ASTIN: Language Technologies in the Nordic Countries 2026
Start: 2026-11-10T00:00:00+02:00
End: 2026-11-11T23:59:59+02:00
Location: Helsinki

The GitHub repository contains finite state source files for the Erzya language, for building morphological analysers, proofing tools and dictionaries.

The Erzya language

Erzya belongs to the Mordvin branch of the Uralic language family. Its closest ties are with the Moksha language. Earlier the Mordvin languages were classified geographically as members of the Volga branch, which implied a closer relation to the Mari languages – Hill Mari and the larger Eastern and Meadow Mari. This classification, however, appears to be more a matter of proximity than generic relationship. Etymologically we can observe an abundance of cognates Erzya shares with the Finnic and Saamic languages, e.g., “ked́” = “käsi” = “giehta” ‘hand; arm’; “veŕ” = “veri” = “varra” ‘blood’; “śeĺme” = “silmä” = “čalbmi” ‘eye’; “kuz” = “kuusi” = “guossa” ‘spruce’; “koto” = “kuusi” = “guhtta” ‘six’.

Erzya literature has many genres, which might be accessed through regular issues of some newspapers, journals and readers as well as radio and television news programs. Hence, the open-source, finite-state description of Erzya begun by Jack Rueter in the end of the 1990s with the help of Kimmo Koskenniemi has a relatively broad base. The lexicon draws on work by Mixail Mosin, Jaana Niemi, Alho Alhoniemi, Nina Agafonova, Kuzʹma Abramov, Raisa Buzakova, Martti Kahla and Evgenij Četvergov, to name but a few. All of the documentation was moved to the forerunner of the Giellalt infrastructure in about 2006. In 2020–2022, Jorma Luutonen, Sirkka Saarinen and the helpful people at the University of Turku provided ample opportunity to work with both lemmatization in 2020 and hands-on work in Constraint Grammar dependencies 2021–2022.

The Erzya-language materials upcoming in the next version of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2) represent a new genre for Erzya. It will include the New Testament published in 2006, some test translations from the 1990s, the Gospell of 1910 as well as Gospel texts from 1821, and more recent Old Testament books from the 2010s. Due to the two-hundred year range, some of the words will not be recognized, which might be the result of changes in orthography or lexicon.

The finite-state analyzer

The Erzya analyzer provides coverage for an extensive morphology in both verbs and nominals. There are approximately 1370 continuation lexica for 176,832 lemma-stem pairs. In addition to a shared set of proper nouns of over 94,000, the lexicon contains approximately 24,000 common nouns, 16,000 verbs and over 31,000 adjectives. The adjectives are exhibit an abundance of Russian cognates, which points to a marked effort to document these kinds of words – thanks to people such as Marina Fedina and Enye Lav, who initiated this kind of adjective collection in the 2010s. The size of the lexical network might, of course, be attributed to strategies of NP head ellipsis known as secondary declension, where modifiers take case, number and definite marking, on the one hand. On the other, the number of continuation lexica might also be attributed to on going work in vowel-harmony validation, i.e., it is one thing to describe perfectly written word forms, but quite another to describe systematic misspellings as well. The coverage of the analyzer can be observed in Korp materials at the Language Bank of Finland, i.a. «UD v2.13», «ERME v2», «Uspenskij 4 battles».

Coverage for Erzya in PaBiVUS

In preparation for the upcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analyzers were evaluated against the words and word forms in books of the New Testament (NT) from different centuries in addition to some newer translation of books from the Old Testament. All in all, there was a total of 311,957 word forms – 35,077 unique tokens of which there were 4,535 unique missing word forms. There were 1093 missing unique forms that appeared more than once, 8,625 words were ambiguous for part-of-speech tagging, and 28,961 tokens had ambiguous dependency tagging.

Erzya materials

New Testament 2006; Gospel of Matthew 1821; Gospels 1910; test translations Mark 1995, Luke 1996, Acts 1996, Matthew 1998, Psalms 2011, Ruth, Ecclesiastes 2020, Songs 2020, Jonah 2020.
total tokens: 397,941
total words: 311,957
total characters (from words): 3,981,520
unique tokens: 35,077
date of attestation 2024-05-04
unique misses = 4,535
number of lines before hapax: 1093
Lacking unambiguous PoS: 8,625
Lacking unambiguous dependency: 28,961
Size of lexicon.lexc: 176,832
Number of LEXICONs: 1370

Observations

In addition to missing proper nouns, a second reason for words not being recognized may be attributed to changes in orthography. First, there is the unstandardized spelling, which is rampant in 1821. In the translation of the Gospel in 1910, there is a standard slightly different from that of today but quite consistent with the orthographics of the early 1880s – fifty years previous to when a standard was developed in the Soviet Union.

Initial work with the non-standard Erzya from 1821 was begun by simply listing the unrecognized word forms. It soon became apparent that development of normalization practices would be more time-efficient. Peculiarities of text include but are not limited to the use of the hard sign ‹ъ› word-finally, the soft sign ‹ь› after non-alveolar consonants, the use of ‹я› to indicate the /æ/ sound of Southeastern/Sura Dialect, which is cognate to the first-syllable vowels ‹i, e, y› of Finnish, ‹a› of North Saami and ‹õ› of Skolt Saami: /sæĺ/ Fin = syli, North Saami = salla, Skolt Saami = sõll ’fathom; embrace’.

With the progress of enhancement work, the number of missing analyses began to drop, but as noted above additional work will be done with normalization. Normalization will mean that word forms can be searched for in Korp using standard-language forms, but that the original texts will be available through data in the margin.

Ending: 2024-05-30
unique misses: 4060
number of lines before hapax: 826

Lacking unambiguous PoS: 6,620
Lacking unambiguous dependency: 27,161
Size of lexicon.lexc: 177,106
Number of LEXICONs: 1370

Future work with the analyzers

Although the coverage of the analyzer has been improved, with only 826 non-hapax unique forms missing, There is still much more work to be done. There are still apparent issues to address in coverage for non-normative morphology. This is indicated by at least two figures. Unique word forms still lacking recognition number at 4060, that is a little over 1.3% of the total. In addition, there are 27,161 word forms, 9% of the total, which still lack any kind of dependency marking. Furthermore, the number of continuation lexicons at 1370, is nearly twice of that in Komi-Zyrian, so there may be reason to reduce the network by means of a restructuring of the analyzer. Thus, there are still points of development to work on. Follow our progress on GiellaLT and in the UralicNLP python, java and .net libraries.

Search the Language Bank Portal:

Researcher of the Month: Mari Myllylä

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information