The Olonets-Karelian morphology and tools

The GitHub repository contains finite state source files for the Olonets-Karelian language, for building morphological analysers, proofing tools and dictionaries.

The Olonets-Karelian language

Olonets-Karelian aka Livvi belongs to the Balto-Finnic branch of the Uralic language family. Its closest ties are with the Veps and Ludic languages, on the one side, and Karelian proper, on the other. As a language form, Olonets-Karelian clearly shares morpho-syntactic features with Veps, e.g., its case marking often run elatives together with inessives, and ablatives together with adessives. “Suures mečäs” ’in a/the big forest’ is contrasted with “Suures mečäspäi” ’from a/the big forest’ in Livvi, whereas Karelian Proper is very close to Finnish with “Suurešša mečäššä” ’in a/the big forest’ contrasted with “Suuresta mečästä” ’from a/the big forest’. The latter shows congruence in case marking, not present in some cases of Olonets-Karelian. From a phonological perspective, however, Olonets-Karelian shares the features of gradation as well as the dichotomies of vowel and consonant length and quality with Karelian Proper, Finnish and Estonian. Lexically, of course, Olonets-Karelian exhibits an abundance of Russian loanwords, as do the other minority languages of Karelia.

The Olonets-Karelian language is actively used in news media in both Karelia and Finland. This includes issues of the Oma Mua newspaper in Karelia, and YLE Uudizet karjalakse ‘News in Karelian’ in text and pod-casts on the Internet. Work with the open-source, finite-state description of Olonets-Karelian was only begun in 2012 by Timo Rantakaulio and Jack Rueter in the auspices of the “Kone Foundation Language Programme”. Here Rantakaulio contributed to both the extension of lexical work by various compilers, such as G. N. Makarov, Martti Penttonen, «Jougi», Jaan Õispuu, and the location of extensive paradigms for better understanding the inflection of verbs and nominals. On the basis of this preparatory work and in collaboration, Rueter designed an open-source analyzer, which addressed not only the preferred spelling of words but also some possible misspellings. In mid 2014, the collaboration with writers from the «Oma Mua» newspaper, provided evaluation of the finite-state description of the language, and invaluable advice on how to improve it.

In more recent years, work has been done with the lexicon to promote Pan-Karelian language use. The Olonets-Karelian analyzer lexicon has been used in the development of Karelian-Proper, Olonets-Karelian and Finnish translation machine development at Apertium (see GitHub apertium-krl-olo and GitHub apertium-fin-olo) in work with Timo Rantakaulio, Flammie Pirinen and Jack Rueter – Google Summer of Code 2021. Click-in-text dictionaries for Livvi are also available at Giellatekno. The Olonets-Karelian-language materials upcoming in the next version of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2) represent a lesser documented genre of Livvi. The materials will include the New Testament published in 2003.

The finite-state analyzer

The Olonets-Karelian analyzer provides coverage for an extensive morphology in both verbs and nominals.

There are approximately 1,591 continuation lexica for 63,079 lemma-stem pairs. This figure includes a set of proper names exceeding 32,100, approximately 13,200 common nouns, 4,500 verbs and over 3,000 adjectives. As in many other descriptions in the Giella infrastructure (e.g. Skolt Saami, North Saami and Inari Saami), compound word description allows for analysis at the individual constituent level. The coverage of the analyzer can be observed in Korp materials at the Language Bank of Finland, i.a. «UD v2.13», «PaBiVUS».

Coverage for Olonets-Karelian in PaBiVUS

In preparation for the upcoming publication of Parallel Biblical Verses for Uralic Studies (PaBiVUS v2), the analyzers were evaluated against the words and word forms in books of the New Testament (NT). The total of word forms in all was 134,493  — 15,959 unique tokens of which there were 3,359 unique missing word forms. There were 1,225 missing unique forms that appeared more than once, 2,727 words were ambiguous for part-of-speech tagging, and 6,010 tokens had ambiguous dependency tagging.

Olonets-Karelian materials

Olonets-Karelian (olo)
Test data
New Testament 2003:
total tokens: 168,235
total words: 134,493
total characters (from words): 931,383
unique tokens: 15,959
Beginning: (2024-08-14)
unique misses: 3359
number of lines before hapax: 1,225
Lacking unambiguous PoS: 2,727
Lacking unambiguous dependency: 6010
Size of lexicon.lexc: 63,079
Number of LEXICONs: 1,591

Future work with the analyzers

There are some shortcomings in lexicon and inflectional coverage of the analyzer. The texts of the New Testament contain numerous proper nouns not present in previous texts. Ordinal numeral description requires further documentation as does the general inflection of both nominals and verbs. The verb description needs work with passive forms as well as verbal adjectives and adverbs.

Upcoming work with the coverage of the analyzer should include integrated work with use and development of the language in the New Media. This could, conceivably, be achieved through work with students of the language. Follow the development of the Olonets-Karelian analyzer on Giella and UralicNLP.

