<< Back to the group page

The Karelian morphology and tools

The GitHub repository contains finite state source files for the Karelian language, for building morphological analysers, proofing tools and dictionaries.

The Karelian language

Karelian is a language form with three different literary norms. For some, these forms can readily be divided into Karelian Proper, on the one hand, and Olonets-Karelian, on the other. Karelian Proper can further be divided into Dvina-Karelian – one of the language forms published in the newspaper «Oma mua» – and Tver-Karelian, which was originally published in the Tver Oblastʹ – approximately half way between St Petersburg and Moscow. The further division of Karelian into smaller groups equates to localization addressing preferred word choice and morphology. All divisions of Karelian Proper, to some extent, improve and simultaneously hinder the development of a singular Karelian-language community. This article deals with the Dvina-Karelian variant of Karelian Proper, which can be found through Karelian and language contacts, the ‘Karelian Union Library’ as well as the VepKar corpora. A wikipedia for Karelian is under construction and, in addition to the New Testament of 2011 in Dvina-Karelian a previous test version of the Gospel of Mark appeared in 1996.

The finite-state description developed for the language in the GiellaLT infrastructure, was originally developed by Flammie Pirinen – the author of the open-source morphological description of Finnish, OMorFi. The objective of this undertaking was to facilitate machine translation between the Karelian variants and Finnish. Dvina-Karelian reminds us of Olonets-Karelian, due to its extensive slavic vocabulary and ‹ua›, ‹iä› diphthongs represented in literary Finnish as ‹aa› and ‹ää›, respectively. Like Finnish and unlike Olonets-Karelian, however, Dvina-Karelian distinguishes between the locative and departure cases. While Dvina-Karelian distinguishes the inessive in both the adjective and the noun ‹šuurešša pereheššä› ‘in the big family’ from the elative ‹šuurešta pereheštä› ‘from the big family’, Olonets-Karelian only makes this distinction for the head of the noun phrase ‹suures perehes› ‘in the big family’ vs ‹suures perehespäi› ‘from the big family’.

Karelian materials

Dvina-Karelian (krl)
New Testament 2011
total tokens: 175,863
total words: 139,767
total characters (from words): 807,821
unique words: 16,940
Beginning: (2024-09-06)
unique misses: 13,643
number of lines before hapax: 6,163
Lacking unambiguous PoS: 15,234
Lacking unambiguous dependency: 16,652
Size of lexicon.lexc: 1,974
Number of LEXICONs: 651

In September of 2024, the model for Dvina-Karelian was minimal. It consisted of an initial size of 1,974 lexical items, i.e., of which 339 were nouns, 181 were verbs and 43 were adjectives. The low number of lexical items is actually much less than what can be observed as incoming lexica, namely, there is a large number of nouns (19,675), verbs (15,186) and adjectives (7,431) that simply have not been aligned with paradigms. This can be taken as preparation for analyzer development to be documented with upcoming corpus attestation. The Dvina-Karelian New Testament from 2011 will be annotated to the extent possible in the upcoming Biblical Verses for Uralic Studies (PaBiVUS-version 2) through the Language Bank of Finland.

Follow our progress on GiellaLT and in the UralicNLP python, java and .net libraries.

Search the Language Bank Portal:
Aku Rouhe
Researcher of the Month: Aku Rouhe

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information