The Northern Mansi morphology and tools

The GitHub repository contains finite state source files for the Mansi language, for building morphological analysers, proofing tools and dictionaries.

Mansi belongs to the Ob-Ugric branch of the Uralic language family. Its closest ties are with Khanty and Hungarian, with whom it shares closer lexical, morphological and syntactic features than with the Balto-Finnic, Saami, Permian, Mari and Mordvin languages. Today, Northern Mansi is the only surviving Mansi language form of the numerous languages recorded by Hungarian, Finnish and other foreign scholars of the 19th and early 20th centuries. There are regular online newspaper publications «Luima Seripos» supplemented by occasional books, readers and here, a translation of the Gospel of Mark from the year 2000.

Northern Mansi has undergone a change in orthography since the beginning of this Millennium. Therefore, the finite-state description developed for the language in the GiellaLT infrastructure has focused on the language of the predominant news media. This has meant allowing a large variety of spelling to cover an alternation in vowel-length marking, on the one hand, and filtering to facilitate previous spelling principles present in publications only twenty-some years old.

Mansi (mns)
Test data
MRK 2000
total tokens: 13,426
unique tokens: 3,125
unique misses 908
number of lines before hapax: 195

Ending: (20240412)
unique misses: 615
number of lines before hapax: 73

Lacking unambiguous PoS: 778
Lacking unambiguous dependency: 5889

During a week-long period, in April of 2024, the model reached a size of 103,937 lexical items, i.e., 9,314 Mansi words and 94,623 proper nouns shared with other languages written in Cyrillics. The lemmatization was improve by adding about 100 new words to the lexicon and adding new vowel-length variants to verb paradigms in consultation with Csilla Horvát (a Mansi scholar at the University of Helsinki) and Trond Trosterud (head of Giellatekno).

The description of Mansi is an ongoing project, so these figures merely provide a snapshot for the spring of 2024 in relation to Biblical texts soon to be made available in a new release of Parallel Biblical Verses for Uralic Studies (PaBiVUS) through the Language Bank of Finland.

