Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jack Rueter tells us about his research on morpho-syntactic description of minority languages.
I am Jack Rueter, a principal investigator in Digital Humanities at the University of Helsinki and a Project Researcher in Finnish and Finno-Ugric Languages at the University of Turku working with contextual disambiguation of corpora, annotated manually and using rule-based systems. At the age of seventeen, I spoke my first words of Finnish, and from there have endeavored to acquire a working knowledge in several other non-English languages.
During my studies and subsequent research of Uralic and other minority languages, I have gradually expanded my comprehension of using language-technological tools and practices for the enhancement of fundamental work in linguistics. Although I began my first finite-state description of Komi-Zyrian a quarter of a century ago, which I followed with parallel and corpus work for the Erzya language in the beginning of this millennium, it is the last decade, which has seen ambitious collaboration in the description of languages in several branches of the Uralic language family and beyond. These descriptions have centered in the study of lexica, rich yet regular morphology, syntax and the idea that useful language documentation might be facilitated in the development of tools and learning environments for multilingual application.
My work with the Komi-Zyrian language began while taking a course at the University of Helsinki in the early nineties. Our teacher, E. Cypanov, offered us lessons based on materials he had written in Russian – no Komi-Finnish or Komi-English dictionaries were available at the time, so I undertook the translation of his glossary into a small trilingual Komi-English-Finnish word list, which I was able to proofread and expand with a scholarship from the Alfred Kordelin Foundation. At the time, such word lists were seen as a fundamental point of development for finite-state descriptions, and as such I was able to begin my modeling of a finite-state description for Komi-Zyrian with advice from Professor Kimmo Koskenniemi on a Unix system in 1995.
From 1996 until 2004, I spent a large part of my time among the Komi, the Erzya and the Moksha. During this time, I taught Finnish at the Mordovian State University in Saransk, Mordovia – about 600 kilometers east-southeast of Moscow. There, in addition to language instruction, I began collecting and digitizing Mordvin language literature, learning the two literary languages and developing relations with professional language users and native speakers. These personal contacts have contributed to my knowledge of the languages and provided me with native-language descriptions of the languages, elementary to their adequate documentation. This was also a time to become familiar with other languages spoken in Russia as well as to foster affiliations with language research at the Universities of Turku and Tromsø.
Upon leaving my teaching position in Saransk, I immediately became involved in work with the open-source infrastructure, Giellatekno, in Tromsø. Trond Trosterud and his colleagues were interested in my work with Komi and wanted to include it in the development of their Barents and Circum-polar language-technology development. Needless to say, I acquiesced, and open-source Komi became another piece of the puzzle for extensive dictionary and morphology work in my collaboration from Helsinki, where I began my postgraduate studies. Language technology definitely played a strong role in the categorization of morphological phenomena in the Erzya language, a forerunner to what I documented in my dissertation in 2010 and what I would greatly expand upon in subsequent work funded by the Kone Foundation and in the auspices of its «Language Programme» (2012–2021).
The Language Programme saw the extensive pilots and projects for digitizing endangered materials from the 1920–40s for Finnish kindred languages in Fenno-Ugrica at the National Library of Finland. Preparation for and continued work with these materials helped pave the way to extensive work with lexica and morphology in Olonets-Karelian, Livonian, Hill Mari, Moksha and Tundra Nenets. The success in these, of course, was due largely to the team of language specialists involved and previous documentational work done on the languages. As open-source projects, the language documentation projects also made use of open Helsinki Finite-State Technology (HFST) and open infrastructure for Saami language-technology research (Giellatekno) and tool implementation (Divvun) in Tromsø, Norway (Giella). It was experience with these technologies which I applied to other minority languages, such as Ingrian, Skolt Saami, Meadow Mari, Udmurt, Võro, Komi-Permyak, Mansi, even Apurinã on the Amazon and Lushootseed in the Pacific Northwest. The resulting tools were online morphology-savvy dictionaries, e.g. Olonets-Karelian, Skolt Saami, Erzya and Moksha, and intelligent computer-assisted language learning (ICALL), such as Skolt Saami Nuõrti, which follows the lead of ICALL for Northern Saami Davvi. The tools also included something for everyday writing and spell checkers at Divvun.
Lexicon and morphology only really make sense if you can apply them to a broader usage – syntax and meaningful usage, for example, translation. Thanks to Anssi Yli-Jyrä, I became involved in the Universal Dependencies project in the late 2010s. It was here that I debuted with a tree bank for Erzya, and subsequently developed in work in Moksha, Komi-Zyrian, Komi-Permyak, Skolt Saami, Apurinã with meaningful collaboration from Helsinki, Turku, Oulu, Saransk, Syktyvkar, Tromsø, Tartu, Göttingen, Belém and Bloomington. Work with treebanks can, on the one hand, be considered a means of making language documentation available to multiple user types, and, on the other hand, it serves as an open repository for development in Constraint Grammar disambiguation, function and dependency work after morphological analysis. A driving force behind meaningful morphosyntax takes me to Apertium and shallow-transfer translation modeling for closely related languages.
Apertium started out with translation between Catalon and Spanish related language forms. This initially involved conversion of lexicon from source to target, the subsequent transfer of morphological information, and finally an adaptation of the resulting source syntax to target syntax and idioms. The idea of being able to translate between closely related languages on the basis of the shallow transfer of regular morphological categories and information describes a tool that, in addition to facilitating informative reference translation, might also be used in measuring the distance between language forms through documented lexical, morphological and syntactic and idiomatic convertibility. The development of shallow-transfer tools for the triangle (Northern Dvina) Karelian, Olonets-Karelian and Finnish, for example, has lead to dictionary development correlating to finite-state morphology in the Giella infrastructure applied at Akusanat and Google Summer of Code through Apertium. Upcoming language pairs might include work with the Mordvin languages Erzya and Moksha, which have recently enjoyed a lot of support through work in the Digilang project at the University of Turku.
At the end of the last millennium, I began collecting Moksha, Erzya and Komi literature with releases from the authors and publishers for compilation and research study in the University of Helsinki Language Corpus Server (UHLCS), which has since been incorporated into the Language Bank of Finland materials at Kielipankki. FIN-CLARIN has provided me with time and resources for validating older UHLCS materials and coaching with work in newer corpora development and educational materials. This has meant that I have had the opportunity to bring my own ERME materials for Erzya and Moksha to the Korp server as well as parallel Biblical verses of Uralic languages with Erik Axelson, Pabivus (Thanks to the Bible Translation Institute). At present, work is underway to introduce Universal Dependency corpora of Finno-Ugric languages to the Korp server. Hopefully, my work in Mordvin syntax at the University of Turku will soon also contribute to the quality of the minority-language corpora at Kielipankki. More accurate morphological analysis with rule-base, contextually derived syntactic readings helps bring speech-to-text and text-to-speech technology closer to lesser documented, minority languages.
Rueter, J., Partanen, N., Hämäläinen, M., & Trosterud, T. (2021). Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and Future. In Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages (pp. 62–72). The Association for Computational Linguistics. https://aclanthology.org/2021.iwclul-1.4.pdf
Hämäläinen, M., Rueter, J., & Alnajjar, K. (2021). Documentação de línguas ameaçadas na era digital. Linha D’Água, 34(2), 47-64. https://doi.org/10.11606/issn.2236-4242.v34i2p47-64
Rueter, J., Hämäläinen, M., & Partanen, N. (2020). Open-Source Morphology for Endangered Mordvinic Languages. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS) (pp. 94–100). The Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlposs-1.13
Hämäläinen, M., Alnajjar, K., Rueter, J., Lehtinen, M., & Partanen, N. (2021). An Online Tool Developed for Post-Editing the New Skolt Sami Dictionary. In I. Kosem, M. Cukr, M. Jakubíček, J. Kallas, S. Krek, & C. Tiberius (Eds.), Electronic lexicography in the 21st century (eLex 2021). Proceedings of the eLex 2021 conference (pp. 653-664). (Electronic lexicography in the 21st century (eLex 2021). Proceedings of the eLex 2021 conference). Lexical Computing CZ s.r.o.. Available: https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_42_pp653-664.pdf
Rueter, J., Pereira de Freitas, M. F., Facundes, S., Hämäläinen, M., & Partanen, N. (2021). Apurinã Universal Dependencies Treebank. In M. Mager, A. Oncevay, A. Rios, I. V. Meza Ruiz, A. Palmer, G. Neubig, & K. Kann (Eds.), Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas (pp. 28-33). The Association for Computational Linguistics. DOI: 10.18653/v1/2021.americasnlp-1.4
Rueter, J. (2020). Корпус национальных мордовских языков: принципы разработки и перспективы функционирования/ действия. In ФИННО-УГОРСКИЕ НАРОДЫ В КОНТЕКСТЕ ФОРМИРОВАНИЯ ОБЩЕРОССИЙСКОЙ ГРАЖДАНСКОЙ ИДЕНТИЧНОСТИ И МЕНЯЮЩЕЙСЯ ОКРУЖАЮЩЕЙ СРЕДЫ (pp. 118-127). Издательский центр Историко-социологического института. https://www.researchgate.net/publication/342869938_Corpus_of_the_national_languages_Erzya_and_Moksha_priciples_of_development_and_perspectives_of_functionactionKorpus_nacionalnyh_mordovskih_azykov_principy_razrabotki_i_perspektivy_funkcionirovania_dej
Rueter, J. (Author), & Axelson, E. (Author). (2020). Raamatun jakeita uralilaisille kielille, rinnakkaiskorpus, sekoitettu, Korp [tekstikorpus]. Software, Kielipankki. Available: http://urn.fi/urn:nbn:fi:lb-2020021119
Rueter, J., Partanen, N., & Ponomareva, L. (2020). On the questions in developing computational infrastructure for Komi-Permyak. In T. A. Pirinen, F. M. Tyers, & M. Rießler (Eds.), Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages (pp. 15–25). The Association for Computational Linguistics. DOI: 10.18653/v1/2020.iwclul-1.3
Rueter, J. M. (2020). Linguistic Distance between Erzya and Moksha. Dependent Morphology. In Е. Ф. Клементьева, Т. И. Мочалова, & И. Н. Рябов (Eds.), ФИННО-УГОРСКИЕ ЯЗЫКИ В СОВРЕМЕННОМ МИРЕ: ФУНКЦИОНИРОВАНИЕ И ПЕРСПЕКТИВЫ РАЗВИТИЯ: Материалы Всероссийской научно-практической конференции, посвященной 95-летию заслуженного деятеля науки РФ, доктора филологических наук, профессора Цыганкина Дмитрия Васильевича (pp. 90-110). МГУ им. Н. П. Огарёва. Available: http://hdl.handle.net/10138/330042
Rueter, J., Partanen, N., & Pirinen, T. A. (2021). Numerals and what counts. In M. D. Lhoneux, & R. Tsarfaty (Eds.), Fifth Workshop on Universal Dependencies : Proceedings (pp. 151–159). The Association for Computational Linguistics. Available: https://aclanthology.org/2021.udw-1.13
Rueter, J., & Hämäläinen, M. (2020). Prerequisites For Shallow-Transfer Machine Translation Of Mordvin Languages: Language Documentation With A Purpose. In Материалы Международного образовательного салона (pp. 18-29). Ижевск: Институт компьютерных исследований. Available: http://hdl.handle.net/10138/325962
Rueter, J. M. (Accepted/In press). Mordva. In R. Valijärvi & D. Abondolo (Eds.), The Uralic Languages Routledge.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.