Photo: Sonja Holopainen, Kotus
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Niko Partanen, researcher in the Kone Foundation funded project Language Documentation meets Language Technology: The Next Step in the Description of Komi tells us about his ongoing research in which he will produce new resources for the Language Bank of Finland.
I am Niko Partanen, a researcher in the project Language Documentation meets Language Technology: The Next Step in the Description of Komi funded by the Kone Foundation. When working as a senior adviser in the Institute for the Languages of Finland last year I was able to get to know several language resources archived in Finland in an unique way and I will continue working with questions on archiving and digitalization also in the future. The scope and the quantity of language resources in Finland are very good but there are still a lot of open questions, especially in the practices of web publishing today and providing for the accessibility that is appropriate for different user groups. This summer I will spend as a visiting researcher at the University of Helsinki.
My research topic is variation and change in Komi-Zyrian dialects using different digital resources from different periods. My research focuses on certain known but inadequately described interesting features in the dialects that I currently work with towards articles on phonological and morphological subjects.
Researchers have been collecting resources on Komi dialects for over a hundred years already, which makes it possible to compare data over a long time. There are never too much data about endangered resources which is the situation also with the Komi dialects. This fact has made me study various resources that have previously been collected and published in different formats. I have worked for example with text identification related to these activities, which is one of the most effective means in transforming hand written texts into digital format.
I aim at developing and making use of language technology within speech data research. Our research project Language Documentation meets Language Technology: The Next Step in the Description of Komi lead by Rogier Blokland and Michael Rießler and funded by the Kone Foundation that is still going on for some years focuses on developing the morphosyntactic analysis of Komi, and the project has published articles on a regular basis on different solutions making use of natural language technology. In practice we can take a text in the Komi dialect and run it through the analyzer developed in the Giellatechno environment, with relatively good results for each word. However, it is not entirely clear how good the analysis needs to be for solving different kinds of research questions in a realistic way. In this respect I can myself serve as a test subject when I aim at answering specific research questions using this resource. Our project will also produce a wider description of the Komi syntax, and my doctoral research will also be finalised during the project period.
My research project is in the process of transferring its corpora of Komi to the infrastructure provided by the Language Bank. The corpora compiled and transcribed during the earlier project between 2014 and 2016 will be available in the Korp interface, which is very important for the researchers. It is of utmost importance that the resources would be made available for the whole research community as quickly as possible, and the practices for this to take place as easily as possible should be actively developed.
At the moment I am working on scripts for the Language Bank for analyzing the data in the Komi corpus and for configuring them into the format required by the Korp interface. This also applies to the simultaneous checking of the files. Since this is the result of manual work for five years already, the transcripts include a lot of minor non-standard structures that are now searched for with an automated process and fixed with appropriate measures. Otherwise these non-standard structures or anomalies would mean various kinds of problems for the user. For example, part of the contents of the corpus would not be visible through Korp, or the data would be located in a wrong place. All solutions and experiences gathered within the project will naturally be published in accordance with the principles of open science.
So far I have not used the Language Bank resources in my work, but I am interested in using the resources of Finnish and Karelian that are available in the Language Bank.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.