Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampo Pyysalo tells us about his research on natural language processing. Openly available large language models are necessary for developing tools similar to ChatGPT also for smaller languages, such as Finnish.
I’m Sampo Pyysalo, University Research Fellow at the TurkuNLP group of the University of Turku.
My research is on machine learning approaches to natural language processing, with particular focus on processing Finnish text and analyzing biomedical domain scientific literature. A lot of my recent work revolves around training large neural network models, including general ”foundation” models such as FinBERT and FinGPT as well as task-specific models such as a named entity recognition model for Finnish. I also work on data, both compiling raw text resources for the unsupervised training of foundation models and running manual annotation efforts to create resources for supervised training, such as the Turku NER and TurkuONE corpora.
Large neural language models are central to a lot of state-of-the-art natural language processing and the basis for tools such as ChatGPT, but most such models focus on English and many of the best models are not publicly available. We believe that openly available Finnish models such as FinBERT and FinGPT are necessary to enable the creation of tools for processing Finnish language with comparable capabilities to tools available for English.
Creating large language models from scratch requires billions of words of text, and collections of Finnish of this size are not readily available. To compile sufficiently large corpora for language model training we have drawn on various sources, including web crawls and resources available through Kielipankki such as the Yle News Archive, the Finnish News Agency Archive (STT) and the Suomi 24 Corpus. We also distribute resources created by TurkuNLP through Kielipankki among other channels.
In the near future, we hope that we will be able to provide access to the full text resources used to create our models for research purposes through Kielipankki to improve the replicability of our work and to make it easier for future efforts to create models for Finnish.
J. Luoma & LH. Chang & F. Ginter & S. Pyysalo. 2021. Fine-grained Named Entity Annotation for Finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.14
A. Virtanen & J. Kanerva & R. Ilo & J. Luoma & J. Luotolahti & T. Salakoski & F. Ginter & S. Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. In CoRR, abs/1912.07076. https://doi.org/10.48550/arXiv.1912.07076
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.