Researcher of the Month: Filip Ginter

Filip Ginter
Photo: Filip Ginter

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Filip Ginter tells us about his work with the TurkuNLP research group.

Who are you?

I am Filip Ginter and I am an associate professor of language technology at the University of Turku. I am also presently the longest-serving member of the TurkuNLP research group. I am a computer scientist by training, profoundly enjoying the many unique challenges human language poses.

What is the focus of your research?

Not blessed with patience nor long attention span, I have managed to dip into quite many research topics over the years with our TurkuNLP team. We started off with scientific literature mining, but then branched into more general development of various NLP tools and resources. I’ve always had a soft spot for Finnish and chose to contribute especially to Finnish NLP, perhaps to give back to the society which so generously hosted me for my PhD research. My personally most important – or at least most visible – undertaking was the Turku Dependency Treebank, which later on became one of the first treebanks in the super-successful Universal Dependencies (UD) initiative and allowed TurkuNLP to be an important member of the UD community from Day 1. The treebank was also the basis for the relatively broadly used line of statistical syntactic Finnish language dependency parsers from TurkuNLP. I am proud that this work helped to bring Finnish into the results tables of ACL papers and to close the gap to much more studied languages, at least in terms of parsing accuracy.

Recently, I of course could not help but jump on board the deep learning tsunami. TurkuNLP’s previous work on crawling the Finnish Internet and gathering billions of words of Finnish paid off when it became a crucial part of the training corpus of the FinBERT model. If you have recently done any machine learning on Finnish language, it is quite likely you used this model to squeeze that extra few percent points on your accuracy. The story of FinBERT is a story of having plenty of language data ready at the right moment and shows the importance of gathering and maintaining language resources. You never know when you next need a few billion words of Finnish.

And where do I go from here? I see it as my goal to bring to Finnish, one way or another, most of the tools, tasks, and resources that the bigger languages have. Think about question answering, summarization, semantic search, paraphrase models and many other NLP tasks not yet properly covered for Finnish. If they can exist for English, then they should also for Finnish. We are living exciting times in NLP and now we have many more opportunities to make it happen than we had yet five years ago. And of course, with the LUMI supercomputer around the corner, you can expect new exciting language models from the TurkuNLP workshop.

Apart from these more or less mainstream NLP projects, I have had several I dare say successful collaborations in the field of digital humanities, in particular with the historians. I enjoyed these projects as they challenged us with interesting technical and algorithmic problems to solve.

How is your research related to Kielipankki?

Perhaps my most visible contribution to the Language Bank is the Finnish dependency parser (of course there was many of us working on it in TurkuNLP), which is used by the Language Bank to make data more accessible to researchers. The most recent version of the parser brings about a substantial improvement in accuracy on all levels of analysis. One day, when the legislation catches up with present-day language technology needs, I hope to see also our Internet Parsebank and other large-scale web-based data contributed to the Language Bank.

Naturally, we have used the Language Bank’s resources extensively here in TurkuNLP, perhaps most of them the Suomi24 corpus, in various research projects as well as in language model training. We have also benefited enormously from the Newspaper and Periodical OCR Corpus of the National Library of Finland in our work with the historians.

I cannot stress how important it is for Finnish NLP that we all contribute open datasets and free tools and models to the Language Bank and also maintain our edge in terms of computational resources, with LUMI being the perfect example

Publications

J. Kanerva & F. Ginter & S. Pyysalo 2020. Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task. Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies. DOI: 10.18653/v1/2020.iwpt-1.17

J. Kanerva & F. Ginter & T. Salakoski 2020. Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks. Natural Language Engineering. DOI: 10.1017/S1351324920000224

J. Kanerva & F. Ginter & N. Miekka & A. Leino & T. Salakoski 2018. Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. DOI: 10.18653/v1/K18-2013

A. Vesanto & A. Nivala & T. Salakoski & H. Salmi & F. Ginter 2017. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). https://aclanthology.org/W17-0249

Tools and corpora (available via Kielipankki)

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Sampsa Holopainen

Sampsa Holopainen
Photo: Laura Horváth

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampsa Holopainen tells us about his research on the history of the Uralic languages.

Who are you?

My name is Sampsa Holopainen, and I am a researcher of the history of the Uralic languages. I am currently working as a recipient of an APART-GSK Fellowship of the Austrian Academy of Sciences at the Finno-Ugrian department of the University of Vienna. I made my doctoral studies in the University of Helsinki, my PhD defence was in December 2019.

What is your research topic?

My current research topic is the history of Hungarian or more widely the history of the Ugric languages (including also Khanty and mansi): historical phonology, etymology and loanword research. I am investigating these topics in my current project (2021–2023) Hungarian historical phonology reexamined (with special focus on Ugric vocabulary and Iranian loanwords). In my earlier work I have done research on the etymology of the other Uralic languages too, especially on the Indo-Iranian and other Indo-European lexical influence on the various Uralic languages. In 2019–2021, I worked with Finnic etymology in particular in the project Suomen vanhimman sanaston etymologinen verkkosanakirja (The digital etymological dictionary of the oldest vocabulary of Finnish) in the University of Helsinki. This project is led by Dr. Santeri Junttila and funded by the Kone Foundation.

How is your research related to Kielipankki?

As a part of my current project I am developing an etymological database of the shared vocabulary of Hungarian, Khanty and Mansi (the vocabulary traditionally reconstructed into the Ugric proto-language) and of the early Iranian loanwords of Hungarian; the database is built into the Sanat-wiki that is maintained by Kielipankki. These vocabulary layers are investigated critically and the results are presented in word-articles, and the database will also later include tables illustrating the developments of historical phonology. The database forms only part of my current research work, but it gives a good opportunity to publish research results and observations quickly and openly.

My database is based on a much larger etymological database of the Finnic languages, that has been developed in Santeri Junttila’s project Suomen vanhimman sanaston etymologinen verkkosanakirja (The digital etymological dictionary of the oldest vocabulary of Finnish). Also docent Petri Kallio, MA Juha Kuokkala and MA Juho Pystynen have worked in this project. This project is still active but I am not involved in it any more as a full-time researcher. I think that this project is especially significant, as it has produced the excellent Wiki-database of etymology that has served as the basis of further projects on etymology, such as my own current project in the University of Vienna. The Wiki-database gives good chances to update the research results and forms a good platform for researchers to communicate.

Publications

Holopainen, Sampsa 2022: Uralilaisen lingvistisen paleontologian ongelmia – mitä sanasto voi kertoa kulttuurista? – Kaheinen, Kaisla & Leisiö, Larisa & Erkkilä, Riku & Qiu, Toivo E.H. (toim.), Hämeenmaalta Jamalille: kirja Tapani Salmiselle 07.04.2022. Helsinki: Helsingin yliopiston kirjasto. 101–114. DOI: 10.31885/9789515180858.9

Holopainen, Sampsa 2021: On the question of substitution of palatovelars in Indo-European loanwords into Uralic. – Suomalais-Ugrilaisen Seuran Aikakauskirja 98. 197–233. DOI: 10.33340/susa.95365

Junttila, Santeri & Holopainen, Sampsa & Pystynen, Juho 2020: Digital Etymological Dictionary of the Oldest Vocabulary of Finnish. – Rasprave 46, 2. 733–747. DOI: 10.31724/rihjj.46.2.15

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Jack Rueter

Jack Rueter
Photo: Jack Rueter

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jack Rueter tells us about his research on morpho-syntactic description of minority languages.

Who are you?

I am Jack Rueter, a principal investigator in Digital Humanities at the University of Helsinki and a Project Researcher in Finnish and Finno-Ugric Languages at the University of Turku working with contextual disambiguation of corpora, annotated manually and using rule-based systems. At the age of seventeen, I spoke my first words of Finnish, and from there have endeavored to acquire a working knowledge in several other non-English languages.

What is your research topic?

During my studies and subsequent research of Uralic and other minority languages, I have gradually expanded my comprehension of using language-technological tools and practices for the enhancement of fundamental work in linguistics. Although I began my first finite-state description of Komi-Zyrian a quarter of a century ago, which I followed with parallel and corpus work for the Erzya language in the beginning of this millennium, it is the last decade, which has seen ambitious collaboration in the description of languages in several branches of the Uralic language family and beyond. These descriptions have centered in the study of lexica, rich yet regular morphology, syntax and the idea that useful language documentation might be facilitated in the development of tools and learning environments for multilingual application.

My work with the Komi-Zyrian language began while taking a course at the University of Helsinki in the early nineties. Our teacher, E. Cypanov, offered us lessons based on materials he had written in Russian – no Komi-Finnish or Komi-English dictionaries were available at the time, so I undertook the translation of his glossary into a small trilingual Komi-English-Finnish word list, which I was able to proofread and expand with a scholarship from the Alfred Kordelin Foundation. At the time, such word lists were seen as a fundamental point of development for finite-state descriptions, and as such I was able to begin my modeling of a finite-state description for Komi-Zyrian with advice from Professor Kimmo Koskenniemi on a Unix system in 1995.

From 1996 until 2004, I spent a large part of my time among the Komi, the Erzya and the Moksha. During this time, I taught Finnish at the Mordovian State University in Saransk, Mordovia – about 600 kilometers east-southeast of Moscow. There, in addition to language instruction, I began collecting and digitizing Mordvin language literature, learning the two literary languages and developing relations with professional language users and native speakers. These personal contacts have contributed to my knowledge of the languages and provided me with native-language descriptions of the languages, elementary to their adequate documentation. This was also a time to become familiar with other languages spoken in Russia as well as to foster affiliations with language research at the Universities of Turku and Tromsø.

Upon leaving my teaching position in Saransk, I immediately became involved in work with the open-source infrastructure, Giellatekno, in Tromsø. Trond Trosterud and his colleagues were interested in my work with Komi and wanted to include it in the development of their Barents and Circum-polar language-technology development. Needless to say, I acquiesced, and open-source Komi became another piece of the puzzle for extensive dictionary and morphology work in my collaboration from Helsinki, where I began my postgraduate studies. Language technology definitely played a strong role in the categorization of morphological phenomena in the Erzya language, a forerunner to what I documented in my dissertation in 2010 and what I would greatly expand upon in subsequent work funded by the Kone Foundation and in the auspices of its «Language Programme» (2012–2021).

The Language Programme saw the extensive pilots and projects for digitizing endangered materials from the 1920–40s for Finnish kindred languages in Fenno-Ugrica at the National Library of Finland. Preparation for and continued work with these materials helped pave the way to extensive work with lexica and morphology in Olonets-Karelian, Livonian, Hill Mari, Moksha and Tundra Nenets. The success in these, of course, was due largely to the team of language specialists involved and previous documentational work done on the languages. As open-source projects, the language documentation projects also made use of open Helsinki Finite-State Technology (HFST) and open infrastructure for Saami language-technology research (Giellatekno) and tool implementation (Divvun) in Tromsø, Norway (Giella). It was experience with these technologies which I applied to other minority languages, such as Ingrian, Skolt Saami, Meadow Mari, Udmurt, Võro, Komi-Permyak, Mansi, even Apurinã on the Amazon and Lushootseed in the Pacific Northwest. The resulting tools were online morphology-savvy dictionaries, e.g. Olonets-Karelian, Skolt Saami, Erzya and Moksha, and intelligent computer-assisted language learning (ICALL), such as Skolt Saami Nuõrti, which follows the lead of ICALL for Northern Saami Davvi. The tools also included something for everyday writing and spell checkers at Divvun.

Lexicon and morphology only really make sense if you can apply them to a broader usage – syntax and meaningful usage, for example, translation. Thanks to Anssi Yli-Jyrä, I became involved in the Universal Dependencies project in the late 2010s. It was here that I debuted with a tree bank for Erzya, and subsequently developed in work in Moksha, Komi-Zyrian, Komi-Permyak, Skolt Saami, Apurinã with meaningful collaboration from Helsinki, Turku, Oulu, Saransk, Syktyvkar, Tromsø, Tartu, Göttingen, Belém and Bloomington. Work with treebanks can, on the one hand, be considered a means of making language documentation available to multiple user types, and, on the other hand, it serves as an open repository for development in Constraint Grammar disambiguation, function and dependency work after morphological analysis. A driving force behind meaningful morphosyntax takes me to Apertium and shallow-transfer translation modeling for closely related languages.

Apertium started out with translation between Catalon and Spanish related language forms. This initially involved conversion of lexicon from source to target, the subsequent transfer of morphological information, and finally an adaptation of the resulting source syntax to target syntax and idioms. The idea of being able to translate between closely related languages on the basis of the shallow transfer of regular morphological categories and information describes a tool that, in addition to facilitating informative reference translation, might also be used in measuring the distance between language forms through documented lexical, morphological and syntactic and idiomatic convertibility. The development of shallow-transfer tools for the triangle (Northern Dvina) Karelian, Olonets-Karelian and Finnish, for example, has lead to dictionary development correlating to finite-state morphology in the Giella infrastructure applied at Akusanat and Google Summer of Code through Apertium. Upcoming language pairs might include work with the Mordvin languages Erzya and Moksha, which have recently enjoyed a lot of support through work in the Digilang project at the University of Turku.

How is your research related to Kielipankki?

At the end of the last millennium, I began collecting Moksha, Erzya and Komi literature with releases from the authors and publishers for compilation and research study in the University of Helsinki Language Corpus Server (UHLCS), which has since been incorporated into the Language Bank of Finland materials at Kielipankki. FIN-CLARIN has provided me with time and resources for validating older UHLCS materials and coaching with work in newer corpora development and educational materials. This has meant that I have had the opportunity to bring my own ERME materials for Erzya and Moksha to the Korp server as well as parallel Biblical verses of Uralic languages with Erik Axelson, Pabivus (Thanks to the Bible Translation Institute). At present, work is underway to introduce Universal Dependency corpora of Finno-Ugric languages to the Korp server. Hopefully, my work in Mordvin syntax at the University of Turku will soon also contribute to the quality of the minority-language corpora at Kielipankki. More accurate morphological analysis with rule-base, contextually derived syntactic readings helps bring speech-to-text and text-to-speech technology closer to lesser documented, minority languages.

Publications

Rueter, J., Partanen, N., Hämäläinen, M., & Trosterud, T. (2021). Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and Future. In Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages (pp. 62–72). The Association for Computational Linguistics. https://aclanthology.org/2021.iwclul-1.4.pdf

Hämäläinen, M., Rueter, J., & Alnajjar, K. (2021). Documentação de línguas ameaçadas na era digital. Linha D’Água, 34(2), 47-64. https://doi.org/10.11606/issn.2236-4242.v34i2p47-64

Rueter, J., Hämäläinen, M., & Partanen, N. (2020). Open-Source Morphology for Endangered Mordvinic Languages. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS) (pp. 94–100). The Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlposs-1.13

Hämäläinen, M., Alnajjar, K., Rueter, J., Lehtinen, M., & Partanen, N. (2021). An Online Tool Developed for Post-Editing the New Skolt Sami Dictionary. In I. Kosem, M. Cukr, M. Jakubíček, J. Kallas, S. Krek, & C. Tiberius (Eds.), Electronic lexicography in the 21st century (eLex 2021). Proceedings of the eLex 2021 conference (pp. 653-664). (Electronic lexicography in the 21st century (eLex 2021). Proceedings of the eLex 2021 conference). Lexical Computing CZ s.r.o.. Available: https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_42_pp653-664.pdf

Rueter, J., Pereira de Freitas, M. F., Facundes, S., Hämäläinen, M., & Partanen, N. (2021). Apurinã Universal Dependencies Treebank. In M. Mager, A. Oncevay, A. Rios, I. V. Meza Ruiz, A. Palmer, G. Neubig, & K. Kann (Eds.), Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas (pp. 28-33). The Association for Computational Linguistics. DOI: 10.18653/v1/2021.americasnlp-1.4

Rueter, J. (2020). Корпус национальных мордовских языков: принципы разработки и перспективы функционирования/ действия. In ФИННО-УГОРСКИЕ НАРОДЫ В КОНТЕКСТЕ ФОРМИРОВАНИЯ ОБЩЕРОССИЙСКОЙ ГРАЖДАНСКОЙ ИДЕНТИЧНОСТИ И МЕНЯЮЩЕЙСЯ ОКРУЖАЮЩЕЙ СРЕДЫ (pp. 118-127). Издательский центр Историко-социологического института. https://www.researchgate.net/publication/342869938_Corpus_of_the_national_languages_Erzya_and_Moksha_priciples_of_development_and_perspectives_of_functionactionKorpus_nacionalnyh_mordovskih_azykov_principy_razrabotki_i_perspektivy_funkcionirovania_dej

Rueter, J. (Author), & Axelson, E. (Author). (2020). Raamatun jakeita uralilaisille kielille, rinnakkaiskorpus, sekoitettu, Korp [tekstikorpus]. Software, Kielipankki. Available: http://urn.fi/urn:nbn:fi:lb-2020021119

Rueter, J., Partanen, N., & Ponomareva, L. (2020). On the questions in developing computational infrastructure for Komi-Permyak. In T. A. Pirinen, F. M. Tyers, & M. Rießler (Eds.), Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages (pp. 15–25). The Association for Computational Linguistics. DOI: 10.18653/v1/2020.iwclul-1.3

Rueter, J. M. (2020). Linguistic Distance between Erzya and Moksha. Dependent Morphology. In Е. Ф. Клементьева, Т. И. Мочалова, & И. Н. Рябов (Eds.), ФИННО-УГОРСКИЕ ЯЗЫКИ В СОВРЕМЕННОМ МИРЕ: ФУНКЦИОНИРОВАНИЕ И ПЕРСПЕКТИВЫ РАЗВИТИЯ: Материалы Всероссийской научно-практической конференции, посвященной 95-летию заслуженного деятеля науки РФ, доктора филологических наук, профессора Цыганкина Дмитрия Васильевича (pp. 90-110). МГУ им. Н. П. Огарёва. Available: http://hdl.handle.net/10138/330042

Rueter, J., Partanen, N., & Pirinen, T. A. (2021). Numerals and what counts. In M. D. Lhoneux, & R. Tsarfaty (Eds.), Fifth Workshop on Universal Dependencies : Proceedings (pp. 151–159). The Association for Computational Linguistics. Available: https://aclanthology.org/2021.udw-1.13

Rueter, J., & Hämäläinen, M. (2020). Prerequisites For Shallow-Transfer Machine Translation Of Mordvin Languages: Language Documentation With A Purpose. In Материалы Международного образовательного салона (pp. 18-29). Ижевск: Институт компьютерных исследований. Available: http://hdl.handle.net/10138/325962

Rueter, J. M. (Accepted/In press). Mordva. In R. Valijärvi & D. Abondolo (Eds.), The Uralic Languages Routledge.

More information on resources in Kielipankki

Other resources and repositories

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Mika Hämäläinen

Mika Hämäläinen
Photo: Khalid Alnajjar

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mika Hämäläinen tells us about his research on computational creativity and developing language technology for endangered languages.

Who are you?

I am Mika Hämäläinen, a postdoctoral researcher at the Department of Digital Humanities at the University of Helsinki. In 2020, I finished my PhD thesis on computational creativity with the title Generating Creative Language: theories, practice and evaluation. The title describes well my research interests, as I am not only interested in the technical implementation of language technology models, but also in their relation to theories and real-world phenomena. Open source code and publishing research results as easy-to-use tools as possible are very important to me.

What is your research topic?

I have researched computational creativity as well as language technology for endangered languages and for non-standard languages such as dialects and historical language forms. Computational creativity is a challenging research topic from the perspective of Artificial Intelligence (AI), as the aim is to develop computational models that are capable of producing new creative texts such as poetry (Hämäläinen & Alnajjar, 2019) or humour (Alnajjar & Hämäläinen, 2021). A machine shouldn’t just be able to output new text, but also be able to interpret its output on some meaningful level. For this purpose, we have developed analysis tools, such as the FinMeter library, which analyses Finnish poetry. The library can be used, for example, to analyse meter and interpret metaphors.

Language technology for endangered languages is very challenging, as modern language technology increasingly relies on massive text resources that are not readily available. The corpora of endangered languages also tend to contain a lot of variation, as the languages concerned may not have been subject to the same extent of language guidance as, for example, Finnish. This kind of linguistic diversity is difficult from the perspective of machine learning: The more variation the corpus contains, the larger its size should be in order for machine learning models to cope with the variation. Language technology for endangered languages therefore requires some ingenuity. We have successfully analysed the morphology (Hämäläinen et al., 2021a), morphosyntax (Hämäläinen & Wiechetek, 2020) and cognates (Hämäläinen & Rueter, 2019) of endangered languages by generating synthetic data for machine learning models. Data from endangered languages can be easily processed using the UralicNLP library that I have developed.

Even in the case of vital languages, the abundant variation is a headache for language technologists. I have done research on the normalisation of historical English language forms (Hämäläinen et al., 2018). Normalisation simply means that a computer can convert the historical deviant orthography into a modern language. The English language normalisation tool Natas is available on GitHub. Since then, I have worked on the normalisation of Finnish (Partanen et al., 2019) and Finnish Swedish dialects (Hämäläinen et al., 2020a), as well as on the generation of Finnish dialects (Hämäläinen et al., 2020b) based on the written language. These research results have been published in the Murre library. My most recent work has been the automatic recognition of Finnish dialects based on sound and text (Hämäläinen et al., 2021b)

How is your research related to Kielipankki?

The Samples of Spoken Finnish corpus has been absolutely crucial in building dialect models. Without this corpus, my research on Finnish dialects would simply have been impossible.

The data from the Language Bank has also been useful in the study of computational creativity. For example, the Finnish WordNet has been used in my poetry generator (Hämäläinen, 2018) and Opusparcus has been useful in producing creative dialogue (Alnajjar & Hämäläinen, 2019).

Publications

Alnajjar, K., & Hämäläinen, M. (2021). When a Computer Cracks a Joke: Automated Generation of Humorous Headlines. In Proceedings of the 12th International Conference on Computational Creativity (ICCC 2021) (pp. 292-299). Association for Computational Creativity.

Hämäläinen, M., Alnajjar, K., Partanen, N., & Rueter, J. (2021b). Finnish Dialect Identification: The Effect of Audio and Text. In M-F. Moens, X. Huang, L. Specia, & S. Wen-tau Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8777-8783). The Association for Computational Linguistics.

Hämäläinen, M. (2020) Generating Creative Language: Theories, Practice and Evaluation. Helsingin yliopisto. Saatavilla: http://urn.fi/URN:ISBN:978-951-51-6707-1

Alnajjar, K., & Hämäläinen, M. (2019). A Creative Dialog Generator for Fallout 4. In Proceedings of the 14th International Conference on the Foundations of Digital Games [48] ACM. https://doi.org/10.1145/3337722.3341824

Hämäläinen, M., & Alnajjar, K. (2019). Let’s FACE it: Finnish Poetry Generation with Aesthetics and Framing. In K. V. Deemter, C. Lin, & H. Takamura (Eds.), 12th International Conference on Natural Language Generation: Proceedings of the Conference (pp. 290-300). The Association for Computational Linguistics. https://doi.org/10.18653/v1/w19-8637

Hämäläinen, M., Partanen, N., Rueter, J., & Alnajjar, K. (2021a). Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered. In S. Dobnik, & L. Øvrelid (Eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 166-177). (NEALT Proceedings Series; No. 45), (Linköping Electronic Conference Proceedings; No. 178). Linköping University Electronic Press.

Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates with a Character-Based NMT Approach. In A. Arppe, J. Good, M. Hulden, J. Lachler, A. Palmer, L. Schwartz, & M. Silfverberg (Eds.), Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages: (Volume 1) Papers (pp. 39-45). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-6006.pdf

Hämäläinen, M., Partanen, N., & Alnajjar, K. (2020a). Normalization of Different Swedish Dialects Spoken in Finland. In GeoHumanities’20: Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities (pp. 24–27). ACM. https://doi.org/10.1145/3423337.3429435

Hämäläinen, M., Partanen, N., Alnajjar, K., Rueter, J., & Poibeau, T. (2020b). Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity. In F. A. Cardoso, P. Machado, T. Veale, & J. M. Cunha (Eds.), Proceedings of the 11th International Conference on Computational Creativity (ICCC’20) (pp. 204-211). Association for Computational Creativity.

Hämäläinen, M., & Wiechetek, L. (2020). Morphological Disambiguation of South Sámi with FSTs and Neural Networks. In D. Beermann, L. Besacier, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020) (pp. 36-40). European Language Resources Association (ELRA).

Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., & Mäkelä, E. (2018). Normalizing early English letters to Present-day English spelling. In B. Alex, S. Degaetano-Ortlieb, A. Feldman, A. Kazantseva, N. Reiter, & S. Szpakowicz (Eds.), Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 87-96). (ACL Anthology; No. W18-45). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-4510

Hämäläinen, M. (2018). Harnessing NLG to Create Finnish Poetry Automatically. In F. Pachet, A. Jordanous, & C. León (Eds.), Proceedings of the Ninth International Conference on Computational Creativity (pp. 9-15). Association for Computational Creativity (ACC)

Partanen, N., Hämäläinen, M., & Alnajjar, K. (2019). Dialect Text Normalization to Normative Standard Finnish. In W. Xu, A. Ritter, T. Baldwin, & A. Rahimi (Eds.), The Fifth Workshop on Noisy User-generated Text (W-NUT 2019): Proceedings of the Workshop (pp. 141–146). The Association for Computational Linguistics.
 

More information on the tools and corpora

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Ari Huhta

Ari Huhta
Photo: Anne Pitkänen-Huhta

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Ari Huhta tells us about his research on language assessment.

Who are you?

I am Ari Huhta, a professor of language assessment and the director of the Centre for Applied Language Studies (CALS) at the University of Jyväskylä.

What is your research topic?

During my career I have been involved in developing various kinds of language assessment instruments and assessment systems as well as in carrying out related research. In the past 15 years I have also investigated learning a foreign or second language and the factors involved in learning languages.

Language assessment, or assessment in general, has several different purposes. Some of them concern awarding certificates to individuals for achieving a certain level of proficiency or a certain goal, as is the case in the Matriculation Examination or the National Certificates of Language Proficiency (Yleiset kielitutkinnot, which is used to demonstrate the level of language proficiency required for Finnish citizenship). I have been involved in both of these examinations but most of my research has focused on assessment that supports learning and that is called formative or diagnostic assessment.

A particularly important activity in my career was the international Dialang project in which we developed a 14-language assessment and feedback system that can be used via a web browser. Dialang was completed already in 2004 but it is still accessible. Dialang led to a number of studies that combine the perspectives of language assessment and language learning research. These projects investigated the relationship between ability to use a language and different linguistic features (e.g., structures and vocabulary) and their co-development, which will help design both teaching materials and assessment instruments for supporting learning. Researchers have been particularly interested in the linguistic characteristics of the functionally defined proficiency levels of the Common European Framework of Reference for Languages (CEFR); these levels are nowadays widely used in Europe, including Finland, as a way to define learning targets in foreign language education.

The most important examples of the above mentioned studies were the Cefling and Topling projects (PI prof. Maisa Martin, JyU) that investigated writing and its development among Finnish-speaking learners of English and Swedish, and learners of Finnish as a second language, as well as the Dialuki project that I led and that studied reading and writing skills among learners of English and Finnish. The participants in all these projects were school-aged language learners. More recently, I have studied learning and teaching of English in the primary school. In addition, I am involved in the DigiTala project, which is a joint venture between University of Helsinki, Aalto University and University of Jyväskylä; this project investigates automatic recognition and assessment of speech produced by learners of Finnish and Swedish.

How is your research related to Kielipankki?

Some of the learners’ texts collected during the projects Cefling and Topling (the Topling corpus) are already available via the Language Bank of Finland. The Dialuki corpus is to be published soon. As for the DigiTala project, we intend to make the speech material available to the scientific community to the extent where this is possible. By sharing our corpora, we aim to support and to enhance research on language learning.

Publications

Khushik, Ghulam & Huhta, Ari. 2022. Syntactic complexity in English as a foreign language learners’ writing at CEFR levels A1 – B2. European Journal of Applied Linguistics, 10(1). Early online. https://doi.org/10.1515/eujal-2021-0011

Khushik, Ghulam & Huhta, Ari. 2020. Investigating syntactic complexity in EFL learners’ writing across Common European Framework of Reference levels A1, A2, and B1. Applied Linguistics 41(4), 506-553. https://doi.org/10.1093/applin/amy064

Leontjev, Dmitri; Huhta, Ari & Mäntylä, Katja. 2016. Word derivational knowledge and writing proficiency: How do they link? System 59, 73-89. https://doi.org/10.1016/j.system.2016.03.013

Huhta, Ari; Alanen, Riikka; Tarnanen, Mirja; Martin, Maisa & Hirvelä, Tuija. 2014. Assessing learners’ writing skills in a SLA study: Validating the rating process across tasks, scales and languages. Language Testing 31(3) 307–328. https://doi.org/10.1177/0265532214526176

Mäntylä, Katja & Huhta Ari. 2013. Knowledge of word parts. In Milton, James & Fitzpatrick, Tess (eds.) Dimensions of Vocabulary Knowledge. (pp. 45-59). Palgrave.

Alanen, Riikka; Huhta, Ari & Tarnanen Mirja. 2010. Designing and assessing L2 writing tasks across CEFR proficiency levels. In Bartning, Inge, Martin, Maisa & Vedder Ineke (eds.) Communicative proficiency and linguistic development: intersections between SLA and language testing research. EUROSLA Monograph Series, 1. 21-56. http://eurosla.org/monographs/EM01/EM01home.html

 

More information on the aforementioned resources

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Tuisku Vilenius

Tuisku Vilenius
On a linguistic field trip in Tver, Karelia in the summer 2019. Photo: Tuisku Vilenius

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tuisku Vilenius investigated a corpus of Finnish online discussions to outline the cultural stereotypes that emerged from the discussions related to the indigenous Saami people.

Who are you?

I am Tuisku Vilenius and I graduated last summer with a Master’s degree in Linguistics from the University of Helsinki. My degree also included Saami studies and Indigenous studies. On the level of languages, I am particularly interested in the Saami languages, the Mayan languages and Nahuatl. Currently, I am working as a Finnish language teacher for immigrants and planning my postgraduate studies.

What is your research topic?

The aim of my Master’s thesis was to find out how ordinary Finns perceive the Saami people and their culture. As I had just recently begun my Saami studies when I started working on my Master’s thesis, I decided to approach the topic through material that was written in Finnish. I examined which adjectives were used in Finnish online discussions when referring to the Saami, and I also wanted to find out which broader discourses or stereotypes affected the chosen adjectives. At the same time, my research was also a diachronic overview of the Finnish Saami discussions during recent decades.

It was interesting to notice that although the amount of discussions related to the Saami increased significantly during the period I reviewed (2001-2017), the references to the Saami changed little. Throughout the reviewed time period, the discussion was dominated by a stereotypical view in which the Saami were perceived as a traditional and even ancient people. This may be explained by the fact that the average Finn has little day-to-day contact with the Saami. On the other hand, much of the discussion focused on defining who and what the genuine Saami actually are. This reflects the need of the mainstream population to control and define the indigenous people.

How is your research related to Kielipankki?

I used the Suomi24 corpus (2001-2017) as the source of research data for my study. This corpus is available in the Language Bank’s Korp tool, and it contains discussions from the Suomi24 online forum. I chose this data because it provided a very broad view of the history of Finnish Internet discussion. The online discussion forum material is also more likely to reflect the views of ordinary Finns than, for example, the newspaper articles that had been used as a basis for earlier research on Saami discussions. In addition to the extensive material, I was delighted with the various additional features available in Korp. I was able to easily search for adjectives referring to the Saami with the search tool, and I also used identification data to learn, for example, when and on which discussion area the message had been posted. This allowed me to better outline the topics to which the Saami discussions related.

Publications related to Kielipankki

Vilenius, Tuisku 2021. Oikeat ja muinaiset: saamelaisstereotyypit suomalaisissa internetkeskusteluissa. Master’s Thesis. University of Helsinki. Available: URN:NBN:fi:hulib-202106152749

 

More information on the aforementioned resources in Kielipankki

 

Studies and Degree programmes at the University of Helsinki

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Jussi Ylikoski

Jussi Ylikoski
Photo: Ilona Ylikoski

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jussi Ylikoski tells us about his research on the grammatical properties of Finnish and other Uralic languages.

Who are you?

I am Jussi Ylikoski, a linguist. I have been working at the University of Oulu for five years as a professor of Saami language, but starting in the autumn of 2022, I will be a professor of Finno-Ugric languages at the University of Turku. So, I do research on quite a few languages, including Finnish.

What is your research topic?

I have worked on quite a large number of research topics on Finnish and other Uralic languages, and partly outside the Uralic family, too. I have mainly focused on grammars (morphology and syntax) of both better- and lesser-known languages, and occasionally also on etymology. When describing present-day languages, I often can’t help looking at them also from a diachronic perspective, and when I study the historical development of these languages, I tend to pay quite a lot of attention to the actual use of modern languages in the light of real text corpora.

How is your research related to Kielipankki?

I have used the corpora available in the Language Bank of Finland particularly as a researcher of Finnish grammar. As early as in 2003, I published an article in which I used the Finnish Text Collection in the Language Bank to show that the verb form known as the so-called fifth infinitive (-maisillaan/mäisillään, ’on the verge of doing something’) can be used in many other ways in addition to the periphrastic construction with the verb olla (’to be’), contrary to what had been regularly stated in grammars. For instance, the ’forehead veins’ (otsasuonet) may ‘be on the verge of bursting’ (olla repeämäisillään), but they might also be ‘bulging on the verge of bursting’ (pullistella repeämäisillään), or someone may be afraid and ‘waiting (for something) with his/her forehead veins on the verge of bursting’ (odottaa otsasuonet repeämäisillään).

In recent years, I have been fascinated by the larger and larger text corpora containing billions of words that are available through the Language Bank of Finland and other CLARIN services. In my research, I have used e.g. the Korp version of the University of Helsinki E-thesis collection, the Finnish subcorpus of the Newspaper and Periodical Corpus of the National Library of Finland, the Suomi 24 Corpus, Ylilauta Corpus, and the Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, version 2. With the help of large corpora, it has been possible to discover, in a way, even new morphological cases also in a well-known and well-described language like Finnish. Among other things, I have studied the syntactic properties of forms traditionally known as the prolative, and I have found them to be used in ways that are much more similar to case forms than what has been suggested by previous research literature. Prolatives are not always only individual adverbs (e.g., maitse ‘by land’ and meritse ‘by sea’), but these forms can also be modified by subordinate clauses (e.g., mailitse jossa on helpompi kaunistella asioita ‘by email where it is easier to embellish facts’ and tekstiviestitse joihin turhan harva vastaa ‘by text messages that tend to be answered by too few’).

I have made my most exciting observations when studying forms that were previously considered as clear-cut derivations, such as lauantaisin ‘on Saturdays’ and viikonloppuisin ‘on weekends’ or kunnittain ‘by/across municipalities’ and aihealueittain ‘by/across thematic areas’. In the multi-billion word corpora searchable through the Korp interface of the Language Bank of Finland, it is possible to find hundreds or even thousands of relatively natural sentences, in which even these kinds of forms can have various modifiers that make them look like noun inflections: elokuun lauantaisin ‘on August Saturdays’, joka lauantaisin ‘on every Saturday’, satunnaisin viikonloppuisin ‘on random weekends’ or, e.g., Suomen kunnittain ‘by the municipalities of Finland’, eri maittain ‘by different countries’ ja tietyin aihealueittain ‘by certain thematic areas’. Since these kinds of temporal and distributive expressions look like case-inflected noun phrases, I have playfully called them “dwarf cases” in analogy to the fact that Pluto that was formerly known as a planet but is now called a dwarf planet.

After working on the hazy boundary between derivation and inflection, I have also ended up studying the abessive case in Finnish (rahatta ‘without money’, internetittä ‘without Internet’, etc.) and the so-called t accusative (minut ‘me’, meidät ‘us’, etc.) more thoroughly than before. Even though I personally like to observe and to describe forms and syntactic structures largely by means of descriptive linguistics, the tools of the Language Bank do also offer a lot of opportunities for those who are interested in quantitative analysis.

In addition to the corpora in the Language Bank of Finland, I have also used the corpora of Saami languages and many other Uralic minority languages that have been produced by the language technologists in Tromsø, Norway. The corpora are available via the Korp service maintained by Giellatekno, i.e., the user interface is similar to that of the Korp service in the Language Bank of Finland. Those who are interested also in other Uralic languages besides Finnish can access the corpora in the Tromsø Korp service, http://gtweb.uit.no/korp/ (Saami) and http://gtweb.uit.no/u_korp/ (other languages). With 63 million words of annotated Mari, what more can a Uralicist wish for?

Publications related to Kielipankki

Ylikoski, Jussi. 2003. Havaintoja suomen ns. viidennen infinitiivin käytöstä. [Summary: Remarks on the use of the proximative verb form (the so-called 5th infinitive) in Finnish.] Sananjalka 45. 7–44. https://doi.org/10.30673/sja.86640

Ylikoski, Jussi. 2018. Prolatiivi ja instrumentaali: suomen –(i)tse ja –teitse kieliopin ja leksikon rajamailla. Sananjalka 60. 7–27. [Summary: On Finnish prolatives and instrumentals: –(i)tse and –teitse in between grammar and lexicon.] https://doi.org/10.30673/sja.69978

Ylikoski, Jussi. 2020. Kielemme kääpiösijoista: prolatiivi, temporaali ja distributiivi. Virittäjä 124. 529–554. [Summary: On Finnish dwarf cases: prolative, temporal and distributive.] https://doi.org/10.23982/vir.76971

Ylikoski, Jussi. 2021. Abessiivin apologia. Puhe ja kieli 41. 139–157. [Summary: Apologia of the Finnish abessive case.] https://doi.org/10.23997/pk.110924

Ylikoski, Jussi. 2021. Mistä voisin löytää sen entisen sinut? Suomen kielen akkusatiivi- ja pronominioppia. – Leena Maria Heikkola, Geda Paulsen, Katarzyna Wojciechowicz & Jutta Rosenberg (toim.), Språkets funktion. Juhlakirja Urpo Nikanteen 60-vuotispäivän kunniaksi. Festskrift till Urpo Nikanne på 60-årsdagen. Festschrift for Urpo Nikanne in honor of his 60th birthday. Åbo: Åbo Akademis förlag. 220–243. https://urn.fi/URN:ISBN:978-952-12-4062-1

 

More information on the aforementioned resources in Kielipankki

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Jutta Salminen

Jutta Salminen
Photo: Malin Bengtsson

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jutta Salminen tells us about her research on the various ways of expressing negation in Finnish.

Who are you?

I am Jutta Salminen (PhD, BMus). I defended my dissertation on the Finnish language at the University of Helsinki in the spring of 2020 and I have been working as a Finnish language lecturer at the University of Greifswald in Germany for more than five years. I am interested in grammar and linguistic meaning — particularly in the expression of negation and also in ambiguity.

What is your research topic?

In my dissertation, I studied the use and interpretations of the verb epäillä (’to doubt, to suspect, to suppose’) and its nominative derivatives epäily and epäilys (’a doubt, a suspicion’), as well as the changes related to the verb during the era of written Finnish. The starting point of the study was the observation that, in present-day Finnish, these lexemes may express that something is considered either probable or unlikely, depending on the context of use. So, I became interested in how a single word can be used in two opposite senses. In addition, these words provided an opportunity for observing how the negation proper (‘it is not (true) that X’) and the so-called evaluative negativity (‘it’s not good that X’, ‘I don’t like X’) relate to each other in language use, since both of these aspects of negativity are included in the meaning potential of the verb and its nominal derivatives.

My current research is focused on the negative polarity items (e.g., kukaan) in Finnish and on what their contexts of use can tell about their grammatical and semantic nature. In English literature, negative polarity items (NPI) have been studied rather extensively (especially in big Indo-European languages), and it is interesting to observe how the Finnish NPIs could relate to these descriptions.

How is your research related to Kielipankki?

In order to study the variation, change and prevalence of different interpretations of linguistic meaning, it is necessary to have access to language material where it is possible to observe and to analyze instances of the language phenomenon under study. For the purpose of my dissertation research, I compiled a data set representing various text genres from several corpora: The Helsinki Korp Version of the Finnish Text Collection, Classics of Finnish Literature, The Corpus of Early Modern Finnish, The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (KLK) and Corpus of Old Literary Finnish. When I began my dissertation study, the Finnish Text Collection was available via the old search interface, Lemmie, in Kielipankki, and the rest of the corpora (excluding KLK) were accessible via the Kaino service provided by Kotus (Institute for the Languages of Finland). Nowadays, I can use all of them via the Korp service in Kielipankki.

I based my comparison of the epäily(s) nouns on the occurrences found in the HS.fi News and Comments Corpus, which made it possible to examine the use of the words in both the delivered news texts and in the readers’ comments. The linguistic context plays a key role in the perception of the meaning variants of ambiguous words, so that access to the wider context of search results provided by the Language Bank was essential.

My ongoing research on negative polarity items mostly consists of grammatical description. Since grammar tends to change when in use, linguistic data is necessary for this type of research in addition to self-postulated examples, especially when the acceptability and the entrenchment of a particular expression is questionable to some extent. The Suomi24 Corpus has turned out to be a fruitful source of data for studying the use of the Finnish NPIs.

Publications related to Kielipankki

Salminen, Jutta (2020). Epäilemisen merkitys. Epäillä-sanueen polaarinen kaksihahmotteisuus kiellon ja kielteisyyden semantiikan peilinä. (The meaning and import of epäillä: The polar ambiguity of the Finnish verb epäillä ‘doubt, suspect, suppose’ and its nominal derivatives as a reflection of the semantics of negation and negativity.) Doctoral dissertation. Helsinki: University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5879-6

Salminen, Jutta (2018). Paratactic negation revisited. The case of the Finnish verb epäillä. Functions of Language 25(2): 259–288. https://doi.org/10.1075/fol.15030.sal

Salminen, Jutta (2017). Mitä tarkoittaa epäillä? Epäillä-verbin polaarisesta merkitysvariaatiosta nykysuomessa. (What does epäillä mean? On the polar meaning variation of the verb epäillä in Modern Finnish.) Virittäjä 121: 4–36. https://journal.fi/virittaja/article/view/52322

Salminen, Jutta (2017). Epäillä-verbin polaarinen kaksihahmotteisuus merkitysmuutoksena. (The polar ambiguity of the Finnish verb epäillä as evidenced through meaning development.) Virittäjä 121: 37–66. https://journal.fi/virittaja/article/view/52323

Salminen, Jutta (2017). Epäily vai epäilys? Jaettu polysemia ja lekseemien tyypilliset käytöt. (Epäily or epäilys? Shared polysemy and specialised typical uses.) Sananjalka 59: 217–243. https://doi.org/10.30673/sja.66636

 

More information on the aforementioned resources in Kielipankki

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Mikko Kurimo

Mikko Kurimo
Photo: Evelin Kask, Aalto-yliopisto

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikko Kurimo tells us about his research on automatic speech recognition.

Who are you?

I am a Professor in Speech and Language Processing and leader of the Speech Recognition research team at the Department of Signal Processing and Acoustics of Aalto University.

What is your research topic?

For my PhD dissertation 25 years ago, I developed neural network algorithms to make automatic speech recognition more accurate and more robust. In order to train statistical models for recognizing speech sounds, it is necessary to utilize large amounts of speech material where the sounds are aligned with the corresponding text. At that time, very few such corpora were available. Thus, the research team had to collect and process the data themselves. When we developed automatic methods for aligning speech and text, it become possible to utilize larger data such as audiobooks and radio and television news (e.g., FBC – The Finnish Broadcast Corpus) in training the Finnish speech recognizer.

However, sufficient accuracy cannot be reached just by modeling individual speech sounds, since they do not appear separately in speech and in practice they are modified to fit in the word and sentence context. Therefore, the speech recognizer must also be provided with a model of the language in question. On the basis of the language model, the recognizer decides which words and sentences are represented by the observed speech sound sequences. To train the language model, huge quantities of text are required that should also contain a large variety of examples of different types of language use. For training the Finnish speech recognizer, we have used, e.g., the Finnish Text Collection (FTC).

When it is possible to automatically convert read-aloud speech and dictation into text with sufficient accuracy, this technology can be used in dictation services as well as in many other useful applications, such as transcribing planned speeches or respeaking presentations or television programmes. However, I am even more interested in natural and spontaneous speech that we all use in our everyday conversations and storytelling. Since free speech is the most efficient means of communication for humans, is of utmost importance to have an automatic speech recognizer that can understand this kind of speech when developing Artificial Intelligence systems that are to communicate with people.

The challenges in training models of conversational speech lie in the huge amount of variation in speech and in the limited availability of carefully transcribed resources of natural speech that are suited for training the recognizers. Since written language differs from spoken language in many ways, it is in practice necessary to create the text resources by transcribing speech first.

How is your research related to Kielipankki?

When training the first conversational speech recognizer, we used the FinDialogue corpus in addition to the DSPCON corpus we collected ourselves. The language models were trained with specific portions of conversations in written format that were found to be similar to spoken language according to the aforementioned spoken corpora.

At the moment, we are preparing two new corpora of free speech for publication: an extension of the Plenary Sessions of the Parliament of Finland and the speech material collected in the Donate Speech campaign. Both corpora contain approximately 4000 hours of speech, which clearly exceeds the total amount that was included in all previously published Finnish speech corpora that were suitable for training automatic speech recognizers. I am confident that the new data will enable us to significantly improve the automatic speech recognizer we have developed at Aalto University (Aalto-ASR), whose most recent version (Aalto-ASR 2.1) is currently available via the Language Bank of Finland.

Publications related to Kielipankki

Mikko Kurimo (1997). Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models. PhD thesis, Helsinki University of Technology, Espoo, Finland.

Mikko Kurimo, Vesa Siivola, Teemu Hirsimäki, Janne Pylkkönen, Reima Karhila, Peter Smit, Seppo Enarvi, André Mansikkaniemi, Matti Varjokallio, Ulpu Remes, Heikki Kallasjoki, Sami Keronen, Katri Leino, Ville T. Turunen & Kalle Palomäki (author names in no particular order, except the project leader is first). 2000 –2016. AaltoASR open source large-vocabulary continuous speech recognition system, Aalto University.

Seppo Enarvi & Mikko Kurimo (2013). Studies on Training Text Selection for Conversational Finnish Language Modeling. In Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT), Heidelberg, Germany, pp. 256–263. Available: http://urn.fi/URN:NBN:fi:aalto-201708036342.

André Mansikkaniemi, Peter Smit & Mikko Kurimo (2017). Automatic Construction of the Finnish Parliament Speech Corpus. Proceedings of Interspeech 2017, Vol. 8, pp. 3762–3766. Available: https://doi.org/10.21437/Interspeech.2017-1115

Juho Leinonen, Sami Virpioja & Mikko Kurimo (2021). Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. Available: http://hdl.handle.net/10138/330758

Peter Smit, Sami Virpioja & Mikko Kurimo (2021). Advances in subword-based HMM-DNN speech recognition across languages. Computer Speech & Language,Vol. 66. Available: https://doi.org/10.1016/j.csl.2020.101158

 

More information on the aforementioned resources in Kielipankki

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Veronika Laippala

Veronika Laippala
Photo: Matti Honka-Hallila

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Veronika Laippala tells us about her research on large language resources and computational methods.

Who are you?

My name is Veronika Laippala. I am a Professor of Digital Language Research at the School of Languages and Translation Studies of the University of Turku and the TurkuNLP research group.

What is your research topic?

Most of my research is related to language use in one way or another: to large language resources, mostly compiled from the Internet, and to computational methods to analyze the data. In addition, I have been involved in the development of Finnish language technology, including resources such as the Turku Dependency Treebank and the Turku NER named entity recognition system.

We have currently several on-going projects where we process large web-based language resources by analyzing the genres or registers found in them and by developing machine learning methods that can automatically recognize the different registers. Such methods and tools would benefit both Internet users in general and researchers using Internet-based language materials.

How is your research related to Kielipankki?

The wide selection of corpora and resources in the Language Bank of Finland provides huge opportunities! The Suomi 24 corpus is quite unique in its scope and it is probably the resource I have used the most. In addition, the syntactic parser developed on the basis of our tree bank is used to parse the corpora in Kielipankki. Naturally, I also teach the use of the Korp interface in my courses.

Publications related to Kielipankki

Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo & Veronika Laippala (2021). Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 183–191. Available: https://aclanthology.org/2021.eacl-srw.24.

Veronika Laippala, Jesse Egbert, Douglas Biber & Aki-Juhani Kyröläinen (2021). Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents. Language Resources and Evaluation, Vol. 55, pp. 757–788. DOI: 10.1007/s10579-020-09519-z.

 

More information on the aforementioned resources in Kielipankki

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Juho Leinonen

Juho Leinonen
Photo: Petteri Haapaniemi

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juho Leinonen tells us about his research on automatic speech recognition, speech alignment and chatbots.

Who are you?

My name is Juho Leinonen and I am completing my PhD studies in the Speech Recognition research group led by Mikko Kurimo in Aalto University. I started my PhD studies in 2017 after a couple of years of work in industry.

What is your research topic?

The topic of my Master’s thesis was the automatic speech recognition for Sámi language, and it is possible for me to build on this experience in my PhD work as well. In my current research, regarding chatbots and forced alignment of speech, I still need language models and acoustic models, both of which are also required in automatic speech recognition. In speech recognizers, language models are used for recognizing words that are pronounced in an unclear or ambiguous way, whereas chatbots need language models for generating new text. Language models can also be applied on assessing the quality of text generated by bots. The process becomes circular: in order to evaluate the results in a reliable way, we need to understand what high-quality text is like, but the same understanding is a pre-requisite for generating text in the chatbot. This constitutes a philosophical problem as well as an engineering one.

The goal in traditional speech recognition is to find the text that corresponds to the audio recording as well as possible. When developing a speech recognizer, previously aligned speech data is first required in order to train the acoustic models. Aligning text with speech is actually routine work in speech recognition. However, speech alignment would be a useful functionality for researchers in other fields as well, and it is hardly possible for everyone to become a speech recognition professional before they can get started with their own research. During the past year, I have packaged the speech recognition and alignment tools used in our research group into a toolkit that would be as easy to share as possible. I am also searching for good measures that could be used for assessing the quality of the alignment. My goal is to find out which acoustic models or features produce the best alignment, and in what sort of situations it is possible or worthwhile to use the models trained on major languages for aligning minority languages. This research has also opened up the world of language researchers for me, since I am trying to adapt the tool to suit their purposes as well as possible.

How is your research related to Kielipankki?

On the spur of the moment, I ended up testing the Finnish speech recognizer, developed by our group, for aligning the Giellagas corpus of Northern Saami. This project gave me the idea of cross-language alignment that is described in my latest publication (Leinonen, Virpioja & Kurimo, 2021). Thus, an alignment tool developed for one language can possibly be applied on aligning speech and text in other languages as well, in case the sound and writing systems of the languages are sufficiently similar. In the future, I will also be utilizing other previously aligned speech corpora that are in the Language Bank of Finland. The automatic speech aligner that I have used in my research is now also available for other researchers as part of the Aalto University Automatic Speech Recognition System (Aalto-ASR v.2) that has been installed in the Puhti computing environment at CSC.

For training chatbots, I also use the Suomi24 corpus available in the Language Bank. It may seem strange to use the sort of language used in online discussion forums for ”training” purposes. However, huge amounts of text are required in order to train useful language models, and finding suitable material in sufficiently large quantities is very difficult.

Publications related to Kielipankki

Leinonen, J., Smit, P., Virpioja, S., & Kurimo, M. (2017). New baseline in automatic speech recognition for Northern Sámi. In International Workshop on Computational Linguistics for the Uralic Languages (pp. 89-99). https://doi.org/10.18653/v1/W18-0208

Leino, K., Leinonen, J., Singh, M., Virpioja, S., & Kurimo, M. (2020). FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics. In Interspeech (pp. 429-433). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2511

Leinonen, J., Virpioja, S., & Kurimo, M. (2021, May). Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. http://hdl.handle.net/10138/330758

 

More information on the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Okko Räsänen

Okko Räsänen
Photo: Jonne Renvall/Tampere University

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Okko Räsänen tells us about his research on the computational modeling of infant language development.

Who are you?

I am Okko Räsänen, Associate Professor and Academy Research Fellow at the Unit of Computing Sciences of Tampere University, where I also lead the Speech and Cognition research group. Before moving to Tampere, I worked at the Department of Signal Processing and Acoustics at Aalto University, where I am Docent in Speech Processing.

What is your research topic?

The main topics of my research are the computational modeling of infants’ early language acquisition and the speech that infants hear. Our aim is to understand the principles of information processing that underlie language learning: What sort of transformations and processing steps does the speech signal undergo in the human brain in order to make it possible for the individual to learn how to comprehend it, and how can we build similar language capabilities to artificial intelligence systems? We are interested in what sort of linguistic structures can be acquired in a language-independent and unsupervised manner from speech and from the rest of the sensory information that is available to children. On the other hand, we study the learning mechanisms and presuppositions that must be included in the models in order for the learning to succeed. An interesting question is, what kind of language input and other multisensory information infants are generally able to hear and to perceive during their early language development, and to what extent the acquisition of linguistic structures (e.g., sounds and words) is supported by the amount, quality, and the multisensory nature of the input.

In addition to computational models, we have also developed practical analysis tools for the automated analysis of large child-centered audio data, which can help us to better understand the characteristics of speech heard by children. The data sets typically consist of day-long recordings recorded using wearable microphones in children’s natural acoustic and linguistic environments. For example, in the recently completed international collaboration project Analyzing Child Language Experiences around the World, we analyzed about 14,000 hours of child-centered audio material in order to study children’s early language experiences in various linguistic and cultural settings. Our next goal is to further process our analysis results into publications.

Computational research in language learning is multidisciplinary and interesting work, but on the other hand, it is also challenging. In order to work with speech signals and to model human learning processes, an in-depth command of signal processing and machine learning methods is required. In addition, however, it is important to have a good understanding of phonetics, early language development and the functioning of human cognition, so as to make it possible to reconcile the new models and methods with theory and data from language development research.

In addition to research on language acquisition, my research team develops various analysis methods for speech, e.g., for evaluating the health condition or the emotional state of a given speaker. My group is also involved in the development of smart wearables for babies for the clinical assessment and monitoring of their neurophysiological and motor development (as part of the Academy of Finland’s Health from Science research program). Moreover, I work on many other themes in speech technology, cognitive science, and signal analytics based on machine learning. Often, the signal processing and machine learning methods that are used in speech technology are also well suited for processing a wide variety of time series data.

How is your research related to Kielipankki?

In my research, I have used the FinDialogue corpus that is currently on its way to the Language Bank of Finland, and many other corpora that are provided by the Language Bank are also familiar to me. I am looking forward to the publication of the speech material collected during the Donate Speech campaign for research use. In my opinion, the Language Bank is also a viable publication channel for any new data that we may create during our research in the future.

Publications related to Kielipankki

Khorrami, K. & Räsänen, O. (2021). Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? – A computational investigation. Language Development Research, https://doi.org/10.34842/w3vw-s845

Räsänen, O., Seshadri, S., Lavechin, M., Cristia, A., & Casillas, M. (2021). ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings. Behavior Research Methods, 53,  818–835, https://doi.org/10.3758/s13428-020-01460-x.

Räsänen, O., Doyle, G., & Frank, M. C. (2018). Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171, 130–150, https://doi.org/10.1016/j.cognition.2017.11.003.

Kakouros, S., Salminen, N. & Räsänen, O. (2018). Making predictable unpredictable with style — Behavioral and electrophysiological evidence for the critical role of prosodic expectations in the perception of prominence in speech. Neuropsychologia, 109, 181–199, https://doi.org/10.1016/j.neuropsychologia.2017.12.011.

Räsänen, O., Kakouros, S. & Soderstrom, M. (2018). Is infant-directed speech interesting because it is surprising? — Linking properties of IDS to statistical learning and attention at the prosodic level. Cognition, 178, 193–206, https://doi.org/10.1016/j.cognition.2018.05.015.

Rasilo H. & Räsänen O. (2017). An online model of vowel imitation learning. Speech Communication, 86, 1–23, https://doi.org/10.1016/j.specom.2016.10.010.

Räsänen, O. & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122(4), 792–829, https://doi.org/10.1037/a0039702.

 

 

More information on the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Olli Kuparinen

Olli Kuparinen
Photo: Ilona Lehtonen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Olli Kuparinen tells us about his research on language variation and change where he has used The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s), the Samples of Spoken Finnish and The Finnish Dialect Syntax Archive.

Who are you?

I am Olli Kuparinen, Doctor of Philosophy in Finnish language. In my doctoral dissertation, which I defended in June 2021, I studied the change of Finnish spoken in Helsinki and theories on language change. My dissertation was written in a multidisciplinary research group Kippo, and the study was funded by the Kone Foundation.

What is your research topic?

I study the variation and change in spoken Finnish as well as the theories that are utilized in sociolinguistics. My research methods have for the most part been statistical.

My dissertation scrutinized the change in Finnish spoken in Helsinki from the 1970s to the 2010s. The real time corpus of three time points enabled me to study the concrete changes in Helsinki as well as test the theories that have been drafted in studies of one or two time points. Studying three time points contests, for instance, the practicality of the patterns of change put forth by William Labov.

In my postdoctoral research I will examine the variation in Finnish dialects and the ways variation is discussed in works on dialects.

How is your research related to Kielipankki?

In my dissertation I used the Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s), which consists of interviews of Helsinki natives from the 1970s, 1990s and 2010s. The interviews are available as sound files in the Language Bank. Many of the interviews have also been transcribed. In my dissertation I focused mainly on the transcriptions.

During my work on Helsinki Finnish I have also utilized the Samples of Spoken Finnish as a test corpus for different statistical models. I plan to use the corpus also in my postdoctoral research, in which I study the variation in Finnish dialects. The great benefit of the corpus is that it has been translated into standard Finnish. This enables, for instance, the use of different machine learning algorithms on the corpus to scrutinize the topics of the interviews.

I also plan to use the Finnish Dialect Syntax Archive as a supplement for the Samples of Spoken Finnish in my postdoctoral work.

Publications related to Kielipankki

Kuparinen, Olli 2018: Infinitiivien variaatio ja muutos Helsingissä. – Virittäjä 122 s. 29 – 52. https://doi.org/10.23982/vir.65310

Kuparinen, Olli 2021: Muutoksen mekanismit. Kolmen aikapisteen reaaliaikatutkimus Helsingin puhekielestä. Tampereen yliopiston väitöskirjat 428. Tampere: Tampereen yliopisto 2021. http://urn.fi/URN:ISBN:978-952-03-1990-8 

Kuparinen, Olli – Mustanoja, Liisa – Peltonen, Jaakko – Santaharju, Jenni – Leino, Unni 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. – Sananjalka 61 s. 30–56. https://doi.org/10.30673/sja.80056

Kuparinen, Olli – Peltonen, Jaakko – Mustanoja, Liisa – Leino, Unni – Santaharju, Jenni 2021: Lects in Helsinki Finnish: a probabilistic component modeling approach. – Language Variation and Change. https://doi.org/10.1017/s0954394521000041

More information on the current versions of the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Karita Suomalainen

Karita Suomalainen
Photo: Heidi Suomalainen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Karita Suomalainen tells us about her research on interactional linguistics where she has used the ArkiSyn Database of Finnish Conversational Discourse, The Finnish Dialect Syntax Archive and The Suomi24 Sentences Corpus 2001-2017.

Who are you?

I am Karita Suomalainen, Doctor of Philosophy in Finnish language. I defended my doctoral dissertation in December 2020, and at the moment, I am working as a university teacher at the University of Turku. During the academic year 2021–2022 I will be visiting Aarhus University in Denmark as post doc researcher, with a grant that I received from the Finnish Academy of Science and Letters via the Foundations’ Post Doc Pool (Säätiöiden post doc -pooli).

What is your research topic?

My main research interests lie in the area of interactional linguistics. My research concerns the way different grammatical structures are used in interactional contexts. In particular, I have worked on the use of different referential expressions.

My doctoral dissertation examined second person singular, focusing on the variation of its use in Finnish everyday conversations. My study revealed that, in addition to referring to and addressing the recipient, the second person singular forms can also be used in fixed expressions (e.g., tietsä ‘(do) you know’) or to create open reference, so that they do not refer exclusively to the addressee, but rather describe interpersonal or generic experiences or states of affairs; similar use of second person singular can be found in many other languages. My current post doc project deals with the grammaticalization of verbal constructions expressing person. The goal of the project is to describe the use, development and status of these expressions in Finnish and compare them to similar expressions in Danish.

In collaboration with Ritva Laury and Anna Vatanen, I have also worked on use of the Finnish se että construction in spoken language. In addition, I have examined the linguistic features of online hate speech together with Simo Määttä and Ulla Tuomarla.

How is your research related to Kielipankki?

Most of my research is actually based on data that is also available in corpora of Kielipankki – the Language Bank of Finland. My doctoral dissertation was part of the project “Arkisyn: Morphosyntactically coded database of conversational Finnish” (funded by Kone Foundation). The project produced a morphosyntactically annotated corpus of everyday Finnish conversations that is also available in Kielipankki (ArkiSyn Database of Finnish Conversational Discourse, Helsinki Korp Version). The corpus enables the research of morphosyntactic phenomena in conversational data, and this feature has been very useful in my own research. I have also used The Finnish Dialect Syntax Archive with the help of which it is possible to examine diachronically older spoken language. It is also possible to listen to the samples of the data, and that feature has been especially useful for a spoken language researcher like me. I appreciate that Kielipankki also hosts spoken language corpora – I know that coding such data is not always a very simple task.

In our research of online hate speech, Simo Määttä, Ulla Tuomarla and I have analyzed a discussion thread found within the Suomi 24 corpus available in Kielipankki. Our study was based on the qualitative analysis of a particular case, but it would be interesting to use corpus data for a more comprehensive study. However, it turned out in our project that it is difficult to define specific lexical or grammatical search criteria that could be used for locating samples of hate speech. Some new solutions should be considered in order to be able to extend the analysis.

Publications related to Kielipankki

Suomalainen, Karita (2020): Kuka sinä on? Tutkimus yksikön 2. persoonan käytöstä ja käytön variaatiosta suomenkielisissä arkikeskusteluissa [Who is ‘you’? On the use of the second person singular in Finnish everyday conversations]. Annales Universitatis Turkuensis C 499. Doctoral dissertation. http://urn.fi/URN:ISBN:978-951-29-8238-7

Suomalainen, Karita – Vatanen, Anna – Laury, Ritva (2020): The Finnish se että initiated expressions: NPs or not? In Sandra Thompson & Tsuyoshi Ono (eds.), The ‘Noun Phrase’ across Languages. An emergent unit in interaction, 12–41. Typological Studies in Language 128. Amsterdam: John Benjamins. https://doi.org/10.1075/tsl.128.02suo

Määttä, Simo – Suomalainen, Karita – Tuomarla, Ulla (2020): Maahanmuuttovastaisen ideologian ja ryhmäidentiteetin rakentuminen Suomi24-keskustelussa [Constructing anti-immigration ideology and group identity in an online conversation thread on the Suomi24 discussion board]. Virittäjä 124 (2), 190–216. https://doi.org/10.23982/vir.81931

More information on the current versions of the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Mila Oiva

Mila Oiva
Photo: Mila Oiva

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mila Oiva tells us about her research in Cultural History, including the making of the Yves Montand in the USSR interviews.

Who are you?

My name is Mila Oiva. I’m a Cultural Historian and I work as a Senior Researcher at CUDAN Open Lab in Tallinn University. CUDAN is an Horizon2020 funded cultural data analytics initiative that studies cultural phenomena by integrating qualitative and quantitative approaches from humanities, social sciences, network science, complexity science and beyond.

What is your research topic?

I study how knowledge and assumptions circulate and how the used communication tools affect how knowledge moves and formulates. For example, I have studied circulation of news globally in the 19th century newspapers (https://oceanicexchanges.org/) and circulation of popular interpretations of history in the Russian language web discussions in the 2010s (https://sites.utu.fi/pseudohistoria/en/). In addition, I have explored the construction and reception of a tour of French-Italian singer-actor Yves Montand to the Soviet Union in 1956-57 in the context of the Cold War. All these studies that I have done in collaboration with my colleagues demonstrate in an interesting way how our assumptions are built simultaneously as global phenomena and local interpretations of them.

How is your research related to Kielipankki?

I am about to publish the collection of oral history interviews that we made for our book Yves Montand in the USSR. Cultural Diplomacy and Mixed Messages (Palgrave Macmillan 2021) at the Language Bank of Finland for research and teaching purposes. It is still relatively seldom that historians share their data, but I think that the dataset can be useful also for other scholars and students interested in the memories of Soviet popular culture. Furthermore this year it is Montand’s 100th anniversary and publishing memories concerning his Soviet tour is a good way to celebrate it!

Publications related to Kielipankki

Oiva, Mila, Hannu Salmi, and Bruce Johnson. Yves Montand in the USSR: Cultural Diplomacy and Mixed Messages. Palgrave Macmillan, 2021. https://doi.org/10.1007/978-3-030-69048-9.

Fridlund, Mats, Mila Oiva, and Petri Paju, eds. Digital Readings of History. History Research in the Digital Era. Helsinki: Helsinki University Press, 2020. https://doi.org/10.33134/HUP-5.

Oiva, Mila, Asko Nivala, Hannu Salmi, Otto Latva, Marja Jalava, Jana Keck, Laura Martínez Domínguez, and James Parker. “Spreading News in 1904. The Media Coverage of Nikolay Bobrikov’s Shooting.” Media History 25, no. 3 (August 11, 2019): 1–17. https://doi.org/10.1080/13688804.2019.1652090.

 

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Gwenaëlle Bauvois

Gwenaëlle Bauvois 
Photo: Gwenaëlle Bauvois

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Gwenaëlle Bauvois tells us about her research based on various media data sources, including the Plenary Sessions of the Parliament of Finland, Downloadable Version 1 available via Kielipankki.

Who are you?

I am a researcher at the University of Helsinki based at the Centre for Research of Ethnic Relations and Nationalism (CEREN) at the Swedish School of Social Science, University of Helsinki. I hold a PhD in Sociology.

What is your research topic?

I am interested in right-wing populism, countermedia, reinformation, hybrid media and post-truth. My interest in these phenomena was really sparked in 2015 after the Charlie Hebdo events, and I have been working on these topics since then.

Niko Pyrhönen; photo: Niko Pyrhönen
Niko Pyrhönen
Tuukka Ylä-Anttila; photo: Ilkka Vuorinen
Tuukka Ylä-Anttila

In the years 2016–2019, I and my colleagues Niko Pyrhönen and Tuukka Ylä-Anttila were involved in a research project called Mobilizing ’the Disenfranchised’ in Finland, France and the United states. Post-truth public stories in the transnational hybrid media space. We studied how countermedia mobilizes a “disenfranchised” community of people who are losing trust in the mainstream media. ’Countermedia’ refers to partisan media that oppose conventional media and the establishment. For this project, we collected data from online media located in Finland, France and the United States.

Some of the results of our project were published in our co-authored article Politicization of migration in the countermedia style: A computational and qualitative analysis of populist discourse (2019). In this paper, we set out to investigate whether countermedia style is also used in the arena of ‘high politics’ – in this case the Parliament of Finland – and if so, how and by whom. The results of our computational and qualitative analysis of media data from Helsingin Sanomat and MV Lehti (2015-2017) and of the Plenary Sessions of the Parliament of Finland (years 2015-2016) showed that countermedia style expressions are indeed used in parliamentary debates, especially by the populist right-wing Finns Party, during debates on the ”refugee crisis”.

How is your research related to Kielipankki?

As one of our data sets for this research, we used the minutes from the years 2015-2016 that were included in the Plenary Sessions of the Parliament of Finland, Downloadable Version 1, available via the Language Bank of Finland. The selected subset of the data contains the full transcripts of 183 parliamentary sessions and 6819 speeches that we analyzed computationally and qualitatively.

Publications related to Kielipankki

Tuukka Ylä-Anttila, Gwenaëlle Bauvois & Niko Pyrhönen (2019). Politicization of migration in the countermedia style: A computational and qualitative analysis of populist discourse. Discourse, Context & Media, 32: 1–8. Available: https://doi.org/10.1016/j.dcm.2019.100326.

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Heikki Rasilo

Heikki Rasilo, photo: Jessie Dupont
Photo: Jessie Dupont

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Heikki Rasilo tells us about his use of the Aalto University DSP Course Conversation Corpus for his research related to speech production.

Who are you?

I am Heikki Rasilo, a postdoc researcher in the Artificial Intelligence Lab at Vrije Universiteit Brussel, Belgium. I got my PhD as a joint degree between VUB and Aalto University in 2017. After working in the private sector for a couple of years, I received a research grant from Ulla Tuominen Foundation, through the Finnish Foundations’ Post Doc Pool (Säätiöiden post doc -pooli), for continuing my research.

What is your research topic?

Already from the beginning of my PhD studies, my main research focus has been on physical speech production and on its learning mechanisms. How do human children learn to articulate and imitate the speech of their parents while using their own vocal tracts of very different size and shape? The acoustic properties of adult and infant speech are different as well, and it is difficult to compare them directly. Nevertheless, children learn to articulate their mother tongue, and I am interested in whether the articulatory learning process can also affect the way in which we recognize and comprehend speech. Perhaps one of the reasons why we understand speech better than machines is that we know the physical mechanism through which speech is produced.

I am currently investigating whether the acoustic representations of speech that are formed in learning speech articulation could also be utilized in automatic speech recognition. The amount of recorded speech data that is required in order to train the world’s best speech recognizers is vast, and human children are not likely to encounter a similar amount of speech during their speech acquisition process. Therefore, it must be possible to learn to understand speech with smaller amounts of data, and physical articulation may play a role in the learning process.

How is your research related to Kielipankki?

In a study that was published last year, I trained a neural network to simultaneously recognize both phonemes and physical articulation from speech. The hypothesis was that the articulatory learning would shape the representations the network would learn, and these new representations could be helpful also when recognizing phonemes. For the experiment, I needed some recorded speech as well as articulatory information related to it. In the Language Bank of Finland, I found the Aalto University DSP Course Conversation Corpus that contained a sufficient amount of Finnish speech material including phonemic transcriptions. From the transcriptions, I was able to generate coarse synthetic articulatory data by using a Finnish speech synthesizer. The results of the experiment were promising – the articulatory learning did shape the speech representations in ways that can enhance phoneme recognition.

In my previous research, I have also used the CAREGIVER Corpus (available via ELRA) that consists of simple sentences and their orthographic transcriptions. With Academy Research Fellow Okko Räsänen, we used the corpus in order to investigate certain algorithms for learning word-meaning mappings, word segmentation and acoustic patterns related to words.

Publications related to Kielipankki

Rasilo, H. (2020). Phonemic learning based on articulatory-acoustic speech representations. In S. Denison., M. Mack, Y. Xu, & B.C. Armstrong (Eds.), Proceedings of the 42nd Annual Conference of the Cognitive Science Society (pp. 2203–2209). Cognitive Science Society. Available at: https://cogsci.mindmodeling.org/2020/papers/0512/index.html

Rasilo, H. & Räsänen, O. (2017), An online model for vowel imitation learning. Speech Communication, 86, 1-23. Available at: https://doi.org/10.1016/j.specom.2016.10.010

Räsänen, O. & Rasilo, H. (2015), A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122(4), 792–829. Available at: https://psycnet.apa.org/doi/10.1037/a0039702

Rasilo, H. & Räsänen, O. (2015), Weakly-supervised word learning is improved by an active online algorithm. Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, pp. 1561-1565. Available at: https://www.isca-speech.org/archive/interspeech_2015/i15_1561.html

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Emmi Lahti

Emmi Lahti, photo: Julius Jaakola
Photo: Julius Jaakola

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Emmi Lahti tells us about her research that is related to rhetoric and discourse studies and based on the Suomi 24 Corpus (2016H2), available via Kielipankki.

Who are you?

My name is Emmi Lahti and I am a grant researcher at the University of Helsinki. I finished my doctoral dissertation on Finnish in 2019. I am especially interested in argumentation and rhetoric as well as on critical discourse analysis. I am fascinated by the various ways in which language participates in the social construction of reality.

What is your research topic?

In my dissertation research, I analyzed the rhetoric of discussions on immigration. As data, I used immigration related discussion threads on Suomi 24 from the year 2015. In particular, I investigated the linguistic construction of various groups, the types of arguments and argumentation strategies used and the ways of showing agreement or disagreement with other participants in the discussions.

The results of the study showed how mutual solidarity and support are expressed by the like-minded discussants who are opposed to immigration and how these participants construct a common view of the world and common argumentation.

How is your research related to Kielipankki?

In my doctoral study, I utilized the Suomi 24 corpora available in Kielipankki – the Language Bank of Finland. The Suomi 24 Sentences Corpus (2016H2) can be used via the Korp user interface in Kielipankki, and the corresponding data referred to as the Suomi 24 Corpus (2016H2) can be downloaded for research purposes. In my study, I ended up selecting the downloadable version of the corpus from which I collected 117 discussion threads for my analysis.

Publications related to Kielipankki

Lahti, Emmi (2019). Maahanmuuttokeskustelun retoriikkaa. Doctoral dissertation. Helsinki: University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5707-2

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Mats Fridlund

Mats Fridlund
Photo: Mats Fridlund

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Professor Mats Fridlund tells us about his research that is related to digital history and supported by the tools and corpora available via Kielipankki in Finland and via Språkbanken in Sweden.

Who are you?

I am Associate Professor of History of Science & Ideas and Deputy Director of the Centre for Digital Humanities at the University of Gothenburg. My background is that I am a diploma engineer in Engineering Physics and a PhD in History of Technology from KTH Royal Institute of Technology in Stockholm. During 2013-2018 I worked in Finland as Associate Professor in the History of Industrialization at Aalto University.

What is your research topic?

As an historian of science, technology and innovation and an emerging digital historian my research is focused on infrastructure history and on the political history of technology.

Within infrastructure history I initially did research on the role of users in the development of electric power and telecommunications systems while during the last couple of years I have broadened these interests towards digital infrastructures. I especially focus on how academic users such as historical researchers have changed their professional practices to take advantage of the affordances of new digital infrastructures such as those made possible by the Language Bank of Finland. Connected to this is also my most recent interests in digital humanities.

Since 2012 I have been involved in various efforts in Finland and Sweden to develop digital humanities in general and digital history in particular. I have been principal investigator of two Kone Foundation funded projects to develop and strengthen Finnish digital history (see Paju et al 2020). Since 2019 I am deputy director of the Centre for Digital Humanities at the University of Gothenburg where I get several opportunities to practically materialize these interests together with language technologists and engineers to develop new digital infrastructures for scholars in humanities and social science and for the wider public.

My current research on the political history of technology is focused on the global history of technology of terrorism from the late 18 century until the present. I currently lead two research projects on the history of terrorism: Things for living with terror: a global history of the materialities of urban terror and security funded  by the Swedish Riksbankens Jubileumsfond, and the large research project Terrorism in Swedish politics (SweTerror): A multimodal study of the configuration of terrorism in parliamentary debates, legislation and policy networks in Sweden 1968–2018 that is part of the digital humanities DIGARV research program initiated by the Government of Sweden and financed by the Swedish Research Council, Riksbankens Jubileumsfond and the Royal Swedish Academy of Letters, History and Antiquities. In SweTerror I collaborate with the National Language Bank (Språkbanken) in Sweden to analyse and make digitally accessible the text and audio corpora of the political debates of the Swedish Parliament.

How is your research related to Kielipankki?

As a part of my research on the history of terrorism I use various large digital text corpora to analyse various media discourses to trace the historical emergence of terrorism as a political and cultural phenomenon. One of the projects that I am currently involved in is conducted together with language technologists from Swedish Språkbanken and with support from Swe-Clarin where we analyse historical Swedish-language newspaper corpora accessible through two national CLARIN B-centers: the National Language Bank (Nationella språkbanken) in Sweden and the Language Bank of Finland (Kielipankki) to determine how the modern meaning of terrorism emerged from the 18th century. This research is part of an initiative of Swe-Clarin to develop genuine interdisciplinary collaboration between researchers in humanities and language technology, using e-science tools for large-scale corpus studies. Thus, the project combines history domain knowledge and language technology expertise to evaluate and expand on earlier research claims regarding the historical meanings associated with terrorism in Swedish and Finnish contexts.

Primarily, we are interested in testing the hypothesis that sub-state terrorism’s modern meaning was not yet established in the 19th century but primarily restricted to Russian terrorism. Using a cross-border comparative approach we explore overlapping national discourses on terrorism. By using the Korp tool, installed in the Swedish as well as in the Finnish language banks, we have been able to efficiently investigate terrorism-related words and their historical contexts to show a more complex image of the history of terrorism in the Nordic countries, especially the meanings associated with salient state terrorism and various forms of ethnic sub-state terrorisms within Great Power empires, i.e. Finnish terrorism within the Russian empire, Macedonian terrorism within the Ottoman empire and Indian terrorism in the British empire. Together with Finnish historians of terrorism and language technologists, we are planning to extend the analysis to the wider Finnish context via the corresponding Finnish-language newspaper corpora in Kielipankki. Furthermore, the study allows us to develop the concrete practices of cross-border comparative studies by utilizing the extensive corpus resources of Swe-Clarin and FIN-CLARIN. There are great opportunities for researchers in the humanities and language technologists to conduct cross-disciplinary, comparative big data studies on national online newspaper corpora.

Kielipankki have also been important not just through the tools it provides but also in other less direct ways in my work on strengthening digital humanities research in Finland. In 2018 as Principal Investigator of the Kone Foundation project “From Roadmap to Roadshow: A collective demonstration & information project to strengthen Finnish digital history” I organized a roadshow to the six Finnish universities of Oulu, Jyväskylä, Eastern Finland, Turku, Tampere and Helsinki. At each university we arranged a one-day digital history methods workshop with lectures and hands-on workshops with experienced digital historians, language technologists and information technology specialists from Finland, Sweden and the United States. Among them was Kielipankki’s application specialist Tero Aalto who participated with a very appreciated lecture on “Digital Methods in Language Research”. The great enthusiasm that the roadshow lectures generated among Finnish historians led to an unplanned expansion and continuation of this project. In May 2018 I together with my two postdoctoral researchers Mila Oiva and Petri Paju organized a workshop where we matched up digital humanities curious historians with language technologists and information technology specialists to jointly explore, develop and conduct digital history research projects. In December 2020 several of these project ideas are published as peer-reviewed research articles in one of the first Open Access books of Helsinki University Press Digital Histories: Emergent Approaches in the New Digital History edited by myself together with Mila Oiva and Petri Paju.

Publications related to Kielipankki

Mats Fridlund, Leif-Jöran Olsson, Daniel Brodén & Lars Borin, 2019 ”Trawling for Terrorists: A Big Data Analysis of Conceptual Meanings and Contexts in Swedish Newspapers, 1780–1926,” in Melvin Wevers, Mohammed Hasanuzzaman, Gaël Dias, Marten Düring, & Adam Jatowt, eds. Proceedings of the 5th International Workshop on Computational History (HistoInformatics 2019) co-located with the 23rd International Conference on Theory and Practice of Digital Libraries (TPDL 2019) Oslo, Norway, September 12th, 2019, CEUR-WS  vol. 2461 (Aachen: CEUR-WS.org, 2019), 1-10, http://ceur-ws.org/Vol-2461/paper_5.pdf.

Mats Fridlund, Leif-Jöran Olsson, Daniel Brodén & Lars Borin, 2020 ”Trawling the Gulf of Bothnia of News: A Big Data Analysis of the Emergence of Terrorism in Swedish and Finnish Newspapers, 1780–1926”, in Costanza Navarretta & Maria Eskevich, eds. Proceedings of CLARIN Annual Conference 2020 (Virtual edition: CLARIN, 2020), 61-65. https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf

Mats Fridlund, Mila Oiva, & Petri Paju, eds., 2020 Digital Histories: Emergent Approaches within the New Digital History (Helsinki: Helsinki University Press, 2020), 3-18. https://doi.org/10.33134/HUP-5

Mats Fridlund, 2020 “Digital History 1.5: A Middle Way between Normal and Paradigmatic Digital Historical Research”, in Mats Fridlund, Mila Oiva, & Petri Paju, eds., Digital Histories: Emergent Approaches within the New Digital History (Helsinki: Helsinki University Press, 2020), 69-87. https://doi.org/10.33134/HUP-5

Paju, Petri & Mila Oiva. ”Digitaalisen historiantutkimuksen opetuskiertue”, Historiallinen Aikakauskirja 1/ 2019, pp 89-94.

Petri Paju, Mila Oiva & Mats Fridlund, 2020 “Digital and Distant Histories: Emergent Approaches within the New Digital History”, in Mats Fridlund, Mila Oiva, & Petri Paju, eds., Digital Histories: Emergent Approaches within the New Digital History (Helsinki: Helsinki University Press, 2020), 3-18. https://doi.org/10.33134/HUP-5

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Tommi Jauhiainen

Tommi Jauhiainen
Photo: Heidi Jauhiainen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tommi Jauhiainen works as a Project Planning Officer in Kielipankki and he is currently starting his two-year post doc. Here, Tommi tells us about his research related to some language resources in Kielipankki.

Who are you?

I am Tommi Jauhiainen and at the moment, I work as a Project Planning Officer in Kielipankki. From the beginning of year 2021, I will start as a post doc researcher on a grant from the Finnish Research Impact Foundation.

What is your research topic?

During the past ten years, my research has focused in language identification of text. On this topic, I completed my Master’s thesis in 2010 and my PhD dissertation in 2019. Language identification refers to the comparison of a text written in an unknown language to a set of given languages. A similar method can also be used to classify texts by subject area, for example.

The difficulty of language identification varies greatly depending on the situation. The task is easy in case there are only a few clearly different languages to choose from, such as Finnish and Swedish, and if the texts are reasonably long, for example several sentences. In case there are hundreds of languages to choose from, if the languages are close to each other (e.g. Kven and Meänkieli) and/or if the texts are short (e.g. single words only), it may be very difficult to identify the language.

Last year, our extensive survey of automatic language identification in texts was published in the Journal of Artificial Intelligence. We are also currently working on a textbook on the same topic. The book is expected to be published in “Synthesis Lectures on Human Language Technologies” series by Morgan & Claypool in late 2021.

During and after my PhD research, I have participated in several international shared tasks that have focused on distinguishing between very close languages or dialects. In 2018, we won the shared tasks focusing on Swiss German dialects and Indo-Aryan languages, and last year we won a shared task focusing on different versions of Mandarin Chinese. I am also a member of the ”Ancient Near Eastern Empires” Centre of Excellence, in which context I have studied how cuneiform texts written in different dialects of Akkadian and Sumerian could be distinguished from one another. I organized an international shared task on this topic last year, and the winner was a Canadian research team using deep learning.

In the forthcoming “Language Identification of Speech and Text” project, funded by the Finnish Research Impact Foundation, I will move towards the study of language identification in speech, in addition to text. Until now, the research fields of speech and text language identification have been relatively separate from each other, and my intention is to bring more collaboration between them.

How is your research related to Kielipankki?

Most of my PhD research was done in the Finno-Ugric Languages and Internet project, which was part of the FIN-CLARIN research group that maintains Kielipankki. In the project, we searched the Internet for websites written in small Uralic languages, created a portal site for them, and compiled sentence corpora from the texts they contained. During the processes of harvesting the web and creating the sentence corpora, we used automatic language recognition as part of the workflow. The portal site, Wanca, is now part of the tools maintained by Kielipankki and the Wanca 2016 corpora can be found in Kielipankki in three different versions. The Wanca 2017 corpora is being used in the ongoing ULI (Uralic Language Identification) shared task and the corpora will be published next year.

Publications related to Kielipankki:

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2015). The Finno-Ugric Languages and the Internet project. In First International Workshop on Computational Linguistics for Uralic Languages: Proceedings of the Workshop (Vol. 2, pp. 87–98). (Septentrio Conference Series; Vol. 2015, No. 2). Septentrio Academic Publishing. https://doi.org/10.7557/scs.2015.2

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2015). Language Set Identification in Noisy Synthetic Multilingual Documents. In Computational Linguistics and Intelligent Text Processing (Vol. Part I, pp. 633-643). (Lecture Notes in Computer Science; Vol. 9041). Springer International Publishing AG. https://doi.org/10.1007/978-3-319-18111-0_48

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2016). HeLI, a Word-Based Backoff Method for Language Identification. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects: VarDial3, Osaka, Japan, December 12 2016 (pp. 153-162). https://www.aclweb.org/anthology/W16-4820/

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2017). Evaluation of language identification methods using 285 languages. In 21st Nordic Conference of Computational Linguistics: Proceedings of the Conference (pp. 183-191). (Linkping Electronic Conference Proceedings; No. 31). Linköping University Electronic Press. https://www.aclweb.org/anthology/W17-0221/

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). Iterative Language Model Adaptation for Indo-Aryan Language Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 66-75). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-3907

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). HeLI-based Experiments in Swiss German Dialect Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 254-262). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-3929

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2019). Wanca in Korp: Text corpora for underresourced Uralic languages. In Proceedings of the Research data and humanities (RDHUM) 2019 conference : data, methods and tools (pp. 21-40). Studia Humaniora Ouluensia; No. 17. University of Oulu.

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2019). Language Model Adaptation for Language and Dialect Identification of Text. Natural Language Engineering, 25(5), 561-583. [135132491900038]. https://doi.org/10.1017/S135132491900038X

Jauhiainen, T. (2019). Language identification in texts. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5131-5

Jauhiainen, T., Jauhiainen, H., Alstola, T., & Linden, K. (2019). Language and Dialect Identification of Cuneiform Texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 89-98). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1409/https://www.aclweb.org/anthology/W19-1409/

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 178-187). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1419/

Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic Language Identification in Texts: A Survey. Journal of Artificial Intelligence Research, 65, 675-782. https://doi.org/10.1613/jair.1.11675

Zampieri, M., Malmasi, S., Scherrer, Y., Samardžic, T., Tyers, F., Silfverberg, M. P., Klyueva, N., Pan, T-L., Huang, C-R., Ionescu, R. T., Butnaru, A., & Jauhiainen, T. S. (2019). A Report on the Third VarDial Evaluation Campaign. In Proceedings of the (pp. 1-16). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1401/

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2020). Building Web Corpora for Minority Languages. In Proceedings of the 12th Web as Corpus Workshop (pp. 23-32). The Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.wac-1.4

Gaman, M., Hovy, D., Ionescu, R. T., Jauhiainen, H., Jauhiainen, T., Linden, K., Ljubešić, N., Partanen, N., Purschke, C., Scherrer, Y., & Zampieri, M. (Accepted/In press). A Report on the VarDial Evaluation Campaign 2020. In Proceedings of VarDial 2020

Jauhiainen, T., Jauhiainen, H., Partanen, N., & Linden, K. (Accepted/In press). Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora. In Proceedings of VarDial 2020 https://arxiv.org/pdf/2008.12169.pdf

Lindgren, M., Jauhiainen, T., & Kurimo, M. (2020). Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets. In Proceedings of Interspeech 2020 (pp. 467-471) http://www.interspeech2020.org/uploadfile/pdf/Mon-1-11-5.pdf

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.