Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Noora Hoffrén tells us about her PhD research on constructed action in Finnish Sign Language and Finnish language.
I am Noora Hoffrén, a sign language interpreter and a doctoral researcher. I am working on my PhD thesis at the Sign Language Centre (SLC) in the Department of Language and Communication Studies at the University of Jyväskylä.
The topic of my dissertation is showing by enacting, i.e. constructed action. When a speaker or signer is immersed in the role of another character and displays the character’s thoughts, speech, emotions or actions, he or she is constructing action. Constructed action is not always obvious or overt. Often, especially in signed languages, constructed action is so closely integrated into the language that it is not always easy to discern it. In my research, I am studying constructed action in both Finnish Sign Language and Finnish language. My dissertation is part of the ongoing ShowTell project at the University of Jyväskylä.
As my research data, I will use the Corpus of Finnish Sign Language, part of which is already available for download in Kielipankki (CFINSL). In addition to videos that are recorded from multiple angles, the database contains basic annotations and metadata. The fact that such a corpus exists allows us to study constructed action in the best possible way.
My aim is to collect a video corpus of spoken Finnish, parallel to the Finnish Sign Language material, and to deposit the corpus in Kielipankki. The Finnish video corpus will be collected in pairs from six native speakers of Finnish. The methods that are used to collect the material will be similar to those used to collect the Finnish Sign Language corpus, for example, using multiple cameras during filming sessions and using the same elicitation materials (e.g. ’The Snowman’ and ’Frog, Where Are You?’ picture books).
Hoffrén, Noora 2019. Kuvailevien viittomien ja konstruoidun toiminnan yhteispeli. Master’s thesis. University of Jyväskylä. Available: http://urn.fi/URN:NBN:fi:jyu-201910144419
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Maria Sarhemaa tells us about her research on the appellativization of first names in Finnish language. Online discussions are a fruitful source of data for studying informal or colloquial language use.
I am Maria Sarhemaa, a doctoral researcher in Finnish language at the University of Helsinki. Currently, I am working on my thesis on a grant from the Kone Foundation.
I am doing research on the appellativization of first names in Finnish language, i.e. words that typically belong to the informal registers of the language and originate from a first name. These include yrjö meaning ’vomiting’ and jonne meaning a certain kind of teenage boy, but there are also compound words with an appellativized first name as part of the word, such as baarimikko ‘bartender’. In my dissertation research, I am exploring appellativization as a linguistic phenomenon in Finnish, and in the sub-publications I will examine compound words with an appellativized part, the expressions uuno, tauno and urpo meaning ’stupid’, and the construction jonnet ei muista ‘teenagers cannot remember’.
I collected data from the Suomi24 corpus in Kielipankki for my article on uuno, tauno and urpo. The Suomi24 corpus is a fruitful source of data for my research topic, as appellativized expressions are used extensively, particularly in informal language, and the language used in Suomi24 is often colloquial. I have also collected data from the same corpus for my forthcoming article on the jonnet ei muista construction and for a study on the jonne appellative that I am conducting with Lasse Hämäläinen, PhD.
Hämäläinen, Lasse & Sarhemaa, Maria 2022: Jonnen jäljillä: Appellatiivisen jonnen alkuvaiheet verkkokeskusteluaineistojen valossa. Sananjalka 64, 255–269. https://doi.org/10.30673/sja.114194
Sarhemaa, Maria 2021: Tavan tauno uunoilee urpokaupungissa: Nimien Uuno, Tauno ja Urpo appellatiivistuminen ja appellatiivien käyttö Suomi24-keskustelupalstalla. Sananjalka 63, 103–129. https://doi.org/10.30673/sja.107278
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Therese Lindström Tiedemann tells us about her research on Swedish as a second language. There is a definite need to continue developing Finland-Swedish corpora to ensure that Finland-Swedish is also included in future studies of the Swedish language.
My name is Therese Lindström Tiedemann and I am a university lecturer in the Swedish Language at the University of Helsinki. In addition to the Swedish language, I also work on general linguistics. I wrote my PhD thesis on the history of grammaticalisation as a concept in linguistics, i.e. within the history of linguistics.
In recent years, most of my research has been on Swedish as a second language. In my research I often use corpus linguistic methods. Together with colleagues, I have also tried to use crowdsourcing. I also do research on other topics such as grammaticalisation, the history of linguistics, the teaching of grammar and metalinguistic knowledge.
I have used Kielipankki’s resources mainly in connection with my research on Swedish as a second language and in the context of teaching. For instance, I have used the Swedish subcorpus of the Topling corpus. Currently, I am managing our faculty’s part of the Digisvenska project where we are creating a text corpus from the Digital Matriculation Examination in B1-Swedish (Swedish as a second language, i.e. having been learnt from year 6 (or 7 in the old curriculum)) in Finland. We aim to study how the exam correlates to the curriculum and the fairness and transparency of the test results. Among other things, we will study how lexical breadth in the form of lexical variation (cf. vocabulary size) relates to scores and marks in the exams, but also verb conjugation and adverbial clause modifiers, as well as the linguistic accuracy in the form of how close it is to the norm.
A few years ago, I tried to study the Swedish word nog (lit. ‘enough’) using the Sinebrychoff corpus together with Jan Lindström. However, in the end the work needed to be done primarily with a more comprehensive text version of the corpus and not with the version available in Korp.
I also have a more general interest in the Swedish-language resources available in Kielipankki because of my research on Swedish and teaching students in Scandinavian languages, and since I often use corpus-based methods. This is why it is important for me to know which corpora I can recommend students to use and how they can be used. There is definitely a need to continue developing Finland-Swedish corpora to ensure that we can describe Finland-Swedish (Sw. ”finlandssvenska”) in a similar way to how we can describe Swedish as spoken in Sweden (Sw. ”sverigesvenska”), and that Finland-Swedish is also included in future studies of the Swedish language. In the Finnish context, we can also see that some corpora contain both Finnish and Swedish. There is a need to consider the best way to study how and when Swedish is used in these corpora, and whether this is representative of how Swedish is used in these contexts in Finland. This applies, for example, to the corpus of parliamentary plenary sessions (Eduskunnan täysistunnot), where Swedish words are currently only tagged as foreign words. This impedes research possibilities on this part of the data. However, at the same time, we can clearly see that Swedish words top and dominate the list of words tagged as foreign words in the plenary sessions. It would be interesting to see these parts treated as Swedish, and whether it might somehow be possible to annotate the Swedish parts as Swedish, thus facilitating the study of them from a Swedish perspective.
Besides the Swedish-language resources, I also have an interest in interoperability between different corpora and resources, transparency of research data and comparability between different sources for the Swedish language. With many of the Swedish language corpora being available via Språkbanken Text (Sweden), and with our needs to be able to compare corpora at Kielipankki with these, I see a need for information about how comparable these corpora are, and whether corpora in Kielipankki have been annotated in the same way. This is important to ensure that Finland-Swedish and other Swedish corpora located in Finland can be compared with Swedish corpora located in Sweden. This could give Finland Swedish and second language Swedish (L2 Swedish) with Finnish as the first language (L1) a clear and fair place in research on Swedish and L2 Swedish in general.
As part of my work on corpora my colleagues and I have also checked how well the automatic annotation works, especially on material produced by L2 speakers. We have checked the annotation of coursebook texts (written by L1 speakers but aimed at, or selected for, L2 learners), texts written by L2 learners and texts written by L2 speakers and ”normalised” (i.e. with standardised spelling for instance) to facilitate annotation, queries and comparisons. The results showed that texts written by learners are often not as well annotated but also not always worse. Lemmatisation, word class tagging and sense disambiguation was good enough to be used in studies of L2 Swedish, even though sense disambiguation was more problematic than the first two. There were bigger problems with dependency analysis (cf. clause analysis, parsing) and multiword expressions also proved to be problematic especially in learner writings. Still multiword annotation was good enough to allow us to conclude that we can use it in our work, although the user should know that something may have been missed and that the multiword annotation is based on the expressions which are part of the Saldo lexicon, and how they have been listed in Saldo. The results showed that sometimes there was disagreement regarding whether a preposition should be seen as part of the expression or not.
I am very happy to see that more Swedish corpora have been added to Kielipankki in the last few years. I hope that in the future there will be even more Swedish corpora added in Kielipankki and that they will be annotated as the Swedish corpora in Språkbanken Text (Sweden) and that information about the data will be made accessible in such a way that students and researchers can easily find comparable material and know how representative the material is for a certain type of language (e.g. a dialect, newspaper writings).
In the coming years I will be working on a project on pseudonymisation of linguistic data (Mormor Karl är 27 år). Pseudonymisation means that some information such as names of people, places, etc are changed to pseudonyms in the data, when this information is such that it might reveal who wrote the text. In this project we will study how pseudonymisation affects research data in the humanities, an important step in work on open reusable data needed for reproducibility and for reduplication studies to be possible on data already collected while at the same time protecting people’s identity.
In connection to the project which I have just finished together with Elena Volodina, University of Gothenburg (L2 profiles – Development of lexical and grammatical competences in immigrant Swedish) we have released a dataset with manual morphological annotation of lexemes which are present in materials aimed at learners of Swedish as a second language or produced by speakers of Swedish as a second language (CoDeRooMor). This resource has now been updated and will be released as part of the resource Swedish L2 profiles during 2023. Swedish L2 profiles is a resource where you can search for e.g. a word, a tense, a morpheme or a word formation pattern to see how this is used at different proficiency levels (according to CEFR, the Common European Framework of Reference for Languages, Council of Europe) both in course books for Swedish as a second language and in learner essays from different CEFR-levels. The resources which we have created are part of Språkbanken Text (Sweden), but are or will be openly accessible.
I have also been involved in the development of an annotation tool in relation to research on Swedish (Legato) and in the use of the CALL platform Lärka for the teaching of syntactic functions, word classes and semantic roles. The CALL platform Lärka is something I have used in teaching grammar, which meant that I could give feedback to the developers from that perspective. Together with Volodina I have also used the platform to collect anonymous data to study what students often get right or wrong when they practise these categories, useful in connection to research on metalinguistic knowledge and the ability to analyse Swedish grammatically.
Apart from research related to Kielipankki’s resources and areas of interest I am also the current project manager of Finland Swedish Online (FSO), an online course in Finland Swedish created at University of Helsinki based on an Icelandic model (Icelandic Online). FSO is currently part of SAFMORIL, one of the K-Centres within CLARIN. One of my aims have been that FSO would not only be something which supports the learning of a language but also a possibility to study language acquisition by seeing if it is possible to trace the development of learners in FSO if they grant access to that information. (Icelandic Online has done research on this based on their data.)
Alfter, D., Borin, L., Pilán, I., Lindström Tiedemann, T. & Volodina, E. 2019a. Lärka: From Language learning platform to infrastructure for research and language learning. In: Selected papers from the CLARIN Annual Conference 2018. Linköping: Linköping university press. 14pp. http://www.ep.liu.se/ecp/159/001/ecp18159001.pdf
Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2019b. LEGATO: A flexible lexicographic annotation tool. In: Hartmann, M. & Plank, B. (eds.), The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa): Proceedings of the conference. Linköping: Linköping University Electronic Press. pp. 382–388. http://hdl.handle.net/10138/306297
Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2021. Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts vs Non-Experts. Northern European Journal of Language Technology, 7 (1): 35pp. https://doi.org/10.3384/nejlt.2000-1533.2021.3128
Arnbjörnsdóttir, B., Friðriksdóttir, K., & Bédi, B. 2020. Icelandic Online: twenty years of development, evaluation, and expansion of an LMOOC. CALL for widening participation: short papers from EUROCALL 2020, 13.
Borin, L., Forsberg, M. & Lönngren, L. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4): 1191–1211. https://doi.org/10.1007/s10579-013-9233-4
Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, teaching and assessment. https://rm.coe.int/1680459f97
Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion Volume with new descriptors. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-2018/1680787989
Council of Europe. 2020. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion volume. https://rm.coe.int/common-european-framework-of-reference-for-languages-learning-teaching/16809ea0d4
Friðriksdóttir, K. 2021. The effect of tutor-specific and other motivational factors on student retention on Icelandic Online. Computer Assisted Language Learning, 34(5-6), 663-684.
Lenardič, J., Lindström Tiedemann, T. & Fišer, D. 2018. Overview of L2 corpora and resources. CLARIN report. CLARIN ERIC. https://office.clarin.eu/v/CE-2018-1202-L2-corpora-report.pdf
Lindström, J. & Lindström Tiedemann, T. 2020. ”Ni minnes nog hvilka jag menar”: Subjektiva och intersubjektiva aspekter av modaladverbet nog. In: Lehti-Eklund, H. & Silén, B. (eds.), Handel med konst. Språk och dialog i Paul Sinebrychoffs brevsamling från sekelskiftet 1900. Helsinki: Svenska litteratursällskapet. pp. 293–323. http://hdl.handle.net/10138/315043
Lindström, J. & Lindström Tiedemann, T. 2018. Subjektivt och intersubjektivt nog: Om grammatikalisering och bruk i ljuset av Paul Sinebrychoffs brevväxling kring 1900. In: Lönnroth, H, Haagensen, B., Kvist, M. & Sandvad West, K. (eds.) Studier i svensk språkhistoria 14. Vaasa: University of Vaasa. pp. 180–197. http://hdl.handle.net/10138/243079
Lindström [Tiedemann], T. 2004. The History of the Concept of Grammaticalisation. Unpublished PhD thesis, University of Sheffield. https://etheses.whiterose.ac.uk/1437/
Lindström Tiedemann, T., Alfter, D. & Volodina, E. 2022. CEFR-nivåer och svenska flerordsuttryck. In: Björklund, S., Haagensen, B., Nordman, M. & Westerlund, A. (eds.), Svenskan i Finland 19. Vasa: Svensk-österbottniska samfundet. pp. 218–233. https://urn.fi/URN:ISBN:978-952-69650-5-5
Lindström Tiedemann, T., Lenardič, J. & Fišer, D. 2018. L2 learner corpus survey: towards improved verifiability, reproducability and inspiration in learner corpus research. CLARIN annual conference, Pisa.
https://office.clarin.eu/v/CE-2018-1292-CLARIN2018_ConferenceProceedings.pdf
Lindström Tiedemann, T., Volodina, E. & Jansson, H. 2016. Lärka – ett verktyg för träning av språkterminologi och grammatik. LexicoNordica, 23: 161–181. https://tidsskrift.dk/lexn/article/view/111823
Prentice, J., Håkansson, C, Lindström Tiedemann, T., Pilán, I. & Volodina, E. 2021. Language learning and teaching with Swedish FrameNet++: two examples. In: Dannélls, D., Borin, L. & Friberg Heppin, K. (eds.), The Swedish FrameNet++: Harmonization, integration, method development and practical language technology applications. Amsterdam: Benjamins. pp. 303–329. https://doi.org/10.1075/nlp.14.12pre
Stemle, E. W., Boyd, A., Jansen, M., Lindström Tiedemann, T., Mikelić Preradović, N., Rosen, A., Rosén, D. & Volodina, E. 2019. Working together towards an ideal infrastructure for language learner corpora. In: Abel, A., Glaznieks, A., Lyding, V. & Nicolas, L. (eds.) Widening the Scope of Learner Corpus Research: Selected papers from the fourth leaner corpus research conference. Louvain-la-Neuve: Presses universitaires de Louvain.
http://hdl.handle.net/10138/311309
Volodina, E., Alfter, D., Lindström Tiedemann, T., Lauriala, M.S. & Piipponen, D. H. 2022. Reliability of Automatic Linguistic Annotation: Native vs Non-native Texts. In: Monachini, M. & Eskevich, M. (eds.), Selected papers from the CLARIN Annual Conference 2021. Linköping: Linköping University Electronic Press. pp. 151–167.
https://doi.org/10.3384/ecp18914
Volodina, E., Mohammed, Y. A. & Lindström Tiedemann, T. 2021. CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. Proceedings of the 23rd Nordic conference on computational linguistics (NoDaLiDa). Linköping. pp. 178–189. http://hdl.handle.net/10138/339476
Volodina, E. & Lindström Tiedemann, T. 2014. Evaluating students’ metalinguistic knowledge with Lärka. Swedish Language Technology Conference, Uppsala. http://hdl.handle.net/10138/347397
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Marja-Liisa Helasvuo tells us about the digital language resources that have been compiled at the University of Turku. The collaboration with others in the same field has now evolved into a full-scale infrastructure of language data and resources.
I am Marja-Liisa Helasvuo, professor of Finnish language at the University of Turku. I studied Finnish language and general linguistics at the University of Helsinki, and I did my PhD in linguistics at the University of California, Santa Barbara. I have always been particularly interested in spoken language, and in my doctoral thesis I examined spoken Finnish from a crosslinguistic perspective.
My research has focused on grammar and human interaction. I have investigated a wide variety of data: everyday conversations between adults or between adults and children, online conversations, and other computer-mediated interactions. I have also studied written texts, from the oldest Finnish texts to more recent ones. I have explored a wide range of grammatical topics with the help of these resources.
I work at the Department of Finnish and Finno-Ugric Languages at the University of Turku. We have produced several digital corpora, starting from The Finnish Dialect Corpus of the Syntax Archive, whose compilation began in 1967. It is the first Finnish language corpus that has been directly compiled into a machine-readable format.
Since the Dialect Corpus, several others have followed: the Agricola Corpus, which contains all the works of Mikael Agricola from the 16th century, the Advanced Finnish Learners’ Corpus (LAS2) and the Corpus of Academic Finnish (LAS1). These are all grammatically coded and they are available in Kielipankki – the Language Bank of Finland (LAS1 will be available soon). In addition, we have produced several resources for Finno-Ugric languages. These materials have been collected in the Archive of Finnish and Finno-Ugric Languages. As we have produced many language resources in our organization, we also have many researchers who are interested in conducting corpus-based research. It’s always easy to ask a colleague for assistance when figuring out which corpus to use to study a particular topic.
Recently, we have been increasingly collaborating with the TurkuNLP research group. We established the UTU-Digilang infrastructure, which includes not only the Archive of Finnish and Finno-Ugric Languages, but also the Digilang portal, the Digilang longterm storage, and the TurkuNLP research group with its language resources and data tools. This collaboration has been very rewarding and I have learned a lot from it. I would like to see more collaboration of this kind in the future as well.
I have used language corpora in almost all my research. Many of these resources are available in Kielipankki.
I have been working on the ArkiSyn Corpus, which is available in Kielipankki. We received funding for the project from the Kone Foundation, which helped us to build a morphosyntactically annotated corpus. You can easily search it for all occurrences of a given word (e.g. all forms of the verb ajatella, ’think’) or all occurrences of a given grammatical form (e.g. all forms of the past tense).
Recently, my research has focused in particular on different kinds of fixed expressions, which occur frequently and mostly in the same form. For example, the verb ajatella ’think’ is a very common verb in everyday Finnish conversation. It almost always occurs in the 1st person singular and the tense of the expression is the past tense (ajattelin ’I thought’). When we compared the results of the corpus search with the corresponding passages in the audio recordings, we found that although the expressions were transcribed as ’I thought’, they were in fact phonetically quite eroded. In most cases, the expression occurred in the form maattet. The first person singular pronoun minä ‘I’ was reduced to the m sound at the beginning, the first and second syllable of the verb ’think’ (ajat) were fused together (aat). The reduced form of the word että ’that’ had stuck at the end. This type of phonetic reduction and crystallization of usage into a particular form is very common in fixed expressions.
In addition to ArkiSyn, I have also used the Suomi24 Corpus, the Agricola Corpus, The Finnish Dialect Corpus of the Syntax Archive and newspaper materials. The different corpora allow for different research topics.
Laury, Ritva, Marja-Liisa Helasvuo & Janica Rauma 2020. “When an expression becomes fixed: mä ajattelin että ‘I thought that’ in spoken Finnish”. – Ritva Laury & Tsuyoshi Ono (eds.), Fixed Expressions: Building language structure and social action, pp. 133–166. Pragmatics & Beyond New Series 315. Amsterdam: John Benjamins. DOI: http://dx.doi.org/10.1075/pbns.315.06lau
Helasvuo, Marja-Liisa 2019. “Free NPs as units”. Special issue “On the Notion of Unit in the Study of Human Languages”, guest editors Tsuyoshi Ono, Ritva Laury & Ryoko Suzuki. Studies in Language 43:2:301–328. DOI: http://dx.doi.org/10.1075/sl.16064.hel
Laury, Ritva & Marja-Liisa Helasvuo 2016. “Disclaiming epistemic access with ‘know’ and ‘remember’ in Finnish”. Special Issue on “Grammar and negative epistemics in talk-in-interaction”, guest editors Jan Lindström, Yael Maschler and Simona Pekarek Doehler. Journal of Pragmatics 106 (2016): 80–96. DOI: http://dx.doi.org/10.1016/j.pragma.2016.07.005
Helasvuo, Marja-Liisa & Aki-Juhani Kyröläinen 2016. “Choosing between zero and pronominal subject: Modeling subject expression in the 1st person singular in Finnish conversation”. Corpus Linguistics and Linguistic Theory 12(2):263–299. DOI: http://dx.doi.org/10.1515/cllt-2015-0066
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Marjatta Palander tells us about her research on the dialects of the Karelian language. The Karelian language speech corpora that were compiled in her research projects will be available via Kielipankki.
I am Marjatta Palander, Professor Emerita of Finnish Language at the School of Humanities of the University of Eastern Finland. I am the leader of the recently finished research project KATVE (Migration and linguistic differentiation: Karelian in Tver and Finland), which was funded by the Academy of Finland.
During my career, I have done research mostly on the Eastern dialects of Finnish, but in the 2000s, I have also studied Karelian in two research projects. The FINKA project (2011–2014) focused on the dialects of Border Karelia. The KATVE project (2018–2022) investigated the differences and similarities between the dialects of Karelian in Border Karelia and Tver. These Karelian dialects are descended from the common Southern Karelian dialect of the Karelian Proper, which was still spoken in the area of present-day Eastern Finland in the early 17th century. After the Swedish conquest of Eastern Finland, most of the Karelian-speaking population of the region fled to Russia, as far as Tver. Since then, the Karelians of Tver have lived without contact with other Karelians. In the KATVE project, we have examined the differentiation of dialects that has occured in the course of around 350 years.
Our research concerns, among other things, the features of sentence structure, possessive forms and vocabulary. We are also investigating to what extent people with a Border Karelian background and people with a Tver Karelian background can understand each other’s dialects. In my own research, I have examined Karelians’ linguistic awareness using folk-linguistic methods. In addition, I have investigated the temporal variation in one Border Karelian idiolect of which we have recordings from a timeline of 17 years.
In the research projects of the 2010s and 2020s, we have compiled three Karelian language speech corpora, which include recorded dialect interviews and their transcriptions produced by FU transcription. The Border Karelian corpus (119 hours) is based on interviews recorded in the 1960s and 1970s, preserved at the Institute for the Languages of Finland (Kotus). The Tver Karelian corpus 1957–1971 (approx. 30 h) was also compiled from recordings at the Institute for the Languages of Finland. The more recent Tver Karelia is represented in the Tver Karelian corpus 2016–2019 (ca. 15 h), which was compiled by researchers from the KATVE project and Karelian language students on our field trips. All the corpora have been submitted to the Language Bank in order to provide researchers with more electronic data on Karelian, which is an endangered minority language.
Palander, Marjatta 2015. Rajakarjalaistaustaisten ja muiden suomalaisten käsityksiä karjalasta. Virittäjä, 119(1), 34–66. Available: https://journal.fi/virittaja/article/view/41260
Palander, Marjatta & Mäkisalo, Jukka 2022. Reaaliaikatutkimus rajakarjalaisidiolektista. Virittäjä, 126(3), 339–368.
Palander, Marjatta & Riionheimo, Helka 2018. Miten Raja-Karjalan murre eroaa suomesta? Rajakarjalaistaustaiset pohjoiskarjalaiset kuuntelutestissä. Sananjalka, 60(60.), 49–70. DOI: 10.30673/sja.69997
Riionheimo, Helka & Palander, Marjatta 2017. Rajakarjalainen kuuntelutesti: havainnoijina suomen kielen yliopisto-opiskelijat. Lähivørdlusi/Lähivertailuja 27, 212–241. Eesti rakenduslingvistika ühing. Tallinn. DOI: 10.5128/LV27.07
Uusitupa, Milla, Koivisto, Vesa & Palander, Marjatta 2017. Raja-Karjalan murteet ja raja-alueiden kielimuotojen nimitykset. Virittäjä 121(1), 67–106. Available: https://journal.fi/virittaja/article/view/53121
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Benjamin Schweitzer tells us about his research on the Finnish special language of art music. Corpus linguistics enable the researcher to study the topic from several points of view.
I am a German composer, translator and linguist (in biographical order). I studied composition, music theory and orchestra conducting – at the Sibelius-Academy in Helsinki, among other places – and have since worked mainly as a freelance artist with some additional work as lecturer and concert organiser. From the early 2000’s on, I started translating from Finnish to German – mostly historical and musicological non-fiction, but also opera librettos and short stories.
In my fourties, I entered a second career path and studied Fennistics and Scandinavistics in Greifswald and Tartu. When I received my MA degree in 2018, I already had the feeling that this wouldn’t be the end of my linguistic ambitions. I was very happy I got the opportunity to continue soon after this with a PhD project: I am now employed as a researcher at the Department of Finnish Studies of the University of Greifswald and working on my PhD thesis within the framework of an International Research Training Group called Baltic Peripeties. My supervisor is Professor Marko Pantermöller.
I am researching the Finnish special language of art music from several points of view. My first aspect is historical-systematical: I am trying to show how a special language of a field emerged which, as a cultural practice, was itself imported to Finland. What happened spontaneously and what came about as the result of language planning and maintenance? Which terms were adapted, where did the language community succeed in inventing ”originally” Finnish words, and which structural problems had to be overcome in the process?
The second aspect concerns the transition from terms to texts, from words to narration: Which challenges did Finnish critics and musicologists face when writing about music in Finnish? Which models did they follow, and are there structurally ”typically Finnish” ways to write about music?
The third and most complex aspect is a discourse-linguistic approach: What kind of intertextual relations can be found in Finnish texts about (Finnish) music? How does this discourse reveal national auto- and heterostereotypes? And how is art music as a core element of Finnish ”cultural identity” reflected in the writing about music since the beginning of the 20th century?
Corpus linguistics plays an important role in my research, even though I am probably employing a somewhat nonstandard approach. Within the official taxonomy, my research might qualify as corpus-based or corpus-oriented, but I would maybe prefer the attribute corpus-aware. In my research, I am mainly looking at longer passages or even entire texts, from which I extract key words, collocations and discourse-semantic frames. This means that my analytical approach is clearly qualitative. Nevertheless, if I want to find out when and in which context certain key words or concepts first appeared, how they were distributed diachronically and how big or small their impact was, I also need to look at bulk material from a quantitative angle.
This is where Kielipankki enters. I mainly use the Newspaper and Periodical Corpus of the National Library of Finland (KLK) which not only contains a huge collection of daily papers until the mid-20th century, but also early music journals, which is an invaluable source. Basically, I use corpus analysis to test, back up and extend research hypotheses which often arise from one single finding in a text, or even an ”I know that there must be something somewhere around here” gut feeling. That can, to name a concrete example, be a question like ”since when does the co-occurence of ’Sibelius’ and ’alkuvoima’ appear? Does the corpus provide evidence for the assumption that it became a fixed collocation, and if so, when?”
To this end, I mainly use the extended search tool (Korp) to identify co-occurences in comparatively larger samples (paragraphs) because a simple left/right-neighbour-search wouldn’t reveal much – especially not in the complicated syntax of early modern Finnish writing on music, which is often closer to literary works than to factual non-fiction style. The corpus excerpts can then be used for further investigation, e.g. for qualitative data analysis, but sometimes also to generate new hypotheses. I have to admit it happened more than once that I found a needle in a haystack – e.g. an interesting text that I might have overlooked otherwise – by browsing through my corpus search results.
Schweitzer, Benjamin 2019. Musikinstrumentenbezeichnungen im Finnischen: Historisch-systematischer Überblick, Varianten und Verstetigung. MA thesis. Universität Greifswald. Available: urn:nbn:de:gbv:9-oa-000003-2
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikko Laitinen tells us about his recent work on social media datasets, which also allow researchers to explore social networks.
I am Mikko Laitinen, professor of English Language and Culture at the School of Humanities at the University of Eastern Finland and one of the PI’s of the national Digital Humanities research infrastructure consortium, FIN-CLARIAH.
I am a sociolinguist, which means that I am interested in the use of language in different situations and as a social phenomenon. As a researcher, I have worked with small and structured corpora as well as with large and computationally intensive mass data, but always with some background variables through which language use has been examined. The corpora have been both synchronic snapshots and diachronic cross-sections through time.
Recently, my research team has been working a lot with various Twitter datasets. We are now building a large, representative and continuously updated benchmark corpus that follows language use in near real time on this social media platform. This kind of ”digital observatory”, which offers us means to monitor language use in society, is useful, for example, as a background for language policy discussions. What is more, if it is combined with illustrative visualisations in a more comprehensible format, it may also increase people’s interest in language research in general. Twitter is an interesting resource, because despite of its limited text length, it has extremely rich metadata that allow us to explore people’s language use in social networks, for example.
I think it is great that we have all these resources collected and accessible in one place and through one easy-to-use interface. This is a great service for students and researchers! I have personally used the English language resources the most, including the COHA and COCA corpora, and I have downloaded the English lingua franca corpus (ELFA) on my own computer. I also occasionally check the Suomi24 corpus for some interesting phenomena.
Laitinen, Mikko. 2020. Empirical perspectives on English as a lingua franca (ELF) grammar. World Englishes 39:3, 1–16. DOI: 10.1111/weng.12482
Laitinen, Mikko, Masoud Fatemi & Jonas Lundberg. 2020. Size matters: Digital social networks and language change. Frontiers in Artificial Intelligence 3:46. DOI: 10.3389/frai.2020.00046
Laitinen, Mikko. 2018. Placing ELF among the varieties of English: Observations from typological profiling. In Sandra Deshors (ed.), Modelling World Englishes in the 21st Century: Assessing the Interplay of Emancipation and Globalization of ESL varieties, 109–131. Amsterdam: John Benjamins. DOI: 10.1075/veaw.g61.05lai
Laitinen, Mikko & Magnus Levin. 2016d. On the globalization of English: Observations of subjective progressives in present-day Englishes. In Elena Seoane & Cristina Suárez-Gómez (eds.), World Englishes: New Theoretical and Methodological Considerations, 229–252. (Varieties of English around the World G57). Amsterdam: John Benjamins. DOI: 10.1075/veaw.g57.10lai
Lundberg, Jonas & Mikko Laitinen. 2020b. Twitter trolls: a linguistic profile of anti-democratic discourse. Language Sciences 79. DOI: 10.1016/j.langsci.2019.101268
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Filip Ginter tells us about his work with the TurkuNLP research group.
I am Filip Ginter and I am an associate professor of language technology at the University of Turku. I am also presently the longest-serving member of the TurkuNLP research group. I am a computer scientist by training, profoundly enjoying the many unique challenges human language poses.
Not blessed with patience nor long attention span, I have managed to dip into quite many research topics over the years with our TurkuNLP team. We started off with scientific literature mining, but then branched into more general development of various NLP tools and resources. I’ve always had a soft spot for Finnish and chose to contribute especially to Finnish NLP, perhaps to give back to the society which so generously hosted me for my PhD research. My personally most important – or at least most visible – undertaking was the Turku Dependency Treebank, which later on became one of the first treebanks in the super-successful Universal Dependencies (UD) initiative and allowed TurkuNLP to be an important member of the UD community from Day 1. The treebank was also the basis for the relatively broadly used line of statistical syntactic Finnish language dependency parsers from TurkuNLP. I am proud that this work helped to bring Finnish into the results tables of ACL papers and to close the gap to much more studied languages, at least in terms of parsing accuracy.
Recently, I of course could not help but jump on board the deep learning tsunami. TurkuNLP’s previous work on crawling the Finnish Internet and gathering billions of words of Finnish paid off when it became a crucial part of the training corpus of the FinBERT model. If you have recently done any machine learning on Finnish language, it is quite likely you used this model to squeeze that extra few percent points on your accuracy. The story of FinBERT is a story of having plenty of language data ready at the right moment and shows the importance of gathering and maintaining language resources. You never know when you next need a few billion words of Finnish.
And where do I go from here? I see it as my goal to bring to Finnish, one way or another, most of the tools, tasks, and resources that the bigger languages have. Think about question answering, summarization, semantic search, paraphrase models and many other NLP tasks not yet properly covered for Finnish. If they can exist for English, then they should also for Finnish. We are living exciting times in NLP and now we have many more opportunities to make it happen than we had yet five years ago. And of course, with the LUMI supercomputer around the corner, you can expect new exciting language models from the TurkuNLP workshop.
Apart from these more or less mainstream NLP projects, I have had several I dare say successful collaborations in the field of digital humanities, in particular with the historians. I enjoyed these projects as they challenged us with interesting technical and algorithmic problems to solve.
Perhaps my most visible contribution to the Language Bank is the Finnish dependency parser (of course there was many of us working on it in TurkuNLP), which is used by the Language Bank to make data more accessible to researchers. The most recent version of the parser brings about a substantial improvement in accuracy on all levels of analysis. One day, when the legislation catches up with present-day language technology needs, I hope to see also our Internet Parsebank and other large-scale web-based data contributed to the Language Bank.
Naturally, we have used the Language Bank’s resources extensively here in TurkuNLP, perhaps most of them the Suomi24 corpus, in various research projects as well as in language model training. We have also benefited enormously from the Newspaper and Periodical OCR Corpus of the National Library of Finland in our work with the historians.
I cannot stress how important it is for Finnish NLP that we all contribute open datasets and free tools and models to the Language Bank and also maintain our edge in terms of computational resources, with LUMI being the perfect example
J. Kanerva & F. Ginter & S. Pyysalo 2020. Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task. Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies. DOI: 10.18653/v1/2020.iwpt-1.17
J. Kanerva & F. Ginter & T. Salakoski 2020. Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks. Natural Language Engineering. DOI: 10.1017/S1351324920000224
J. Kanerva & F. Ginter & N. Miekka & A. Leino & T. Salakoski 2018. Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. DOI: 10.18653/v1/K18-2013
A. Vesanto & A. Nivala & T. Salakoski & H. Salmi & F. Ginter 2017. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). https://aclanthology.org/W17-0249
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampsa Holopainen tells us about his research on the history of the Uralic languages.
My name is Sampsa Holopainen, and I am a researcher of the history of the Uralic languages. I am currently working as a recipient of an APART-GSK Fellowship of the Austrian Academy of Sciences at the Finno-Ugrian department of the University of Vienna. I made my doctoral studies in the University of Helsinki, my PhD defence was in December 2019.
My current research topic is the history of Hungarian or more widely the history of the Ugric languages (including also Khanty and mansi): historical phonology, etymology and loanword research. I am investigating these topics in my current project (2021–2023) Hungarian historical phonology reexamined (with special focus on Ugric vocabulary and Iranian loanwords). In my earlier work I have done research on the etymology of the other Uralic languages too, especially on the Indo-Iranian and other Indo-European lexical influence on the various Uralic languages. In 2019–2021, I worked with Finnic etymology in particular in the project Suomen vanhimman sanaston etymologinen verkkosanakirja (The digital etymological dictionary of the oldest vocabulary of Finnish) in the University of Helsinki. This project is led by Dr. Santeri Junttila and funded by the Kone Foundation.
As a part of my current project I am developing an etymological database of the shared vocabulary of Hungarian, Khanty and Mansi (the vocabulary traditionally reconstructed into the Ugric proto-language) and of the early Iranian loanwords of Hungarian; the database is built into the Sanat-wiki that is maintained by Kielipankki. These vocabulary layers are investigated critically and the results are presented in word-articles, and the database will also later include tables illustrating the developments of historical phonology. The database forms only part of my current research work, but it gives a good opportunity to publish research results and observations quickly and openly.
My database is based on a much larger etymological database of the Finnic languages, that has been developed in Santeri Junttila’s project Suomen vanhimman sanaston etymologinen verkkosanakirja (The digital etymological dictionary of the oldest vocabulary of Finnish). Also docent Petri Kallio, MA Juha Kuokkala and MA Juho Pystynen have worked in this project. This project is still active but I am not involved in it any more as a full-time researcher. I think that this project is especially significant, as it has produced the excellent Wiki-database of etymology that has served as the basis of further projects on etymology, such as my own current project in the University of Vienna. The Wiki-database gives good chances to update the research results and forms a good platform for researchers to communicate.
Holopainen, Sampsa 2022: Uralilaisen lingvistisen paleontologian ongelmia – mitä sanasto voi kertoa kulttuurista? – Kaheinen, Kaisla & Leisiö, Larisa & Erkkilä, Riku & Qiu, Toivo E.H. (toim.), Hämeenmaalta Jamalille: kirja Tapani Salmiselle 07.04.2022. Helsinki: Helsingin yliopiston kirjasto. 101–114. DOI: 10.31885/9789515180858.9
Holopainen, Sampsa 2021: On the question of substitution of palatovelars in Indo-European loanwords into Uralic. – Suomalais-Ugrilaisen Seuran Aikakauskirja 98. 197–233. DOI: 10.33340/susa.95365
Junttila, Santeri & Holopainen, Sampsa & Pystynen, Juho 2020: Digital Etymological Dictionary of the Oldest Vocabulary of Finnish. – Rasprave 46, 2. 733–747. DOI: 10.31724/rihjj.46.2.15
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jack Rueter tells us about his research on morpho-syntactic description of minority languages.
I am Jack Rueter, a principal investigator in Digital Humanities at the University of Helsinki and a Project Researcher in Finnish and Finno-Ugric Languages at the University of Turku working with contextual disambiguation of corpora, annotated manually and using rule-based systems. At the age of seventeen, I spoke my first words of Finnish, and from there have endeavored to acquire a working knowledge in several other non-English languages.
During my studies and subsequent research of Uralic and other minority languages, I have gradually expanded my comprehension of using language-technological tools and practices for the enhancement of fundamental work in linguistics. Although I began my first finite-state description of Komi-Zyrian a quarter of a century ago, which I followed with parallel and corpus work for the Erzya language in the beginning of this millennium, it is the last decade, which has seen ambitious collaboration in the description of languages in several branches of the Uralic language family and beyond. These descriptions have centered in the study of lexica, rich yet regular morphology, syntax and the idea that useful language documentation might be facilitated in the development of tools and learning environments for multilingual application.
My work with the Komi-Zyrian language began while taking a course at the University of Helsinki in the early nineties. Our teacher, E. Cypanov, offered us lessons based on materials he had written in Russian – no Komi-Finnish or Komi-English dictionaries were available at the time, so I undertook the translation of his glossary into a small trilingual Komi-English-Finnish word list, which I was able to proofread and expand with a scholarship from the Alfred Kordelin Foundation. At the time, such word lists were seen as a fundamental point of development for finite-state descriptions, and as such I was able to begin my modeling of a finite-state description for Komi-Zyrian with advice from Professor Kimmo Koskenniemi on a Unix system in 1995.
From 1996 until 2004, I spent a large part of my time among the Komi, the Erzya and the Moksha. During this time, I taught Finnish at the Mordovian State University in Saransk, Mordovia – about 600 kilometers east-southeast of Moscow. There, in addition to language instruction, I began collecting and digitizing Mordvin language literature, learning the two literary languages and developing relations with professional language users and native speakers. These personal contacts have contributed to my knowledge of the languages and provided me with native-language descriptions of the languages, elementary to their adequate documentation. This was also a time to become familiar with other languages spoken in Russia as well as to foster affiliations with language research at the Universities of Turku and Tromsø.
Upon leaving my teaching position in Saransk, I immediately became involved in work with the open-source infrastructure, Giellatekno, in Tromsø. Trond Trosterud and his colleagues were interested in my work with Komi and wanted to include it in the development of their Barents and Circum-polar language-technology development. Needless to say, I acquiesced, and open-source Komi became another piece of the puzzle for extensive dictionary and morphology work in my collaboration from Helsinki, where I began my postgraduate studies. Language technology definitely played a strong role in the categorization of morphological phenomena in the Erzya language, a forerunner to what I documented in my dissertation in 2010 and what I would greatly expand upon in subsequent work funded by the Kone Foundation and in the auspices of its «Language Programme» (2012–2021).
The Language Programme saw the extensive pilots and projects for digitizing endangered materials from the 1920–40s for Finnish kindred languages in Fenno-Ugrica at the National Library of Finland. Preparation for and continued work with these materials helped pave the way to extensive work with lexica and morphology in Olonets-Karelian, Livonian, Hill Mari, Moksha and Tundra Nenets. The success in these, of course, was due largely to the team of language specialists involved and previous documentational work done on the languages. As open-source projects, the language documentation projects also made use of open Helsinki Finite-State Technology (HFST) and open infrastructure for Saami language-technology research (Giellatekno) and tool implementation (Divvun) in Tromsø, Norway (Giella). It was experience with these technologies which I applied to other minority languages, such as Ingrian, Skolt Saami, Meadow Mari, Udmurt, Võro, Komi-Permyak, Mansi, even Apurinã on the Amazon and Lushootseed in the Pacific Northwest. The resulting tools were online morphology-savvy dictionaries, e.g. Olonets-Karelian, Skolt Saami, Erzya and Moksha, and intelligent computer-assisted language learning (ICALL), such as Skolt Saami Nuõrti, which follows the lead of ICALL for Northern Saami Davvi. The tools also included something for everyday writing and spell checkers at Divvun.
Lexicon and morphology only really make sense if you can apply them to a broader usage – syntax and meaningful usage, for example, translation. Thanks to Anssi Yli-Jyrä, I became involved in the Universal Dependencies project in the late 2010s. It was here that I debuted with a tree bank for Erzya, and subsequently developed in work in Moksha, Komi-Zyrian, Komi-Permyak, Skolt Saami, Apurinã with meaningful collaboration from Helsinki, Turku, Oulu, Saransk, Syktyvkar, Tromsø, Tartu, Göttingen, Belém and Bloomington. Work with treebanks can, on the one hand, be considered a means of making language documentation available to multiple user types, and, on the other hand, it serves as an open repository for development in Constraint Grammar disambiguation, function and dependency work after morphological analysis. A driving force behind meaningful morphosyntax takes me to Apertium and shallow-transfer translation modeling for closely related languages.
Apertium started out with translation between Catalon and Spanish related language forms. This initially involved conversion of lexicon from source to target, the subsequent transfer of morphological information, and finally an adaptation of the resulting source syntax to target syntax and idioms. The idea of being able to translate between closely related languages on the basis of the shallow transfer of regular morphological categories and information describes a tool that, in addition to facilitating informative reference translation, might also be used in measuring the distance between language forms through documented lexical, morphological and syntactic and idiomatic convertibility. The development of shallow-transfer tools for the triangle (Northern Dvina) Karelian, Olonets-Karelian and Finnish, for example, has lead to dictionary development correlating to finite-state morphology in the Giella infrastructure applied at Akusanat and Google Summer of Code through Apertium. Upcoming language pairs might include work with the Mordvin languages Erzya and Moksha, which have recently enjoyed a lot of support through work in the Digilang project at the University of Turku.
At the end of the last millennium, I began collecting Moksha, Erzya and Komi literature with releases from the authors and publishers for compilation and research study in the University of Helsinki Language Corpus Server (UHLCS), which has since been incorporated into the Language Bank of Finland materials at Kielipankki. FIN-CLARIN has provided me with time and resources for validating older UHLCS materials and coaching with work in newer corpora development and educational materials. This has meant that I have had the opportunity to bring my own ERME materials for Erzya and Moksha to the Korp server as well as parallel Biblical verses of Uralic languages with Erik Axelson, Pabivus (Thanks to the Bible Translation Institute). At present, work is underway to introduce Universal Dependency corpora of Finno-Ugric languages to the Korp server. Hopefully, my work in Mordvin syntax at the University of Turku will soon also contribute to the quality of the minority-language corpora at Kielipankki. More accurate morphological analysis with rule-base, contextually derived syntactic readings helps bring speech-to-text and text-to-speech technology closer to lesser documented, minority languages.
Rueter, J., Partanen, N., Hämäläinen, M., & Trosterud, T. (2021). Overview of Open-Source Morphology Development for the Komi-Zyrian Language: Past and Future. In Proceedings of the Seventh International Workshop on Computational Linguistics of Uralic Languages (pp. 62–72). The Association for Computational Linguistics. https://aclanthology.org/2021.iwclul-1.4.pdf
Hämäläinen, M., Rueter, J., & Alnajjar, K. (2021). Documentação de línguas ameaçadas na era digital. Linha D’Água, 34(2), 47-64. https://doi.org/10.11606/issn.2236-4242.v34i2p47-64
Rueter, J., Hämäläinen, M., & Partanen, N. (2020). Open-Source Morphology for Endangered Mordvinic Languages. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS) (pp. 94–100). The Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.nlposs-1.13
Hämäläinen, M., Alnajjar, K., Rueter, J., Lehtinen, M., & Partanen, N. (2021). An Online Tool Developed for Post-Editing the New Skolt Sami Dictionary. In I. Kosem, M. Cukr, M. Jakubíček, J. Kallas, S. Krek, & C. Tiberius (Eds.), Electronic lexicography in the 21st century (eLex 2021). Proceedings of the eLex 2021 conference (pp. 653-664). (Electronic lexicography in the 21st century (eLex 2021). Proceedings of the eLex 2021 conference). Lexical Computing CZ s.r.o.. Available: https://elex.link/elex2021/wp-content/uploads/2021/08/eLex_2021_42_pp653-664.pdf
Rueter, J., Pereira de Freitas, M. F., Facundes, S., Hämäläinen, M., & Partanen, N. (2021). Apurinã Universal Dependencies Treebank. In M. Mager, A. Oncevay, A. Rios, I. V. Meza Ruiz, A. Palmer, G. Neubig, & K. Kann (Eds.), Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas (pp. 28-33). The Association for Computational Linguistics. DOI: 10.18653/v1/2021.americasnlp-1.4
Rueter, J. (2020). Корпус национальных мордовских языков: принципы разработки и перспективы функционирования/ действия. In ФИННО-УГОРСКИЕ НАРОДЫ В КОНТЕКСТЕ ФОРМИРОВАНИЯ ОБЩЕРОССИЙСКОЙ ГРАЖДАНСКОЙ ИДЕНТИЧНОСТИ И МЕНЯЮЩЕЙСЯ ОКРУЖАЮЩЕЙ СРЕДЫ (pp. 118-127). Издательский центр Историко-социологического института. https://www.researchgate.net/publication/342869938_Corpus_of_the_national_languages_Erzya_and_Moksha_priciples_of_development_and_perspectives_of_functionactionKorpus_nacionalnyh_mordovskih_azykov_principy_razrabotki_i_perspektivy_funkcionirovania_dej
Rueter, J. (Author), & Axelson, E. (Author). (2020). Raamatun jakeita uralilaisille kielille, rinnakkaiskorpus, sekoitettu, Korp [tekstikorpus]. Software, Kielipankki. Available: http://urn.fi/urn:nbn:fi:lb-2020021119
Rueter, J., Partanen, N., & Ponomareva, L. (2020). On the questions in developing computational infrastructure for Komi-Permyak. In T. A. Pirinen, F. M. Tyers, & M. Rießler (Eds.), Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages (pp. 15–25). The Association for Computational Linguistics. DOI: 10.18653/v1/2020.iwclul-1.3
Rueter, J. M. (2020). Linguistic Distance between Erzya and Moksha. Dependent Morphology. In Е. Ф. Клементьева, Т. И. Мочалова, & И. Н. Рябов (Eds.), ФИННО-УГОРСКИЕ ЯЗЫКИ В СОВРЕМЕННОМ МИРЕ: ФУНКЦИОНИРОВАНИЕ И ПЕРСПЕКТИВЫ РАЗВИТИЯ: Материалы Всероссийской научно-практической конференции, посвященной 95-летию заслуженного деятеля науки РФ, доктора филологических наук, профессора Цыганкина Дмитрия Васильевича (pp. 90-110). МГУ им. Н. П. Огарёва. Available: http://hdl.handle.net/10138/330042
Rueter, J., Partanen, N., & Pirinen, T. A. (2021). Numerals and what counts. In M. D. Lhoneux, & R. Tsarfaty (Eds.), Fifth Workshop on Universal Dependencies : Proceedings (pp. 151–159). The Association for Computational Linguistics. Available: https://aclanthology.org/2021.udw-1.13
Rueter, J., & Hämäläinen, M. (2020). Prerequisites For Shallow-Transfer Machine Translation Of Mordvin Languages: Language Documentation With A Purpose. In Материалы Международного образовательного салона (pp. 18-29). Ижевск: Институт компьютерных исследований. Available: http://hdl.handle.net/10138/325962
Rueter, J. M. (Accepted/In press). Mordva. In R. Valijärvi & D. Abondolo (Eds.), The Uralic Languages Routledge.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mika Hämäläinen tells us about his research on computational creativity and developing language technology for endangered languages.
I am Mika Hämäläinen, a postdoctoral researcher at the Department of Digital Humanities at the University of Helsinki. In 2020, I finished my PhD thesis on computational creativity with the title Generating Creative Language: theories, practice and evaluation. The title describes well my research interests, as I am not only interested in the technical implementation of language technology models, but also in their relation to theories and real-world phenomena. Open source code and publishing research results as easy-to-use tools as possible are very important to me.
I have researched computational creativity as well as language technology for endangered languages and for non-standard languages such as dialects and historical language forms. Computational creativity is a challenging research topic from the perspective of Artificial Intelligence (AI), as the aim is to develop computational models that are capable of producing new creative texts such as poetry (Hämäläinen & Alnajjar, 2019) or humour (Alnajjar & Hämäläinen, 2021). A machine shouldn’t just be able to output new text, but also be able to interpret its output on some meaningful level. For this purpose, we have developed analysis tools, such as the FinMeter library, which analyses Finnish poetry. The library can be used, for example, to analyse meter and interpret metaphors.
Language technology for endangered languages is very challenging, as modern language technology increasingly relies on massive text resources that are not readily available. The corpora of endangered languages also tend to contain a lot of variation, as the languages concerned may not have been subject to the same extent of language guidance as, for example, Finnish. This kind of linguistic diversity is difficult from the perspective of machine learning: The more variation the corpus contains, the larger its size should be in order for machine learning models to cope with the variation. Language technology for endangered languages therefore requires some ingenuity. We have successfully analysed the morphology (Hämäläinen et al., 2021a), morphosyntax (Hämäläinen & Wiechetek, 2020) and cognates (Hämäläinen & Rueter, 2019) of endangered languages by generating synthetic data for machine learning models. Data from endangered languages can be easily processed using the UralicNLP library that I have developed.
Even in the case of vital languages, the abundant variation is a headache for language technologists. I have done research on the normalisation of historical English language forms (Hämäläinen et al., 2018). Normalisation simply means that a computer can convert the historical deviant orthography into a modern language. The English language normalisation tool Natas is available on GitHub. Since then, I have worked on the normalisation of Finnish (Partanen et al., 2019) and Finnish Swedish dialects (Hämäläinen et al., 2020a), as well as on the generation of Finnish dialects (Hämäläinen et al., 2020b) based on the written language. These research results have been published in the Murre library. My most recent work has been the automatic recognition of Finnish dialects based on sound and text (Hämäläinen et al., 2021b)
The Samples of Spoken Finnish corpus has been absolutely crucial in building dialect models. Without this corpus, my research on Finnish dialects would simply have been impossible.
The data from the Language Bank has also been useful in the study of computational creativity. For example, the Finnish WordNet has been used in my poetry generator (Hämäläinen, 2018) and Opusparcus has been useful in producing creative dialogue (Alnajjar & Hämäläinen, 2019).
Alnajjar, K., & Hämäläinen, M. (2021). When a Computer Cracks a Joke: Automated Generation of Humorous Headlines. In Proceedings of the 12th International Conference on Computational Creativity (ICCC 2021) (pp. 292-299). Association for Computational Creativity.
Hämäläinen, M., Alnajjar, K., Partanen, N., & Rueter, J. (2021b). Finnish Dialect Identification: The Effect of Audio and Text. In M-F. Moens, X. Huang, L. Specia, & S. Wen-tau Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8777-8783). The Association for Computational Linguistics.
Hämäläinen, M. (2020) Generating Creative Language: Theories, Practice and Evaluation. Helsingin yliopisto. Saatavilla: http://urn.fi/URN:ISBN:978-951-51-6707-1
Alnajjar, K., & Hämäläinen, M. (2019). A Creative Dialog Generator for Fallout 4. In Proceedings of the 14th International Conference on the Foundations of Digital Games [48] ACM. https://doi.org/10.1145/3337722.3341824
Hämäläinen, M., & Alnajjar, K. (2019). Let’s FACE it: Finnish Poetry Generation with Aesthetics and Framing. In K. V. Deemter, C. Lin, & H. Takamura (Eds.), 12th International Conference on Natural Language Generation: Proceedings of the Conference (pp. 290-300). The Association for Computational Linguistics. https://doi.org/10.18653/v1/w19-8637
Hämäläinen, M., Partanen, N., Rueter, J., & Alnajjar, K. (2021a). Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered. In S. Dobnik, & L. Øvrelid (Eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 166-177). (NEALT Proceedings Series; No. 45), (Linköping Electronic Conference Proceedings; No. 178). Linköping University Electronic Press.
Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates with a Character-Based NMT Approach. In A. Arppe, J. Good, M. Hulden, J. Lachler, A. Palmer, L. Schwartz, & M. Silfverberg (Eds.), Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages: (Volume 1) Papers (pp. 39-45). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-6006.pdf
Hämäläinen, M., Partanen, N., & Alnajjar, K. (2020a). Normalization of Different Swedish Dialects Spoken in Finland. In GeoHumanities’20: Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities (pp. 24–27). ACM. https://doi.org/10.1145/3423337.3429435
Hämäläinen, M., Partanen, N., Alnajjar, K., Rueter, J., & Poibeau, T. (2020b). Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity. In F. A. Cardoso, P. Machado, T. Veale, & J. M. Cunha (Eds.), Proceedings of the 11th International Conference on Computational Creativity (ICCC’20) (pp. 204-211). Association for Computational Creativity.
Hämäläinen, M., & Wiechetek, L. (2020). Morphological Disambiguation of South Sámi with FSTs and Neural Networks. In D. Beermann, L. Besacier, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020) (pp. 36-40). European Language Resources Association (ELRA).
Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., & Mäkelä, E. (2018). Normalizing early English letters to Present-day English spelling. In B. Alex, S. Degaetano-Ortlieb, A. Feldman, A. Kazantseva, N. Reiter, & S. Szpakowicz (Eds.), Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 87-96). (ACL Anthology; No. W18-45). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-4510
Hämäläinen, M. (2018). Harnessing NLG to Create Finnish Poetry Automatically. In F. Pachet, A. Jordanous, & C. León (Eds.), Proceedings of the Ninth International Conference on Computational Creativity (pp. 9-15). Association for Computational Creativity (ACC)
Partanen, N., Hämäläinen, M., & Alnajjar, K. (2019). Dialect Text Normalization to Normative Standard Finnish. In W. Xu, A. Ritter, T. Baldwin, & A. Rahimi (Eds.), The Fifth Workshop on Noisy User-generated Text (W-NUT 2019): Proceedings of the Workshop (pp. 141–146). The Association for Computational Linguistics.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Ari Huhta tells us about his research on language assessment.
I am Ari Huhta, a professor of language assessment and the director of the Centre for Applied Language Studies (CALS) at the University of Jyväskylä.
During my career I have been involved in developing various kinds of language assessment instruments and assessment systems as well as in carrying out related research. In the past 15 years I have also investigated learning a foreign or second language and the factors involved in learning languages.
Language assessment, or assessment in general, has several different purposes. Some of them concern awarding certificates to individuals for achieving a certain level of proficiency or a certain goal, as is the case in the Matriculation Examination or the National Certificates of Language Proficiency (Yleiset kielitutkinnot, which is used to demonstrate the level of language proficiency required for Finnish citizenship). I have been involved in both of these examinations but most of my research has focused on assessment that supports learning and that is called formative or diagnostic assessment.
A particularly important activity in my career was the international Dialang project in which we developed a 14-language assessment and feedback system that can be used via a web browser. Dialang was completed already in 2004 but it is still accessible. Dialang led to a number of studies that combine the perspectives of language assessment and language learning research. These projects investigated the relationship between ability to use a language and different linguistic features (e.g., structures and vocabulary) and their co-development, which will help design both teaching materials and assessment instruments for supporting learning. Researchers have been particularly interested in the linguistic characteristics of the functionally defined proficiency levels of the Common European Framework of Reference for Languages (CEFR); these levels are nowadays widely used in Europe, including Finland, as a way to define learning targets in foreign language education.
The most important examples of the above mentioned studies were the Cefling and Topling projects (PI prof. Maisa Martin, JyU) that investigated writing and its development among Finnish-speaking learners of English and Swedish, and learners of Finnish as a second language, as well as the Dialuki project that I led and that studied reading and writing skills among learners of English and Finnish. The participants in all these projects were school-aged language learners. More recently, I have studied learning and teaching of English in the primary school. In addition, I am involved in the DigiTala project, which is a joint venture between University of Helsinki, Aalto University and University of Jyväskylä; this project investigates automatic recognition and assessment of speech produced by learners of Finnish and Swedish.
Some of the learners’ texts collected during the projects Cefling and Topling (the Topling corpus) are already available via the Language Bank of Finland. The Dialuki corpus is to be published soon. As for the DigiTala project, we intend to make the speech material available to the scientific community to the extent where this is possible. By sharing our corpora, we aim to support and to enhance research on language learning.
Khushik, Ghulam & Huhta, Ari. 2022. Syntactic complexity in English as a foreign language learners’ writing at CEFR levels A1 – B2. European Journal of Applied Linguistics, 10(1). Early online. https://doi.org/10.1515/eujal-2021-0011
Khushik, Ghulam & Huhta, Ari. 2020. Investigating syntactic complexity in EFL learners’ writing across Common European Framework of Reference levels A1, A2, and B1. Applied Linguistics 41(4), 506-553. https://doi.org/10.1093/applin/amy064
Leontjev, Dmitri; Huhta, Ari & Mäntylä, Katja. 2016. Word derivational knowledge and writing proficiency: How do they link? System 59, 73-89. https://doi.org/10.1016/j.system.2016.03.013
Huhta, Ari; Alanen, Riikka; Tarnanen, Mirja; Martin, Maisa & Hirvelä, Tuija. 2014. Assessing learners’ writing skills in a SLA study: Validating the rating process across tasks, scales and languages. Language Testing 31(3) 307–328. https://doi.org/10.1177/0265532214526176
Mäntylä, Katja & Huhta Ari. 2013. Knowledge of word parts. In Milton, James & Fitzpatrick, Tess (eds.) Dimensions of Vocabulary Knowledge. (pp. 45-59). Palgrave.
Alanen, Riikka; Huhta, Ari & Tarnanen Mirja. 2010. Designing and assessing L2 writing tasks across CEFR proficiency levels. In Bartning, Inge, Martin, Maisa & Vedder Ineke (eds.) Communicative proficiency and linguistic development: intersections between SLA and language testing research. EUROSLA Monograph Series, 1. 21-56. http://eurosla.org/monographs/EM01/EM01home.html
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
On a linguistic field trip in Tver, Karelia in the summer 2019. Photo: Tuisku Vilenius
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tuisku Vilenius investigated a corpus of Finnish online discussions to outline the cultural stereotypes that emerged from the discussions related to the indigenous Saami people.
I am Tuisku Vilenius and I graduated last summer with a Master’s degree in Linguistics from the University of Helsinki. My degree also included Saami studies and Indigenous studies. On the level of languages, I am particularly interested in the Saami languages, the Mayan languages and Nahuatl. Currently, I am working as a Finnish language teacher for immigrants and planning my postgraduate studies.
The aim of my Master’s thesis was to find out how ordinary Finns perceive the Saami people and their culture. As I had just recently begun my Saami studies when I started working on my Master’s thesis, I decided to approach the topic through material that was written in Finnish. I examined which adjectives were used in Finnish online discussions when referring to the Saami, and I also wanted to find out which broader discourses or stereotypes affected the chosen adjectives. At the same time, my research was also a diachronic overview of the Finnish Saami discussions during recent decades.
It was interesting to notice that although the amount of discussions related to the Saami increased significantly during the period I reviewed (2001-2017), the references to the Saami changed little. Throughout the reviewed time period, the discussion was dominated by a stereotypical view in which the Saami were perceived as a traditional and even ancient people. This may be explained by the fact that the average Finn has little day-to-day contact with the Saami. On the other hand, much of the discussion focused on defining who and what the genuine Saami actually are. This reflects the need of the mainstream population to control and define the indigenous people.
I used the Suomi24 corpus (2001-2017) as the source of research data for my study. This corpus is available in the Language Bank’s Korp tool, and it contains discussions from the Suomi24 online forum. I chose this data because it provided a very broad view of the history of Finnish Internet discussion. The online discussion forum material is also more likely to reflect the views of ordinary Finns than, for example, the newspaper articles that had been used as a basis for earlier research on Saami discussions. In addition to the extensive material, I was delighted with the various additional features available in Korp. I was able to easily search for adjectives referring to the Saami with the search tool, and I also used identification data to learn, for example, when and on which discussion area the message had been posted. This allowed me to better outline the topics to which the Saami discussions related.
Vilenius, Tuisku 2021. Oikeat ja muinaiset: saamelaisstereotyypit suomalaisissa internetkeskusteluissa. Master’s Thesis. University of Helsinki. Available: URN:NBN:fi:hulib-202106152749
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jussi Ylikoski tells us about his research on the grammatical properties of Finnish and other Uralic languages.
I am Jussi Ylikoski, a linguist. I have been working at the University of Oulu for five years as a professor of Saami language, but starting in the autumn of 2022, I will be a professor of Finno-Ugric languages at the University of Turku. So, I do research on quite a few languages, including Finnish.
I have worked on quite a large number of research topics on Finnish and other Uralic languages, and partly outside the Uralic family, too. I have mainly focused on grammars (morphology and syntax) of both better- and lesser-known languages, and occasionally also on etymology. When describing present-day languages, I often can’t help looking at them also from a diachronic perspective, and when I study the historical development of these languages, I tend to pay quite a lot of attention to the actual use of modern languages in the light of real text corpora.
I have used the corpora available in the Language Bank of Finland particularly as a researcher of Finnish grammar. As early as in 2003, I published an article in which I used the Finnish Text Collection in the Language Bank to show that the verb form known as the so-called fifth infinitive (-maisillaan/–mäisillään, ’on the verge of doing something’) can be used in many other ways in addition to the periphrastic construction with the verb olla (’to be’), contrary to what had been regularly stated in grammars. For instance, the ’forehead veins’ (otsasuonet) may ‘be on the verge of bursting’ (olla repeämäisillään), but they might also be ‘bulging on the verge of bursting’ (pullistella repeämäisillään), or someone may be afraid and ‘waiting (for something) with his/her forehead veins on the verge of bursting’ (odottaa otsasuonet repeämäisillään).
In recent years, I have been fascinated by the larger and larger text corpora containing billions of words that are available through the Language Bank of Finland and other CLARIN services. In my research, I have used e.g. the Korp version of the University of Helsinki E-thesis collection, the Finnish subcorpus of the Newspaper and Periodical Corpus of the National Library of Finland, the Suomi 24 Corpus, Ylilauta Corpus, and the Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, version 2. With the help of large corpora, it has been possible to discover, in a way, even new morphological cases also in a well-known and well-described language like Finnish. Among other things, I have studied the syntactic properties of forms traditionally known as the prolative, and I have found them to be used in ways that are much more similar to case forms than what has been suggested by previous research literature. Prolatives are not always only individual adverbs (e.g., maitse ‘by land’ and meritse ‘by sea’), but these forms can also be modified by subordinate clauses (e.g., mailitse jossa on helpompi kaunistella asioita ‘by email where it is easier to embellish facts’ and tekstiviestitse joihin turhan harva vastaa ‘by text messages that tend to be answered by too few’).
I have made my most exciting observations when studying forms that were previously considered as clear-cut derivations, such as lauantaisin ‘on Saturdays’ and viikonloppuisin ‘on weekends’ or kunnittain ‘by/across municipalities’ and aihealueittain ‘by/across thematic areas’. In the multi-billion word corpora searchable through the Korp interface of the Language Bank of Finland, it is possible to find hundreds or even thousands of relatively natural sentences, in which even these kinds of forms can have various modifiers that make them look like noun inflections: elokuun lauantaisin ‘on August Saturdays’, joka lauantaisin ‘on every Saturday’, satunnaisin viikonloppuisin ‘on random weekends’ or, e.g., Suomen kunnittain ‘by the municipalities of Finland’, eri maittain ‘by different countries’ ja tietyin aihealueittain ‘by certain thematic areas’. Since these kinds of temporal and distributive expressions look like case-inflected noun phrases, I have playfully called them “dwarf cases” in analogy to the fact that Pluto that was formerly known as a planet but is now called a dwarf planet.
After working on the hazy boundary between derivation and inflection, I have also ended up studying the abessive case in Finnish (rahatta ‘without money’, internetittä ‘without Internet’, etc.) and the so-called t accusative (minut ‘me’, meidät ‘us’, etc.) more thoroughly than before. Even though I personally like to observe and to describe forms and syntactic structures largely by means of descriptive linguistics, the tools of the Language Bank do also offer a lot of opportunities for those who are interested in quantitative analysis.
In addition to the corpora in the Language Bank of Finland, I have also used the corpora of Saami languages and many other Uralic minority languages that have been produced by the language technologists in Tromsø, Norway. The corpora are available via the Korp service maintained by Giellatekno, i.e., the user interface is similar to that of the Korp service in the Language Bank of Finland. Those who are interested also in other Uralic languages besides Finnish can access the corpora in the Tromsø Korp service, http://gtweb.uit.no/korp/ (Saami) and http://gtweb.uit.no/u_korp/ (other languages). With 63 million words of annotated Mari, what more can a Uralicist wish for?
Ylikoski, Jussi. 2003. Havaintoja suomen ns. viidennen infinitiivin käytöstä. [Summary: Remarks on the use of the proximative verb form (the so-called 5th infinitive) in Finnish.] Sananjalka 45. 7–44. https://doi.org/10.30673/sja.86640
Ylikoski, Jussi. 2018. Prolatiivi ja instrumentaali: suomen –(i)tse ja –teitse kieliopin ja leksikon rajamailla. Sananjalka 60. 7–27. [Summary: On Finnish prolatives and instrumentals: –(i)tse and –teitse in between grammar and lexicon.] https://doi.org/10.30673/sja.69978
Ylikoski, Jussi. 2020. Kielemme kääpiösijoista: prolatiivi, temporaali ja distributiivi. Virittäjä 124. 529–554. [Summary: On Finnish dwarf cases: prolative, temporal and distributive.] https://doi.org/10.23982/vir.76971
Ylikoski, Jussi. 2021. Abessiivin apologia. Puhe ja kieli 41. 139–157. [Summary: Apologia of the Finnish abessive case.] https://doi.org/10.23997/pk.110924
Ylikoski, Jussi. 2021. Mistä voisin löytää sen entisen sinut? Suomen kielen akkusatiivi- ja pronominioppia. – Leena Maria Heikkola, Geda Paulsen, Katarzyna Wojciechowicz & Jutta Rosenberg (toim.), Språkets funktion. Juhlakirja Urpo Nikanteen 60-vuotispäivän kunniaksi. Festskrift till Urpo Nikanne på 60-årsdagen. Festschrift for Urpo Nikanne in honor of his 60th birthday. Åbo: Åbo Akademis förlag. 220–243. https://urn.fi/URN:ISBN:978-952-12-4062-1
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jutta Salminen tells us about her research on the various ways of expressing negation in Finnish.
I am Jutta Salminen (PhD, BMus). I defended my dissertation on the Finnish language at the University of Helsinki in the spring of 2020 and I have been working as a Finnish language lecturer at the University of Greifswald in Germany for more than five years. I am interested in grammar and linguistic meaning — particularly in the expression of negation and also in ambiguity.
In my dissertation, I studied the use and interpretations of the verb epäillä (’to doubt, to suspect, to suppose’) and its nominative derivatives epäily and epäilys (’a doubt, a suspicion’), as well as the changes related to the verb during the era of written Finnish. The starting point of the study was the observation that, in present-day Finnish, these lexemes may express that something is considered either probable or unlikely, depending on the context of use. So, I became interested in how a single word can be used in two opposite senses. In addition, these words provided an opportunity for observing how the negation proper (‘it is not (true) that X’) and the so-called evaluative negativity (‘it’s not good that X’, ‘I don’t like X’) relate to each other in language use, since both of these aspects of negativity are included in the meaning potential of the verb and its nominal derivatives.
My current research is focused on the negative polarity items (e.g., kukaan) in Finnish and on what their contexts of use can tell about their grammatical and semantic nature. In English literature, negative polarity items (NPI) have been studied rather extensively (especially in big Indo-European languages), and it is interesting to observe how the Finnish NPIs could relate to these descriptions.
In order to study the variation, change and prevalence of different interpretations of linguistic meaning, it is necessary to have access to language material where it is possible to observe and to analyze instances of the language phenomenon under study. For the purpose of my dissertation research, I compiled a data set representing various text genres from several corpora: The Helsinki Korp Version of the Finnish Text Collection, Classics of Finnish Literature, The Corpus of Early Modern Finnish, The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland (KLK) and Corpus of Old Literary Finnish. When I began my dissertation study, the Finnish Text Collection was available via the old search interface, Lemmie, in Kielipankki, and the rest of the corpora (excluding KLK) were accessible via the Kaino service provided by Kotus (Institute for the Languages of Finland). Nowadays, I can use all of them via the Korp service in Kielipankki.
I based my comparison of the epäily(s) nouns on the occurrences found in the HS.fi News and Comments Corpus, which made it possible to examine the use of the words in both the delivered news texts and in the readers’ comments. The linguistic context plays a key role in the perception of the meaning variants of ambiguous words, so that access to the wider context of search results provided by the Language Bank was essential.
My ongoing research on negative polarity items mostly consists of grammatical description. Since grammar tends to change when in use, linguistic data is necessary for this type of research in addition to self-postulated examples, especially when the acceptability and the entrenchment of a particular expression is questionable to some extent. The Suomi24 Corpus has turned out to be a fruitful source of data for studying the use of the Finnish NPIs.
Salminen, Jutta (2020). Epäilemisen merkitys. Epäillä-sanueen polaarinen kaksihahmotteisuus kiellon ja kielteisyyden semantiikan peilinä. (The meaning and import of epäillä: The polar ambiguity of the Finnish verb epäillä ‘doubt, suspect, suppose’ and its nominal derivatives as a reflection of the semantics of negation and negativity.) Doctoral dissertation. Helsinki: University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5879-6
Salminen, Jutta (2018). Paratactic negation revisited. The case of the Finnish verb epäillä. Functions of Language 25(2): 259–288. https://doi.org/10.1075/fol.15030.sal
Salminen, Jutta (2017). Mitä tarkoittaa epäillä? Epäillä-verbin polaarisesta merkitysvariaatiosta nykysuomessa. (What does epäillä mean? On the polar meaning variation of the verb epäillä in Modern Finnish.) Virittäjä 121: 4–36. https://journal.fi/virittaja/article/view/52322
Salminen, Jutta (2017). Epäillä-verbin polaarinen kaksihahmotteisuus merkitysmuutoksena. (The polar ambiguity of the Finnish verb epäillä as evidenced through meaning development.) Virittäjä 121: 37–66. https://journal.fi/virittaja/article/view/52323
Salminen, Jutta (2017). Epäily vai epäilys? Jaettu polysemia ja lekseemien tyypilliset käytöt. (Epäily or epäilys? Shared polysemy and specialised typical uses.) Sananjalka 59: 217–243. https://doi.org/10.30673/sja.66636
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Photo: Evelin Kask, Aalto-yliopisto
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikko Kurimo tells us about his research on automatic speech recognition.
I am a Professor in Speech and Language Processing and leader of the Speech Recognition research team at the Department of Signal Processing and Acoustics of Aalto University.
For my PhD dissertation 25 years ago, I developed neural network algorithms to make automatic speech recognition more accurate and more robust. In order to train statistical models for recognizing speech sounds, it is necessary to utilize large amounts of speech material where the sounds are aligned with the corresponding text. At that time, very few such corpora were available. Thus, the research team had to collect and process the data themselves. When we developed automatic methods for aligning speech and text, it become possible to utilize larger data such as audiobooks and radio and television news (e.g., FBC – The Finnish Broadcast Corpus) in training the Finnish speech recognizer.
However, sufficient accuracy cannot be reached just by modeling individual speech sounds, since they do not appear separately in speech and in practice they are modified to fit in the word and sentence context. Therefore, the speech recognizer must also be provided with a model of the language in question. On the basis of the language model, the recognizer decides which words and sentences are represented by the observed speech sound sequences. To train the language model, huge quantities of text are required that should also contain a large variety of examples of different types of language use. For training the Finnish speech recognizer, we have used, e.g., the Finnish Text Collection (FTC).
When it is possible to automatically convert read-aloud speech and dictation into text with sufficient accuracy, this technology can be used in dictation services as well as in many other useful applications, such as transcribing planned speeches or respeaking presentations or television programmes. However, I am even more interested in natural and spontaneous speech that we all use in our everyday conversations and storytelling. Since free speech is the most efficient means of communication for humans, is of utmost importance to have an automatic speech recognizer that can understand this kind of speech when developing Artificial Intelligence systems that are to communicate with people.
The challenges in training models of conversational speech lie in the huge amount of variation in speech and in the limited availability of carefully transcribed resources of natural speech that are suited for training the recognizers. Since written language differs from spoken language in many ways, it is in practice necessary to create the text resources by transcribing speech first.
When training the first conversational speech recognizer, we used the FinDialogue corpus in addition to the DSPCON corpus we collected ourselves. The language models were trained with specific portions of conversations in written format that were found to be similar to spoken language according to the aforementioned spoken corpora.
At the moment, we are preparing two new corpora of free speech for publication: an extension of the Plenary Sessions of the Parliament of Finland and the speech material collected in the Donate Speech campaign. Both corpora contain approximately 4000 hours of speech, which clearly exceeds the total amount that was included in all previously published Finnish speech corpora that were suitable for training automatic speech recognizers. I am confident that the new data will enable us to significantly improve the automatic speech recognizer we have developed at Aalto University (Aalto-ASR), whose most recent version (Aalto-ASR 2.1) is currently available via the Language Bank of Finland.
Mikko Kurimo (1997). Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models. PhD thesis, Helsinki University of Technology, Espoo, Finland.
Mikko Kurimo, Vesa Siivola, Teemu Hirsimäki, Janne Pylkkönen, Reima Karhila, Peter Smit, Seppo Enarvi, André Mansikkaniemi, Matti Varjokallio, Ulpu Remes, Heikki Kallasjoki, Sami Keronen, Katri Leino, Ville T. Turunen & Kalle Palomäki (author names in no particular order, except the project leader is first). 2000 –2016. AaltoASR open source large-vocabulary continuous speech recognition system, Aalto University.
Seppo Enarvi & Mikko Kurimo (2013). Studies on Training Text Selection for Conversational Finnish Language Modeling. In Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT), Heidelberg, Germany, pp. 256–263. Available: http://urn.fi/URN:NBN:fi:aalto-201708036342.
André Mansikkaniemi, Peter Smit & Mikko Kurimo (2017). Automatic Construction of the Finnish Parliament Speech Corpus. Proceedings of Interspeech 2017, Vol. 8, pp. 3762–3766. Available: https://doi.org/10.21437/Interspeech.2017-1115
Juho Leinonen, Sami Virpioja & Mikko Kurimo (2021). Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. Available: http://hdl.handle.net/10138/330758
Peter Smit, Sami Virpioja & Mikko Kurimo (2021). Advances in subword-based HMM-DNN speech recognition across languages. Computer Speech & Language,Vol. 66. Available: https://doi.org/10.1016/j.csl.2020.101158
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Veronika Laippala tells us about her research on large language resources and computational methods.
My name is Veronika Laippala. I am a Professor of Digital Language Research at the School of Languages and Translation Studies of the University of Turku and the TurkuNLP research group.
Most of my research is related to language use in one way or another: to large language resources, mostly compiled from the Internet, and to computational methods to analyze the data. In addition, I have been involved in the development of Finnish language technology, including resources such as the Turku Dependency Treebank and the Turku NER named entity recognition system.
We have currently several on-going projects where we process large web-based language resources by analyzing the genres or registers found in them and by developing machine learning methods that can automatically recognize the different registers. Such methods and tools would benefit both Internet users in general and researchers using Internet-based language materials.
The wide selection of corpora and resources in the Language Bank of Finland provides huge opportunities! The Suomi 24 corpus is quite unique in its scope and it is probably the resource I have used the most. In addition, the syntactic parser developed on the basis of our tree bank is used to parse the corpora in Kielipankki. Naturally, I also teach the use of the Korp interface in my courses.
Liina Repo, Valtteri Skantsi, Samuel Rönnqvist, Saara Hellström, Miika Oinonen, Anna Salmela, Douglas Biber, Jesse Egbert, Sampo Pyysalo & Veronika Laippala (2021). Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 183–191. Available: https://aclanthology.org/2021.eacl-srw.24.
Veronika Laippala, Jesse Egbert, Douglas Biber & Aki-Juhani Kyröläinen (2021). Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents. Language Resources and Evaluation, Vol. 55, pp. 757–788. DOI: 10.1007/s10579-020-09519-z.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juho Leinonen tells us about his research on automatic speech recognition, speech alignment and chatbots.
My name is Juho Leinonen and I am completing my PhD studies in the Speech Recognition research group led by Mikko Kurimo in Aalto University. I started my PhD studies in 2017 after a couple of years of work in industry.
The topic of my Master’s thesis was the automatic speech recognition for Sámi language, and it is possible for me to build on this experience in my PhD work as well. In my current research, regarding chatbots and forced alignment of speech, I still need language models and acoustic models, both of which are also required in automatic speech recognition. In speech recognizers, language models are used for recognizing words that are pronounced in an unclear or ambiguous way, whereas chatbots need language models for generating new text. Language models can also be applied on assessing the quality of text generated by bots. The process becomes circular: in order to evaluate the results in a reliable way, we need to understand what high-quality text is like, but the same understanding is a pre-requisite for generating text in the chatbot. This constitutes a philosophical problem as well as an engineering one.
The goal in traditional speech recognition is to find the text that corresponds to the audio recording as well as possible. When developing a speech recognizer, previously aligned speech data is first required in order to train the acoustic models. Aligning text with speech is actually routine work in speech recognition. However, speech alignment would be a useful functionality for researchers in other fields as well, and it is hardly possible for everyone to become a speech recognition professional before they can get started with their own research. During the past year, I have packaged the speech recognition and alignment tools used in our research group into a toolkit that would be as easy to share as possible. I am also searching for good measures that could be used for assessing the quality of the alignment. My goal is to find out which acoustic models or features produce the best alignment, and in what sort of situations it is possible or worthwhile to use the models trained on major languages for aligning minority languages. This research has also opened up the world of language researchers for me, since I am trying to adapt the tool to suit their purposes as well as possible.
On the spur of the moment, I ended up testing the Finnish speech recognizer, developed by our group, for aligning the Giellagas corpus of Northern Saami. This project gave me the idea of cross-language alignment that is described in my latest publication (Leinonen, Virpioja & Kurimo, 2021). Thus, an alignment tool developed for one language can possibly be applied on aligning speech and text in other languages as well, in case the sound and writing systems of the languages are sufficiently similar. In the future, I will also be utilizing other previously aligned speech corpora that are in the Language Bank of Finland. The automatic speech aligner that I have used in my research is now also available for other researchers as part of the Aalto University Automatic Speech Recognition System (Aalto-ASR v.2) that has been installed in the Puhti computing environment at CSC.
For training chatbots, I also use the Suomi24 corpus available in the Language Bank. It may seem strange to use the sort of language used in online discussion forums for ”training” purposes. However, huge amounts of text are required in order to train useful language models, and finding suitable material in sufficiently large quantities is very difficult.
Leinonen, J., Smit, P., Virpioja, S., & Kurimo, M. (2017). New baseline in automatic speech recognition for Northern Sámi. In International Workshop on Computational Linguistics for the Uralic Languages (pp. 89-99). https://doi.org/10.18653/v1/W18-0208
Leino, K., Leinonen, J., Singh, M., Virpioja, S., & Kurimo, M. (2020). FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics. In Interspeech (pp. 429-433). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2511
Leinonen, J., Virpioja, S., & Kurimo, M. (2021, May). Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. http://hdl.handle.net/10138/330758
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Photo: Jonne Renvall/Tampere University
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Okko Räsänen tells us about his research on the computational modeling of infant language development.
I am Okko Räsänen, Associate Professor and Academy Research Fellow at the Unit of Computing Sciences of Tampere University, where I also lead the Speech and Cognition research group. Before moving to Tampere, I worked at the Department of Signal Processing and Acoustics at Aalto University, where I am Docent in Speech Processing.
The main topics of my research are the computational modeling of infants’ early language acquisition and the speech that infants hear. Our aim is to understand the principles of information processing that underlie language learning: What sort of transformations and processing steps does the speech signal undergo in the human brain in order to make it possible for the individual to learn how to comprehend it, and how can we build similar language capabilities to artificial intelligence systems? We are interested in what sort of linguistic structures can be acquired in a language-independent and unsupervised manner from speech and from the rest of the sensory information that is available to children. On the other hand, we study the learning mechanisms and presuppositions that must be included in the models in order for the learning to succeed. An interesting question is, what kind of language input and other multisensory information infants are generally able to hear and to perceive during their early language development, and to what extent the acquisition of linguistic structures (e.g., sounds and words) is supported by the amount, quality, and the multisensory nature of the input.
In addition to computational models, we have also developed practical analysis tools for the automated analysis of large child-centered audio data, which can help us to better understand the characteristics of speech heard by children. The data sets typically consist of day-long recordings recorded using wearable microphones in children’s natural acoustic and linguistic environments. For example, in the recently completed international collaboration project Analyzing Child Language Experiences around the World, we analyzed about 14,000 hours of child-centered audio material in order to study children’s early language experiences in various linguistic and cultural settings. Our next goal is to further process our analysis results into publications.
Computational research in language learning is multidisciplinary and interesting work, but on the other hand, it is also challenging. In order to work with speech signals and to model human learning processes, an in-depth command of signal processing and machine learning methods is required. In addition, however, it is important to have a good understanding of phonetics, early language development and the functioning of human cognition, so as to make it possible to reconcile the new models and methods with theory and data from language development research.
In addition to research on language acquisition, my research team develops various analysis methods for speech, e.g., for evaluating the health condition or the emotional state of a given speaker. My group is also involved in the development of smart wearables for babies for the clinical assessment and monitoring of their neurophysiological and motor development (as part of the Academy of Finland’s Health from Science research program). Moreover, I work on many other themes in speech technology, cognitive science, and signal analytics based on machine learning. Often, the signal processing and machine learning methods that are used in speech technology are also well suited for processing a wide variety of time series data.
In my research, I have used the FinDialogue corpus that is currently on its way to the Language Bank of Finland, and many other corpora that are provided by the Language Bank are also familiar to me. I am looking forward to the publication of the speech material collected during the Donate Speech campaign for research use. In my opinion, the Language Bank is also a viable publication channel for any new data that we may create during our research in the future.
Khorrami, K. & Räsänen, O. (2021). Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? – A computational investigation. Language Development Research, https://doi.org/10.34842/w3vw-s845
Räsänen, O., Seshadri, S., Lavechin, M., Cristia, A., & Casillas, M. (2021). ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings. Behavior Research Methods, 53, 818–835, https://doi.org/10.3758/s13428-020-01460-x.
Räsänen, O., Doyle, G., & Frank, M. C. (2018). Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171, 130–150, https://doi.org/10.1016/j.cognition.2017.11.003.
Kakouros, S., Salminen, N. & Räsänen, O. (2018). Making predictable unpredictable with style — Behavioral and electrophysiological evidence for the critical role of prosodic expectations in the perception of prominence in speech. Neuropsychologia, 109, 181–199, https://doi.org/10.1016/j.neuropsychologia.2017.12.011.
Räsänen, O., Kakouros, S. & Soderstrom, M. (2018). Is infant-directed speech interesting because it is surprising? — Linking properties of IDS to statistical learning and attention at the prosodic level. Cognition, 178, 193–206, https://doi.org/10.1016/j.cognition.2018.05.015.
Rasilo H. & Räsänen O. (2017). An online model of vowel imitation learning. Speech Communication, 86, 1–23, https://doi.org/10.1016/j.specom.2016.10.010.
Räsänen, O. & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122(4), 792–829, https://doi.org/10.1037/a0039702.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Olli Kuparinen tells us about his research on language variation and change where he has used The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s), the Samples of Spoken Finnish and The Finnish Dialect Syntax Archive.
I am Olli Kuparinen, Doctor of Philosophy in Finnish language. In my doctoral dissertation, which I defended in June 2021, I studied the change of Finnish spoken in Helsinki and theories on language change. My dissertation was written in a multidisciplinary research group Kippo, and the study was funded by the Kone Foundation.
I study the variation and change in spoken Finnish as well as the theories that are utilized in sociolinguistics. My research methods have for the most part been statistical.
My dissertation scrutinized the change in Finnish spoken in Helsinki from the 1970s to the 2010s. The real time corpus of three time points enabled me to study the concrete changes in Helsinki as well as test the theories that have been drafted in studies of one or two time points. Studying three time points contests, for instance, the practicality of the patterns of change put forth by William Labov.
In my postdoctoral research I will examine the variation in Finnish dialects and the ways variation is discussed in works on dialects.
In my dissertation I used the Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s), which consists of interviews of Helsinki natives from the 1970s, 1990s and 2010s. The interviews are available as sound files in the Language Bank. Many of the interviews have also been transcribed. In my dissertation I focused mainly on the transcriptions.
During my work on Helsinki Finnish I have also utilized the Samples of Spoken Finnish as a test corpus for different statistical models. I plan to use the corpus also in my postdoctoral research, in which I study the variation in Finnish dialects. The great benefit of the corpus is that it has been translated into standard Finnish. This enables, for instance, the use of different machine learning algorithms on the corpus to scrutinize the topics of the interviews.
I also plan to use the Finnish Dialect Syntax Archive as a supplement for the Samples of Spoken Finnish in my postdoctoral work.
Kuparinen, Olli 2018: Infinitiivien variaatio ja muutos Helsingissä. – Virittäjä 122 s. 29 – 52. https://doi.org/10.23982/vir.65310
Kuparinen, Olli 2021: Muutoksen mekanismit. Kolmen aikapisteen reaaliaikatutkimus Helsingin puhekielestä. Tampereen yliopiston väitöskirjat 428. Tampere: Tampereen yliopisto 2021. http://urn.fi/URN:ISBN:978-952-03-1990-8
Kuparinen, Olli – Mustanoja, Liisa – Peltonen, Jaakko – Santaharju, Jenni – Leino, Unni 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. – Sananjalka 61 s. 30–56. https://doi.org/10.30673/sja.80056
Kuparinen, Olli – Peltonen, Jaakko – Mustanoja, Liisa – Leino, Unni – Santaharju, Jenni 2021: Lects in Helsinki Finnish: a probabilistic component modeling approach. – Language Variation and Change. https://doi.org/10.1017/s0954394521000041
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.