Suomeksi

Researcher of the Month: Elina Vaahensalo

Elina Vaahensalo
Photo: Elina Vaahensalo

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Elina Vaahensalo tells us about her research on confrontation and otherness in online discussions.

Who are you?

I am Elina Vaahensalo, doctoral researcher in Digital Culture at the Faculty of Humanities, University of Turku, in the Degree Programme in Digital Culture, Landscape and Cultural Heritage. In addition, at the beginning of October I will start as a researcher in the Academy project SoliPro (”Solidariteetit käytäntöön – Nuorten arkiyhteisöt tunnustuksen lähteenä ja ehkäisevän sosiaalityön areenana”), coordinated by the University of Tampere.

What is your research topic?

In my dissertation, I examine online discussion that produces otherness, especially from the perspective of anonymous Finnish-language online communities. I am interested in how confrontation, alienation and even violent hostility are constructed in Finnish-language online discussion cultures, and what different forms the concept of otherness takes in these cultures. Otherness is a fruitful conceptual starting point for research on online discussions because it can be used in a variety of ways to outline descriptions of community, group identities, and the sense of being an outsider or downgraded and different. In Finnish-language online discussions, otherness takes very different – and also contradictory – forms: the other can be an enemy who is violently and dehumanisingly opposed, but also a relatable fellow sufferer with whom one shares common, peer-based experiences of marginalisation.

In addition, my colleague Lilli Sihvonen and I have studied online cultures from the framework of media archaeology. In particular, we are interested in what happens when a cybercultural phenomenon or object – a meme that has gone viral or a social media platform – dies, and what kind of afterlife can be associated with it. Our interest is driven by the perception of the vulnerability of digital phenomena. In our view, online phenomena in Finnish, for example, are particularly vulnerable because they often do not spread globally and are therefore not stored very widely online. In storing Finnish-language online cultural phenomena, Kielipankki has therefore done a valuable job by depositing online discussions from both the Suomi24 forum and the Ylilauta forum.

In my research for the SoliPro project, I will continue my work on othering, but from an even more robust perspective of community and solidarity. My aim is to examine the descriptions of community, otherness and solidarity shared by young people on social media.

How is your research related to Kielipankki?

In my more recent research, I have used qualitative and ethnographic online discussion data that was collected by myself, but the Suomi24 data from Kielipankki also plays an important role for the beginning of my research career. In 2017, I started as a research assistant in the ”Citizen Mindscapes” consortium project, funded by the Research Council of Finland. The project, where I also wrote my Master’s thesis, was built around the Suomi24 data from Kielipankki. Already then, I developed the concept of othering online discourse, and tested its identification and quantitative measurement using the Suomi24 data. Experimenting with corpus-based research was quite a dive into the unknown for a cultural researcher such as myself. However, with all its challenges, it was a valuable lesson to see how working on Master’s thesis provides opportunities to try out different research tools – also outside one’s own comfort zone.

From time to time, I also teach digital culture students, and my teaching focuses on the tools and methods that can be used for conducting qualitative research on online discussions. I always encourage my students to use the online discussion corpora in Kielipankki, as they are unique collections of Finnish online culture, and they also prove that the language used online is worth saving and remembering.

Recent publications

Vaahensalo, E., & Sihvonen, L. (2022). Elävät, kuolleet ja elävät kuolleet keskustelufoorumit: verkkoyhteisöjen elämänvaiheet ja niiden tutkiminen. In R. Mähkä, M. Ahonen, N. Heikkilä, S. Ollitervo, & M. Räsänen (Eds.), Kulttuurihistorian tutkimusmenetelmät (pp. 411-429). Turun yliopisto.

Vaahensalo, E. (2022). ”Uuniin siitä” – Väkivaltainen ja toiseuttava verkkokeskustelu Ylilaudalla. Lähikuva – audiovisuaalisen kulttuurin tieteellinen julkaisu, 35(3), 29–44. https://doi.org/10.23994/lk.121893

Vaahensalo, E. (2022). Organisaatiot ja toiseuttava verkkokeskustelu. In H. Kantanen & M. Koskela (Eds.), Procomma Academic 2022: Poikkeuksellinen viestintä. ProCom – Viestinnän ammattilaiset ry. https://doi.org/10.31885/2022.00001

Vaahensalo, E. (2021). Samanlaista toiseuttamista, erilaisia toisia: Toiseuttavan verkkokeskustelun muodot anonyymeissä suomenkielisissä keskustelukulttuureissa. Media & Viestintä, 44(3), 1–29. https://doi.org/10.23983/mv.111507

Vaahensalo, E. (2021). Kontekstualisointimalli sosiaalisen median lähdekritiikin avaimena. Informaatiotutkimus, 40(3), 110–141. https://doi.org/10.23978/inf.107897

Vaahensalo, E. (2021). Creating the other in online interaction: Othering online discourse theory. In J. Bailey, A. Flynn, & N. Henry (Eds.), Handbook on technology-facilitated violence and abuse: International perspectives and experiences (pp. 227-246). Emerald Studies on Digital Crime, Technology & Social Harms. https://doi.org/10.1108/978-1-83982-848-520211016

Suominen, J., Saarikoski, P., & Vaahensalo, E. (2019). Digitaalisia kohtaamisia: Verkkokeskustelut BBS-purkeista sosiaaliseen mediaan. Helsinki: Gaudeamus.

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Aku Rouhe

Aku Rouhe
Photo: Jasmine Gustafsson

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Aku Rouhe tells us about his research on speech recognition. His current work includes, among other things, fine-tuning large language models that are optimized for Finnish and Nordic languages. These openly available LLMs have been created through successful academia-enterprise collaboration.

Who are you?

I am Aku Rouhe. For several years, I did research in the Aalto University Speech Recognition research group, and defended my doctoral thesis there this past February. After Aalto, I moved to Silo AI (now owned by AMD), where I work with large language models (LLMs) – I have moved from speech to text. My interest in language is also part of my free time in creative writing.

What is your research topic?

In my doctoral thesis, I compared end-to-end models with more traditional multi-model decomposed systems. In recent years, both the academia and commercial deployments in speech recognition have largely moved to end-to-end models. However, my work showed how multi-model decomposed systems remain a competitive alternative, for instance, in terms of recognition accuracy. Indeed, the main advantage of end-to-end models is probably their simplicity.

End-to-end models often require vast training resources. Thus, it was important for me to study end-to-end models applied to under-resourced languages as well.

My current work at Silo is on fine-tuning large language models such as Poro and Viking, which are models optimized for Finnish and Nordic language. These LLMs were developed in a collaborative research project between Silo and TurkuNLP.

How is your research related to Kielipankki?

End-to-end models hunger for data, so large corpora are needed. I was involved in compiling the Aalto Finnish Parliament ASR Corpus 2008-2020, which consists of Finnish Parliament plenary session recordings, and also in the Lahjoita Puhetta project, where volunteers donated their speech to produce the Puhelahjat corpus. I got to combine both of these large speech corpora in an article that was published when I was finalizing my PhD, at a time when I was involved with the LAREINA project. Nowadays, the Finnish speech recognition resources are respectable for a language spoken by so few.

Recent publications

Rouhe, A., Grósz, T., Kurimo, M. 2024. Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 623-638, 2024. doi: 0.1109/taslp.2023.3336517

Virkkunen, A., Rouhe, A., Phan, N. et al. 2023. Finnish parliament ASR corpus. Lang Resources & Evaluation 57, 1645–1670 (2023). doi: 10.1007/s10579-023-09650-7

Moisio, A., Porjazovski, D., Rouhe, A. et al. 2023. Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks. Lang Resources & Evaluation 57, 1295–1327 (2023). doi: 10.1007/s10579-022-09606-3

Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M. 2022. Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0. Proc. Interspeech 2022, 3543–3547,
doi: 10.21437/Interspeech.2022-11318

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tuukka Törö

Tuukka Törö
Photo: Riina Kiianmies

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Tuukka Törö tells us about his research on Finnish speech synthesis. Neural network models, which are trained with large amounts of audio data from varied datasets, enable researchers to analyze speech in new ways.

Who are you?

I am Tuukka Törö. I have been working as a doctoral researcher at the University of Helsinki’s Phonetics and Speech Synthesis Research Group since the beginning of this year. My background is in linguistics and phonetics, and I hold a BA in English studies from the University of Malmö and an MA in Phonetics from the University of Helsinki. After writing my Master’s thesis on controlling speaking styles in speech synthesis, I spent some time working with YLE on AI radio projects where we created synthetic ‘actors’ for radio features.

In my current position, I work in the Academy of Finland funded project Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis. The project’s aim is to develop text-to-speech (TTS) synthesis inspired by the predictive processing theory of human cognition.

While my focus has become more technically inclined, the primary inspiration behind my work stems from a fascination with how social structures influence speech, from macro level variation to how people convey social dynamics in specific contexts.

What is your research topic?

Currently I am researching macro level language variation using neural-network models built for TTS and speech recognition. While the models’ original purpose is in technological applications, they enable us to analyze speech in new ways. As the models are trained with large amounts of audio, they can be used to model ’wild’ data of varying quality on a large scale instead of picking apart specific acoustic features from small, professionally recorded datasets.

Within the academy project, my aim is to tie together sociolinguistic variation with the predictive processing and speech synthesis side of things. Hopefully, in the coming years we will learn something new about how humans perceive social cues in speech and how high-level social predictions can be utilized to improve speech synthesis.

How is your research related to Kielipankki?

I often use corpora from Kielipankki such as Samples of Spoken Finnish (SKN), FinSyn (to be available in Kielipankki), and most of all Donate Speech (Lahjoita puhetta). In order to train speech synthesizers that we control on social variables – such as age, gender, and dialect – we need a large amount of audio data from people with a rich variety of backgrounds. With Finnish being a relatively small language, it is vital to have a concentrated effort for building large datasets like the Donate Speech corpus.

Recent publications

Törö, T., Suni, A. and Šimko, J. (2024). Analysis of regional variants in a vast corpus of Finnish spontaneous speech using a large-scale self-supervised model, Proceedings of Speech Prosody 2024, Leiden, Netherlands. DOI: 10.21437/SpeechProsody.2024-8

Šimko, J., Törö, T., Vainio M., and Suni, A. (2023). Prosody under control: Controlling prosody in text-to-speech synthesis by adjustments in latent reference space, Proceedings of the 18th International Congress of Phonetic Sciences, Prague, Czech Republic. http://hdl.handle.net/10138/565382

Other related work

Corpora

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Heidi Niva

Heidi Niva
Photo: Emmi Pollari

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Heidi Niva tells us about her research on Finnish grammatical phenomena and introduces a Vepsian-Finnish dictionary project. In a joint research, she also aims to evaluate the corpus of online discussions as a source for a language researcher.

Who are you?

I am Heidi Niva, a postdoc Finnish language researcher. I am currently a substitute lecturer of Finnish language and culture at the University of Helsinki. I am also actively involved in the LOST DOC collective, a community for postdoc language researchers.

What is your research topic?

Both in my dissertation and afterwards, grammatical phenomena have been in the focus of my research. Among other things, I have studied the structures that are used to express futurity in Finnish. Now I am involved in a joint project where we study the structures expressing avertivity, i.e. non-realization of events. I am also working in a project where we aim to compile a Vepsian-Finnish dictionary. Vepsian, also known as Veps, is a related but endangered language spoken south of Lake Onega (Ääninen). In addition to the dictionary project, I am also doing research on adpositional structures in the Veps language.

How is your research related to Kielipankki?

In my research on the Finnish grammar, instead of normativity, I am more interested in how people actually use linguistic structures, and what types of meanings and connotations these structures can convey. For this purpose, I have used the resources in Kielipankki: The Suomi24 Sentences Corpus 2001-2020 for the study of Modern Finnish, and the corpora of Early Modern Finnish and Old Literary Finnish for the study of the older forms of the language. I am also currently using the Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s and the Finnish News Agency Archive Corpus.

In fact, the Suomi24 Sentences Corpus 2001-2020 is itself the subject of our joint research with Max Wahlström and Olli Silvennoinen. What is interesting about this corpus is that it largely represents informal language use but is still different from spoken language in terms of its linguistic features. In addition, the corpus is a diverse source in terms of the formality of language use and the occurrence of linguistic phenomena as they seem to be influenced by the various topics of discussion and their styles of expression. In our forthcoming article, we will critically examine what kind of source the Suomi24 corpus actually is for a language researcher.

Publications

Niva, Heidi 2022: Suomen progressiivirakenne intentioiden ja ennakoinnin ilmaisuissa. Helsinki: Helsingin yliopisto. Available: http://urn.fi/URN:ISBN:978-951-51-8727-7

Niva, Heidi 2024: Tulen muistamaan hänet aina. Tulla V-mAAn vääjäämättömän tulevaisuuden ilmaisukeinona. Virittäjä 128(2), 238–263. DOI: 10.23982/vir.126878

Corpora

Links

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Krister Lindén

Krister Lindén
Photo: Juhani Jokinen

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Krister Lindén, the Director of the Language Bank, describes how researchers in Humanities can benefit from the use of artificial intelligence in their corpus-based research.

Who are you?

I am Krister Lindén. At the University of Helsinki, I am Research Director for Language Technology at the Department of Digital Humanities, and Deputy Team Leader at the Centre of Excellence for Ancient Near Eastern Empires. For national research infrastructures, I am the Director of the Language Bank of Finland, the National Coordinator of FIN-CLARIN, and the PI of FIN-CLARIAH. At the EU level, I am Chair of the National Coordinators Forum of CLARIN, a research infrastructure for the humanities and social sciences, and a member of the CLARIN Legal Issues Committee (CLIC).

What is your research topic?

I have always been interested in language technology and its application and, due to my involvement in the Language Bank, increasingly also in the prerequisites for developing and applying technology:

  • How can we use data to answer a broad range of research questions in the humanities and social sciences?
  • Where can we obtain development and test data to develop and evaluate our data processing methods?
  • Under what conditions can data be shared with other researchers so that they can verify the proclaimed performance of the methods?

An independent evaluation of methods is important to ensure progress and that we find the best methods in each case. If only a preliminary evaluation is needed, and a small-scale experiment is sufficient, you can give ChatGPT a few examples to see how it copes with the task. If there is insufficient data to reliably use a statistical method, and the task requires a high precision method, it may be quicker to use manually developed methods. On the other hand, if there is enough data, a suitable machine learning method is available, and the processing environment performance is sufficient, this combination often provides the most reproducible development path.

All the above development paths are data-driven and require data to be shared with other researchers for replication. In previous years, there has been a strong enthusiasm for completely open source data sets. While this is still a desirable goal, there are many datasets that, for one reason or another, cannot be made available to everyone. Gradually, as our community of researchers together with the lawmakers have succeeded in developing a legal framework for data access which is open enough for academic researchers to study the data and verify the results in a relatively straightforward way, while keeping the data accessible to a sufficiently small audience not to risk personal data nor infringe on copyrights.

A new development need is to create a method for researchers in the humanities and social sciences to discuss the content of datasets which they deposit in the Language Bank with an AI.

How is your research related to Kielipankki?

The Language Bank provides both a platform for tool development and an opportunity to show how different types of research-oriented datasets can be shared with other researchers in a safe and legal way.

Recent publications

Jauhiainen, T., Zampieri, M., Baldwin, T. C., & Linden, K. (2024). Automatic Language Identification in Texts. (Synthesis Lectures on Human Language Technologies). Springer. https://doi.org/10.1007/978-3-031-45822-4

Jauhiainen, T., Piitulainen, J., Axelson, E., Dieckmann, U., Lennes, M., Niemi, J., Rueter, J., & Linden, K. (2024). Investigating Multilinguality in the Plenary Sessions of the Parliament of Finland with Automatic Language Identification. In D. Fišer, M. Eskevich, & D. Bordon (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): ParlaCLARIN IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (pp. 48-56). (International conference on computational linguistics), (LREC proceedings). European Language Resources Association (ELRA). https://researchportal.helsinki.fi/files/312866811/ArtikkeliJulkaistu.pdf

Sahala, A., & Linden, K. (2023). BabyLemmatizer 2.0 – A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages. In A. Anderson, S. Gordin, B. Li, Y. Liu, & M. C. Passarotti (Eds.), Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023 (pp. 203-212). INCOMA. https://aclanthology.org/2023.alp-1.23

Linden, K., Niemi, J., & Kontino, T. (Eds.) (2023). CLARIN Annual Conference Proceedings 2023. (CLARIN Annual Conference Proceedings). CLARIN ERIC. https://researchportal.helsinki.fi/files/298353929/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf

Lindén, K., Ruokolainen, T., Hämäläinen, L., & Harviainen, J. T. (2023). Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC). In M. M. Rantanen , S. Westerstrand, O. Sahlgren, & J. Koskinen (Eds.), Proceedings of the Conference on Technology Ethics 2023 – Tethics 2023 (pp. 114-131). (CEUR Workshop Proceedings; Vol. 3582). CEUR-WS.org. https://researchportal.helsinki.fi/files/295005165/FP_10.pdf

Kamocki, P., Linden, K., Puksas, A., & Kelli, A. (2023). EU Data Governance Act: Outlining a Potential Role for CLARIN. In T. Erjavec, & M. Eskevich (Eds.), Selected papers from the CLARIN Annual Conference 2022 (pp. 57-65). (Linköping Electronic Conference Proceedings; No. 198). CLARIN ERIC. https://doi.org/10.3384/ecp198006

Linden, K., Jauhiainen, T., & Hardwick, S. (2023). FinnSentiment: A Finnish Social Media Corpus for Sentiment Polarity Annotation. Language Resources and Evaluation, 57(2), 581-609. https://doi.org/10.1007/s10579-023-09644-5

Axelson, E., Hardwick, S., & Linden, K. (2023). HFST Training Environment and Recent Additions. In A. Hurskainen, K. Koskenniemi, & T. P. (Eds.), Rule-Based Language Technology (pp. 60-69). (NEALT Monograph Series; No. 2[1]). Northern European Association for Language Technology. http://hdl.handle.net/10062/89595

Links

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Juraj Šimko

Juraj Šimko
Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juraj Šimko tells us about his research on speech articulation and prosody. The Phonetics and Speech Synthesis Research Group at the University of Helsinki also aims to use large language models for finding answers to certain theoretical questions related to speech.

Who are you?

I am a University Lecturer in Phonetics, working at the University of Helsinki since 2013. Prior to that I have studied and worked at several Universities in Slovakia, Ireland and Germany, and I spend several years as a Language Specialist in Microsoft. I currently also hold an Honorary Professorship at the Indian Institute of Technology in Guwahati. My background is in Maths, Cognitive Science and Phonetics.

I am a member of the Phonetics and Speech Synthesis Research Group at the Department of Digital Humanities, but I am currently also involved in an ERC Advanced grant (to Professor Alice Turk) called Planning the Articulation of Spoken Utterances at the University of Edinburgh, where we investigate and model cognitive processes behind speech production and articulation.

What is your research topic?

I am passionate about human speech research. Besides speech articulation, my own as well as our Group’s main research interest is speech prosody, that is, essentially, all those melodic, rhythmic, emotional aspects of speech that go beyond the linguistic message that we pass on when we speak. In our current project Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis we are working on a novel speech synthesis architecture that is inspired by the influential theoretical and modelling paradigm of human cognition called Predictive Processing. Of course, the first obvious aim is to produce a world-class speech synthesis, and our team has indeed been creating state-of-the-art Finnish and Finland Swedish synthesis systems. But we also want to use the huge language models that drive technological applications as statistical representations of speech material used for their training, and use them to answer theoretical questions related to speech. These questions include, among others, distribution and evolution of accents and dialects, relationship between sociolinguistics and prosody, and prosodic patterns in politicians’ parliamentary speeches.

How is your research related to Kielipankki?

In order to do all that, we need quite a lot of data. Some of it we create ourselves, with invaluable assistance from Kielipankki experts: we have designed and recorded FinSyn corpus of high quality speech material intended for speech technology application, primarily for speech synthesis. The corpus contains ~75 hours of studio quality recordings from three voice talents, two of them speaking Finnish and one Finland Swedish. This corpus will appear as a part of Kielipankki collection. Our work on dialects and sociolinguistics heavily relies on other Kielipankki corpora, primarily the groundbreaking Donate Speech (Lahjoita puhetta) Corpus and Aalto Finnish Parliament ASR Corpus.

Recent publications

Törö, T., Suni, A. and Šimko, J. (2024). Analysis of regional variants in a vast corpus of Finnish spontaneous speech using a large-scale self-supervised model, Proceedings of Speech Prosody 2024, Leiden, Netherlands. DOI: 10.21437/SpeechProsody.2024

Vainio, M., Suni, A., Šimko, J. and Kakouros, S. (2024). The Power of Prosody and Prosody of Power: An Acoustic Analysis of Finnish Parliamentary Speech, Proceedings of Speech Prosody 2024, Leiden, Netherlands. DOI: 10.21437/SpeechProsody.2024

Elie, B., and Šimko, J., and Turk, A. (2024). Optimization-based modeling of Lombard speech articulation: Supraglottal characteristics. JASA Express Letters, 4(1). https://doi.org/10.1121/10.0024364

Kakouros, S., Šimko, J., Vainio M., and Suni, A. (2023). Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody, Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France. https://doi.org/10.21437/SSW.2023-20

Šimko, J., Törö, T., Vainio M., and Suni, A. (2023). Prosody under control: Controlling prosody in text-to-speech synthesis by adjustments in latent reference space, Proceedings of the 18th International Congress of Phonetic Sciences, Prague, Czech Republic. http://hdl.handle.net/10138/565382

Šimko, J., Adigwe, A., Suni, A. and Vainio M. (2022). A Hierarchical Predictive Processing Approach to Modelling Prosody, Proc. 11th International Conference on Speech Prosody, Lisbon, Portugal. https://doi.org/10.21437/SpeechProsody.2022-86

Corpora

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Lotta Leiwo

Lotta Leiwo
Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Lotta Leiwo tells us about her research in folkloristics, digging into the life and work of Finnish-American T-Bone Slim.

Who are you?

I am Lotta Leiwo, a doctoral researcher at the University of Helsinki, where I am studying for a PhD in history and cultural heritage. My dissertation in Folklore Studies examines the political role and nature-related rhetoric of Finnish-American women in the Finnish Socialist Federation (FSF) in the early 20th century. My main research data consists of FSF documents and a newspaper called Toveritar. The Toveritar, a mouthpiece of the FSF, targeted women and was edited and written mainly by women.

Prior to my doctoral project, I worked for two years as a research assistant on the project T-Bone Slim and the transnational poetics of the migrant left in North America (Kone Foundation 2022–2023). My main responsibility in this international project was the construction of the T-Bone Slim corpus and database. During the project, I wrote my Master’s thesis on Finnish socialist women in North America and found the topic for my dissertation.

What is your research topic?

In the T-Bone Slim project, an international research team studied the life and literary works of the second-generation American Finnish Matti Valentinpoika Huhta (1882–1942), also known as T-Bone Slim. Huhta was born in Ashtabula, Ohio, to a Finnish family that emigrated from Kälviä, Central Ostrobothnia. He spent his childhood and youth in Finnish communities in the US, working as a dock worker and as a correspondent for the local chapter of the temperance movement. In the 1910s, Huhta abandoned his family and took up a life as a ’hobo’ or itinerant worker. By the 1920s, Huhta became radicalised, joining the Industrial Workers of the World (IWW) and becoming a columnist for IWW newspapers and periodicals. He continued his writing career under the pen name T-Bone Slim until his death. Huhta lived his last years in New York, where he worked as a deck scow captain. In May 1942, he was found drowned in New York’s East River and was almost forgotten for several decades. For further exploration of the unresolved questions surrounding T-Bone Slim’s death, please visit our project blog and read Saku Pinta’s two-part text ”Who Killed T-Bone Slim” Part I and Part II.

In the late 2010s, musician John Westmoreland, a relative of Slim’s, discovered his ”Uncle Matt’s” T-Bone Slim writing career. Around the same time, academic interest in Slim, who had a Finnish background, began to grow, and his relatives and researchers found each other over T-Bone Slim Studies. The research continued in a project funded by the Kone Foundation, which brought together John Westmoreland and scholars from Finland, the UK, the US, Canada, and Australia. Kirsti Salmi-Niklander is the Principal Investigator of the project. We collected the T-Bone Slim materials gathered by the researchers from various archives organizing them into a corpus to enchance accessibility for others interested in the subject. In total, data from 14 archives across three continents and five countries – the United States, Canada, Finland, Sweden and Australia – provided the materials.

The corpus encompasses a total of 1294 texts written by T-Bone Slim and published in English in IWW periodicals. However, Slim also wrote in Finnish on occasion and occasionally used Swedish. Furthermore, the corpus also includes the surviving manuscripts written by Slim.

The texts written by T-Bone Slim are a gold mine for researchers. Slim used language cleverly, combining different genres and means of expression. In addition, the historical, literary and cultural references found in the texts provide an opportunity to examine the IWW movement, transnational migration and history in the United States from diverse perspectives. The language employed in the texts is rich, insightful, and even playful, and may be of interest to linguists. As the material comprises both published and unpublished texts, it offers insights into both the editorial processes of political publishing and the writing practices of an individual author.

Within the framework of the project, I have examined the literary practices, literacy acquisition of Finnish migrant-settlers and Slim’s utilization of genres from a semiotic perspective. Notably, Slim’s texts exhibit multilingualism in both background and content, incorporating intertextuality and multimodality across various genres and oral-literary practices. Such practices are evident, for example, in his song lyrics. In typical IWW style, Slim wrote lyrics addressing social injustices to popular song tunes known to readers. The lyrics were thus written to be sung, with the aim of provoking the reader/singer to reflect on the message of the lyrics. As Owen Clayton, a collaborator on our project, has observed, T-Bone Slim sought to activate and engage readers through language and words. I, too, am continually amazed and delighted by Slim’s skilful written expression.

How is your research related to Kielipankki?

In the early stages of the project, we thought long and hard about a suitable repository for the T-Bone Slim corpus and database. Our priority was to find a long-term storage solution for the materials that would ensure the materials’ widespread accessibility. Equally important was the need for the corpus to be explored and analysed through digital humanities methods.

The T-Bone Slim corpus and database will be published in April 2024 in Kielipankki, which fulfills all our storage and access requirements. The collection consists of photographic and microfilm scans of the original materials (newspapers, periodicals and manuscripts) with transcriptions and a database. The database includes all the texts in the corpus accompanied by metadata (date of publication, publication, title of the text, archive from which the material was collected, language, etc.). Additionally, we have experimented abstracting the data into a subset of the materials. For example, the people and places mentioned by T-Bone Slim and information about the poems or songs contained in the texts are listed in the abstracted data. The purpose of the database is to facilitate data navigation and serve as a foundation for more detailed abstraction of the data by other researchers.

T-Bone Slim Corpus and Database Launching Event

Welcome to the Resurrection – T-Bone Slim Corpus and Database Launching Event on Monday May 20th, 2024 at 15:00–17:00. The launching event is open to the public and the program can be followed both via Zoom and on-site at the Finnish Literature Society (Hallituskatu 1, Helsinki). More information and registration for remote participants.

Publications

Apajalahti, Eeva-Lotta et al. (2022). ”Ihmistieteelliset näkökulmat metsiin tuottavat tietoa moninaisista metsäsuhteista ja niiden tulevaisuuksista.” Vuosilusto 14(2022): 13–51. Available: https://lusto.fi/wp-content/uploads/2022/12/Lusto-Vuosilusto14.pdf.

Leiwo, Lotta (2024). ”When One’s Life Becomes the Field. Assessing the Field in Collaborative Autoethnography.” Marburg Journal of Religion 25(1). https://doi.org/10.17192/mjr.2024.25.8693.

Leiwo, Lotta (2023). ”Luontokin näkyy olevan köyhälistöä vastaan” Luonto kolmantena tilana Toveritar-lehden paikkakuntakirjeissä 1916–1917. Master’s thesis. Helsinki: University of Helsinki. http://urn.fi/URN:NBN:fi:hulib-202305302306.

Leiwo, Lotta (2023). ”Suomen koloniaalin osallisuuden kontekstit haltuun: Hoegaerts, Josephine, Tuire Liimatainen, Laura Hekanaho ja Elizabeth Peterson (toim.). 2022. Finnishness, Whiteness and Coloniality.” Elore, 30(2), 142–147. Book review. https://doi.org/10.30666/elore.137470.

Mäkelä, Heidi Henriikka, Leiwo, Lotta, Linkola, Hannu ja Rinne, Jenni (2023). ”The spiritual forest: an ethnographic exploration on Finnish forest yoga and the forest landscape.” Landscape Research. https://doi.org/10.1080/01426397.2023.2268550.

Corpora

Entries from the Research Project’s Blog

Leiwo, Lotta (2023). ”T-Bone Slim Database – Final Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. 18.12.2023. https://blogs.helsinki.fi/tboneslim/2023/12/18/t-bone-slim-database-final-steps/.

Leiwo, Lotta (2023). ”T-Bone Slim Database – Next Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 22.6.2023. https://blogs.helsinki.fi/tboneslim/2023/06/22/t-bone-slim-database-next-steps/.

Salmi-Niklander, Kirsti (2023).”’T-Bone Slim’ eli Matti V. Huhta ajatteli ja kirjoitti kahdella kielellä kulkurielämästä ja työläisten oikeuksista” ’Vähäisiä lisiä’ Blog. Published 12.5.2023. https://www.finlit.fi/ajankohtaista/blogi/t-bone-slim-eli-matti-v-huhta-ajatteli-ja-kirjoitti-kahdella-kielella-kulkurielamasta-ja-tyolaisten-oikeuksista/.

Clayton, Owen (2023). ”Technocracy and T-Bone Slim’s Break with Ralph Chaplin” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 1.3.2023. https://blogs.helsinki.fi/tboneslim/2023/03/01/technocracy-and-t-bone-slims-break-with-ralph-chaplin/.

Dalbello, Marija (2022). ” From my Archival ‘Digs’, part I. Finding Slim!” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 12.12.2022. https://blogs.helsinki.fi/tboneslim/2022/12/12/finding-slim/.

Pinta, Saku (2022). ”T-Bone Slim’s Forgotten Finnish-Language Writings in the IWW Press” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 20.10.2022. https://blogs.helsinki.fi/tboneslim/2022/10/20/t-bone-slims-forgotten-finnish-language-writings-in-the-iww-press/.

Leiwo, Lotta (2022). ”T-Bone Slim Database – First Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 5.10.2022. https://blogs.helsinki.fi/tboneslim/2022/10/05/t-bone-slim-database-first-steps/.

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Harri Uusitalo

Harri Uusitalo
Photo: Timo Tuovinen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Harri Uusitalo tells us about his research using various types of Finnish-language corpora from different time periods.

Who are you?

I am Harri Uusitalo, postdoctoral researcher at the University of Turku. I am a researcher of the Finnish language and currently, I am working at the School of History, Culture and Arts Studies in the interdisciplinary projects Fauna et Flora Fennica and Disappeared, Endangered and Newly Arrived Species: The Human Relationship with the Changing Biodiversity of the Baltic Sea. In the research groups, we examine the historical relationship of the Finnish people with nature.

What is your research topic?

I have studied Finnish texts from different periods, from the time of Agricola to the present day. My doctoral thesis focused on the legal language of the 17th century, and more recently I have been fascinated by environmental themes and ecolinguistic perspectives.

How is your research related to Kielipankki?

Together with my colleagues, I have used the Kielipankki data in some of my research. For example, together with Karita Suomalainen, we used the Suomi24 corpus and the Korp tool to investigate how Finnish people identify and discuss invasive alien species. With Duha Elsayed and Heidi Salmi, we used the Morpho-Syntactic Database of Mikael Agricola’s Works to study the translative form of the A-infinitive in Agricola’s works.

In my future research, I will certainly make use of many other corpora in Kielipankki, such as the Corpus of Old Literary Finnish, the Corpus of Early Modern Finnish and the Newspaper and Periodical Corpus of the National Library of Finland.

Publications

Uusitalo Harri, Lähdesmäki Heta, Sonck-Rautio Kirsi, Latva Otto, Salmi Hannu & Alenius Teija (forthcoming): Alien Plants between Practices and Representations: the Cases of European Spruce and Beach Rose in Finland. Plant Perspectives.

Uusitalo Harri & Suomalainen Karita 2023: Ecolinguistic Approach to Online Finnish Discourse on Invasive Alien Species. Language@Internet 21. https://www.languageatinternet.org/articles/2023/uusitalo

Elsayed Duha, Salmi Heidi & Uusitalo Harri 2022: A-infinitiivin translatiivi Mikael Agricolan teksteissä. Sananjalka 64. Suomen Kielen Seura, Turku. DOI: 10.30673/sja.107377

Corpora and tools

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tanja Säily

Tanja Säily
Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tanja Säily tells us about her research on the English language, which combines corpus linguistics, digital humanities and historical sociolinguistics.

Who are you?

I am Tanja Säily, Assistant Professor in English Language at the University of Helsinki.

What is your research topic?

I study variation and change in the English language from a sociolinguistic perspective. My research combines corpus linguistics, digital humanities and historical sociolinguistics. I frequently collaborate with other linguists and historians, and I develop new methods with data scientists and language technologists. I analyse sociolinguistic variation especially in linguistic productivity, such as the use of neologisms. I have also studied gendered styles and factors influencing the rate of language change.

How is your research related to Kielipankki?

In my research, I use English text corpora, which I have also deposited in Kielipankki for myself and others to use. I am currently studying the productivity of various linguistic constructions in the Corpus of Historical American English (e.g. Säily & Vartiainen, forthcoming). I have been using this corpus with the Korp tool and have also downloaded it to my own computer.

I have prepared openly available teaching materials on the methods of historical corpus linguistics for graduate students and other interested parties. They are included in the Method Bank for Linguistics, and the Early Modern English section of the Helsinki Corpus of English Texts used in the exercises can be found in Kielipankki.

Publications

Here are a few of my most recent publications; the entire list can be found at https://tanjasaily.fi/publications/

Accepted. Säily, Tanja, Martin Hilpert & Jukka Suomela. New approaches to investigating change in derivational productivity: Gender and internal factors in the development of -ity and -ness, 1600–1800. Patricia Ronan, Theresa Neumaier, Lisa Westermayer, Andreas Weilinghoff & Sarah Buschfeld (eds.), Crossing boundaries through corpora: Innovative approaches to corpus linguistics (Studies in Corpus Linguistics). Amsterdam: John Benjamins.

Accepted. Säily, Tanja & Turo Vartiainen. Historical linguistics. Michaela Mahlberg & Gavin Brooks (eds.), Bloomsbury handbook of corpus linguistics. London: Bloomsbury.

Accepted. Säily, Tanja, Turo Vartiainen, Harri Siirtola & Terttu Nevalainen. Changing styles of letter-writing? Evidence from 400 years of early English letters in a POS-tagged corpus. Luisella Caon, Moragh Gordon & Thijs Porck (eds.), Unlocking the history of English: Pragmatics, prescriptivism and text types (Current Issues in Linguistic Theory). Amsterdam: John Benjamins.

2023. Landert, Daniela, Tanja Säily & Mika Hämäläinen. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47(1): 63–79. DOI: 10.2478/icame-2023-0004

2022. Rodríguez-Puente, Paula, Tanja Säily & Jukka Suomela. New methods for analysing diachronic suffix competition across registers: How -ity gained ground on -ness in Early Modern English. International Journal of Corpus Linguistics27(4): 506–528. Special issue, Corpus studies of language through time, ed. by Tony McEnery, Gavin Brookes & Isobelle Clarke. DOI: 10.1075/ijcl.22014.rod

2021. Säily, Tanja, Eetu Mäkelä & Mika Hämäläinen. From plenipotentiary to puddingless: Users and uses of new words in early English letters. Mika Hämäläinen, Niko Partanen & Khalid Alnajjar (eds.), Multilingual Facilitation, 153–169. Helsinki: University of Helsinki. DOI: 10.31885/9789515150257.15

2020. Mäkelä, Eetu, Krista Lagus, Leo Lahti, Tanja Säily, Mikko Tolonen, Mika Hämäläinen, Samuli Kaislaniemi & Terttu Nevalainen. Wrangling with non-standard data. Sanita Reinsone, Inguna Skadiņa, Anda Baklāne & Jānis Daugavietis (eds.), Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, Riga, Latvia, October 21–23, 2020 (CEUR Workshop Proceedings 2612), 81–96. Aachen: CEUR-WS.org. DHN 2020 Best Paper Award. http://ceur-ws.org/Vol-2612/paper6.pdf

2020. Nevalainen, Terttu, Tanja Säily, Turo Vartiainen, Aatu Liimatta & Jefrey Lijffijt. History of English as punctuated equilibria? A meta-analysis of the rate of linguistic change in Middle English. Journal of Historical Sociolinguistics 6(2): article 20190008. Special issue, Comparative Sociolinguistic Perspectives on the Rate of Linguistic Change, ed. by Terttu Nevalainen, Tanja Säily & Turo Vartiainen. DOI:10.1515/jhsl-2019-0008

2019. Hill, Mark J., Ville Vaara, Tanja Säily, Leo Lahti & Mikko Tolonen. Reconstructing intellectual networks: From the ESTC’s bibliographic metadata to historical material. Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, Copenhagen, Denmark, March 6–8, 2019 (CEUR Workshop Proceedings 2364), 201–219. Aachen: CEUR-WS.org. DHN 2019 Best Paper Award. http://ceur-ws.org/Vol-2364/19_paper.pdf

2018. Säily, Tanja. Change or variation? Productivity of the suffixes -ness and -ity. Terttu Nevalainen, Minna Palander-Collin & Tanja Säily (eds.), Patterns of Change in 18th-century English: A Sociolinguistic Approach (Advances in Historical Sociolinguistics 8), 197–218. Amsterdam: John Benjamins. DOI: 10.1075/ahs.8

Corpora and teaching materials

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Liisa Mustanoja

Liisa Mustanoja
Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Liisa Mustanoja tells us about her research on sociolinguistics. With the help of a longitudinal corpus, it is possible to observe changes in the spoken language of the same people at different points in time.

Who are you?

I am Liisa Mustanoja, PhD, from Tampere. I work as a University Lecturer of Finnish Language in the Unit of Languages at the Faculty of Information Technology and Communication, University of Tampere. From January 2024, I will be the Head of the Unit of Languages for the next five years. I am also an Associate Professor of Finnish at the University of Oulu, specialising in sociolinguistics.

What is your research topic?

So far, all my research fits under the large umbrella of sociolinguistics. I am interested in the relationship between language and society, especially in all forms of change, upheaval and movement. In my doctoral research, I examined the change of the spoken language of Tampere at the level of the idiolect. This was a so-called real-time panel survey, in which I examined the language of the same people in the light of two points in time. Later, together with my colleagues, I have extended the study to the spoken language of Helsinki, and we have also included a third time point. The focus has largely been on the phonetic and formal structure of the language, but the data has also allowed for a sociophonetic approach. In one article, for example, we investigated changes in pitch over time.

In addition to the path of variation studies, I am interested in the interface between spoken and written language, and this has provided me with another research direction, namely the study of letter writing. I have investigated – both on my own as well as together with Finnish language students – the correspondence during the Second World War. As there was no other means of communication during the war, everyone took up their pen, regardless of age, profession or educational background. Although this correspondence resource is old, it has provided essential insights into the importance of human contact in times of crisis, as well as into everyday life and humanity in the midst of world turmoil.

How is your research related to Kielipankki?

For some time now, Kielipankki has made accessible the Longitudinal Corpus of Finnish Spoken in Helsinki, which has provided me and my colleagues with an important source of data for studying language change. This corpus will hopefully be joined in the coming months by a little sister, the Longitudinal data of Tampere spoken language. Previously, recordings of the spoken language of Tampere had been made in the 1970s and 1990s. In 2019, I started a third round of data collection in Tampere, which has been continued by students up to the present day. Thanks to the funding I received from FIN-CLARIN, I have also been able to hire some temporary help to work on the material. Everything is now in place, except for the final paperwork. The transfer and archiving of personal speech data has its own complications, but Kielipankki is by far the best possible repository for this valuable longitudinal data. On the eve of handing over the material, it feels like there should be more material and it should be more complete, and the transcripts should be revised countless more times. But really, every little addition to Kielipankki is a great gift to the research community. And by opening up even a part of the resource, someone else has also the possibility to join the transcription work if they want to!

From the resources in Kielipankki, I would also like to mention the Suomi24 Corpus, which suits well for student work. Nowadays, when data protection matters are demanding, it is a relief to be able to direct students to these ready-made resources. For me, too, there is still a lot of new things to wonder about in Kielipankki. My interest in wartime letters, for example, has recently led me to Kalle Päätalo’s Iijoki series, and I have been quite surprised by the research potential of this cornucopia.

Publications

Mustanoja Liisa, O’Dell Michael & Lappalainen Hanna, 2022: Helsinkiläis- ja tamperelaispuhujien äänenkorkeuden muutokset 1970-luvulta 2010-luvulle. Puhe ja kieli. https://doi.org/10.23997/pk.121404

Kuparinen Olli, Santaharju Jenni, Leino Unni, Mustanoja Liisa & Peltonen Jaakko 2022: Katomuotojen eteneminen hd-yhtymässä Helsingin puhekielessä. Virittäjä 126, s. 316–338. https://doi.org/10.23982/vir.100585

Kuparinen Olli, Peltonen Jaakko, Mustanoja Liisa, Leino Unni & Santaharju Jenni, 2021: Lects in Helsinki Finnish – a probabilistic component modeling approach. Language Variation and Change. https://doi.org/10.1017/S0954394521000041

Lappalainen Hanna, Mustanoja Liisa & O’Dell Michael, 2019: Miten ja milloin yksilön kieli muuttuu? Helsinkiläisidiolektien muutos ja muutoksen tutkimuksen menetelmät. Virittäjä 123, s. 550–581. https://doi.org/10.23982/vir.67808

Kuparinen Olli, Mustanoja Liisa, Peltonen Jaakko, Santaharju Jenni & Leino Unni, 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. Sananjalka 61. s. 30–56. https://doi.org/10.30673/sja.80056

Mustanoja Liisa, 2018: Sydämellisiä kirjeitä talvisodasta. Hämäläisten sotilaiden kiitoskirjeet aikansa kielen ja kirjeenvaihtokulttuurin heijastajina. Sisko Brunni, Niina Kunnas, Santeri Palviainen ja Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia humaniora ouluensia 16. Oulu, s. 251–285. https://urn.fi/URN:ISBN:9789526221120

Mustanoja Liisa (toim.), 2017: Arjen sirpaleita ja suuria tunteita: Kirjeet sodan sanoittajina ja ihmissuhteiden ylläpitäjinä 1939–1944. Tampere Studies in Language, Translation and Literature B5. Tampereen yliopisto. https://urn.fi/URN:ISBN:978-952-03-0527-7

Mustanoja Liisa, 2011: Idiolekti ja sen muuttuminen: reaaliaikatutkimus Tampereen puhekielestä. Tampere: Tampere University Press. https://urn.fi/urn:isbn:978-951-44-8417-9

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tiina Onikki-Rantajääskö

Tiina Onikki-Rantajääskö
Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tiina Onikki-Rantajääskö tells us about the principles of the Helsinki Term Bank for the Arts and Sciences (HTB) and invites interested experts to join the collaborative terminology work.

Who are you?

I am Tiina Onikki-Rantajääskö, Professor of Finnish at the University of Helsinki. I also lead the Helsinki Term Bank for the Arts and Sciences (HTB).

What is your research topic?

I am generally interested in how vocabulary and grammatical structures construe linguistic meaning and how they function in relation to the wider textual context. Most of my published research is related to the local cases of the Finnish language. Currently, I am delighted to see how younger researchers aim to combine qualitative and quantitative research in the project Platforms and Rhetorical Group Strategies (in Finnish, ”Alustat ja retoriset ryhmästrategiat”), run by me and Eetu Mäkelä and funded by Kone Foundation. I am particularly interested in discovering whether some constructions can indicate broader discourse structures. However, during this winter, I am spending most of my time on my duties as the Finnish Language Rapporteur, appointed by the Ministry of Justice.

How is your research related to Kielipankki?

I tend to use the Finnish language resources in Kielipankki whenever I need information about the context of a word or grammatical element. Many of the corpora that I have used in the past can now be found in Kielipankki, such as the HS.fi News and Comments Corpus that was compiled in one of my earlier projects.

In addition, the Helsinki Term Bank for the Arts and Sciences (HTB) is part of the FIN-CLARIAH Research Infrastructure, together with Kielipankki. This is reflected in the fact that the online service of the HTB is also accessible via Kielipankki. The HTB also has an employee funded through the FIN-CLARIAH project (FIRI funding from the Research Council of Finland). There is a need for collaboration in the field of language technologies.

The contents Helsinki Term Bank for the Arts and Sciences (HTB) are still in the construction phase. We are constantly working to involve more and more researchers from different disciplines in the terminology work and to invite new disciplines to join the HTB. Defining scientific terms and providing other background information on concepts require expertise in each field. Therefore, the selected method is niche-sourcing of experts, supported by our project planner. The aim is to promote the multilingualism of science in addition to providing openly accessible information describing the formation of scientific knowledge and facilitating the utilization of science. Scientific concepts are at the heart of research. Multilingualism can be promoted by offering translation equivalents for terms in different languages. The Finnish language is in focus, since the aim is to support Finnish as a language of science. However, it is possible to present definitions and concept pages in languages other than Finnish. The term bank thus opens up opportunities for international collaboration. Especially for multilingual and multidisciplinary research groups, the term bank provides an opportunity to shape the common terminological ground. All interested experts are welcome to participate.

My research interests in the Helsinki Term Bank for the Arts and Sciences (HTB) include the presentation of background knowledge frames and the emergence of prototypicality, as well as collaborative interactions: the network of experts in the HTB and the online service interact and form a field of action that differs from traditional research projects.

Publications

Enqvist, Johanna & Tiina Onikki.Rantajääskö & Kaarina Pitkänen-Heikkilä 2021: Terminology work as open, communal and collaborative crowdsourcing practice of academic communities. – Terminology 27:1, Pp. 56-79. DOI: 10.1075/term.00058.enq

Jaakola, Minna & Tiina Onikki-Rantajääskö (eds.) 2023: The Finnish Cases System: Cognitive Linguistic Perspectives. Helsinki:SKS. DOI: doi.org/10.21435/sflin.23

Kettunen, Harri & Tiina Onikki-Rantajääskö (forthcoming): Vetenskapstermbanken i Finland i samhällets tjänst. – Publikation Nordterm 2023.

Kettunen, Harri & Tiina Onikki-Rantajääskö (forthcoming): Tieteen termipankki tieteentekemisen ytimessä. – Kieliviesti 2/2023.

Onikki-Rantajääskö, Tiina & Harri Kettunen 2023: Vuosi 2022 Tieteen termipankissa: Laajenemista uusille aihealueille ja tunnustuspalkintoja avoimen tieteen edistämisestä. – Tieteen termipankin blogi. Helmikuu/2023. https://blogs.helsinki.fi/tieteentermipankki/2023/02/16/vuosi-2022-tieteen-termipankissa-laajenemista-uusille-aihealueille-ja-tunnustuspalkintoja-avoimen-tieteen-edistamisesta/

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Aleksi Sahala

Aleksi Sahala
Photo: Marianne Ough

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Aleksi Sahala tells us about his research on the development and application of Natural Language Processing (NLP) methods for annotating and analyzing ancient text data.

Who are you?

I am Aleksi Sahala, a postdoc researcher in Assyriology and Language Technology. I am currently working for the University of Helsinki in an Academy of Finland funded project “The Origins of Emesal”, where our goal is to investigate how Emesal, the only known language variety of Sumerian, came to be and evolved over time using computational methods.

I did my master’s degree in Assyriology and Computational Linguistics, and in 2021 I finished my PhD thesis “Contributions to Computational Assyriology”. In 2022, I was a visiting scholar at the University of California, Berkeley, and in 2024 I will visit the University of Innsbruck in Austria. I have also worked in close co-operation with the Centre of Excellency in Ancient Near Eastern Empires at the University of Helsinki.

What is your research topic?

My research focuses on the development and application of NLP (Natural Language Processing) methods for annotating and analyzing ancient text data. My particular interest lies in the Mesopotamian cuneiform texts written in Sumerian (3200 BCE – 100 CE) and Akkadian (2500 BCE – 100 CE). Analysis of Sumerian and Akkadian texts is not only challenging due to data sparsity and the fragmentary nature of the primary sources, but also due to the complexity of the cuneiform writing system and inflectional morphology. In theory, most words can occur in several thousands of different forms, each of which can also be spelled in several different ways.

My focal point has been on the development of a pipeline that is able to linguistically annotate raw transliterations of cuneiform texts so that these texts can be used for data analysis and visualization. This allows for the analysis of thousands of transliterated texts simultaneously and, for example, the visualization and study of how different words, concepts or entities are related to each other on a larger scale. Although Assyriologists have digitized over 20,000 Akkadian and over 100,000 Sumerian texts in various text corpora, these texts have mostly been studied qualitatively by close-reading. By applying a more computational approach, it becomes easier to reveal larger patterns within specific groups of texts.

I have developed a finite-state morphology for Akkadian (BabyFST), as well as a language independent neural lemmatizer and tagger with a special support for cuneiform languages (BabyLemmatizer). In addition, I have built a word-embedding-based tool for analyzing semantic relationships of words and in sparse and fragmentary data sets (PMI Embeddings).

My current project focuses on Emesal, a liturgic variant of the Sumerian language, which is only attested in writing after Sumerian was no longer used as a vernacular. Although it is known that Emesal was used in liturgic context, such as lamentations, and occasional to indicate direct speech of goddesses and women, its origins and evolution are still widely debated. None of the Emesal texts were entirely written in this language variant, but rather in Sumerian, and Emesal was only used here and there as keywords to indicate that the current line or passage should be read in this dialect. The rules behind this code switching, if such ever existed, remain largely unknown. We hope, that a larger scale analysis of Emesal texts could reveal some patterns that could explain, what kinds of environments triggered the use of Emesal words exactly, and how the use of this language variant was introduced in written documents and how evolved over its 2000 year old history.

How is your research related to Kielipankki?

Kielipankki has been co-operating with the Centre of Excellence in Ancient Near Eastern Empires by annotating cuneiform texts and publishing them in Korp concordance service. My responsibilities have been collecting and converting these data sets into Korp-compatible format and developing tools for annotating and harmonizing them with the existing resources in a way, that they can be used efficiently together for quantitative analysis.

Recently, we have been working on the harmonization, lemmatization and tagging of Achemenet, a collection of Neo-Babylonian administrative and legal documents.

Publications

Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S., & Lindén, K. (2019). Aššur and his friends: a statistical analysis of neo-assyrian texts. Journal of Cuneiform Studies, 71(1), 159–180. http://hdl.handle.net/10138/303986

Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). Digital Approaches to Analyzing and Translating Emotion: What Is Love?. In The Routledge Handbook of Emotions in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/348398

Bennet, E. & Sahala, A. (2023). Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023. http://hdl.handle.net/10138/565513

Ihalainen, P. & Sahala, A. (2020). Evolving Conceptualisations of Internationalism in the UK Parliament. Digital Histories, 199.

Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). Akkadian treebank for early neo-assyrian royal inscriptions. In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories. The Association for Computational Linguistics. http://hdl.handle.net/10138/322305

Sahala, A. J. A. (2017). Johdatus sumerin kieleen. Suomen itämainen seura.

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). BabyFST: Towards a finite-state based computational model of ancient babylonian. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3886–3894). http://hdl.handle.net/10138/317691

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association (ELRA). http://hdl.handle.net/10138/317688

Sahala, A. (2021). Contributions to Computational Assyriology. PhD Thesis. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-7416-1

Sahala, A., & Töyräänvuori, J. (2022). Kirjoitustaidon kehittyminen. In Svärd, S. & Töyräänvuori, J. (eds.), Muinaisen Lähi-idän imperiumit. Kadonneiden suurvaltojen kukoistus ja tuho, s.49–69. Gaudeamus, Helsinki.

Sahala, A., & Svärd, S. (2022). Language technology approach to “seeing” in Akkadian. In The Routledge Handbook of the Senses in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/339256

Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023, June). Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction. In Selected papers from the CLARIN Annual Conference 2022 (pp. 111–119). http://hdl.handle.net/10138/563733

Sahala, A. & Lindén, K. (2023). A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Svärd, S., Jauhiainen, H., Sahala, A., & Lindén, K. (2018). Semantic Domains in Akkadian Texts. CyberResearch on the Ancient Near East and Neighboring Regions. Case Studies on Archaeological Data, Objects, Texts, and Digital Archiving, 2, 224–256. http://hdl.handle.net/10138/241805

Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2020). Fear in akkadian texts: New digital perspectives on lexical semantics. In The Expression of Emotions in Ancient Egypt and Mesopotamia (pp. 470–502). Brill. http://hdl.handle.net/10138/328017

Tools

  • BabyLemmatizer, OpenNMT based neural lemmatizer and tagger. Pretrained models available for Ancient Greek, Latin and various cuneiform languages.
  • BabyFST, Finite-state morphology of Akkadian, specifically Babylonian dialect.
  • PMI-Embeddings, Hyper-parametrized tool for creating PMI+SVD based word embeddings from sparse or fragmentary data sets.

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Anna Dmitrieva

Anna Dmitrieva
Anna Dmitrieva (standing) with Aleksandra Konovalova (sitting), co-creators of the Parallel Corpus of Finnish and Easy-to-read Finnish. Photo: Anna Dmitrieva

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anna Dmitrieva tells us about her research on text simplification. Computational methods and the compiling of parallel corpora are an integral part of her work.

Who are you?

I am Anna Dmitrieva, a doctoral researcher at HELSLANG, the Doctoral Programme in Language Studies at the University of Helsinki.

What is your research topic?

My main field of interest is text simplification. I have studied computational linguistics since 2012, when I started my studies for the Bachelor’s degree. Since then, I have been involved in many projects related to natural language processing (NLP), but text simplification has been my main focus during my doctoral studies.

Text simplification is a process of making a text “easier”. A simplified text should be more readable and accessible to a broader audience. In NLP, text simplification can be viewed as a monolingual machine translation problem. We train models that are capable of translating or transforming texts, taking a source text in a particular language and producing a “simpler” version of the text in the same language. This task typically requires a lot of parallel data, where there is a corresponding “easy” target text for each source text.

I work with languages that do not have a lot of simplification data, make datasets for them, and train simplification models. During my time as a doctoral researcher, I have made Russian and Finnish text simplification datasets and models. I am also investigating controlled text simplification, the task of manipulating certain linguistic properties in the output of the simplification model.

How is your research related to Kielipankki?

As a Finnish university student, I have naturally thought of making a Finnish simplification model. Since there were no parallel simplification corpora for Finnish, I had to make one myself. The most obvious choice for the data source was Yle Easy-to-read Finnish News: they exist in the form of text, have been around for a relatively long time, and have equivalents in “regular” Finnish. It was a relief to know that I didn’t have to scrape the news myself using Yle’s API because all the archives are already on Kielipankki.

However, I had to solve the problem of aligning Easy Finnish and Standard Finnish news. I performed automatic alignment, but there was no golden test set of document pairs to test the quality of the alignments. This is where my friend Aleksandra Konovalova (University of Turku) stepped in and helped me, evaluating 1919 pairs of documents herself. Together, we created the Parallel Corpus of Finnish and Easy-to-read Finnish, which is now available in Kielipankki. Currently, I am adding more document pairs and creating a sentence-aligned version, which will hopefully also be made available via Kielipankki when completed.

Publications

Dmitrieva, A. & Konovalova, A. Creating a parallel Finnish—Easy Finnish dataset from news articles. Jun 2023, Proceedings of the 1st Workshop on Open Community-Driven Machine Translation. Esplá-Gomis, M., Forcada, M., Kuzman, T., Ljubešić, N., van Noord, R., Ramírez-Sánchez, G., Tiedemann, J. & Toral, A. (eds.). Universitat d’Alacant, p. 21-26 6 p. https://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27

Dmitrieva, A. Automatic text simplification of Russian texts using control tokens. May 2023, Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Piskorski, J., Marcińczuk, M. & Nakov, et al., P. (eds.). Stroudsburg: Association for Computational Linguistics (ACL), p. 70-77 8 p. DOI: 10.18653/v1/2023.bsnlp-1.9

Dmitrieva, A. The role of language technology in accessible communication research. Jun 2023, Emerging Fields in Easy Language and Accessible Communication Research. Deilen, S., Hansen-Schirra, S., Hernández Garrido, S., Maaß, C. & Tardel, A. (eds.). Frank & Timme, p. 319-338 20 p. (Easy – Plain – Accessible; vol. 14). https://researchportal.helsinki.fi/fi/publications/the-role-of-language-technology-in-accessible-communication-resea

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Sampo Pyysalo

Sampo Pyysalo
Photo: Pasi Leino / University of Turku

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampo Pyysalo tells us about his research on natural language processing. Openly available large language models are necessary for developing tools similar to ChatGPT also for smaller languages, such as Finnish.

Who are you?

I’m Sampo Pyysalo, University Research Fellow at the TurkuNLP group of the University of Turku.

What is your research topic?

My research is on machine learning approaches to natural language processing, with particular focus on processing Finnish text and analyzing biomedical domain scientific literature. A lot of my recent work revolves around training large neural network models, including general ”foundation” models such as FinBERT and FinGPT as well as task-specific models such as a named entity recognition model for Finnish. I also work on data, both compiling raw text resources for the unsupervised training of foundation models and running manual annotation efforts to create resources for supervised training, such as the Turku NER and TurkuONE corpora.

Large neural language models are central to a lot of state-of-the-art natural language processing and the basis for tools such as ChatGPT, but most such models focus on English and many of the best models are not publicly available. We believe that openly available Finnish models such as FinBERT and FinGPT are necessary to enable the creation of tools for processing Finnish language with comparable capabilities to tools available for English.

How is your research related to Kielipankki?

Creating large language models from scratch requires billions of words of text, and collections of Finnish of this size are not readily available. To compile sufficiently large corpora for language model training we have drawn on various sources, including web crawls and resources available through Kielipankki such as the Yle News Archive, the Finnish News Agency Archive (STT) and the Suomi 24 Corpus. We also distribute resources created by TurkuNLP through Kielipankki among other channels.

In the near future, we hope that we will be able to provide access to the full text resources used to create our models for research purposes through Kielipankki to improve the replicability of our work and to make it easier for future efforts to create models for Finnish.

Publications

J. Luoma & LH. Chang & F. Ginter & S. Pyysalo. 2021. Fine-grained Named Entity Annotation for Finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.14

A. Virtanen & J. Kanerva & R. Ilo & J. Luoma & J. Luotolahti & T. Salakoski & F. Ginter & S. Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. In CoRR, abs/1912.07076. https://doi.org/10.48550/arXiv.1912.07076

Corpora

More information

  • TurkuNLP group of the University of Turku
  • FinBERT, a version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group
  • FinGPT, generative GPT-3-like models for Finnish
  • Finnish NER, a Named Entity Recognition system for Finnish (based on FinBERT and a new NER annotation layer of the UD_Finnish-TDT treebank)

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Nobufumi Inaba

Nobufumi Inaba
Photo: Krista Teeri

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Nobufumi Inaba tells us about a corpus that he is preparing, which contains a text from the year 1526 and is an interesting source for researchers studying language change.

Who are you?

I am Nobufumi Inaba, Senior Researcher at the Archive of Finnish and Finno-Ugric languages at the University of Turku. The Archive is part of the Department of Finnish and Finno-Ugric Languages and it has only been operating under this name for a couple of years. The Finnish language part of the Archive, for which I am responsible, was formerly known as the Syntax Archive. Many Finnish language researchers are probably familiar with the corpus of the same name. I have been involved in the planning and implementation of eg. technical solutions for the projects in our department and for the corpora produced in our Archive. I have also created tools to be used internally by our corpus teams.

What is your research topic?

I have been interested in studying language change and its causes. In my dissertation, I investigated the roots of the so-called dative genitive in Finnish and my research data consisted mostly of texts from old literary languages. In recent years, I have been studying the phenomenon of leaving out the inflection of words in Finnish. My data consists of chat conversations in a location-based game community and of the speech recordings I collected at the game locations.

Currently, I am investigating old literary language again. I am preparing a corpus of the 1526 Swedish New Testament, one of the source texts used by Mikael Agricola. This New Testament has been seen as a symbol of the beginning of the Modern Swedish period. The forthcoming corpus is intended to support the study of the language of Agricola’s works. The importance of the text is not merely symbolic. In my opinion, this earlier New Testament text is a much more valuable source for those interested in linguistic changes than the whole Bible of 1541 (Gustav Vasas bibel). It does not seem to contain regulated language in contrast to the whole Bible that includes many attempts to regulate and harmonize linguistic elements all the way from vocabulary to syntax. Moreover, the 1526 New Testament contains a striking number of elements from spoken language, which the 1541 Bible largely attempted to eliminate. The preliminary coding of the text in order to facilitate annotation is now complete and I expect to start the annotation work in the autumn of 2023.

How is your research related to Kielipankki?

We have had a good division of labour with Kielipankki ever since the days of the Syntax Archive. The University of Turku produces language resources that are published via Kielipankki for the use of the scientific community. The Finnish Dialect Corpus of the Syntax Archive and The Morpho-Syntactic Database of Mikael Agricola’s Works, produced in cooperation with the Institute of the Languages of Finland, as well as the Arkisyn corpus, an important annotated collection of contemporary Finnish produced at the University of Turku, have all been published via the Korp service in Kielipankki. Naturally, Kielipankki will also be the publication site for the Swedish-language New Testament corpus that I am currently working on.

Publications

Nobufumi Inaba (2015). Suomen datiivigenetiivin juuret vertailevan menetelmän valossa. Suomalais-Ugrilaisen Seuran toimituksia 272. https://www.sgr.fi/fi/items/show/78

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Niina Kunnas

Niina Kunnas
Photo: Mikko Törmänen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Niina Kunnas tells us about her research on minority languages including, e.g., Meänkieli.

Who are you?

I am Niina Kunnas, Associate Professor of Finnish language and University Lecturer at the University of Oulu. I am also positioned as a part-time Professor of Finnish language at Sámi Allaskuvla in Koutokeino, Norway.

What is your research topic?

My research represents sociolinguistics, folklinguistics and minority language research. I have examined linguistic variation, language perceptions and situational variation in minority languages, among other things.

How is your research related to Kielipankki?

In recent years, Kielipankki has been involved in my research in a number of ways. Firstly, in 2019, I collected a corpus of spoken Meänkieli together with my students, which was originally recorded with the intention of making it available to researchers via Kielipankki. The corpus contains spoken Meänkieli from several Meänkieli-speaking municipalities in the Meänkieli-speaking area, and its collection has been encouraged by Heikki Paunonen. Some of the interviewees are the same as those previously recorded in the 1990s. Paunonen has also recorded speech from the same parishioners in the 1960s, so the material as a whole makes it possible to carry out a three-round follow-up study of spoken Meänkieli.

I have also recently made use of the Iijoki, the University of Oulu Päätalo Collection corpus on the Korp server. The corpus contains all the novels in the Iijoki series written by Kalle Päätalo and has a size of over 5 million tokens. Together with Liisa Mustanoja and Maija Saviniemi, we will use this data in our study of the function and the associated affects of the Viena Karelian episodes in the Iijoki series. The corpus has allowed us to search data rapidly, and the results of the study will be published in an article that will appear in a volume with the working title Päättymätön savotta. Analyyseja Kalle Päätalon tuotannosta (Timberwork without End. Analyses of Kalle Päätalo’s works).

Publications

Kunnas, Niina 2019: Karjalan kieli Oulun seudulla. – Harri Mantila, Maija Saviniemi & Niina Kunnas (toim.), Oulu kieliyhteisönä. 144–199. Helsinki: Suomalaisen Kirjallisuuden Seura.

Saviniemi, Maija, Kunnas, Niina, Mantila, Harri, Paukkunen, Ulla & Rajala, Elina 2019: Oulua havainnoimassa. – Harri Mantila, Maija Saviniemi & Niina Kunnas (toim.), Oulu kieliyhteisönä. 276–318. Helsinki: Suomalaisen Kirjallisuuden Seura.

Vaattovaara, Johanna, Kunnas, Niina & Saviniemi, Maija 2018: Stadi imitoituna. – Sisko Brunni, Niina Kunnas, Santeri Palviainen & Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia Humaniora Ouluensia 16. Oulun yliopisto. http://jultika.oulu.fi/files/isbn9789526221120.pdf

Kunnas, Niina 2018: Viena Karelians as observers of dialect differences in their heritage language. – Marjatta Palander, Helka Riionheimo & Vesa Koivisto (eds.), On the border of language and dialect. 123–155. Studia Fennica Linguistica 21. Helsinki: Suomalaisen Kirjallisuuden Seura.

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Mikael Varjo

Mikael Varjo
Photo: Emmi Saari

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikael Varjo tells us about his research on zero-subject constructions in the ArkiSyn corpus containing everyday Finnish conversation.

Who are you?

I am Mikael Varjo and I am currently working as a university teacher at the University of Turku. In March 2023, I defended my doctoral thesis on zero-subject constructions, also at the University of Turku. My interests are diverse, ranging from teaching and researching Finnish as a second and foreign language to research in usage-based syntax.

What is your research topic?

In my doctoral thesis I examine zero-subject constructions (zero person in the subject position) in Finnish everyday conversation. I have extracted my data from the morphosyntactically annotated ArkiSyn corpus, which I also helped to build as a project researcher in 2015–2016 before starting my own dissertation.

Previous research on the zero person has been quite qualitatively oriented. My research aims to fill this methodological gap by combining two approaches: quantitative corpus linguistics and qualitative interactional linguistics. In my research, I examine the characteristics, variation, contexts of use, and functions of zero-subject constructions in spoken interaction. My research reveals that the grammatical and semantic features typically associated with the zero person also distinguish the subcategories of zero-subject constructions. The differences between subcategories are also linked to the tasks the constructions have in interaction. Typically, zero-subject constructions are used for expressing stance towards something that is under discussion, (joint) planning, sharing of experiences, feelings and desires, or for giving directives.

How is your research related to Kielipankki?

The ArkiSyn corpus is available in Kielipankki. In addition, Kielipankki provided important support in the early stages of my doctoral studies as I was taking my first steps in language technology, natural language processing and automatic text processing. Converting zero-subject constructions extracted from the ArkiSyn corpus into a format that was easy to process and met the needs of my dissertation required a lot of learning over the years. With the help of the Kielipankki’s methodological course Corpus Clinic, I was able to get started in the autumn of 2015.

Publications

Varjo, Mikael. 2022. Greater than zero? A study of referentially open and specific necessity constructions in Finnish everyday conversation. Eesti Ja Soome-Ugri Keeleteaduse Ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 13(2), 5–46. https://doi.org/10.12697/jeful.2022.13.2.01

Suomalainen, Karita & Mikael Varjo. 2020. When personal is interpersonal. Organizing interaction with deictically open personal constructions in Finnish everyday conversations. Journal of Pragmatics, 168, 98–118. https://doi.org/10.1016/j.pragma.2020.06.003

Varjo, Mikael. 2019. It Takes All Kinds to Make a Zero: Employing Multiple Correspondence Analysis to Categorize an Open Personal Construction in Conversational Finnish. Corpus Linguistics Research, 5, 55–87. https://doi.org/10.18659/clr.2019.5.03

Varjo, Mikael ja Karita Suomalainen. 2018. From zero to ‘you’ and back: A mixed methods study comparing the use of two open personal constructions in Finnish. Nordic Journal of Linguistics, 41(3), 333–366. https://doi.org/10.1017/s0332586518000215

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Rosa González Hautamäki

Rosa González Hautamäki
Photo: Ville Hautamäki

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Rosa González Hautamäki tells us about her research on within-speaker variation and the effects of voice modifications. The AVOID corpus, which she collected in collaboration with the Computational Speech group at UEF, is a valuable resource for studying human-induced voice modifications.

Who are you?

I am Rosa González Hautamäki, a postdoctoral researcher at the Research Unit of Logopedics (RULOGO) at the University of Oulu, and a visiting researcher at the School of Humanities at the University of Eastern Finland. I hold a Ph.D. in Computer Science and maintain ongoing collaborations with the School of Computing at the University of Eastern Finland and the Human Language Technology lab at the National University of Singapore (NUS).

What is your research topic?

My research focuses on within-speaker variation in the context of speaker recognition. Speech is a complex signal that varies due to several factors, such as age, health, emotional state, and more, so it is expected that a speaker won’t utter the same phrase in exactly the same way multiple times. During my doctoral studies, I studied the effects of voice modifications on the performance of voice comparisons carried out by listeners or automatic systems. My initial research focused on mimicry and voice disguise, considering that some speakers may not be cooperative when interacting with speaker recognition systems. Our research showed that even simple techniques to disguise one’s voice could cause degradation in the performance of automatic systems, while also making the task of speaker comparison challenging for listeners.

Since then, my studies on within-speaker variation have focused on identifying the factors that impact the performance of speaker verification, including deliberate and non-deliberate voice modifications. These findings have also been important in analyzing speech in other speech technology tasks, such as speech spoofing attacks and auditory speech perception. Exploring the factors that impact system decisions can help in making them more reliable.

Currently, my research on speech analysis involves using machine learning models with data from evaluations used to identify developmental language disorders in children. I am excited to be part of a motivated group of researchers who are exploring speech and interventions that can support those working with the development of children’s speech.

How is your research related to Kielipankki?

During my doctoral research, I collaborated with the Computational Speech group at the University of Eastern Finland to collect a dataset for the study of voice disguise. Kielipankki provided crucial support by offering information necessary for the collection and preparation of the corpus, as well as for its publication as a resource. The resulting dataset, called the Age-related Voice Disguise (AVOID) corpus, contains voice recordings of Finnish speakers in their modal voice and attempting age disguise.

In one study, we used the AVOID corpus to analyze the impact of changes in selected acoustical features on automatic speaker recognition systems, and found that the difference in long-term fundamental frequency (F0) was the most detrimental factor to speaker recognition performance, even when the automatic system uses spectral features.

In another study using the AVOID corpus, we evaluated the effectiveness of age stereotypes as a voice disguise strategy in speaker comparisons. Listeners estimated both the speaker’s chronological and intended age (attempting child and elderly voices), and results showed that the age estimations for intended voices for female speakers were more accurate towards the target age, while for male speakers, age estimations corresponded to the direction of the target voice only for elderly voices.

Overall, the AVOID corpus is a valuable resource for studying human-induced voice modifications and we expect further research would help make systems more robust to disguised voices.

Publications

González Hautamäki, R., Hautamäki, V., and Kinnunen, T. (2019). ”On Limits of Automatic Speaker Verification: Explaining Degraded Recognizer Score Through Acoustic Changes Resulting from Voice Disguise”, The Journal of the Acoustic Society of America 146, 693. https://doi.org/10.1121/1.5119240

González Hautamäki,R., Sahidullah, Md., Hautamäki, V., and Kinnunen,T. (2017). ”Acoustical and perceptual study of voice disguise by age modification in speaker verification”, Speech Communication, Volume 95, Pages 1-15, https://doi.org/10.1016/j.specom.2017.10.002

González Hautamäki, R., Sahidullah, Md., Kinnunen, T., and Hautamäki, V (2016). ”Age-Related Voice Disguise and its Impact in Speaker Verification Accuracy”, Odyssey: The Speaker and Language Recognition Workshop, Bilbao, Spain, pages 277-282, http://dx.doi.org/10.21437/Odyssey.2016-40

González Hautamäki, R., Kanervisto, A., Hautamäki, V., and Kinnunen, T. (2018). ”Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification”, Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, pages 320-326, http://dx.doi.org/10.21437/Odyssey.2018-45

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Johanna Vaattovaara

Johanna Vaattovaara
Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Johanna Vaattovaara tells us about her research projects on language awareness and language attitudes.

Who are you?

I am Johanna Vaattovaara, professor of Finnish language in the Languages Unit at the Faculty of Information Technology and Communication Sciences, Tampere University.

What is your research topic?

My research topics represent sociolinguistics and language ideology research, mainly language awareness and attitude research. I have also done research on linguistic variation and language change, and for these topics various corpora have proven to be very valuable resources. Corpora have also been useful in the creation of language attitude study designs. In recent years, for example, I have used the Suomi24 corpus in various ways in studies where I have investigated, together with Elizabeth Peterson and also with Ylva Bir and Turo Hiltunen, the integration of English expressions into Finnish language use.

How is your research related to Kielipankki?

So far, I have used the Suomi24 corpus in Kielipankki, especially Suomi24 2016H2. Currently, I am launching a research project Arkisuomien kielitietoisuudet ja muutos (Societal awareness of linguistic variation and change), funded by the Kone Foundation (2023–25). During the project, we will collect language awareness and attitude data using different methods, such as a nationwide survey data, which we plan to distribute via Kielipankki.

In the past, I have distributed data through the archives of the Institute for the Languages of Finland (Kotus). Also the data that I collected for my dissertation is available from Kotus. The data consists of interviews of a group of high school graduates in Pello, Tornionlaakso (Torne Valley). In the post-doc phase, I collected reaction and interview data in the lobby of the Finnish Science Centre, Heureka, in the project Helsingin suomea – monimuotoisuus, sosiaalinen identiteetti ja kielelliset asenteet kaupunkiympäristössä, led by Marja-Leena Sorjonen and funded by the Academy of Finland in 2009–2012. This corpus of metalinguistic material can also be obtained from Kotus.

Publications

Peterson, E., Hiltunen, T., Vaattovaara, J. 2022. A place for pliis in Finnish: A discourse-pragmatic variation account of position. – Elizabeth Peterson, Turo Hiltunen & Joseph Kern (eds.), Discourse-Pragmatic Variation and Change: Theory, Innovations, Contact, pp. 272–292. Cambridge University Press. DOI: 10.1017/9781108864183.015

Peterson, E., Biri, Y., Vaattovaara, J. 2022. Grammatical and social structures of English-sourced swear words in Finnish discourse. – Martín-Solano, R. & San Segundo, R. (eds.), Corpus linguistics and Anglicisms, pp. 49–70. Peter Lang Publishing. DOI: 10.3726/b19222

Vaattovaara, J. & Peterson, E. 2019. Same old paska or new shit? On the stylistic boundaries and social meaning potentials of a loanword in Finnish. – Ampersand 6/2019 (Special Issue, E. Zenner, A. Calude & L. Rosseel (eds.), Lexical borrowing as expression of culture, identity and attitude – empirical investigations into the social meaning potential of loanwords.) DOI: 10.1016/j.amper.2019.100057

Vaattovaara, J. 2012. Spatial concerns for the study of social meaning of linguistic variables – an experimental approach. – Hanna Lehti-Eklund, Camilla Lindholm & Caroline Sandström (eds.), Folkmålsstudier : Meddelanden från Föreningen för Nordisk Filologi 2012/50, pp. 175–209. https://journal.fi/folkmalsstudier/article/view/82136

Nuolijärvi, Pirkko & Vaattovaara, Johanna 2011. De-standardisation in progress in Finnish society? – T. Kristiansen & N. Coupland (eds.), Standard Languages and Language Standards in a Changing Europe, pp. 67–74. Oslo: Novus Forlag. http://omp.novus.no/index.php/novus/catalog/view/3/5/163

Vaattovaara, Johanna 2009. Meän tapa puhua: Tornionlaakso pellolaisnuorten subjektiivisena paikkana ja murrealueena. Helsinki: Suomalaisen Kirjallisuuden Seura (304 pp.). Suomalaisen Kirjallisuuden Seuran toimituksia 1224. http://urn.fi/URN:ISBN:978-952-222-100-1

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Noora Hoffrén

Noora Hoffrén
Photo: Essi Ekman

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Noora Hoffrén tells us about her PhD research on constructed action in Finnish Sign Language and Finnish language.

Who are you?

I am Noora Hoffrén, a sign language interpreter and a doctoral researcher. I am working on my PhD thesis at the Sign Language Centre (SLC) in the Department of Language and Communication Studies at the University of Jyväskylä.

What is your research topic?

The topic of my dissertation is showing by enacting, i.e. constructed action. When a speaker or signer is immersed in the role of another character and displays the character’s thoughts, speech, emotions or actions, he or she is constructing action. Constructed action is not always obvious or overt. Often, especially in signed languages, constructed action is so closely integrated into the language that it is not always easy to discern it. In my research, I am studying constructed action in both Finnish Sign Language and Finnish language. My dissertation is part of the ongoing ShowTell project at the University of Jyväskylä.

How is your research related to Kielipankki?

As my research data, I will use the Corpus of Finnish Sign Language, part of which is already available for download in Kielipankki (CFINSL). In addition to videos that are recorded from multiple angles, the database contains basic annotations and metadata. The fact that such a corpus exists allows us to study constructed action in the best possible way.

My aim is to collect a video corpus of spoken Finnish, parallel to the Finnish Sign Language material, and to deposit the corpus in Kielipankki. The Finnish video corpus will be collected in pairs from six native speakers of Finnish. The methods that are used to collect the material will be similar to those used to collect the Finnish Sign Language corpus, for example, using multiple cameras during filming sessions and using the same elicitation materials (e.g. ’The Snowman’ and ’Frog, Where Are You?’ picture books).

Publications

Hoffrén, Noora 2019. Kuvailevien viittomien ja konstruoidun toiminnan yhteispeli. Master’s thesis. University of Jyväskylä. Available: http://urn.fi/URN:NBN:fi:jyu-201910144419

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Hae Kielipankki-portaalista:
Elina Vaahensalo
Kuukauden tutkija: Elina Vaahensalo

 

Tulevat tapahtumat


Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot