Researcher of the Month: Heidi Niva

Photo: Emmi Pollari

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Heidi Niva tells us about her research on Finnish grammatical phenomena and introduces a Vepsian-Finnish dictionary project. In a joint research, she also aims to evaluate the corpus of online discussions as a source for a language researcher.

Who are you?

I am Heidi Niva, a postdoc Finnish language researcher. I am currently a substitute lecturer of Finnish language and culture at the University of Helsinki. I am also actively involved in the LOST DOC collective, a community for postdoc language researchers.

What is your research topic?

Both in my dissertation and afterwards, grammatical phenomena have been in the focus of my research. Among other things, I have studied the structures that are used to express futurity in Finnish. Now I am involved in a joint project where we study the structures expressing avertivity, i.e. non-realization of events. I am also working in a project where we aim to compile a Vepsian-Finnish dictionary. Vepsian, also known as Veps, is a related but endangered language spoken south of Lake Onega (Ääninen). In addition to the dictionary project, I am also doing research on adpositional structures in the Veps language.

How is your research related to Kielipankki?

In my research on the Finnish grammar, instead of normativity, I am more interested in how people actually use linguistic structures, and what types of meanings and connotations these structures can convey. For this purpose, I have used the resources in Kielipankki: The Suomi24 Sentences Corpus 2001-2020 for the study of Modern Finnish, and the corpora of Early Modern Finnish and Old Literary Finnish for the study of the older forms of the language. I am also currently using the Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s and the Finnish News Agency Archive Corpus.

In fact, the Suomi24 Sentences Corpus 2001-2020 is itself the subject of our joint research with Max Wahlström and Olli Silvennoinen. What is interesting about this corpus is that it largely represents informal language use but is still different from spoken language in terms of its linguistic features. In addition, the corpus is a diverse source in terms of the formality of language use and the occurrence of linguistic phenomena as they seem to be influenced by the various topics of discussion and their styles of expression. In our forthcoming article, we will critically examine what kind of source the Suomi24 corpus actually is for a language researcher.

Publications

Niva, Heidi 2022: Suomen progressiivirakenne intentioiden ja ennakoinnin ilmaisuissa. Helsinki: Helsingin yliopisto. Available: http://urn.fi/URN:ISBN:978-951-51-8727-7

Niva, Heidi 2024: Tulen muistamaan hänet aina. Tulla V-mAAn vääjäämättömän tulevaisuuden ilmaisukeinona. Virittäjä 128(2), 238–263. DOI: 10.23982/vir.126878

Corpora

Links

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Krister Lindén

Photo: Juhani Jokinen

Kielipankki – The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Krister Lindén, the Director of the Language Bank, describes how researchers in Humanities can benefit from the use of artificial intelligence in their corpus-based research.

Who are you?

I am Krister Lindén. At the University of Helsinki, I am Research Director for Language Technology at the Department of Digital Humanities, and Deputy Team Leader at the Centre of Excellence for Ancient Near Eastern Empires. For national research infrastructures, I am the Director of the Language Bank of Finland, the National Coordinator of FIN-CLARIN, and the PI of FIN-CLARIAH. At the EU level, I am Chair of the National Coordinators Forum of CLARIN, a research infrastructure for the humanities and social sciences, and a member of the CLARIN Legal Issues Committee (CLIC).

What is your research topic?

I have always been interested in language technology and its application and, due to my involvement in the Language Bank, increasingly also in the prerequisites for developing and applying technology:

How can we use data to answer a broad range of research questions in the humanities and social sciences?
Where can we obtain development and test data to develop and evaluate our data processing methods?
Under what conditions can data be shared with other researchers so that they can verify the proclaimed performance of the methods?

An independent evaluation of methods is important to ensure progress and that we find the best methods in each case. If only a preliminary evaluation is needed, and a small-scale experiment is sufficient, you can give ChatGPT a few examples to see how it copes with the task. If there is insufficient data to reliably use a statistical method, and the task requires a high precision method, it may be quicker to use manually developed methods. On the other hand, if there is enough data, a suitable machine learning method is available, and the processing environment performance is sufficient, this combination often provides the most reproducible development path.

All the above development paths are data-driven and require data to be shared with other researchers for replication. In previous years, there has been a strong enthusiasm for completely open source data sets. While this is still a desirable goal, there are many datasets that, for one reason or another, cannot be made available to everyone. Gradually, as our community of researchers together with the lawmakers have succeeded in developing a legal framework for data access which is open enough for academic researchers to study the data and verify the results in a relatively straightforward way, while keeping the data accessible to a sufficiently small audience not to risk personal data nor infringe on copyrights.

A new development need is to create a method for researchers in the humanities and social sciences to discuss the content of datasets which they deposit in the Language Bank with an AI.

How is your research related to Kielipankki?

The Language Bank provides both a platform for tool development and an opportunity to show how different types of research-oriented datasets can be shared with other researchers in a safe and legal way.

Recent publications

Jauhiainen, T., Zampieri, M., Baldwin, T. C., & Linden, K. (2024). Automatic Language Identification in Texts. (Synthesis Lectures on Human Language Technologies). Springer. https://doi.org/10.1007/978-3-031-45822-4

Jauhiainen, T., Piitulainen, J., Axelson, E., Dieckmann, U., Lennes, M., Niemi, J., Rueter, J., & Linden, K. (2024). Investigating Multilinguality in the Plenary Sessions of the Parliament of Finland with Automatic Language Identification. In D. Fišer, M. Eskevich, & D. Bordon (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): ParlaCLARIN IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (pp. 48-56). (International conference on computational linguistics), (LREC proceedings). European Language Resources Association (ELRA). https://researchportal.helsinki.fi/files/312866811/ArtikkeliJulkaistu.pdf

Sahala, A., & Linden, K. (2023). BabyLemmatizer 2.0 – A Neural Pipeline for POS-tagging and Lemmatizing Cuneiform Languages. In A. Anderson, S. Gordin, B. Li, Y. Liu, & M. C. Passarotti (Eds.), Proceedings of the Ancient Language Processing Workshop associated with the 14th International Conference on Recent Advances in Natural Language Processing, RANLP 2023 (pp. 203-212). INCOMA. https://aclanthology.org/2023.alp-1.23

Linden, K., Niemi, J., & Kontino, T. (Eds.) (2023). CLARIN Annual Conference Proceedings 2023. (CLARIN Annual Conference Proceedings). CLARIN ERIC. https://researchportal.helsinki.fi/files/298353929/CE-2023-2328_CLARIN2023_ConferenceProceedings.pdf

Lindén, K., Ruokolainen, T., Hämäläinen, L., & Harviainen, J. T. (2023). Ethically Archiving a Hard-to-Access Massive Research Data Set in the Language Bank of Finland: The Finnish Dark Web Marketplace Corpus (FINDarC). In M. M. Rantanen , S. Westerstrand, O. Sahlgren, & J. Koskinen (Eds.), Proceedings of the Conference on Technology Ethics 2023 – Tethics 2023 (pp. 114-131). (CEUR Workshop Proceedings; Vol. 3582). CEUR-WS.org. https://researchportal.helsinki.fi/files/295005165/FP_10.pdf

Kamocki, P., Linden, K., Puksas, A., & Kelli, A. (2023). EU Data Governance Act: Outlining a Potential Role for CLARIN. In T. Erjavec, & M. Eskevich (Eds.), Selected papers from the CLARIN Annual Conference 2022 (pp. 57-65). (Linköping Electronic Conference Proceedings; No. 198). CLARIN ERIC. https://doi.org/10.3384/ecp198006

Linden, K., Jauhiainen, T., & Hardwick, S. (2023). FinnSentiment: A Finnish Social Media Corpus for Sentiment Polarity Annotation. Language Resources and Evaluation, 57(2), 581-609. https://doi.org/10.1007/s10579-023-09644-5

Axelson, E., Hardwick, S., & Linden, K. (2023). HFST Training Environment and Recent Additions. In A. Hurskainen, K. Koskenniemi, & T. P. (Eds.), Rule-Based Language Technology (pp. 60-69). (NEALT Monograph Series; No. 2[1]). Northern European Association for Language Technology. http://hdl.handle.net/10062/89595

Researcher of the Month: Juraj Šimko

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juraj Šimko tells us about his research on speech articulation and prosody. The Phonetics and Speech Synthesis Research Group at the University of Helsinki also aims to use large language models for finding answers to certain theoretical questions related to speech.

Who are you?

I am a University Lecturer in Phonetics, working at the University of Helsinki since 2013. Prior to that I have studied and worked at several Universities in Slovakia, Ireland and Germany, and I spend several years as a Language Specialist in Microsoft. I currently also hold an Honorary Professorship at the Indian Institute of Technology in Guwahati. My background is in Maths, Cognitive Science and Phonetics.

I am a member of the Phonetics and Speech Synthesis Research Group at the Department of Digital Humanities, but I am currently also involved in an ERC Advanced grant (to Professor Alice Turk) called Planning the Articulation of Spoken Utterances at the University of Edinburgh, where we investigate and model cognitive processes behind speech production and articulation.

What is your research topic?

I am passionate about human speech research. Besides speech articulation, my own as well as our Group’s main research interest is speech prosody, that is, essentially, all those melodic, rhythmic, emotional aspects of speech that go beyond the linguistic message that we pass on when we speak. In our current project Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis we are working on a novel speech synthesis architecture that is inspired by the influential theoretical and modelling paradigm of human cognition called Predictive Processing. Of course, the first obvious aim is to produce a world-class speech synthesis, and our team has indeed been creating state-of-the-art Finnish and Finland Swedish synthesis systems. But we also want to use the huge language models that drive technological applications as statistical representations of speech material used for their training, and use them to answer theoretical questions related to speech. These questions include, among others, distribution and evolution of accents and dialects, relationship between sociolinguistics and prosody, and prosodic patterns in politicians’ parliamentary speeches.

How is your research related to Kielipankki?

In order to do all that, we need quite a lot of data. Some of it we create ourselves, with invaluable assistance from Kielipankki experts: we have designed and recorded FinSyn corpus of high quality speech material intended for speech technology application, primarily for speech synthesis. The corpus contains ~75 hours of studio quality recordings from three voice talents, two of them speaking Finnish and one Finland Swedish. This corpus will appear as a part of Kielipankki collection. Our work on dialects and sociolinguistics heavily relies on other Kielipankki corpora, primarily the groundbreaking Donate Speech (Lahjoita puhetta) Corpus and Aalto Finnish Parliament ASR Corpus.

Recent publications

Törö, T., Suni, A. and Šimko, J. (2024). Analysis of regional variants in a vast corpus of Finnish spontaneous speech using a large-scale self-supervised model, Proceedings of Speech Prosody 2024, Leiden, Netherlands.

Vainio, M., Suni, A., Šimko, J. and Kakouros, S. (2024). The Power of Prosody and Prosody of Power: An Acoustic Analysis of Finnish Parliamentary Speech, Proceedings of Speech Prosody 2024, Leiden, Netherlands.

Elie, B., and Šimko, J., and Turk, A. (2024). Optimization-based modeling of Lombard speech articulation: Supraglottal characteristics. JASA Express Letters, 4(1). https://doi.org/10.1121/10.0024364

Kakouros, S., Šimko, J., Vainio M., and Suni, A. (2023). Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody, Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France. https://doi.org/10.21437/SSW.2023-20

Šimko, J., Törö, T., Vainio M., and Suni, A. (2023). Prosody under control: Controlling prosody in text-to-speech synthesis by adjustments in latent reference space, Proceedings of the 18th International Congress of Phonetic Sciences, Prague, Czech Republic. http://hdl.handle.net/10138/565382

Šimko, J., Adigwe, A., Suni, A. and Vainio M. (2022). A Hierarchical Predictive Processing Approach to Modelling Prosody, Proc. 11th International Conference on Speech Prosody, Lisbon, Portugal. https://doi.org/10.21437/SpeechProsody.2022-86

Corpora

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

Suomeksi

Researcher of the Month: Lotta Leiwo

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Lotta Leiwo tells us about her research in folkloristics, digging into the life and work of Finnish-American T-Bone Slim.

Who are you?

I am Lotta Leiwo, a doctoral researcher at the University of Helsinki, where I am studying for a PhD in history and cultural heritage. My dissertation in Folklore Studies examines the political role and nature-related rhetoric of Finnish-American women in the Finnish Socialist Federation (FSF) in the early 20th century. My main research data consists of FSF documents and a newspaper called Toveritar. The Toveritar, a mouthpiece of the FSF, targeted women and was edited and written mainly by women.

Prior to my doctoral project, I worked for two years as a research assistant on the project T-Bone Slim and the transnational poetics of the migrant left in North America (Kone Foundation 2022–2023). My main responsibility in this international project was the construction of the T-Bone Slim corpus and database. During the project, I wrote my Master’s thesis on Finnish socialist women in North America and found the topic for my dissertation.

What is your research topic?

In the T-Bone Slim project, an international research team studied the life and literary works of the second-generation American Finnish Matti Valentinpoika Huhta (1882–1942), also known as T-Bone Slim. Huhta was born in Ashtabula, Ohio, to a Finnish family that emigrated from Kälviä, Central Ostrobothnia. He spent his childhood and youth in Finnish communities in the US, working as a dock worker and as a correspondent for the local chapter of the temperance movement. In the 1910s, Huhta abandoned his family and took up a life as a ’hobo’ or itinerant worker. By the 1920s, Huhta became radicalised, joining the Industrial Workers of the World (IWW) and becoming a columnist for IWW newspapers and periodicals. He continued his writing career under the pen name T-Bone Slim until his death. Huhta lived his last years in New York, where he worked as a deck scow captain. In May 1942, he was found drowned in New York’s East River and was almost forgotten for several decades. For further exploration of the unresolved questions surrounding T-Bone Slim’s death, please visit our project blog and read Saku Pinta’s two-part text ”Who Killed T-Bone Slim” Part I and Part II.

In the late 2010s, musician John Westmoreland, a relative of Slim’s, discovered his ”Uncle Matt’s” T-Bone Slim writing career. Around the same time, academic interest in Slim, who had a Finnish background, began to grow, and his relatives and researchers found each other over T-Bone Slim Studies. The research continued in a project funded by the Kone Foundation, which brought together John Westmoreland and scholars from Finland, the UK, the US, Canada, and Australia. Kirsti Salmi-Niklander is the Principal Investigator of the project. We collected the T-Bone Slim materials gathered by the researchers from various archives organizing them into a corpus to enchance accessibility for others interested in the subject. In total, data from 14 archives across three continents and five countries – the United States, Canada, Finland, Sweden and Australia – provided the materials.

The corpus encompasses a total of 1294 texts written by T-Bone Slim and published in English in IWW periodicals. However, Slim also wrote in Finnish on occasion and occasionally used Swedish. Furthermore, the corpus also includes the surviving manuscripts written by Slim.

The texts written by T-Bone Slim are a gold mine for researchers. Slim used language cleverly, combining different genres and means of expression. In addition, the historical, literary and cultural references found in the texts provide an opportunity to examine the IWW movement, transnational migration and history in the United States from diverse perspectives. The language employed in the texts is rich, insightful, and even playful, and may be of interest to linguists. As the material comprises both published and unpublished texts, it offers insights into both the editorial processes of political publishing and the writing practices of an individual author.

Within the framework of the project, I have examined the literary practices, literacy acquisition of Finnish migrant-settlers and Slim’s utilization of genres from a semiotic perspective. Notably, Slim’s texts exhibit multilingualism in both background and content, incorporating intertextuality and multimodality across various genres and oral-literary practices. Such practices are evident, for example, in his song lyrics. In typical IWW style, Slim wrote lyrics addressing social injustices to popular song tunes known to readers. The lyrics were thus written to be sung, with the aim of provoking the reader/singer to reflect on the message of the lyrics. As Owen Clayton, a collaborator on our project, has observed, T-Bone Slim sought to activate and engage readers through language and words. I, too, am continually amazed and delighted by Slim’s skilful written expression.

How is your research related to Kielipankki?

In the early stages of the project, we thought long and hard about a suitable repository for the T-Bone Slim corpus and database. Our priority was to find a long-term storage solution for the materials that would ensure the materials’ widespread accessibility. Equally important was the need for the corpus to be explored and analysed through digital humanities methods.

The T-Bone Slim corpus and database will be published in April 2024 in Kielipankki, which fulfills all our storage and access requirements. The collection consists of photographic and microfilm scans of the original materials (newspapers, periodicals and manuscripts) with transcriptions and a database. The database includes all the texts in the corpus accompanied by metadata (date of publication, publication, title of the text, archive from which the material was collected, language, etc.). Additionally, we have experimented abstracting the data into a subset of the materials. For example, the people and places mentioned by T-Bone Slim and information about the poems or songs contained in the texts are listed in the abstracted data. The purpose of the database is to facilitate data navigation and serve as a foundation for more detailed abstraction of the data by other researchers.

T-Bone Slim Corpus and Database Launching Event

Welcome to the Resurrection – T-Bone Slim Corpus and Database Launching Event on Monday May 20th, 2024 at 15:00–17:00. The launching event is open to the public and the program can be followed both via Zoom and on-site at the Finnish Literature Society (Hallituskatu 1, Helsinki). More information and registration for remote participants.

Publications

Apajalahti, Eeva-Lotta et al. (2022). ”Ihmistieteelliset näkökulmat metsiin tuottavat tietoa moninaisista metsäsuhteista ja niiden tulevaisuuksista.” Vuosilusto 14(2022): 13–51. Available: https://lusto.fi/wp-content/uploads/2022/12/Lusto-Vuosilusto14.pdf.

Leiwo, Lotta (2024). ”When One’s Life Becomes the Field. Assessing the Field in Collaborative Autoethnography.” Marburg Journal of Religion 25(1). https://doi.org/10.17192/mjr.2024.25.8693.

Leiwo, Lotta (2023). ”Luontokin näkyy olevan köyhälistöä vastaan” Luonto kolmantena tilana Toveritar-lehden paikkakuntakirjeissä 1916–1917. Master’s thesis. Helsinki: University of Helsinki. http://urn.fi/URN:NBN:fi:hulib-202305302306.

Leiwo, Lotta (2023). ”Suomen koloniaalin osallisuuden kontekstit haltuun: Hoegaerts, Josephine, Tuire Liimatainen, Laura Hekanaho ja Elizabeth Peterson (toim.). 2022. Finnishness, Whiteness and Coloniality.” Elore, 30(2), 142–147. Book review. https://doi.org/10.30666/elore.137470.

Mäkelä, Heidi Henriikka, Leiwo, Lotta, Linkola, Hannu ja Rinne, Jenni (2023). ”The spiritual forest: an ethnographic exploration on Finnish forest yoga and the forest landscape.” Landscape Research. https://doi.org/10.1080/01426397.2023.2268550.

Corpora

T-Bone Slim Corpus, source (Kielipankki)

T-Bone Slim Corpus, Westmoreland materials (Metashare)

Entries from the Research Project’s Blog

Leiwo, Lotta (2023). ”T-Bone Slim Database – Final Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. 18.12.2023. https://blogs.helsinki.fi/tboneslim/2023/12/18/t-bone-slim-database-final-steps/.

Leiwo, Lotta (2023). ”T-Bone Slim Database – Next Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 22.6.2023. https://blogs.helsinki.fi/tboneslim/2023/06/22/t-bone-slim-database-next-steps/.

Salmi-Niklander, Kirsti (2023).”’T-Bone Slim’ eli Matti V. Huhta ajatteli ja kirjoitti kahdella kielellä kulkurielämästä ja työläisten oikeuksista” ’Vähäisiä lisiä’ Blog. Published 12.5.2023. https://www.finlit.fi/ajankohtaista/blogi/t-bone-slim-eli-matti-v-huhta-ajatteli-ja-kirjoitti-kahdella-kielella-kulkurielamasta-ja-tyolaisten-oikeuksista/.

Clayton, Owen (2023). ”Technocracy and T-Bone Slim’s Break with Ralph Chaplin” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 1.3.2023. https://blogs.helsinki.fi/tboneslim/2023/03/01/technocracy-and-t-bone-slims-break-with-ralph-chaplin/.

Dalbello, Marija (2022). ” From my Archival ‘Digs’, part I. Finding Slim!” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 12.12.2022. https://blogs.helsinki.fi/tboneslim/2022/12/12/finding-slim/.

Pinta, Saku (2022). ”T-Bone Slim’s Forgotten Finnish-Language Writings in the IWW Press” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 20.10.2022. https://blogs.helsinki.fi/tboneslim/2022/10/20/t-bone-slims-forgotten-finnish-language-writings-in-the-iww-press/.

Leiwo, Lotta (2022). ”T-Bone Slim Database – First Steps.” ’T-Bone Slim and the transnational poetics of the migrant left in North America’ Research Project’s Blog. Published 5.10.2022. https://blogs.helsinki.fi/tboneslim/2022/10/05/t-bone-slim-database-first-steps/.

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Harri Uusitalo

Photo: Timo Tuovinen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Harri Uusitalo tells us about his research using various types of Finnish-language corpora from different time periods.

Who are you?

I am Harri Uusitalo, postdoctoral researcher at the University of Turku. I am a researcher of the Finnish language and currently, I am working at the School of History, Culture and Arts Studies in the interdisciplinary projects Fauna et Flora Fennica and Disappeared, Endangered and Newly Arrived Species: The Human Relationship with the Changing Biodiversity of the Baltic Sea. In the research groups, we examine the historical relationship of the Finnish people with nature.

What is your research topic?

I have studied Finnish texts from different periods, from the time of Agricola to the present day. My doctoral thesis focused on the legal language of the 17th century, and more recently I have been fascinated by environmental themes and ecolinguistic perspectives.

How is your research related to Kielipankki?

Together with my colleagues, I have used the Kielipankki data in some of my research. For example, together with Karita Suomalainen, we used the Suomi24 corpus and the Korp tool to investigate how Finnish people identify and discuss invasive alien species. With Duha Elsayed and Heidi Salmi, we used the Morpho-Syntactic Database of Mikael Agricola’s Works to study the translative form of the A-infinitive in Agricola’s works.

In my future research, I will certainly make use of many other corpora in Kielipankki, such as the Corpus of Old Literary Finnish, the Corpus of Early Modern Finnish and the Newspaper and Periodical Corpus of the National Library of Finland.

Publications

Uusitalo Harri, Lähdesmäki Heta, Sonck-Rautio Kirsi, Latva Otto, Salmi Hannu & Alenius Teija (forthcoming): Alien Plants between Practices and Representations: the Cases of European Spruce and Beach Rose in Finland. Plant Perspectives.

Uusitalo Harri & Suomalainen Karita 2023: Ecolinguistic Approach to Online Finnish Discourse on Invasive Alien Species. Language@Internet 21. https://www.languageatinternet.org/articles/2023/uusitalo

Elsayed Duha, Salmi Heidi & Uusitalo Harri 2022: A-infinitiivin translatiivi Mikael Agricolan teksteissä. Sananjalka 64. Suomen Kielen Seura, Turku. DOI: 10.30673/sja.107377

Corpora and tools

The Korp tool

The Suomi24 resource group

The Morpho-Syntactic Database of Mikael Agricola’s Works

The Corpus of Old Literary Finnish (VKS)

The Corpus of Early Modern Finnish (VNSK)

The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version (KLK)

More information

Fauna et Flora Fennica (FaFFe) project

Disappeared, Endangered and Newly Arrived Species: The Human Relationship with the Changing Biodiversity of the Baltic Sea (HumBio) project

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tanja Säily

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tanja Säily tells us about her research on the English language, which combines corpus linguistics, digital humanities and historical sociolinguistics.

Who are you?

I am Tanja Säily, Assistant Professor in English Language at the University of Helsinki.

What is your research topic?

I study variation and change in the English language from a sociolinguistic perspective. My research combines corpus linguistics, digital humanities and historical sociolinguistics. I frequently collaborate with other linguists and historians, and I develop new methods with data scientists and language technologists. I analyse sociolinguistic variation especially in linguistic productivity, such as the use of neologisms. I have also studied gendered styles and factors influencing the rate of language change.

How is your research related to Kielipankki?

In my research, I use English text corpora, which I have also deposited in Kielipankki for myself and others to use. I am currently studying the productivity of various linguistic constructions in the Corpus of Historical American English (e.g. Säily & Vartiainen, forthcoming). I have been using this corpus with the Korp tool and have also downloaded it to my own computer.

I have prepared openly available teaching materials on the methods of historical corpus linguistics for graduate students and other interested parties. They are included in the Method Bank for Linguistics, and the Early Modern English section of the Helsinki Corpus of English Texts used in the exercises can be found in Kielipankki.

Publications

Here are a few of my most recent publications; the entire list can be found at https://tanjasaily.fi/publications/

Accepted. Säily, Tanja, Martin Hilpert & Jukka Suomela. New approaches to investigating change in derivational productivity: Gender and internal factors in the development of -ity and -ness, 1600–1800. Patricia Ronan, Theresa Neumaier, Lisa Westermayer, Andreas Weilinghoff & Sarah Buschfeld (eds.), Crossing boundaries through corpora: Innovative approaches to corpus linguistics (Studies in Corpus Linguistics). Amsterdam: John Benjamins.

Accepted. Säily, Tanja & Turo Vartiainen. Historical linguistics. Michaela Mahlberg & Gavin Brooks (eds.), Bloomsbury handbook of corpus linguistics. London: Bloomsbury.

Accepted. Säily, Tanja, Turo Vartiainen, Harri Siirtola & Terttu Nevalainen. Changing styles of letter-writing? Evidence from 400 years of early English letters in a POS-tagged corpus. Luisella Caon, Moragh Gordon & Thijs Porck (eds.), Unlocking the history of English: Pragmatics, prescriptivism and text types (Current Issues in Linguistic Theory). Amsterdam: John Benjamins.

2023. Landert, Daniela, Tanja Säily & Mika Hämäläinen. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47(1): 63–79. DOI: 10.2478/icame-2023-0004

2022. Rodríguez-Puente, Paula, Tanja Säily & Jukka Suomela. New methods for analysing diachronic suffix competition across registers: How -ity gained ground on -ness in Early Modern English. International Journal of Corpus Linguistics27(4): 506–528. Special issue, Corpus studies of language through time, ed. by Tony McEnery, Gavin Brookes & Isobelle Clarke. DOI: 10.1075/ijcl.22014.rod

2021. Säily, Tanja, Eetu Mäkelä & Mika Hämäläinen. From plenipotentiary to puddingless: Users and uses of new words in early English letters. Mika Hämäläinen, Niko Partanen & Khalid Alnajjar (eds.), Multilingual Facilitation, 153–169. Helsinki: University of Helsinki. DOI: 10.31885/9789515150257.15

2020. Mäkelä, Eetu, Krista Lagus, Leo Lahti, Tanja Säily, Mikko Tolonen, Mika Hämäläinen, Samuli Kaislaniemi & Terttu Nevalainen. Wrangling with non-standard data. Sanita Reinsone, Inguna Skadiņa, Anda Baklāne & Jānis Daugavietis (eds.), Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, Riga, Latvia, October 21–23, 2020 (CEUR Workshop Proceedings 2612), 81–96. Aachen: CEUR-WS.org. DHN 2020 Best Paper Award. http://ceur-ws.org/Vol-2612/paper6.pdf

2020. Nevalainen, Terttu, Tanja Säily, Turo Vartiainen, Aatu Liimatta & Jefrey Lijffijt. History of English as punctuated equilibria? A meta-analysis of the rate of linguistic change in Middle English. Journal of Historical Sociolinguistics 6(2): article 20190008. Special issue, Comparative Sociolinguistic Perspectives on the Rate of Linguistic Change, ed. by Terttu Nevalainen, Tanja Säily & Turo Vartiainen. DOI:10.1515/jhsl-2019-0008

2019. Hill, Mark J., Ville Vaara, Tanja Säily, Leo Lahti & Mikko Tolonen. Reconstructing intellectual networks: From the ESTC’s bibliographic metadata to historical material. Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, Copenhagen, Denmark, March 6–8, 2019 (CEUR Workshop Proceedings 2364), 201–219. Aachen: CEUR-WS.org. DHN 2019 Best Paper Award. http://ceur-ws.org/Vol-2364/19_paper.pdf

2018. Säily, Tanja. Change or variation? Productivity of the suffixes -ness and -ity. Terttu Nevalainen, Minna Palander-Collin & Tanja Säily (eds.), Patterns of Change in 18th-century English: A Sociolinguistic Approach (Advances in Historical Sociolinguistics 8), 197–218. Amsterdam: John Benjamins. DOI: 10.1075/ahs.8

Corpora and teaching materials

The Corpus of Historical American English (COHA)

Helsinki Corpus of English Texts, Early Modern English section

Historical Corpus Linguistics (The Method Bank for Linguistics)

More information

Homepage: https://tanjasaily.fi

ORCID: https://orcid.org/0000-0003-4407-8929

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Liisa Mustanoja

Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Liisa Mustanoja tells us about her research on sociolinguistics. With the help of a longitudinal corpus, it is possible to observe changes in the spoken language of the same people at different points in time.

Who are you?

I am Liisa Mustanoja, PhD, from Tampere. I work as a University Lecturer of Finnish Language in the Unit of Languages at the Faculty of Information Technology and Communication, University of Tampere. From January 2024, I will be the Head of the Unit of Languages for the next five years. I am also an Associate Professor of Finnish at the University of Oulu, specialising in sociolinguistics.

What is your research topic?

So far, all my research fits under the large umbrella of sociolinguistics. I am interested in the relationship between language and society, especially in all forms of change, upheaval and movement. In my doctoral research, I examined the change of the spoken language of Tampere at the level of the idiolect. This was a so-called real-time panel survey, in which I examined the language of the same people in the light of two points in time. Later, together with my colleagues, I have extended the study to the spoken language of Helsinki, and we have also included a third time point. The focus has largely been on the phonetic and formal structure of the language, but the data has also allowed for a sociophonetic approach. In one article, for example, we investigated changes in pitch over time.

In addition to the path of variation studies, I am interested in the interface between spoken and written language, and this has provided me with another research direction, namely the study of letter writing. I have investigated – both on my own as well as together with Finnish language students – the correspondence during the Second World War. As there was no other means of communication during the war, everyone took up their pen, regardless of age, profession or educational background. Although this correspondence resource is old, it has provided essential insights into the importance of human contact in times of crisis, as well as into everyday life and humanity in the midst of world turmoil.

How is your research related to Kielipankki?

For some time now, Kielipankki has made accessible the Longitudinal Corpus of Finnish Spoken in Helsinki, which has provided me and my colleagues with an important source of data for studying language change. This corpus will hopefully be joined in the coming months by a little sister, the Longitudinal data of Tampere spoken language. Previously, recordings of the spoken language of Tampere had been made in the 1970s and 1990s. In 2019, I started a third round of data collection in Tampere, which has been continued by students up to the present day. Thanks to the funding I received from FIN-CLARIN, I have also been able to hire some temporary help to work on the material. Everything is now in place, except for the final paperwork. The transfer and archiving of personal speech data has its own complications, but Kielipankki is by far the best possible repository for this valuable longitudinal data. On the eve of handing over the material, it feels like there should be more material and it should be more complete, and the transcripts should be revised countless more times. But really, every little addition to Kielipankki is a great gift to the research community. And by opening up even a part of the resource, someone else has also the possibility to join the transcription work if they want to!

From the resources in Kielipankki, I would also like to mention the Suomi24 Corpus, which suits well for student work. Nowadays, when data protection matters are demanding, it is a relief to be able to direct students to these ready-made resources. For me, too, there is still a lot of new things to wonder about in Kielipankki. My interest in wartime letters, for example, has recently led me to Kalle Päätalo’s Iijoki series, and I have been quite surprised by the research potential of this cornucopia.

Publications

Mustanoja Liisa, O’Dell Michael & Lappalainen Hanna, 2022: Helsinkiläis- ja tamperelaispuhujien äänenkorkeuden muutokset 1970-luvulta 2010-luvulle. Puhe ja kieli. https://doi.org/10.23997/pk.121404

Kuparinen Olli, Santaharju Jenni, Leino Unni, Mustanoja Liisa & Peltonen Jaakko 2022: Katomuotojen eteneminen hd-yhtymässä Helsingin puhekielessä. Virittäjä 126, s. 316–338. https://doi.org/10.23982/vir.100585

Kuparinen Olli, Peltonen Jaakko, Mustanoja Liisa, Leino Unni & Santaharju Jenni, 2021: Lects in Helsinki Finnish – a probabilistic component modeling approach. Language Variation and Change. https://doi.org/10.1017/S0954394521000041

Lappalainen Hanna, Mustanoja Liisa & O’Dell Michael, 2019: Miten ja milloin yksilön kieli muuttuu? Helsinkiläisidiolektien muutos ja muutoksen tutkimuksen menetelmät. Virittäjä 123, s. 550–581. https://doi.org/10.23982/vir.67808

Kuparinen Olli, Mustanoja Liisa, Peltonen Jaakko, Santaharju Jenni & Leino Unni, 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. Sananjalka 61. s. 30–56. https://doi.org/10.30673/sja.80056

Mustanoja Liisa, 2018: Sydämellisiä kirjeitä talvisodasta. Hämäläisten sotilaiden kiitoskirjeet aikansa kielen ja kirjeenvaihtokulttuurin heijastajina. Sisko Brunni, Niina Kunnas, Santeri Palviainen ja Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia humaniora ouluensia 16. Oulu, s. 251–285. https://urn.fi/URN:ISBN:9789526221120

Mustanoja Liisa (toim.), 2017: Arjen sirpaleita ja suuria tunteita: Kirjeet sodan sanoittajina ja ihmissuhteiden ylläpitäjinä 1939–1944. Tampere Studies in Language, Translation and Literature B5. Tampereen yliopisto. https://urn.fi/URN:ISBN:978-952-03-0527-7

Mustanoja Liisa, 2011: Idiolekti ja sen muuttuminen: reaaliaikatutkimus Tampereen puhekielestä. Tampere: Tampere University Press. https://urn.fi/urn:isbn:978-951-44-8417-9

Corpora

The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s)

Longitudinal data of Tampere spoken language

Suomi 24 resource group

Iijoki, the University of Oulu Päätalo collection

More information

Faculty of Information Technology and Communication Sciences | Tampere Universities community (tuni.fi)

The Linguistic Study of Wartime Correspondence | Tampere Universities community (tuni.fi)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tiina Onikki-Rantajääskö

Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tiina Onikki-Rantajääskö tells us about the principles of the Helsinki Term Bank for the Arts and Sciences (HTB) and invites interested experts to join the collaborative terminology work.

Who are you?

I am Tiina Onikki-Rantajääskö, Professor of Finnish at the University of Helsinki. I also lead the Helsinki Term Bank for the Arts and Sciences (HTB).

What is your research topic?

I am generally interested in how vocabulary and grammatical structures construe linguistic meaning and how they function in relation to the wider textual context. Most of my published research is related to the local cases of the Finnish language. Currently, I am delighted to see how younger researchers aim to combine qualitative and quantitative research in the project Platforms and Rhetorical Group Strategies (in Finnish, ”Alustat ja retoriset ryhmästrategiat”), run by me and Eetu Mäkelä and funded by Kone Foundation. I am particularly interested in discovering whether some constructions can indicate broader discourse structures. However, during this winter, I am spending most of my time on my duties as the Finnish Language Rapporteur, appointed by the Ministry of Justice.

How is your research related to Kielipankki?

I tend to use the Finnish language resources in Kielipankki whenever I need information about the context of a word or grammatical element. Many of the corpora that I have used in the past can now be found in Kielipankki, such as the HS.fi News and Comments Corpus that was compiled in one of my earlier projects.

In addition, the Helsinki Term Bank for the Arts and Sciences (HTB) is part of the FIN-CLARIAH Research Infrastructure, together with Kielipankki. This is reflected in the fact that the online service of the HTB is also accessible via Kielipankki. The HTB also has an employee funded through the FIN-CLARIAH project (FIRI funding from the Research Council of Finland). There is a need for collaboration in the field of language technologies.

The contents Helsinki Term Bank for the Arts and Sciences (HTB) are still in the construction phase. We are constantly working to involve more and more researchers from different disciplines in the terminology work and to invite new disciplines to join the HTB. Defining scientific terms and providing other background information on concepts require expertise in each field. Therefore, the selected method is niche-sourcing of experts, supported by our project planner. The aim is to promote the multilingualism of science in addition to providing openly accessible information describing the formation of scientific knowledge and facilitating the utilization of science. Scientific concepts are at the heart of research. Multilingualism can be promoted by offering translation equivalents for terms in different languages. The Finnish language is in focus, since the aim is to support Finnish as a language of science. However, it is possible to present definitions and concept pages in languages other than Finnish. The term bank thus opens up opportunities for international collaboration. Especially for multilingual and multidisciplinary research groups, the term bank provides an opportunity to shape the common terminological ground. All interested experts are welcome to participate.

My research interests in the Helsinki Term Bank for the Arts and Sciences (HTB) include the presentation of background knowledge frames and the emergence of prototypicality, as well as collaborative interactions: the network of experts in the HTB and the online service interact and form a field of action that differs from traditional research projects.

Publications

Enqvist, Johanna & Tiina Onikki.Rantajääskö & Kaarina Pitkänen-Heikkilä 2021: Terminology work as open, communal and collaborative crowdsourcing practice of academic communities. – Terminology 27:1, Pp. 56-79. DOI: 10.1075/term.00058.enq

Jaakola, Minna & Tiina Onikki-Rantajääskö (eds.) 2023: The Finnish Cases System: Cognitive Linguistic Perspectives. Helsinki:SKS. DOI: doi.org/10.21435/sflin.23

Kettunen, Harri & Tiina Onikki-Rantajääskö (forthcoming): Vetenskapstermbanken i Finland i samhällets tjänst. – Publikation Nordterm 2023.

Kettunen, Harri & Tiina Onikki-Rantajääskö (forthcoming): Tieteen termipankki tieteentekemisen ytimessä. – Kieliviesti 2/2023.

Onikki-Rantajääskö, Tiina & Harri Kettunen 2023: Vuosi 2022 Tieteen termipankissa: Laajenemista uusille aihealueille ja tunnustuspalkintoja avoimen tieteen edistämisestä. – Tieteen termipankin blogi. Helmikuu/2023. https://blogs.helsinki.fi/tieteentermipankki/2023/02/16/vuosi-2022-tieteen-termipankissa-laajenemista-uusille-aihealueille-ja-tunnustuspalkintoja-avoimen-tieteen-edistamisesta/

Corpora

HS.fi News and Comments Corpus

More information

Helsinki Term Bank for the Arts and Sciences (HTB)

Instructions for a new expert for joining the collaborative terminology work

FIN-CLARIAH Research Infrastructure

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Aleksi Sahala

Photo: Marianne Ough

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Aleksi Sahala tells us about his research on the development and application of Natural Language Processing (NLP) methods for annotating and analyzing ancient text data.

Who are you?

I am Aleksi Sahala, a postdoc researcher in Assyriology and Language Technology. I am currently working for the University of Helsinki in an Academy of Finland funded project “The Origins of Emesal”, where our goal is to investigate how Emesal, the only known language variety of Sumerian, came to be and evolved over time using computational methods.

I did my master’s degree in Assyriology and Computational Linguistics, and in 2021 I finished my PhD thesis “Contributions to Computational Assyriology”. In 2022, I was a visiting scholar at the University of California, Berkeley, and in 2024 I will visit the University of Innsbruck in Austria. I have also worked in close co-operation with the Centre of Excellency in Ancient Near Eastern Empires at the University of Helsinki.

What is your research topic?

My research focuses on the development and application of NLP (Natural Language Processing) methods for annotating and analyzing ancient text data. My particular interest lies in the Mesopotamian cuneiform texts written in Sumerian (3200 BCE – 100 CE) and Akkadian (2500 BCE – 100 CE). Analysis of Sumerian and Akkadian texts is not only challenging due to data sparsity and the fragmentary nature of the primary sources, but also due to the complexity of the cuneiform writing system and inflectional morphology. In theory, most words can occur in several thousands of different forms, each of which can also be spelled in several different ways.

My focal point has been on the development of a pipeline that is able to linguistically annotate raw transliterations of cuneiform texts so that these texts can be used for data analysis and visualization. This allows for the analysis of thousands of transliterated texts simultaneously and, for example, the visualization and study of how different words, concepts or entities are related to each other on a larger scale. Although Assyriologists have digitized over 20,000 Akkadian and over 100,000 Sumerian texts in various text corpora, these texts have mostly been studied qualitatively by close-reading. By applying a more computational approach, it becomes easier to reveal larger patterns within specific groups of texts.

I have developed a finite-state morphology for Akkadian (BabyFST), as well as a language independent neural lemmatizer and tagger with a special support for cuneiform languages (BabyLemmatizer). In addition, I have built a word-embedding-based tool for analyzing semantic relationships of words and in sparse and fragmentary data sets (PMI Embeddings).

My current project focuses on Emesal, a liturgic variant of the Sumerian language, which is only attested in writing after Sumerian was no longer used as a vernacular. Although it is known that Emesal was used in liturgic context, such as lamentations, and occasional to indicate direct speech of goddesses and women, its origins and evolution are still widely debated. None of the Emesal texts were entirely written in this language variant, but rather in Sumerian, and Emesal was only used here and there as keywords to indicate that the current line or passage should be read in this dialect. The rules behind this code switching, if such ever existed, remain largely unknown. We hope, that a larger scale analysis of Emesal texts could reveal some patterns that could explain, what kinds of environments triggered the use of Emesal words exactly, and how the use of this language variant was introduced in written documents and how evolved over its 2000 year old history.

How is your research related to Kielipankki?

Kielipankki has been co-operating with the Centre of Excellence in Ancient Near Eastern Empires by annotating cuneiform texts and publishing them in Korp concordance service. My responsibilities have been collecting and converting these data sets into Korp-compatible format and developing tools for annotating and harmonizing them with the existing resources in a way, that they can be used efficiently together for quantitative analysis.

Recently, we have been working on the harmonization, lemmatization and tagging of Achemenet, a collection of Neo-Babylonian administrative and legal documents.

Publications

Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S., & Lindén, K. (2019). Aššur and his friends: a statistical analysis of neo-assyrian texts. Journal of Cuneiform Studies, 71(1), 159–180. http://hdl.handle.net/10138/303986

Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). Digital Approaches to Analyzing and Translating Emotion: What Is Love?. In The Routledge Handbook of Emotions in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/348398

Bennet, E. & Sahala, A. (2023). Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023. http://hdl.handle.net/10138/565513

Ihalainen, P. & Sahala, A. (2020). Evolving Conceptualisations of Internationalism in the UK Parliament. Digital Histories, 199.

Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). Akkadian treebank for early neo-assyrian royal inscriptions. In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories. The Association for Computational Linguistics. http://hdl.handle.net/10138/322305

Sahala, A. J. A. (2017). Johdatus sumerin kieleen. Suomen itämainen seura.

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). BabyFST: Towards a finite-state based computational model of ancient babylonian. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3886–3894). http://hdl.handle.net/10138/317691

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association (ELRA). http://hdl.handle.net/10138/317688

Sahala, A. (2021). Contributions to Computational Assyriology. PhD Thesis. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-7416-1

Sahala, A., & Töyräänvuori, J. (2022). Kirjoitustaidon kehittyminen. In Svärd, S. & Töyräänvuori, J. (eds.), Muinaisen Lähi-idän imperiumit. Kadonneiden suurvaltojen kukoistus ja tuho, s.49–69. Gaudeamus, Helsinki.

Sahala, A., & Svärd, S. (2022). Language technology approach to “seeing” in Akkadian. In The Routledge Handbook of the Senses in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/339256

Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023, June). Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction. In Selected papers from the CLARIN Annual Conference 2022 (pp. 111–119). http://hdl.handle.net/10138/563733

Sahala, A. & Lindén, K. (2023). A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Svärd, S., Jauhiainen, H., Sahala, A., & Lindén, K. (2018). Semantic Domains in Akkadian Texts. CyberResearch on the Ancient Near East and Neighboring Regions. Case Studies on Archaeological Data, Objects, Texts, and Digital Archiving, 2, 224–256. http://hdl.handle.net/10138/241805

Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2020). Fear in akkadian texts: New digital perspectives on lexical semantics. In The Expression of Emotions in Ancient Egypt and Mesopotamia (pp. 470–502). Brill. http://hdl.handle.net/10138/328017

Tools

BabyLemmatizer, OpenNMT based neural lemmatizer and tagger. Pretrained models available for Ancient Greek, Latin and various cuneiform languages.

BabyFST, Finite-state morphology of Akkadian, specifically Babylonian dialect.

PMI-Embeddings, Hyper-parametrized tool for creating PMI+SVD based word embeddings from sparse or fragmentary data sets.

Corpora

Open Richly Annotated Cuneiform Corpus (Oracc)

Achemenet Babylonian texts

More information

Centre of Excellency in Ancient Near Eastern Empires (ANEE)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Anna Dmitrieva

Anna Dmitrieva (standing) with Aleksandra Konovalova (sitting), co-creators of the Parallel Corpus of Finnish and Easy-to-read Finnish. Photo: Anna Dmitrieva

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anna Dmitrieva tells us about her research on text simplification. Computational methods and the compiling of parallel corpora are an integral part of her work.

Who are you?

I am Anna Dmitrieva, a doctoral researcher at HELSLANG, the Doctoral Programme in Language Studies at the University of Helsinki.

What is your research topic?

My main field of interest is text simplification. I have studied computational linguistics since 2012, when I started my studies for the Bachelor’s degree. Since then, I have been involved in many projects related to natural language processing (NLP), but text simplification has been my main focus during my doctoral studies.

Text simplification is a process of making a text “easier”. A simplified text should be more readable and accessible to a broader audience. In NLP, text simplification can be viewed as a monolingual machine translation problem. We train models that are capable of translating or transforming texts, taking a source text in a particular language and producing a “simpler” version of the text in the same language. This task typically requires a lot of parallel data, where there is a corresponding “easy” target text for each source text.

I work with languages that do not have a lot of simplification data, make datasets for them, and train simplification models. During my time as a doctoral researcher, I have made Russian and Finnish text simplification datasets and models. I am also investigating controlled text simplification, the task of manipulating certain linguistic properties in the output of the simplification model.

How is your research related to Kielipankki?

As a Finnish university student, I have naturally thought of making a Finnish simplification model. Since there were no parallel simplification corpora for Finnish, I had to make one myself. The most obvious choice for the data source was Yle Easy-to-read Finnish News: they exist in the form of text, have been around for a relatively long time, and have equivalents in “regular” Finnish. It was a relief to know that I didn’t have to scrape the news myself using Yle’s API because all the archives are already on Kielipankki.

However, I had to solve the problem of aligning Easy Finnish and Standard Finnish news. I performed automatic alignment, but there was no golden test set of document pairs to test the quality of the alignments. This is where my friend Aleksandra Konovalova (University of Turku) stepped in and helped me, evaluating 1919 pairs of documents herself. Together, we created the Parallel Corpus of Finnish and Easy-to-read Finnish, which is now available in Kielipankki. Currently, I am adding more document pairs and creating a sentence-aligned version, which will hopefully also be made available via Kielipankki when completed.

Publications

Dmitrieva, A. & Konovalova, A. Creating a parallel Finnish—Easy Finnish dataset from news articles. Jun 2023, Proceedings of the 1st Workshop on Open Community-Driven Machine Translation. Esplá-Gomis, M., Forcada, M., Kuzman, T., Ljubešić, N., van Noord, R., Ramírez-Sánchez, G., Tiedemann, J. & Toral, A. (eds.). Universitat d’Alacant, p. 21-26 6 p. https://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27

Dmitrieva, A. Automatic text simplification of Russian texts using control tokens. May 2023, Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Piskorski, J., Marcińczuk, M. & Nakov, et al., P. (eds.). Stroudsburg: Association for Computational Linguistics (ACL), p. 70-77 8 p. DOI: 10.18653/v1/2023.bsnlp-1.9

Dmitrieva, A. The role of language technology in accessible communication research. Jun 2023, Emerging Fields in Easy Language and Accessible Communication Research. Deilen, S., Hansen-Schirra, S., Hernández Garrido, S., Maaß, C. & Tardel, A. (eds.). Frank & Timme, p. 319-338 20 p. (Easy – Plain – Accessible; vol. 14). https://researchportal.helsinki.fi/fi/publications/the-role-of-language-technology-in-accessible-communication-resea

Corpora

Yle News Archive in Kielipankki

Parallel Corpus of Finnish and Easy-to-read Finnish

More information

HELSLANG – The Doctoral Programme in Language Studies at the University of Helsinki

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Sampo Pyysalo

Photo: Pasi Leino / University of Turku

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampo Pyysalo tells us about his research on natural language processing. Openly available large language models are necessary for developing tools similar to ChatGPT also for smaller languages, such as Finnish.

Who are you?

I’m Sampo Pyysalo, University Research Fellow at the TurkuNLP group of the University of Turku.

What is your research topic?

My research is on machine learning approaches to natural language processing, with particular focus on processing Finnish text and analyzing biomedical domain scientific literature. A lot of my recent work revolves around training large neural network models, including general ”foundation” models such as FinBERT and FinGPT as well as task-specific models such as a named entity recognition model for Finnish. I also work on data, both compiling raw text resources for the unsupervised training of foundation models and running manual annotation efforts to create resources for supervised training, such as the Turku NER and TurkuONE corpora.

Large neural language models are central to a lot of state-of-the-art natural language processing and the basis for tools such as ChatGPT, but most such models focus on English and many of the best models are not publicly available. We believe that openly available Finnish models such as FinBERT and FinGPT are necessary to enable the creation of tools for processing Finnish language with comparable capabilities to tools available for English.

How is your research related to Kielipankki?

Creating large language models from scratch requires billions of words of text, and collections of Finnish of this size are not readily available. To compile sufficiently large corpora for language model training we have drawn on various sources, including web crawls and resources available through Kielipankki such as the Yle News Archive, the Finnish News Agency Archive (STT) and the Suomi 24 Corpus. We also distribute resources created by TurkuNLP through Kielipankki among other channels.

In the near future, we hope that we will be able to provide access to the full text resources used to create our models for research purposes through Kielipankki to improve the replicability of our work and to make it easier for future efforts to create models for Finnish.

Publications

J. Luoma & LH. Chang & F. Ginter & S. Pyysalo. 2021. Fine-grained Named Entity Annotation for Finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.14

A. Virtanen & J. Kanerva & R. Ilo & J. Luoma & J. Luotolahti & T. Salakoski & F. Ginter & S. Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. In CoRR, abs/1912.07076. https://doi.org/10.48550/arXiv.1912.07076

Corpora

Turku NER Corpus (data available via GitHub)

TurkuONE Corpus (data available via GitHub)

The Yle News Archive resource group in Kielipankki

The Finnish News Agency Archive resource group in Kielipankki

The Suomi 24 Corpus resource group in Kielipankki

More information

TurkuNLP group of the University of Turku

FinBERT, a version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group

FinGPT, generative GPT-3-like models for Finnish

Finnish NER, a Named Entity Recognition system for Finnish (based on FinBERT and a new NER annotation layer of the UD_Finnish-TDT treebank)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Nobufumi Inaba

Photo: Krista Teeri

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Nobufumi Inaba tells us about a corpus that he is preparing, which contains a text from the year 1526 and is an interesting source for researchers studying language change.

Who are you?

I am Nobufumi Inaba, Senior Researcher at the Archive of Finnish and Finno-Ugric languages at the University of Turku. The Archive is part of the Department of Finnish and Finno-Ugric Languages and it has only been operating under this name for a couple of years. The Finnish language part of the Archive, for which I am responsible, was formerly known as the Syntax Archive. Many Finnish language researchers are probably familiar with the corpus of the same name. I have been involved in the planning and implementation of eg. technical solutions for the projects in our department and for the corpora produced in our Archive. I have also created tools to be used internally by our corpus teams.

What is your research topic?

I have been interested in studying language change and its causes. In my dissertation, I investigated the roots of the so-called dative genitive in Finnish and my research data consisted mostly of texts from old literary languages. In recent years, I have been studying the phenomenon of leaving out the inflection of words in Finnish. My data consists of chat conversations in a location-based game community and of the speech recordings I collected at the game locations.

Currently, I am investigating old literary language again. I am preparing a corpus of the 1526 Swedish New Testament, one of the source texts used by Mikael Agricola. This New Testament has been seen as a symbol of the beginning of the Modern Swedish period. The forthcoming corpus is intended to support the study of the language of Agricola’s works. The importance of the text is not merely symbolic. In my opinion, this earlier New Testament text is a much more valuable source for those interested in linguistic changes than the whole Bible of 1541 (Gustav Vasas bibel). It does not seem to contain regulated language in contrast to the whole Bible that includes many attempts to regulate and harmonize linguistic elements all the way from vocabulary to syntax. Moreover, the 1526 New Testament contains a striking number of elements from spoken language, which the 1541 Bible largely attempted to eliminate. The preliminary coding of the text in order to facilitate annotation is now complete and I expect to start the annotation work in the autumn of 2023.

How is your research related to Kielipankki?

We have had a good division of labour with Kielipankki ever since the days of the Syntax Archive. The University of Turku produces language resources that are published via Kielipankki for the use of the scientific community. The Finnish Dialect Corpus of the Syntax Archive and The Morpho-Syntactic Database of Mikael Agricola’s Works, produced in cooperation with the Institute of the Languages of Finland, as well as the Arkisyn corpus, an important annotated collection of contemporary Finnish produced at the University of Turku, have all been published via the Korp service in Kielipankki. Naturally, Kielipankki will also be the publication site for the Swedish-language New Testament corpus that I am currently working on.

Publications

Nobufumi Inaba (2015). Suomen datiivigenetiivin juuret vertailevan menetelmän valossa. Suomalais-Ugrilaisen Seuran toimituksia 272. https://www.sgr.fi/fi/items/show/78

Language resources

The Finnish Dialect Corpus of the Syntax Archive, Helsinki Korp Version

The Morpho-Syntactic Database of Mikael Agricola’s Works

ArkiSyn Database of Finnish Conversational Discourse

More information

Archive of Finnish and Finno-Ugric Languages (University of Turku)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Niina Kunnas

Photo: Mikko Törmänen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Niina Kunnas tells us about her research on minority languages including, e.g., Meänkieli.

Who are you?

I am Niina Kunnas, Associate Professor of Finnish language and University Lecturer at the University of Oulu. I am also positioned as a part-time Professor of Finnish language at Sámi Allaskuvla in Koutokeino, Norway.

What is your research topic?

My research represents sociolinguistics, folklinguistics and minority language research. I have examined linguistic variation, language perceptions and situational variation in minority languages, among other things.

How is your research related to Kielipankki?

In recent years, Kielipankki has been involved in my research in a number of ways. Firstly, in 2019, I collected a corpus of spoken Meänkieli together with my students, which was originally recorded with the intention of making it available to researchers via Kielipankki. The corpus contains spoken Meänkieli from several Meänkieli-speaking municipalities in the Meänkieli-speaking area, and its collection has been encouraged by Heikki Paunonen. Some of the interviewees are the same as those previously recorded in the 1990s. Paunonen has also recorded speech from the same parishioners in the 1960s, so the material as a whole makes it possible to carry out a three-round follow-up study of spoken Meänkieli.

I have also recently made use of the Iijoki, the University of Oulu Päätalo Collection corpus on the Korp server. The corpus contains all the novels in the Iijoki series written by Kalle Päätalo and has a size of over 5 million tokens. Together with Liisa Mustanoja and Maija Saviniemi, we will use this data in our study of the function and the associated affects of the Viena Karelian episodes in the Iijoki series. The corpus has allowed us to search data rapidly, and the results of the study will be published in an article that will appear in a volume with the working title Päättymätön savotta. Analyyseja Kalle Päätalon tuotannosta (Timberwork without End. Analyses of Kalle Päätalo’s works).

Publications

Kunnas, Niina 2019: Karjalan kieli Oulun seudulla. – Harri Mantila, Maija Saviniemi & Niina Kunnas (toim.), Oulu kieliyhteisönä. 144–199. Helsinki: Suomalaisen Kirjallisuuden Seura.

Saviniemi, Maija, Kunnas, Niina, Mantila, Harri, Paukkunen, Ulla & Rajala, Elina 2019: Oulua havainnoimassa. – Harri Mantila, Maija Saviniemi & Niina Kunnas (toim.), Oulu kieliyhteisönä. 276–318. Helsinki: Suomalaisen Kirjallisuuden Seura.

Vaattovaara, Johanna, Kunnas, Niina & Saviniemi, Maija 2018: Stadi imitoituna. – Sisko Brunni, Niina Kunnas, Santeri Palviainen & Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia Humaniora Ouluensia 16. Oulun yliopisto. http://jultika.oulu.fi/files/isbn9789526221120.pdf

Kunnas, Niina 2018: Viena Karelians as observers of dialect differences in their heritage language. – Marjatta Palander, Helka Riionheimo & Vesa Koivisto (eds.), On the border of language and dialect. 123–155. Studia Fennica Linguistica 21. Helsinki: Suomalaisen Kirjallisuuden Seura.

Language resources

Iijoki, the University of Oulu Päätalo collection

Corpus of Spoken Meänkieli

More information

Sámi Allaskuvla (Sámi University of Applied Sciences)

University of Oulu, Faculty of Humanities

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Mikael Varjo

Photo: Emmi Saari

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikael Varjo tells us about his research on zero-subject constructions in the ArkiSyn corpus containing everyday Finnish conversation.

Who are you?

I am Mikael Varjo and I am currently working as a university teacher at the University of Turku. In March 2023, I defended my doctoral thesis on zero-subject constructions, also at the University of Turku. My interests are diverse, ranging from teaching and researching Finnish as a second and foreign language to research in usage-based syntax.

What is your research topic?

In my doctoral thesis I examine zero-subject constructions (zero person in the subject position) in Finnish everyday conversation. I have extracted my data from the morphosyntactically annotated ArkiSyn corpus, which I also helped to build as a project researcher in 2015–2016 before starting my own dissertation.

Previous research on the zero person has been quite qualitatively oriented. My research aims to fill this methodological gap by combining two approaches: quantitative corpus linguistics and qualitative interactional linguistics. In my research, I examine the characteristics, variation, contexts of use, and functions of zero-subject constructions in spoken interaction. My research reveals that the grammatical and semantic features typically associated with the zero person also distinguish the subcategories of zero-subject constructions. The differences between subcategories are also linked to the tasks the constructions have in interaction. Typically, zero-subject constructions are used for expressing stance towards something that is under discussion, (joint) planning, sharing of experiences, feelings and desires, or for giving directives.

How is your research related to Kielipankki?

The ArkiSyn corpus is available in Kielipankki. In addition, Kielipankki provided important support in the early stages of my doctoral studies as I was taking my first steps in language technology, natural language processing and automatic text processing. Converting zero-subject constructions extracted from the ArkiSyn corpus into a format that was easy to process and met the needs of my dissertation required a lot of learning over the years. With the help of the Kielipankki’s methodological course Corpus Clinic, I was able to get started in the autumn of 2015.

Publications

Varjo, Mikael. 2022. Greater than zero? A study of referentially open and specific necessity constructions in Finnish everyday conversation. Eesti Ja Soome-Ugri Keeleteaduse Ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 13(2), 5–46. https://doi.org/10.12697/jeful.2022.13.2.01

Suomalainen, Karita & Mikael Varjo. 2020. When personal is interpersonal. Organizing interaction with deictically open personal constructions in Finnish everyday conversations. Journal of Pragmatics, 168, 98–118. https://doi.org/10.1016/j.pragma.2020.06.003

Varjo, Mikael. 2019. It Takes All Kinds to Make a Zero: Employing Multiple Correspondence Analysis to Categorize an Open Personal Construction in Conversational Finnish. Corpus Linguistics Research, 5, 55–87. https://doi.org/10.18659/clr.2019.5.03

Varjo, Mikael ja Karita Suomalainen. 2018. From zero to ‘you’ and back: A mixed methods study comparing the use of two open personal constructions in Finnish. Nordic Journal of Linguistics, 41(3), 333–366. https://doi.org/10.1017/s0332586518000215

Language resources

ArkiSyn Database of Finnish Conversational Discourse

More information

Courses and training organized by the Language Bank of Finland

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Rosa González Hautamäki

Photo: Ville Hautamäki

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Rosa González Hautamäki tells us about her research on within-speaker variation and the effects of voice modifications. The AVOID corpus, which she collected in collaboration with the Computational Speech group at UEF, is a valuable resource for studying human-induced voice modifications.

Who are you?

I am Rosa González Hautamäki, a postdoctoral researcher at the Research Unit of Logopedics (RULOGO) at the University of Oulu, and a visiting researcher at the School of Humanities at the University of Eastern Finland. I hold a Ph.D. in Computer Science and maintain ongoing collaborations with the School of Computing at the University of Eastern Finland and the Human Language Technology lab at the National University of Singapore (NUS).

What is your research topic?

My research focuses on within-speaker variation in the context of speaker recognition. Speech is a complex signal that varies due to several factors, such as age, health, emotional state, and more, so it is expected that a speaker won’t utter the same phrase in exactly the same way multiple times. During my doctoral studies, I studied the effects of voice modifications on the performance of voice comparisons carried out by listeners or automatic systems. My initial research focused on mimicry and voice disguise, considering that some speakers may not be cooperative when interacting with speaker recognition systems. Our research showed that even simple techniques to disguise one’s voice could cause degradation in the performance of automatic systems, while also making the task of speaker comparison challenging for listeners.

Since then, my studies on within-speaker variation have focused on identifying the factors that impact the performance of speaker verification, including deliberate and non-deliberate voice modifications. These findings have also been important in analyzing speech in other speech technology tasks, such as speech spoofing attacks and auditory speech perception. Exploring the factors that impact system decisions can help in making them more reliable.

Currently, my research on speech analysis involves using machine learning models with data from evaluations used to identify developmental language disorders in children. I am excited to be part of a motivated group of researchers who are exploring speech and interventions that can support those working with the development of children’s speech.

How is your research related to Kielipankki?

During my doctoral research, I collaborated with the Computational Speech group at the University of Eastern Finland to collect a dataset for the study of voice disguise. Kielipankki provided crucial support by offering information necessary for the collection and preparation of the corpus, as well as for its publication as a resource. The resulting dataset, called the Age-related Voice Disguise (AVOID) corpus, contains voice recordings of Finnish speakers in their modal voice and attempting age disguise.

In one study, we used the AVOID corpus to analyze the impact of changes in selected acoustical features on automatic speaker recognition systems, and found that the difference in long-term fundamental frequency (F0) was the most detrimental factor to speaker recognition performance, even when the automatic system uses spectral features.

In another study using the AVOID corpus, we evaluated the effectiveness of age stereotypes as a voice disguise strategy in speaker comparisons. Listeners estimated both the speaker’s chronological and intended age (attempting child and elderly voices), and results showed that the age estimations for intended voices for female speakers were more accurate towards the target age, while for male speakers, age estimations corresponded to the direction of the target voice only for elderly voices.

Overall, the AVOID corpus is a valuable resource for studying human-induced voice modifications and we expect further research would help make systems more robust to disguised voices.

Publications

González Hautamäki, R., Hautamäki, V., and Kinnunen, T. (2019). ”On Limits of Automatic Speaker Verification: Explaining Degraded Recognizer Score Through Acoustic Changes Resulting from Voice Disguise”, The Journal of the Acoustic Society of America 146, 693. https://doi.org/10.1121/1.5119240

González Hautamäki,R., Sahidullah, Md., Hautamäki, V., and Kinnunen,T. (2017). ”Acoustical and perceptual study of voice disguise by age modification in speaker verification”, Speech Communication, Volume 95, Pages 1-15, https://doi.org/10.1016/j.specom.2017.10.002

González Hautamäki, R., Sahidullah, Md., Kinnunen, T., and Hautamäki, V (2016). ”Age-Related Voice Disguise and its Impact in Speaker Verification Accuracy”, Odyssey: The Speaker and Language Recognition Workshop, Bilbao, Spain, pages 277-282, http://dx.doi.org/10.21437/Odyssey.2016-40

González Hautamäki, R., Kanervisto, A., Hautamäki, V., and Kinnunen, T. (2018). ”Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification”, Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, pages 320-326, http://dx.doi.org/10.21437/Odyssey.2018-45

Language resources

Corpus of Age-related Voice Disguise (AVOID)

More information

Computational Speech group at the University of Eastern Finland

Research Unit of Logopedics (RULOGO) at the University of Oulu

School of Humanities at the University of Eastern Finland

School of Computing at the University of Eastern Finland

Human Language Technology lab at the National University of Singapore

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Johanna Vaattovaara

Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Johanna Vaattovaara tells us about her research projects on language awareness and language attitudes.

Who are you?

I am Johanna Vaattovaara, professor of Finnish language in the Languages Unit at the Faculty of Information Technology and Communication Sciences, Tampere University.

What is your research topic?

My research topics represent sociolinguistics and language ideology research, mainly language awareness and attitude research. I have also done research on linguistic variation and language change, and for these topics various corpora have proven to be very valuable resources. Corpora have also been useful in the creation of language attitude study designs. In recent years, for example, I have used the Suomi24 corpus in various ways in studies where I have investigated, together with Elizabeth Peterson and also with Ylva Bir and Turo Hiltunen, the integration of English expressions into Finnish language use.

How is your research related to Kielipankki?

So far, I have used the Suomi24 corpus in Kielipankki, especially Suomi24 2016H2. Currently, I am launching a research project Arkisuomien kielitietoisuudet ja muutos (Societal awareness of linguistic variation and change), funded by the Kone Foundation (2023–25). During the project, we will collect language awareness and attitude data using different methods, such as a nationwide survey data, which we plan to distribute via Kielipankki.

In the past, I have distributed data through the archives of the Institute for the Languages of Finland (Kotus). Also the data that I collected for my dissertation is available from Kotus. The data consists of interviews of a group of high school graduates in Pello, Tornionlaakso (Torne Valley). In the post-doc phase, I collected reaction and interview data in the lobby of the Finnish Science Centre, Heureka, in the project Helsingin suomea – monimuotoisuus, sosiaalinen identiteetti ja kielelliset asenteet kaupunkiympäristössä, led by Marja-Leena Sorjonen and funded by the Academy of Finland in 2009–2012. This corpus of metalinguistic material can also be obtained from Kotus.

Publications

Peterson, E., Hiltunen, T., Vaattovaara, J. 2022. A place for pliis in Finnish: A discourse-pragmatic variation account of position. – Elizabeth Peterson, Turo Hiltunen & Joseph Kern (eds.), Discourse-Pragmatic Variation and Change: Theory, Innovations, Contact, pp. 272–292. Cambridge University Press. DOI: 10.1017/9781108864183.015

Peterson, E., Biri, Y., Vaattovaara, J. 2022. Grammatical and social structures of English-sourced swear words in Finnish discourse. – Martín-Solano, R. & San Segundo, R. (eds.), Corpus linguistics and Anglicisms, pp. 49–70. Peter Lang Publishing. DOI: 10.3726/b19222

Vaattovaara, J. & Peterson, E. 2019. Same old paska or new shit? On the stylistic boundaries and social meaning potentials of a loanword in Finnish. – Ampersand 6/2019 (Special Issue, E. Zenner, A. Calude & L. Rosseel (eds.), Lexical borrowing as expression of culture, identity and attitude – empirical investigations into the social meaning potential of loanwords.) DOI: 10.1016/j.amper.2019.100057

Vaattovaara, J. 2012. Spatial concerns for the study of social meaning of linguistic variables – an experimental approach. – Hanna Lehti-Eklund, Camilla Lindholm & Caroline Sandström (eds.), Folkmålsstudier : Meddelanden från Föreningen för Nordisk Filologi 2012/50, pp. 175–209. https://journal.fi/folkmalsstudier/article/view/82136

Nuolijärvi, Pirkko & Vaattovaara, Johanna 2011. De-standardisation in progress in Finnish society? – T. Kristiansen & N. Coupland (eds.), Standard Languages and Language Standards in a Changing Europe, pp. 67–74. Oslo: Novus Forlag. http://omp.novus.no/index.php/novus/catalog/view/3/5/163

Vaattovaara, Johanna 2009. Meän tapa puhua: Tornionlaakso pellolaisnuorten subjektiivisena paikkana ja murrealueena. Helsinki: Suomalaisen Kirjallisuuden Seura (304 pp.). Suomalaisen Kirjallisuuden Seuran toimituksia 1224. http://urn.fi/URN:ISBN:978-952-222-100-1

More information

Suomi 24 resource group in Kielipankki

Societal awareness of linguistic variation and change project (2023–25)

Institute for the Languages of Finland (Kotus)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Noora Hoffrén

Photo: Essi Ekman

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Noora Hoffrén tells us about her PhD research on constructed action in Finnish Sign Language and Finnish language.

Who are you?

I am Noora Hoffrén, a sign language interpreter and a doctoral researcher. I am working on my PhD thesis at the Sign Language Centre (SLC) in the Department of Language and Communication Studies at the University of Jyväskylä.

What is your research topic?

The topic of my dissertation is showing by enacting, i.e. constructed action. When a speaker or signer is immersed in the role of another character and displays the character’s thoughts, speech, emotions or actions, he or she is constructing action. Constructed action is not always obvious or overt. Often, especially in signed languages, constructed action is so closely integrated into the language that it is not always easy to discern it. In my research, I am studying constructed action in both Finnish Sign Language and Finnish language. My dissertation is part of the ongoing ShowTell project at the University of Jyväskylä.

How is your research related to Kielipankki?

As my research data, I will use the Corpus of Finnish Sign Language, part of which is already available for download in Kielipankki (CFINSL). In addition to videos that are recorded from multiple angles, the database contains basic annotations and metadata. The fact that such a corpus exists allows us to study constructed action in the best possible way.

My aim is to collect a video corpus of spoken Finnish, parallel to the Finnish Sign Language material, and to deposit the corpus in Kielipankki. The Finnish video corpus will be collected in pairs from six native speakers of Finnish. The methods that are used to collect the material will be similar to those used to collect the Finnish Sign Language corpus, for example, using multiple cameras during filming sessions and using the same elicitation materials (e.g. ’The Snowman’ and ’Frog, Where Are You?’ picture books).

Publications

Hoffrén, Noora 2019. Kuvailevien viittomien ja konstruoidun toiminnan yhteispeli. Master’s thesis. University of Jyväskylä. Available: http://urn.fi/URN:NBN:fi:jyu-201910144419

More information

Corpus of Finnish Sign Language in Kielipankki (CFINSL)

ProGram data. The stories Snowman and Frog, where are you?

Sign Language Centre (SLC)

ShowTell project (2021–25)

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Maria Sarhemaa

Photo: K-Art Foto

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Maria Sarhemaa tells us about her research on the appellativization of first names in Finnish language. Online discussions are a fruitful source of data for studying informal or colloquial language use.

Who are you?

I am Maria Sarhemaa, a doctoral researcher in Finnish language at the University of Helsinki. Currently, I am working on my thesis on a grant from the Kone Foundation.

What is your research topic?

I am doing research on the appellativization of first names in Finnish language, i.e. words that typically belong to the informal registers of the language and originate from a first name. These include yrjö meaning ’vomiting’ and jonne meaning a certain kind of teenage boy, but there are also compound words with an appellativized first name as part of the word, such as baarimikko ‘bartender’. In my dissertation research, I am exploring appellativization as a linguistic phenomenon in Finnish, and in the sub-publications I will examine compound words with an appellativized part, the expressions uuno, tauno and urpo meaning ’stupid’, and the construction jonnet ei muista ‘teenagers cannot remember’.

How is your research related to Kielipankki?

I collected data from the Suomi24 corpus in Kielipankki for my article on uuno, tauno and urpo. The Suomi24 corpus is a fruitful source of data for my research topic, as appellativized expressions are used extensively, particularly in informal language, and the language used in Suomi24 is often colloquial. I have also collected data from the same corpus for my forthcoming article on the jonnet ei muista construction and for a study on the jonne appellative that I am conducting with Lasse Hämäläinen, PhD.

Publications

Hämäläinen, Lasse & Sarhemaa, Maria 2022: Jonnen jäljillä: Appellatiivisen jonnen alkuvaiheet verkkokeskusteluaineistojen valossa. Sananjalka 64, 255–269. https://doi.org/10.30673/sja.114194

Sarhemaa, Maria 2021: Tavan tauno uunoilee urpokaupungissa: Nimien Uuno, Tauno ja Urpo appellatiivistuminen ja appellatiivien käyttö Suomi24-keskustelupalstalla. Sananjalka 63, 103–129. https://doi.org/10.30673/sja.107278

More information

The Suomi24 Corpus

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

På svenska

Researcher of the Month: Therese Lindström Tiedemann

Photo: Tove Tiedemann

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Therese Lindström Tiedemann tells us about her research on Swedish as a second language. There is a definite need to continue developing Finland-Swedish corpora to ensure that Finland-Swedish is also included in future studies of the Swedish language.

Who are you?

My name is Therese Lindström Tiedemann and I am a university lecturer in the Swedish Language at the University of Helsinki. In addition to the Swedish language, I also work on general linguistics. I wrote my PhD thesis on the history of grammaticalisation as a concept in linguistics, i.e. within the history of linguistics.

What is your research topic?

In recent years, most of my research has been on Swedish as a second language. In my research I often use corpus linguistic methods. Together with colleagues, I have also tried to use crowdsourcing. I also do research on other topics such as grammaticalisation, the history of linguistics, the teaching of grammar and metalinguistic knowledge.

How is your research related to Kielipankki?

I have used Kielipankki’s resources mainly in connection with my research on Swedish as a second language and in the context of teaching. For instance, I have used the Swedish subcorpus of the Topling corpus. Currently, I am managing our faculty’s part of the Digisvenska project where we are creating a text corpus from the Digital Matriculation Examination in B1-Swedish (Swedish as a second language, i.e. having been learnt from year 6 (or 7 in the old curriculum)) in Finland. We aim to study how the exam correlates to the curriculum and the fairness and transparency of the test results. Among other things, we will study how lexical breadth in the form of lexical variation (cf. vocabulary size) relates to scores and marks in the exams, but also verb conjugation and adverbial clause modifiers, as well as the linguistic accuracy in the form of how close it is to the norm.

A few years ago, I tried to study the Swedish word nog (lit. ‘enough’) using the Sinebrychoff corpus together with Jan Lindström. However, in the end the work needed to be done primarily with a more comprehensive text version of the corpus and not with the version available in Korp.

Swedish-language resources in Finland need developing

I also have a more general interest in the Swedish-language resources available in Kielipankki because of my research on Swedish and teaching students in Scandinavian languages, and since I often use corpus-based methods. This is why it is important for me to know which corpora I can recommend students to use and how they can be used. There is definitely a need to continue developing Finland-Swedish corpora to ensure that we can describe Finland-Swedish (Sw. ”finlandssvenska”) in a similar way to how we can describe Swedish as spoken in Sweden (Sw. ”sverigesvenska”), and that Finland-Swedish is also included in future studies of the Swedish language. In the Finnish context, we can also see that some corpora contain both Finnish and Swedish. There is a need to consider the best way to study how and when Swedish is used in these corpora, and whether this is representative of how Swedish is used in these contexts in Finland. This applies, for example, to the corpus of parliamentary plenary sessions (Eduskunnan täysistunnot), where Swedish words are currently only tagged as foreign words. This impedes research possibilities on this part of the data. However, at the same time, we can clearly see that Swedish words top and dominate the list of words tagged as foreign words in the plenary sessions. It would be interesting to see these parts treated as Swedish, and whether it might somehow be possible to annotate the Swedish parts as Swedish, thus facilitating the study of them from a Swedish perspective.

Besides the Swedish-language resources, I also have an interest in interoperability between different corpora and resources, transparency of research data and comparability between different sources for the Swedish language. With many of the Swedish language corpora being available via Språkbanken Text (Sweden), and with our needs to be able to compare corpora at Kielipankki with these, I see a need for information about how comparable these corpora are, and whether corpora in Kielipankki have been annotated in the same way. This is important to ensure that Finland-Swedish and other Swedish corpora located in Finland can be compared with Swedish corpora located in Sweden. This could give Finland Swedish and second language Swedish (L2 Swedish) with Finnish as the first language (L1) a clear and fair place in research on Swedish and L2 Swedish in general.

As part of my work on corpora my colleagues and I have also checked how well the automatic annotation works, especially on material produced by L2 speakers. We have checked the annotation of coursebook texts (written by L1 speakers but aimed at, or selected for, L2 learners), texts written by L2 learners and texts written by L2 speakers and ”normalised” (i.e. with standardised spelling for instance) to facilitate annotation, queries and comparisons. The results showed that texts written by learners are often not as well annotated but also not always worse. Lemmatisation, word class tagging and sense disambiguation was good enough to be used in studies of L2 Swedish, even though sense disambiguation was more problematic than the first two. There were bigger problems with dependency analysis (cf. clause analysis, parsing) and multiword expressions also proved to be problematic especially in learner writings. Still multiword annotation was good enough to allow us to conclude that we can use it in our work, although the user should know that something may have been missed and that the multiword annotation is based on the expressions which are part of the Saldo lexicon, and how they have been listed in Saldo. The results showed that sometimes there was disagreement regarding whether a preposition should be seen as part of the expression or not.

I am very happy to see that more Swedish corpora have been added to Kielipankki in the last few years. I hope that in the future there will be even more Swedish corpora added in Kielipankki and that they will be annotated as the Swedish corpora in Språkbanken Text (Sweden) and that information about the data will be made accessible in such a way that students and researchers can easily find comparable material and know how representative the material is for a certain type of language (e.g. a dialect, newspaper writings).

Recently finished projects and some future steps

In the coming years I will be working on a project on pseudonymisation of linguistic data (Mormor Karl är 27 år). Pseudonymisation means that some information such as names of people, places, etc are changed to pseudonyms in the data, when this information is such that it might reveal who wrote the text. In this project we will study how pseudonymisation affects research data in the humanities, an important step in work on open reusable data needed for reproducibility and for reduplication studies to be possible on data already collected while at the same time protecting people’s identity.

In connection to the project which I have just finished together with Elena Volodina, University of Gothenburg (L2 profiles – Development of lexical and grammatical competences in immigrant Swedish) we have released a dataset with manual morphological annotation of lexemes which are present in materials aimed at learners of Swedish as a second language or produced by speakers of Swedish as a second language (CoDeRooMor). This resource has now been updated and will be released as part of the resource Swedish L2 profiles during 2023. Swedish L2 profiles is a resource where you can search for e.g. a word, a tense, a morpheme or a word formation pattern to see how this is used at different proficiency levels (according to CEFR, the Common European Framework of Reference for Languages, Council of Europe) both in course books for Swedish as a second language and in learner essays from different CEFR-levels. The resources which we have created are part of Språkbanken Text (Sweden), but are or will be openly accessible.

I have also been involved in the development of an annotation tool in relation to research on Swedish (Legato) and in the use of the CALL platform Lärka for the teaching of syntactic functions, word classes and semantic roles. The CALL platform Lärka is something I have used in teaching grammar, which meant that I could give feedback to the developers from that perspective. Together with Volodina I have also used the platform to collect anonymous data to study what students often get right or wrong when they practise these categories, useful in connection to research on metalinguistic knowledge and the ability to analyse Swedish grammatically.

Apart from research related to Kielipankki’s resources and areas of interest I am also the current project manager of Finland Swedish Online (FSO), an online course in Finland Swedish created at University of Helsinki based on an Icelandic model (Icelandic Online). FSO is currently part of SAFMORIL, one of the K-Centres within CLARIN. One of my aims have been that FSO would not only be something which supports the learning of a language but also a possibility to study language acquisition by seeing if it is possible to trace the development of learners in FSO if they grant access to that information. (Icelandic Online has done research on this based on their data.)

References

Alfter, D., Borin, L., Pilán, I., Lindström Tiedemann, T. & Volodina, E. 2019a. Lärka: From Language learning platform to infrastructure for research and language learning. In: Selected papers from the CLARIN Annual Conference 2018. Linköping: Linköping university press. 14pp. http://www.ep.liu.se/ecp/159/001/ecp18159001.pdf

Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2019b. LEGATO: A flexible lexicographic annotation tool. In: Hartmann, M. & Plank, B. (eds.), The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa): Proceedings of the conference. Linköping: Linköping University Electronic Press. pp. 382–388. http://hdl.handle.net/10138/306297

Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2021. Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts vs Non-Experts. Northern European Journal of Language Technology, 7 (1): 35pp. https://doi.org/10.3384/nejlt.2000-1533.2021.3128

Arnbjörnsdóttir, B., Friðriksdóttir, K., & Bédi, B. 2020. Icelandic Online: twenty years of development, evaluation, and expansion of an LMOOC. CALL for widening participation: short papers from EUROCALL 2020, 13.

Borin, L., Forsberg, M. & Lönngren, L. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4): 1191–1211. https://doi.org/10.1007/s10579-013-9233-4

Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, teaching and assessment. https://rm.coe.int/1680459f97

Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion Volume with new descriptors. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-2018/1680787989

Council of Europe. 2020. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion volume. https://rm.coe.int/common-european-framework-of-reference-for-languages-learning-teaching/16809ea0d4

Friðriksdóttir, K. 2021. The effect of tutor-specific and other motivational factors on student retention on Icelandic Online. Computer Assisted Language Learning, 34(5-6), 663-684.

Lenardič, J., Lindström Tiedemann, T. & Fišer, D. 2018. Overview of L2 corpora and resources. CLARIN report. CLARIN ERIC. https://office.clarin.eu/v/CE-2018-1202-L2-corpora-report.pdf

Lindström, J. & Lindström Tiedemann, T. 2020. ”Ni minnes nog hvilka jag menar”: Subjektiva och intersubjektiva aspekter av modaladverbet nog. In: Lehti-Eklund, H. & Silén, B. (eds.), Handel med konst. Språk och dialog i Paul Sinebrychoffs brevsamling från sekelskiftet 1900. Helsinki: Svenska litteratursällskapet. pp. 293–323. http://hdl.handle.net/10138/315043

Lindström, J. & Lindström Tiedemann, T. 2018. Subjektivt och intersubjektivt nog: Om grammatikalisering och bruk i ljuset av Paul Sinebrychoffs brevväxling kring 1900. In: Lönnroth, H, Haagensen, B., Kvist, M. & Sandvad West, K. (eds.) Studier i svensk språkhistoria 14. Vaasa: University of Vaasa. pp. 180–197. http://hdl.handle.net/10138/243079

Lindström [Tiedemann], T. 2004. The History of the Concept of Grammaticalisation. Unpublished PhD thesis, University of Sheffield. https://etheses.whiterose.ac.uk/1437/

Lindström Tiedemann, T., Alfter, D. & Volodina, E. 2022. CEFR-nivåer och svenska flerordsuttryck. In: Björklund, S., Haagensen, B., Nordman, M. & Westerlund, A. (eds.), Svenskan i Finland 19. Vasa: Svensk-österbottniska samfundet. pp. 218–233. https://urn.fi/URN:ISBN:978-952-69650-5-5

Lindström Tiedemann, T., Lenardič, J. & Fišer, D. 2018. L2 learner corpus survey: towards improved verifiability, reproducability and inspiration in learner corpus research. CLARIN annual conference, Pisa.
https://office.clarin.eu/v/CE-2018-1292-CLARIN2018_ConferenceProceedings.pdf

Lindström Tiedemann, T., Volodina, E. & Jansson, H. 2016. Lärka – ett verktyg för träning av språkterminologi och grammatik. LexicoNordica, 23: 161–181. https://tidsskrift.dk/lexn/article/view/111823

Prentice, J., Håkansson, C, Lindström Tiedemann, T., Pilán, I. & Volodina, E. 2021. Language learning and teaching with Swedish FrameNet++: two examples. In: Dannélls, D., Borin, L. & Friberg Heppin, K. (eds.), The Swedish FrameNet++: Harmonization, integration, method development and practical language technology applications. Amsterdam: Benjamins. pp. 303–329. https://doi.org/10.1075/nlp.14.12pre

Stemle, E. W., Boyd, A., Jansen, M., Lindström Tiedemann, T., Mikelić Preradović, N., Rosen, A., Rosén, D. & Volodina, E. 2019. Working together towards an ideal infrastructure for language learner corpora. In: Abel, A., Glaznieks, A., Lyding, V. & Nicolas, L. (eds.) Widening the Scope of Learner Corpus Research: Selected papers from the fourth leaner corpus research conference. Louvain-la-Neuve: Presses universitaires de Louvain.
http://hdl.handle.net/10138/311309

Volodina, E., Alfter, D., Lindström Tiedemann, T., Lauriala, M.S. & Piipponen, D. H. 2022. Reliability of Automatic Linguistic Annotation: Native vs Non-native Texts. In: Monachini, M. & Eskevich, M. (eds.), Selected papers from the CLARIN Annual Conference 2021. Linköping: Linköping University Electronic Press. pp. 151–167.
https://doi.org/10.3384/ecp18914

Volodina, E., Mohammed, Y. A. & Lindström Tiedemann, T. 2021. CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. Proceedings of the 23rd Nordic conference on computational linguistics (NoDaLiDa). Linköping. pp. 178–189. http://hdl.handle.net/10138/339476

Volodina, E. & Lindström Tiedemann, T. 2014. Evaluating students’ metalinguistic knowledge with Lärka. Swedish Language Technology Conference, Uppsala. http://hdl.handle.net/10138/347397

Finland-Swedish language resources

The Swedish Subcorpus of Topling – Paths in Second Language Acquisition

The Swedish Sub-corpus of the Letters of Paul Sinebrychoff, Kielipankki Version

The University of Helsinki’s Swedish E-thesis, Korp Version

Finland Swedish Online – språkkurs

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Researcher of the Month: Marja-Liisa Helasvuo

Photo: Lyyra Virtanen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Marja-Liisa Helasvuo tells us about the digital language resources that have been compiled at the University of Turku. The collaboration with others in the same field has now evolved into a full-scale infrastructure of language data and resources.

Who are you?

I am Marja-Liisa Helasvuo, professor of Finnish language at the University of Turku. I studied Finnish language and general linguistics at the University of Helsinki, and I did my PhD in linguistics at the University of California, Santa Barbara. I have always been particularly interested in spoken language, and in my doctoral thesis I examined spoken Finnish from a crosslinguistic perspective.

What is your research topic?

My research has focused on grammar and human interaction. I have investigated a wide variety of data: everyday conversations between adults or between adults and children, online conversations, and other computer-mediated interactions. I have also studied written texts, from the oldest Finnish texts to more recent ones. I have explored a wide range of grammatical topics with the help of these resources.

I work at the Department of Finnish and Finno-Ugric Languages at the University of Turku. We have produced several digital corpora, starting from The Finnish Dialect Corpus of the Syntax Archive, whose compilation began in 1967. It is the first Finnish language corpus that has been directly compiled into a machine-readable format.

Since the Dialect Corpus, several others have followed: the Agricola Corpus, which contains all the works of Mikael Agricola from the 16th century, the Advanced Finnish Learners’ Corpus (LAS2) and the Corpus of Academic Finnish (LAS1). These are all grammatically coded and they are available in Kielipankki – the Language Bank of Finland (LAS1 will be available soon). In addition, we have produced several resources for Finno-Ugric languages. These materials have been collected in the Archive of Finnish and Finno-Ugric Languages. As we have produced many language resources in our organization, we also have many researchers who are interested in conducting corpus-based research. It’s always easy to ask a colleague for assistance when figuring out which corpus to use to study a particular topic.

Recently, we have been increasingly collaborating with the TurkuNLP research group. We established the UTU-Digilang infrastructure, which includes not only the Archive of Finnish and Finno-Ugric Languages, but also the Digilang portal, the Digilang longterm storage, and the TurkuNLP research group with its language resources and data tools. This collaboration has been very rewarding and I have learned a lot from it. I would like to see more collaboration of this kind in the future as well.

How is your research related to Kielipankki?

I have used language corpora in almost all my research. Many of these resources are available in Kielipankki.

I have been working on the ArkiSyn Corpus, which is available in Kielipankki. We received funding for the project from the Kone Foundation, which helped us to build a morphosyntactically annotated corpus. You can easily search it for all occurrences of a given word (e.g. all forms of the verb ajatella, ’think’) or all occurrences of a given grammatical form (e.g. all forms of the past tense).

Recently, my research has focused in particular on different kinds of fixed expressions, which occur frequently and mostly in the same form. For example, the verb ajatella ’think’ is a very common verb in everyday Finnish conversation. It almost always occurs in the 1st person singular and the tense of the expression is the past tense (ajattelin ’I thought’). When we compared the results of the corpus search with the corresponding passages in the audio recordings, we found that although the expressions were transcribed as ’I thought’, they were in fact phonetically quite eroded. In most cases, the expression occurred in the form maattet. The first person singular pronoun minä ‘I’ was reduced to the m sound at the beginning, the first and second syllable of the verb ’think’ (ajat) were fused together (aat). The reduced form of the word että ’that’ had stuck at the end. This type of phonetic reduction and crystallization of usage into a particular form is very common in fixed expressions.

In addition to ArkiSyn, I have also used the Suomi24 Corpus, the Agricola Corpus, The Finnish Dialect Corpus of the Syntax Archive and newspaper materials. The different corpora allow for different research topics.

Publications

Laury, Ritva, Marja-Liisa Helasvuo & Janica Rauma 2020. “When an expression becomes fixed: mä ajattelin että ‘I thought that’ in spoken Finnish”. – Ritva Laury & Tsuyoshi Ono (eds.), Fixed Expressions: Building language structure and social action, pp. 133–166. Pragmatics & Beyond New Series 315. Amsterdam: John Benjamins. DOI: http://dx.doi.org/10.1075/pbns.315.06lau

Helasvuo, Marja-Liisa 2019. “Free NPs as units”. Special issue “On the Notion of Unit in the Study of Human Languages”, guest editors Tsuyoshi Ono, Ritva Laury & Ryoko Suzuki. Studies in Language 43:2:301–328. DOI: http://dx.doi.org/10.1075/sl.16064.hel

Laury, Ritva & Marja-Liisa Helasvuo 2016. “Disclaiming epistemic access with ‘know’ and ‘remember’ in Finnish”. Special Issue on “Grammar and negative epistemics in talk-in-interaction”, guest editors Jan Lindström, Yael Maschler and Simona Pekarek Doehler. Journal of Pragmatics 106 (2016): 80–96. DOI: http://dx.doi.org/10.1016/j.pragma.2016.07.005

Helasvuo, Marja-Liisa & Aki-Juhani Kyröläinen 2016. “Choosing between zero and pronominal subject: Modeling subject expression in the 1st person singular in Finnish conversation”. Corpus Linguistics and Linguistic Theory 12(2):263–299. DOI: http://dx.doi.org/10.1515/cllt-2015-0066

More information

The Finnish Dialect Corpus of the Syntax Archive

The Morpho-Syntactic Database of Mikael Agricola’s Works

The Advanced Finnish Learners’ Corpus (LAS2)

The Corpus of Academic Finnish (LAS 1)

Archive of Finnish and Finno-Ugric Languages

ArkiSyn Database of Finnish Conversational Discourse

Suomi24 resource group

UTU-Digilang – Digital language resources and language technology tools

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Hae Kielipankki-portaalista:
Haku:

Kuukauden tutkija: Heidi Niva

Uutisia

Kuukauden tutkija: Heidi Niva (18.7.2024)
Kielipankin uutiskirje 1/2024 (20.6.2024)
Metadatan tallennusalusta vaihtui META-SHAREsta COMEDIin (20.6.2024)
Aineiston perusparannus: Helsingin yliopiston korpuspalvelimen monikielinen aineistokokoelma (UHLCS) (19.6.2024)
Uusi aineisto: Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkielinen osakorpus versio 2, VRT (19.6.2024)

Lisää uutisia

Tulevat tapahtumat

CLARIN Annual Conference 2024
15.10.2024 16.00–17.10.2024 17.00

Digital Research Data and Human Sciences (DRDHum) 2024
10.12.2024–12.12.2024

Näytä kaikki tapahtumat

Yhteystiedot
Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot

© 2015–2024 Kielipankki, FIN-CLARIN ja CSC – Tieteen tietotekniikan keskus

Kielipankin käyttöehdot, tietosuojakäytänteet ja saavutettavuusseloste