Suomeksi

Researcher of the Month: Tanja Säily

Tanja Säily
Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tanja Säily tells us about her research on the English language, which combines corpus linguistics, digital humanities and historical sociolinguistics.

Who are you?

I am Tanja Säily, Assistant Professor in English Language at the University of Helsinki.

What is your research topic?

I study variation and change in the English language from a sociolinguistic perspective. My research combines corpus linguistics, digital humanities and historical sociolinguistics. I frequently collaborate with other linguists and historians, and I develop new methods with data scientists and language technologists. I analyse sociolinguistic variation especially in linguistic productivity, such as the use of neologisms. I have also studied gendered styles and factors influencing the rate of language change.

How is your research related to Kielipankki?

In my research, I use English text corpora, which I have also deposited in Kielipankki for myself and others to use. I am currently studying the productivity of various linguistic constructions in the Corpus of Historical American English (e.g. Säily & Vartiainen, forthcoming). I have been using this corpus with the Korp tool and have also downloaded it to my own computer.

I have prepared openly available teaching materials on the methods of historical corpus linguistics for graduate students and other interested parties. They are included in the Method Bank for Linguistics, and the Early Modern English section of the Helsinki Corpus of English Texts used in the exercises can be found in Kielipankki.

Publications

Here are a few of my most recent publications; the entire list can be found at https://tanjasaily.fi/publications/

Accepted. Säily, Tanja, Martin Hilpert & Jukka Suomela. New approaches to investigating change in derivational productivity: Gender and internal factors in the development of -ity and -ness, 1600–1800. Patricia Ronan, Theresa Neumaier, Lisa Westermayer, Andreas Weilinghoff & Sarah Buschfeld (eds.), Crossing boundaries through corpora: Innovative approaches to corpus linguistics (Studies in Corpus Linguistics). Amsterdam: John Benjamins.

Accepted. Säily, Tanja & Turo Vartiainen. Historical linguistics. Michaela Mahlberg & Gavin Brooks (eds.), Bloomsbury handbook of corpus linguistics. London: Bloomsbury.

Accepted. Säily, Tanja, Turo Vartiainen, Harri Siirtola & Terttu Nevalainen. Changing styles of letter-writing? Evidence from 400 years of early English letters in a POS-tagged corpus. Luisella Caon, Moragh Gordon & Thijs Porck (eds.), Unlocking the history of English: Pragmatics, prescriptivism and text types (Current Issues in Linguistic Theory). Amsterdam: John Benjamins.

2023. Landert, Daniela, Tanja Säily & Mika Hämäläinen. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47(1): 63–79. DOI: 10.2478/icame-2023-0004

2022. Rodríguez-Puente, Paula, Tanja Säily & Jukka Suomela. New methods for analysing diachronic suffix competition across registers: How -ity gained ground on -ness in Early Modern English. International Journal of Corpus Linguistics27(4): 506–528. Special issue, Corpus studies of language through time, ed. by Tony McEnery, Gavin Brookes & Isobelle Clarke. DOI: 10.1075/ijcl.22014.rod

2021. Säily, Tanja, Eetu Mäkelä & Mika Hämäläinen. From plenipotentiary to puddingless: Users and uses of new words in early English letters. Mika Hämäläinen, Niko Partanen & Khalid Alnajjar (eds.), Multilingual Facilitation, 153–169. Helsinki: University of Helsinki. DOI: 10.31885/9789515150257.15

2020. Mäkelä, Eetu, Krista Lagus, Leo Lahti, Tanja Säily, Mikko Tolonen, Mika Hämäläinen, Samuli Kaislaniemi & Terttu Nevalainen. Wrangling with non-standard data. Sanita Reinsone, Inguna Skadiņa, Anda Baklāne & Jānis Daugavietis (eds.), Proceedings of the Digital Humanities in the Nordic Countries 5th Conference, Riga, Latvia, October 21–23, 2020 (CEUR Workshop Proceedings 2612), 81–96. Aachen: CEUR-WS.org. DHN 2020 Best Paper Award. http://ceur-ws.org/Vol-2612/paper6.pdf

2020. Nevalainen, Terttu, Tanja Säily, Turo Vartiainen, Aatu Liimatta & Jefrey Lijffijt. History of English as punctuated equilibria? A meta-analysis of the rate of linguistic change in Middle English. Journal of Historical Sociolinguistics 6(2): article 20190008. Special issue, Comparative Sociolinguistic Perspectives on the Rate of Linguistic Change, ed. by Terttu Nevalainen, Tanja Säily & Turo Vartiainen. DOI:10.1515/jhsl-2019-0008

2019. Hill, Mark J., Ville Vaara, Tanja Säily, Leo Lahti & Mikko Tolonen. Reconstructing intellectual networks: From the ESTC’s bibliographic metadata to historical material. Costanza Navarretta, Manex Agirrezabal & Bente Maegaard (eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference, Copenhagen, Denmark, March 6–8, 2019 (CEUR Workshop Proceedings 2364), 201–219. Aachen: CEUR-WS.org. DHN 2019 Best Paper Award. http://ceur-ws.org/Vol-2364/19_paper.pdf

2018. Säily, Tanja. Change or variation? Productivity of the suffixes -ness and -ity. Terttu Nevalainen, Minna Palander-Collin & Tanja Säily (eds.), Patterns of Change in 18th-century English: A Sociolinguistic Approach (Advances in Historical Sociolinguistics 8), 197–218. Amsterdam: John Benjamins. DOI: 10.1075/ahs.8

Corpora and teaching materials

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Liisa Mustanoja

Liisa Mustanoja
Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Liisa Mustanoja tells us about her research on sociolinguistics. With the help of a longitudinal corpus, it is possible to observe changes in the spoken language of the same people at different points in time.

Who are you?

I am Liisa Mustanoja, PhD, from Tampere. I work as a University Lecturer of Finnish Language in the Unit of Languages at the Faculty of Information Technology and Communication, University of Tampere. From January 2024, I will be the Head of the Unit of Languages for the next five years. I am also an Associate Professor of Finnish at the University of Oulu, specialising in sociolinguistics.

What is your research topic?

So far, all my research fits under the large umbrella of sociolinguistics. I am interested in the relationship between language and society, especially in all forms of change, upheaval and movement. In my doctoral research, I examined the change of the spoken language of Tampere at the level of the idiolect. This was a so-called real-time panel survey, in which I examined the language of the same people in the light of two points in time. Later, together with my colleagues, I have extended the study to the spoken language of Helsinki, and we have also included a third time point. The focus has largely been on the phonetic and formal structure of the language, but the data has also allowed for a sociophonetic approach. In one article, for example, we investigated changes in pitch over time.

In addition to the path of variation studies, I am interested in the interface between spoken and written language, and this has provided me with another research direction, namely the study of letter writing. I have investigated – both on my own as well as together with Finnish language students – the correspondence during the Second World War. As there was no other means of communication during the war, everyone took up their pen, regardless of age, profession or educational background. Although this correspondence resource is old, it has provided essential insights into the importance of human contact in times of crisis, as well as into everyday life and humanity in the midst of world turmoil.

How is your research related to Kielipankki?

For some time now, Kielipankki has made accessible the Longitudinal Corpus of Finnish Spoken in Helsinki, which has provided me and my colleagues with an important source of data for studying language change. This corpus will hopefully be joined in the coming months by a little sister, the Longitudinal data of Tampere spoken language. Previously, recordings of the spoken language of Tampere had been made in the 1970s and 1990s. In 2019, I started a third round of data collection in Tampere, which has been continued by students up to the present day. Thanks to the funding I received from FIN-CLARIN, I have also been able to hire some temporary help to work on the material. Everything is now in place, except for the final paperwork. The transfer and archiving of personal speech data has its own complications, but Kielipankki is by far the best possible repository for this valuable longitudinal data. On the eve of handing over the material, it feels like there should be more material and it should be more complete, and the transcripts should be revised countless more times. But really, every little addition to Kielipankki is a great gift to the research community. And by opening up even a part of the resource, someone else has also the possibility to join the transcription work if they want to!

From the resources in Kielipankki, I would also like to mention the Suomi24 Corpus, which suits well for student work. Nowadays, when data protection matters are demanding, it is a relief to be able to direct students to these ready-made resources. For me, too, there is still a lot of new things to wonder about in Kielipankki. My interest in wartime letters, for example, has recently led me to Kalle Päätalo’s Iijoki series, and I have been quite surprised by the research potential of this cornucopia.

Publications

Mustanoja Liisa, O’Dell Michael & Lappalainen Hanna, 2022: Helsinkiläis- ja tamperelaispuhujien äänenkorkeuden muutokset 1970-luvulta 2010-luvulle. Puhe ja kieli. https://doi.org/10.23997/pk.121404

Kuparinen Olli, Santaharju Jenni, Leino Unni, Mustanoja Liisa & Peltonen Jaakko 2022: Katomuotojen eteneminen hd-yhtymässä Helsingin puhekielessä. Virittäjä 126, s. 316–338. https://doi.org/10.23982/vir.100585

Kuparinen Olli, Peltonen Jaakko, Mustanoja Liisa, Leino Unni & Santaharju Jenni, 2021: Lects in Helsinki Finnish – a probabilistic component modeling approach. Language Variation and Change. https://doi.org/10.1017/S0954394521000041

Lappalainen Hanna, Mustanoja Liisa & O’Dell Michael, 2019: Miten ja milloin yksilön kieli muuttuu? Helsinkiläisidiolektien muutos ja muutoksen tutkimuksen menetelmät. Virittäjä 123, s. 550–581. https://doi.org/10.23982/vir.67808

Kuparinen Olli, Mustanoja Liisa, Peltonen Jaakko, Santaharju Jenni & Leino Unni, 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. Sananjalka 61. s. 30–56. https://doi.org/10.30673/sja.80056

Mustanoja Liisa, 2018: Sydämellisiä kirjeitä talvisodasta. Hämäläisten sotilaiden kiitoskirjeet aikansa kielen ja kirjeenvaihtokulttuurin heijastajina. Sisko Brunni, Niina Kunnas, Santeri Palviainen ja Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia humaniora ouluensia 16. Oulu, s. 251–285. https://urn.fi/URN:ISBN:9789526221120

Mustanoja Liisa (toim.), 2017: Arjen sirpaleita ja suuria tunteita: Kirjeet sodan sanoittajina ja ihmissuhteiden ylläpitäjinä 1939–1944. Tampere Studies in Language, Translation and Literature B5. Tampereen yliopisto. https://urn.fi/URN:ISBN:978-952-03-0527-7

Mustanoja Liisa, 2011: Idiolekti ja sen muuttuminen: reaaliaikatutkimus Tampereen puhekielestä. Tampere: Tampere University Press. https://urn.fi/urn:isbn:978-951-44-8417-9

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Tiina Onikki-Rantajääskö

Tiina Onikki-Rantajääskö
Photo: Veikko Somerpuro

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tiina Onikki-Rantajääskö tells us about the principles of the Helsinki Term Bank for the Arts and Sciences (HTB) and invites interested experts to join the collaborative terminology work.

Who are you?

I am Tiina Onikki-Rantajääskö, Professor of Finnish at the University of Helsinki. I also lead the Helsinki Term Bank for the Arts and Sciences (HTB).

What is your research topic?

I am generally interested in how vocabulary and grammatical structures construe linguistic meaning and how they function in relation to the wider textual context. Most of my published research is related to the local cases of the Finnish language. Currently, I am delighted to see how younger researchers aim to combine qualitative and quantitative research in the project Platforms and Rhetorical Group Strategies (in Finnish, ”Alustat ja retoriset ryhmästrategiat”), run by me and Eetu Mäkelä and funded by Kone Foundation. I am particularly interested in discovering whether some constructions can indicate broader discourse structures. However, during this winter, I am spending most of my time on my duties as the Finnish Language Rapporteur, appointed by the Ministry of Justice.

How is your research related to Kielipankki?

I tend to use the Finnish language resources in Kielipankki whenever I need information about the context of a word or grammatical element. Many of the corpora that I have used in the past can now be found in Kielipankki, such as the HS.fi News and Comments Corpus that was compiled in one of my earlier projects.

In addition, the Helsinki Term Bank for the Arts and Sciences (HTB) is part of the FIN-CLARIAH Research Infrastructure, together with Kielipankki. This is reflected in the fact that the online service of the HTB is also accessible via Kielipankki. The HTB also has an employee funded through the FIN-CLARIAH project (FIRI funding from the Research Council of Finland). There is a need for collaboration in the field of language technologies.

The contents Helsinki Term Bank for the Arts and Sciences (HTB) are still in the construction phase. We are constantly working to involve more and more researchers from different disciplines in the terminology work and to invite new disciplines to join the HTB. Defining scientific terms and providing other background information on concepts require expertise in each field. Therefore, the selected method is niche-sourcing of experts, supported by our project planner. The aim is to promote the multilingualism of science in addition to providing openly accessible information describing the formation of scientific knowledge and facilitating the utilization of science. Scientific concepts are at the heart of research. Multilingualism can be promoted by offering translation equivalents for terms in different languages. The Finnish language is in focus, since the aim is to support Finnish as a language of science. However, it is possible to present definitions and concept pages in languages other than Finnish. The term bank thus opens up opportunities for international collaboration. Especially for multilingual and multidisciplinary research groups, the term bank provides an opportunity to shape the common terminological ground. All interested experts are welcome to participate.

My research interests in the Helsinki Term Bank for the Arts and Sciences (HTB) include the presentation of background knowledge frames and the emergence of prototypicality, as well as collaborative interactions: the network of experts in the HTB and the online service interact and form a field of action that differs from traditional research projects.

Publications

Enqvist, Johanna & Tiina Onikki.Rantajääskö & Kaarina Pitkänen-Heikkilä 2021: Terminology work as open, communal and collaborative crowdsourcing practice of academic communities. – Terminology 27:1, Pp. 56-79. DOI: 10.1075/term.00058.enq

Jaakola, Minna & Tiina Onikki-Rantajääskö (eds.) 2023: The Finnish Cases System: Cognitive Linguistic Perspectives. Helsinki:SKS. DOI: doi.org/10.21435/sflin.23

Kettunen, Harri & Tiina Onikki-Rantajääskö (tulossa): Vetenskapstermbanken i Finland i samhällets tjänst. – Publikation Nordterm 2023.

Kettunen, Harri & Tiina Onikki-Rantajääskö (tulossa): Tieteen termipankki tieteentekemisen ytimessä. – Kieliviesti 2/2023.

Onikki-Rantajääskö, Tiina & Harri Kettunen 2023: Vuosi 2022 Tieteen termipankissa: Laajenemista uusille aihealueille ja tunnustuspalkintoja avoimen tieteen edistämisestä. – Tieteen termipankin blogi. Helmikuu/2023. https://blogs.helsinki.fi/tieteentermipankki/2023/02/16/vuosi-2022-tieteen-termipankissa-laajenemista-uusille-aihealueille-ja-tunnustuspalkintoja-avoimen-tieteen-edistamisesta/

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Aleksi Sahala

Aleksi Sahala
Photo: Marianne Ough

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Aleksi Sahala tells us about his research on the development and application of Natural Language Processing (NLP) methods for annotating and analyzing ancient text data.

Who are you?

I am Aleksi Sahala, a postdoc researcher in Assyriology and Language Technology. I am currently working for the University of Helsinki in an Academy of Finland funded project “The Origins of Emesal”, where our goal is to investigate how Emesal, the only known language variety of Sumerian, came to be and evolved over time using computational methods.

I did my master’s degree in Assyriology and Computational Linguistics, and in 2021 I finished my PhD thesis “Contributions to Computational Assyriology”. In 2022, I was a visiting scholar at the University of California, Berkeley, and in 2024 I will visit the University of Innsbruck in Austria. I have also worked in close co-operation with the Centre of Excellency in Ancient Near Eastern Empires at the University of Helsinki.

What is your research topic?

My research focuses on the development and application of NLP (Natural Language Processing) methods for annotating and analyzing ancient text data. My particular interest lies in the Mesopotamian cuneiform texts written in Sumerian (3200 BCE – 100 CE) and Akkadian (2500 BCE – 100 CE). Analysis of Sumerian and Akkadian texts is not only challenging due to data sparsity and the fragmentary nature of the primary sources, but also due to the complexity of the cuneiform writing system and inflectional morphology. In theory, most words can occur in several thousands of different forms, each of which can also be spelled in several different ways.

My focal point has been on the development of a pipeline that is able to linguistically annotate raw transliterations of cuneiform texts so that these texts can be used for data analysis and visualization. This allows for the analysis of thousands of transliterated texts simultaneously and, for example, the visualization and study of how different words, concepts or entities are related to each other on a larger scale. Although Assyriologists have digitized over 20,000 Akkadian and over 100,000 Sumerian texts in various text corpora, these texts have mostly been studied qualitatively by close-reading. By applying a more computational approach, it becomes easier to reveal larger patterns within specific groups of texts.

I have developed a finite-state morphology for Akkadian (BabyFST), as well as a language independent neural lemmatizer and tagger with a special support for cuneiform languages (BabyLemmatizer). In addition, I have built a word-embedding-based tool for analyzing semantic relationships of words and in sparse and fragmentary data sets (PMI Embeddings).

My current project focuses on Emesal, a liturgic variant of the Sumerian language, which is only attested in writing after Sumerian was no longer used as a vernacular. Although it is known that Emesal was used in liturgic context, such as lamentations, and occasional to indicate direct speech of goddesses and women, its origins and evolution are still widely debated. None of the Emesal texts were entirely written in this language variant, but rather in Sumerian, and Emesal was only used here and there as keywords to indicate that the current line or passage should be read in this dialect. The rules behind this code switching, if such ever existed, remain largely unknown. We hope, that a larger scale analysis of Emesal texts could reveal some patterns that could explain, what kinds of environments triggered the use of Emesal words exactly, and how the use of this language variant was introduced in written documents and how evolved over its 2000 year old history.

How is your research related to Kielipankki?

Kielipankki has been co-operating with the Centre of Excellence in Ancient Near Eastern Empires by annotating cuneiform texts and publishing them in Korp concordance service. My responsibilities have been collecting and converting these data sets into Korp-compatible format and developing tools for annotating and harmonizing them with the existing resources in a way, that they can be used efficiently together for quantitative analysis.

Recently, we have been working on the harmonization, lemmatization and tagging of Achemenet, a collection of Neo-Babylonian administrative and legal documents.

Publications

Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S., & Lindén, K. (2019). Aššur and his friends: a statistical analysis of neo-assyrian texts. Journal of Cuneiform Studies, 71(1), 159–180. http://hdl.handle.net/10138/303986

Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). Digital Approaches to Analyzing and Translating Emotion: What Is Love?. In The Routledge Handbook of Emotions in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/348398

Bennet, E. & Sahala, A. (2023). Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023. http://hdl.handle.net/10138/565513

Ihalainen, P. & Sahala, A. (2020). Evolving Conceptualisations of Internationalism in the UK Parliament. Digital Histories, 199.

Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). Akkadian treebank for early neo-assyrian royal inscriptions. In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories. The Association for Computational Linguistics. http://hdl.handle.net/10138/322305

Sahala, A. J. A. (2017). Johdatus sumerin kieleen. Suomen itämainen seura.

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). BabyFST: Towards a finite-state based computational model of ancient babylonian. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3886–3894). http://hdl.handle.net/10138/317691

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association (ELRA). http://hdl.handle.net/10138/317688

Sahala, A. (2021). Contributions to Computational Assyriology. PhD Thesis. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-7416-1

Sahala, A., & Töyräänvuori, J. (2022). Kirjoitustaidon kehittyminen. In Svärd, S. & Töyräänvuori, J. (eds.), Muinaisen Lähi-idän imperiumit. Kadonneiden suurvaltojen kukoistus ja tuho, s.49–69. Gaudeamus, Helsinki.

Sahala, A., & Svärd, S. (2022). Language technology approach to “seeing” in Akkadian. In The Routledge Handbook of the Senses in the Ancient Near East. Taylor & Francis. http://hdl.handle.net/10138/339256

Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023, June). Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction. In Selected papers from the CLARIN Annual Conference 2022 (pp. 111–119). http://hdl.handle.net/10138/563733

Sahala, A. & Lindén, K. (2023). A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Svärd, S., Jauhiainen, H., Sahala, A., & Lindén, K. (2018). Semantic Domains in Akkadian Texts. CyberResearch on the Ancient Near East and Neighboring Regions. Case Studies on Archaeological Data, Objects, Texts, and Digital Archiving, 2, 224–256. http://hdl.handle.net/10138/241805

Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2020). Fear in akkadian texts: New digital perspectives on lexical semantics. In The Expression of Emotions in Ancient Egypt and Mesopotamia (pp. 470–502). Brill. http://hdl.handle.net/10138/328017

Tools

  • BabyLemmatizer, OpenNMT based neural lemmatizer and tagger. Pretrained models available for Ancient Greek, Latin and various cuneiform languages.
  • BabyFST, Finite-state morphology of Akkadian, specifically Babylonian dialect.
  • PMI-Embeddings, Hyper-parametrized tool for creating PMI+SVD based word embeddings from sparse or fragmentary data sets.

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Anna Dmitrieva

Anna Dmitrieva
Anna Dmitrieva (standing) with Aleksandra Konovalova (sitting), co-creators of the Parallel Corpus of Finnish and Easy-to-read Finnish. Photo: Anna Dmitrieva

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anna Dmitrieva tells us about her research on text simplification. Computational methods and the compiling of parallel corpora are an integral part of her work.

Who are you?

I am Anna Dmitrieva, a doctoral researcher at HELSLANG, the Doctoral Programme in Language Studies at the University of Helsinki.

What is your research topic?

My main field of interest is text simplification. I have studied computational linguistics since 2012, when I started my studies for the Bachelor’s degree. Since then, I have been involved in many projects related to natural language processing (NLP), but text simplification has been my main focus during my doctoral studies.

Text simplification is a process of making a text “easier”. A simplified text should be more readable and accessible to a broader audience. In NLP, text simplification can be viewed as a monolingual machine translation problem. We train models that are capable of translating or transforming texts, taking a source text in a particular language and producing a “simpler” version of the text in the same language. This task typically requires a lot of parallel data, where there is a corresponding “easy” target text for each source text.

I work with languages that do not have a lot of simplification data, make datasets for them, and train simplification models. During my time as a doctoral researcher, I have made Russian and Finnish text simplification datasets and models. I am also investigating controlled text simplification, the task of manipulating certain linguistic properties in the output of the simplification model.

How is your research related to Kielipankki?

As a Finnish university student, I have naturally thought of making a Finnish simplification model. Since there were no parallel simplification corpora for Finnish, I had to make one myself. The most obvious choice for the data source was Yle Easy-to-read Finnish News: they exist in the form of text, have been around for a relatively long time, and have equivalents in “regular” Finnish. It was a relief to know that I didn’t have to scrape the news myself using Yle’s API because all the archives are already on Kielipankki.

However, I had to solve the problem of aligning Easy Finnish and Standard Finnish news. I performed automatic alignment, but there was no golden test set of document pairs to test the quality of the alignments. This is where my friend Aleksandra Konovalova (University of Turku) stepped in and helped me, evaluating 1919 pairs of documents herself. Together, we created the Parallel Corpus of Finnish and Easy-to-read Finnish, which is now available in Kielipankki. Currently, I am adding more document pairs and creating a sentence-aligned version, which will hopefully also be made available via Kielipankki when completed.

Publications

Dmitrieva, A. & Konovalova, A. Creating a parallel Finnish—Easy Finnish dataset from news articles. Jun 2023, Proceedings of the 1st Workshop on Open Community-Driven Machine Translation. Esplá-Gomis, M., Forcada, M., Kuzman, T., Ljubešić, N., van Noord, R., Ramírez-Sánchez, G., Tiedemann, J. & Toral, A. (eds.). Universitat d’Alacant, p. 21-26 6 p. https://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27

Dmitrieva, A. Automatic text simplification of Russian texts using control tokens. May 2023, Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023). Piskorski, J., Marcińczuk, M. & Nakov, et al., P. (eds.). Stroudsburg: Association for Computational Linguistics (ACL), p. 70-77 8 p. DOI: 10.18653/v1/2023.bsnlp-1.9

Dmitrieva, A. The role of language technology in accessible communication research. Jun 2023, Emerging Fields in Easy Language and Accessible Communication Research. Deilen, S., Hansen-Schirra, S., Hernández Garrido, S., Maaß, C. & Tardel, A. (eds.). Frank & Timme, p. 319-338 20 p. (Easy – Plain – Accessible; vol. 14). https://researchportal.helsinki.fi/fi/publications/the-role-of-language-technology-in-accessible-communication-resea

Corpora

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Sampo Pyysalo

Sampo Pyysalo
Photo: Pasi Leino / University of Turku

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampo Pyysalo tells us about his research on natural language processing. Openly available large language models are necessary for developing tools similar to ChatGPT also for smaller languages, such as Finnish.

Who are you?

I’m Sampo Pyysalo, University Research Fellow at the TurkuNLP group of the University of Turku.

What is your research topic?

My research is on machine learning approaches to natural language processing, with particular focus on processing Finnish text and analyzing biomedical domain scientific literature. A lot of my recent work revolves around training large neural network models, including general ”foundation” models such as FinBERT and FinGPT as well as task-specific models such as a named entity recognition model for Finnish. I also work on data, both compiling raw text resources for the unsupervised training of foundation models and running manual annotation efforts to create resources for supervised training, such as the Turku NER and TurkuONE corpora.

Large neural language models are central to a lot of state-of-the-art natural language processing and the basis for tools such as ChatGPT, but most such models focus on English and many of the best models are not publicly available. We believe that openly available Finnish models such as FinBERT and FinGPT are necessary to enable the creation of tools for processing Finnish language with comparable capabilities to tools available for English.

How is your research related to Kielipankki?

Creating large language models from scratch requires billions of words of text, and collections of Finnish of this size are not readily available. To compile sufficiently large corpora for language model training we have drawn on various sources, including web crawls and resources available through Kielipankki such as the Yle News Archive, the Finnish News Agency Archive (STT) and the Suomi 24 Corpus. We also distribute resources created by TurkuNLP through Kielipankki among other channels.

In the near future, we hope that we will be able to provide access to the full text resources used to create our models for research purposes through Kielipankki to improve the replicability of our work and to make it easier for future efforts to create models for Finnish.

Publications

J. Luoma & LH. Chang & F. Ginter & S. Pyysalo. 2021. Fine-grained Named Entity Annotation for Finnish. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden. https://aclanthology.org/2021.nodalida-main.14

A. Virtanen & J. Kanerva & R. Ilo & J. Luoma & J. Luotolahti & T. Salakoski & F. Ginter & S. Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. In CoRR, abs/1912.07076. https://doi.org/10.48550/arXiv.1912.07076

Corpora

More information

  • TurkuNLP group of the University of Turku
  • FinBERT, a version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group
  • FinGPT, generative GPT-3-like models for Finnish
  • Finnish NER, a Named Entity Recognition system for Finnish (based on FinBERT and a new NER annotation layer of the UD_Finnish-TDT treebank)

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Nobufumi Inaba

Nobufumi Inaba
Photo: Krista Teeri

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Nobufumi Inaba tells us about a corpus that he is preparing, which contains a text from the year 1526 and is an interesting source for researchers studying language change.

Who are you?

I am Nobufumi Inaba, Senior Researcher at the Archive of Finnish and Finno-Ugric languages at the University of Turku. The Archive is part of the Department of Finnish and Finno-Ugric Languages and it has only been operating under this name for a couple of years. The Finnish language part of the Archive, for which I am responsible, was formerly known as the Syntax Archive. Many Finnish language researchers are probably familiar with the corpus of the same name. I have been involved in the planning and implementation of eg. technical solutions for the projects in our department and for the corpora produced in our Archive. I have also created tools to be used internally by our corpus teams.

What is your research topic?

I have been interested in studying language change and its causes. In my dissertation, I investigated the roots of the so-called dative genitive in Finnish and my research data consisted mostly of texts from old literary languages. In recent years, I have been studying the phenomenon of leaving out the inflection of words in Finnish. My data consists of chat conversations in a location-based game community and of the speech recordings I collected at the game locations.

Currently, I am investigating old literary language again. I am preparing a corpus of the 1526 Swedish New Testament, one of the source texts used by Mikael Agricola. This New Testament has been seen as a symbol of the beginning of the Modern Swedish period. The forthcoming corpus is intended to support the study of the language of Agricola’s works. The importance of the text is not merely symbolic. In my opinion, this earlier New Testament text is a much more valuable source for those interested in linguistic changes than the whole Bible of 1541 (Gustav Vasas bibel). It does not seem to contain regulated language in contrast to the whole Bible that includes many attempts to regulate and harmonize linguistic elements all the way from vocabulary to syntax. Moreover, the 1526 New Testament contains a striking number of elements from spoken language, which the 1541 Bible largely attempted to eliminate. The preliminary coding of the text in order to facilitate annotation is now complete and I expect to start the annotation work in the autumn of 2023.

How is your research related to Kielipankki?

We have had a good division of labour with Kielipankki ever since the days of the Syntax Archive. The University of Turku produces language resources that are published via Kielipankki for the use of the scientific community. The Finnish Dialect Corpus of the Syntax Archive and The Morpho-Syntactic Database of Mikael Agricola’s Works, produced in cooperation with the Institute of the Languages of Finland, as well as the Arkisyn corpus, an important annotated collection of contemporary Finnish produced at the University of Turku, have all been published via the Korp service in Kielipankki. Naturally, Kielipankki will also be the publication site for the Swedish-language New Testament corpus that I am currently working on.

Publications

Nobufumi Inaba (2015). Suomen datiivigenetiivin juuret vertailevan menetelmän valossa. Suomalais-Ugrilaisen Seuran toimituksia 272. https://www.sgr.fi/fi/items/show/78

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Niina Kunnas

Niina Kunnas
Photo: Mikko Törmänen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Niina Kunnas tells us about her research on minority languages including, e.g., Meänkieli.

Who are you?

I am Niina Kunnas, Associate Professor of Finnish language and University Lecturer at the University of Oulu. I am also positioned as a part-time Professor of Finnish language at Sámi Allaskuvla in Koutokeino, Norway.

What is your research topic?

My research represents sociolinguistics, folklinguistics and minority language research. I have examined linguistic variation, language perceptions and situational variation in minority languages, among other things.

How is your research related to Kielipankki?

In recent years, Kielipankki has been involved in my research in a number of ways. Firstly, in 2019, I collected a corpus of spoken Meänkieli together with my students, which was originally recorded with the intention of making it available to researchers via Kielipankki. The corpus contains spoken Meänkieli from several Meänkieli-speaking municipalities in the Meänkieli-speaking area, and its collection has been encouraged by Heikki Paunonen. Some of the interviewees are the same as those previously recorded in the 1990s. Paunonen has also recorded speech from the same parishioners in the 1960s, so the material as a whole makes it possible to carry out a three-round follow-up study of spoken Meänkieli.

I have also recently made use of the Iijoki, the University of Oulu Päätalo Collection corpus on the Korp server. The corpus contains all the novels in the Iijoki series written by Kalle Päätalo and has a size of over 5 million tokens. Together with Liisa Mustanoja and Maija Saviniemi, we will use this data in our study of the function and the associated affects of the Viena Karelian episodes in the Iijoki series. The corpus has allowed us to search data rapidly, and the results of the study will be published in an article that will appear in a volume with the working title Päättymätön savotta. Analyyseja Kalle Päätalon tuotannosta (Timberwork without End. Analyses of Kalle Päätalo’s works).

Publications

Kunnas, Niina 2019: Karjalan kieli Oulun seudulla. – Harri Mantila, Maija Saviniemi & Niina Kunnas (toim.), Oulu kieliyhteisönä. 144–199. Helsinki: Suomalaisen Kirjallisuuden Seura.

Saviniemi, Maija, Kunnas, Niina, Mantila, Harri, Paukkunen, Ulla & Rajala, Elina 2019: Oulua havainnoimassa. – Harri Mantila, Maija Saviniemi & Niina Kunnas (toim.), Oulu kieliyhteisönä. 276–318. Helsinki: Suomalaisen Kirjallisuuden Seura.

Vaattovaara, Johanna, Kunnas, Niina & Saviniemi, Maija 2018: Stadi imitoituna. – Sisko Brunni, Niina Kunnas, Santeri Palviainen & Jari Sivonen (toim.), Kuinka mahottomasti nää tekkiit. Juhlakirja Harri Mantilan 60-vuotispäivän kunniaksi. Studia Humaniora Ouluensia 16. Oulun yliopisto. http://jultika.oulu.fi/files/isbn9789526221120.pdf

Kunnas, Niina 2018: Viena Karelians as observers of dialect differences in their heritage language. – Marjatta Palander, Helka Riionheimo & Vesa Koivisto (eds.), On the border of language and dialect. 123–155. Studia Fennica Linguistica 21. Helsinki: Suomalaisen Kirjallisuuden Seura.

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Mikael Varjo

Mikael Varjo
Photo: Emmi Saari

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikael Varjo tells us about his research on zero-subject constructions in the ArkiSyn corpus containing everyday Finnish conversation.

Who are you?

I am Mikael Varjo and I am currently working as a university teacher at the University of Turku. In March 2023, I defended my doctoral thesis on zero-subject constructions, also at the University of Turku. My interests are diverse, ranging from teaching and researching Finnish as a second and foreign language to research in usage-based syntax.

What is your research topic?

In my doctoral thesis I examine zero-subject constructions (zero person in the subject position) in Finnish everyday conversation. I have extracted my data from the morphosyntactically annotated ArkiSyn corpus, which I also helped to build as a project researcher in 2015–2016 before starting my own dissertation.

Previous research on the zero person has been quite qualitatively oriented. My research aims to fill this methodological gap by combining two approaches: quantitative corpus linguistics and qualitative interactional linguistics. In my research, I examine the characteristics, variation, contexts of use, and functions of zero-subject constructions in spoken interaction. My research reveals that the grammatical and semantic features typically associated with the zero person also distinguish the subcategories of zero-subject constructions. The differences between subcategories are also linked to the tasks the constructions have in interaction. Typically, zero-subject constructions are used for expressing stance towards something that is under discussion, (joint) planning, sharing of experiences, feelings and desires, or for giving directives.

How is your research related to Kielipankki?

The ArkiSyn corpus is available in Kielipankki. In addition, Kielipankki provided important support in the early stages of my doctoral studies as I was taking my first steps in language technology, natural language processing and automatic text processing. Converting zero-subject constructions extracted from the ArkiSyn corpus into a format that was easy to process and met the needs of my dissertation required a lot of learning over the years. With the help of the Kielipankki’s methodological course Corpus Clinic, I was able to get started in the autumn of 2015.

Publications

Varjo, Mikael. 2022. Greater than zero? A study of referentially open and specific necessity constructions in Finnish everyday conversation. Eesti Ja Soome-Ugri Keeleteaduse Ajakiri. Journal of Estonian and Finno-Ugric Linguistics, 13(2), 5–46. https://doi.org/10.12697/jeful.2022.13.2.01

Suomalainen, Karita & Mikael Varjo. 2020. When personal is interpersonal. Organizing interaction with deictically open personal constructions in Finnish everyday conversations. Journal of Pragmatics, 168, 98–118. https://doi.org/10.1016/j.pragma.2020.06.003

Varjo, Mikael. 2019. It Takes All Kinds to Make a Zero: Employing Multiple Correspondence Analysis to Categorize an Open Personal Construction in Conversational Finnish. Corpus Linguistics Research, 5, 55–87. https://doi.org/10.18659/clr.2019.5.03

Varjo, Mikael ja Karita Suomalainen. 2018. From zero to ‘you’ and back: A mixed methods study comparing the use of two open personal constructions in Finnish. Nordic Journal of Linguistics, 41(3), 333–366. https://doi.org/10.1017/s0332586518000215

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Rosa González Hautamäki

Rosa González Hautamäki
Photo: Ville Hautamäki

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Rosa González Hautamäki tells us about her research on within-speaker variation and the effects of voice modifications. The AVOID corpus, which she collected in collaboration with the Computational Speech group at UEF, is a valuable resource for studying human-induced voice modifications.

Who are you?

I am Rosa González Hautamäki, a postdoctoral researcher at the Research Unit of Logopedics (RULOGO) at the University of Oulu, and a visiting researcher at the School of Humanities at the University of Eastern Finland. I hold a Ph.D. in Computer Science and maintain ongoing collaborations with the School of Computing at the University of Eastern Finland and the Human Language Technology lab at the National University of Singapore (NUS).

What is your research topic?

My research focuses on within-speaker variation in the context of speaker recognition. Speech is a complex signal that varies due to several factors, such as age, health, emotional state, and more, so it is expected that a speaker won’t utter the same phrase in exactly the same way multiple times. During my doctoral studies, I studied the effects of voice modifications on the performance of voice comparisons carried out by listeners or automatic systems. My initial research focused on mimicry and voice disguise, considering that some speakers may not be cooperative when interacting with speaker recognition systems. Our research showed that even simple techniques to disguise one’s voice could cause degradation in the performance of automatic systems, while also making the task of speaker comparison challenging for listeners.

Since then, my studies on within-speaker variation have focused on identifying the factors that impact the performance of speaker verification, including deliberate and non-deliberate voice modifications. These findings have also been important in analyzing speech in other speech technology tasks, such as speech spoofing attacks and auditory speech perception. Exploring the factors that impact system decisions can help in making them more reliable.

Currently, my research on speech analysis involves using machine learning models with data from evaluations used to identify developmental language disorders in children. I am excited to be part of a motivated group of researchers who are exploring speech and interventions that can support those working with the development of children’s speech.

How is your research related to Kielipankki?

During my doctoral research, I collaborated with the Computational Speech group at the University of Eastern Finland to collect a dataset for the study of voice disguise. Kielipankki provided crucial support by offering information necessary for the collection and preparation of the corpus, as well as for its publication as a resource. The resulting dataset, called the Age-related Voice Disguise (AVOID) corpus, contains voice recordings of Finnish speakers in their modal voice and attempting age disguise.

In one study, we used the AVOID corpus to analyze the impact of changes in selected acoustical features on automatic speaker recognition systems, and found that the difference in long-term fundamental frequency (F0) was the most detrimental factor to speaker recognition performance, even when the automatic system uses spectral features.

In another study using the AVOID corpus, we evaluated the effectiveness of age stereotypes as a voice disguise strategy in speaker comparisons. Listeners estimated both the speaker’s chronological and intended age (attempting child and elderly voices), and results showed that the age estimations for intended voices for female speakers were more accurate towards the target age, while for male speakers, age estimations corresponded to the direction of the target voice only for elderly voices.

Overall, the AVOID corpus is a valuable resource for studying human-induced voice modifications and we expect further research would help make systems more robust to disguised voices.

Publications

González Hautamäki, R., Hautamäki, V., and Kinnunen, T. (2019). ”On Limits of Automatic Speaker Verification: Explaining Degraded Recognizer Score Through Acoustic Changes Resulting from Voice Disguise”, The Journal of the Acoustic Society of America 146, 693. https://doi.org/10.1121/1.5119240

González Hautamäki,R., Sahidullah, Md., Hautamäki, V., and Kinnunen,T. (2017). ”Acoustical and perceptual study of voice disguise by age modification in speaker verification”, Speech Communication, Volume 95, Pages 1-15, https://doi.org/10.1016/j.specom.2017.10.002

González Hautamäki, R., Sahidullah, Md., Kinnunen, T., and Hautamäki, V (2016). ”Age-Related Voice Disguise and its Impact in Speaker Verification Accuracy”, Odyssey: The Speaker and Language Recognition Workshop, Bilbao, Spain, pages 277-282, http://dx.doi.org/10.21437/Odyssey.2016-40

González Hautamäki, R., Kanervisto, A., Hautamäki, V., and Kinnunen, T. (2018). ”Perceptual Evaluation of the Effectiveness of Voice Disguise by Age Modification”, Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, France, pages 320-326, http://dx.doi.org/10.21437/Odyssey.2018-45

Language resources

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Johanna Vaattovaara

Johanna Vaattovaara
Photo: Antti Yrjönen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Johanna Vaattovaara tells us about her research projects on language awareness and language attitudes.

Who are you?

I am Johanna Vaattovaara, professor of Finnish language in the Languages Unit at the Faculty of Information Technology and Communication Sciences, Tampere University.

What is your research topic?

My research topics represent sociolinguistics and language ideology research, mainly language awareness and attitude research. I have also done research on linguistic variation and language change, and for these topics various corpora have proven to be very valuable resources. Corpora have also been useful in the creation of language attitude study designs. In recent years, for example, I have used the Suomi24 corpus in various ways in studies where I have investigated, together with Elizabeth Peterson and also with Ylva Bir and Turo Hiltunen, the integration of English expressions into Finnish language use.

How is your research related to Kielipankki?

So far, I have used the Suomi24 corpus in Kielipankki, especially Suomi24 2016H2. Currently, I am launching a research project Arkisuomien kielitietoisuudet ja muutos (Societal awareness of linguistic variation and change), funded by the Kone Foundation (2023–25). During the project, we will collect language awareness and attitude data using different methods, such as a nationwide survey data, which we plan to distribute via Kielipankki.

In the past, I have distributed data through the archives of the Institute for the Languages of Finland (Kotus). Also the data that I collected for my dissertation is available from Kotus. The data consists of interviews of a group of high school graduates in Pello, Tornionlaakso (Torne Valley). In the post-doc phase, I collected reaction and interview data in the lobby of the Finnish Science Centre, Heureka, in the project Helsingin suomea – monimuotoisuus, sosiaalinen identiteetti ja kielelliset asenteet kaupunkiympäristössä, led by Marja-Leena Sorjonen and funded by the Academy of Finland in 2009–2012. This corpus of metalinguistic material can also be obtained from Kotus.

Publications

Peterson, E., Hiltunen, T., Vaattovaara, J. 2022. A place for pliis in Finnish: A discourse-pragmatic variation account of position. – Elizabeth Peterson, Turo Hiltunen & Joseph Kern (eds.), Discourse-Pragmatic Variation and Change: Theory, Innovations, Contact, pp. 272–292. Cambridge University Press. DOI: 10.1017/9781108864183.015

Peterson, E., Biri, Y., Vaattovaara, J. 2022. Grammatical and social structures of English-sourced swear words in Finnish discourse. – Martín-Solano, R. & San Segundo, R. (eds.), Corpus linguistics and Anglicisms, pp. 49–70. Peter Lang Publishing. DOI: 10.3726/b19222

Vaattovaara, J. & Peterson, E. 2019. Same old paska or new shit? On the stylistic boundaries and social meaning potentials of a loanword in Finnish. – Ampersand 6/2019 (Special Issue, E. Zenner, A. Calude & L. Rosseel (eds.), Lexical borrowing as expression of culture, identity and attitude – empirical investigations into the social meaning potential of loanwords.) DOI: 10.1016/j.amper.2019.100057

Vaattovaara, J. 2012. Spatial concerns for the study of social meaning of linguistic variables – an experimental approach. – Hanna Lehti-Eklund, Camilla Lindholm & Caroline Sandström (eds.), Folkmålsstudier : Meddelanden från Föreningen för Nordisk Filologi 2012/50, pp. 175–209. https://journal.fi/folkmalsstudier/article/view/82136

Nuolijärvi, Pirkko & Vaattovaara, Johanna 2011. De-standardisation in progress in Finnish society? – T. Kristiansen & N. Coupland (eds.), Standard Languages and Language Standards in a Changing Europe, pp. 67–74. Oslo: Novus Forlag. http://omp.novus.no/index.php/novus/catalog/view/3/5/163

Vaattovaara, Johanna 2009. Meän tapa puhua: Tornionlaakso pellolaisnuorten subjektiivisena paikkana ja murrealueena. Helsinki: Suomalaisen Kirjallisuuden Seura (304 pp.). Suomalaisen Kirjallisuuden Seuran toimituksia 1224. http://urn.fi/URN:ISBN:978-952-222-100-1

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Noora Hoffrén

Noora Hoffrén
Photo: Essi Ekman

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Noora Hoffrén tells us about her PhD research on constructed action in Finnish Sign Language and Finnish language.

Who are you?

I am Noora Hoffrén, a sign language interpreter and a doctoral researcher. I am working on my PhD thesis at the Sign Language Centre (SLC) in the Department of Language and Communication Studies at the University of Jyväskylä.

What is your research topic?

The topic of my dissertation is showing by enacting, i.e. constructed action. When a speaker or signer is immersed in the role of another character and displays the character’s thoughts, speech, emotions or actions, he or she is constructing action. Constructed action is not always obvious or overt. Often, especially in signed languages, constructed action is so closely integrated into the language that it is not always easy to discern it. In my research, I am studying constructed action in both Finnish Sign Language and Finnish language. My dissertation is part of the ongoing ShowTell project at the University of Jyväskylä.

How is your research related to Kielipankki?

As my research data, I will use the Corpus of Finnish Sign Language, part of which is already available for download in Kielipankki (CFINSL). In addition to videos that are recorded from multiple angles, the database contains basic annotations and metadata. The fact that such a corpus exists allows us to study constructed action in the best possible way.

My aim is to collect a video corpus of spoken Finnish, parallel to the Finnish Sign Language material, and to deposit the corpus in Kielipankki. The Finnish video corpus will be collected in pairs from six native speakers of Finnish. The methods that are used to collect the material will be similar to those used to collect the Finnish Sign Language corpus, for example, using multiple cameras during filming sessions and using the same elicitation materials (e.g. ’The Snowman’ and ’Frog, Where Are You?’ picture books).

Publications

Hoffrén, Noora 2019. Kuvailevien viittomien ja konstruoidun toiminnan yhteispeli. Master’s thesis. University of Jyväskylä. Available: http://urn.fi/URN:NBN:fi:jyu-201910144419

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Suomeksi

Researcher of the Month: Maria Sarhemaa

Maria Sarhemaa
Photo: K-Art Foto

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Maria Sarhemaa tells us about her research on the appellativization of first names in Finnish language. Online discussions are a fruitful source of data for studying informal or colloquial language use.

Who are you?

I am Maria Sarhemaa, a doctoral researcher in Finnish language at the University of Helsinki. Currently, I am working on my thesis on a grant from the Kone Foundation.

What is your research topic?

I am doing research on the appellativization of first names in Finnish language, i.e. words that typically belong to the informal registers of the language and originate from a first name. These include yrjö meaning ’vomiting’ and jonne meaning a certain kind of teenage boy, but there are also compound words with an appellativized first name as part of the word, such as baarimikko ‘bartender’. In my dissertation research, I am exploring appellativization as a linguistic phenomenon in Finnish, and in the sub-publications I will examine compound words with an appellativized part, the expressions uuno, tauno and urpo meaning ’stupid’, and the construction jonnet ei muista ‘teenagers cannot remember’.

How is your research related to Kielipankki?

I collected data from the Suomi24 corpus in Kielipankki for my article on uuno, tauno and urpo. The Suomi24 corpus is a fruitful source of data for my research topic, as appellativized expressions are used extensively, particularly in informal language, and the language used in Suomi24 is often colloquial. I have also collected data from the same corpus for my forthcoming article on the jonnet ei muista construction and for a study on the jonne appellative that I am conducting with Lasse Hämäläinen, PhD.

Publications

Hämäläinen, Lasse & Sarhemaa, Maria 2022: Jonnen jäljillä: Appellatiivisen jonnen alkuvaiheet verkkokeskusteluaineistojen valossa. Sananjalka 64, 255–269. https://doi.org/10.30673/sja.114194

Sarhemaa, Maria 2021: Tavan tauno uunoilee urpokaupungissa: Nimien Uuno, Tauno ja Urpo appellatiivistuminen ja appellatiivien käyttö Suomi24-keskustelupalstalla. Sananjalka 63, 103–129. https://doi.org/10.30673/sja.107278

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

På svenska

Researcher of the Month: Therese Lindström Tiedemann

Therese Lindström Tiedemann
Photo: Tove Tiedemann

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Therese Lindström Tiedemann tells us about her research on Swedish as a second language. There is a definite need to continue developing Finland-Swedish corpora to ensure that Finland-Swedish is also included in future studies of the Swedish language.

Who are you?

My name is Therese Lindström Tiedemann and I am a university lecturer in the Swedish Language at the University of Helsinki. In addition to the Swedish language, I also work on general linguistics. I wrote my PhD thesis on the history of grammaticalisation as a concept in linguistics, i.e. within the history of linguistics.

What is your research topic?

In recent years, most of my research has been on Swedish as a second language. In my research I often use corpus linguistic methods. Together with colleagues, I have also tried to use crowdsourcing. I also do research on other topics such as grammaticalisation, the history of linguistics, the teaching of grammar and metalinguistic knowledge.

How is your research related to Kielipankki?

I have used Kielipankki’s resources mainly in connection with my research on Swedish as a second language and in the context of teaching. For instance, I have used the Swedish subcorpus of the Topling corpus. Currently, I am managing our faculty’s part of the Digisvenska project where we are creating a text corpus from the Digital Matriculation Examination in B1-Swedish (Swedish as a second language, i.e. having been learnt from year 6 (or 7 in the old curriculum)) in Finland. We aim to study how the exam correlates to the curriculum and the fairness and transparency of the test results. Among other things, we will study how lexical breadth in the form of lexical variation (cf. vocabulary size) relates to scores and marks in the exams, but also verb conjugation and adverbial clause modifiers, as well as the linguistic accuracy in the form of how close it is to the norm.

A few years ago, I tried to study the Swedish word nog (lit. ‘enough’) using the Sinebrychoff corpus together with Jan Lindström. However, in the end the work needed to be done primarily with a more comprehensive text version of the corpus and not with the version available in Korp.

Swedish-language resources in Finland need developing

I also have a more general interest in the Swedish-language resources available in Kielipankki because of my research on Swedish and teaching students in Scandinavian languages, and since I often use corpus-based methods. This is why it is important for me to know which corpora I can recommend students to use and how they can be used. There is definitely a need to continue developing Finland-Swedish corpora to ensure that we can describe Finland-Swedish (Sw. ”finlandssvenska”) in a similar way to how we can describe Swedish as spoken in Sweden (Sw. ”sverigesvenska”), and that Finland-Swedish is also included in future studies of the Swedish language. In the Finnish context, we can also see that some corpora contain both Finnish and Swedish. There is a need to consider the best way to study how and when Swedish is used in these corpora, and whether this is representative of how Swedish is used in these contexts in Finland. This applies, for example, to the corpus of parliamentary plenary sessions (Eduskunnan täysistunnot), where Swedish words are currently only tagged as foreign words. This impedes research possibilities on this part of the data. However, at the same time, we can clearly see that Swedish words top and dominate the list of words tagged as foreign words in the plenary sessions. It would be interesting to see these parts treated as Swedish, and whether it might somehow be possible to annotate the Swedish parts as Swedish, thus facilitating the study of them from a Swedish perspective.

Besides the Swedish-language resources, I also have an interest in interoperability between different corpora and resources, transparency of research data and comparability between different sources for the Swedish language. With many of the Swedish language corpora being available via Språkbanken Text (Sweden), and with our needs to be able to compare corpora at Kielipankki with these, I see a need for information about how comparable these corpora are, and whether corpora in Kielipankki have been annotated in the same way. This is important to ensure that Finland-Swedish and other Swedish corpora located in Finland can be compared with Swedish corpora located in Sweden. This could give Finland Swedish and second language Swedish (L2 Swedish) with Finnish as the first language (L1) a clear and fair place in research on Swedish and L2 Swedish in general.

As part of my work on corpora my colleagues and I have also checked how well the automatic annotation works, especially on material produced by L2 speakers. We have checked the annotation of coursebook texts (written by L1 speakers but aimed at, or selected for, L2 learners), texts written by L2 learners and texts written by L2 speakers and ”normalised” (i.e. with standardised spelling for instance) to facilitate annotation, queries and comparisons. The results showed that texts written by learners are often not as well annotated but also not always worse. Lemmatisation, word class tagging and sense disambiguation was good enough to be used in studies of L2 Swedish, even though sense disambiguation was more problematic than the first two. There were bigger problems with dependency analysis (cf. clause analysis, parsing) and multiword expressions also proved to be problematic especially in learner writings. Still multiword annotation was good enough to allow us to conclude that we can use it in our work, although the user should know that something may have been missed and that the multiword annotation is based on the expressions which are part of the Saldo lexicon, and how they have been listed in Saldo. The results showed that sometimes there was disagreement regarding whether a preposition should be seen as part of the expression or not.

I am very happy to see that more Swedish corpora have been added to Kielipankki in the last few years. I hope that in the future there will be even more Swedish corpora added in Kielipankki and that they will be annotated as the Swedish corpora in Språkbanken Text (Sweden) and that information about the data will be made accessible in such a way that students and researchers can easily find comparable material and know how representative the material is for a certain type of language (e.g. a dialect, newspaper writings).

Recently finished projects and some future steps

In the coming years I will be working on a project on pseudonymisation of linguistic data (Mormor Karl är 27 år). Pseudonymisation means that some information such as names of people, places, etc are changed to pseudonyms in the data, when this information is such that it might reveal who wrote the text. In this project we will study how pseudonymisation affects research data in the humanities, an important step in work on open reusable data needed for reproducibility and for reduplication studies to be possible on data already collected while at the same time protecting people’s identity.

In connection to the project which I have just finished together with Elena Volodina, University of Gothenburg (L2 profiles – Development of lexical and grammatical competences in immigrant Swedish) we have released a dataset with manual morphological annotation of lexemes which are present in materials aimed at learners of Swedish as a second language or produced by speakers of Swedish as a second language (CoDeRooMor). This resource has now been updated and will be released as part of the resource Swedish L2 profiles during 2023. Swedish L2 profiles is a resource where you can search for e.g. a word, a tense, a morpheme or a word formation pattern to see how this is used at different proficiency levels (according to CEFR, the Common European Framework of Reference for Languages, Council of Europe) both in course books for Swedish as a second language and in learner essays from different CEFR-levels. The resources which we have created are part of Språkbanken Text (Sweden), but are or will be openly accessible.

I have also been involved in the development of an annotation tool in relation to research on Swedish (Legato) and in the use of the CALL platform Lärka for the teaching of syntactic functions, word classes and semantic roles. The CALL platform Lärka is something I have used in teaching grammar, which meant that I could give feedback to the developers from that perspective. Together with Volodina I have also used the platform to collect anonymous data to study what students often get right or wrong when they practise these categories, useful in connection to research on metalinguistic knowledge and the ability to analyse Swedish grammatically.

Apart from research related to Kielipankki’s resources and areas of interest I am also the current project manager of Finland Swedish Online (FSO), an online course in Finland Swedish created at University of Helsinki based on an Icelandic model (Icelandic Online). FSO is currently part of SAFMORIL, one of the K-Centres within CLARIN. One of my aims have been that FSO would not only be something which supports the learning of a language but also a possibility to study language acquisition by seeing if it is possible to trace the development of learners in FSO if they grant access to that information. (Icelandic Online has done research on this based on their data.)

References

Alfter, D., Borin, L., Pilán, I., Lindström Tiedemann, T. & Volodina, E. 2019a. Lärka: From Language learning platform to infrastructure for research and language learning. In: Selected papers from the CLARIN Annual Conference 2018. Linköping: Linköping university press. 14pp. http://www.ep.liu.se/ecp/159/001/ecp18159001.pdf

Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2019b. LEGATO: A flexible lexicographic annotation tool. In: Hartmann, M. & Plank, B. (eds.), The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa): Proceedings of the conference. Linköping: Linköping University Electronic Press. pp. 382–388. http://hdl.handle.net/10138/306297

Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2021. Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts vs Non-Experts. Northern European Journal of Language Technology, 7 (1): 35pp. https://doi.org/10.3384/nejlt.2000-1533.2021.3128

Arnbjörnsdóttir, B., Friðriksdóttir, K., & Bédi, B. 2020. Icelandic Online: twenty years of development, evaluation, and expansion of an LMOOC. CALL for widening participation: short papers from EUROCALL 2020, 13.

Borin, L., Forsberg, M. & Lönngren, L. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4): 1191–1211. https://doi.org/10.1007/s10579-013-9233-4

Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, teaching and assessment. https://rm.coe.int/1680459f97

Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion Volume with new descriptors. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-2018/1680787989

Council of Europe. 2020. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion volume. https://rm.coe.int/common-european-framework-of-reference-for-languages-learning-teaching/16809ea0d4

Friðriksdóttir, K. 2021. The effect of tutor-specific and other motivational factors on student retention on Icelandic Online. Computer Assisted Language Learning, 34(5-6), 663-684.

Lenardič, J., Lindström Tiedemann, T. & Fišer, D. 2018. Overview of L2 corpora and resources. CLARIN report. CLARIN ERIC. https://office.clarin.eu/v/CE-2018-1202-L2-corpora-report.pdf

Lindström, J. & Lindström Tiedemann, T. 2020. ”Ni minnes nog hvilka jag menar”: Subjektiva och intersubjektiva aspekter av modaladverbet nog. In: Lehti-Eklund, H. & Silén, B. (eds.), Handel med konst. Språk och dialog i Paul Sinebrychoffs brevsamling från sekelskiftet 1900. Helsinki: Svenska litteratursällskapet. pp. 293–323. http://hdl.handle.net/10138/315043

Lindström, J. & Lindström Tiedemann, T. 2018. Subjektivt och intersubjektivt nog: Om grammatikalisering och bruk i ljuset av Paul Sinebrychoffs brevväxling kring 1900. In: Lönnroth, H, Haagensen, B., Kvist, M. & Sandvad West, K. (eds.) Studier i svensk språkhistoria 14. Vaasa: University of Vaasa. pp. 180–197. http://hdl.handle.net/10138/243079

Lindström [Tiedemann], T. 2004. The History of the Concept of Grammaticalisation. Unpublished PhD thesis, University of Sheffield. https://etheses.whiterose.ac.uk/1437/

Lindström Tiedemann, T., Alfter, D. & Volodina, E. 2022. CEFR-nivåer och svenska flerordsuttryck. In: Björklund, S., Haagensen, B., Nordman, M. & Westerlund, A. (eds.), Svenskan i Finland 19. Vasa: Svensk-österbottniska samfundet. pp. 218–233. https://urn.fi/URN:ISBN:978-952-69650-5-5

Lindström Tiedemann, T., Lenardič, J. & Fišer, D. 2018. L2 learner corpus survey: towards improved verifiability, reproducability and inspiration in learner corpus research. CLARIN annual conference, Pisa.
https://office.clarin.eu/v/CE-2018-1292-CLARIN2018_ConferenceProceedings.pdf

Lindström Tiedemann, T., Volodina, E. & Jansson, H. 2016. Lärka – ett verktyg för träning av språkterminologi och grammatik. LexicoNordica, 23: 161–181. https://tidsskrift.dk/lexn/article/view/111823

Prentice, J., Håkansson, C, Lindström Tiedemann, T., Pilán, I. & Volodina, E. 2021. Language learning and teaching with Swedish FrameNet++: two examples. In: Dannélls, D., Borin, L. & Friberg Heppin, K. (eds.), The Swedish FrameNet++: Harmonization, integration, method development and practical language technology applications. Amsterdam: Benjamins. pp. 303–329. https://doi.org/10.1075/nlp.14.12pre

Stemle, E. W., Boyd, A., Jansen, M., Lindström Tiedemann, T., Mikelić Preradović, N., Rosen, A., Rosén, D. & Volodina, E. 2019. Working together towards an ideal infrastructure for language learner corpora. In: Abel, A., Glaznieks, A., Lyding, V. & Nicolas, L. (eds.) Widening the Scope of Learner Corpus Research: Selected papers from the fourth leaner corpus research conference. Louvain-la-Neuve: Presses universitaires de Louvain.
http://hdl.handle.net/10138/311309

Volodina, E., Alfter, D., Lindström Tiedemann, T., Lauriala, M.S. & Piipponen, D. H. 2022. Reliability of Automatic Linguistic Annotation: Native vs Non-native Texts. In: Monachini, M. & Eskevich, M. (eds.), Selected papers from the CLARIN Annual Conference 2021. Linköping: Linköping University Electronic Press. pp. 151–167.
https://doi.org/10.3384/ecp18914

Volodina, E., Mohammed, Y. A. & Lindström Tiedemann, T. 2021. CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. Proceedings of the 23rd Nordic conference on computational linguistics (NoDaLiDa). Linköping. pp. 178–189. http://hdl.handle.net/10138/339476

Volodina, E. & Lindström Tiedemann, T. 2014. Evaluating students’ metalinguistic knowledge with Lärka. Swedish Language Technology Conference, Uppsala. http://hdl.handle.net/10138/347397

Finland-Swedish language resources

 
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Researcher of the Month: Marja-Liisa Helasvuo

Marja-Liisa Helasvuo
Photo: Lyyra Virtanen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Marja-Liisa Helasvuo tells us about the digital language resources that have been compiled at the University of Turku. The collaboration with others in the same field has now evolved into a full-scale infrastructure of language data and resources.

Who are you?

I am Marja-Liisa Helasvuo, professor of Finnish language at the University of Turku. I studied Finnish language and general linguistics at the University of Helsinki, and I did my PhD in linguistics at the University of California, Santa Barbara. I have always been particularly interested in spoken language, and in my doctoral thesis I examined spoken Finnish from a crosslinguistic perspective.

What is your research topic?

My research has focused on grammar and human interaction. I have investigated a wide variety of data: everyday conversations between adults or between adults and children, online conversations, and other computer-mediated interactions. I have also studied written texts, from the oldest Finnish texts to more recent ones. I have explored a wide range of grammatical topics with the help of these resources.

I work at the Department of Finnish and Finno-Ugric Languages at the University of Turku. We have produced several digital corpora, starting from The Finnish Dialect Corpus of the Syntax Archive, whose compilation began in 1967. It is the first Finnish language corpus that has been directly compiled into a machine-readable format.

Since the Dialect Corpus, several others have followed: the Agricola Corpus, which contains all the works of Mikael Agricola from the 16th century, the Advanced Finnish Learners’ Corpus (LAS2) and the Corpus of Academic Finnish (LAS1). These are all grammatically coded and they are available in Kielipankki – the Language Bank of Finland (LAS1 will be available soon). In addition, we have produced several resources for Finno-Ugric languages. These materials have been collected in the Archive of Finnish and Finno-Ugric Languages. As we have produced many language resources in our organization, we also have many researchers who are interested in conducting corpus-based research. It’s always easy to ask a colleague for assistance when figuring out which corpus to use to study a particular topic.

Recently, we have been increasingly collaborating with the TurkuNLP research group. We established the UTU-Digilang infrastructure, which includes not only the Archive of Finnish and Finno-Ugric Languages, but also the Digilang portal, the Digilang longterm storage, and the TurkuNLP research group with its language resources and data tools. This collaboration has been very rewarding and I have learned a lot from it. I would like to see more collaboration of this kind in the future as well.

How is your research related to Kielipankki?

I have used language corpora in almost all my research. Many of these resources are available in Kielipankki.

I have been working on the ArkiSyn Corpus, which is available in Kielipankki. We received funding for the project from the Kone Foundation, which helped us to build a morphosyntactically annotated corpus. You can easily search it for all occurrences of a given word (e.g. all forms of the verb ajatella, ’think’) or all occurrences of a given grammatical form (e.g. all forms of the past tense).

Recently, my research has focused in particular on different kinds of fixed expressions, which occur frequently and mostly in the same form. For example, the verb ajatella ’think’ is a very common verb in everyday Finnish conversation. It almost always occurs in the 1st person singular and the tense of the expression is the past tense (ajattelin ’I thought’). When we compared the results of the corpus search with the corresponding passages in the audio recordings, we found that although the expressions were transcribed as ’I thought’, they were in fact phonetically quite eroded. In most cases, the expression occurred in the form maattet. The first person singular pronoun minä ‘I’ was reduced to the m sound at the beginning, the first and second syllable of the verb ’think’ (ajat) were fused together (aat). The reduced form of the word että ’that’ had stuck at the end. This type of phonetic reduction and crystallization of usage into a particular form is very common in fixed expressions.

In addition to ArkiSyn, I have also used the Suomi24 Corpus, the Agricola Corpus, The Finnish Dialect Corpus of the Syntax Archive and newspaper materials. The different corpora allow for different research topics.

Publications

Laury, Ritva, Marja-Liisa Helasvuo & Janica Rauma 2020. “When an expression becomes fixed: mä ajattelin että ‘I thought that’ in spoken Finnish”. – Ritva Laury & Tsuyoshi Ono (eds.), Fixed Expressions: Building language structure and social action, pp. 133–166. Pragmatics & Beyond New Series 315. Amsterdam: John Benjamins. DOI: http://dx.doi.org/10.1075/pbns.315.06lau

Helasvuo, Marja-Liisa 2019. “Free NPs as units”. Special issue “On the Notion of Unit in the Study of Human Languages”, guest editors Tsuyoshi Ono, Ritva Laury & Ryoko Suzuki. Studies in Language 43:2:301–328. DOI: http://dx.doi.org/10.1075/sl.16064.hel

Laury, Ritva & Marja-Liisa Helasvuo 2016. “Disclaiming epistemic access with ‘know’ and ‘remember’ in Finnish”. Special Issue on “Grammar and negative epistemics in talk-in-interaction”, guest editors Jan Lindström, Yael Maschler and Simona Pekarek Doehler. Journal of Pragmatics 106 (2016): 80–96. DOI: http://dx.doi.org/10.1016/j.pragma.2016.07.005

Helasvuo, Marja-Liisa & Aki-Juhani Kyröläinen 2016. “Choosing between zero and pronominal subject: Modeling subject expression in the 1st person singular in Finnish conversation”. Corpus Linguistics and Linguistic Theory 12(2):263–299. DOI: http://dx.doi.org/10.1515/cllt-2015-0066

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Researcher of the Month: Marjatta Palander

Marjatta Palander
Photo: Satu Kokkonen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Marjatta Palander tells us about her research on the dialects of the Karelian language. The Karelian language speech corpora that were compiled in her research projects will be available via Kielipankki.

Who are you?

I am Marjatta Palander, Professor Emerita of Finnish Language at the School of Humanities of the University of Eastern Finland. I am the leader of the recently finished research project KATVE (Migration and linguistic differentiation: Karelian in Tver and Finland), which was funded by the Academy of Finland.

What is your research topic?

During my career, I have done research mostly on the Eastern dialects of Finnish, but in the 2000s, I have also studied Karelian in two research projects. The FINKA project (2011–2014) focused on the dialects of Border Karelia. The KATVE project (2018–2022) investigated the differences and similarities between the dialects of Karelian in Border Karelia and Tver. These Karelian dialects are descended from the common Southern Karelian dialect of the Karelian Proper, which was still spoken in the area of present-day Eastern Finland in the early 17th century. After the Swedish conquest of Eastern Finland, most of the Karelian-speaking population of the region fled to Russia, as far as Tver. Since then, the Karelians of Tver have lived without contact with other Karelians. In the KATVE project, we have examined the differentiation of dialects that has occured in the course of around 350 years.

Our research concerns, among other things, the features of sentence structure, possessive forms and vocabulary. We are also investigating to what extent people with a Border Karelian background and people with a Tver Karelian background can understand each other’s dialects. In my own research, I have examined Karelians’ linguistic awareness using folk-linguistic methods. In addition, I have investigated the temporal variation in one Border Karelian idiolect of which we have recordings from a timeline of 17 years.

How is your research related to Kielipankki?

In the research projects of the 2010s and 2020s, we have compiled three Karelian language speech corpora, which include recorded dialect interviews and their transcriptions produced by FU transcription. The Border Karelian corpus (119 hours) is based on interviews recorded in the 1960s and 1970s, preserved at the Institute for the Languages of Finland (Kotus). The Tver Karelian corpus 1957–1971 (approx. 30 h) was also compiled from recordings at the Institute for the Languages of Finland. The more recent Tver Karelia is represented in the Tver Karelian corpus 2016–2019 (ca. 15 h), which was compiled by researchers from the KATVE project and Karelian language students on our field trips. All the corpora have been submitted to the Language Bank in order to provide researchers with more electronic data on Karelian, which is an endangered minority language.

Research

Palander, Marjatta 2015. Rajakarjalaistaustaisten ja muiden suomalaisten käsityksiä karjalasta. Virittäjä, 119(1), 34–66. Available: https://journal.fi/virittaja/article/view/41260

Palander, Marjatta & Mäkisalo, Jukka 2022. Reaaliaikatutkimus rajakarjalaisidiolektista. Virittäjä, 126(3), 339–368.

Palander, Marjatta & Riionheimo, Helka 2018. Miten Raja-Karjalan murre eroaa suomesta? Rajakarjalaistaustaiset pohjoiskarjalaiset kuuntelutestissä. Sananjalka, 60(60.), 49–70. DOI: 10.30673/sja.69997

Riionheimo, Helka & Palander, Marjatta 2017. Rajakarjalainen kuuntelutesti: havainnoijina suomen kielen yliopisto-opiskelijat. Lähivørdlusi/Lähivertailuja 27, 212–241. Eesti rakenduslingvistika ühing. Tallinn. DOI: 10.5128/LV27.07

Uusitupa, Milla, Koivisto, Vesa & Palander, Marjatta 2017. Raja-Karjalan murteet ja raja-alueiden kielimuotojen nimitykset. Virittäjä 121(1), 67–106. Available: https://journal.fi/virittaja/article/view/53121

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Researcher of the Month: Benjamin Schweitzer

Benjamin Schweitzer
Photo: Grit Ruhland

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Benjamin Schweitzer tells us about his research on the Finnish special language of art music. Corpus linguistics enable the researcher to study the topic from several points of view.

Who are you?

I am a German composer, translator and linguist (in biographical order). I studied composition, music theory and orchestra conducting – at the Sibelius-Academy in Helsinki, among other places – and have since worked mainly as a freelance artist with some additional work as lecturer and concert organiser. From the early 2000’s on, I started translating from Finnish to German – mostly historical and musicological non-fiction, but also opera librettos and short stories.

In my fourties, I entered a second career path and studied Fennistics and Scandinavistics in Greifswald and Tartu. When I received my MA degree in 2018, I already had the feeling that this wouldn’t be the end of my linguistic ambitions. I was very happy I got the opportunity to continue soon after this with a PhD project: I am now employed as a researcher at the Department of Finnish Studies of the University of Greifswald and working on my PhD thesis within the framework of an International Research Training Group called Baltic Peripeties. My supervisor is Professor Marko Pantermöller.

What is your research topic?

I am researching the Finnish special language of art music from several points of view. My first aspect is historical-systematical: I am trying to show how a special language of a field emerged which, as a cultural practice, was itself imported to Finland. What happened spontaneously and what came about as the result of language planning and maintenance? Which terms were adapted, where did the language community succeed in inventing ”originally” Finnish words, and which structural problems had to be overcome in the process?

The second aspect concerns the transition from terms to texts, from words to narration: Which challenges did Finnish critics and musicologists face when writing about music in Finnish? Which models did they follow, and are there structurally ”typically Finnish” ways to write about music?

The third and most complex aspect is a discourse-linguistic approach: What kind of intertextual relations can be found in Finnish texts about (Finnish) music? How does this discourse reveal national auto- and heterostereotypes? And how is art music as a core element of Finnish ”cultural identity” reflected in the writing about music since the beginning of the 20th century?

How is your research related to Kielipankki?

Corpus linguistics plays an important role in my research, even though I am probably employing a somewhat nonstandard approach. Within the official taxonomy, my research might qualify as corpus-based or corpus-oriented, but I would maybe prefer the attribute corpus-aware. In my research, I am mainly looking at longer passages or even entire texts, from which I extract key words, collocations and discourse-semantic frames. This means that my analytical approach is clearly qualitative. Nevertheless, if I want to find out when and in which context certain key words or concepts first appeared, how they were distributed diachronically and how big or small their impact was, I also need to look at bulk material from a quantitative angle.

This is where Kielipankki enters. I mainly use the Newspaper and Periodical Corpus of the National Library of Finland (KLK) which not only contains a huge collection of daily papers until the mid-20th century, but also early music journals, which is an invaluable source. Basically, I use corpus analysis to test, back up and extend research hypotheses which often arise from one single finding in a text, or even an ”I know that there must be something somewhere around here” gut feeling. That can, to name a concrete example, be a question like ”since when does the co-occurence of ’Sibelius’ and ’alkuvoima’ appear? Does the corpus provide evidence for the assumption that it became a fixed collocation, and if so, when?”

To this end, I mainly use the extended search tool (Korp) to identify co-occurences in comparatively larger samples (paragraphs) because a simple left/right-neighbour-search wouldn’t reveal much – especially not in the complicated syntax of early modern Finnish writing on music, which is often closer to literary works than to factual non-fiction style. The corpus excerpts can then be used for further investigation, e.g. for qualitative data analysis, but sometimes also to generate new hypotheses. I have to admit it happened more than once that I found a needle in a haystack – e.g. an interesting text that I might have overlooked otherwise – by browsing through my corpus search results.

Publications

Schweitzer, Benjamin 2019. Musikinstrumentenbezeichnungen im Finnischen: Historisch-systematischer Überblick, Varianten und Verstetigung. MA thesis. Universität Greifswald. Available: urn:nbn:de:gbv:9-oa-000003-2

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.

Researcher of the Month: Mikko Laitinen

Mikko Laitinen
Photo: Olli Laitinen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikko Laitinen tells us about his recent work on social media datasets, which also allow researchers to explore social networks.

Who are you?

I am Mikko Laitinen, professor of English Language and Culture at the School of Humanities at the University of Eastern Finland and one of the PI’s of the national Digital Humanities research infrastructure consortium, FIN-CLARIAH.

What is your research topic?

I am a sociolinguist, which means that I am interested in the use of language in different situations and as a social phenomenon. As a researcher, I have worked with small and structured corpora as well as with large and computationally intensive mass data, but always with some background variables through which language use has been examined. The corpora have been both synchronic snapshots and diachronic cross-sections through time.

Recently, my research team has been working a lot with various Twitter datasets. We are now building a large, representative and continuously updated benchmark corpus that follows language use in near real time on this social media platform. This kind of ”digital observatory”, which offers us means to monitor language use in society, is useful, for example, as a background for language policy discussions. What is more, if it is combined with illustrative visualisations in a more comprehensible format, it may also increase people’s interest in language research in general. Twitter is an interesting resource, because despite of its limited text length, it has extremely rich metadata that allow us to explore people’s language use in social networks, for example.

How is your research related to Kielipankki?

I think it is great that we have all these resources collected and accessible in one place and through one easy-to-use interface. This is a great service for students and researchers! I have personally used the English language resources the most, including the COHA and COCA corpora, and I have downloaded the English lingua franca corpus (ELFA) on my own computer. I also occasionally check the Suomi24 corpus for some interesting phenomena.

Publications

Laitinen, Mikko. 2020. Empirical perspectives on English as a lingua franca (ELF) grammar. World Englishes 39:3, 1–16. DOI: 10.1111/weng.12482

Laitinen, Mikko, Masoud Fatemi & Jonas Lundberg. 2020. Size matters: Digital social networks and language change. Frontiers in Artificial Intelligence 3:46. DOI: 10.3389/frai.2020.00046

Laitinen, Mikko. 2018. Placing ELF among the varieties of English: Observations from typological profiling. In Sandra Deshors (ed.), Modelling World Englishes in the 21st Century: Assessing the Interplay of Emancipation and Globalization of ESL varieties, 109–131. Amsterdam: John Benjamins. DOI: 10.1075/veaw.g61.05lai

Laitinen, Mikko & Magnus Levin. 2016d. On the globalization of English: Observations of subjective progressives in present-day Englishes. In Elena Seoane & Cristina Suárez-Gómez (eds.), World Englishes: New Theoretical and Methodological Considerations, 229–252. (Varieties of English around the World G57). Amsterdam: John Benjamins. DOI: 10.1075/veaw.g57.10lai

Lundberg, Jonas & Mikko Laitinen. 2020b. Twitter trolls: a linguistic profile of anti-democratic discourse. Language Sciences 79. DOI: 10.1016/j.langsci.2019.101268

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Filip Ginter

Filip Ginter
Photo: Filip Ginter

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Filip Ginter tells us about his work with the TurkuNLP research group.

Who are you?

I am Filip Ginter and I am an associate professor of language technology at the University of Turku. I am also presently the longest-serving member of the TurkuNLP research group. I am a computer scientist by training, profoundly enjoying the many unique challenges human language poses.

What is the focus of your research?

Not blessed with patience nor long attention span, I have managed to dip into quite many research topics over the years with our TurkuNLP team. We started off with scientific literature mining, but then branched into more general development of various NLP tools and resources. I’ve always had a soft spot for Finnish and chose to contribute especially to Finnish NLP, perhaps to give back to the society which so generously hosted me for my PhD research. My personally most important – or at least most visible – undertaking was the Turku Dependency Treebank, which later on became one of the first treebanks in the super-successful Universal Dependencies (UD) initiative and allowed TurkuNLP to be an important member of the UD community from Day 1. The treebank was also the basis for the relatively broadly used line of statistical syntactic Finnish language dependency parsers from TurkuNLP. I am proud that this work helped to bring Finnish into the results tables of ACL papers and to close the gap to much more studied languages, at least in terms of parsing accuracy.

Recently, I of course could not help but jump on board the deep learning tsunami. TurkuNLP’s previous work on crawling the Finnish Internet and gathering billions of words of Finnish paid off when it became a crucial part of the training corpus of the FinBERT model. If you have recently done any machine learning on Finnish language, it is quite likely you used this model to squeeze that extra few percent points on your accuracy. The story of FinBERT is a story of having plenty of language data ready at the right moment and shows the importance of gathering and maintaining language resources. You never know when you next need a few billion words of Finnish.

And where do I go from here? I see it as my goal to bring to Finnish, one way or another, most of the tools, tasks, and resources that the bigger languages have. Think about question answering, summarization, semantic search, paraphrase models and many other NLP tasks not yet properly covered for Finnish. If they can exist for English, then they should also for Finnish. We are living exciting times in NLP and now we have many more opportunities to make it happen than we had yet five years ago. And of course, with the LUMI supercomputer around the corner, you can expect new exciting language models from the TurkuNLP workshop.

Apart from these more or less mainstream NLP projects, I have had several I dare say successful collaborations in the field of digital humanities, in particular with the historians. I enjoyed these projects as they challenged us with interesting technical and algorithmic problems to solve.

How is your research related to Kielipankki?

Perhaps my most visible contribution to the Language Bank is the Finnish dependency parser (of course there was many of us working on it in TurkuNLP), which is used by the Language Bank to make data more accessible to researchers. The most recent version of the parser brings about a substantial improvement in accuracy on all levels of analysis. One day, when the legislation catches up with present-day language technology needs, I hope to see also our Internet Parsebank and other large-scale web-based data contributed to the Language Bank.

Naturally, we have used the Language Bank’s resources extensively here in TurkuNLP, perhaps most of them the Suomi24 corpus, in various research projects as well as in language model training. We have also benefited enormously from the Newspaper and Periodical OCR Corpus of the National Library of Finland in our work with the historians.

I cannot stress how important it is for Finnish NLP that we all contribute open datasets and free tools and models to the Language Bank and also maintain our edge in terms of computational resources, with LUMI being the perfect example

Publications

J. Kanerva & F. Ginter & S. Pyysalo 2020. Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task. Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies. DOI: 10.18653/v1/2020.iwpt-1.17

J. Kanerva & F. Ginter & T. Salakoski 2020. Universal Lemmatizer: A Sequence to Sequence Model for Lemmatizing Universal Dependencies Treebanks. Natural Language Engineering. DOI: 10.1017/S1351324920000224

J. Kanerva & F. Ginter & N. Miekka & A. Leino & T. Salakoski 2018. Turku Neural Parser Pipeline: An End-to-End System for the CoNLL 2018 Shared Task. Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. DOI: 10.18653/v1/K18-2013

A. Vesanto & A. Nivala & T. Salakoski & H. Salmi & F. Ginter 2017. A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa). https://aclanthology.org/W17-0249

Tools and corpora (available via Kielipankki)

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Sampsa Holopainen

Sampsa Holopainen
Photo: Laura Horváth

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampsa Holopainen tells us about his research on the history of the Uralic languages.

Who are you?

My name is Sampsa Holopainen, and I am a researcher of the history of the Uralic languages. I am currently working as a recipient of an APART-GSK Fellowship of the Austrian Academy of Sciences at the Finno-Ugrian department of the University of Vienna. I made my doctoral studies in the University of Helsinki, my PhD defence was in December 2019.

What is your research topic?

My current research topic is the history of Hungarian or more widely the history of the Ugric languages (including also Khanty and mansi): historical phonology, etymology and loanword research. I am investigating these topics in my current project (2021–2023) Hungarian historical phonology reexamined (with special focus on Ugric vocabulary and Iranian loanwords). In my earlier work I have done research on the etymology of the other Uralic languages too, especially on the Indo-Iranian and other Indo-European lexical influence on the various Uralic languages. In 2019–2021, I worked with Finnic etymology in particular in the project Suomen vanhimman sanaston etymologinen verkkosanakirja (The digital etymological dictionary of the oldest vocabulary of Finnish) in the University of Helsinki. This project is led by Dr. Santeri Junttila and funded by the Kone Foundation.

How is your research related to Kielipankki?

As a part of my current project I am developing an etymological database of the shared vocabulary of Hungarian, Khanty and Mansi (the vocabulary traditionally reconstructed into the Ugric proto-language) and of the early Iranian loanwords of Hungarian; the database is built into the Sanat-wiki that is maintained by Kielipankki. These vocabulary layers are investigated critically and the results are presented in word-articles, and the database will also later include tables illustrating the developments of historical phonology. The database forms only part of my current research work, but it gives a good opportunity to publish research results and observations quickly and openly.

My database is based on a much larger etymological database of the Finnic languages, that has been developed in Santeri Junttila’s project Suomen vanhimman sanaston etymologinen verkkosanakirja (The digital etymological dictionary of the oldest vocabulary of Finnish). Also docent Petri Kallio, MA Juha Kuokkala and MA Juho Pystynen have worked in this project. This project is still active but I am not involved in it any more as a full-time researcher. I think that this project is especially significant, as it has produced the excellent Wiki-database of etymology that has served as the basis of further projects on etymology, such as my own current project in the University of Vienna. The Wiki-database gives good chances to update the research results and forms a good platform for researchers to communicate.

Publications

Holopainen, Sampsa 2022: Uralilaisen lingvistisen paleontologian ongelmia – mitä sanasto voi kertoa kulttuurista? – Kaheinen, Kaisla & Leisiö, Larisa & Erkkilä, Riku & Qiu, Toivo E.H. (toim.), Hämeenmaalta Jamalille: kirja Tapani Salmiselle 07.04.2022. Helsinki: Helsingin yliopiston kirjasto. 101–114. DOI: 10.31885/9789515180858.9

Holopainen, Sampsa 2021: On the question of substitution of palatovelars in Indo-European loanwords into Uralic. – Suomalais-Ugrilaisen Seuran Aikakauskirja 98. 197–233. DOI: 10.33340/susa.95365

Junttila, Santeri & Holopainen, Sampsa & Pystynen, Juho 2020: Digital Etymological Dictionary of the Oldest Vocabulary of Finnish. – Rasprave 46, 2. 733–747. DOI: 10.31724/rihjj.46.2.15

More information

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Hae Kielipankki-portaalista:
Tanja Säily
Kuukauden tutkija: Tanja Säily

 

Tulevat tapahtumat


Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot