Researcher of the Month: Juho Leinonen

Juho Leinonen
Photo: Petteri Haapaniemi

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juho Leinonen tells us about his research on automatic speech recognition, speech alignment and chatbots.

Who are you?

My name is Juho Leinonen and I am completing my PhD studies in the Speech Recognition research group led by Mikko Kurimo in Aalto University. I started my PhD studies in 2017 after a couple of years of work in industry.

What is your research topic?

The topic of my Master’s thesis was the automatic speech recognition for Sámi language, and it is possible for me to build on this experience in my PhD work as well. In my current research, regarding chatbots and forced alignment of speech, I still need language models and acoustic models, both of which are also required in automatic speech recognition. In speech recognizers, language models are used for recognizing words that are pronounced in an unclear or ambiguous way, whereas chatbots need language models for generating new text. Language models can also be applied on assessing the quality of text generated by bots. The process becomes circular: in order to evaluate the results in a reliable way, we need to understand what high-quality text is like, but the same understanding is a pre-requisite for generating text in the chatbot. This constitutes a philosophical problem as well as an engineering one.

The goal in traditional speech recognition is to find the text that corresponds to the audio recording as well as possible. When developing a speech recognizer, previously aligned speech data is first required in order to train the acoustic models. Aligning text with speech is actually routine work in speech recognition. However, speech alignment would be a useful functionality for researchers in other fields as well, and it is hardly possible for everyone to become a speech recognition professional before they can get started with their own research. During the past year, I have packaged the speech recognition and alignment tools used in our research group into a toolkit that would be as easy to share as possible. I am also searching for good measures that could be used for assessing the quality of the alignment. My goal is to find out which acoustic models or features produce the best alignment, and in what sort of situations it is possible or worthwhile to use the models trained on major languages for aligning minority languages. This research has also opened up the world of language researchers for me, since I am trying to adapt the tool to suit their purposes as well as possible.

How is your research related to Kielipankki?

On the spur of the moment, I ended up testing the Finnish speech recognizer, developed by our group, for aligning the Giellagas corpus of Northern Saami. This project gave me the idea of cross-language alignment that is described in my latest publication (Leinonen, Virpioja & Kurimo, 2021). Thus, an alignment tool developed for one language can possibly be applied on aligning speech and text in other languages as well, in case the sound and writing systems of the languages are sufficiently similar. In the future, I will also be utilizing other previously aligned speech corpora that are in the Language Bank of Finland. The automatic speech aligner that I have used in my research is now also available for other researchers as part of the Aalto University Automatic Speech Recognition System (Aalto-ASR v.2) that has been installed in the Puhti computing environment at CSC.

For training chatbots, I also use the Suomi24 corpus available in the Language Bank. It may seem strange to use the sort of language used in online discussion forums for ”training” purposes. However, huge amounts of text are required in order to train useful language models, and finding suitable material in sufficiently large quantities is very difficult.

Publications related to Kielipankki

Leinonen, J., Smit, P., Virpioja, S., & Kurimo, M. (2017). New baseline in automatic speech recognition for Northern Sámi. In International Workshop on Computational Linguistics for the Uralic Languages (pp. 89-99). https://doi.org/10.18653/v1/W18-0208

Leino, K., Leinonen, J., Singh, M., Virpioja, S., & Kurimo, M. (2020). FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics. In Interspeech (pp. 429-433). International Speech Communication Association. https://doi.org/10.21437/Interspeech.2020-2511

Leinonen, J., Virpioja, S., & Kurimo, M. (2021, May). Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. http://hdl.handle.net/10138/330758

 

More information on the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Okko Räsänen

Okko Räsänen
Photo: Jonne Renvall/Tampere University

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Okko Räsänen tells us about his research on the computational modeling of infant language development.

Who are you?

I am Okko Räsänen, Associate Professor and Academy Research Fellow at the Unit of Computing Sciences of Tampere University, where I also lead the Speech and Cognition research group. Before moving to Tampere, I worked at the Department of Signal Processing and Acoustics at Aalto University, where I am Docent in Speech Processing.

What is your research topic?

The main topics of my research are the computational modeling of infants’ early language acquisition and the speech that infants hear. Our aim is to understand the principles of information processing that underlie language learning: What sort of transformations and processing steps does the speech signal undergo in the human brain in order to make it possible for the individual to learn how to comprehend it, and how can we build similar language capabilities to artificial intelligence systems? We are interested in what sort of linguistic structures can be acquired in a language-independent and unsupervised manner from speech and from the rest of the sensory information that is available to children. On the other hand, we study the learning mechanisms and presuppositions that must be included in the models in order for the learning to succeed. An interesting question is, what kind of language input and other multisensory information infants are generally able to hear and to perceive during their early language development, and to what extent the acquisition of linguistic structures (e.g., sounds and words) is supported by the amount, quality, and the multisensory nature of the input.

In addition to computational models, we have also developed practical analysis tools for the automated analysis of large child-centered audio data, which can help us to better understand the characteristics of speech heard by children. The data sets typically consist of day-long recordings recorded using wearable microphones in children’s natural acoustic and linguistic environments. For example, in the recently completed international collaboration project Analyzing Child Language Experiences around the World, we analyzed about 14,000 hours of child-centered audio material in order to study children’s early language experiences in various linguistic and cultural settings. Our next goal is to further process our analysis results into publications.

Computational research in language learning is multidisciplinary and interesting work, but on the other hand, it is also challenging. In order to work with speech signals and to model human learning processes, an in-depth command of signal processing and machine learning methods is required. In addition, however, it is important to have a good understanding of phonetics, early language development and the functioning of human cognition, so as to make it possible to reconcile the new models and methods with theory and data from language development research.

In addition to research on language acquisition, my research team develops various analysis methods for speech, e.g., for evaluating the health condition or the emotional state of a given speaker. My group is also involved in the development of smart wearables for babies for the clinical assessment and monitoring of their neurophysiological and motor development (as part of the Academy of Finland’s Health from Science research program). Moreover, I work on many other themes in speech technology, cognitive science, and signal analytics based on machine learning. Often, the signal processing and machine learning methods that are used in speech technology are also well suited for processing a wide variety of time series data.

How is your research related to Kielipankki?

In my research, I have used the FinDialogue corpus that is currently on its way to the Language Bank of Finland, and many other corpora that are provided by the Language Bank are also familiar to me. I am looking forward to the publication of the speech material collected during the Donate Speech campaign for research use. In my opinion, the Language Bank is also a viable publication channel for any new data that we may create during our research in the future.

Publications related to Kielipankki

Khorrami, K. & Räsänen, O. (2021). Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? – A computational investigation. Language Development Research, https://doi.org/10.34842/w3vw-s845

Räsänen, O., Seshadri, S., Lavechin, M., Cristia, A., & Casillas, M. (2021). ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings. Behavior Research Methods, 53,  818–835, https://doi.org/10.3758/s13428-020-01460-x.

Räsänen, O., Doyle, G., & Frank, M. C. (2018). Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171, 130–150, https://doi.org/10.1016/j.cognition.2017.11.003.

Kakouros, S., Salminen, N. & Räsänen, O. (2018). Making predictable unpredictable with style — Behavioral and electrophysiological evidence for the critical role of prosodic expectations in the perception of prominence in speech. Neuropsychologia, 109, 181–199, https://doi.org/10.1016/j.neuropsychologia.2017.12.011.

Räsänen, O., Kakouros, S. & Soderstrom, M. (2018). Is infant-directed speech interesting because it is surprising? — Linking properties of IDS to statistical learning and attention at the prosodic level. Cognition, 178, 193–206, https://doi.org/10.1016/j.cognition.2018.05.015.

Rasilo H. & Räsänen O. (2017). An online model of vowel imitation learning. Speech Communication, 86, 1–23, https://doi.org/10.1016/j.specom.2016.10.010.

Räsänen, O. & Rasilo, H. (2015). A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122(4), 792–829, https://doi.org/10.1037/a0039702.

 

 

More information on the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Olli Kuparinen

Olli Kuparinen
Photo: Ilona Lehtonen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Olli Kuparinen tells us about his research on language variation and change where he has used The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s), the Samples of Spoken Finnish and The Finnish Dialect Syntax Archive.

Who are you?

I am Olli Kuparinen, Doctor of Philosophy in Finnish language. In my doctoral dissertation, which I defended in June 2021, I studied the change of Finnish spoken in Helsinki and theories on language change. My dissertation was written in a multidisciplinary research group Kippo, and the study was funded by the Kone Foundation.

What is your research topic?

I study the variation and change in spoken Finnish as well as the theories that are utilized in sociolinguistics. My research methods have for the most part been statistical.

My dissertation scrutinized the change in Finnish spoken in Helsinki from the 1970s to the 2010s. The real time corpus of three time points enabled me to study the concrete changes in Helsinki as well as test the theories that have been drafted in studies of one or two time points. Studying three time points contests, for instance, the practicality of the patterns of change put forth by William Labov.

In my postdoctoral research I will examine the variation in Finnish dialects and the ways variation is discussed in works on dialects.

How is your research related to Kielipankki?

In my dissertation I used the Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s), which consists of interviews of Helsinki natives from the 1970s, 1990s and 2010s. The interviews are available as sound files in the Language Bank. Many of the interviews have also been transcribed. In my dissertation I focused mainly on the transcriptions.

During my work on Helsinki Finnish I have also utilized the Samples of Spoken Finnish as a test corpus for different statistical models. I plan to use the corpus also in my postdoctoral research, in which I study the variation in Finnish dialects. The great benefit of the corpus is that it has been translated into standard Finnish. This enables, for instance, the use of different machine learning algorithms on the corpus to scrutinize the topics of the interviews.

I also plan to use the Finnish Dialect Syntax Archive as a supplement for the Samples of Spoken Finnish in my postdoctoral work.

Publications related to Kielipankki

Kuparinen, Olli 2018: Infinitiivien variaatio ja muutos Helsingissä. – Virittäjä 122 s. 29 – 52. https://doi.org/10.23982/vir.65310

Kuparinen, Olli 2021: Muutoksen mekanismit. Kolmen aikapisteen reaaliaikatutkimus Helsingin puhekielestä. Tampereen yliopiston väitöskirjat 428. Tampere: Tampereen yliopisto 2021. http://urn.fi/URN:ISBN:978-952-03-1990-8 

Kuparinen, Olli – Mustanoja, Liisa – Peltonen, Jaakko – Santaharju, Jenni – Leino, Unni 2019: Muutosmallit kolmen aikapisteen pitkittäisaineiston valossa. – Sananjalka 61 s. 30–56. https://doi.org/10.30673/sja.80056

Kuparinen, Olli – Peltonen, Jaakko – Mustanoja, Liisa – Leino, Unni – Santaharju, Jenni 2021: Lects in Helsinki Finnish: a probabilistic component modeling approach. – Language Variation and Change. https://doi.org/10.1017/s0954394521000041

More information on the current versions of the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Karita Suomalainen

Karita Suomalainen
Photo: Heidi Suomalainen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Karita Suomalainen tells us about her research on interactional linguistics where she has used the ArkiSyn Database of Finnish Conversational Discourse, The Finnish Dialect Syntax Archive and The Suomi24 Sentences Corpus 2001-2017.

Who are you?

I am Karita Suomalainen, Doctor of Philosophy in Finnish language. I defended my doctoral dissertation in December 2020, and at the moment, I am working as a university teacher at the University of Turku. During the academic year 2021–2022 I will be visiting Aarhus University in Denmark as post doc researcher, with a grant that I received from the Finnish Academy of Science and Letters via the Foundations’ Post Doc Pool (Säätiöiden post doc -pooli).

What is your research topic?

My main research interests lie in the area of interactional linguistics. My research concerns the way different grammatical structures are used in interactional contexts. In particular, I have worked on the use of different referential expressions.

My doctoral dissertation examined second person singular, focusing on the variation of its use in Finnish everyday conversations. My study revealed that, in addition to referring to and addressing the recipient, the second person singular forms can also be used in fixed expressions (e.g., tietsä ‘(do) you know’) or to create open reference, so that they do not refer exclusively to the addressee, but rather describe interpersonal or generic experiences or states of affairs; similar use of second person singular can be found in many other languages. My current post doc project deals with the grammaticalization of verbal constructions expressing person. The goal of the project is to describe the use, development and status of these expressions in Finnish and compare them to similar expressions in Danish.

In collaboration with Ritva Laury and Anna Vatanen, I have also worked on use of the Finnish se että construction in spoken language. In addition, I have examined the linguistic features of online hate speech together with Simo Määttä and Ulla Tuomarla.

How is your research related to Kielipankki?

Most of my research is actually based on data that is also available in corpora of Kielipankki – the Language Bank of Finland. My doctoral dissertation was part of the project “Arkisyn: Morphosyntactically coded database of conversational Finnish” (funded by Kone Foundation). The project produced a morphosyntactically annotated corpus of everyday Finnish conversations that is also available in Kielipankki (ArkiSyn Database of Finnish Conversational Discourse, Helsinki Korp Version). The corpus enables the research of morphosyntactic phenomena in conversational data, and this feature has been very useful in my own research. I have also used The Finnish Dialect Syntax Archive with the help of which it is possible to examine diachronically older spoken language. It is also possible to listen to the samples of the data, and that feature has been especially useful for a spoken language researcher like me. I appreciate that Kielipankki also hosts spoken language corpora – I know that coding such data is not always a very simple task.

In our research of online hate speech, Simo Määttä, Ulla Tuomarla and I have analyzed a discussion thread found within the Suomi 24 corpus available in Kielipankki. Our study was based on the qualitative analysis of a particular case, but it would be interesting to use corpus data for a more comprehensive study. However, it turned out in our project that it is difficult to define specific lexical or grammatical search criteria that could be used for locating samples of hate speech. Some new solutions should be considered in order to be able to extend the analysis.

Publications related to Kielipankki

Suomalainen, Karita (2020): Kuka sinä on? Tutkimus yksikön 2. persoonan käytöstä ja käytön variaatiosta suomenkielisissä arkikeskusteluissa [Who is ‘you’? On the use of the second person singular in Finnish everyday conversations]. Annales Universitatis Turkuensis C 499. Doctoral dissertation. http://urn.fi/URN:ISBN:978-951-29-8238-7

Suomalainen, Karita – Vatanen, Anna – Laury, Ritva (2020): The Finnish se että initiated expressions: NPs or not? In Sandra Thompson & Tsuyoshi Ono (eds.), The ‘Noun Phrase’ across Languages. An emergent unit in interaction, 12–41. Typological Studies in Language 128. Amsterdam: John Benjamins. https://doi.org/10.1075/tsl.128.02suo

Määttä, Simo – Suomalainen, Karita – Tuomarla, Ulla (2020): Maahanmuuttovastaisen ideologian ja ryhmäidentiteetin rakentuminen Suomi24-keskustelussa [Constructing anti-immigration ideology and group identity in an online conversation thread on the Suomi24 discussion board]. Virittäjä 124 (2), 190–216. https://doi.org/10.23982/vir.81931

More information on the current versions of the aforementioned resources in Kielipankki

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Mila Oiva

Mila Oiva
Photo: Mila Oiva

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mila Oiva tells us about her research in Cultural History, including the making of the Yves Montand in the USSR interviews.

Who are you?

My name is Mila Oiva. I’m a Cultural Historian and I work as a Senior Researcher at CUDAN Open Lab in Tallinn University. CUDAN is an Horizon2020 funded cultural data analytics initiative that studies cultural phenomena by integrating qualitative and quantitative approaches from humanities, social sciences, network science, complexity science and beyond.

What is your research topic?

I study how knowledge and assumptions circulate and how the used communication tools affect how knowledge moves and formulates. For example, I have studied circulation of news globally in the 19th century newspapers (https://oceanicexchanges.org/) and circulation of popular interpretations of history in the Russian language web discussions in the 2010s (https://sites.utu.fi/pseudohistoria/en/). In addition, I have explored the construction and reception of a tour of French-Italian singer-actor Yves Montand to the Soviet Union in 1956-57 in the context of the Cold War. All these studies that I have done in collaboration with my colleagues demonstrate in an interesting way how our assumptions are built simultaneously as global phenomena and local interpretations of them.

How is your research related to Kielipankki?

I am about to publish the collection of oral history interviews that we made for our book Yves Montand in the USSR. Cultural Diplomacy and Mixed Messages (Palgrave Macmillan 2021) at the Language Bank of Finland for research and teaching purposes. It is still relatively seldom that historians share their data, but I think that the dataset can be useful also for other scholars and students interested in the memories of Soviet popular culture. Furthermore this year it is Montand’s 100th anniversary and publishing memories concerning his Soviet tour is a good way to celebrate it!

Publications related to Kielipankki

Oiva, Mila, Hannu Salmi, and Bruce Johnson. Yves Montand in the USSR: Cultural Diplomacy and Mixed Messages. Palgrave Macmillan, 2021. https://doi.org/10.1007/978-3-030-69048-9.

Fridlund, Mats, Mila Oiva, and Petri Paju, eds. Digital Readings of History. History Research in the Digital Era. Helsinki: Helsinki University Press, 2020. https://doi.org/10.33134/HUP-5.

Oiva, Mila, Asko Nivala, Hannu Salmi, Otto Latva, Marja Jalava, Jana Keck, Laura Martínez Domínguez, and James Parker. “Spreading News in 1904. The Media Coverage of Nikolay Bobrikov’s Shooting.” Media History 25, no. 3 (August 11, 2019): 1–17. https://doi.org/10.1080/13688804.2019.1652090.

 

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Gwenaëlle Bauvois

Gwenaëlle Bauvois 
Photo: Gwenaëlle Bauvois

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Gwenaëlle Bauvois tells us about her research based on various media data sources, including the Plenary Sessions of the Parliament of Finland, Downloadable Version 1 available via Kielipankki.

Who are you?

I am a researcher at the University of Helsinki based at the Centre for Research of Ethnic Relations and Nationalism (CEREN) at the Swedish School of Social Science, University of Helsinki. I hold a PhD in Sociology.

What is your research topic?

I am interested in right-wing populism, countermedia, reinformation, hybrid media and post-truth. My interest in these phenomena was really sparked in 2015 after the Charlie Hebdo events, and I have been working on these topics since then.

Niko Pyrhönen; photo: Niko Pyrhönen
Niko Pyrhönen
Tuukka Ylä-Anttila; photo: Ilkka Vuorinen
Tuukka Ylä-Anttila

In the years 2016–2019, I and my colleagues Niko Pyrhönen and Tuukka Ylä-Anttila were involved in a research project called Mobilizing ’the Disenfranchised’ in Finland, France and the United states. Post-truth public stories in the transnational hybrid media space. We studied how countermedia mobilizes a “disenfranchised” community of people who are losing trust in the mainstream media. ’Countermedia’ refers to partisan media that oppose conventional media and the establishment. For this project, we collected data from online media located in Finland, France and the United States.

Some of the results of our project were published in our co-authored article Politicization of migration in the countermedia style: A computational and qualitative analysis of populist discourse (2019). In this paper, we set out to investigate whether countermedia style is also used in the arena of ‘high politics’ – in this case the Parliament of Finland – and if so, how and by whom. The results of our computational and qualitative analysis of media data from Helsingin Sanomat and MV Lehti (2015-2017) and of the Plenary Sessions of the Parliament of Finland (years 2015-2016) showed that countermedia style expressions are indeed used in parliamentary debates, especially by the populist right-wing Finns Party, during debates on the ”refugee crisis”.

How is your research related to Kielipankki?

As one of our data sets for this research, we used the minutes from the years 2015-2016 that were included in the Plenary Sessions of the Parliament of Finland, Downloadable Version 1, available via the Language Bank of Finland. The selected subset of the data contains the full transcripts of 183 parliamentary sessions and 6819 speeches that we analyzed computationally and qualitatively.

Publications related to Kielipankki

Tuukka Ylä-Anttila, Gwenaëlle Bauvois & Niko Pyrhönen (2019). Politicization of migration in the countermedia style: A computational and qualitative analysis of populist discourse. Discourse, Context & Media, 32: 1–8. Available: https://doi.org/10.1016/j.dcm.2019.100326.

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Heikki Rasilo

Heikki Rasilo, photo: Jessie Dupont
Photo: Jessie Dupont

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Heikki Rasilo tells us about his use of the Aalto University DSP Course Conversation Corpus for his research related to speech production.

Who are you?

I am Heikki Rasilo, a postdoc researcher in the Artificial Intelligence Lab at Vrije Universiteit Brussel, Belgium. I got my PhD as a joint degree between VUB and Aalto University in 2017. After working in the private sector for a couple of years, I received a research grant from Ulla Tuominen Foundation, through the Finnish Foundations’ Post Doc Pool (Säätiöiden post doc -pooli), for continuing my research.

What is your research topic?

Already from the beginning of my PhD studies, my main research focus has been on physical speech production and on its learning mechanisms. How do human children learn to articulate and imitate the speech of their parents while using their own vocal tracts of very different size and shape? The acoustic properties of adult and infant speech are different as well, and it is difficult to compare them directly. Nevertheless, children learn to articulate their mother tongue, and I am interested in whether the articulatory learning process can also affect the way in which we recognize and comprehend speech. Perhaps one of the reasons why we understand speech better than machines is that we know the physical mechanism through which speech is produced.

I am currently investigating whether the acoustic representations of speech that are formed in learning speech articulation could also be utilized in automatic speech recognition. The amount of recorded speech data that is required in order to train the world’s best speech recognizers is vast, and human children are not likely to encounter a similar amount of speech during their speech acquisition process. Therefore, it must be possible to learn to understand speech with smaller amounts of data, and physical articulation may play a role in the learning process.

How is your research related to Kielipankki?

In a study that was published last year, I trained a neural network to simultaneously recognize both phonemes and physical articulation from speech. The hypothesis was that the articulatory learning would shape the representations the network would learn, and these new representations could be helpful also when recognizing phonemes. For the experiment, I needed some recorded speech as well as articulatory information related to it. In the Language Bank of Finland, I found the Aalto University DSP Course Conversation Corpus that contained a sufficient amount of Finnish speech material including phonemic transcriptions. From the transcriptions, I was able to generate coarse synthetic articulatory data by using a Finnish speech synthesizer. The results of the experiment were promising – the articulatory learning did shape the speech representations in ways that can enhance phoneme recognition.

In my previous research, I have also used the CAREGIVER Corpus (available via ELRA) that consists of simple sentences and their orthographic transcriptions. With Academy Research Fellow Okko Räsänen, we used the corpus in order to investigate certain algorithms for learning word-meaning mappings, word segmentation and acoustic patterns related to words.

Publications related to Kielipankki

Rasilo, H. (2020). Phonemic learning based on articulatory-acoustic speech representations. In S. Denison., M. Mack, Y. Xu, & B.C. Armstrong (Eds.), Proceedings of the 42nd Annual Conference of the Cognitive Science Society (pp. 2203–2209). Cognitive Science Society. Available at: https://cogsci.mindmodeling.org/2020/papers/0512/index.html

Rasilo, H. & Räsänen, O. (2017), An online model for vowel imitation learning. Speech Communication, 86, 1-23. Available at: https://doi.org/10.1016/j.specom.2016.10.010

Räsänen, O. & Rasilo, H. (2015), A joint model of word segmentation and meaning acquisition through cross-situational learning. Psychological Review, 122(4), 792–829. Available at: https://psycnet.apa.org/doi/10.1037/a0039702

Rasilo, H. & Räsänen, O. (2015), Weakly-supervised word learning is improved by an active online algorithm. Proceedings of the 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, pp. 1561-1565. Available at: https://www.isca-speech.org/archive/interspeech_2015/i15_1561.html

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Emmi Lahti

Emmi Lahti, photo: Julius Jaakola
Photo: Julius Jaakola

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Emmi Lahti tells us about her research that is related to rhetoric and discourse studies and based on the Suomi 24 Corpus (2016H2), available via Kielipankki.

Who are you?

My name is Emmi Lahti and I am a grant researcher at the University of Helsinki. I finished my doctoral dissertation on Finnish in 2019. I am especially interested in argumentation and rhetoric as well as on critical discourse analysis. I am fascinated by the various ways in which language participates in the social construction of reality.

What is your research topic?

In my dissertation research, I analyzed the rhetoric of discussions on immigration. As data, I used immigration related discussion threads on Suomi 24 from the year 2015. In particular, I investigated the linguistic construction of various groups, the types of arguments and argumentation strategies used and the ways of showing agreement or disagreement with other participants in the discussions.

The results of the study showed how mutual solidarity and support are expressed by the like-minded discussants who are opposed to immigration and how these participants construct a common view of the world and common argumentation.

How is your research related to Kielipankki?

In my doctoral study, I utilized the Suomi 24 corpora available in Kielipankki – the Language Bank of Finland. The Suomi 24 Sentences Corpus (2016H2) can be used via the Korp user interface in Kielipankki, and the corresponding data referred to as the Suomi 24 Corpus (2016H2) can be downloaded for research purposes. In my study, I ended up selecting the downloadable version of the corpus from which I collected 117 discussion threads for my analysis.

Publications related to Kielipankki

Lahti, Emmi (2019). Maahanmuuttokeskustelun retoriikkaa. Doctoral dissertation. Helsinki: University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5707-2

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Mats Fridlund

Mats Fridlund
Photo: Mats Fridlund

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Professor Mats Fridlund tells us about his research that is related to digital history and supported by the tools and corpora available via Kielipankki in Finland and via Språkbanken in Sweden.

Who are you?

I am Associate Professor of History of Science & Ideas and Deputy Director of the Centre for Digital Humanities at the University of Gothenburg. My background is that I am a diploma engineer in Engineering Physics and a PhD in History of Technology from KTH Royal Institute of Technology in Stockholm. During 2013-2018 I worked in Finland as Associate Professor in the History of Industrialization at Aalto University.

What is your research topic?

As an historian of science, technology and innovation and an emerging digital historian my research is focused on infrastructure history and on the political history of technology.

Within infrastructure history I initially did research on the role of users in the development of electric power and telecommunications systems while during the last couple of years I have broadened these interests towards digital infrastructures. I especially focus on how academic users such as historical researchers have changed their professional practices to take advantage of the affordances of new digital infrastructures such as those made possible by the Language Bank of Finland. Connected to this is also my most recent interests in digital humanities.

Since 2012 I have been involved in various efforts in Finland and Sweden to develop digital humanities in general and digital history in particular. I have been principal investigator of two Kone Foundation funded projects to develop and strengthen Finnish digital history (see Paju et al 2020). Since 2019 I am deputy director of the Centre for Digital Humanities at the University of Gothenburg where I get several opportunities to practically materialize these interests together with language technologists and engineers to develop new digital infrastructures for scholars in humanities and social science and for the wider public.

My current research on the political history of technology is focused on the global history of technology of terrorism from the late 18 century until the present. I currently lead two research projects on the history of terrorism: Things for living with terror: a global history of the materialities of urban terror and security funded  by the Swedish Riksbankens Jubileumsfond, and the large research project Terrorism in Swedish politics (SweTerror): A multimodal study of the configuration of terrorism in parliamentary debates, legislation and policy networks in Sweden 1968–2018 that is part of the digital humanities DIGARV research program initiated by the Government of Sweden and financed by the Swedish Research Council, Riksbankens Jubileumsfond and the Royal Swedish Academy of Letters, History and Antiquities. In SweTerror I collaborate with the National Language Bank (Språkbanken) in Sweden to analyse and make digitally accessible the text and audio corpora of the political debates of the Swedish Parliament.

How is your research related to Kielipankki?

As a part of my research on the history of terrorism I use various large digital text corpora to analyse various media discourses to trace the historical emergence of terrorism as a political and cultural phenomenon. One of the projects that I am currently involved in is conducted together with language technologists from Swedish Språkbanken and with support from Swe-Clarin where we analyse historical Swedish-language newspaper corpora accessible through two national CLARIN B-centers: the National Language Bank (Nationella språkbanken) in Sweden and the Language Bank of Finland (Kielipankki) to determine how the modern meaning of terrorism emerged from the 18th century. This research is part of an initiative of Swe-Clarin to develop genuine interdisciplinary collaboration between researchers in humanities and language technology, using e-science tools for large-scale corpus studies. Thus, the project combines history domain knowledge and language technology expertise to evaluate and expand on earlier research claims regarding the historical meanings associated with terrorism in Swedish and Finnish contexts.

Primarily, we are interested in testing the hypothesis that sub-state terrorism’s modern meaning was not yet established in the 19th century but primarily restricted to Russian terrorism. Using a cross-border comparative approach we explore overlapping national discourses on terrorism. By using the Korp tool, installed in the Swedish as well as in the Finnish language banks, we have been able to efficiently investigate terrorism-related words and their historical contexts to show a more complex image of the history of terrorism in the Nordic countries, especially the meanings associated with salient state terrorism and various forms of ethnic sub-state terrorisms within Great Power empires, i.e. Finnish terrorism within the Russian empire, Macedonian terrorism within the Ottoman empire and Indian terrorism in the British empire. Together with Finnish historians of terrorism and language technologists, we are planning to extend the analysis to the wider Finnish context via the corresponding Finnish-language newspaper corpora in Kielipankki. Furthermore, the study allows us to develop the concrete practices of cross-border comparative studies by utilizing the extensive corpus resources of Swe-Clarin and FIN-CLARIN. There are great opportunities for researchers in the humanities and language technologists to conduct cross-disciplinary, comparative big data studies on national online newspaper corpora.

Kielipankki have also been important not just through the tools it provides but also in other less direct ways in my work on strengthening digital humanities research in Finland. In 2018 as Principal Investigator of the Kone Foundation project “From Roadmap to Roadshow: A collective demonstration & information project to strengthen Finnish digital history” I organized a roadshow to the six Finnish universities of Oulu, Jyväskylä, Eastern Finland, Turku, Tampere and Helsinki. At each university we arranged a one-day digital history methods workshop with lectures and hands-on workshops with experienced digital historians, language technologists and information technology specialists from Finland, Sweden and the United States. Among them was Kielipankki’s application specialist Tero Aalto who participated with a very appreciated lecture on “Digital Methods in Language Research”. The great enthusiasm that the roadshow lectures generated among Finnish historians led to an unplanned expansion and continuation of this project. In May 2018 I together with my two postdoctoral researchers Mila Oiva and Petri Paju organized a workshop where we matched up digital humanities curious historians with language technologists and information technology specialists to jointly explore, develop and conduct digital history research projects. In December 2020 several of these project ideas are published as peer-reviewed research articles in one of the first Open Access books of Helsinki University Press Digital Histories: Emergent Approaches in the New Digital History edited by myself together with Mila Oiva and Petri Paju.

Publications related to Kielipankki

Mats Fridlund, Leif-Jöran Olsson, Daniel Brodén & Lars Borin, 2019 ”Trawling for Terrorists: A Big Data Analysis of Conceptual Meanings and Contexts in Swedish Newspapers, 1780–1926,” in Melvin Wevers, Mohammed Hasanuzzaman, Gaël Dias, Marten Düring, & Adam Jatowt, eds. Proceedings of the 5th International Workshop on Computational History (HistoInformatics 2019) co-located with the 23rd International Conference on Theory and Practice of Digital Libraries (TPDL 2019) Oslo, Norway, September 12th, 2019, CEUR-WS  vol. 2461 (Aachen: CEUR-WS.org, 2019), 1-10, http://ceur-ws.org/Vol-2461/paper_5.pdf.

Mats Fridlund, Leif-Jöran Olsson, Daniel Brodén & Lars Borin, 2020 ”Trawling the Gulf of Bothnia of News: A Big Data Analysis of the Emergence of Terrorism in Swedish and Finnish Newspapers, 1780–1926”, in Costanza Navarretta & Maria Eskevich, eds. Proceedings of CLARIN Annual Conference 2020 (Virtual edition: CLARIN, 2020), 61-65. https://office.clarin.eu/v/CE-2020-1738-CLARIN2020_ConferenceProceedings.pdf

Mats Fridlund, Mila Oiva, & Petri Paju, eds., 2020 Digital Histories: Emergent Approaches within the New Digital History (Helsinki: Helsinki University Press, 2020), 3-18. https://doi.org/10.33134/HUP-5

Mats Fridlund, 2020 “Digital History 1.5: A Middle Way between Normal and Paradigmatic Digital Historical Research”, in Mats Fridlund, Mila Oiva, & Petri Paju, eds., Digital Histories: Emergent Approaches within the New Digital History (Helsinki: Helsinki University Press, 2020), 69-87. https://doi.org/10.33134/HUP-5

Paju, Petri & Mila Oiva. ”Digitaalisen historiantutkimuksen opetuskiertue”, Historiallinen Aikakauskirja 1/ 2019, pp 89-94.

Petri Paju, Mila Oiva & Mats Fridlund, 2020 “Digital and Distant Histories: Emergent Approaches within the New Digital History”, in Mats Fridlund, Mila Oiva, & Petri Paju, eds., Digital Histories: Emergent Approaches within the New Digital History (Helsinki: Helsinki University Press, 2020), 3-18. https://doi.org/10.33134/HUP-5

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Tommi Jauhiainen

Tommi Jauhiainen
Photo: Heidi Jauhiainen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tommi Jauhiainen works as a Project Planning Officer in Kielipankki and he is currently starting his two-year post doc. Here, Tommi tells us about his research related to some language resources in Kielipankki.

Who are you?

I am Tommi Jauhiainen and at the moment, I work as a Project Planning Officer in Kielipankki. From the beginning of year 2021, I will start as a post doc researcher on a grant from the Finnish Research Impact Foundation.

What is your research topic?

During the past ten years, my research has focused in language identification of text. On this topic, I completed my Master’s thesis in 2010 and my PhD dissertation in 2019. Language identification refers to the comparison of a text written in an unknown language to a set of given languages. A similar method can also be used to classify texts by subject area, for example.

The difficulty of language identification varies greatly depending on the situation. The task is easy in case there are only a few clearly different languages to choose from, such as Finnish and Swedish, and if the texts are reasonably long, for example several sentences. In case there are hundreds of languages to choose from, if the languages are close to each other (e.g. Kven and Meänkieli) and/or if the texts are short (e.g. single words only), it may be very difficult to identify the language.

Last year, our extensive survey of automatic language identification in texts was published in the Journal of Artificial Intelligence. We are also currently working on a textbook on the same topic. The book is expected to be published in “Synthesis Lectures on Human Language Technologies” series by Morgan & Claypool in late 2021.

During and after my PhD research, I have participated in several international shared tasks that have focused on distinguishing between very close languages or dialects. In 2018, we won the shared tasks focusing on Swiss German dialects and Indo-Aryan languages, and last year we won a shared task focusing on different versions of Mandarin Chinese. I am also a member of the ”Ancient Near Eastern Empires” Centre of Excellence, in which context I have studied how cuneiform texts written in different dialects of Akkadian and Sumerian could be distinguished from one another. I organized an international shared task on this topic last year, and the winner was a Canadian research team using deep learning.

In the forthcoming “Language Identification of Speech and Text” project, funded by the Finnish Research Impact Foundation, I will move towards the study of language identification in speech, in addition to text. Until now, the research fields of speech and text language identification have been relatively separate from each other, and my intention is to bring more collaboration between them.

How is your research related to Kielipankki?

Most of my PhD research was done in the Finno-Ugric Languages and Internet project, which was part of the FIN-CLARIN research group that maintains Kielipankki. In the project, we searched the Internet for websites written in small Uralic languages, created a portal site for them, and compiled sentence corpora from the texts they contained. During the processes of harvesting the web and creating the sentence corpora, we used automatic language recognition as part of the workflow. The portal site, Wanca, is now part of the tools maintained by Kielipankki and the Wanca 2016 corpora can be found in Kielipankki in three different versions. The Wanca 2017 corpora is being used in the ongoing ULI (Uralic Language Identification) shared task and the corpora will be published next year.

Publications related to Kielipankki:

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2015). The Finno-Ugric Languages and the Internet project. In First International Workshop on Computational Linguistics for Uralic Languages: Proceedings of the Workshop (Vol. 2, pp. 87–98). (Septentrio Conference Series; Vol. 2015, No. 2). Septentrio Academic Publishing. https://doi.org/10.7557/scs.2015.2

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2015). Language Set Identification in Noisy Synthetic Multilingual Documents. In Computational Linguistics and Intelligent Text Processing (Vol. Part I, pp. 633-643). (Lecture Notes in Computer Science; Vol. 9041). Springer International Publishing AG. https://doi.org/10.1007/978-3-319-18111-0_48

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2016). HeLI, a Word-Based Backoff Method for Language Identification. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects: VarDial3, Osaka, Japan, December 12 2016 (pp. 153-162). https://www.aclweb.org/anthology/W16-4820/

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2017). Evaluation of language identification methods using 285 languages. In 21st Nordic Conference of Computational Linguistics: Proceedings of the Conference (pp. 183-191). (Linkping Electronic Conference Proceedings; No. 31). Linköping University Electronic Press. https://www.aclweb.org/anthology/W17-0221/

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). Iterative Language Model Adaptation for Indo-Aryan Language Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 66-75). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-3907

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). HeLI-based Experiments in Swiss German Dialect Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 254-262). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-3929

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2019). Wanca in Korp: Text corpora for underresourced Uralic languages. In Proceedings of the Research data and humanities (RDHUM) 2019 conference : data, methods and tools (pp. 21-40). Studia Humaniora Ouluensia; No. 17. University of Oulu.

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2019). Language Model Adaptation for Language and Dialect Identification of Text. Natural Language Engineering, 25(5), 561-583. [135132491900038]. https://doi.org/10.1017/S135132491900038X

Jauhiainen, T. (2019). Language identification in texts. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5131-5

Jauhiainen, T., Jauhiainen, H., Alstola, T., & Linden, K. (2019). Language and Dialect Identification of Cuneiform Texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 89-98). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1409/https://www.aclweb.org/anthology/W19-1409/

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 178-187). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1419/

Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic Language Identification in Texts: A Survey. Journal of Artificial Intelligence Research, 65, 675-782. https://doi.org/10.1613/jair.1.11675

Zampieri, M., Malmasi, S., Scherrer, Y., Samardžic, T., Tyers, F., Silfverberg, M. P., Klyueva, N., Pan, T-L., Huang, C-R., Ionescu, R. T., Butnaru, A., & Jauhiainen, T. S. (2019). A Report on the Third VarDial Evaluation Campaign. In Proceedings of the (pp. 1-16). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1401/

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2020). Building Web Corpora for Minority Languages. In Proceedings of the 12th Web as Corpus Workshop (pp. 23-32). The Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.wac-1.4

Gaman, M., Hovy, D., Ionescu, R. T., Jauhiainen, H., Jauhiainen, T., Linden, K., Ljubešić, N., Partanen, N., Purschke, C., Scherrer, Y., & Zampieri, M. (Accepted/In press). A Report on the VarDial Evaluation Campaign 2020. In Proceedings of VarDial 2020

Jauhiainen, T., Jauhiainen, H., Partanen, N., & Linden, K. (Accepted/In press). Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora. In Proceedings of VarDial 2020 https://arxiv.org/pdf/2008.12169.pdf

Lindgren, M., Jauhiainen, T., & Kurimo, M. (2020). Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets. In Proceedings of Interspeech 2020 (pp. 467-471) http://www.interspeech2020.org/uploadfile/pdf/Mon-1-11-5.pdf

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.

Researcher of the Month: Tommi Kurki

Tommi Kurki

Photo: Kaisla Kurki

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Adjunct Professor, Senior Lecturer Tommi Kurki from the University of Turku tells us about how he makes use of the resources provided by Kielipankki.

Who are you?

I am Adjunct Professor in the Finnish language at the University of Turku and work there as a senior lecturer. My fields of expertise include sociolinguistics, especially language variation and change (in the Finnish language) and methodology in sociolinguistics. Currently, I am the principal investigator in Digilang, an infrastructure project where the digital linguistic research materials of the School of Languages and Translation Studies in the University of Turku are collected, organized and developed. (see Kurki & al. 2018).

What is your research topic and how is it related to the Language Bank of Finland?

I am interested in several linguistic topics, most of which have been connected with language change. In my early undergraduate years, I got familiar with longitudinal corpora, and this is probably why I have been interested in many types of Finnish corpora and especially longitudinal ones ever since. I have used at least the Follow-up Study of Dialects of Finnish corpus, the The Finnish Dialect Syntax Archive, Samples of Spoken Finnish and the Digital Morphology Archives. When examining the variation in Finnish, I have usually dealt with phonological, morphophonological and morphological features but during the past few years I have tried to extend my scope on prosodic features as well.

However, linguistic corpora have been an essential part of my career: collecting and processing material, compiling and developing corpora. In the 1990’s, I was recruited as a trainee to the Finnish Dialect Follow-up Project conducted by Kotus (the former Research Institute for the Languages of Finland, currently the Institute for the Languages of Finland). In the project, I wrote my MA thesis (1998a) and wrote two research reports (1998b, 1999) as a young researcher in Kotus. As part of the Follow-up Project, I also completed my doctoral thesis (2005) that dealt with the mechanisms of language change as well as the methodology of studying language change.

Until today, all the projects directed by me have been connected with spoken language and linguistic corpora. ”Linguistic Variation in the Province of Satakunta in the 21st Century” is a sociolinguistic project funded by the Finnish Cultural Foundation. In this project, over 200 local speakers were recorded, representing various age groups and 16 municipalities in Satakunta. Currently, this data is being morphologically and syntactically annotated. The corpus is to be made available in the Language Bank of Finland during the next few years. The data from this project and from the Samples of Spoken Finnish corpus (available in the Language Bank) have been analyzed for instance in Kurki & al., 2011.

The Regional and Social Variation in Finnish Prosody Project is funded by the Kone foundation and the Digilang project, and it was started in 2013 by my and my colleague PhD Tommi Nieminen (see for example Kurki & al. 2014). In this project, we compiled a sociophonetic corpus where speakers recorded their voices over the Internet in elicitation tasks. Representative sets of data from this corpus are being segmented and annotated. The objective of this project is to examine the prosody of Finnish and to pay more attention to regional and social variation than before. This corpus will also be available in the Language Bank of Finland in a few years.

Apart from my research projects, The Language Bank of Finland has been an integral part of my work as a lecturer and supervisor. When I was working in the Syntax Archive, one of my most important tasks was to introduce students to different linguistic corpora and to help them find good material for their BA and MA theses. Suitable examples and materials for my students were easy to find when I was giving courses on Finnish dialects and dialectology or on corpus linguistics. All the corpus projects I am running at the moment were originally planned so as to make the collected data available via the Language Bank of Finland. As a speech and language research expert, I have also participated in designing the Donate Speech campaign (by Vake) in collaboration with Professor Mikko Kurimo (from Aalto University) and the Language Bank of Finland.

Publications related to the resources:

Kurki, Tommi 1998a: Kui Kuivlahdel puhuta? Eurajoen vanhan murteen ja puhekielen vertailua sekä ikäryhmittäisten ja sukupuolikohtaistan erojen tarkastelua. Pro gradu ja suomen murteiden seuruuhankkeen osatutkimus (118 sivua + 39 liitesivua). Turun yliopisto, suomen kieli.

Kurki, Tommi 1998b: Kielellinen vaihtelu ja muutos Alastaron murteessa. Kotimaisten kielten tutkimuskeskuksen seuruuhankkeen tutkimusraportti. (79 sivua + 35 liitesivua). Helsinki: Kotus.

Kurki, Tommi 1999: Kielellinen vaihtelu ja muutos Pälkäneen murteessa. Kotimaisten kielten tutkimuskeskuksen seuruuhankkeen tutkimusraportti.  (114 sivua + 51 liitesivua). Helsinki: Kotus.

Kurki, Tommi 2005: Yksilön ja ryhmän kielen reaaliaikainen muuttuminen. Kielenmuutosten seuraamisesta ja niiden tarkastelussa käytettävistä menetelmistä. SKST 1036. SKS, Helsinki.

Kurki, Tommi, Siitonen, Kirsti, Väänänen, Milja, Ivaska, Ilmari & Ekberg, Jari 2011: Ensi havaintoja Satakuntalaisuus puheessa ‐hankkeesta. Sananjalka 53, 83–108. DOI: https://doi.org/10.30673/sja.86706.

Kurki, Tommi – Nieminen, Tommi – Kallio, Heini & Behravan, Hamid 2014: Uusi puhesuomen variaatiota tarkasteleva hanke. Katse kohti prosodisia ilmiöitä. – Sananjalka 56 s. 186–195. URN: http://urn.fi/urn:nbn:fi:ele-1733815.

Kurki, Tommi – Inaba, Nobufumi – Kaivapalu, Annekatrin – Koponen, Maarit – Laippala, Veronika – Leblay, Christophe – Luutonen, Jorma – Mutta, Maarit – Nikulin, Markku & Reunanen, Elisa 2018: Digilang – Turun yliopiston digitaalisia kieliaineistoja kehittämässä. – Proceedings of the Research Data and Humanities (RDHum) 2019 Conference: Data, Methods and Tools, p. 41–56. Studia Humaniora Ouluensia 17. Oulu: University of Oulu. URN: http://urn.fi/urn:isbn:9789526223216.

 

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Saana Svärd

Photo: Lauri Laine

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Associate Professor in Ancient Near Eastern Studies at the University of Helsinki Saana Svärd tells us about how she makes use of the resources provided by Kielipankki.

Who are you?

I am Saana Svärd, Associate Professor in Ancient Near Eastern Studies and the director of the Academy of Finland funded Centre of Excellence in Ancient Near Eastern Empires, ANEE.

What is your research topic?

I’m originally an assyriologist, which means that I research different kinds of historical phenomena by using primary sources, e.g. the cuneiform texts from ancient Near East. In my research I have especially investigated what we can deduct about the status and roles of women in ancient Mesopotamia. This research is still on-going, but during the last four years I have concentrated on developing digital humanities approaches in my field. There exists hundreds of thousands of texts from the ancient Near East, and even though they are only partly available in digital form, there is a lot of digital research material available.

In the team I lead (ANEE team 1), we have conducted diverse research by combining the methods of language technology and assyriology. This type of language techonological research is new in the field of ancient Near Eastern studies and there is great potential for new discoveries. Our latest article is related to fear. How is fear depicted in the cuneiform texts? We created a semantic field from five verbs usually translated as “to fear,” as well as their derivatives. Among other things, the results showed that the vocabulary depicting fear in this ancient Semitic language (called Akkadian) is very diverge. Different lexemes for fear were used in different text genres and some of the fear-words were reserved for very specialized use. For example the word pirittu ”terror” is found almost exclusively in a certain type of prayer.

How is your research related to Kielipankki?

Kielipankki is essential for my research. Our digitized cuneiform sources originate from the Open Richly Annotated Cuneiform Corpus portal (ORACC), but they are also available in Kielipankki (oracc). By using the Korp-tool from Kielipankki, we have been able to efficiently investigate interesting words and their contexts. We are able to get interesting results about the semantic dimensions of an individual word or concept by using the language technological tools we have created, but the results always need to be investigated in their original context. Korp makes this possible in an easy way. Furthermore, Korp search results also provide links to the original texts in ORACC, so that the researchers can follow the links all the way to a photograph of cuneiform tablet, if needed.

Publications related to the resources:

Svärd, Saana, Tero Alstola, Heidi Jauhiainen, Aleksi Sahala, and Krister Lindén. Fear in Akkadian Texts. In S.-W. Hsu and J. Llop-Radua (eds.), The Expression of Emotions in Ancient Egypt and Mesopotamia. Culture and History of the Ancient Near East (CHANE), 116. Brill. Coming out in December 2020 (https://brill.com/view/title/57151)

Tero Alstola, Shana Zaia, Aleksi Sahala, Heidi Jauhiainen, Saana Svärd, Krister Linden. 2019. “Aššur and His Friends: A Statistical Analysis of Neo-Assyrian Texts” Journal of Cuneiform Studies 71, pp. 159-180. https://doi.org/10.1086/703859

Saana Svärd, Heidi Jauhiainen, Aleksi Sahala, Krister Lindén 2018 ”Semantic Domains in Akkadian Texts” in Vanessa Juloux, Amy Gansell, & Alessandro di Ludovico, (eds.) CyberResearch on the Ancient Near East and Neighboring Regions: Case Studies on Archaeological Data, Objects, Texts, and Digital Archiving. Digital Biblical Studies 2. Brill: Leiden, pp 224-256. DOI: https://doi.org/10.1163/9789004375086_009

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Tuomo Hiippala

Photo: Veikko Somerpuro

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Assistant Professor of English Language and Digital Humanities in the Department of Languages at the University of Helsinki Tuomo Hiippala tells us about how he makes use of the resources provided by Kielipankki.

Who are you?

My name is Tuomo Hiippala and I am Assistant Professor of English Language and Digital Humanities in the Department of Languages at the University of Helsinki since 2018.

What is your research topic?

My research focuses on multimodality, that is, the way human communication relies on appropriate combinations of expressive resources. The phenomenon of multimodality is increasingly acknowledged as an inherent feature of human communication. To exemplify, face-to-face interaction involves constant coordination of spoken language, gestures, gaze and posture, whereas page-based documents regularly use combinations written language, photographs, diagrams and layout to communicate with their reader.

Just which expressive resources are combined and how depends largely on the communicative situation in question. Humans engage in a wide range of communicative situations every day and negotiate them largely without problems. I am interested in the underlying principles of multimodal communication that help humans navigate this diversity. This understanding, however, is not likely to be achieved without empirical research, which is currently slowed down by the lack of large corpora with rich annotations. For this reason, I am especially interested in developing computational approaches that would enable conducting empirical research on multimodality at scale.

How is your research related to Kielipankki?

I have published two multimodal corpora that are distributed through Kielipankki: one related to my doctoral dissertation (GeM-HTB) and another created as a part of my recent research project (AI2D-RST). I find the service to be extremely useful for long-term storage and easy access to the data, and plan to continue sharing multimodal corpora created in future projects.

Publications related to the resources:

Tuomo Hiippala (2016) Helsingin kaupungin matkailuesitteiden multimodaalinen korpus. Terra 128(2): 75-85.

Tuomo Hiippala, Malihe Alikhani, Jonas Haverinen, Timo Kalliokoski, Evanfiya Logacheva, Serafina Orekhova, Aino Tuomainen, Matthew Stone, John A. Bateman (2020) AI2D-RST: A multimodal corpus of 1000 primary school science diagrams. arXiv: arXiv:1912.03879

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Jenny Tarvainen

Jenny Tarvainen - kuva: Inka Huuskonen
Photo: Inka Huuskonen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Jenny Tarvainen, graduate from the University of Jyväskylä tells us about how she makes use of the resources International Corpus of Learner Finnish, ICFLI and The Suomi 24 Corpus provided by Kielipankki.

Who are you?

I am Jenny Tarvainen. In January 2019, I graduated from the University of Jyväskylä with the Finnish language as the major subject in my master’s degree. At the moment I teach the Finnish language for immigrants, with an intention to start my doctoral education in the near future. I was drawn into corpus research in my bachelor’s studies already, and no change in this interest is expected in the future. The Language Bank of Finland, Kielipankki, has become quite familiar to me during the years.

What is your research or development work topic?

My Master’s thesis (Tarvainen 2018) presented a comparative corpus study on the phraseological features of the verb SAADA (to gain) in native Finnish and learner Finnish. The aim was to find out, with Contrastive Interlanguage Analysis (CIA), how the usage or the verb SAADA by Finnish language learners differs from how the native speakers use this verb. To address these differences, I focused on the word forms and the meanings in the cotext of the verb. I also studied the correlation between these forms and meanings with statistical methods. An interesting finding was that the correlation between the forms and the meanings was firmer in the usage of those studying Finnish as a foreign language than in the texts by the native speakers, i.e. a specific form of the verb SAADA appeared in the learner language more often with the specific meaning found in the cotext: the discussion around the verb form saavat (they get), for example, focus most probable on family or people in general, whereas the themes found around the base form saada are place, direction and area.

During my studies and after graduating I have also worked as a research assistant in the research projects led by professor of Finnish language Jarmo Jantunen at the University of Jyväskylä. The research projects study how homo and hetero sexual people are discussed in the media (Jantunen 2018) and what kind of discourses arise when the discussion concerns different cities in the Metropolitan area (forthcoming). During these research projects I have learned about the Computer Assisted Discourse Studies (CADS). At the moment I am working on the research plan for applying for the doctoral studies during the autumn.
Corpora will provide data for my research in the future, too: I intend to use machine learning to study discourses in The Suomi 24 Corpus, related to the Metropolitan area.

How is your research related to Kielipankki?

For the Master’s thesis I compiled the data from the International Corpus of Learner Finnish, ICFLI International Corpus of Learner Finnish, ICFLI
The corpus comprises texts written by students of Finnish as a foreign language which have been categorized according to the Common European Framework of Reference for Languages (CEFR) / to reference levels. I used the texts of the advanced students because the reference data was compiled of the texts by the native Finnish speakers. The variety of texts (essays, summaries, emails, job applications…) made it possible to study learner language widely instead of studying features that are typical to only a specific genre only, or the impact of a specific native tongue.

The Suomi 24 Corpus provided by Kielipankki has offered data for the other studies. It has been possible to sample smaller subcorpora from the data based on the search results, such as the subcorpora of homos and heteros and the subcorpora of the different cities in the Metropolitan area to provide access to discourses in these subcorpora.

Publications related to the resource

Tarvainen, Jenny 2018: SAADA-verbin fraseologiaa: vertaileva korpustutkimus oppijan- ja natiivikielestä. Master’s thesis. University of Jyväskylä. https://jyx.jyu.fi/handle/123456789/59273?show=full
Jantunen, Jarmo H. 2018: Homot ja heterot Suomi24:ssä: analyysi digitaalisista diskursseista. Puhe ja kieli, 38(1), 3–22. https://doi.org/10.23997/pk.65488

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Sam Hardwick

Sam Hardwich - kuva: Bess Hardwick
Photo: Bess Hardwick

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sam Hardwick, project researcher at the University of Helsinki tells us about developing some of the tools provided by the Language Bank, Kielipankki.

Who are you?

I’m a freelance consultant, researcher and programmer. I started in language technology at the University of Helsinki in a research software project called HFST. We developed code for computational morphology, which ended up being used in eg. inflecting dictionaries and spellcheckers for languages with extensive morphology (like Finnish, Sámi and Greenlandic). Since then I’ve worked on the technical side of various infrastructure and research projects, and done private consulting work.

What is your research or development work topic?

Right now I’m involved with publishing a sentiment corpus for Finnish. This is a collection of texts gathered from social media with their sentiment – whether they are positive, neutral or negative – annotated by humans. This will be the basis for automatic sentiment classification for future corpora and tools.

I’m also involved with the ANEE-project, helping to make a treebank for Akkadian, which again will be the basis of an automatic annotation tool. Hopefully we’ll be ultimately able to automatically annotate more of the texts in this ancient language.

How is the development work related to Kielipankki?

I’ve done a lot of development work directly for Kielipankki. For example, right now I’m planning an API for accessing corpora directly from code. NLP applications are more and more the domain of general machine learning people, not just language experts, and there’s a lot of interest in our data and resources.

Publications related to the resources or tools:

Hardwick, S., Enqvist, E. J., Onikki-Rantajääskö, T. A., & Linden, B.K. J. (2018). Tieteen kansallinen termipankki (TTP) ja tiedonlouhinnan apuneuvot. Poster (in Finnish) at the Annual Conference of Linguistics, Helsinki, Finland.

I’ve published demonstrations for various bits of code and analysis, some of it perhaps comprehensible in English, here: https://www.kielipankki.fi/tools/demo/

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Anna Puupponen

Anna Puupponen - kuva: Tapio Laitinen
Photo: Tapio Laitinen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anna Puupponen, postdoctoral researcher at the University of Jyväskylä tells us about how she makes use of the resources Corpus of Finnish Sign Language and ProGram data. The stories Snowman and Frog, where are you? in her research.

Who are you?

I am Anna Puupponen and I am working as a postdoctoral researcher at the Sign Language Centre at the University of Jyväskylä. I finalized my PhD in May 2019 and at the moment I am continuing my postdoctoral research on Finnish Sign Language (FinSL).

What is your research topic?

My doctoral research focused on a relatively understudied area of sign language linguistics: signers’ head and body movements. In the PhD project I studied actions of the signer’s head and body and the role that these actions play in structuring of language, in interaction and transfer of meanings.

I am currently doing research within various projects at the Sign Language Centre, focusing on embodied communication in signed situations, the similarities and the differences between the signing of adults and children, the sign language processing through neuroimaging, and the signing fluency of native signers and sign language learners.

How is the research work related to Kielipankki?

Several multimodal resources have been published in the Language Bank of Finland, Kielipankki, which I have participated in compiling and made use of in my research. A corpus comprising signed stories, the Snowfrog corpus ProGram data. The stories Snowman and Frog, where are you? was published in 2016 and the first part of the Corpus of Finnish Sign Language (Corpus FinSL) in 2019. In linguistic research on sign languages, corpus data can be seen as having an especially central role. Sign languages often have a weak status in the society, they lack well developed institutional standards, and their transmission from one generation to the next one is disturbed. In building descriptions and grammars of sign languages, it is important to study language-internal variation from extensive data sets. Sign language corpora are important also for the development of sign language teaching.

This data driven approach was in a central role in my PhD project. I used the sign language corpora published in Kielipankki in studies where I focused on the sequences of actions of the head and body, and the semiotic features of these sequences, in signed narratives and conversations. As the Snowfrog corpus and Corpus FinSL are very similar to the relevant corpora published on Swedish sign language with respect to the principles of compilation, I could also conduct a comparative study between Finnish and Swedish sign languages in my doctoral research.

Currently I’m using Corpus FinSL in a research project where we focus on the depictive language use of signers of different ages. The first part of The Corpus of Finnish Sign Language published in Kielipankki comprises signed narratives and discussions from 21 signers aged between 18 and 29 years. In the project we analyse Corpus FinSL data as well as data from children using FinSL collected in the VIKKE project hosted by the Sign Language Centre.

Publications related to the resource:

Puupponen, A. (2019). Understanding nonmanuality: A study on the actions of the head and body in Finnish Sign Language. PhD dissertation. University of Jyväskylä.
Puupponen, A. (2019). Towards understanding nonmanuality: A semiotic treatment of signers’ head movements. Glossa: a journal of general linguistics 4(1): 39. 1–39. DOI: https://doi.org/10.5334/gjgl.709
Jantunen, T.; Mesch, J.; Puupponen, A. & Laaksonen, J. (2016). On the rhythm of head movements in Finnish and Swedish Sign Language sentences. In Proceedings of Speech Prosody 2016 [organized at Boston University, May 31–June 3, 2016], pp. 850–853
Press release of Anna Puupponen’s dissertation on the website of the University of Jyväskylä.

The developer’s point of view to the Corpus of Finnish Sign Language was presented in the interview of Juhana Salonen in May 2020.

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Juhana Salonen

Juhana Salonen - kuva: Hanna-Kaisa Hämäläinen
Photo: Hanna-Kaisa Hämäläinen

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Juhana Salonen, project researcher at the University of Jyväskylä tells us about publishing the resource Corpus of Finnish Sign Language.

Who are you?

My name is Juhana Salonen and I work as a project researcher in the Sign Language Centre of the University of Jyväskylä. I´m responsible for the corpus work of Finland´s national sign languages (Finnish and Finland-Swedish Sign Language). Majoring in Finnish Sign Language, I graduated with an M. Phil. in the fall of 2012.

What is your research topic?

Together with the team, I am working on an infrastructure for research on the corpora of both sign languages. I have been working in the corpus project since 2014, during which time we have filmed a total of 103 native sign language users from all over Finland. I acted as a guide in the filming sessions, where I was able to follow informants’ conversations and narrations up close while they were recorded by a total of seven different camera angles. The result was over 700 hours of video footage. After the data collection and editing, the video material was annotated using the ELAN program (Eudico Linguistic Annotator). The annotation was carried out by distinguishing utterances from the signed text stream at both the sign and sentence levels. The signs were identified with the help of ID-glosses that are connected online to a lexical database of the Finnish Signbank, and the sentences were translated into Finnish. We have tried to make the annotation of the large dataset as systematic as possible, so that the data can be applied and used by different researchers for a range of different research objectives.

How is the research related to Kielipankki?

The primary goals of our corpus work are to preserve the data in the long term, and to publish various parts of it, which will be done in accordance with the informants’ research consent and the terms of data protection legislation. The Language Bank has provided an excellent setting for achieving our goals, for which we are very grateful. The first subset of the Corpus of Finnish Sign Language (Corpus FinSL) was transferred to the Language Bank in March 2019. Corpus FinSL comprises approximately 14.5 hours of video material from 21 signers, together with textual annotations and metadata. The material is divided into two subcorpora (Corpus of Finnish Sign Language: elicited narratives and Corpus of Finnish Sign Language: conversations), the first of which is publicly available and the second of which requires a research plan and personal access rights, in accordance with the RES license of the Language Bank. The published data has already been exploited both in research on Finnish Sign Language and in teaching, which is only the prelude to a great leap forward in the field of sign language, for example in terms of the development of both learning materials and the social status of the language.

Publications related to the resource:

· Salonen, J., Puupponen, A., Takkinen, R. & Jantunen, T. (2019). Suomen viittomakielten korpusta rakentamassa [Building the corpus of Finland´s sign languages]. In Jantunen, Jarmo Harri; Brunni, Sisko; Kunnas, Niina; Palviainen, Santeri; Västi, Katja (Eds.) Proceedings of the Research data and humanities (RDHUM) 2019 conference: data, methods and tools, Studia Humaniora Ouluensia, 17. Oulu: Oulun yliopisto, 83-98. http://urn.fi/urn:isbn:9789526223216

· The Corpus of Finnish Sign Language (Corpus FinSL) in the Language Bank: http://urn.fi/urn:nbn:fi:lb-2019012321

· Homepages of the corpus work of Finland´s sign languages: http://r.jyu.fi/AB7

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Mikhail Mikhahilov

Mikhail Mikhailov - kuva: University of Helsinki
Photo: University of Helsinki

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikhail Mikhail, professor of Translation Studies at the Tampere University tells us about how he makes use of the resources ParFin, Finnish-Russian Parallel Corpus of Literary Texts and ParRus 2016, Russian-Finnish Parallel Corpus of Literary Texts.

Who are you?

I am Mikhail Mikhailov, Professor of Translation Studies (Finnish and Russian) at the Tampere University.

What is your research topic?

I collect and study multilingual text corpora with an emphasis on parallel corpora. Several language corpora were collected under my supervision, e.g. ParRus (Russian-Finnish corpus of literary texts), ParFin (Finnish-Russian corpus of literary texts), FiRuLex (Russian-Finnish comparable corpus of legal texts), PEST (Parallel Electronic Corpus of State Treaties, Finnish-Russian-Swedish-English), MLCCA (Multilingual Corpus of Contracts and Agreements). I also develop corpus management software. My research is on the border between linguistics and translation studies. I am trying to find out, what the difference is between texts initially written in language X and texts translated into language X. I am working with the language pair Finnish-Russian and to some extent with other pairs like Russian-English.

How is the research work related to Kielipankki?

I have been for quite a long time collaborating with the Language Bank. FIN-CLARIN has supported some of my corpus projects: ParRus 2016, Russian-Finnish Parallel Corpus of Literary Texts, ParFin, Finnish-Russian Parallel Corpus of Literary Texts and recently MLCCA which will be published in the Language Bank.

Publications related to the resource you have used:

Mikhailov Mikhail, Cooper Robert. (2016). Corpus Linguistics for Translation and Contrastive Studies: a guide for research. London and New York: Routledge.
Mikhailov, Mikhail. (2019). The Extent of Similarity: comparing texts by their frequency lists. Teoksessa Jantunen, Jarmo Harri et al (toim.) Proceedings of the Research Data and Humanities (RDHum) 2019 Conference: Data, Methods And Tools. Oulu: Oulun yliopisto, 159-178. (Studia humaniora ouluensia 17).
Mikhailov Mikhail. (2017). Are Classical Principles of Corpus Compiling Applicable to Parallel Corpora of Literary Texts?. Teoksessa Zybatow Lew N, Stauder Andy, Ustaszewski Michael (toim.) Translation Studies and Translation Practice: Proceedings of the 2nd International TRANSLATA Conference, 2014 Part 1. Frankfurt am Main, Bern, Bruxelles, New York, Oxford, Warszawa, Wien: Peter Lang, 151-157. (Forum Translationswissenschaft 19).

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Markus Mattila

Markus Mattila - kuva: Markus Mattila
Photo: Markus Mattila

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Markus Mattila, MA graduated from Åbo Akademi tells us about how he makes use of the resource The Suomi 24 Sentences Corpus (2017H2) (beta).

Who are you?

I am Markus Mattila. I graduated last year from Åbo Akademi University with Master’s degrees in Finnish language and English language and literature. I also have a Master of Economic Sciences degree from before. At the moment, I am working as a substitute teacher and planning to take up postgraduate studies.

What is your research topic?

In my MA thesis in Finnish language, I studied language change, focusing on situational idioms containing possessive suffixes and, to be more precise, the agreement between the possessive suffix and the subject of the clause. In standard usage, the possessive suffix should agree with the subject, as in olen huolissani vs. *olen huolissaan. The research questions investigated in my thesis were:
• How common is the use of non-agreeing possessive suffixes in the first person singular in certain situational idioms?
• Have there been any changes in the proportion of non-agreeing forms – i.e. forms contrary to the usual norms – during the period under investigation in this study?
• Do the studied idioms differ from one another with respect to how common the use of the non-agreeing possessive suffix is?
Based on a pilot study, the expressions selected for further investigation were olla huolissaan [to be worried], olla pahoillaan [to be sorry] and olla innoissaan [to be excited]. In order to answer my research questions, I conducted a corpus study comprising three time periods: 2001—2006, 2007—2011 and 2012—2017. The statistical significance of the results of the study was tested using cross tabulation and Pearson’s χ² / chi-square test.

How is the research work related to Kielipankki?

A corpus-based study was the best possible method for researching such a rare phenomenon. Since language change takes place more often in spoken language than in the more controlled and stable written language, I chose to take my research data from the vast Suomi24 corpus provided by Kielipankki. The corpus consists of all discussions in the discussion forum Suomi24 between 2001 and 2017. These discussions, which are unofficial and written under a pseudonym, are a lot closer to spoken language than texts in official documents, news articles or literature, and are thus a very useful resource for investigating a research topic of this kind.
The specific resource in my research was The Suomi 24 Sentences Corpus (2017H2) (beta) which I first used as a whole to gain an overall picture of the data. After that, I divided the messages into the aforementioned time periods in order to study the possible changes. The corpus data was retrieved with the web based Korp concordancer tool available at Kielipankki, which I found simple and pleasant to use. One factor contributing to this positive experience was the excellent technical support provided, for which I would like once more to express my gratitude to the personnel concerned.

Publications related to the resource you have used:

Mattila, M. (2019): Olen pahoillani ja huolissaan” : Tutkimus persoonakongruenssista olotilanilmausidiomeissa Suomi24-korpuksessa 2001–2017, Master’s thesis (Pro gradu). Åbo Akademi. http://www.urn.fi/URN:NBN:fi-fe2019062421760

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Researcher of the Month: Anita Nuopponen

Anita Nuopponen - kuva: Harri Huusko
Photo: Harri Huusko

 

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Anita Nuopponen, professor in technical communication at the University of Vaasa tells us about how she makes use of the resource The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version.

Who are you?

I am Anita Nuopponen, professor in technical communication from the School of Marketing and Communication, Communication Studies, University of Vaasa.

What is your research topic?

I have once again returned to terminology research, the subject of my dissertation in 1994. My special interest still concerns relations between concepts. The typology I have created for them is still relevant, since there is a need to distinguish between various types of concept relations in information systems and digitalization initiatives. Part of the relation types in the classification is going to be included in the next version of the international Terminology Standard ISO 704. The second current research area I am focusing on – also related to conceptual relations – is developing a systematic concept analysis method that makes use of the relations. I am currently working on an article on conceptual analysis as an aid to research work and also on a collaborative article on terminological methods in teaching special languages to students in various fields. Both will appear in the VAKKI Symposium series.

How is the research work related to Kielipankki?

At the moment I am on research leave and work partly on FIN-CLARIN initiative funding with the aim to create for Kielipankki content that is similar to the work I have done on my Terminology Forum site since 1994. I have thus returned to continue the work I started years ago! I am now looking for online vocabularies and glossaries available in Finnish in various fields, and creating a link list out of them, but the aim is to deposit glossaries with Kielipankki’s collections when possible. Interested parties in various fields, teachers, enterprises, associations and other organizations have compiled glossaries covering their own fields, and published them online. Several people could benefit from these if only they were available. All glossaries do not end up in TSK’s TEPA term bank or the Helsinki Term Bank for the Arts and Sciences. Many valuable resources disappear when the creator of the vocabulary changes jobs, retires or when the website of a company is renewed.

I became familiar with Kielipankki in the context of my presentation “Vaikeasti käsitettävä käsitteen käsite” [The concept of the concept is difficult to comprehend] in the Annual Conference on Linguistics in 2015. I used the data from the year 2000 included in the The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version to study the definitions of concept and how the word concept is used and how concepts are addressed – I focused mainly on general language. The word concept is a frequently used word in general language and its function follows the personal intuition of each writer, and often that intuition is identical to the definition used in terminological research and given in dictionaries of general language. However, already in the next sentence it can be mixed with word, term or even phenomenon. This often happens in scientific writing, too.

Publications related to the resource you have used:

The paper on concepts mentioned above is yet to be published. The present project on Terminology Forum has not yet resulted in related publications, but there are presentations and articles from various contexts on making use of the internet in collecting and disseminating terminological resources. (Publication list: http://lipas.uwasa.fi/~atn/AnitaNuopponen/index.html)

 

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.

All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.

Hae Kielipankki-portaalista:
Juho Leinonen
Kuukauden tutkija: Juho Leinonen

 

Tulevat tapahtumat

  1. CLARIN Annual Conference 2021

    27.9.2021 10.0029.9.2021 16.15

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot