Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mietta Lennes tells us about her PhD study and about her work in FIN-CLARIN.
My name is Mietta Lennes and I work as a Project Planning Officer for the FIN-CLARIN consortium that is coordinated by the University of Helsinki. I help and advise researchers and students in their various problems related to managing, analyzing and publishing language corpora. In addition, I teach online courses in corpus linguistics, speech analysis and data management. I am a phonetician by training.
My forthcoming doctoral dissertation deals with the link between the phonetic variability and the frequencies of words in casual spoken Finnish. For instance, it is previously known that, in any language, words that occur often tend to be shorter than words that occur rarely. However, the phonetic phenomena that may affect this situation can only be studied with a sufficiently large corpus. Furthermore, the speech recordings must be of high technical quality so as to allow for reliable acoustic-phonetic measurements.
For phonetic analysis, I have used a corpus called The FinINTAS Corpus of Spontaneous and Read-aloud Finnish Speech and especially the subcorpus FinDialogue that contains conversational speech. The corpus will be available in the Language Bank of Finland when I finish my PhD study. The FinINTAS corpus was mainly collected during the international INTAS 00-915 project and the associated Finnish projects in which the phonetic properties of reading aloud were compared with those of spontaneous speech. In practice, I was responsible for planning and coordinating the speech recordings and the annotation work of the corpus. Several students in Phonetics and Finnish from Helsinki as well as from St. Petersburg participated in these efforts. Together, we gradually managed to annotate the corpus comprehensively enough in order to produce some publications.
In my PhD study, I also needed information about the frequencies of word forms in spoken Finnish. The number of word tokens in the FinDialogue corpus alone was too small for this purpose, and there were no suitable corpora available in the Language Bank of Finland at that time. Fortunately, the transcripts of the 1970s subcorpus of what is now called the Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s) (Helpuhe1) happened to be available on the server of the Department of Linguistics of the University of Helsinki, and I was able to use these texts. The Helsinki spoken material was even somewhat similar in style to the FinDialogue corpus. However, the transcription practices of the material collected in the 1970s had varied a great deal, and so I needed to manually edit and harmonize the texts in order to be able to calculate at least approximate word frequencies. Looking back to this messy project, it feels great to know that all three subcorpora of the Longitudinal Corpus of Finnish Spoken in Helsinki – both the audio recordings and their aligned transcripts – have been more recently deposited in the Language Bank of Finland, thanks to the research group of Hanna Lappalainen.
A researcher who collects language material often runs into the fact that a huge mass of texts or a collection of audio recordings alone does not directly provide the desired answers. I have learned from experience in many projects that it is easy to make audio and video recordings of speech, but it takes a lot of planning and hard work to collect the material systematically and then to prepare, organize, transcribe and annotate the files, which tends to be much more time-consuming. The researcher should also carefully describe the data and the analysis methods. One needs to make sure that it will possible to make gradual improvements to the study and to reuse the data later.
Even if the corpus has been properly created, some manual labour or tailored automatic methods may be necessary in order to perform a specific analysis that is required to answer the research question. In this detective work, a collection of services like Kielipankki, together with the entire network of researchers within FIN-CLARIN, can be extremely valuable. I believe that, in the future, versatile skills in data management will become a more and more important part in any researcher’s competence.
My own work in FIN-CLARIN is interesting and varied. It feels great to be able to help a student or a researcher solve a technical problem related with his or her research or to discover a tool that matches the purpose. Together with the entire Kielipankki team and the co-operating partners of FIN-CLARIN we also brainstorm and develop new services that can be provided via Kielipankki for the researchers’ benefit.
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive.