Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mikko Kurimo tells us about his research on automatic speech recognition.
For my PhD dissertation 25 years ago, I developed neural network algorithms to make automatic speech recognition more accurate and more robust. In order to train statistical models for recognizing speech sounds, it is necessary to utilize large amounts of speech material where the sounds are aligned with the corresponding text. At that time, very few such corpora were available. Thus, the research team had to collect and process the data themselves. When we developed automatic methods for aligning speech and text, it become possible to utilize larger data such as audiobooks and radio and television news (e.g., FBC – The Finnish Broadcast Corpus) in training the Finnish speech recognizer.
However, sufficient accuracy cannot be reached just by modeling individual speech sounds, since they do not appear separately in speech and in practice they are modified to fit in the word and sentence context. Therefore, the speech recognizer must also be provided with a model of the language in question. On the basis of the language model, the recognizer decides which words and sentences are represented by the observed speech sound sequences. To train the language model, huge quantities of text are required that should also contain a large variety of examples of different types of language use. For training the Finnish speech recognizer, we have used, e.g., the Finnish Text Collection (FTC).
When it is possible to automatically convert read-aloud speech and dictation into text with sufficient accuracy, this technology can be used in dictation services as well as in many other useful applications, such as transcribing planned speeches or respeaking presentations or television programmes. However, I am even more interested in natural and spontaneous speech that we all use in our everyday conversations and storytelling. Since free speech is the most efficient means of communication for humans, is of utmost importance to have an automatic speech recognizer that can understand this kind of speech when developing Artificial Intelligence systems that are to communicate with people.
The challenges in training models of conversational speech lie in the huge amount of variation in speech and in the limited availability of carefully transcribed resources of natural speech that are suited for training the recognizers. Since written language differs from spoken language in many ways, it is in practice necessary to create the text resources by transcribing speech first.
When training the first conversational speech recognizer, we used the FinDialogue corpus in addition to the DSPCON corpus we collected ourselves. The language models were trained with specific portions of conversations in written format that were found to be similar to spoken language according to the aforementioned spoken corpora.
At the moment, we are preparing two new corpora of free speech for publication: an extension of the Plenary Sessions of the Parliament of Finland and the speech material collected in the Donate Speech campaign. Both corpora contain approximately 4000 hours of speech, which clearly exceeds the total amount that was included in all previously published Finnish speech corpora that were suitable for training automatic speech recognizers. I am confident that the new data will enable us to significantly improve the automatic speech recognizer we have developed at Aalto University (Aalto-ASR), whose most recent version (Aalto-ASR 2.1) is currently available via the Language Bank of Finland.
Mikko Kurimo (1997). Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models. PhD thesis, Helsinki University of Technology, Espoo, Finland.
Mikko Kurimo, Vesa Siivola, Teemu Hirsimäki, Janne Pylkkönen, Reima Karhila, Peter Smit, Seppo Enarvi, André Mansikkaniemi, Matti Varjokallio, Ulpu Remes, Heikki Kallasjoki, Sami Keronen, Katri Leino, Ville T. Turunen & Kalle Palomäki (author names in no particular order, except the project leader is first). 2000 –2016. AaltoASR open source large-vocabulary continuous speech recognition system, Aalto University.
Seppo Enarvi & Mikko Kurimo (2013). Studies on Training Text Selection for Conversational Finnish Language Modeling. In Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT), Heidelberg, Germany, pp. 256–263. Available: http://urn.fi/URN:NBN:fi:aalto-201708036342.
André Mansikkaniemi, Peter Smit & Mikko Kurimo (2017). Automatic Construction of the Finnish Parliament Speech Corpus. Proceedings of Interspeech 2017, Vol. 8, pp. 3762–3766. Available: https://doi.org/10.21437/Interspeech.2017-1115
Juho Leinonen, Sami Virpioja & Mikko Kurimo (2021). Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Linköping University Electronic Press. Available: http://hdl.handle.net/10138/330758
Peter Smit, Sami Virpioja & Mikko Kurimo (2021). Advances in subword-based HMM-DNN speech recognition across languages. Computer Speech & Language,Vol. 66. Available: https://doi.org/10.1016/j.csl.2020.101158
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.