Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tommi Jauhiainen works as a Project Planning Officer in Kielipankki and he is currently starting his two-year post doc. Here, Tommi tells us about his research related to some language resources in Kielipankki.
I am Tommi Jauhiainen and at the moment, I work as a Project Planning Officer in Kielipankki. From the beginning of year 2021, I will start as a post doc researcher on a grant from the Finnish Research Impact Foundation.
During the past ten years, my research has focused in language identification of text. On this topic, I completed my Master’s thesis in 2010 and my PhD dissertation in 2019. Language identification refers to the comparison of a text written in an unknown language to a set of given languages. A similar method can also be used to classify texts by subject area, for example.
The difficulty of language identification varies greatly depending on the situation. The task is easy in case there are only a few clearly different languages to choose from, such as Finnish and Swedish, and if the texts are reasonably long, for example several sentences. In case there are hundreds of languages to choose from, if the languages are close to each other (e.g. Kven and Meänkieli) and/or if the texts are short (e.g. single words only), it may be very difficult to identify the language.
Last year, our extensive survey of automatic language identification in texts was published in the Journal of Artificial Intelligence. We are also currently working on a textbook on the same topic. The book is expected to be published in “Synthesis Lectures on Human Language Technologies” series by Morgan & Claypool in late 2021.
During and after my PhD research, I have participated in several international shared tasks that have focused on distinguishing between very close languages or dialects. In 2018, we won the shared tasks focusing on Swiss German dialects and Indo-Aryan languages, and last year we won a shared task focusing on different versions of Mandarin Chinese. I am also a member of the ”Ancient Near Eastern Empires” Centre of Excellence, in which context I have studied how cuneiform texts written in different dialects of Akkadian and Sumerian could be distinguished from one another. I organized an international shared task on this topic last year, and the winner was a Canadian research team using deep learning.
In the forthcoming “Language Identification of Speech and Text” project, funded by the Finnish Research Impact Foundation, I will move towards the study of language identification in speech, in addition to text. Until now, the research fields of speech and text language identification have been relatively separate from each other, and my intention is to bring more collaboration between them.
Most of my PhD research was done in the Finno-Ugric Languages and Internet project, which was part of the FIN-CLARIN research group that maintains Kielipankki. In the project, we searched the Internet for websites written in small Uralic languages, created a portal site for them, and compiled sentence corpora from the texts they contained. During the processes of harvesting the web and creating the sentence corpora, we used automatic language recognition as part of the workflow. The portal site, Wanca, is now part of the tools maintained by Kielipankki and the Wanca 2016 corpora can be found in Kielipankki in three different versions. The Wanca 2017 corpora is being used in the ongoing ULI (Uralic Language Identification) shared task and the corpora will be published next year.
Jauhiainen, H., Jauhiainen, T., & Linden, K. (2015). The Finno-Ugric Languages and the Internet project. In First International Workshop on Computational Linguistics for Uralic Languages: Proceedings of the Workshop (Vol. 2, pp. 87–98). (Septentrio Conference Series; Vol. 2015, No. 2). Septentrio Academic Publishing. https://doi.org/10.7557/scs.2015.2
Jauhiainen, T., Linden, K., & Jauhiainen, H. (2015). Language Set Identification in Noisy Synthetic Multilingual Documents. In Computational Linguistics and Intelligent Text Processing (Vol. Part I, pp. 633-643). (Lecture Notes in Computer Science; Vol. 9041). Springer International Publishing AG. https://doi.org/10.1007/978-3-319-18111-0_48
Jauhiainen, T., Linden, K., & Jauhiainen, H. (2016). HeLI, a Word-Based Backoff Method for Language Identification. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects: VarDial3, Osaka, Japan, December 12 2016 (pp. 153-162). https://www.aclweb.org/anthology/W16-4820/
Jauhiainen, T., Linden, K., & Jauhiainen, H. (2017). Evaluation of language identification methods using 285 languages. In 21st Nordic Conference of Computational Linguistics: Proceedings of the Conference (pp. 183-191). (Linkping Electronic Conference Proceedings; No. 31). Linköping University Electronic Press. https://www.aclweb.org/anthology/W17-0221/
Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). Iterative Language Model Adaptation for Indo-Aryan Language Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 66-75). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-3907
Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). HeLI-based Experiments in Swiss German Dialect Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 254-262). The Association for Computational Linguistics. http://aclweb.org/anthology/W18-3929
Jauhiainen, H., Jauhiainen, T., & Linden, K. (2019). Wanca in Korp: Text corpora for underresourced Uralic languages. In Proceedings of the Research data and humanities (RDHUM) 2019 conference : data, methods and tools (pp. 21-40). Studia Humaniora Ouluensia; No. 17. University of Oulu.
Jauhiainen, T., Linden, K., & Jauhiainen, H. (2019). Language Model Adaptation for Language and Dialect Identification of Text. Natural Language Engineering, 25(5), 561-583. . https://doi.org/10.1017/S135132491900038X
Jauhiainen, T. (2019). Language identification in texts. University of Helsinki. http://urn.fi/URN:ISBN:978-951-51-5131-5
Jauhiainen, T., Jauhiainen, H., Alstola, T., & Linden, K. (2019). Language and Dialect Identification of Cuneiform Texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 89-98). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1409/https://www.aclweb.org/anthology/W19-1409/
Jauhiainen, T., Jauhiainen, H., & Linden, K. (2019). Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 178-187). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1419/
Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). Automatic Language Identification in Texts: A Survey. Journal of Artificial Intelligence Research, 65, 675-782. https://doi.org/10.1613/jair.1.11675
Zampieri, M., Malmasi, S., Scherrer, Y., Samardžic, T., Tyers, F., Silfverberg, M. P., Klyueva, N., Pan, T-L., Huang, C-R., Ionescu, R. T., Butnaru, A., & Jauhiainen, T. S. (2019). A Report on the Third VarDial Evaluation Campaign. In Proceedings of the (pp. 1-16). The Association for Computational Linguistics. https://www.aclweb.org/anthology/W19-1401/
Jauhiainen, H., Jauhiainen, T., & Linden, K. (2020). Building Web Corpora for Minority Languages. In Proceedings of the 12th Web as Corpus Workshop (pp. 23-32). The Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.wac-1.4
Gaman, M., Hovy, D., Ionescu, R. T., Jauhiainen, H., Jauhiainen, T., Linden, K., Ljubešić, N., Partanen, N., Purschke, C., Scherrer, Y., & Zampieri, M. (Accepted/In press). A Report on the VarDial Evaluation Campaign 2020. In Proceedings of VarDial 2020
Jauhiainen, T., Jauhiainen, H., Partanen, N., & Linden, K. (Accepted/In press). Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora. In Proceedings of VarDial 2020 https://arxiv.org/pdf/2008.12169.pdf
Lindgren, M., Jauhiainen, T., & Kurimo, M. (2020). Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets. In Proceedings of Interspeech 2020 (pp. 467-471) http://www.interspeech2020.org/uploadfile/pdf/Mon-1-11-5.pdf
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Humanities of the University of Helsinki.