Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Therese Lindström Tiedemann tells us about her research on Swedish as a second language. There is a definite need to continue developing Finland-Swedish corpora to ensure that Finland-Swedish is also included in future studies of the Swedish language.
My name is Therese Lindström Tiedemann and I am a university lecturer in the Swedish Language at the University of Helsinki. In addition to the Swedish language, I also work on general linguistics. I wrote my PhD thesis on the history of grammaticalisation as a concept in linguistics, i.e. within the history of linguistics.
In recent years, most of my research has been on Swedish as a second language. In my research I often use corpus linguistic methods. Together with colleagues, I have also tried to use crowdsourcing. I also do research on other topics such as grammaticalisation, the history of linguistics, the teaching of grammar and metalinguistic knowledge.
I have used Kielipankki’s resources mainly in connection with my research on Swedish as a second language and in the context of teaching. For instance, I have used the Swedish subcorpus of the Topling corpus. Currently, I am managing our faculty’s part of the Digisvenska project where we are creating a text corpus from the Digital Matriculation Examination in B1-Swedish (Swedish as a second language, i.e. having been learnt from year 6 (or 7 in the old curriculum)) in Finland. We aim to study how the exam correlates to the curriculum and the fairness and transparency of the test results. Among other things, we will study how lexical breadth in the form of lexical variation (cf. vocabulary size) relates to scores and marks in the exams, but also verb conjugation and adverbial clause modifiers, as well as the linguistic accuracy in the form of how close it is to the norm.
A few years ago, I tried to study the Swedish word nog (lit. ‘enough’) using the Sinebrychoff corpus together with Jan Lindström. However, in the end the work needed to be done primarily with a more comprehensive text version of the corpus and not with the version available in Korp.
I also have a more general interest in the Swedish-language resources available in Kielipankki because of my research on Swedish and teaching students in Scandinavian languages, and since I often use corpus-based methods. This is why it is important for me to know which corpora I can recommend students to use and how they can be used. There is definitely a need to continue developing Finland-Swedish corpora to ensure that we can describe Finland-Swedish (Sw. ”finlandssvenska”) in a similar way to how we can describe Swedish as spoken in Sweden (Sw. ”sverigesvenska”), and that Finland-Swedish is also included in future studies of the Swedish language. In the Finnish context, we can also see that some corpora contain both Finnish and Swedish. There is a need to consider the best way to study how and when Swedish is used in these corpora, and whether this is representative of how Swedish is used in these contexts in Finland. This applies, for example, to the corpus of parliamentary plenary sessions (Eduskunnan täysistunnot), where Swedish words are currently only tagged as foreign words. This impedes research possibilities on this part of the data. However, at the same time, we can clearly see that Swedish words top and dominate the list of words tagged as foreign words in the plenary sessions. It would be interesting to see these parts treated as Swedish, and whether it might somehow be possible to annotate the Swedish parts as Swedish, thus facilitating the study of them from a Swedish perspective.
Besides the Swedish-language resources, I also have an interest in interoperability between different corpora and resources, transparency of research data and comparability between different sources for the Swedish language. With many of the Swedish language corpora being available via Språkbanken Text (Sweden), and with our needs to be able to compare corpora at Kielipankki with these, I see a need for information about how comparable these corpora are, and whether corpora in Kielipankki have been annotated in the same way. This is important to ensure that Finland-Swedish and other Swedish corpora located in Finland can be compared with Swedish corpora located in Sweden. This could give Finland Swedish and second language Swedish (L2 Swedish) with Finnish as the first language (L1) a clear and fair place in research on Swedish and L2 Swedish in general.
As part of my work on corpora my colleagues and I have also checked how well the automatic annotation works, especially on material produced by L2 speakers. We have checked the annotation of coursebook texts (written by L1 speakers but aimed at, or selected for, L2 learners), texts written by L2 learners and texts written by L2 speakers and ”normalised” (i.e. with standardised spelling for instance) to facilitate annotation, queries and comparisons. The results showed that texts written by learners are often not as well annotated but also not always worse. Lemmatisation, word class tagging and sense disambiguation was good enough to be used in studies of L2 Swedish, even though sense disambiguation was more problematic than the first two. There were bigger problems with dependency analysis (cf. clause analysis, parsing) and multiword expressions also proved to be problematic especially in learner writings. Still multiword annotation was good enough to allow us to conclude that we can use it in our work, although the user should know that something may have been missed and that the multiword annotation is based on the expressions which are part of the Saldo lexicon, and how they have been listed in Saldo. The results showed that sometimes there was disagreement regarding whether a preposition should be seen as part of the expression or not.
I am very happy to see that more Swedish corpora have been added to Kielipankki in the last few years. I hope that in the future there will be even more Swedish corpora added in Kielipankki and that they will be annotated as the Swedish corpora in Språkbanken Text (Sweden) and that information about the data will be made accessible in such a way that students and researchers can easily find comparable material and know how representative the material is for a certain type of language (e.g. a dialect, newspaper writings).
In the coming years I will be working on a project on pseudonymisation of linguistic data (Mormor Karl är 27 år). Pseudonymisation means that some information such as names of people, places, etc are changed to pseudonyms in the data, when this information is such that it might reveal who wrote the text. In this project we will study how pseudonymisation affects research data in the humanities, an important step in work on open reusable data needed for reproducibility and for reduplication studies to be possible on data already collected while at the same time protecting people’s identity.
In connection to the project which I have just finished together with Elena Volodina, University of Gothenburg (L2 profiles – Development of lexical and grammatical competences in immigrant Swedish) we have released a dataset with manual morphological annotation of lexemes which are present in materials aimed at learners of Swedish as a second language or produced by speakers of Swedish as a second language (CoDeRooMor). This resource has now been updated and will be released as part of the resource Swedish L2 profiles during 2023. Swedish L2 profiles is a resource where you can search for e.g. a word, a tense, a morpheme or a word formation pattern to see how this is used at different proficiency levels (according to CEFR, the Common European Framework of Reference for Languages, Council of Europe) both in course books for Swedish as a second language and in learner essays from different CEFR-levels. The resources which we have created are part of Språkbanken Text (Sweden), but are or will be openly accessible.
I have also been involved in the development of an annotation tool in relation to research on Swedish (Legato) and in the use of the CALL platform Lärka for the teaching of syntactic functions, word classes and semantic roles. The CALL platform Lärka is something I have used in teaching grammar, which meant that I could give feedback to the developers from that perspective. Together with Volodina I have also used the platform to collect anonymous data to study what students often get right or wrong when they practise these categories, useful in connection to research on metalinguistic knowledge and the ability to analyse Swedish grammatically.
Apart from research related to Kielipankki’s resources and areas of interest I am also the current project manager of Finland Swedish Online (FSO), an online course in Finland Swedish created at University of Helsinki based on an Icelandic model (Icelandic Online). FSO is currently part of SAFMORIL, one of the K-Centres within CLARIN. One of my aims have been that FSO would not only be something which supports the learning of a language but also a possibility to study language acquisition by seeing if it is possible to trace the development of learners in FSO if they grant access to that information. (Icelandic Online has done research on this based on their data.)
Alfter, D., Borin, L., Pilán, I., Lindström Tiedemann, T. & Volodina, E. 2019a. Lärka: From Language learning platform to infrastructure for research and language learning. In: Selected papers from the CLARIN Annual Conference 2018. Linköping: Linköping university press. 14pp. http://www.ep.liu.se/ecp/159/001/ecp18159001.pdf
Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2019b. LEGATO: A flexible lexicographic annotation tool. In: Hartmann, M. & Plank, B. (eds.), The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa): Proceedings of the conference. Linköping: Linköping University Electronic Press. pp. 382–388. http://hdl.handle.net/10138/306297
Alfter, D., Lindström Tiedemann, T. & Volodina, E. 2021. Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts vs Non-Experts. Northern European Journal of Language Technology, 7 (1): 35pp. https://doi.org/10.3384/nejlt.2000-1533.2021.3128
Arnbjörnsdóttir, B., Friðriksdóttir, K., & Bédi, B. 2020. Icelandic Online: twenty years of development, evaluation, and expansion of an LMOOC. CALL for widening participation: short papers from EUROCALL 2020, 13.
Borin, L., Forsberg, M. & Lönngren, L. 2013. SALDO: a touch of yin to WordNet’s yang. Language Resources and Evaluation, 47(4): 1191–1211. https://doi.org/10.1007/s10579-013-9233-4
Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, teaching and assessment. https://rm.coe.int/1680459f97
Council of Europe. 2018. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion Volume with new descriptors. https://rm.coe.int/cefr-companion-volume-with-new-descriptors-2018/1680787989
Council of Europe. 2020. Common European Framework of Reference for Languages: Learning, teaching and assessment. Companion volume. https://rm.coe.int/common-european-framework-of-reference-for-languages-learning-teaching/16809ea0d4
Friðriksdóttir, K. 2021. The effect of tutor-specific and other motivational factors on student retention on Icelandic Online. Computer Assisted Language Learning, 34(5-6), 663-684.
Lenardič, J., Lindström Tiedemann, T. & Fišer, D. 2018. Overview of L2 corpora and resources. CLARIN report. CLARIN ERIC. https://office.clarin.eu/v/CE-2018-1202-L2-corpora-report.pdf
Lindström, J. & Lindström Tiedemann, T. 2020. ”Ni minnes nog hvilka jag menar”: Subjektiva och intersubjektiva aspekter av modaladverbet nog. In: Lehti-Eklund, H. & Silén, B. (eds.), Handel med konst. Språk och dialog i Paul Sinebrychoffs brevsamling från sekelskiftet 1900. Helsinki: Svenska litteratursällskapet. pp. 293–323. http://hdl.handle.net/10138/315043
Lindström, J. & Lindström Tiedemann, T. 2018. Subjektivt och intersubjektivt nog: Om grammatikalisering och bruk i ljuset av Paul Sinebrychoffs brevväxling kring 1900. In: Lönnroth, H, Haagensen, B., Kvist, M. & Sandvad West, K. (eds.) Studier i svensk språkhistoria 14. Vaasa: University of Vaasa. pp. 180–197. http://hdl.handle.net/10138/243079
Lindström [Tiedemann], T. 2004. The History of the Concept of Grammaticalisation. Unpublished PhD thesis, University of Sheffield. https://etheses.whiterose.ac.uk/1437/
Lindström Tiedemann, T., Alfter, D. & Volodina, E. 2022. CEFR-nivåer och svenska flerordsuttryck. In: Björklund, S., Haagensen, B., Nordman, M. & Westerlund, A. (eds.), Svenskan i Finland 19. Vasa: Svensk-österbottniska samfundet. pp. 218–233. https://urn.fi/URN:ISBN:978-952-69650-5-5
Lindström Tiedemann, T., Lenardič, J. & Fišer, D. 2018. L2 learner corpus survey: towards improved verifiability, reproducability and inspiration in learner corpus research. CLARIN annual conference, Pisa.
Lindström Tiedemann, T., Volodina, E. & Jansson, H. 2016. Lärka – ett verktyg för träning av språkterminologi och grammatik. LexicoNordica, 23: 161–181. https://tidsskrift.dk/lexn/article/view/111823
Prentice, J., Håkansson, C, Lindström Tiedemann, T., Pilán, I. & Volodina, E. 2021. Language learning and teaching with Swedish FrameNet++: two examples. In: Dannélls, D., Borin, L. & Friberg Heppin, K. (eds.), The Swedish FrameNet++: Harmonization, integration, method development and practical language technology applications. Amsterdam: Benjamins. pp. 303–329. https://doi.org/10.1075/nlp.14.12pre
Stemle, E. W., Boyd, A., Jansen, M., Lindström Tiedemann, T., Mikelić Preradović, N., Rosen, A., Rosén, D. & Volodina, E. 2019. Working together towards an ideal infrastructure for language learner corpora. In: Abel, A., Glaznieks, A., Lyding, V. & Nicolas, L. (eds.) Widening the Scope of Learner Corpus Research: Selected papers from the fourth leaner corpus research conference. Louvain-la-Neuve: Presses universitaires de Louvain.
Volodina, E., Alfter, D., Lindström Tiedemann, T., Lauriala, M.S. & Piipponen, D. H. 2022. Reliability of Automatic Linguistic Annotation: Native vs Non-native Texts. In: Monachini, M. & Eskevich, M. (eds.), Selected papers from the CLARIN Annual Conference 2021. Linköping: Linköping University Electronic Press. pp. 151–167.
Volodina, E., Mohammed, Y. A. & Lindström Tiedemann, T. 2021. CoDeRooMor: A new dataset for non-inflectional morphology studies of Swedish. Proceedings of the 23rd Nordic conference on computational linguistics (NoDaLiDa). Linköping. pp. 178–189. http://hdl.handle.net/10138/339476
Volodina, E. & Lindström Tiedemann, T. 2014. Evaluating students’ metalinguistic knowledge with Lärka. Swedish Language Technology Conference, Uppsala. http://hdl.handle.net/10138/347397
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.
All previously published Language Bank researcher interviews are stored in the Researcher of the Month archive. This article is also published on the website of the Faculty of Arts of the University of Helsinki.