<< List of all deliverables

D2.5.2: Analysis and annotation tools for learner performances

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.5: Report on Finnish as a second language learners automated analysis and annotation tools
Date of reporting: 28-11-2023

Report author: Ari Huhta (University of Jyväskylä)
Contributors:: Jenny Tarvainen, Ida Toivanen, Sirkku Kronholm, Mika Halttunen (University of Jyväskylä)
Deliverable location: (so far only in Google Drive)

Description

Work in WP2.5 divides into two stages. In 2022, tools used for automated analysis of texts written by native speakers of Finnish were reviewed in collaboration with WP3.2. To investigate how the tools perform with texts written by Finnish as second language (L2) learners’ texts, texts collected in previous projects were used to test certain tools. The texts represented different proficiency levels defined in the Common European Framework of Reference for Languages (CEFR), based on assessments by trained raters.

Testing focused on two promising tools, Finnish Tagtools (Language Bank) and Turku-neural-parser-pipeline. Both tools utilize machine learning with pre-trained language models. The tools perform e.g. segmentation, lemmatization and morphological tagging for Finnish texts. In addition, TurkuNPP provides information about universal dependency relations. The tools were tested with L2 Finnish learners’ texts evaluated at several CEFR levels. Only a few texts could be analysed due to various technical and other reasons. However, it was clear that the tools do not function well on learner performances, with various mistakes often confusing the processing. Typical L2 Finnish characteristics, like mixing back and front vowels (kavelin vs. kävelin), can cause incorrect lemmatization and/or tagging. However, in some cases, tools are faithful to learner language forms and are able to give the lemma based on the inflected learner language form rather than giving the targeted Finnish lemma (e.g. lumihannenlumihansi not lumihanki). As language learning researchers have started to see learner language as a valuable language variant, this can be seen as a positive characteristic, but useful tools should give both learner language lemmas and targeted lemmas. A poster presentation of these findings was given at the annual conference of the Finnish Association of Applied Linguistics in November 2022.

In the second stage in 2023, a study has been conducted to build models for classifying learner language into CEFR levels and to investigate resources needed to establish strong deep learning based L2 Finnish research in the future. This will facilitate e.g. designing automated tools for learner language detection for pedagogical and assessment purposes and contributing to the development of textual models for Finnish. Specifically, the study investigates (1) if the currently available CEFR-annotated datasets are enough for training deep learning models, (2) how the trained models perform with new data, (3) if pretraining with MLM learner language improves model performance, and (4) if the model performs equally well across all CEFR levels.

Four CEFR annotated written Finnish as a second or foreign language datasets were used: International Corpus of Learner Finnish (ICLFI), The Advanced Finnish Learner’s Corpus (LAS2), and two young learner corpora from the cross-sectional Cefling and the longitudinal Topling projects.

The state-of-the-art Finnish BERT model, FinBERT base was used and tested against FinBERT large. To inspect the effect of pretraining (with masked language modeling (MLM) objective, models trained with and without pretraining were compared. The models were evaluated with test data extracted from all four datasets. The evaluation metrics include accuracy, F1-score, recall and precision. For model evaluation, an average value over five folds for each evaluation metric is computed. An article based on the study is currently in preparation.

Events / presentations:

Sirkku Kronholm & Ari Huhta: Automaattisten tekstityökalujen kehittäminen oppijankieliseen aineistoon. Poster presentation. AFinLA autumn symposium. Helsinki. 27.-29.10.2022. https://www.helsinki.fi/assets/drupal/2022-10/AFinLA2022_FINALFINAL_Timetable_A3.pdf

Search the Language Bank Portal:
Tanja Säily
Researcher of the Month: Tanja Säily

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information