TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies

The TallVocabL2Fi dataset comprises of responses from 15 participants to a ”tall” 12000 word 5-point scale self-rating response task and a 100 word confirmatory word translation response task. The 15 participants were split by native language, 5 English, 4 Hungarian and 6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The data was gathered through a website from paid participants resident in Finland over a period of 3 months from September and November 2021. In total there are 180 thousand word knowledge self-rating responses and 1.5 thousand word translation responses.

The dataset is unique in its combination of the tall data collection set up, where responses are collected for many words, the varied backgrounds of the learners, the use of Finnish prompt words, and the triangulation with a word translation test. The dataset can be used for vocabulary acquisition research in general, but it is particularly suited to evaluation of the task of Vocabulary Inventory Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT). The dataset is relational/tabular. It is distributed as a series of TSV files along with a SQL schema exported from DuckDB.

Further information about the schema and the collection process is available in the readme included with the data, and in the accompanying publication: Robertson, F., Chang & L., Söyrinki, S. (2022). TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners’ Vocabulary. In Language Resources and Evaluation Conference (LREC 2022).

Latest versions/subcorpora:  
TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

  This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051702