TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies

The TallVocabL2Fi dataset comprises of responses from 15 participants to a ”tall” 12000 word 5-point scale self-rating response task and a 100 word confirmatory word translation response task. The 15 participants were split by native language, 5 English, 4 Hungarian and 6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The data was gathered through a website from paid participants resident in Finland over a period of 3 months from September and November 2021. In total there are 180 thousand word knowledge self-rating responses and 1.5 thousand word translation responses.

The dataset is unique in its combination of the tall data collection set up, where responses are collected for many words, the varied backgrounds of the learners, the use of Finnish prompt words, and the triangulation with a word translation test. The dataset can be used for vocabulary acquisition research in general, but it is particularly suited to evaluation of the task of Vocabulary Inventory Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT). The dataset is relational/tabular. It is distributed as a series of TSV files along with a SQL schema exported from DuckDB.

Further information about the schema and the collection process is available in the readme included with the data, and in the accompanying publication: Robertson, F., Chang & L., Söyrinki, S. (2022). TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners’ Vocabulary. In Language Resources and Evaluation Conference (LREC 2022).

Latest versions/subcorpora:  
TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

  This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051702

Testipiste Corpus

Testipiste is a language assessment centre for adult migrants. This corpus contains texts written by 2397 different persons, 3 texts from each person. It also contains assignments and other related texts. The essays contain i.a. information on the starting level of their authors, as defined by Testipiste.

Latest versions/subcorpora:
Testipiste Corpus, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The resource will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021061603

DIALUKI – Diagnosing reading and writing in a second or foreign language

The project studies the diagnosis of reading and writing abilities in a second or foreign language. It seeks to identify the cognitive features which predict a learner’s strengths and weaknesses in those areas. The project brings together scholars from applied linguistics, psychology and assessment to engage in multidisciplinary work and to develop innovative ways of diagnosing the development of second and foreign language abilities.

More information on the corpus: https://www.jyu.fi/dialuki

Latest versions/subcorpora:  
DIALUKI – Diagnosing reading and writing in a second or foreign language
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The resource will be made available in Korp
Search for all versions in META-SHARE  

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021061602

CEFLING Project Corpus

Finnish as a second language and English as a foreign language writing performances collected from comprehensive school students (grades 7 – 9) in the project CEFLING – Linguistic Basis of the Common European Framework for L2 English and L2 Finnish. Data from several hundred learners; 4-5 writing tasks from each learner; background information, self-assessments of proficiency.

More information:
https://www.jyu.fi/hytk/fi/laitokset/kivi/tutkimus/hankkeet/paattyneet-tutkimushankkeet/cefling/en/cefling

Latest versions/subcorpora:
CEFLING Project Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The resource will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021061601

The Advanced Finnish Learners’ Corpus

The Advanced Finnish Learners’ Corpus (in Finnish Edistyneiden suomenoppijoiden korpus) consists mainly of texts written by non-native MA students of Finnish language. At the end of 2009 it consisted of the following:

– digitalized exam essays,
– digitalized theses,
– other academic writings digitalized.

The subcorpora containing digitalized exam essays (esseet) and course papers (tentit) have been made available at http://korp.csc.fi/

Important: Due to the nature of the material, the resource should be handled with care in order to respect the privacy of the personal data. If samples of the data are published, they must be anonymized according to best practices.

More information on the corpus: https://www.utu.fi/fi/yliopisto/humanistinen-tiedekunta/suomen-kieli-ja-suomalais-ugrilainen-kielentutkimus/lauseopin-arkisto

Latest versions/subcorpora:
The Advanced Finnish Learners’ Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Advanced Finnish Learners’ Corpus, Downloadable Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051907

International Corpus of Learner Finnish (ICLFI)

The International Corpus of Learner Finnish (ICLFI) is a corpus of written learner language.

The corpus is morphologically annotated. The texts have been written by students of Finnish as a foreign language from various language backgrounds. They have been compiled with the help of Finnish language teachers around the world.

The corpus contains texts written by basic, independent, and proficient learners of Finnish, and the texts are analyzed according to the Common European Framework of Reference for Languages (CEFR). The ICLFI comprises a variety of both non-fictional (e.g. essays, argumentative texts) and fictional texts (e.g. narratives, letters). In addition, the corpus provides information on a large number of variables concerning the linguistic background of the learner, the learning task, the learning context, etc.

Important: Due to the nature of the material, the resource should be handled with care in order to respect the privacy of the personal data. If samples of the data are published, they must be anonymized according to best practices.

Latest versions/subcorpora:
International Corpus of Learner Finnish (ICLFI)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051906

Studentsvenska 79-80 Corpus

The corpus contains Swedish language essays / compositions written by Finnish-speaking students taking the Matriculation examination in 1979-80.

Latest versions/subcorpora:
Studentsvenska 79-80 Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042605

Topling – Paths in Second Language Acquisition

Topling is a research project of the Department of Languages and Centre for Applied Language Studies at the University of Jyväskylä. It is financed by the Academy of Finland (2010-2013) and the University of Jyväskylä. It makes use of the data and results of an earlier project called Cefling.

The main objective of the project is to compare cross-sectional and longitudinal sequences of the acquisition of writing skills in Finnish, English and Swedish as second languages in the Finnish educational system (incl. adults).

The cross-sectional data, 1,194 samples for L2 Finnish, 3,154 for L2 English, on a variety of tasks, already exists, rated for level and coded for analyses, with similar data available for Swedish. The longitudinal data (incl. language use outside school) will be collected during this project.

The corpus can be used e.g. to better understand and predict problems in the development of writing proficiency in foreign/second languages.

Latest versions/subcorpora:
The English Subcorpus of Topling – Paths in Second Language Acquisition
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Finnish Subcorpus of Topling – Paths in Second Language Acquisition
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Swedish Subcorpus of Topling – Paths in Second Language Acquisition
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092406

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information