TallVocabL2Fi: Measurements of 15 L2 Finnish learners' vocabularies

TallVocabL2Fi: Mitat 15 S2-opiskelijan sanavarastosta

Short name: tallvocabl2fi

Metadata: http://urn.fi/urn:nbn:fi:lb-2022041921

License: CC0 (CC-ZERO) 1.0

See complete license details in the separate LICENSE.txt or under http://urn.fi/urn:nbn:fi:lb-2022041923



# Definitions

 * TSV: Tab Separated Values; A tabular file format with one row per line,
   and with columns/fields separated by tab characters
 * CSV: Comma Separated Values, as above but with a comma as the field separator
 * DuckDB: An embedded analytical-relational database available at
   [DuckDB.org](https://duckdb.org/)
 * SQL: Structured Query Language; A programming language for querying and
   manipulating relational databases. Supported by DuckDB.
 * Primary key: Relational database terminology referring to
   a unique identifier for a row in the current table.
 * Foreign key: As above but a reference to a row in another table
 * Synthetic primary key: A "made up" identifier for each row, e.g. consecutive
   numbers 1, 2, 3, ...
 * CEFR: The Common European Framework of Reference for Languages. In CEFR, the
   skills of speaking, writing, listening comprehension and reading
   comprehension are given coarse-grained levels from A1 to C2. The scale can
   be understood through a [CEFR self-assessment
   grid](https://www.coe.int/en/web/portfolio/self-assessment-grid).
 * YKI: Short for *Yleinen kielitutkinto*; The main general language
   certificate administered in Finland. We refer by default to the Finnish
   language certificate here.

# Description

This resource has been deposited at the Language Bank of Finland, where it has
been assigned the identifier
[urn:nbn:fi:lb-2022041921](http://urn.fi/urn:nbn:fi:lb-2022041921).

The TallVocabL2Fi dataset comprises of responses from 15 participants to
a "tall" 12000 word 5-point scale self-rating response task and a 100 word
confirmatory word translation response task.

The dataset is unique in its combination of the tall data collection set up,
where responses are collected for many words, the varied backgrounds of the
learners, the use of Finnish prompt words, and the triangulation with a word
translation test.

The dataset can be used for vocabulary acquisition research in general, but it
is particularly suited to evaluation of the task of Vocabulary Inventory
Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT).

The dataset is relational/tabular. It is distributed as a series of TSV files
along with a SQL schema exported from DuckDB.

The 15 participants were split by native language, 5 English, 4 Hungarian and
6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The
data was gathered through a website from paid participants resident in Finland
over a period of 3 months from September and November 2021. In total there are
180 thousand word knowledge self-rating responses and 1.5 thousand word
translation responses.

# Dataset format, schema and coding

## Formats

There are two formats available. The *simple format* contains only the response
data to the self-rating test, and the marks from the translation task, and is
meant as an quick analyses. The detailed release format, which contains the
full dataset, is the *relational format*. Both formats are TSV based. All text
is encoded as UTF-8.

## Coding

### 5-point self-assessment scale

The 5-point self-assessment scale was presented to the respondents as follows:

 * 1: I have never seen the word before
 * 2: I have probably seen the word before, but don't know the meaning
 * 3: I have definitely seen the word before, but don't know the meaning / I have
    tried to learn the word but have forgotten the meaning
 * 4: I probably know the word's meaning or am able to guess
 * 5: I absolutely know the word's meaning

### Translation task marking scale

The marking scale for the translation task is as follows:

 * 1: Completely incorrect answer
 * 1b: No answer
 * 2: In some way partially correct but also incorrect and misleading with
   regards to the meaning it would provide within a text. Maximum score for
   partial compound.
 * 3: Correct enough that it may help understanding a text. Maximum for
   a response with the wrong part-of-speech or which seems to result from
   parsing a compound with the wrong head.
 * 4: Not quite correct, but unlikely to impede understanding
 * 5: Completely correct

(For further detail on both scales, refer to the 2022 LREC publication.)

### CEFR levels

All CEFR levels in the data set are coded as integers, following an extended
version of the conventions of YKI. The coding is as follows:

 * 1: A1
 * 2: A2
 * 3: B1
 * 4: B2
 * 5: C1
 * 6: C2
 * 7: Native speaker

### Languages

Languages are encoded with a [2-character ISO 639-1:2002 language
code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes).

### Sessions, dates and times

Responses are divided into sessions. Sessions are designed as streams of
consecutive events which include responses, form input events such as key
presses, and window focus and blur events. Whenever there is a gap exceeding
5 minutes between events, the session is considered timed out and a new session
is created.

All dates are measured as offsets from the date on which first response is
received from the participant. For example, if the participant gives their
first response on the 1st, all activity on that date is indicated as day 0, and
all activity on the 2nd, day 1 and so forth. All dates are recorded in the
webserver's time zone, which was UTC (*not* Helsinki time).

All times are measured from the webserver. Only time intervals are given in the
dataset. All times are either given in seconds (secs) or microseconds (usecs,
millionths of a second). In both cases rounding is to the nearest integer. 

## Simple format

The simple format has the following columns:

 * `participant`: Numerical identifier for the participant (synthetic primary key)
 * `word`: The word the participant was asked to self-assess their knowledge of
   or to translate (always lower case)
 * `time_usecs`: The time from the word being sent to the participant until
   receiving the response in microseconds
 * `rating`: The rating on the 5-point self-assessment scale
 * `mark`: The mark on the translation task marking scale. This is only
   available for 1 in 150 words.

## Relational format

The relational format has been exported from [DuckDB](https://duckdb.org/). It
can be used most easily be reimporting back into a DuckDB database. Note that
DuckDB has good interoperability including with Python & Pandas and R Data
Frames. The following command in the same working directory as this file will
import the dataset into a DuckDB database called `tallvocabl2fi.duckdb`:

    $ duckdb tallvocabl2fi.duckdb "IMPORT DATABASE 'relational'"

Since the data is stored in TSV files, can also be loaded by any other software
supporting TSV[^1].

![An entity-relationship diagram showing the relationship between the tables,
and a selection of their columns.](erd.svg)

The schema can be viewed in `erd.svg` and the data types can be found in `schema.sql`,
which will be used by DuckDB if you choose to load the data there.

The tables making up the relational format of the dataset are:

 * Participant, `participant.csv`
   * Purpose: Gives the basic information about each participant.
   * Columns:
     * `id`: Synthetic primary key
     * `cefr_selfassess_speaking`: Self-assessed speaking CEFR level in Finnish, 1-6
     * `cefr_selfassess_writing`: Self-assessed writing CEFR level in Finnish 1-6
     * `cefr_selfassess_listening_comprehension`: Self-assessed listening comprehension CEFR level in Finnish, 1-6
     * `cefr_selfassess_reading_comprehension`: Self-assessed reading comprehension CEFR level in Finnish, 1-6
     * `cefr_proof_speaking`: Speaking CEFR level in Finnish according to proof document, 1-6
     * `cefr_proof_writing`: Writing CEFR level in Finnish according to proof document, 1-6
     * `cefr_proof_listening_comprehension`: Listening comprehension CEFR level in Finnish according to proof document, 1-6
     * `cefr_proof_reading_comprehension`: Reading comprehension CEFR level in Finnish according to proof document, 1-6
     * `lived_in_finland`: Years lived in Finland as a whole number
     * `proof_age`: Age of proof. Coding:
       * `lt1`: Less than one year, < 1yr
       * `lt3`: Less than three years, < 3yr
       * `lt5`: Less than five years, < 5yr
       * `gte5: Greater than or equal to 5 years, >= 5yr
     * `proof_type`: The type of proof. Coding:
       * `yki_intermediate`: The intermediate YKI qualification which confers the levels less than 3, 3 or 4
       * `yki_advanced`: The advanced YKI qualification which confers the levels less than 5, 5 or 6
       * `course_english_degree`: Completion of a course completed as part of an international degree programme taught in English
       * `completed_finnish_upper_secondary`: Completion of upper-secondary school level education in Finnish
       * `completed_finnish_degree`: Completion of a higher or further education qualification in Finnish
       * `other`: Another type of proof
     * `miniexam_time_secs`: The time for the miniexam/translation task to be
       completed as measured from as the sum of the times of all miniexam
       sessions. Given in seconds.
     * `miniexam_day`: The day the miniexam was completed on
 * Participant language, `participant_language.csv`
   * Purpose: Gives each language known by the participant, and the level at
     which it is known, including their native language, but not Finnish which
     is given in the "Participant" table.
   * Columns:
     * `participant_id`: Foreign key to "Participant" table
     * `language`: Language encoded with an 2-character ISO 639 code. 
     * `level`: Estimated overall CEFR level, 1-7. Native language is included here as 7.
 * Self-assessment session, `selfassess_session.csv`
   * Purpose: Gives information about the sessions in which the self-assessment was completed.
   * Columns:
     * `id`: Synthetic primary key
     * `participant_id`: Foreign key to "Participant" table
     * `device`: The device the session was completed on. Coding:
       * `mobile`: Mobile phone
       * `tablet`: Tablet computer
       * `pc`: Personal computer
       * `unknown`: The detection process failed
     * `time_secs`: The time for the session to be completed as measured from
       the first to last event in the session event stream
     * `day`: The day on which the session started
 * Self-assessment response, `selfassess_response.csv`
   * Purpose: Gives the word-level response for each participant
   * Columns:
     * `session_id`: Foreign key to "Self-assessment session" table
     * `word`: The word the participant was asked to self-assess their
       knowledge of (always lower case)
     * `time_usecs`: The time from the word being sent to the participant until
       receiving the response in microseconds
     * `rating`: The rating on the 5-point self-assessment scale
 * Mini-exam (translation task) mark, `miniexam_mark.csv`
   * Purpose: Gives the mark for participants' responses to the translation
     task. Please refer to the 2022 LREC publication for the full details of
     the marking process and the combination of marks.
   * Columns:
     * `miniexam_response_id`: Foreign key to "Mini-exam (translation task)
       response" table
     * `marker`: The marker/marking session. Coding:
       * `ann1`: The marking from annotator 1
       * `ann2`: The marking from annotator 2
       * `corr`: A corrected/agreed mark between the two annotators (only
         for some of the disagreeing marks)
       * `final`: A combination of all three of the above to reach a final
         mark. Given for every response. **Typically only this mark is used.**
     * `mark`: Mark given according to the marking scale
 * Mini-exam (translation task) response, `miniexam_response.csv`
   * Purpose: Gives the response of the participant to each word in the
     translation task
   * Columns:
     * `id`: Synthetic primary key
     * `participant_id`: Foreign key to "Participant" table
     * `word`: Gives the word-level response for each participant
     * `type`: The type of response the respondent opted to give. Coding:
       * `trans_defn`: A translation or definition
       * `topic`: The topic of the word
       * `donotknow`: The respondent specifies that they cannot give any
         response because they do not know the word at all
     * `lang`: Language of response. Either:
       * `en`: English
       * `fi`: Finnish
       * `hu`: Hungarian (only available for native Hungarian speakers)
       * `ru`: Russian (only available for native Russian speakers)
     * `response`: The plain-text response to the task

# Source data

The creation of the word list (see 2022 LREC publication) is based on data
sourced from the following other resources:

 * Huovilainen, T. (2018). *Psycholinguistic Descriptives* [text corpus].
   Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2018081601
 * Ylönen, T., Wiktionary contributors (2021). *Kaikki.org.* Retrieved from
   http://kaikki.org/

# Publication

Further information is available in the accompanying publication:

Robertson, F., Chang & L., Söyrinki, S. (2022).
TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners' Vocabulary.
In *Language Resources and Evaluation Conference* (LREC 2022)

# License

This resource is licensed under the CC0 (CC-ZERO) 1.0 license, available at
[https://creativecommons.org/publicdomain/zero/1.0/](https://creativecommons.org/publicdomain/zero/1.0/)
and also included in LICENSE.txt

If you make direct use of this resource in academic work, please cite the above
publication.

[^1]: Note that the data is in TSV, not CSV even though the files end with
`.csv`. This is a quirk of DuckDB's export functionality.
