Samples of Spoken Finnish, VRT version
Suomen kielen näytteitä, VRT-versio

Short name: skn-vrt
URN: http://urn.fi/urn:nbn:fi:lb-2021112221
License: CC-BY
Licensor: Institute for the Languages of Finland
Distributor: The Language Bank of Finland / FIN-CLARIN


Description

This package contains the transcript data for the Samples of Spoken
Finnish in the VRT (VeRticalized Text) format as used in the Language
Bank of Finland. The data corresponds to that in Korp, except that
obsolete LAT links have been removed.

Please see also http://urn.fi/urn:nbn:fi:lb-201407141 for more
information on the Samples of Spoken Finnish corpus in general.

The data has been automatically annotated using an old version of the
Turku Dependency Parser Pipeline (TDPP) from Turku NLP, based on
manually added standard Finnish word forms of the original dialect
words.

The directory "vrt" contains the data split into 99 VRT files so that
each original sample is in its own file. The file name contains the
number of the sample and the parish; e.g., SKN01a_Suomussalmi.vrt.

The VRT files contain XML-style tags for nested structural markup
(texts, paragraphs and sentences) and associated annotations
(metadata) as attributes. Each token is on its own line, attributes
separated by TAB characters. In addition, the files contain XML-style
comment lines at the beginning (and end).

The first comment line lists the names of the token (positional)
attributes (as used internally in Korp) in the order they are listed
for each token:

<!-- #vrt positional-attributes: word original normalized comment id ref lemma lemmacomp pos msd dephead deprel nertag nerbio -->

- word: standard Finnish word form
- original: original dialectal form (detailed transcription)
- normalized: rough dialectal form without diacritics
- comment: note on the word
- id: the number of the token in the sentence
- ref: the number of the token in the sentence as used for dependency heads
- lemma: base form of the standard Finnish word form
- lemmacomp: base form with compound boundaries marked with a "|"
- pos: part of speech
- msd: morphological analysis (morpho-syntactic description)
- dephead: dependency head number, referring to attribute ref (0 if no head)
- deprel: dependency relation
- nertag: name tag
- nerbio: "B": begins a name; "I": within a name; "O": outside a name

Note that the base form, part of speech, morphological analysis,
dependency relations and name information have been added by programs
and not manually corrected, so they contain errors. See also
https://www.kielipankki.fi/tuki/korp-tdt/ for some information on
these annotations produced by TDPP in Finnish.

A missing value for a token attribute is indicated by an underscore
("_").

Each VRT file contains a single text element (structure) with the
following attributes:

<text name="SKN01a_Suomussalmi" title="Suomussalmen murretta (vihko 1)" editor="Alpo Räisänen" parish="Suomussalmi" dialect_group="Kainuu" dialect_region="Savolaismurteet" date="1978" datefrom="19780101" dateto="19781231" timefrom="000000" timeto="235959" _geo_parish="|Suomussalmi;FI;64.88685;28.90778|">

- name: name of the file
- title: title of the sample leaflet
- editor: editor of the sample
- parish: dialect parish
- dialect_group: dialect group
- dialect_region: dialect region
- date: year of publication of the sample leaflet

The attributes datefrom, dateto, timefrom, timeto and _geo_parish are
used internally in Korp.

Each paragraph element corresponds to one turn of either an
interviewer or interviewee, with the following attributes:

<paragraph id="1" speaker="AR" sex="NA" role="muu">

- id: paragraph number
- speaker: speaker initials
- sex: "M" (male), "N" (female) or "NA" (not known)
- role: "haastateltava" (interviewee) or "muu" (usually interviewer)

Each sentence element has the following attributes:

<sentence id="1" origid="s1" beg="00:00:0.20" duration="1.59 s">

- id: sentence number
- origid: sentence identifier in the original data
- beg: sentence begin time in the original recording
- duration: duration of the sentence in the recording

In addition, tokens recognized as name, numeral or temporal
expressions are enclosed in ne elements with the following attributes:

<ne name="Kiantajärven" fulltype="EnamexLocGpl" ex="ENAMEX" type="LOC" subtype="GPL" placename="Kiantajärven" placename_source="ner">

- name: name (or numeral or temporal expression)
- fulltype: full expression type
- ex: category: "ENAMEX" (name), "NUMEX" (number), "TIMEX" (time)
- type: main type of expression
- subtype: subtype of expression
- placename: name if it is a place name, empty otherwise
- placename_source: "ner" if the name is a placename, empty otherwise
