Resource title (English): The Suomi24 Corpus 2001-2017, VRT version 1.3 

Resource title (Finnish): Suomi24-korpus 2001-2017, VRT-versio 1.3 

Shortname: suomi24-2001-2017-vrt-v1-3 

Metadata: http://urn.fi/urn:nbn:fi:lb-2020021801

Rightholder: Aller Media Oy

License: ACA-NC
The complete license is available at http://urn.fi/urn:nbn:fi:lb-20150304151

A copy of the license is included in LICENSE.txt. The license details
may be subject to change, so before downloading the resource, please
refer to the latest version of the license at the above link.

Resource group page: http://urn.fi/urn:nbn:fi:lb-2022011221


Short description

The corpus contains all the texts available in the discussion forums
of the Suomi24 online social networking website from 1 January 2001 to
31 December 2017. The data was tokenized, converted to VRT format and
annotated at the Language Bank of Finland.

The base data is the same as for the VRT version 1.0, but it was
re-parsed to correct major discrepancies in dependency annotations
resulting from a mistake in processing the first version of the
corpus. The messages have also been reordered and their attributes
augmented so that this data set corresponds to the data in Korp. VRT
version 1.2 also includes sentence-level sentiment polarity
annotation. VRT version 1.3 includes annotations for names and
identified languages as well as minor metadata corrections.

The entire corpus in the VRT format is downloadable for academic
research purposes.


Detailed description

This data set is an annotated VRT version of a full database dump of
the content of the Suomi24 discussion forums
(https://keskustelu.suomi24.fi) from 1 January 2001 to 31 December
2017 from Aller Media, received in June 2018. The data set excludes
data from closed or hidden discussion topics.

The data was tokenized, transformed to VRT format and
morpho-syntatically annotated for FIN-CLARIN in the CSC Taito
environment with ad-hoc FIN-CLARIN VRT Tools scripts running e.g. the
UDPipe tokenizer (finnish-tdt model, with post-processors) and the
already old dependency analysis tools and models from Turku NLP group
(TDPP scripts adapted for VRT in the language bank, models used as
they were). The messages were then reordered and augmented with
derived attributes. Later, sentence sentiment polarity was annotated
by a sentiment classifier trained on the FinnSentiment corpus (see
https://arxiv.org/pdf/2012.02613.pdf), names were recognized by the
FiNER tagger, a part of Finnish Tagtools 1.6
(http://urn.fi/urn:nbn:fi:lb-2024021401), and sentence languages were
identified with HeLI-OTS 2.0
(https://urn.fi/urn:nbn:fi:lb-2024040301).

For the VRT version 1.1, the VRT files have been re-generated from the
Corpus Workbench data in the Korp concordancing service of the
Language Bank of Finland (https://korp.csc.fi/), so they correspond
exactly to the data in the Korp service. Although base data is the
same as for the VRT version 1.0, it has been re-parsed to correct
major discrepancies in dependency annotations resulting from a mistake
in processing the first version of the corpus, and the messages have
been reordered and their attributes augmented. Please see the end of
this file for more details on the changes.

In VRT version 1.2, a sentiment polarity attribute has been added to
each sentence.

In VRT version 1.3, identified-language attributes have been added to
each sentence, paragraph and text, and name attributes have been
added. In addition, spurious spaces have been removed from some text
attributes, and some attributes have been renamed. Please see the end
of this file for more details on the changes.

The data has been divided into files by the year, corresponding to the
subcorpora in Korp. The messages within each year are sorted by
thread, and threads are sorted by the timestamp of the first message
of the thread. Messages within a thread are sorted in thread order:
each message is followed by the direct comments to it (recursively),
sorted by their timestamp. Threads that span over several years have
been split by the year.

The original data contained 143 messages with a dummy timestamp
(1970-01-01 00:00:00), which has been replaced in this data set with a
timestamp interpolated from the timestamps of the surrounding
messages.

Messages appear as text elements that contain paragraph elements that
contain sentence elements that contain a sequence of annotated tokens.
Thread titles appear both as an attribute in each message and as a
paragraph in the first message of the thread.

The text elements contain the following essential attributes:
- msg_type: "thread_start" or "comment"
- thread_id: thread identifier (number)
- comment_id: comment identifier (number; 0 if thread start message)
- msg_id: constructed message id (thread_id:comment_id)
- parent_comment_id: parent-comment identifier (0 if thread start
  message or if parent is the thread start message)
- quoted_comment_id: quoted-comment identifier (0 if no quotation)
- date: creation date (2019-01-11)
- time: creation time (16:55:26)
- datetime: combined creation date and time (2019-01-11 16:55:26)
- thread_start_datetime: creation date and time of the thread start
  message (2001-01-01 01:30:00)
- parent_datetime: creation date and time of the parent comment
  (2001-01-01 01:30:00, empty for thread start messages)
- datetime_approximated: whether the date and time were approximated
  based on the surrounding messages (the original was 1970-01-01
  00:00:00)
- author: user nickname
- author_logged_in: whether author was logged in (y, n)
- author_nick_registered: whether nickname was registered (y, n)
- title: thread title from starting message
- topic_names: hierarchical topic (discussion area) name, top level
  first, levels separated by " &gt; " ("Ajoneuvot ja liikenne &gt;
  Autot &gt; Automerkit &gt; Honda")
- topic_names_set: topic level names as a set ("|Ajoneuvot ja
  liikenne|Automerkit|Autot|Honda|")
- topic_name_top: top-level topic name ("Ajoneuvot ja liikenne")
- topic_name_leaf: bottom-level topic name ("Honda")
- topic_adultonly: whether the topic is for adults only (y, n)
- empty: whether the original message was completely empty (y, n)
- sum_lang: the ISO 639-3 codes of languages identified in the
  sentences of the text and the number of sentences in each language
  (see the sentence attribute lang below for some more information)
  ("|fin:37|izh:1|und:1|")

The following text element attributes can be derived from other
attributes, are included mostly for backward-compatibility (although
often renamed) or are otherwise less essential:
- title_orig: original title with possible leading, trailing and
  multiple consecutive spaces preserved
- author_orig: original author with possible leading, trailing and
  multiple consecutive spaces preserved
- topic_names_orig: original hierarchical topic (discussion area)
  name, with double spaces preserved
- datefrom, dateto: creation date (20190111)
- timefrom, timeto: creation time (165526)
- author_v1: user nickname as it incorrectly was in the Korp version
  1.0 of the corpus (different from the VRT version 1.0); this
  attribute is completely missing from those texts in which lacked the
  author attribute in the Korp version 1.0 of the corpus
- author_nick_type: "anonymous" or "registered" (same information as
  in author_nick_registered but with different values)
- author_signed_status: whether nickname was registered and the author
  logged in (-1, 0, 1):
  - 1: logged in, registered nick
  - -1: logged in, anonymous nick
  - 0: not logged in, anonymous nick
  This information is also available in the separate attributes
  author_logged_in and author_nick_registered.
- author_name_type: always "user_nickname"
- topic_nums: comma-separated topic numbers, from bottom to top
  ("3258,1109,6254,2")
- topic_nums_set: topic numbers as a set ("|3258|1109|6254|2|")
- filename_vrt: the name of the VRT file containing the message during
  processing
- filename_orig: the name of the VRT file containing the message in
  the VRT version 1.0 of the corpus
- origfile_textnum: the number of the corresponding text element in
  the VRT file in the VRT version 1.0 (1-based)
- id: same as msg_id (the previous name of msg_id)

Paragraph attributes:
- type: "title" or "body"
- sum_lang: the ISO 639-3 codes of languages identified in the
  sentences of the paragraph and the number of sentences in each
  language (see the sentence attribute lang below for some more
  information) ("|fin:2|und:1|", ordered by number of occurrences,
  tied codes in alphabetic order)
- id: running number of the paragraph within the subcorpus

Sentence attributes:
- lang: ISO 639-3 code of the language of the sentence as identified
  by HeLI-OTS 2.0; "und" for non-language data
- lang_conf: a confidence value of the language identification
  provided by HeLI-OTS
- sentiment_polarity: sentiment polarity of the sentence: "pos",
  "neut" or "neg"
- polarity: an alias of (and the older name for) sentiment_polarity
- id: running number of the sentence within the subcorpus
- _skip: "|finnish-nertag|" if the sentence was not annotated with
  names; completely missing otherwise ("|" in Korp)

In addition to these elements to which all tokens belong, name (and
time and number) expressions recognized by FiNER 1.6 are enclosed in
"ne" elements with the following attributes:
- name: the name enclosed by the element, possibly multi-word; for
  name expressions, the last word is the base form of the last token,
  whereas the preceding ones are word forms
- fulltype: the complete type of the name as recognized by FiNER
  ("EnamexOrgCrp")
- ex: the main category of the expression: "ENAMEX" (name), "TIMEX"
  (time expression) or "NUMEX" (numerical expression)
- type: the broad type of the expression ("ORG")
- subtype: the finer type of the expression ("CRP")
- placename: same as the value for "name" if the name is recognized as
  a place name, empty otherwise
- placename_source: "ner" if the name is recognized as a place name,
  empty otherwise

Nested name expressions are enclosed in "ne1" and "ne2" elements with
the same attributes as "ne". "ne1" elements occur only within "ne" and
"ne2" only within "ne1".

The order of the attributes in the element start tags is arbitrary but
fixed.

The original data contained 19,378 completely empty messages. To
preserve their information in the VRT data, a lone underscore was
added as their content, with the appropriate annotations. The
attribute "empty" of these texts has the value "y".

The first line of each VRT file is a special comment that names the
positional attributes (tab-separated fields) in order:

<!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel spaces initid lex/ nertag2 nertags2/ nerbio2 -->

- word: surface form of the token
- lemma: base form
- lemmacomp: base form with compound-boundary markers (vertical bars)
  separating compound parts
- pos: part of speech
- msd: morpho-syntactic description
- ref: the number of the token in the sentence
- dephead: dependency head number (0 if no head)
- deprel: dependency relation
- spaces: spaces around (or within) the token in the original data
  (from tokenizer)
- initid: running number (from tokenizer; largely redundant with ref)
- lex/: lemgram, a combination of base form and a part-of-speech tag,
  surrounded by vertical bars
- nertag2: maximal name information produced by FiNER, of the form
  CategoryTypSbt-X, where CategoryTypSbt is the full type of the name
  (see above) and X is one of "B" (the first word of a multi-word
  name), "E" (the last word of a multi-word name) or "F" (a
  single-word name)
- nertags2/: name information produced by FiNER, including possible
  nested names: values CategoryTypSbt-X-N separated by vertical bars,
  where CategoryTyp and X are as in "nertag2" and N is the nesting
  level (0, 1 or 2), with 0 being the outermost (maximal) name
- nerbio2: a different kind of name information produced by FiNER for
  maximal names: B-TYP (the first word of a name of with broad type
  TYP), I-TYP (a subsequent word of a name with type TYP) or O
  (outside a name)

Since the parser produced some multi-rooted analyses anyway, the long
sentences that were parsed in shorter shreds were left multi-rooted
when the shreds were put back together.

The three characters < > & appear as &lt; &gt; &amp; everywhere
(because in bare form they are used for the markup), and the double
quotation mark " appears as &quot; in text attribute values. Attribute
values are always enclosed in double quotation marks.

Otherwise all content is encoded as UTF-8. Spurious control characters
were interpreted (for example, most of the C1 block was apparently
intended as Microsoft CP-1252) or removed, space characters were
normalized, BIDI markers were simply removed, U+FDD3 NONCHARACTER was
replaced with HYPHEN, and SHY was either made HYPHEN or removed,
depending on context. However, normalization was not done, nor
ligatures considered, nor unassigned code points, and private-use
characters were preserved as such; we may learn to do better.

No attempt was made to normalize the various characters used or abused
for quotation marks, apostrophes, or dashes.

The values of the text attributes "author", "title", "topic_names",
"topic_name_leaf" and "topic_names_set" were cleaned up so that they
do not contain leading, trailing or multiple consecutive spaces. The
original values are preserved in attributes "author_orig",
"title_orig" and "topic_names_orig", respectively. The original values
of "topic_name_leaf" and "topic_names_set" can be derived from
"topic_names_orig".

Note also that the base forms (positional attribute "lemma")
unintentionally include 77 values with a trailing space, resulting
from the base form with compound-boundary markers ("lemmacomp") ending
in vertical bar preceded by a space (" |"), typically ": |", even
though a trailing vertical bar should not have been interpreted as a
compound-boundary marker. This may be corrected in a future version of
the data.

A small number of particularly problematic message bodies, some
apparently not text at all, were identified by ocular inspection and
mostly removed. Over-long "words" were shortened, partly for
processing reasons. Both are marked with "REDACTED" in the data.

Each VRT file contains a couple of informational XML-style comment
lines ("<!-- ... -->") at the beginning and end of the file.


Differences from VRT version 1.2

- Names have been annotated with the positional attributes "nertag2",
  "nertag2" and "nerbio2" and structures "ne", "ne1" and "ne2" and
  their attributes.

- Sentence languages have been identified and annotated with the
  attributes "lang" and "lang_conf" (and "_skip"), with aggregate
  values in paragraph and text attributes "sum_lang".

- The sentence attribute "polarity" has been renamed to the more
  appropriate "sentiment_polarity", but "polarity" is kept as an alias
  for backward-compatibility.

- The values of text attributes "author" and "title" have been cleaned
  up by removing leading, trailing and multiple consecutive spaces.
  The original values are preserved in attributes "author_orig" and
  "title_orig".

- The values of text attributes "topic_names", "topic_names_set" and
  "topic_name_leaf" have been cleaned up by removing the spurious
  space in " Työpaikkailmoitukset" and "Ravinto  ja ruokavaliot". The
  original value of "topic_names" is preserved in attribute
  "topic_names_orig"; the other two attributes have no corresponding
  original-value attribute but their values can be inferred from
  "topic_names".

- The text attribute "id" has been aliased to "msg_id" for
  forward-compatibility.


Differences from VRT version 1.1

- The attribute polarity has been added to sentences for sentiment
  polarity, with values "pos", "neut" and "neg".


Differences from VRT version 1.0

- The data has been re-parsed to correct major discrepancies in
  dependency annotations resulting from a mistake in processing the
  first version of the corpus.

- Many text attributes have been renamed:
  - type -> msg_type (and its value "thread" -> "thread_start")
  - thread -> thread_id
  - comment -> comment_id
  - parent -> parent_comment_id
  - quote -> quoted_comment_id
  - nick -> author
  - signed -> author_signed_status
  - topics -> topic_nums

- A number of (derived) text attributes have been added, including
  topic names corresponding to the topic numbers in the previous
  version; see above for the attributes in this version.

- The positional attribute lemma with compound-boundary markers (|)
  has been renamed as lemmacomp, and a new attribute lemma has been
  added without the markers.

- The positional attribute lex (lemgram) has been added.

- Base forms containing doubly XML-encoded &, < or > have been
  corrected (e.g., &amp;amp; -> &amp;).

- The data has been divided into files and sorted differently. All the
  messages of each year are in a single file, thread start messages
  and comments in the same files. The messages within each year are
  sorted by thread, threads by the timestamp of the first message of
  the thread, and messages within a thread in thread order.

- An underscore has been added as the content of completely empty
  messages.

- Dummy timestamps "1970-01-01 00:00:00" have been replaced with
  timestamps interpolated from surrounding messages.

- Extra (non-initial) positional attributes comments have been
  removed.
  
  
For further information, please contact fin-clarin@helsinki.fi .