Suomi24-2017H2 2019-01-11

This is a full database dump of Suomi24 up to the end of the year 2017
from Aller Media, received in June 2018, transformed to VRT form and
morpho-syntatically annotated for FIN-CLARIN in the CSC Taito
environment by Jussi Piitulainen using ad-hoc scripts followed by a
new setup of the UDPipe tokenizer and the old dependency analysis
tools and models from Turku NLP (now part of FIN-CLARIN VRT Tools). A
number of hidden messages were in the database dump by accident and
are not included in the final data set.

The 99 files contain up to a million messages each, split from
originally 24 files into a, b, c, ... parts. The year in a file name
appears to be the year *after* the contents of the file.

- threads*.vrt: messages that start discussion threads
- comments*.vrt: further messages in discussion threads

Messages appear as text elements that contain paragraph elements that
contain sentence elements that contain a sequence of annotated tokens.
Thread titles appear both as an attribute in each message and as a
paragraph in the starting message.

Text attributes:
- type: "thread" or "comment"
- thread: thread identifier (number)
- comment: comment identifier (number, in thread message 0)
- parent: parent-comment identifier (or 0)
- quote: quoted-comment identifier (or 0)
- date, time, datetime: creation time (2019-01-11 16:55:26)
- nick: user nickname
- signed: whether nick was registered (-1, 0, 1)
- title: thread title from starting message
- topics: comma-separated discussion-area numbers

A small number of dates may still be dummy (1970-01-01).

The meanings of the three values of "signed" are:
- logged in, registered nick: 1
- logged in, anonymous nick: -1
- not logged in, anonymous nick: 0

Paragraph attributes:
- type: "title" or "body"

A number of text elements contain nothing. These are genuinely empty
messages. A dummy token may be inserted in a later version, for VRT
reasons. In this version they are as they are.

The VRT files contain comments that give names to the tab-separated
fields in order; due to the processing history, these may also appear
in mid-file but not in mid-sentence:

<!-- Positional attributes: word lemma pos msd ref dephead deprel spaces initid -->

- word: surface form of the token
- lemma, pos, msd: base form, "part-of-speech", "morpho-syntactic description"
- ref, dephead, deprel: dependency analysis (number, head number or 0, relation)
- spaces: spaces after the token in the original data (from tokenizer)
- initid: running number (redundant with ref, this is from tokenizer)

Since the parser produced some multi-rooted analyses anyway, the long
sentences that were parsed in shorter shreds were left multi-rooted
when the shreds were put back together.

The three characters < > & appear as &lt; &gt; &amp; everywhere
(because in bare form they are used for the markup), and the two
quotation characters " ' appear as &quot; &apos; in attribute values.

Otherwise all content is encoded as UTF-8. Spurious control characters
were interpreted (for example, most of the C1 block was apparently
intended as Microsoft CP-1252) or removed, space characters were
normalized, BIDI markers were simply removed, and SHY was either made
HYPHEN or removed, depending on context. However, normalization was
not done, nor ligatures considered, nor unassigned code points; we may
learn to do better.

No attempt was made to normalize the various characters used or abused
for quotation marks, apostrophes, or dashes.

A small number of particularly problematic message bodies, some
apparently not text at all, were identified by ocular inspection and
mostly removed. Over-long "words" were shortened, partly for
processing reasons. Both are marked with "REDACTED" in the data.

2019-01-11 Jussi Piitulainen, FIN-CLARIN