Suomi24-2017H2 2019-01-11 This is a full database dump of Suomi24 up to the end of the year 2017 from Aller Media, received in June 2018, transformed to VRT form and morpho-syntatically annotated for FIN-CLARIN in the CSC Taito environment by Jussi Piitulainen using ad-hoc scripts followed by a new setup of the UDPipe tokenizer and the old dependency analysis tools and models from Turku NLP (now part of FIN-CLARIN VRT Tools). A number of hidden messages were in the database dump by accident and are not included in the final data set. The 99 files contain up to a million messages each, split from originally 24 files into a, b, c, ... parts. The year in a file name appears to be the year *after* the contents of the file. - threads*.vrt: messages that start discussion threads - comments*.vrt: further messages in discussion threads Messages appear as text elements that contain paragraph elements that contain sentence elements that contain a sequence of annotated tokens. Thread titles appear both as an attribute in each message and as a paragraph in the starting message. Text attributes: - type: "thread" or "comment" - thread: thread identifier (number) - comment: comment identifier (number, in thread message 0) - parent: parent-comment identifier (or 0) - quote: quoted-comment identifier (or 0) - date, time, datetime: creation time (2019-01-11 16:55:26) - nick: user nickname - signed: whether nick was registered (-1, 0, 1) - title: thread title from starting message - topics: comma-separated discussion-area numbers A small number of dates may still be dummy (1970-01-01). The meanings of the three values of "signed" are: - logged in, registered nick: 1 - logged in, anonymous nick: -1 - not logged in, anonymous nick: 0 Paragraph attributes: - type: "title" or "body" A number of text elements contain nothing. These are genuinely empty messages. A dummy token may be inserted in a later version, for VRT reasons. In this version they are as they are. The VRT files contain comments that give names to the tab-separated fields in order; due to the processing history, these may also appear in mid-file but not in mid-sentence: - word: surface form of the token - lemma, pos, msd: base form, "part-of-speech", "morpho-syntactic description" - ref, dephead, deprel: dependency analysis (number, head number or 0, relation) - spaces: spaces after the token in the original data (from tokenizer) - initid: running number (redundant with ref, this is from tokenizer) Since the parser produced some multi-rooted analyses anyway, the long sentences that were parsed in shorter shreds were left multi-rooted when the shreds were put back together. The three characters < > & appear as < > & everywhere (because in bare form they are used for the markup), and the two quotation characters " ' appear as " ' in attribute values. Otherwise all content is encoded as UTF-8. Spurious control characters were interpreted (for example, most of the C1 block was apparently intended as Microsoft CP-1252) or removed, space characters were normalized, BIDI markers were simply removed, and SHY was either made HYPHEN or removed, depending on context. However, normalization was not done, nor ligatures considered, nor unassigned code points; we may learn to do better. No attempt was made to normalize the various characters used or abused for quotation marks, apostrophes, or dashes. A small number of particularly problematic message bodies, some apparently not text at all, were identified by ocular inspection and mostly removed. Over-long "words" were shortened, partly for processing reasons. Both are marked with "REDACTED" in the data. 2019-01-11 Jussi Piitulainen, FIN-CLARIN