The Suomi24 Corpus 2001–2017, VRT version 1.1 Persistent identifier: http://urn.fi/urn:nbn:fi:lb-2020021801 Licence: CLARIN ACA +NC 1.0: http://urn.fi/urn:nbn:fi:lb-20150304151 Short description The corpus contains all the texts available in the discussion forums of the Suomi24 online social networking website from 1 January 2001 to 31 December 2017. The tokenized version was created and the annotation process was then carried out by Jussi Piitulainen. The base data is the same as for the VRT version 1.0, but it was re-parsed to correct major discrepancies in dependency annotations resulting from a mistake in processing the first version of the corpus. The messages have also been reordered and their attributes augmented so that this data set corresponds to the data in Korp. The entire corpus in the VRT format is downloadable for academic research purposes. Detailed description This data set is an annotated VRT version of a full database dump of the content of the Suomi24 discussion forums (https://keskustelu.suomi24.fi) from 1 January 2001 to 31 December 2017 from Aller Media, received in June 2018. The data set excludes data from closed or hidden discussion topics. The data was tokenized, transformed to VRT format and morpho-syntatically annotated for FIN-CLARIN in the CSC Taito environment by Jussi Piitulainen using ad-hoc scripts followed by a new setup of the UDPipe tokenizer and the old dependency analysis tools and models (TDPP) from Turku NLP (now part of FIN-CLARIN VRT Tools). The messages were then reordered and augmented with derived attributes by Jyrki Niemi. For the VRT version 1.1, the VRT files have been re-generated from the Corpus Workbench data in the Korp concordancing service of the Language Bank of Finland (https://korp.csc.fi/), so they correspond exactly to the data in the Korp service. Although base data is the same as for the VRT version 1.0, it has been re-parsed to correct major discrepancies in dependency annotations resulting from a mistake in processing the first version of the corpus, and the messages have been reordered and their attributes augmented. Please see the end of this file for more details on the changes. The data has been divided into files by the year, corresponding to the subcorpora in Korp. The messages within each year are sorted by thread and threads by the timestamp of the first message of the thread. Messages within a thread are sorted in thread order: each message is followed by the direct comments to it (recursively), sorted by their timestamp. Threads that span over several years have been split by the year. The original data contained 143 messages with a dummy timestamp (1970-01-01 00:00:00), which has been replaced in this data set with a timestamp interpolated from the timestamps of the surrounding messages. Messages appear as text elements that contain paragraph elements that contain sentence elements that contain a sequence of annotated tokens. Thread titles appear both as an attribute in each message and as a paragraph in the first message of the thread. The text elements contain the following essential attributes: - msg_type: "thread_start" or "comment" - thread_id: thread identifier (number) - comment_id: comment identifier (number; 0 if thread start message) - id: constructed message id (thread_id:comment_id) - parent_comment_id: parent-comment identifier (0 if no parent) - quoted_comment_id: quoted-comment identifier (0 if no quotation) - date: creation date (2019-01-11) - time: creation time (16:55:26) - datetime: combined creation date and time (2019-01-11 16:55:26) - thread_start_datetime: creation date and time of the thread start message (2001-01-01 01:30:00) - parent_datetime: creation date and time of the parent comment (2001-01-01 01:30:00, empty for thread start messages) - datetime_approximated: whether the date and time were approximated based on the surrounding messages (the original was 1970-01-01 00:00:00) - author: user nickname - author_logged_in: whether author was logged in (y, n) - author_nick_registered: whether nickname was registered (y, n) - title: thread title from starting message - topic_names: hierarchical topic (discussion area) name, top level first, levels separated by " > " ("Ajoneuvot ja liikenne > Autot > Automerkit > Honda") - topic_names_set: topic level names as a set ("|Ajoneuvot ja liikenne|Automerkit|Autot|Honda|") - topic_name_top: top-level topic name ("Ajoneuvot ja liikenne") - topic_name_leaf: bottom-level topic name ("Honda") - topic_adultonly: whether the topic is for adults only (y, n) - empty: whether the original message was completely empty (y, n) The following text element attributes can be derived from other attributes, are included mostly for backward-compatibility (although often renamed) or are otherwise less essential: - datefrom, dateto: creation date (20190111) - timefrom, timeto: creation time (165526) - author_v1: user nickname as it incorrectly was in the Korp version 1.0 of the corpus (different from the VRT version 1.0); this attribute is completely missing from those texts in which lacked the author attribute in the Korp version 1.0 of the corpus - author_nick_type: "anonymous" or "registered" (same information as in author_nick_registered but with different values) - author_signed_status: whether nickname was registered and the author logged in (-1, 0, 1): - 1: logged in, registered nick - -1: logged in, anonymous nick - 0: not logged in, anonymous nick This information is also available in the separate attributes author_logged_in and author_nick_registered. - author_name_type: always "user_nickname" - topic_nums: comma-separated topic numbers, from bottom to top ("3258,1109,6254,2") - topic_nums_set: topic numbers as a set ("|3258|1109|6254|2|") - filename_vrt: the name of the VRT file containing the message during processing - filename_orig: the name of the VRT file containing the message in the VRT version 1.0 of the corpus - origfile_textnum: the number of the corresponding text element in the VRT file in the VRT version 1.0 (1-based) Paragraph attributes: - type: "title" or "body" - id: running number of the paragraph within the subcorpus Sentence attributes: - id: running number of the sentence within the subcorpus The order of the attributes in the element start tags is arbitrary but fixed. The original data contained 19,378 completely empty messages. To preserve their information in the VRT data, a lone underscore was added as their content, with the appropriate annotations. The attribute "empty" of these texts has the value "y". The first line of each VRT file is a special comment that names the positional attributes (tab-separated fields) in order: - word: surface form of the token lemma, lemmacomp, pos, msd: base - form, base form with compound-boundary markers, "part-of-speech", - "morpho-syntactic description" ref, dephead, deprel: dependency - analysis (number, head number or 0, relation) spaces: spaces after - the token in the original data (from tokenizer) initid: running - number (redundant with ref, this is from tokenizer) lex/: lemgram, a - combination of base form and a part-of-speech tag Since the parser produced some multi-rooted analyses anyway, the long sentences that were parsed in shorter shreds were left multi-rooted when the shreds were put back together. The three characters < > & appear as < > & everywhere (because in bare form they are used for the markup), and the double quotation mark " appears as " in attribute values. Attribute values are always enclosed in double quotation marks. Otherwise all content is encoded as UTF-8. Spurious control characters were interpreted (for example, most of the C1 block was apparently intended as Microsoft CP-1252) or removed, space characters were normalized, BIDI markers were simply removed, U+FDD3 NONCHARACTER was replaced with HYPHEN, and SHY was either made HYPHEN or removed, depending on context. However, normalization was not done, nor ligatures considered, nor unassigned code points, and private-use characters were preserved as such; we may learn to do better. No attempt was made to normalize the various characters used or abused for quotation marks, apostrophes, or dashes. Note that some values of the text attributes "author" and "title" contain two (or more) consecutive spaces. Moreover, one value of the attribute "topic_name_leaf" contains a leading space and one value a double space between words: " Työpaikkailmoitukset" and "Ravinto ja ruokavaliot", respectively. This is also reflected in the attributes "topic_names" and "topic_names_set". These may be normalized in a future version of the data. Note also that the base forms (positional attribute "lemma") unintentionally include 77 values with a trailing space, resulting from the base form with compound-boundary markers ("lemmacomp") ending in vertical bar preceded by a space (" |"), typically ": |", even though a trailing vertical bar should not have been interpreted as a compound-boundary marker. This may be corrected in a future version of the data. A small number of particularly problematic message bodies, some apparently not text at all, were identified by ocular inspection and mostly removed. Over-long "words" were shortened, partly for processing reasons. Both are marked with "REDACTED" in the data. Each VRT file contains a couple of informational XML-style comment lines ("") at the beginning and end of the file. Differences from VRT version 1.0 - The data has been re-parsed to correct major discrepancies in dependency annotations resulting from a mistake in processing the first version of the corpus. - Many text attributes have been renamed: - type -> msg_type (and its value "thread" -> "thread_start") - thread -> thread_id - comment -> comment_id - parent -> parent_comment_id - quote -> quoted_comment_id - nick -> author - signed -> author_signed_status - topics -> topic_nums - A number of (derived) text attributes have been added, including topic names corresponding to the topic numbers in the previous version; see above for the attributes in this version. - The positional attribute lemma with compound-boundary markers (|) has been renamed as lemmacomp, and a new attribute lemma has been added without the markers. - The positional attribute lex (lemgram) has been added. - Base forms containing doubly XML-encoded &, < or > have been corrected (e.g., &amp; -> &). - The data has been divided into files and sorted differently. All the messages of each year are in a single file, thread start messages and comments in the same files. The messages within each year are sorted by thread, threads by the timestamp of the first message of the thread, and messages within a thread in thread order. - An underscore has been added as the content of completely empty messages. - Dummy timestamps "1970-01-01 00:00:00" have been replaced with timestamps interpolated from surrounding messages. - Extra (non-initial) positional attributes comments have been removed.