The Suomi24 Corpus 2001-2017, VRT version 1.1 published in Download service

In January, we published a new version of the Suomi24 2017H2 in Korp service with the name ”The Suomi24 Sentences Corpus 2001-2017, Korp version 1.1” and now we published the respective new version in the Download service as ”The Suomi24 Corpus 2001-2017, VRT version 1.1”.

The old 2017H2 version had these problems, which were now fixed:

– contains major discrepancies in dependency annotations, resulting from a mistake in the parsing process;
– has only numeric topics (discussion area numbers) for messages, not topic names;
– lacks several other text attributes;
– has many text attributes with different names;
– lacks base forms without compound-boundary markers;
– has base forms with doubly XML-encoded &, < and > (e.g., “&amp;lt;” instead of “&lt;” for <);
– has the data divided into 99 files with at most a million messages each, instead of 17 files divided by the year;
– has thread start messages and their comments in different files;
– has the messages sorted according to the thread or comment id (number), instead of having all the messages of a thread (within a year) consecutively in thread order;
– contains some completely empty messages (text elements with no tokens);
– contains 143 messages with the dummy timestamp “1970-01-01 00:00:00”; and
– contains extra (non-initial) positional attributes comments.