On this page you will find detailed information on the individual versions of Suomi24. Data extensions and format changes are explained and it is emphasised how the versions differ from their predecessors.
This page documents differences between the parts of the Suomi24 2001–2023 corpus collection and a history of the most significant changes, both in Korp and the VRT version. Differences from the older versions of the Suomi24 corpora (Suomi24-2001-2014-korp, Suomi24-korp-2016H2) are not documented.
The parts included in the Suomi24 2001–2023 Korp corpora are the following:
The corresponding parts in the Suomi24 2001–2023 VRT corpora are the following:
Please see the Suomi24 resource group page for more information.
As of 2025-04-08, the different parts of Suomi24 corpora differ from each other in the ways listed below.
The data in the Korp and VRT versions are in principle the same but not all information present in the data is shown in the Korp user interface.
In general, the format of the data of Suomi24 2021–2023 is mostly the same as that of Suomi24 2018–2020. The few differences are due to differences in the original source data or in processing the data.
Some Unicode characters may have been treated differently in the different parts of the corpus.
In Suomi24 2021–2023, a token boundary has been added after a punctuation mark immediately followed by an upper-case letter, whereas in 2001–2017 and 2018–2020, such tokens have not been split, also resulting in missing sentence breaks.
Text attribute only in Suomi24 2001–2017 and 2018–2020:
author_orig
: same as author
in 2018–2020; in 2001–2017, the original author nickname that may contain leading, trailing or multiple consecutive spaces, normalized to attribute author
Sentence attribute only in Suomi24 2001–2017 and 2018–2020:
polarity
: renamed to the more descriptive sentiment_polarity
and retained in 2001–2017 and 2018–2020 as an alias for backward-compatibilitySentence attribute only in Suomi24 2018–2020:
lang_v1
: the code for the language identified for the sentence by HeLI-OTS 1.1 (lang
in Suomi24 2018–2020, VRT version 1.0); can differ from that identified by HeLI-OTS 2.0Text attributes only in Suomi24 2018–2020 and 2021–2023:
hierarchy_id
thread_closed
user_id
Text attributes only in Suomi24 2001–2017 (either the information they contained was not available in the later source data or they were not applicable to the data):
author_v1
: not applicableauthor_nick_type
: whether nickname was registeredauthor_signed_status
: whether nickname was registered and the authortopic_nums
: comma-separated topic numberstopic_nums_set
: topic numbers as a setIn 2021–2023, the values of id
attributes are unique identifiers composed of pseudo-random parts, whereas in 2001–2017 and 2018–2020, the text id
is the same as msg_id
, and paragraph and sentence ids are running numbers of the elements within the subcorpus (file).
In 2018–2020 and 2021–2023, values for the positional attribute lemma
for compound words may differ from those in 2001–2017, as they are intended to be more natural, without lemmatizing all compound parts of the word.
The Suomi24 Sentences Corpus 2001–2023, Korp version was made available in Korp as a release candidate:
Changes to the sentence sentiment polarity attribute:
sentence_polarity
to sentence_sentiment_polarity
.The Korp representation of the text attribute containing the identified language of a sentence was changed as follows:
The Suomi24 corpus collection in Korp was extended with the discussions from 2018–2020 (The Suomi24 Sentences Corpus 2018–2020, Korp version).
The Suomi24 2001–2017 Sentences Corpus, Korp version was updated to version 1.2, whose each sentence contains an attribute for sentiment polarity (positive, neutral, negative). The polarity information has been produced by a classifier trained on the FinnSentiment corpus.
The Suomi24 2001–2017 Sentences Corpus, Korp version was updated to version 1.1, in which writer information was corrected. The following modifications were made to the corpus:
The deficient writer nickname information of the previous corpus version is available in the Korp advanced search and Korp API as the attribute text_author_v1
, but it is not shown in the sidebar and you cannot calculate statistics based on it.
The text attributes of the Suomi24 2017H2 corpus were restored to the ones before correcting the dependency parses in December 2019:
This also affected the users of the Korp API.
The discrepancies noticed in September 2019 in the dependency parses and relations of the Suomi24 2017H2 corpus were corrected.
The Suomi24 Sentences Corpus, version 2017H2 was made available in Korp as a beta test version. The corpus covered the discussions in the Suomi24 discussion forum site from 1 January 2001 to 31 December 2017. The corpus also contained messages missing from the previous version, but no messages removed from the Suomi24 site nor messages in closed discussion topics.
Differences from the previous version (Suomi24 2016H2}:
The following updates were made to The Suomi24 Corpus 2001–2017, VRT version 1.3 (suomi24-2001-2017-vrt-v1-3), The Suomi24 Corpus 2018–2020, VRT version 1.1 (suomi24-2018-2020-vrt-v1-1) and The Suomi24 Corpus 2021–2023, VRT version (suomi24-2021-2023-vrt):
nertag2
, nertag2
and nerbio2
and structures ne
, ne1
and ne2
and their attributes.lang
and lang_conf
(and _skip
), with aggregate values in paragraph and text attributes sum_lang
.polarity
was renamed to the more appropriate sentiment_polarity
, but polarity
was kept as an alias for backward-compatibility.author
and title
were cleaned up by removing leading, trailing and multiple consecutive spaces. (In 2018–2020 and 2021–2023, author
had no such spaces.) The original values were preserved in attributes author_orig
(omitted from 2021–2023) and title_orig
.topic_names
, topic_names_set
and topic_name_leaf
were cleaned up by removing the spurious space in Työpaikkailmoitukset
and Ravinto ja ruokavaliot
. The original value of topic_names
was preserved in attribute topic_names_orig
; the other two attributes have no corresponding original-value attribute but their values can be inferred from topic_names
.msg_id
is intended to replace id
, but id
has been preserved for backward-compatibility and its values are the same as before in 2001–2017 and 2018–2020.The Suomi24 Corpus 2018–2020, VRT version (suomi24-2018-2020-vrt) was published, with data from the years 2018–2020.
The Suomi24 Corpus 2001–2017, VRT version 1.2 (suomi24-2001-2017-vrt-1-2) was published:
polarity
was added to sentences for sentiment polarity, with values pos
, neut
and neg
.The Suomi24 Corpus 2001–2017, VRT version 1.1 (suomi24-2001-2017-vrt-v1-1) was published, with the following additions and changes:
type
-> msg_type
(and its value thread
-> thread_start
)thread
-> thread_id
comment
-> comment_id
parent
-> parent_comment_id
quote
-> quoted_comment_id
nick
-> author
signed
-> author_signed_status
topics
-> topic_nums
lemma
with compound-boundary markers (|
) were renamed as lemmacomp
, and a new attribute lemma
was added without the markers.lex
(lemgram) was added.&
, <
or >
were corrected (e.g., &amp;
-> &
).1970-01-01 00:00:00
were replaced with timestamps interpolated from surrounding messages.This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2025040201
Last modified on 2025-04-10