Suomi24 Resource group page


Suomi24: Versions and Updates

On this page you will find detailed information on the individual versions of Suomi24. Data extensions and format changes are explained and it is emphasised how the versions differ from their predecessors.

This page documents differences between the parts of the Suomi24 2001–2023 corpus collection and a history of the most significant changes, both in Korp and the VRT version. Differences from the older versions of the Suomi24 corpora (Suomi24-2001-2014-korp, Suomi24-korp-2016H2) are not documented.

The parts of the Suomi24 2001–2023 corpus collection

The parts included in the Suomi24 2001–2023 Korp corpora are the following:

The corresponding parts in the Suomi24 2001–2023 VRT corpora are the following:

Please see the Suomi24 resource group page for more information.

Differences between the parts of the Suomi24 2001–2023 corpus collection

As of 2025-04-08, the different parts of Suomi24 corpora differ from each other in the ways listed below.

The data in the Korp and VRT versions are in principle the same but not all information present in the data is shown in the Korp user interface.

In general, the format of the data of Suomi24 2021–2023 is mostly the same as that of Suomi24 2018–2020. The few differences are due to differences in the original source data or in processing the data.

Characters

Some Unicode characters may have been treated differently in the different parts of the corpus.

Tokenization

In Suomi24 2021–2023, a token boundary has been added after a punctuation mark immediately followed by an upper-case letter, whereas in 2001–2017 and 2018–2020, such tokens have not been split, also resulting in missing sentence breaks.

Text (structural) attributes

Text attribute only in Suomi24 2001–2017 and 2018–2020:

  • author_orig: same as author in 2018–2020; in 2001–2017, the original author nickname that may contain leading, trailing or multiple consecutive spaces, normalized to attribute author

Sentence attribute only in Suomi24 2001–2017 and 2018–2020:

  • polarity: renamed to the more descriptive sentiment_polarity and retained in 2001–2017 and 2018–2020 as an alias for backward-compatibility

Sentence attribute only in Suomi24 2018–2020:

  • lang_v1: the code for the language identified for the sentence by HeLI-OTS 1.1 (lang in Suomi24 2018–2020, VRT version 1.0); can differ from that identified by HeLI-OTS 2.0

Text attributes only in Suomi24 2018–2020 and 2021–2023:

  • hierarchy_id
  • thread_closed
  • user_id

Text attributes only in Suomi24 2001–2017 (either the information they contained was not available in the later source data or they were not applicable to the data):

  • author_v1: not applicable
  • author_nick_type: whether nickname was registered
  • author_signed_status: whether nickname was registered and the author
    logged in
  • topic_nums: comma-separated topic numbers
  • topic_nums_set: topic numbers as a set

In 2021–2023, the values of id attributes are unique identifiers composed of pseudo-random parts, whereas in 2001–2017 and 2018–2020, the text id is the same as msg_id, and paragraph and sentence ids are running numbers of the elements within the subcorpus (file).

Word (positional) attributes

In 2018–2020 and 2021–2023, values for the positional attribute lemma for compound words may differ from those in 2001–2017, as they are intended to be more natural, without lemmatizing all compound parts of the word.

History of updates of Suomi24 in Korp

2025-04-12: Extended with 2021–2023; added name and language annotations; minor changes

The Suomi24 Sentences Corpus 2001–2023, Korp version was made available in Korp as a release candidate:

  • The collection was extended with the discussions of 2021–2023 (The Suomi24 Sentences Corpus 2021–2023, Korp version), with over 15 million messages (texts) and nearly 474 million tokens.
  • Suomi24 Sentences Corpora 2001–2017 and 2018–2020 were updated with annotations of names recognized with FiNER 1.6 and languages of sentences identified with HeLI-OTS 2.0.
  • Spurious spaces were removed from topic names, titles and author nicknames in these corpora.
  • Because of these additions and changes, the version numbers of the older parts were incremented: 2001–2017 to 1.3 and 2018–2020 to 1.1.

2024-11-12: Sentence sentiment polarity; change in representation of identified languages

Changes to the sentence sentiment polarity attribute:

  • The text attribute name sentence polarity was changed to sentence sentiment polarity, as the established meaning of sentence polarity refers to syntactic polarity.
  • The internal name of the attribute was changed from sentence_polarity to sentence_sentiment_polarity.

The Korp representation of the text attribute containing the identified language of a sentence was changed as follows:

  • A language is always represented by its three-letter ISO 639-3 code even if the code had a translation in Korp.
  • If a language code has a translation, it is shown as a tooltip in the sidebar of the KWIC result when hovering over the code.
  • A language code in the KWIC sidebar is a link to the page of the language in question on the SIL’s ISO 639-3 site.
  • The extended search has a selection list for language codes.
  • The attribute label includes the language code standard (ISO 639-3).
    The internal representation of the attribute is intact, so they can
    be used in the CQP expressions of the advanced search as before.

2021-11-05: Extended with 2018–2020

The Suomi24 corpus collection in Korp was extended with the discussions from 2018–2020 (The Suomi24 Sentences Corpus 2018–2020, Korp version).

2021-04-21: Sentence sentiment polarity added

The Suomi24 2001–2017 Sentences Corpus, Korp version was updated to version 1.2, whose each sentence contains an attribute for sentiment polarity (positive, neutral, negative). The polarity information has been produced by a classifier trained on the FinnSentiment corpus.

2020-02-20: Writer information corrected; corpus name changed

The Suomi24 2001–2017 Sentences Corpus, Korp version was updated to version 1.1, in which writer information was corrected. The following modifications were made to the corpus:

  • All messages have the writer nickname information also in the years 2009–2012 and 2014, in which a large number of messages were previously completely missing a writer nickname.
  • Writer nicknames now contain the characters , and & literally instead of ', " and &. (Search results may have shown these characters correctly, but they could not be searched for.)
  • In the name of the corpus, 2017H2 was replaced with the year range 2001–2017, which indicates the extent of the corpus more clearly. The full name of the corrected version of the corpus is The Suomi24 Sentences Corpus 2001–2017, Korp version 1.1.

The deficient writer nickname information of the previous corpus version is available in the Korp advanced search and Korp API as the attribute text_author_v1, but it is not shown in the sidebar and you cannot calculate statistics based on it.

2020-01-24: Text attributes restored

The text attributes of the Suomi24 2017H2 corpus were restored to the ones before correcting the dependency parses in December 2019:

  • Some text attributes had been left out: writer nickname, registered nickname, discussion thread start timestamp, file name, parent timestamp and message is completely empty:
  • Some text attributes had got values differing from previous ones: yes/no-valued attributes and paragraph type.

This also affected the users of the Korp API.

2019-12-19: Dependency parses corrected

The discrepancies noticed in September 2019 in the dependency parses and relations of the Suomi24 2017H2 corpus were corrected.

2019-02-18: New corpus version: 2017H2

The Suomi24 Sentences Corpus, version 2017H2 was made available in Korp as a beta test version. The corpus covered the discussions in the Suomi24 discussion forum site from 1 January 2001 to 31 December 2017. The corpus also contained messages missing from the previous version, but no messages removed from the Suomi24 site nor messages in closed discussion topics.

Differences from the previous version (Suomi24 2016H2}:

  • The corpus has been divided into subcorpora by the year.
  • Within each year, the messages of a discussion thread are contiguous, and the comments to a message follow the message commented on. Discussion threads are sorted by the time of the first message of the thread during that year.
  • Text attributes have been renamed and the form of some attribute values differs from that in the previous version.
  • The titles of discussion threads are also part of the text content, which allows searching from them using, e.g., lemmas. In contrast, the writer nicknames are not part of the text content.

History of updates of downloadable Suomi24 VRT data

2025-04-12: Extended with 2021–2023; added name and language annotations; minor changes

The following updates were made to The Suomi24 Corpus 2001–2017, VRT version 1.3 (suomi24-2001-2017-vrt-v1-3), The Suomi24 Corpus 2018–2020, VRT version 1.1 (suomi24-2018-2020-vrt-v1-1) and The Suomi24 Corpus 2021–2023, VRT version (suomi24-2021-2023-vrt):

  • The data was extended with the messages in Suomi24 from the years 2021–2023 (suomi24-2021-2023-vrt).
  • Names were annotated with the positional attributes nertag2, nertag2 and nerbio2 and structures ne, ne1 and ne2 and their attributes.
  • Sentence languages were identified and annotated with the attributes lang and lang_conf (and _skip), with aggregate values in paragraph and text attributes sum_lang.
  • The sentence attribute polarity was renamed to the more appropriate sentiment_polarity, but polarity was kept as an alias for backward-compatibility.
  • The values of text attributes author and title were cleaned up by removing leading, trailing and multiple consecutive spaces. (In 2018–2020 and 2021–2023, author had no such spaces.) The original values were preserved in attributes author_orig (omitted from 2021–2023) and title_orig.
  • The values of text attributes topic_names, topic_names_set and topic_name_leaf were cleaned up by removing the spurious space in Työpaikkailmoitukset and Ravinto ja ruokavaliot. The original value of topic_names was preserved in attribute topic_names_orig; the other two attributes have no corresponding original-value attribute but their values can be inferred from topic_names.
  • The text attribute msg_id is intended to replace id, but id has been preserved for backward-compatibility and its values are the same as before in 2001–2017 and 2018–2020.

2021-11-10: Extended with 2018–2020

The Suomi24 Corpus 2018–2020, VRT version (suomi24-2018-2020-vrt) was published, with data from the years 2018–2020.

2021-04-20: suomi24-2001-2017-vrt-v1-2: Added sentence sentiment polarity

The Suomi24 Corpus 2001–2017, VRT version 1.2 (suomi24-2001-2017-vrt-1-2) was published:

  • The attribute polarity was added to sentences for sentiment polarity, with values pos, neut and neg.

2020-03-09: suomi24-2001-2017-vrt-v1-1: Corrected dependency annotations, renamed text attributes, added attributes

The Suomi24 Corpus 2001–2017, VRT version 1.1 (suomi24-2001-2017-vrt-v1-1) was published, with the following additions and changes:

  • The data was re-parsed to correct major discrepancies in dependency annotations resulting from a mistake in processing the first version of the corpus.
  • Many text attributes were renamed:
    • type -> msg_type (and its value thread -> thread_start)
    • thread -> thread_id
    • comment -> comment_id
    • parent -> parent_comment_id
    • quote -> quoted_comment_id
    • nick -> author
    • signed -> author_signed_status
    • topics -> topic_nums
  • A number of (derived) text attributes were added, including topic names corresponding to the topic numbers in the previous version; see above for the attributes in this version.
  • The positional attribute lemma with compound-boundary markers (|) were renamed as lemmacomp, and a new attribute lemma was added without the markers.
  • The positional attribute lex (lemgram) was added.
  • Base forms containing doubly XML-encoded &, < or > were corrected (e.g., &amp;amp; -> &amp;).
  • The data was divided into files and sorted differently. All the messages of each year are in a single file, thread start messages and comments in the same files. The messages within each year are sorted by thread, threads by the timestamp of the first message of the thread, and messages within a thread in thread order.
  • An underscore was added as the content of completely empty messages.
  • Dummy timestamps 1970-01-01 00:00:00 were replaced with timestamps interpolated from surrounding messages.
  • Extra (non-initial) positional attributes comments were removed.

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2025040201

Last modified on 2025-04-10

Search the Language Bank Portal:
Pekka Posio
Researcher of the Month: Pekka Posio

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information