Finnish conversational chat corpus

The corpus contains 85 Finnish chat dialogs which have been collected during 2019-2020. 62 Participants were university staff, university students and high schoolers. For more detailed information, see the article listed below.

Please cite the following paper when using the corpus: K. Leino, J. Leinonen, M. Singh, S. Virpioja and M. Kurimo. ”FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics.” INTERSPEECH. 2020.

Link: https://github.com/aalto-speech/FinChat

Latest versions/subcorpora:  
Finnish conversational chat corpus, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022060901

Suomi 24 resource group

Suomeksi

Latest versions and variants:  
The Suomi 24 Sentences Corpus 2001-2020, Korp version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Open the resource in Korp icon-question-circle
(including the years 2001-2017 and the update 2018-2020)
The Suomi 24 Corpus 2001-2020, VRT version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Download the resource
(including the years 2001-2017 and the update 2018-2020)
The Suomi 24 Sentences Corpus 2018-2020, Korp-version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Open the resource in Korp icon-question-circle
The Suomi24 Corpus 2018-2020, VRT version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Download the resource
The Suomi24 Sentences Corpus 2001-2017, Korp version 1.2
icon-info-circle Metadata and license
icon-quote-right Citation instructions for this version
Open the resource in Korp icon-question-circle
The Suomi24 Corpus 2001-2017, VRT version 1.1
icon-info-circle Metadata and license
icon-quote-right Citation instructions for this version
Download the resource
Search for all available versions  

The resource consists of the discussions posted on the Suomi 24 discussion forum. The content has been annotated with automatic methods and stored in VRT format.

Via the Korp service, it is possible to perform versatile search queries from the content and to obtain various statistics and visualizations (see Korp instructions).

Without logging in via Korp, you can see the items matching your search criteria as brief excerpts only. At each word token in the concordance, you can find a link to the original message and discussion thread on the original Suomi 24 discussion platform, in case they are still available there. If required, researchers can also log in in case they need to view the wider context around the matching items.

In addition to the corpus versions that are available in Korp, the corresponding full text documents are available for logged-in researchers in VRT format either on the CSC computing environment or as downloadable packages via the download service of Kielipankki. In order to use the computing environment, researchers need a CSC user account. Please note, however, that in order to use the full text data efficiently, some technical and programming skills are usually required. The Korp service provides many opportunities for studying and analyzing the Suomi 24 corpus, so it is recommended that you first make sure whether Korp is suitable for your purpose.

 

Persistent identifier of this page: http://urn.fi/urn:nbn:fi:lb-2022011221

FinnSentiment

FinnSentiment is a Finnish social media corpus for sentiment polarity annotation. 27,000 sentence data set annotated independently with sentiment polarity by three native annotators. The corpus and its creation has been documented in https://arxiv.org/pdf/2012.02613.pdf.

Latest versions/subcorpora:  
FinnSentiment 1.1, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
FinnSentiment, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE  

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081106

Corpus of Global Web-Based English

The Corpus of Global Web-Based English (GloWbE) contains about 1.8 billion words and 1 800 000 texts from web pages in the United States, Great Britain, Australia, India, and 16 other countries. About 60 % of the texts come from blogs.

For general terms and conditions for this and other corpora from BYU please see https://www.corpusdata.org/restrictions.asp

More information on the BYU corpora at Kielipankki

Latest versions/subcorpora:
Corpus of Global Web-Based English – Kielipankki Korp version 2017H1
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Corpus of Global Web-Based English – Kielipankki download version 2017H1
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2017061927

SFNET Corpus

The corpus contains written discussion in the SFNET Internet discussion forum in Finnish from 2002-2003.

Latest versions/subcorpora:  
SFNET Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
A copy of this version is available in the computing environment. icon-question-circle
SFNET Corpus, Helsinki Korp Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Resource will be available soon
Search for all versions in META-SHARE  

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021052501

The HS.fi News and Comments Corpus

The HS.fi News and Comments Corpus contains the domestic news of the Helsingin Sanomat website and their comments from 5.9.2011 to 4.9.2012. The corpus starts with the first news of 5.9.2011 and ends with a news published in the morning on 3.9.2012 and the comments published on the website by 5.9.2012.

Latest versions/subcorpora:
The HS.fi News and Comments Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051910

Ylilauta Corpus

The corpus contains text from discussions of the Ylilauta online discussion board from 2012 to 2014. Short fragments from the discussions, e.g. sentences or paragraphs, are publicly available in Kielipankki – the Language Bank of Finland.

Latest versions/subcorpora:  
Ylilauta Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Downloadable Version of the Ylilauta Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
A copy of this version is available in the computing environment. icon-question-circle
Search for these versions in META-SHARE  

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042602

Suomi 24 -aineistoryhmä

In English

Viimeisimmät versiot:  
Suomi24 virkkeet -korpus 2001-2020, Korp-versio
icon-info-circle Kuvailutiedot ja lisenssi
icon-quote-right Viittausohje
Avaa aineisto Korp-palvelussa icon-question-circle
(sis. sekä vuodet 2001-2017 että päivityksen 2018-2020)
Suomi24-korpus 2001-2020, VRT-versio
icon-info-circle Kuvailutiedot ja lisenssi
icon-quote-right Viittausohje
Lataa aineisto
(sis. sekä vuodet 2001-2017 että päivityksen 2018-2020)
Suomi24 virkkeet -korpus 2018-2020, Korp-versio
icon-info-circle Kuvailutiedot ja lisenssi
icon-quote-right Viittausohje
Avaa aineisto Korp-palvelussa icon-question-circle
Suomi24-korpus 2018-2020, VRT-versio
icon-info-circle Kuvailutiedot ja lisenssi
icon-quote-right Viittausohje
Lataa aineisto
Suomi24 virkkeet -korpus 2001-2017, Korp-versio 1.2
icon-info-circle Kuvailutiedot ja lisenssi
icon-quote-right Viittausohje tähän versioon
Avaa aineisto Korp-palvelussa icon-question-circle
Suomi24-korpus 2001-2017, VRT-versio 1.1
icon-info-circle Kuvailutiedot ja lisenssi
icon-quote-right Viittausohje tähän versioon
Lataa aineisto
Etsi muut saatavilla olevat versiot  

Aineisto koostuu Suomi 24 -foorumilta kerätyistä keskusteluista. Sisältö on jäsennetty automaattisin menetelmin ja tallennettu VRT-muotoon.

Korpin kautta tarjottavasta Suomi 24 -korpuksesta voi tehdä monipuolisia hakuja ja tilastoida tai kuvantaa hakutuloksia eri tavoin (katso Korp-palvelun ohjeet).

Kirjautumattomille käyttäjille aineiston tekstisisällöstä löytyneet hakuosumat näytetään Korpissa lyhyinä otteina. Hakuosumien kohdalta on linkit alkuperäiseen viestiin ja keskusteluketjuun Suomi 24 -palvelimella, mikäli nämä ovat edelleen olemassa. Tarvittaessa tutkija saa näkyviin myös laajemman kontekstin kirjautumalla Korp-palveluun.

Korp-palvelussa näkyvän korpusversion lisäksi vastaava VRT-muotoinen kokotekstiaineisto on kirjautuneiden tutkijoiden käytettävissä CSC:n laskentaympäristössä tai ladattavissa omalle koneelle Kielipankin latauspalvelusta. Laskentaympäristön käyttäminen edellyttää CSC:n myöntämää käyttäjätunnusta. Huomaa, että kokotekstiaineiston hallintaan ja tehokkaaseen käsittelyyn tarvitaan yleensä jonkin verran teknistä osaamista ja ohjelmointitaitoja. Korp-palvelu tarjoaa monia mahdollisuuksia myös Suomi 24 -aineiston tutkimiseen, joten kannattaa ensin varmistaa, sopiiko se omaan tarkoitukseesi.

 

Tämän sivun PID: http://urn.fi/urn:nbn:fi:lb-2017021630

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information