The TV Corpus – Kielipankki version

This resource contains a copy of the original TV corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 325 million words of data in 75,000 TV episodes from 1950 to 2018. The TV scripts come from several different English-speaking countries (US, UK, 4 other dialects), which allows to compare very informal language in these countries. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The TV Corpus – Kielipankki version, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112415

Corpus of American Soap Operas – Kielipankki version

This resource contains a copy of the original Corpus of American Soap Operas (SOAP), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 100 million words of data from 22,000 transcripts from American soap operas from the years 2001-2012, and it serves as a great resource to look at very informal language. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
Corpus of American Soap Operas – Kielipankki version, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112410

News on the Web – Kielipankki version

This resource contains a copy of the original News on the Web corpus (NOW), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains data from web-based newspapers and magazines in 20 different English-speaking countries from Jan 2010 to 31 May 2021.  The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
News on the Web – Kielipankki version 2021-05, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112405

The Intelligent Web Corpus – Kielipankki version

This resource contains a copy of the original The Intelligent Web Corpus (iWeb), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 14 billion words in 22 million web pages. The data was taken in 2017 from around 100,000 of the most widely-used websites (for English) in the world.

The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Intelligent Web Corpus – Kielipankki version, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112310

Tekstiks – Speech recognition: speech to text

Automated speech transcription service and a user interface for transcription editing.

Tekstiks.ee is a public speech recognition service for Estonian and Finnish language. The editor and frontend for use with speech recognisers have been developed by the TalTech’s Laboratory of Language Technology. TalTech’s own Estonian ASR has been integrated into it, as has Kielipankki’s ASR service, which uses speech recognition models developed at Aalto University.

The system is fully automated and can process multiple files in parallel. The average processing time is about half of the recording’s length.

If you use this system for research, please refer to the article below in your publications (available here): Olev, Aivo; Alumäe, Tanel. ”Estonian Speech Recognition and Transcription Editing Service”. Baltic J. Modern Computing, Vol. 10 (2022), No. 3, pp. 409–421 https://doi.org/10.22364/bjmc.2022.10.3.14

Latest version:  

tekstiks
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the service
Look for all versions of this tool in META-SHARE  

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112802

The Movie Corpus – Kielipankki version

This resource contains a copy of the original Movie Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 200 million words from about 25,000 movies from the years 1930-2018. The movie scripts come from several different English-speaking countries and include English from the US, UK and 4 other dialects. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Movie Corpus – Kielipankki version, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112305

The Coronavirus Corpus – Kielipankki version

This resource contains a copy of the original Coronavirus Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains data on the medical, social, cultural, and economic impact of the coronavirus (COVID-19) from online magazines and newspapers in 20 different English-speaking countries from 1 Jan 2020 to 31 May 2021. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Coronavirus Corpus – Kielipankki version 2021-05, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022111705

Lahjoita puhetta -aineistot (puhelahjat) yrityskäyttöön

In English (coming soon)

Oletko tutkija? Lahjoita puhetta -aineistot tutkimuskäyttöön löytyvät toiselta sivulta.

 

Huom. Aineistopakettien sisältökuvaukset ja kokotiedot perustuvat alustavaan arvioon ja niitä voidaan tarvittaessa tarkentaa.

Tästä aineistosta tarjotaan yrityskäyttöön seuraavat paketit:
Lahjoita puhetta -aineisto: Näyte
icon-info-circle Kuvailutiedot
Ilmainen näyte, joka sisältää 40 satunnaisesti valittua äänitiedostoa, niiden litteraatit raakatekstinä ja kohdistustiedostoina sekä käytettävissä olevat äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 35 minuuttia.
Hinta: Maksuton näyte

Hanki käyttöoikeus

Aineiston latauslinkki on tulossa tähän

Lahjoita puhetta: Valikoitu aineisto
icon-info-circle Kuvailutiedot
Tämä kokoelma sisältää viisi eri osa-aineistoa, jotka on poimittu Aalto-yliopistossa erityisesti automaattisen puheentunnistuksen kehitys-, opetus- ja testausvaiheita varten. Äänitteiden yhteenlaskettu kesto on noin 131 tuntia.

Hinta: 1000 €

Hanki käyttöoikeus

Aineiston latauslinkki on tulossa tähän

Lahjoita puhetta: Annotoitu aineisto
icon-info-circle Kuvailutiedot
Tämä kokoelma sisältää koko aineiston versioon 1 kuuluvat litteroidut äänitteet, litteraatit raakatekstinä ja kohdistustiedostoina sekä äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 1600 tuntia.
Hinta: 5000 €

Hanki käyttöoikeus

Aineiston latauslinkki on tulossa tähän

Lahjoita puhetta: Koko aineisto (versio 1)
icon-info-circle Kuvailutiedot
Kokoelmassa on mukana kaikki aineiston versioon 1 kuuluvat litteroidut ja litteroimattomat äänitteet, litteraatit raakatekstinä ja kohdistustiedostoina sekä äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 3200 tuntia.
icon-quote-right Tämän version viittausohje
Hinta: 10.000 €

Hanki käyttöoikeus

Lataa aineisto

Aineiston sisältö

Lahjoita puhetta -aineisto eli Puhelahjat on koostettu 16.6.2020 alkaneessa Vaken, Ylen ja Helsingin yliopiston toteuttamassa kampanjassa, jossa kuka tahansa ainakin hieman suomea osaava on voinut helppokäyttöisen selain- tai mobiilisovelluksen kautta lahjoittaa omaa puhettaan. Aineisto on siinä mielessä ainutlaatuinen, että se on alusta alkaen kerätty mahdollisimman läpinäkyvästi sekä tutkijoiden että yritysten rajoitettuun käyttöön siten, että puheen lahjoittajien tietosuojasta pyritään huolehtimaan aineiston koko elinkaaren ajan.

Aineistosta on tulossa saataville erilaisia paketteja Kielipankin latauspalveluun, josta luvan saaneet tutkijat ja yritykset pääsevät niitä käyttämään. Lisätietoja saa osoitteesta lahjoita-puhetta@helsinki.fi.

Kuinka aineistoa pääsee käyttämään? Ohjeet yrityksille

Huom. Ohjeita päivitetään edelleen!

Puhelahjat-aineiston käyttöehtojen mukaisesti käyttöoikeuksia voidaan myöntää myös yrityksille. Jokaisen yrityksen kanssa tehdään kirjallinen sopimus Puhelahjat-aineiston käytöstä, minkä jälkeen pääsy aineistoon voidaan myöntää yrityksen valtuuttamalle edustajalle.

  1. Aineiston käyttämisestä kiinnostuneet yritykset voivat ottaa yhteyttä osoitteeseen lahjoita-puhetta@helsinki.fi.
  2. Yrityksiä koskevien lisenssisopimusten yleisiin ehtoihin voi tutustua täällä.
  3. Ennen maksullisen aineiston hankkimista yritys voi saada veloituksetta pääsyn pieneen näyteaineistoon (”Lahjoita puhetta -aineisto: Näyte”). Myös näyteaineiston käsittelyä koskevat samat käyttöehdot kuin aineiston maksullisia versioita.
  4. Kun lisenssisopimus on tehty, yrityksen valtuuttama edustaja voi hakea pääsyä joko näyte- tai varsinaisen aineistoon Kielipankin oikeudet -palvelussa (LBR, Language Bank Rights).
    Palvelu edellyttää hakijan sähköistä tunnistautumista joko eDuunin välittämällä identiteetillä tai jonkin luottamusverkostoihin kuuluvan akateemisen organisaation myöntämällä käyttäjätunnuksella. Tarvittaessa pääsyhakemuksen tekijä voi luoda itselleen eDuuni-identiteetin, jolla hän voi kirjautua palveluun. Identiteetin vahvistamiseen tarvitaan hakijan omassa käytössä oleva sähköpostiosoite.
    Huom. eDuuni-identiteetin luominen on ilmaista! Yrityksen ei siis tarvitse ostaa muita eDuunin kautta tarjottuja palveluita.
  5. Pääsyhakemuksen yhteydessä yrityksen on ilmoitettava oman hankkeensa julkinen otsikko sekä linkki aineistoon sisältyvien henkilötietojen käsittelyä koskevaan julkiseen tietosuojailmoitukseen. Tiedot julkaistaan Kielipankin verkkosivuilla.
  6. Sopimuksen mukaisen lisenssimaksun on oltava suoritettuna ennen kuin pääsy maksulliseen aineistoon voidaan myöntää. Maksuohjeet saa osoitteesta lahjoita-puhetta@helsinki.fi.
  7. Kun pääsyhakemus on hyväksytty, hakemuksen tehnyt henkilö saa pääsyn aineistoon sillä käyttäjätunnuksella, jolla hakemus tehtiin.

Viimeksi päivitetty: 16.11.2022

 

Tämän sivun pysyvä tunniste: urn:nbn:fi:lb-2022111628

Lahjoita puhetta -aineistot (puhelahjat) tutkimuskäyttöön

In English

Lahjoita puhetta -aineistot yrityskäyttöön: katso lisätiedot toiselta sivulta.

Aineiston versiot:
Lahjoita puhetta -aineisto, versio 1.0
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)

icon-quote-right Tämän version viittausohje
Hae käyttöoikeutta (vain tutkijoille)

+PRIV: Aineisto sisältää henkilötietoja.
Toimita julkinen ilmoitus henkilötietojen käsittelystä

Lataa aineisto
Lahjoita puhetta -aineisto: Näyte
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Opetusdata (100h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Testidata (10h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Kehitysdata (10h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Usean litteroijan testidata (1h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Testidata useaan kertaan litteroiduilta puhujilta (10h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Etsi muut saatavilla olevat versiot

Aineiston sisältö

Lahjoita puhetta -aineisto, lyhytnimeltään Puhelahjat, on koostettu 16.6.2020 alkaneessa Vake Oy:n (sittemmin Ilmastorahasto), Ylen ja Helsingin yliopiston toteuttamassa Lahjoita puhetta -kampanjassa, jossa kuka tahansa suomea osaava henkilö on voinut halutessaan lahjoittaa omaa puhettaan kielentutkimuksen sekä kieliteknologian kehitystyön edistämiseksi. Lahjoitettu puhe on tallennettu helppokäyttöisen selain- tai mobiilisovelluksen kautta.

Kevääseen 2021 mennessä lahjoitetuista puhenäytteistä on rakennettu ääniaineiston ensimmäinen versio, jonka kokonaiskesto on noin 3200 tuntia. Vuonna 2021 näistä äänitteistä litteroitiin käsityönä noin 1600 tuntia ja näin syntyneet tekstimuotoiset litteroinnit kohdistettiin vastaaviin äänitteisiin automaattisilla menetelmillä.

Aineiston ensimmäinen varsinainen versio 1.0 on saatavilla Kielipankin latauspalvelussa, josta luvan saaneet tutkijat ja myöhemmin myös yritykset pääsevät sitä käyttämään. Samaan aineistoon sisältyviä, esimerkiksi automaattisen puheentunnistuksen kehittämistä varten poimittuja osa-aineistoja on lisäksi tarjolla erillisinä paketteina, joiden sisältö ja viittauskäytänteet löytyvät kunkin aineistoversion kuvailutietueesta.

Lahjoita puhetta -aineistokokonaisuutta on tarkoitus myös myöhemmin päivittää ja laajentaa, kun uusia lahjoituksia on kertynyt riittävästi. Uusia versioita tehdään myös sitä mukaa, kun tutkijat tai yritykset jatkavat olemassa olevien äänitteiden litterointia ja muuta annotointia.

Kuinka aineistoa pääsee käyttämään?

Puhelahjat-aineiston käyttäminen on luvanvaraista. Puhelahjat-ryhmän kaikkien osa-aineistojen tutkimuskäyttöä koskee sama lisenssi, johon sisältyy myös aineistokohtaisia tietosuojaehtoja.

Tutkimuskäyttö

  1. Tutkijat voivat hakea aineiston käyttöoikeutta tavanomaisella hakemusmenettelyllä Kielipankin oikeudet -palvelussa (ks. ohjeet).
  2. Tutkijan on syytä jo hakemusvaiheessa huomioida aineistokohtaiset käyttöehdot, ml. tietosuojaehdot, joiden mukaisissa rajoissa tutkimus on voitava toteuttaa myös henkilötietojen käsittelyn osalta, ks. lisenssi (tutkijoille).
  3. Ennen aineiston käsittelyn aloittamista tutkijan on lomakkeella toimitettava Kielipankin julkaistavaksi hankkeensa yleistajuinen otsikko sekä linkki henkilötietojen käsittelyä koskevaan julkiseen tietosuojailmoitukseen.
  4. Luvan saanut tutkija saa samalla hakemuksella pääsyn koko Lahjoita puhetta -aineistoon ja sen eri versioihin ja osa-aineistoihin.

Yrityskäytön ohjeet löytyvät omalta sivultaan.

 


Viimeksi päivitetty: 16.11.2022

 

Tämän sivun pysyvä tunniste: urn:nbn:fi:lb-2022102122

Donate Speech datasets (puhelahjat) for research use

Suomeksi

Donate Speech datasets for commercial use: further details will be available soon.

Versions of this resource:
Donate Speech Corpus, version 1.0
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
Apply for access rights, academic research use only

+PRIV: This resource contains personal data.
Submit public information about personal data processing

Download the resource
Donate Speech Corpus: Sample
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Training data (100h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Test data (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Development data (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Multi-transcriber test data (1h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Test data from multi-transcriber speakers (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Look for other versions of this resource

 

Contents of the resource

The Donate Speech Corpus, abbreviated Puhelahjat, was compiled in the Donate Speech campaign implemented by Vake Oy (later Ilmastorahasto), Yle and the University of Helsinki, launched on June 16, 2020. During the project, anyone who speaks some Finnish had the opportunity to donate their own speech in order to promote language research and the development of language technology. The donated speech was recorded via an easy-to-use browser or mobile application.

The first version of the audio material includes the speech samples that were donated by spring 2021. The total duration of the recordings in this version is approximately 3200 hours. In 2021, approximately 1,600 hours of the recordings were transcribed by hand, and the resulting transcriptions were aligned with the corresponding audio recordings using automatic methods.

The version 1.0 of the dataset is available in the download service for researchers that have been granted access. Some subsets of the complete dataset (selected for instance for the development of automatic speech recognition) will also be made available as separate download packages. The description and the citation practices of each subset can be found in the corresponding metadata records.

The Donate Speech datasets can be updated later, for instance after a sufficient amount of new donations have accumulated. New versions can also be created as researchers or companies continue to transcribe and annotate the existing recordings more extensively.

How to obtain access to use the material?

The research use of the Donate Speech Corpus and any of its subsets is subject to the license of the resource. Note that the license also includes resource-specific data protection conditions.

Research use

  1. Researchers can apply for the right to use the data via the usual application procedure in the Language Bank Rights system (see instructions).
  2. When applying for access, the researcher must consider to the license requirements, including the resource-specific data protection terms and conditions regarding the processing of personal data, see license (for researchers).
  3. Before starting to process the data, the researcher must submit the title of the project and the link to the public Privacy Notice regarding the processing of personal data in their project (see the e-form).
  4. When the application is approved, the researcher can access the entire Donate Speech Corpus as well as all versions and subsets of the resource.

The instructions for commercial use can be found on a separate page.

 


Last updated: 27.10.2022

 

Persistent identifier of this page: urn:nbn:fi:lb-2022102121

Installing and using dockerized tools (finnish-postag, finnish-nertag, …)

Some tools are available as Docker images. They can be used without installing any other dependencies (except for Docker). At this time the images are replacements for the command-line versions of these tools, meaning that they’re used via stdin and stdout, but they can also be run in an application server as a web service.

For now, the available tools are finnish-nertag, finnish-postag and finnish-tokenize.

Installation

The images are available on the Language Bank’s Dockerhub account, and may be installed as follows:

sudo docker pull kielipankki/finnish-nertag:latest

(Or finnish-postag, etc.)

Usage

The resulting containers communicate via stdin and stdout, so you could test them like this::

$ sudo docker run --rm -i kielipankki/finnish-nertag <<< 'Pekingin olympialaiset 2008'
Pekingin <EnamexEvtXxx>
olympialaiset
2008 </EnamexEvtXxx>

They understand the same command-line options as the underlying tools:

$ sudo docker run --rm -i kielipankki/finnish-nertag --bio <<< 'Pekingin olympialaiset 2008'
Pekingin B-MISC
olympialaiset I-MISC
2008 I-MISC

$ sudo docker run –rm -i kielipankki/finnish-nertag –show-analyses <<< ’Pekingin olympialaiset 2008’
Pekingin peking [POS=NOUN][PROPER=PROPER][NUM=SG][CASE=GEN] [PROP=GEO] <EnamexEvtXxx>
olympialaiset olympialaiset [POS=NOUN][NUM=PL][CASE=NOM] _
2008 2008 [POS=NUMERAL][SUBCAT=CARD] _ </EnamexEvtXxx>

TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies

The TallVocabL2Fi dataset comprises of responses from 15 participants to a ”tall” 12000 word 5-point scale self-rating response task and a 100 word confirmatory word translation response task. The 15 participants were split by native language, 5 English, 4 Hungarian and 6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The data was gathered through a website from paid participants resident in Finland over a period of 3 months from September and November 2021. In total there are 180 thousand word knowledge self-rating responses and 1.5 thousand word translation responses.

The dataset is unique in its combination of the tall data collection set up, where responses are collected for many words, the varied backgrounds of the learners, the use of Finnish prompt words, and the triangulation with a word translation test. The dataset can be used for vocabulary acquisition research in general, but it is particularly suited to evaluation of the task of Vocabulary Inventory Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT). The dataset is relational/tabular. It is distributed as a series of TSV files along with a SQL schema exported from DuckDB.

Further information about the schema and the collection process is available in the readme included with the data, and in the accompanying publication: Robertson, F., Chang & L., Söyrinki, S. (2022). TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners’ Vocabulary. In Language Resources and Evaluation Conference (LREC 2022).

Latest versions/subcorpora:  
TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

  This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051702

Finnish conversational chat corpus

The corpus contains 85 Finnish chat dialogs which have been collected during 2019-2020. 62 Participants were university staff, university students and high schoolers. For more detailed information, see the article listed below.

Please cite the following paper when using the corpus: K. Leino, J. Leinonen, M. Singh, S. Virpioja and M. Kurimo. ”FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics.” INTERSPEECH. 2020.

Link: https://github.com/aalto-speech/FinChat

Latest versions/subcorpora:  
Finnish conversational chat corpus, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022060901

Elias Lönnrot Letters Online

The corpus consists of the correspondence of Elias Lönnrot with private individuals as well as institutions from 1823 until Lönnrot’s death. Elias Lönnrot was the creator of the Kalevala, medical doctor and professor of language (1802 – 1884). The letters and drafts of letters belong to the Archive of the Finnish Literature Society and have been transliterated for the project Elias Lönnrot’s Letters Online, http://lonnrot.finlit.fi/omeka/.

 

Latest versions/subcorpora:  
Elias Lönnrot Letters Online, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Finnish sub-corpus of Elias Lönnrot Letters Online – Kielipankki version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
resource will be available soon
The Swedish sub-corpus of Elias Lönnrot Letters Online – Kielipankki version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
resource will be available soon
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051701

Erzya and Moksha Extended Corpora (ERME)

ERME contains predominantly Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level. At sentence level contextual translation is used (English or Finnish translation), while at word level there is morphological encoding, corresponding to each context. Preliminary morphological analysis is carried out using HFST-based transducers, which have been developed in the Giellatekno infrastructure of the University of Tromsø.

The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages.

Amount of processed material: more than a million words. The amount of the processed material is to be increased subsequently.

Latest versions/subcorpora:  
Erzya and Moksha Extended Corpora (ERME), Korp Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022052001

Giellatekno, the Research group for Saami language technology

Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. It focusses on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. The group also extends its activities to other under-resourced languages, particularly Circumpolar and Uralic languages. Analyses and tools are designed to make it easier for other minority language societies to develop the language technology constituting a prerequisite for a language to survive in modern society.

Open the website

Dictionaries of Giellatekno

Find a selection of Giellatekno’s dictionaries gathered under Dictionaries of Neahttadigisánit

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022022301

Martti Rapola’s 19th century vocabulary

Martti Rapola (1891–1972), a distinguished researcher of Old Literary Finnish and Finnish Dialects, compiled extensive material on 19th-century Literary Finnish, which he organized according to its significance. From these pickings made in the 1930s and 1950s, Rapola’s 19th-century vocabulary was created, comprising a total of 44,000 headwords. Rapola made use of this material in many articles published in the 1940s and 1950s and in a selection published in 1960, named ’Sanojemme ensiesiintymiä Agricolasta Yrjö-Koskiseen’, which, as the name implies, contains a vocabulary established in Literary Finnish.

The material published online is based on the original headwords, which have been selectively submitted as a database. It contains information about a total of 5600 words, divided into 1070 concepts. This is about a quarter of the original data.  

Latest versions/subcorpora:  
Martti Rapola’s 19th century vocabulary, Sanat version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the resource in Sanat
Search for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022021805

Place Names in Slang

This resource contains the result of the competition of gathering place names in colloquial language. The competition was hold 18.8.–3.11.2003 in schools of Espoo, Helsinki, Kauniainen and Vanta. It was organized by Stadin slangi ry, the Institute for the Languages of Finland and Helsingin Sanomat.

The whole collection of the competition – about 14 500 names – is organized after the names as well as per school. Additionally to the names other information given by the pupils were published: the official name of the place, the location of the place, example sentences and other additional information like the origin of the name and its use.

Latest versions/subcorpora:  
Place Names in Slang
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Place Names in Slang, Sanat version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the resource in Sanat
Search for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022021804

Digital collections of Kotus

The website offers  a collection of links to all digitally and publicly available language resources of the Institute for the Languages of Finland.

Open the website

 

Examples of language resources available in the service:

Dictionary of Finnish dialects

Dictionary of Old Literary Finnish

Etymological Database of the Sami Languages

Etymological Reference Database

Frequencies of Early Modern Finnish Words

Frequencies of Old Literary Finnish Words

Frequency list of Written Finnish Word Forms

Headword List of the Karelian Dictionary

Modern Finnish Word List

Names of Countries in Seven Languages

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022020901

 

HeLI-OTS

HeLI-OTS (off-the-shelf) is a language identifier with language models for 200 languages. The program will read the <infile> and classify the language of each line as one of the 200 languages it knows and writes the results, one ISO 639-3 code per line, into file <outfile>. It can identify c. 3000 sentences per second using one core on a 2021 laptop and around 3 gigabytes of memory.

Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft.

Read more about HeLIOTS, Off-the-shelf Language Identifier for Text

Latest versions/subcorpora:  
HeLI-OTS 1.4
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
HeLI-OTS 1.3
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Look for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022011801