Lahjoita puhetta -aineistot (puhelahjat) tutkimuskäyttöön

In English

Tästä aineistosta on tulossa tutkijoiden saataville seuraavat versiot:
Lahjoita puhetta -aineisto
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje
Hae käyttöoikeutta (vain tutkijoille; hakemus tulossa)
Tutkija saa yhdellä hakemuksella pääsyn aineiston kaikkiin versioihin ja osa-aineistoihin.

+PRIV: Aineisto sisältää henkilötietoja.
Toimita julkinen ilmoitus henkilötietojen käsittelystä

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Näyte
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Opetusdata (100h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Testidata (10h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Kehitysdata (10h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Usean litteroijan testidata (1h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Lahjoita puhetta -aineisto: Testidata useaan kertaan litteroiduilta puhujilta (10h)
icon-info-circle Kuvailutiedot
icon-info-circle Lisenssi (tutkijoille)
icon-quote-right Tämän version viittausohje

Aineiston latauslinkki tulee tähän
Etsi muut saatavilla olevat versiot

 


Viimeksi päivitetty: 15.6.2022

Donate Speech datasets (puhelahjat) for research use

Suomeksi

For research use, the following versions of this resource are forthcoming:
Donate Speech Corpus
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
Apply for access rights, academic research use only (the application form will be opened soon)
NB: One application will give access to the complete dataset.

+PRIV: This resource contains personal data.
Submit public information about personal data processing

(The download link will appear here)
Donate Speech Corpus: Sample
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Training data (100h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Test data (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Development data (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Multi-transcriber test data (1h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Test data from multi-transcriber speakers (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Look for other versions of this resource

 


Last updated: 15.6.2022

TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies

The TallVocabL2Fi dataset comprises of responses from 15 participants to a ”tall” 12000 word 5-point scale self-rating response task and a 100 word confirmatory word translation response task. The 15 participants were split by native language, 5 English, 4 Hungarian and 6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The data was gathered through a website from paid participants resident in Finland over a period of 3 months from September and November 2021. In total there are 180 thousand word knowledge self-rating responses and 1.5 thousand word translation responses.

The dataset is unique in its combination of the tall data collection set up, where responses are collected for many words, the varied backgrounds of the learners, the use of Finnish prompt words, and the triangulation with a word translation test. The dataset can be used for vocabulary acquisition research in general, but it is particularly suited to evaluation of the task of Vocabulary Inventory Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT). The dataset is relational/tabular. It is distributed as a series of TSV files along with a SQL schema exported from DuckDB.

Further information about the schema and the collection process is available in the readme included with the data, and in the accompanying publication: Robertson, F., Chang & L., Söyrinki, S. (2022). TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners’ Vocabulary. In Language Resources and Evaluation Conference (LREC 2022).

Latest versions/subcorpora:  
TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

  This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051702

Finnish conversational chat corpus

The corpus contains 86 Finnish chat dialogs which have been collected during 2019-2020. 62 Participants were university staff, university students and high schoolers. For more detailed information, see the article listed below.

Please cite the following paper when using the corpus: K. Leino, J. Leinonen, M. Singh, S. Virpioja and M. Kurimo. ”FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics.” INTERSPEECH. 2020.

Link: https://github.com/aalto-speech/FinChat

Latest versions/subcorpora:  
Finnish conversational chat corpus, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The resource will be available soon
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022060901

Elias Lönnrot Letters Online

The corpus consists of the correspondence of Elias Lönnrot with private individuals as well as institutions from 1823 until Lönnrot’s death. Elias Lönnrot was the creator of the Kalevala, medical doctor and professor of language (1802 – 1884). The letters and drafts of letters belong to the Archive of the Finnish Literature Society and have been transliterated for the project Elias Lönnrot’s Letters Online, http://lonnrot.finlit.fi/omeka/.

 

Latest versions/subcorpora:  
Elias Lönnrot Letters Online, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
resource will be available soon
The Finnish sub-corpus of Elias Lönnrot Letters Online – Kielipankki version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
resource will be available soon
The Swedish sub-corpus of Elias Lönnrot Letters Online – Kielipankki version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
resource will be available soon
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051701

Erzya and Moksha Extended Corpora (ERME)

ERME contains predominantly Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level. At sentence level contextual translation is used (English or Finnish translation), while at word level there is morphological encoding, corresponding to each context. Preliminary morphological analysis is carried out using HFST-based transducers, which have been developed in the Giellatekno infrastructure of the University of Tromsø.

The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages.

Amount of processed material: more than a million words. The amount of the processed material is to be increased subsequently.

Latest versions/subcorpora:  
Erzya and Moksha Extended Corpora (ERME), Korp Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022052001

Giellatekno, the Research group for Saami language technology

Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. It focusses on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. The group also extends its activities to other under-resourced languages, particularly Circumpolar and Uralic languages. Analyses and tools are designed to make it easier for other minority language societies to develop the language technology constituting a prerequisite for a language to survive in modern society.

Open the website

Dictionaries of Giellatekno

Find a selection of Giellatekno’s dictionaries gathered under Dictionaries of Neahttadigisánit

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022022301

Martti Rapola’s 19th century vocabulary

Martti Rapola (1891–1972), a distinguished researcher of Old Literary Finnish and Finnish Dialects, compiled extensive material on 19th-century Literary Finnish, which he organized according to its significance. From these pickings made in the 1930s and 1950s, Rapola’s 19th-century vocabulary was created, comprising a total of 44,000 headwords. Rapola made use of this material in many articles published in the 1940s and 1950s and in a selection published in 1960, named ’Sanojemme ensiesiintymiä Agricolasta Yrjö-Koskiseen’, which, as the name implies, contains a vocabulary established in Literary Finnish.

The material published online is based on the original headwords, which have been selectively submitted as a database. It contains information about a total of 5600 words, divided into 1070 concepts. This is about a quarter of the original data.  

Latest versions/subcorpora:  
Martti Rapola’s 19th century vocabulary, Sanat version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the resource in Sanat
Search for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022021805

Place Names in Slang

This resource contains the result of the competition of gathering place names in colloquial language. The competition was hold 18.8.–3.11.2003 in schools of Espoo, Helsinki, Kauniainen and Vanta. It was organized by Stadin slangi ry, the Institute for the Languages of Finland and Helsingin Sanomat.

The whole collection of the competition – about 14 500 names – is organized after the names as well as per school. Additionally to the names other information given by the pupils were published: the official name of the place, the location of the place, example sentences and other additional information like the origin of the name and its use.

Latest versions/subcorpora:  
Place Names in Slang
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Place Names in Slang, Sanat version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the resource in Sanat
Search for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022021804

Digital collections of Kotus

The website offers  a collection of links to all digitally and publicly available language resources of the Institute for the Languages of Finland.

Open the website

 

Examples of language resources available in the service:

Dictionary of Finnish dialects

Dictionary of Old Literary Finnish

Etymological Database of the Sami Languages

Etymological Reference Database

Frequencies of Early Modern Finnish Words

Frequencies of Old Literary Finnish Words

Frequency list of Written Finnish Word Forms

Headword List of the Karelian Dictionary

Modern Finnish Word List

Names of Countries in Seven Languages

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022020901

 

HeLI-OTS

HeLI-OTS (off-the-shelf) is a language identifier with language models for 200 languages. The program will read the <infile> and classify the language of each line as one of the 200 languages it knows and writes the results, one ISO 639-3 code per line, into file <outfile>. It can identify c. 3000 sentences per second using one core on a 2021 laptop and around 3 gigabytes of memory.

Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft.

Latest versions/subcorpora:  
HeLI-OTS 1.3
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
HeLI-OTS 1.2
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Look for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022011801

Suomi 24 resource group

Suomeksi

Latest versions and variants:  
The Suomi 24 Sentences Corpus 2001-2020, Korp version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Open the resource in Korp icon-question-circle
(including the years 2001-2017 and the update 2018-2020)
The Suomi 24 Corpus 2001-2020, VRT version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Download the resource
(including the years 2001-2017 and the update 2018-2020)
The Suomi 24 Sentences Corpus 2018-2020, Korp-version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Open the resource in Korp icon-question-circle
The Suomi24 Corpus 2018-2020, VRT version
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Download the resource
The Suomi24 Sentences Corpus 2001-2017, Korp version 1.2
icon-info-circle Metadata and license
icon-quote-right Citation instructions for this version
Open the resource in Korp icon-question-circle
The Suomi24 Corpus 2001-2017, VRT version 1.1
icon-info-circle Metadata and license
icon-quote-right Citation instructions for this version
Download the resource
Search for all available versions  

The resource consists of the discussions posted on the Suomi 24 discussion forum. The content has been annotated with automatic methods and stored in VRT format.

Via the Korp service, it is possible to perform versatile search queries from the content and to obtain various statistics and visualizations (see Korp instructions).

Without logging in via Korp, you can see the items matching your search criteria as brief excerpts only. At each word token in the concordance, you can find a link to the original message and discussion thread on the original Suomi 24 discussion platform, in case they are still available there. If required, researchers can also log in in case they need to view the wider context around the matching items.

In addition to the corpus versions that are available in Korp, the corresponding full text documents are available for logged-in researchers in VRT format either on the CSC computing environment or as downloadable packages via the download service of Kielipankki. In order to use the computing environment, researchers need a CSC user account. Please note, however, that in order to use the full text data efficiently, some technical and programming skills are usually required. The Korp service provides many opportunities for studying and analyzing the Suomi 24 corpus, so it is recommended that you first make sure whether Korp is suitable for your purpose.

 

Persistent identifier of this page: http://urn.fi/urn:nbn:fi:lb-2022011221

Suomeksi

Corpus Title

Current versions of this resource: 
Corpus Title, Korp version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp icon-question-circle
Corpus Title, download version
icon-info-circle Metadata and license
icon-exclamation-triangle PRIV: See privacy guidelines
icon-quote-rightAttribution instructions
Apply for rights to access the resource
Download the resource
Look for other versions of this resource

Information about the removal of the LAT version of this corpus in November 2020

Due to technical reasons, the LAT service (lat.csc.fi) will be discontinued in the Language Bank of Finland as of November 30, 2020. After this, the LAT version of this corpus will no longer be available. However, the content will be made available for download. In case you urgently need the downloadable data, please contact us.

Corpus contents

The corpus consists of…

Other details about the content and the terms and conditions regarding the different corpus versions are available in the corresponding metadata records.

Example queries from the Korp version of this corpus


Privacy guidelines

Corpus XYZ contains personal data. When using the corpus, follow the personal data guidelines provided by the Language Bank of Finland. Below, you can find a description of the types of personal data that are included in the corpus as well as details on additional specific restrictions that you need to comply with when processing the personal data in question.

[This part should contain the description and corpus-specific restrictions regarding the processing of the personal data in the corpus, as stated by the data controller in the deposition license agreement.]

Nimiarkisto

Nimiarkisto.fi is a portal with the most important digital resources of names and named entities collected from and archived in Finland. The service is offered by the Institute for the Languages of Finland.

Open the website

User Guidelines (in Finnish)

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021111902

finsentiment

Finsentiment estimates a sentiment (positive, negative, or neutral) for each sentence in the input text, and also for the input text as a whole.

The sentiment analysis relies on three resources:

  1. Word embeddings calculated from a corpus of Finnish text.
  2. Product reviews harvested from the Internet.
  3. A word-based convolutional neural network with 100 kernels each of sizes 2, 3, 4 and 5 words. The neural network is trained to predict the rating associated with product reviews, and the prediction it gives to the input text is converted to a sentiment.

At the moment this tool is available as a demo version.

Open the website

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110405

Terminology Forum

Terminology Forum is a global non-profit information forum for freely available terminological information online, created by experts and enthusiasts in various fields. The Forum was established in 1994 and is maintained by the University of Vaasa, Finland.

Open the website

The related corpus Terminology Forum Glossaries (selection), source is available for download in the download service of Kielipankki.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110404

ELAN

ELAN is a program for transcribing and annotating audio and video files, offered by The Language Archive. It can also be used for searching locally stored collections of annotated material.

With ELAN, a user can add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The content of annotations consists of Unicode text and annotation documents are stored in an XML format (EAF).

The program is available for Windows, Mac and Linux and source code is open for developers. Installation instructions and further details about the software can be found on the project website.

Metadata, license and citation instructions

User guidelines in Finnish

User guidelines in English

Install the tool

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110402

Finnish BERT (FinBERT)

A version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.

FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls.

TurkuNLP

For more information see the FinBERT’s project page

Install (GitHub)

FinBERT Kielipankki version: Kielipankki offers a version of Google’s BERT deep transfer learning model for Finnish. It is installed in CSC’s Puhti cluster and can be used via the pytorch 1.4 module. For details see /appl/data/kielipankki/bert_models/README.txt

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110401

Transkribus

Transkribus is a comprehensive platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents.

Open the website

User instructions

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110305

Semantic similarity of words (word2vec)

The tool is developed by the Turku NLP group for analyzing the semantic similarity of words.

Online demo

Documentation

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110304