Tästä aineistosta on tulossa tutkijoiden saataville seuraavat versiot: | |
---|---|
Lahjoita puhetta -aineisto Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Tutkija saa yhdellä hakemuksella pääsyn aineiston kaikkiin versioihin ja osa-aineistoihin. +PRIV: Aineisto sisältää henkilötietoja. Toimita julkinen ilmoitus henkilötietojen käsittelystä Aineiston latauslinkki tulee tähän |
Hae käyttöoikeutta (vain tutkijoille; hakemus tulossa)
Lahjoita puhetta -aineisto: Näyte Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Opetusdata (100h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Testidata (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Kehitysdata (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Usean litteroijan testidata (1h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Testidata useaan kertaan litteroiduilta puhujilta (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Etsi muut saatavilla olevat versiot |
Viimeksi päivitetty: 15.6.2022
For research use, the following versions of this resource are forthcoming: | |
---|---|
Donate Speech Corpus Metadata License (for researchers) Attribution instructions |
NB: One application will give access to the complete dataset. +PRIV: This resource contains personal data. Submit public information about personal data processing (The download link will appear here) |
Apply for access rights, academic research use only (the application form will be opened soon)
Donate Speech Corpus: Sample Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Training data (100h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Test data (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Development data (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Multi-transcriber test data (1h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Test data from multi-transcriber speakers (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Look for other versions of this resource |
Last updated: 15.6.2022
The TallVocabL2Fi dataset comprises of responses from 15 participants to a ”tall” 12000 word 5-point scale self-rating response task and a 100 word confirmatory word translation response task. The 15 participants were split by native language, 5 English, 4 Hungarian and 6 Russian, and self-reported CEFR reading level, 5 B1, 4 B2, 5 C1 and 2 C2. The data was gathered through a website from paid participants resident in Finland over a period of 3 months from September and November 2021. In total there are 180 thousand word knowledge self-rating responses and 1.5 thousand word translation responses.
The dataset is unique in its combination of the tall data collection set up, where responses are collected for many words, the varied backgrounds of the learners, the use of Finnish prompt words, and the triangulation with a word translation test. The dataset can be used for vocabulary acquisition research in general, but it is particularly suited to evaluation of the task of Vocabulary Inventory Prediction (VIP) including techniques based on Computer-Adaptive Testing (CAT). The dataset is relational/tabular. It is distributed as a series of TSV files along with a SQL schema exported from DuckDB.
Further information about the schema and the collection process is available in the readme included with the data, and in the accompanying publication: Robertson, F., Chang & L., Söyrinki, S. (2022). TallVocabL2Fi: An Extensive Mapping of 15 Finnish L2 Learners’ Vocabulary. In Language Resources and Evaluation Conference (LREC 2022).
Latest versions/subcorpora: | |
TallVocabL2Fi: Measurements of 15 L2 Finnish learners’ vocabularies Metadata and license Attribution instructions |
Download the resource |
Search for all versions of this resource in META-SHARE |
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051702
The corpus contains 86 Finnish chat dialogs which have been collected during 2019-2020. 62 Participants were university staff, university students and high schoolers. For more detailed information, see the article listed below.
Please cite the following paper when using the corpus: K. Leino, J. Leinonen, M. Singh, S. Virpioja and M. Kurimo. ”FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics.” INTERSPEECH. 2020.
Link: https://github.com/aalto-speech/FinChat
Latest versions/subcorpora: | |
Finnish conversational chat corpus, source Metadata and license Attribution instructions |
The resource will be available soon |
Search for all versions of this resource in META-SHARE |
Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022060901
The corpus consists of the correspondence of Elias Lönnrot with private individuals as well as institutions from 1823 until Lönnrot’s death. Elias Lönnrot was the creator of the Kalevala, medical doctor and professor of language (1802 – 1884). The letters and drafts of letters belong to the Archive of the Finnish Literature Society and have been transliterated for the project Elias Lönnrot’s Letters Online, http://lonnrot.finlit.fi/omeka/.
Latest versions/subcorpora: | |
Elias Lönnrot Letters Online, source Metadata and license Attribution instructions |
resource will be available soon |
The Finnish sub-corpus of Elias Lönnrot Letters Online – Kielipankki version Metadata and license Attribution instructions |
resource will be available soon |
The Swedish sub-corpus of Elias Lönnrot Letters Online – Kielipankki version Metadata and license Attribution instructions |
resource will be available soon |
Search for all versions of this resource in META-SHARE |
Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022051701
ERME contains predominantly Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level. At sentence level contextual translation is used (English or Finnish translation), while at word level there is morphological encoding, corresponding to each context. Preliminary morphological analysis is carried out using HFST-based transducers, which have been developed in the Giellatekno infrastructure of the University of Tromsø.
The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages.
Amount of processed material: more than a million words. The amount of the processed material is to be increased subsequently.
Latest versions/subcorpora: | |
Erzya and Moksha Extended Corpora (ERME), Korp Version Metadata and license Attribution instructions |
Select the corpus in Korp |
Search for all versions of this resource in META-SHARE |
Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022052001
Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. It focusses on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. The group also extends its activities to other under-resourced languages, particularly Circumpolar and Uralic languages. Analyses and tools are designed to make it easier for other minority language societies to develop the language technology constituting a prerequisite for a language to survive in modern society.
Find a selection of Giellatekno’s dictionaries gathered under Dictionaries of Neahttadigisánit
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022022301
Martti Rapola (1891–1972), a distinguished researcher of Old Literary Finnish and Finnish Dialects, compiled extensive material on 19th-century Literary Finnish, which he organized according to its significance. From these pickings made in the 1930s and 1950s, Rapola’s 19th-century vocabulary was created, comprising a total of 44,000 headwords. Rapola made use of this material in many articles published in the 1940s and 1950s and in a selection published in 1960, named ’Sanojemme ensiesiintymiä Agricolasta Yrjö-Koskiseen’, which, as the name implies, contains a vocabulary established in Literary Finnish.
The material published online is based on the original headwords, which have been selectively submitted as a database. It contains information about a total of 5600 words, divided into 1070 concepts. This is about a quarter of the original data.
Latest versions/subcorpora: | |
Martti Rapola’s 19th century vocabulary, Sanat version Metadata and license Attribution instructions |
Open the resource in Sanat |
Search for all versions in META-SHARE |
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022021805
This resource contains the result of the competition of gathering place names in colloquial language. The competition was hold 18.8.–3.11.2003 in schools of Espoo, Helsinki, Kauniainen and Vanta. It was organized by Stadin slangi ry, the Institute for the Languages of Finland and Helsingin Sanomat.
The whole collection of the competition – about 14 500 names – is organized after the names as well as per school. Additionally to the names other information given by the pupils were published: the official name of the place, the location of the place, example sentences and other additional information like the origin of the name and its use.
Latest versions/subcorpora: | |
Place Names in Slang Metadata and license Attribution instructions |
Open the website |
Place Names in Slang, Sanat version Metadata and license Attribution instructions |
Open the resource in Sanat |
Search for all versions in META-SHARE |
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022021804
The website offers a collection of links to all digitally and publicly available language resources of the Institute for the Languages of Finland.
Examples of language resources available in the service:
Dictionary of Finnish dialects
Dictionary of Old Literary Finnish
Etymological Database of the Sami Languages
Etymological Reference Database
Frequencies of Early Modern Finnish Words
Frequencies of Old Literary Finnish Words
Frequency list of Written Finnish Word Forms
Headword List of the Karelian Dictionary
Names of Countries in Seven Languages
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022020901
HeLI-OTS (off-the-shelf) is a language identifier with language models for 200 languages. The program will read the <infile> and classify the language of each line as one of the 200 languages it knows and writes the results, one ISO 639-3 code per line, into file <outfile>. It can identify c. 3000 sentences per second using one core on a 2021 laptop and around 3 gigabytes of memory.
Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft.
Latest versions/subcorpora: | |
HeLI-OTS 1.3 Metadata and license Attribution instructions |
Open the website |
HeLI-OTS 1.2 Metadata and license Attribution instructions |
Open the website |
Look for all versions in META-SHARE |
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022011801
Latest versions and variants: | |
The Suomi 24 Sentences Corpus 2001-2020, Korp version Metadata and license Citation instructions |
Open the resource in Korp (including the years 2001-2017 and the update 2018-2020) |
The Suomi 24 Corpus 2001-2020, VRT version Metadata and license Citation instructions |
Download the resource (including the years 2001-2017 and the update 2018-2020) |
The Suomi 24 Sentences Corpus 2018-2020, Korp-version Metadata and license Citation instructions |
Open the resource in Korp |
The Suomi24 Corpus 2018-2020, VRT version Metadata and license Citation instructions |
Download the resource |
The Suomi24 Sentences Corpus 2001-2017, Korp version 1.2 Metadata and license Citation instructions for this version |
Open the resource in Korp |
The Suomi24 Corpus 2001-2017, VRT version 1.1 Metadata and license Citation instructions for this version |
Download the resource |
Search for all available versions |
The resource consists of the discussions posted on the Suomi 24 discussion forum. The content has been annotated with automatic methods and stored in VRT format.
Via the Korp service, it is possible to perform versatile search queries from the content and to obtain various statistics and visualizations (see Korp instructions).
Without logging in via Korp, you can see the items matching your search criteria as brief excerpts only. At each word token in the concordance, you can find a link to the original message and discussion thread on the original Suomi 24 discussion platform, in case they are still available there. If required, researchers can also log in in case they need to view the wider context around the matching items.
In addition to the corpus versions that are available in Korp, the corresponding full text documents are available for logged-in researchers in VRT format either on the CSC computing environment or as downloadable packages via the download service of Kielipankki. In order to use the computing environment, researchers need a CSC user account. Please note, however, that in order to use the full text data efficiently, some technical and programming skills are usually required. The Korp service provides many opportunities for studying and analyzing the Suomi 24 corpus, so it is recommended that you first make sure whether Korp is suitable for your purpose.
Persistent identifier of this page: http://urn.fi/urn:nbn:fi:lb-2022011221
Current versions of this resource: | |
Corpus Title, Korp version Metadata and license Attribution instructions | Select the corpus in Korp |
Corpus Title, download version Metadata and license PRIV: See privacy guidelines Attribution instructions | Apply for rights to access the resource Download the resource |
Look for other versions of this resource |
Due to technical reasons, the LAT service (lat.csc.fi) will be discontinued in the Language Bank of Finland as of November 30, 2020. After this, the LAT version of this corpus will no longer be available. However, the content will be made available for download. In case you urgently need the downloadable data, please contact us.
The corpus consists of…
Other details about the content and the terms and conditions regarding the different corpus versions are available in the corresponding metadata records.
Corpus XYZ contains personal data. When using the corpus, follow the personal data guidelines provided by the Language Bank of Finland. Below, you can find a description of the types of personal data that are included in the corpus as well as details on additional specific restrictions that you need to comply with when processing the personal data in question.
[This part should contain the description and corpus-specific restrictions regarding the processing of the personal data in the corpus, as stated by the data controller in the deposition license agreement.]
Nimiarkisto.fi is a portal with the most important digital resources of names and named entities collected from and archived in Finland. The service is offered by the Institute for the Languages of Finland.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021111902
Finsentiment estimates a sentiment (positive, negative, or neutral) for each sentence in the input text, and also for the input text as a whole.
The sentiment analysis relies on three resources:
At the moment this tool is available as a demo version.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110405
Terminology Forum is a global non-profit information forum for freely available terminological information online, created by experts and enthusiasts in various fields. The Forum was established in 1994 and is maintained by the University of Vaasa, Finland.
The related corpus Terminology Forum Glossaries (selection), source is available for download in the download service of Kielipankki.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110404
ELAN is a program for transcribing and annotating audio and video files, offered by The Language Archive. It can also be used for searching locally stored collections of annotated material.
With ELAN, a user can add an unlimited number of textual annotations to audio and/or video recordings. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The content of annotations consists of Unicode text and annotation documents are stored in an XML format (EAF).
The program is available for Windows, Mac and Linux and source code is open for developers. Installation instructions and further details about the software can be found on the project website.
Metadata, license and citation instructions
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110402
A version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.
FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls.
For more information see the FinBERT’s project page
FinBERT Kielipankki version: Kielipankki offers a version of Google’s BERT deep transfer learning model for Finnish. It is installed in CSC’s Puhti cluster and can be used via the pytorch 1.4 module. For details see /appl/data/kielipankki/bert_models/README.txt
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110401
Transkribus is a comprehensive platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110305
The tool is developed by the Turku NLP group for analyzing the semantic similarity of words.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110304