Tällä sivulla on lueteltu ne Lahjoita puhetta -aineistokokonaisuuteen kuuluvat laitetunnisteet (clientID), joiden mukaiset tiedostot käyttäjien tulee poistaa aineiston kaikista versioista.
Mikäli löydät poistettavaksi määrätyn tiedoston Kielipankissa olevasta aineistosta, ilmoitathan meille välittömästi.
This page provides a list of all clientID’s whose corresponding files must be deleted from all versions of the data. Look for any filenames starting with one of the clientID’s plus underscore, and remove them. In case you discover that some of the datasets in Kielipankki still contain files that were marked for removal, please inform us immediately.
clientID | Kielipankki removal date (Corpus versions may be affected if accessed before this) |
---|---|
clt0010617 | Removal in progress – please check the files you downloaded! |
Last updated / Viimeksi päivitetty: 2.2.2023
Please note that the descriptions and size information are based on our current estimates and may be updated at a later stage.
For companies and non-academic organizations, the following versions of this resource are currently available or forthcoming: | |
---|---|
Donate Speech Corpus: Sample Metadata A free sample that contains a randomly selected set of 40 audio files and their corresponding transcripts as plain text files and as annotation files including time alignments. The metadata regarding the recorded samples and the background details supplied by the speakers (if available) are also included. The total duration of the audio files is about 35 minutes. |
Price: Free of charge
See instructions. |
Donate Speech: Selected dataset Metadata This resource contains five different subsets that were selected at Aalto University especially for developing, training and testing ASR systems. The total duration of the audio files is about 131 hours. |
Price: 1000 €
See instructions. |
Donate Speech: Annotated dataset Metadata This resource contains all the annotated audio files, their transcriptions as raw text files and annotation files, and the background information regarding the recordings and speakers. The total duration of the audio files is about 1600 hours. |
Price: 5000 €
See instructions. |
Donate Speech: Complete dataset, version 1 Metadata The Complete dataset (version 1) includes the Annotated dataset (and the Selected dataset and the Sample). In addition, the Complete dataset also includes the audio files that were not transcribed or annotated. |
Price: 10 000 €
See instructions. |
The first version of the Donate Speech Corpus (Puhelahjat) is a collection of speech recordings accumulated during the Donate Speech campaign between 16.6.2020 and 14.9.2021.
The resource contains a total of about 3200 hours of speech recordings, out of which about 1600 hours have been transcribed. The resource also includes information about the elicitation tasks for which each of the speech samples was donated in the original campaign, and the background details that were voluntarily provided by speech donors.
The resource is available via the download service of the Language Bank of Finland under restricted terms and conditions. The services of the Language Bank are directed at academic researchers. For companies and non-academic organizations, access to Puhelahjat datasets may be acquired for a fee. Further details can be requested by email at lahjoita-puhetta@helsinki.fi.
NB: These instructions are still subject to change.
In accordance with the specific terms and conditions of the Puhelahjat resource, it is also possible to grant access to the data for commercial and non-academic purposes. However, in this case, a separate license agreement between the University of Helsinki and the company or organization is required. When the agreement is signed and the payment has been made, access can be granted to the representative authorized by the user organization.
When applying for the use of paid material, it must be shown that the license fee has been paid.
Last updated: 23.12.2022
Persistent Identifier of this page: urn:nbn:fi:lb-2022111627
Donate Speech datasets for commercial use: see further details on another page
Important information for all users of this resource: Removal requests
Versions of this resource: | |
---|---|
Donate Speech Corpus, version 1.0 Metadata License (for researchers) Attribution instructions |
, academic research use only Apply for access rights +PRIV: This resource contains personal data. Submit public information about personal data processing Download the resource |
Donate Speech Corpus: Sample Metadata License (for researchers) Attribution instructions |
Download the resource |
Donate Speech Corpus: Training data (100h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Test data (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Development data (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Multi-transcriber test data (1h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Test data from multi-transcriber speakers (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Look for other versions of this resource |
The Donate Speech Corpus, abbreviated Puhelahjat, was compiled in the Donate Speech campaign implemented by Vake Oy (later Ilmastorahasto), Yle and the University of Helsinki, launched on June 16, 2020. During the project, anyone who speaks some Finnish had the opportunity to donate their own speech in order to promote language research and the development of language technology. The donated speech was recorded via an easy-to-use browser or mobile application.
The first version of the audio material includes the speech samples that were donated by spring 2021. The total duration of the recordings in this version is approximately 3200 hours. In 2021, approximately 1,600 hours of the recordings were transcribed by hand, and the resulting transcriptions were aligned with the corresponding audio recordings using automatic methods.
The version 1.0 of the dataset is available in the download service for researchers that have been granted access. Some subsets of the complete dataset (selected for instance for the development of automatic speech recognition) will also be made available as separate download packages. The description and the citation practices of each subset can be found in the corresponding metadata records.
The Donate Speech datasets can be updated later, for instance after a sufficient amount of new donations have accumulated. New versions can also be created as researchers or companies continue to transcribe and annotate the existing recordings more extensively.
The research use of the Donate Speech Corpus and any of its subsets is subject to the license of the resource. Note that the license also includes resource-specific data protection conditions.
The instructions for commercial use can be found on a separate page.
Last updated: 27.10.2022
Persistent identifier of this page: urn:nbn:fi:lb-2022102121
The Language Bank of Finland (Kielipankki) is working together with the Finnish Broadcasting Company (Yle) and the Finnish State Development Company (Vake Oy) in the Donate Speech campaign (Lahjoita puhetta). Experts from Aalto University and the University of Turku have also participated in the project.
The goal is to gather 10000 hours of ordinary, casual Finnish speech that can be used for studying language as well as for developing technology and services that can be readily used in Finnish. In this project, particular attention has been paid in order to allow for both academic and commercial use of the material under given terms.
Speech is donated via the web browser or mobile app that offers a selection of tasks under fun themes that can inspire and encourage you to talk. The app was developed by Solita.
All variants of spoken Finnish are welcome, including the speech of second-language Finnish learners. As long as you speak some Finnish and can understand the Finnish information and instructions in the app, you can donate!
The speech material donated during the campaign will be stored in the Language Bank of Finland (Kielipankki), coordinated by the University of Helsinki.
The speech material can be redistributed to individual researchers, universities and research organizations or private companies that need it for studying language or artificial intelligence, for developing AI solutions or for higher education purposes related to the aforementioned areas.
Read more about processing personal data in the Donate Speech campaign (in Finnish) and the privacy practices of the Language Bank of Finland.
The Language Bank of Finland will begin redistributing the speech data when a sufficient amount of material has been donated and when the appropriate application process is in place. For academic researchers, the use of the data will be free of charge, similarly to the rest of the services of the Language Bank of Finland. For commercial use, a fee will probably be charged in order to cover handling costs. Details about the pricing will be provided at a later stage.
You can find some examples of research topics in the Researcher of the Month archive of the Language Bank of Finland.
Please contact the email address lahjoita-puhetta (ATT) kielipankki.fi.