Donate Speech datasets (puhelahjat) for research use

Suomeksi

Donate Speech datasets for commercial use: further details will be available soon.

Versions of this resource:
Donate Speech Corpus, version 1.0
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
Apply for access rights, academic research use only

+PRIV: This resource contains personal data.
Submit public information about personal data processing

Download the resource
Donate Speech Corpus: Sample
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Training data (100h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Test data (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Development data (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Multi-transcriber test data (1h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Donate Speech Corpus: Test data from multi-transcriber speakers (10h)
icon-info-circle Metadata
icon-info-circle License (for researchers)
icon-quote-right Attribution instructions
(The download link will appear here)
Look for other versions of this resource

 

Contents of the resource

The Donate Speech Corpus, abbreviated Puhelahjat, was compiled in the Donate Speech campaign implemented by Vake Oy (later Ilmastorahasto), Yle and the University of Helsinki, launched on June 16, 2020. During the project, anyone who speaks some Finnish had the opportunity to donate their own speech in order to promote language research and the development of language technology. The donated speech was recorded via an easy-to-use browser or mobile application.

The first version of the audio material includes the speech samples that were donated by spring 2021. The total duration of the recordings in this version is approximately 3200 hours. In 2021, approximately 1,600 hours of the recordings were transcribed by hand, and the resulting transcriptions were aligned with the corresponding audio recordings using automatic methods.

The version 1.0 of the dataset is available in the download service for researchers that have been granted access. Some subsets of the complete dataset (selected for instance for the development of automatic speech recognition) will also be made available as separate download packages. The description and the citation practices of each subset can be found in the corresponding metadata records.

The Donate Speech datasets can be updated later, for instance after a sufficient amount of new donations have accumulated. New versions can also be created as researchers or companies continue to transcribe and annotate the existing recordings more extensively.

How to obtain access to use the material?

The research use of the Donate Speech Corpus and any of its subsets is subject to the license of the resource. Note that the license also includes resource-specific data protection conditions.

Research use

  1. Researchers can apply for the right to use the data via the usual application procedure in the Language Bank Rights system (see instructions).
  2. When applying for access, the researcher must consider to the license requirements, including the resource-specific data protection terms and conditions regarding the processing of personal data, see license (for researchers).
  3. Before starting to process the data, the researcher must submit the title of the project and the link to the public Privacy Notice regarding the processing of personal data in their project (see the e-form).
  4. When the application is approved, the researcher can access the entire Donate Speech Corpus as well as all versions and subsets of the resource.

The instructions for commercial use can be found on a separate page.

 


Last updated: 27.10.2022

 

Persistent identifier of this page: urn:nbn:fi:lb-2022102121