Donate Speech (puhelahjat): Datasets for companies and non-academic organizations

Suomeksi

Are you a researcher? Information about the Donate Speech datasets for academic research use is available on another page.

Please note that the descriptions and size information are based on our current estimates and may be updated at a later stage.

For companies and non-academic organizations, the following versions of this resource are currently available or forthcoming:
Donate Speech Corpus: Sample
icon-info-circle Metadata
A free sample that contains a randomly selected set of 40 audio files and their corresponding transcripts as plain text files and as annotation files including time alignments. The metadata regarding the recorded samples and the background details supplied by the speakers (if available) are also included. The total duration of the audio files is about 35 minutes.
icon-quote-right Reference instructions for this version
Price: Free of charge

See instructions.

Download the resource

Donate Speech: Selected dataset
icon-info-circle Metadata
This resource contains five different subsets that were selected at Aalto University especially for developing, training and testing ASR systems. The total duration of the audio files is about 131 hours.
icon-quote-right Reference instructions for this version
Price: 1000 €

See instructions.

Download the resource

Donate Speech: Annotated dataset
icon-info-circle Metadata
This resource contains all the annotated audio files, their transcriptions as raw text files and annotation files, and the background information regarding the recordings and speakers. The total duration of the audio files is about 1600 hours.
icon-quote-right Reference instructions for this version
Price: 5000 €

See instructions.

Download the resource

Donate Speech: Complete dataset, version 1
icon-info-circle Metadata
The Complete dataset (version 1) includes the Annotated dataset (and the Selected dataset and the Sample). In addition, the Complete dataset also includes the audio files that were not transcribed or annotated.
icon-quote-right Reference instructions for this version
Price: 10 000 €

See instructions.

Download the resource

Contents of the datasets

The first version of the Donate Speech Corpus (Puhelahjat) is a collection of speech recordings accumulated during the Donate Speech campaign between 16.6.2020 and 14.9.2021.

The resource contains a total of about 3200 hours of speech recordings, out of which about 1600 hours have been transcribed. The resource also includes information about the elicitation tasks for which each of the speech samples was donated in the original campaign, and the background details that were voluntarily provided by speech donors.

The resource is available via the download service of the Language Bank of Finland under restricted terms and conditions. The services of the Language Bank are directed at academic researchers. For companies and non-academic organizations, access to Puhelahjat datasets may be acquired for a fee. Further details can be requested by email at lahjoita-puhetta@helsinki.fi.

How to obtain access to use the material? Instructions for companies and non-academic organizations

NB: These instructions are still subject to change.

In accordance with the specific terms and conditions of the Puhelahjat resource, it is also possible to grant access to the data for commercial and non-academic purposes. However, in this case, a separate license agreement between the University of Helsinki and the company or organization is required. When the agreement is signed and the payment has been made, access can be granted to the representative authorized by the user organization.

  1. Companies and organizations interested in using the data may contact us for further information at lahjoita-puhetta@helsinki.fi.
  2. A copy of the general terms included in the agreements is provided online for reference, see http://urn.fi/urn:nbn:fi:lb-2022060130.
  3. Before acquiring a paid dataset, the company may obtain access to a small sample material free of charge. However, access to the sample material is subject to the same terms and conditions as the paid versions of the material, and an agreement is needed.
  4. When the agreement has been signed, the representative authorized by the company/organization may apply for access to the desired dataset (either to the free sample or to one of the paid datasets) via the Language Bank Rights (LBR) system. The representative may log in by using an eDuuni identity.
  5. In connection with the application, the company applying for the right of use must provide a public link to their Privacy Notice (or similar document) regarding the processing of the personal data contained in the material. This information will be published on the website of the Language Bank.
    Instructions for publishing the Privacy Notice
  6. The license fee must be paid before access to the resource can be granted. Instructions for payment can be requested by email at lahjoita-puhetta@helsinki.fi.
  7. When the application for access is approved in the Language Bank Rights, the applicant can access the data via the same user identity that was used in the application process.

When applying for the use of paid material, it must be shown that the license fee has been paid.


Last updated: 20.4.2023

 

Persistent Identifier of this page: urn:nbn:fi:lb-2022111627

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information