Aalto Finnish Parliament ASR Corpus 2008-2020

This corpus is extracted from the Finnish parliament plenary session transcripts and videos by the
Aalto Speech Recognition group. The original session transcripts and videos are available at the web
portals of the Parliament of Finland (avoindata.eduskunta.fi and verkkolahetys.eduskunta.fi). The
corpus is split into three parts:
1. 2015-2020 set
2. 2008-2016 set
3. Development and evaluation sets

A non-overlapping combination of the 2008-2016 set and the 2015-2020 set form a training set of size:
– 1 422 318 sample pairs
– 3 130 hours of speech
– 19 356 831 word tokens

All audio files in this corpus are single-channel wavs with sample rate 16 kHz and 16-bit precision.
The transcript files (.trn) are plain text files.

Latest versions/subcorpora:
Aalto Finnish Parliament ASR Corpus 2008-2020
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081105

The Helsinki Korp Europarl Bilingual Corpora

The Helsinki Korp Europarl Bilingual Corpora are:

The Helsinki Korp Europarl Finnish-English Corpus
The Helsinki Korp Europarl Finnish-Swedish Corpus
The Helsinki Korp Europarl Finnish-German Corpus
The Helsinki Korp Europarl Finnish-French Corpus
The Helsinki Korp Europarl Finnish-Spanish Corpus
The Helsinki Korp Europarl Finnish-Estonian Corpus

The corpora contain texts of the Europarl Parallel Corpus v7.

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose matching items were extracted and labeled with corresponding document IDs. By using a preprocessor, sentence boundaries were identified. The data was sentence aligned by using a tool based on the Church and Gale algorithm.

For more information on the Europarl Parallel Corpus see http://urn.fi/urn:nbn:fi:lb-20140730195 and http://www.statmt.org/europarl/

Latest versions/subcorpora:
The Helsinki Korp Europarl Bilingual Corpora
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021052403

Suomeksi

Plenary Sessions of the Parliament of Finland

The latest versions:  
Plenary Sessions of the Parliament of Finland, Kielipankki Korp Version 1.5
icon-info-circle Metadata and license
icon-quote-right How to cite this version
Open the corpus in Korp icon-question-circle
Plenary Sessions of the Parliament of Finland, Downloadable Version 1.5
icon-info-circle Metadata and license
icon-quote-rightHow to cite this version
Download the corpus
Locate other versions of the same resource  

Plenary Sessions of the Parliament of Finland contains audio and video recordings of the parliamentary sessions and the transcripts that have been aligned with the audio. Both the media files and the original transcripts have been obtained directly from the online public services of the Parliament. The content is openly available via the Language Bank of Finland without logging in.

Via the Korp service in the Language Bank of Finland, it is possible to perform various kinds of content searches on the corpus and to calculate statistics from the results. The turns of different speakers have been separated in the text. In the Extended search tab in Korp, it is possible to delimit searches on the basis of the speaker’s name, the parliamentary group or the role of the speaker.

In the search results of this corpus version in Korp, there are also links to the corresponding utterances in the original video. If you wish, you may download the ELAN/EAF annotation files and the audio files in the downloadable version of the corpus for further processing. Moreover, the original videos and transcripts can also be located in the online services of the Parliament of Finland.

The text in the original transcripts has been aligned with the audio recordings by automatic methods. The technological expertise in the alignment process was provided by Aalto University. In those audio portions where a matching text was not found in the transcript, an automatic speech recognizer was used in order to provide a tentative transcript. Thus, it is important to remember that the text in the Korp version of the corpus is not error-free and it may not always fully correspond to the original transcript.

Further information about the contents of the different corpus versions can be found in their metadata records.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-201407305

Search the Language Bank Portal:
Juho Leinonen
Researcher of the Month: Juho Leinonen

 

Tulevat tapahtumat

  1. CLARIN Annual Conference 2021

    27.9.2021 10.0029.9.2021 16.15

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information