# Aalto Finland Swedish Parliament ASR Corpus 2015-2020
Short name: `sv-fi-parliament-asr`
Persistent Identifier of this resource: http://urn.fi/urn:nbn:fi:lb-2022052004

This corpus is extracted from the transcripts and the videos of Swedish speeches held during plenary sessions in the Parliament of Finland. The extraction was done by the Aalto Speech Recognition group. The original session transcripts and videos are available at the web
portals of the Parliament of Finland (avoindata.eduskunta.fi and verkkolahetys.eduskunta.fi).

This corpus consists of:
 - 3889 sample pairs
 - 6.4 hours of speech
 - 333 483 word tokens
 
This dataset is considerably smaller than the Finnish counterpart (http://urn.fi/urn:nbn:fi:lb-2021051903). This is because Swedish speeches are rare in Parliament of Finland. Only around 1% of all speeches are held in Swedish.

All audio files in this corpus are single-channel wavs with sample rate 16 kHz and 16-bit precision.
The transcript files (.trn) are plain text files.

The tools and code used to produce this corpus:
 - Preprocessing postprocessing: https://github.com/aalto-speech/sv-fi-parliament-tools
 - Decoding and segmentation: Kaldi, https://github.com/kaldi-asr/kaldi

### Data

This corpus contains samples of Swedish speech (.wav) and their corresponding transcripts (.trn) from sessions
between 1/2015 and 104/2020. Many sessions in this range are not present in the dataset. In most cases, this is because they did not contain any Swedish speeches. In a few cases, it is because either the recording or the transcription was unavailable.

Samples are grouped by session. Each filename is formed from the following components:

> Filename (Kaldi-compatible utterance id): <mpid>-<session_number>-<session_year>-<startsec>-<endsec>
> e.g.: 00941-050-2015-00214257-00214804

Further details:

|     Component    |                                                              Definition                                                      |
|:----------------:|:----------------------------------------------------------------------------------------------------------------------------:|
| <mpid>           |              The unique Member of Parliament identifier given to the MPs in the parliament's public databases.               |
| <session_number> |      A running number given to the plenary session which together with the working year uniquely identifies the session.     |
| <session_year>   |       The parliamentary working year of the session. In election years, the working year differs from the calendar year.     |
| <startsec>       | The start timestamp of the segment in the full plenary session audio. Format is seconds + two decimals, 00186868 = 1868.68 s |
| <endsec>         |                  Like start timestamp, this marks the end timestamp of the segment in the original audio.                    |

This subset is machine-extracted so there remains some inaccuracies in the samples. The audio quality
also varies.

### Note about MPIDs
A mapping between the MPID and the name is provided in `speaker-id-mapping.csv`

---
## License
See the `LICENSE.md` file.

---
## Contact
Authors: Otto-Ville Raitolahti, Anja Virkkunen, and Mikko Kurimo of the Aalto Speech Recognition Group
Contact via kielipankki@csc.fi

