Samples of Spoken Finnish

Samples of Spoken Finnish

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Changes

PLEASE NOTE: The downloadable version of this data set was re-packaged on 31.01.2023, because some information was found to have been missing in the former download packages.
The following data were added:

Four preface texts (’saate’) from individual books in the original printed series ”Suomen kielen näytteitä”, in PDF format
PDF files with general information for each of the 50 municipalities
wav audio files for the municipalities 9-14

Detailed list of the added files

Content and structure

The corpus Samples of Spoken Finnish (SKN corpus) is based on the series of dialect books of the same name published by the Institute for the Languages of Finland between 1978 and 2000 (see Samples of Spoken Finnish). 50 books were published in total, each containing about two hours of dialect transcriptions. The municipalities selected for the series represent a comprehensive range of dialect areas. The material was mainly taken from the recordings of the Finnish Audio Recordings Archive. From the original SKN series, a data set has been created containing both the recordings and the transcribed text that has been aligned with the audio. The corpus was divided into fifty sections according to municipality and to the previously published dialect books. Two dialect samples are generally available for each section.

The text was manually aligned with the audio in fragments that roughly match sentence or utterance boundaries. The corpus is searchable based on the text, and it is possible to listen to the audio fragments that match the search results.

The SKN corpus contains a total of 696 376 transcribed words, of which 684 977 words have been assigned with a ”normalized” form (”normalized” referring to the corresponding form in Standard Spoken Finnish, as opposed to the original form in the local dialect in question). Note that the ”normalization” is not necessarily unambiguous, even though efforts were made to take into account the meaning of the word in context. A normalized form was not included for incomplete or unclear words. For a description of the principles of the normalization process, see the document yleiskielistys_skn.pdf (in Finnish) under the root folder of the corpus.

Several versions of the data are available (see listing above).

More information on the recordings and annotations

As the original interviews were recorded under varying conditions and the tapes were digitised at a later date, many of the audio recordings in this dataset contain background noise and occasional other disturbances, and the sound quality of the recordings may vary. The audio files in WAV format are single channel (mono) and they were sampled in 16-bit format at 44100 Hz.

The LAT version of the data was phased out in November 2020

The LAT platform of the Language Bank was discontinued at the end of year 2020. Although the Samples of Spoken Finnish resource can no longer be accessed through the LAT interface, all the content previously available on LAT is available for download. The annotated speech samples can be accessed on a local computer using tools such as ELAN and Praat.

Content of the annotation files in EAF format

Each audio recording of the original material corresponds to an annotation file in EAF format (e.g. SKN01a_Suomussalmi.eaf). Once you have downloaded the EAF file and the corresponding audio file on your local computer (see the downloadable version of the data), they can be opened for editing with ELAN. In case ELAN does not automatically find the media file (the wav audio) linked to the EAF in the directory where you placed it on your computer, you can locate the wav file manually. As soon as you save the EAF file again, the associated audio file will be found on the same computer the next time you use it.

The EAF annotation files of this corpus contain several annotation layers or tiers. One tier contains the transcripts of the utterances, ”sentences” or similar passages uttered by the speaker in question, and another tier contains the roughly ”normalized” editions of the transcribed passages. The alignment of the transcripts and the audio was intended to facilitate searching, browsing and listening. The alignment is not completely accurate, and not all pauses have been marked. In addition to the tiers of transcribed speech, the annotation file also contains tiers of word tokens, where the original and the roughly normalized forms of individual words were aligned with each other. Please note that the individual word tokens were not aligned with the audio, but they were only intended to facilitate more complex content searches in ELAN.

TextGrid files corresponding to EAF files are also available and can be used with Praat. The TextGrid file must be opened and viewed together with the corresponding WAV audio file in Praat (since the audio file is not automatically opened with the annotation file in Praat, unlike ELAN).

The alignment of the audio and the text was originally done by importing XML-formatted documents with the help of a Praat script into TextGrid-formatted annotation files, which were then converted into the EAF format by another Praat script.

In ELAN, a ”Linguistic type” was defined for each annotation layer in the EAF files, which allows the user to define focused searches that would only match, e.g., those tiers that contain the normalized word forms. For technical reasons, hierarchical relationships between annotation layers and linguistic types were not originally defined for the SKN corpus files. Thus, if you wish to edit the annotations in ELAN, please note that the annotation tiers are independent from each other, i.e. if you move annotations of the type ”normalised word”, or their boundaries, for example, the changes will not automatically be reflected in the corresponding units in the other tiers. It might sometimes be easier to use Praat for making the changes to the TextGrid annotation files, since it is possible in Praat to move co-located annotation boundaries in sync. Alternatively, you can first manually create a hierarchy between the annotation tiers in your ELAN corpus version by creating new versions of the linguistic types (Type: Add linguistic type…) and then by using the command Tier: Change parent of tier… in ELAN.

Searching based on annotations

It is possible to search the transcripts of the corpus via the Korp service.

Searches can also be performed in ELAN, where you can make use of the different types of annotation tiers, in addition to the transcribed text. The annotation tier types ”original sentence” and ”original word” represent the original transcript, and ”normalised sentence” and ”normalised word” represent the preliminary normalized translations of these. The standardized form of some sentences is also accompanied by additional notes, which are described in the ”note for normalised word” layer.

The annotation layers related to the interviewer’s speech are indicated as ”interviewer”. All other layers relate to the speech of either the interviewees or other people present at the time of the recording.

Resource creators

The original audio material was edited by Sakari Pietarila. The original transcripts have been published in dialect books, the prefaces of which are attached to the corresponding sections of the corpus as PDF documents. The text and audio were aligned at Kotus by My Sjöholm, Pauliina Liuska and Olli Miettinen. The normalization of the transcripts was performed by Maria Vilkuna, Pauliina Liuska and Pinja Ruponen at Kotus. The audio recordings and the annotation files were converted for the LAT system by Mietta Lennes.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023012601

Last modified on 2025-05-09