The Finnish Dialect Corpus of the Syntax Archive (la-murre)

Suomeksi


Currently available versions of this resource

ShortnameName and metadataLicenseLocationCiteResource group and helpApplyPublication yearSupport level
ShortnameName and metadataLicenseLocationCiteResource group and helpApplyPublication yearSupport level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

ShortnameName and metadataLicenseFormatsSupport levelContact PersonResource group and helpLocationOther information
ShortnameName and metadataLicenseFormatsSupport levelContact PersonResource group and helpLocationOther information

Resource information

The Finnish Dialect Corpus of the Syntax Archive is a collection of material produced in collaboration between the University of Turku and the Institute for the Languages of Finland (Kotus, formerly the Research Institute for the Languages of Finland) from interview recordings in The Tape Archive of the Finnish Language and the Finnish Language Recording Archive of the University of Turku. The recordings were transcribed and grammatically annotated between 1976 and 1984. The grammatical analysis, which was carried out manually using numerical codes, has since been converted into a structured format and supplemented with word forms, and corrections and harmonizations have been made.

The Finnish Dialect Corpus of the Syntax Archive at the Language Bank of Finland contains both audio recordings and transcribed text. The text and audio are aligned in segments of sentences or other suitable lengths. The corpus can be searched based on the text, and the corresponding audio samples can be played back. Searches based on grammatical codes and lemmas can be performed in the Korp system.

The material consists of 142 dialect samples representing 132 localities across Finland, including a number of localities in ceded Karelia. Most localities are represented by a single sample, which is usually an interview with one speaker lasting about an hour. The interviews are generally similar to those in the Samples of Spoken Finnish (SKN) corpus. Some localities have two shorter samples. In some samples, there are two interviewees.

There are slightly over a million word tokens in the data (1 194 163 according to Korp, and over 887 000 grammatically analyzed word tokens produced by interviewees), 67 894 sequences marked as sentences (in Korp; approximately 54 500 sentences from interviewees), and 166 608 sentences distinguished and analyzed according to syntactic criteria.

The Finnish Dialect Corpus of the Syntax Archive overlaps slightly with the SKN corpus. For example, the Kiihtelysvaara interview (SKN14a) is identical. However, the transcription is more coarse than in SKN.

The Finnish Dialect Corpus of the Syntax Archive has long been used in research and theses, initially via searches conducted by the archive staff and, more recently, via the search interface developed by Nobufumi Inaba. Due to modifications and corrections, older search results may differ slightly from newer ones. The early stages of the material and the coding system are described in Lauseopin arkiston opas (”Guide to the Syntax Archive”) by Osmo Ikola (Lauseopin arkiston julkaisuja 1, Turku: University of Turku 1985).

The basic work on the material was carried out at the University of Turku between 1976 and 1984. The text and audio were aligned at Kotus by My Sjöholm, Pauliina Liuska, Matti Uusivirta, and Maria Vilkuna, while Pauliina Liuska and Maria Vilkuna were responsible for the structure and corrections.

Content corresponding to the previous LAT version of the material is now available in the Language Bank download service

The Language Bank LAT platform was discontinued at the end of 2020, and this material is no longer accessible via the LAT service. The corresponding content is available in downloadable format. The data can therefore be further explored and processed using tools such as ELAN and Praat. Please note that a VRT version of The Finnish Dialect Corpus of the Syntax Archive is also available for download. This version does not include the original audio files or annotation files.

Instructions for using the downloadable version

Since the interviews were recorded under varying conditions, there may be background noise and other disturbances in the recordings, and the sound level may vary. The alignment of the transcription with the audio is intended to facilitate searching, browsing, and listening. Therefore, it is not entirely accurate, and not all pauses have been marked.

To process the annotation files, you usually also need the corresponding WAV audio files so that you can listen to the samples. EAF annotation files can be opened for editing with the ELAN program. In addition, the TextGrid files corresponding to the EAF files are also available and can be used with the Praat program. It is recommended that the EAF file or TextGrid file for a given interview sample and the corresponding WAV audio file be stored in the same directory on your computer.

Annotations visible in Korp

The annotations visible in Korp (word classes, morphological features, sentence structure) are described on a separate page (in Finnish).

 


This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2025091110

Last modified on 2025-11-07