FIN-CLARIAH D2.3.2: Aligning and retrieving

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.3: Report on Aligning and retrieving
Date of reporting: 13-11-2023

Report author: Jack Rueter, Erik Axelson (University of Helsinki)
Contributors: Aleksei Ivanov (University of Tartu), Niko Partanen (University of Helsinki)
Deliverable location: Christmas Gospel text-to-speech in four Uralic languages

Description

The «Christmas Gospel text-to-speech in four Uralic languages» (shortname: xmas-gospel-tts) is a collection of .txt, .wav and .vrt files with a variety of alignments used in Korp searches. The collection is intended as a demo for showing how to donate and implement in parallel multi-lingual spoken materials to the Language Bank of Finland.

Background

A model for Massively Multilingual Speech (MMS, CC-BY-NC 4.0) has recently been developed at Facebook (Meta), with language support for hundreds of languages whose automatic speech recognition (ASR), text to speech (TTS) and language identification (LID) coverage is documented here.

The documentation at Meta includes 16 of approximately 32 Uralic languages or language forms spoken today. We chose three languages, Komi-Zyrian (kpv), Karelian (krl) and Erzya (myv), of the eight Uralic languages with coverage for the three categories of ASR, TTS and LID, and then we selected one additional language, Olonets-Karelian (olo, aka Livvi), one of the 16 languages lacking coverage for any of the three categories. Our choice of a fourth language was motivated by the fact that Karelian and Olonets-Karelian share much the same character-to-sound correlation and that the latter might actually be the source of digital information under the umbrella term Karelian.

The .txt files represent a segment of an existing parallel corpus, Parallel Biblical Verses for Uralic Studies (PaBiVUS), which is described in Metashare with a CC-BY-NC license. The segment or mini parallel corpus here is the Christmas Gospel (Luke 2:1–20), which is well known in Finland.

The .wav files have been produced as a text-to-speech exercise with a Python script by Aleksei Ivanov, Niko Partanen and Jack Rueter, utilizing the model for MMS built at Facebook (see above).

The .vrt files contain morpho-syntactically annotated versions of the Christmas Gospel texts, which have been subsequently inspected and manually corrected. The annotation used analysers built with Helsinki Finite-State Technologies (HFST) under continual development at Saami Language Technology (GiellaLT), based at the Norwegian Arctic University, in Tromsø: (Erzya; Komi-Zyrian; Karelian; Olonets-Karelian); Constraint Grammar (CG) methods as documented at the University of Southern Denmark, and a Universal Dependencies tool, Annotatrix.

The demo provides two facets of searchability on the Korp server. First, there is parallel corpus searchability, as found in the PaBiVUS corpus, i.e., there are links between .vrt coded verses of the Christmas Gospel with automatically annotated and subsequently manually corrected dependencies. Second, the text content of each verse is linked with the sound file (.wav), which allows for a sentence-to-utterance alignment as found, for example, in the Finnish Parliament materials, where timestamps would be the equivalents of our verse identifiers.

Search the Language Bank Portal:

Researcher of the Month: Milla Uusitupa

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information

FIN-CLARIAH D2.3.2: Aligning and retrieving

Description

Background

News

Contact