# -*- coding : UTF-8 -*-

Christmas Gospel text-to-speech in four Uralic languages, source

Metadata: http://urn.fi/urn:nbn:fi:lb-2023110601
Licence: CC-BY-NC (http://urn.fi/urn:nbn:fi:lb-2023110603)
Resource shortname: xmas-gospel-tts-src

This is the xmas-gospel-tts-src/ directory. It contains a set of
.txt, .wav and linked morpho-syntactically annotated .vrt files pertaining to
the Finnish Christmas Gospel, LUK.2.1–20, in four languages.

The .txt files in four languages include Komi-Zyrian (kpv), Erzya (myv),
Karelian (krl) and Olonets-Karelian (olo, aka Livvi), these texts are all
included in the Parallel Biblical Verses for Uralic Studies (PaBiVUS) corpus,
metashare: http://urn.fi/urn:nbn:fi:lb-2020021121. 

The .wav files have been produced by Aleksei Ivanov, Niko Partanen and
Jack Rueter, who have made use of the Facebook
MMS (https://ai.meta.com/blog/multilingual-model-speech-recognition/)
with language support for hundreds of languages
(https://huggingface.co/facebook/mms-tts#supported-languages).

While the Komi-Zyrian (kpv), Erzya (myv), Karelian (krl) are reported
as having coverage for ASR, TTS and LID, Olonets-Karelian is not mentioned
at all. The Olonets-Karelian .wav has been produced using the krl TTS,
as both language forms use writing systems with basically the same
character to sound correlation.

The names of the files follow a pattern that can be associated directly
with the identifiers used in correlating Bible verses in the PaBiVUS corpus.
xmas-gospel + iso_lang + book of the Bible three-letter abbreviation +
chapter number + verse number and dot file type:
xmas-gospel-krl-LUK.2.1.txt 
xmas-gospel-krl-LUK.2.1.wav

Inside the .vrt files, the link elements allow for the parallel linking of
sentences between languages. Hence, where the link element «<link id=":LUK.2.1:">»
allows for linking of sentence elements with the same «id» attribute, on the
one hand, the link element «<link id=":LUK.2.4–5:">» allows for linking between
the Komi-Zyrian sentence element «<sentence id=":LUK.2.4–5:">» and the sentence
elements «<sentence id=":LUK.2.4:">» and «<sentence id=":LUK.2.5:">» of the other
languages, on the other.

The names of the .wav files correlate to «<utterance>» element ids and can be
directly linked to the sentence elements with the same ids in the .vrt files.

This directory contains the following subdirectories:
     /vrt_with_links/
     /xmas-gospel-kpv/
     /xmas-gospel-krl/
     /xmas-gospel-myv/
     /xmas-gospel-olo/
     /misc/
     
The subdirectory /vrt_with_links/ contains linked and morpho-syntactically annotated VRT files:
    myv-2006_LUK.2.1–20_with_links.vrt (Erzya)
    kpv-2008_LUK.2.1–20_with_links.vrt (Komi-Zyrian)
    krl-2011_LUK.2.1–20_with_links.vrt (Karelian)
    olo-2003_LUK.2.1–20_with_links.vrt (Olonets-Karelian)

The subdirectory /xmas-gospel-kpv/ contains two sub-subdirectories:

    /kpv_txt/
    /kpv_wav/

The subdirectory /xmas-gospel-krl/ contains two sub-subdirectories:

    /krl_txt/
    /krl_wav/

The subdirectory /xmas-gospel-myv/ contains two sub-subdirectories:

    /myv_txt/
    /myv_wav/

The subdirectory /xmas-gospel-olo/ contains two sub-subdirectories:

    /olo_txt/
    /olo_wav/

These directories contain .txt and .wav files. The .txt files contain pure text in one line, and the .wav files have been generated using tts.

The subdirectory /misc/ contains a copy of the script used for producing
the .wav files
    xmas_gospel_tts.ipynb

