Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 2.3: Report on Licensing interpretation sessions
Date of reporting: 22-02-2022
Report author: Jack Rueter (University of Helsinki)
Contributors: Varpu Vehomäki, Mietta Lennes (University of Helsinki)
There are essentially two sets of audio materials that the EU Parliament makes available online. There are the debates and videos associated with the sittings, on the one hand, and a smaller set of «News in Brief» podcasts, on the other. While the latter set consists of recordings without transcriptions of thematically consistent audios, one to nearly four minutes in length, the former consists of original audios and their transcriptions with the correlating audios of all the interpretations.
We have selected the former set as the target of our investigation, and have found that acquisition of the desired materials might be carried out semi-automatically. The timestamps for original-language audios and their transcriptions can be harvested from the European Parliament | Plenary sitting | Debates and videos document directory, but access to the audios themselves is more problematic.
Original audios and their interpretations are readily accessible on line, but acquisition of them does not parallel that of the direct download offered for the News-in-Brief podcasts rather it requires that a request with timestamps and language choice be made for each audio. Thus, in order to acquire both an original English audio and its correlating Finnish interpretation, two requests must be sent. An optimal audio set might then be seen in a request for the recording of an entire sitting as an original-language audio complemented by a second request for the correlating Finnish interpretation audio sharing the same timestamps.
To investigate the possibility of acquiring larger collections of the datasets, a citizens’ enquiries contact form has been filed in which the original-language audios with their correlating interpretation audios for Finnish, Swedish, English and Estonian have been requested for the third week of January, 2023. Although our interest lies in the procurement of Finnish interpretation audios for original English original audios and transcriptions, this more extensive request is seen to be helpful in establishing procedures for future acquisitions and the testing of speech recognition technologies already used and developed in Finland.
It has been ascertained that materials for constructing a Demo corpus with aligned audios and transcriptions of original-language speech along with correlating interpretation audios and their text-to-speech-derived transcriptions can be acquired using semi-automated means. To this end the alignment of English-originals with Finnish interpretation audios and their text-to-speech-recognized transcriptions from a single sitting might serve as an illustrative example from which to later build upon.