<< List of all deliverables

D2.3.1: Remote access to text data repositories

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.3: Report on Remote access to text data repositories
Date of reporting: 30-09-2025

Report authors: Tommi Jauhiainen (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024071601 and urn:nbn:fi:lb-2025081401 

Description

In this work package, we aimed to provide infrastructure for translation and interpretation research, both in machine translation and in translation studies, by enhancing our access to remote text data repositories. During the project, we focused on improving our access to three significant external sources of text data: the Parliament of Finland, the National Broadcasting company (Yle), and the various institutional repositories managed by the Finnish Universities.

With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API and published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the resource publishing pipeline of the Language Bank of Finland (LBF). For future updates of this resource, we plan to collaborate with the Parlamenttisampo and maintain the software components used to extract and parse the API-provided dataset together.

Similarly, we published a new source version of the Yle Finnish News Archive, covering the years 2022-2024: urn:nbn:fi:lb-2025081401. We have worked on streamlining the publishing pipeline for resources that are regularly updated, which include both the Parliament and Yle datasets. Preliminary investigations indicate that the best throughput will be achieved by creating a customized pipeline for each resource with checklists tailored to make the creation and publishing of new versions as easy as possible.

We have also created a semi-automated system that can be used to harvest all PDF-formatted publications from the institutional repositories managed by Finnish Universities. Automated harvesting was made possible by the widespread use of DSpace software as the backend of these repositories. We are further developing automated methods to determine the types of language resources that can be published based on this collection. The licenses under which the texts have been published vary considerably, and we aim to publish them as openly as possible.
 
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

 

Search the Language Bank Portal:
Krista Ojutkangas
Researcher of the Month: Krista Ojutkangas

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information