D3.1.2: Workflow automation and version syncing

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Comprehensive data versioning
Date of reporting: 22-09-2025

Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Keywords for the deliverable page: versioning, updates, differences

Description

The versioning mechanism has been rigorously tested with a daily update schedule, which is far too often, considering that the data set is changing relatively rarely and a monthly update schedule is envisaged. We have added improvements to better serve the Elastic Search use case and make it easier to track the provenance of the dataset and to improve the reliability of the snapshot creation. Below we describe in more details how the dataset serves the selected use cases.

Using the data set as a source for newer versions of the KLK dataset in Kielipankki.

To create ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], (”KLK”, for short) using this data set[2] the original Python scripts[3] need to be changed. Presently they are operating on directories extracted from zip files obtained directly from the National Library of Finland (NLF). We decided to not use these files directly for two reasons:

The files are in an internal format of the National Library and contain data which is not available publicly via the API of NLF, in this case the TIFF archive versions of scanned newspapers.
The TIFF files are very large and would significantly impact download times and storage requirements.

Unlike planned we opted in the end to not create a working proof-of-concept, but to explain below the steps needed to adapt the present scripts to the new format. One major change is to operate on the zip files instead of a Posix file structure. Especially in HPC filesystems like Lustre working on zip files is much more efficient than to extract the small files contained in them. Concretely Python’s zipfile module[4] can be used to search for METS files within the downloaded zip files in /scratch/project_2006633/nlf-harvester/zip on Puhti. METS files of a specific binding are contained in the ”mets” directory of said binding. The corresponding OCR data can then be found in the ”alto” directory on the same level.

The example of binding 19712 below illustrates how finding METS files (in the ”mets” directory) leads to the respective OCR data (in the ”alto” directory on the same level as the mets file).

1/19/197/1971/19712/19712/mets/19712_METS.xml
1/19/197/1971/19712/19712/alto/00001.xml
1/19/197/1971/19712/19712/alto/00002.xml
…

A minor issue was observed: Before using the dataset for the next version of ”KLK”, we need to request a collection of periodicals (marked ”aikakausi”) to be added to the dataset, presently we only download newspapers (marked ”sanomalehti”).

Using the dataset as a basis for an Elastic Search instance containing NLF data

Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[5]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. In this use case it was important to keep the Elastic Search Engine in sync with changes within the data set. While we already provide versions, comparing these version is resource intensive. To make comparison easier, we introduced a ”log” directory (/scratch/project_2006633/nlf-harvester/log/ containing listings of additions and deletions that were performed during each synchronisation as well as general information about snapshot runs. We also made it easy to refer to a specific version of the dataset by tagging it with the hash number used in the restic backup. Since the changes from one version to another can be potentially large (e.g. if NLF publishes are new version of the OCR’d scans), resources on HPC login nodes are not sufficient to generate snapshots using restic. For that reason restic is now run as a HPC job on a compute node with adequate resources.

Summary and Outlook

The goal of this work package was create a consistent download framework for publicly available newspaper data from the NLF. To achieve this we used Apache Airflow for task automation and Restic for versioning. It turned out that Apache Airflow is not designed to deal with too many tasks at once that might take a long time. We had to find compromises to reduce the number of tasks.

We ran the download pipeline on a daily basis for few weeks without issue and are now confident that Airflow can be run on a monthly basis to update the dataset. Restic turned out to be a reliable tool for versioning. The versioning to Allas makes it possible to free space on Puhti in case the data set is not in active use after the end of the project. It also makes it possible to stage the data set to other environments, like personal laptops or the LUMI super computer. Long term funding for keeping the data on Allas still needs to be worked out.

References

[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401

[2] See the Harvester documentation for details.

[3] https://github.com/CSCfi/Kielipankki-utilities/tree/master/corp/klk-alto

[4] Introduction to the python zipfile module: https://realpython.com/python-zipfile/

[5] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Search the Language Bank Portal:

Researcher of the Month: Atte Huhtala

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information