<< List of all deliverables

FIN-CLARIAH D3.1.2: Ingestion framework

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.1: Report on Ingestion framework
Date of reporting: 2023-02

Report author: Johanna Lilja (National Library of Finland), Tuula Pääkkönen (National Library of Finland)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Description

Basic concept of how the data is downloaded exists. Technology defined (Apache airflow for workflow management) has been chosen. Script created for downloading METS XML, and then ALTO XML files via Airflow. CSC Project created with necessary quota. Download of dataset (METS, ALTO) started in January 2023. Areas of improvement identified: Download speed, METS filepaths need post processing. Next steps are agreed between NLF and CSC, we continue the fruitful collaboration. Airflow evaluated and found fit for purpose.

More information

FIN-CLARIAH WP3.1 presentation from DARIAH-FI workshop on November 9th, 2022.

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information