Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Automated metadata of archival data from NAF
Date of reporting: 04-06-2025
Report authors: Venla Poso (JYU), Ida Toivanen (JYU)
Contributors: Antero Holmila (JYU), Venla Poso (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Ilkka Jokipii (NAF)
Deliverable locations:
The National Archives of Finland has been digitising their material at an increasing pace. For example, they started piloting a mass digitisation project in 2019, where the aim is to digitise over 135 kilometres of archival data. The aim of the deliverable D3.3.1 was to develop machine learning methods for generating metadata, such as document type and journal number, from OCR-scanned archival materials to facilitate their analysis and information extraction. The goal has been to generate metadata which helps to make large variant data collections within the archives more usable. The development process has included creating a deep learning (DL) model for named entity recognition (work started in 2022–2023) and for document type classification (2024–2025).
The research started with archival data included developing named-entity recognition (for example, journal number) for state authority archives via (1) publishing annotation guidelines to aid the annotation process and recognize the properties of archival data [1], and (2) DL modelling based on annotated archival data [2,3]. In addition to publishing a DL model trained with the annotated data [3], we evaluated an archival text model against a Finnish text model to see and determine how big an effect noise brings to real-life cases and the acute workings of models [2].
The process of developing document type classification for noisy and diverse archival data has included collecting and annotating a new benchmark dataset from openly available archival data (to be published) and evaluating different DL model architectures for the task of document type classification. As a result we released an image-based model that classifies scanned documents into seven different categories: cover page, card index, map, picture, running text, table or form, and newspaper (https://huggingface.co/jyu-digihum/findoctype). Our future work will entail adding a multimodal dimension to the current framework.
Development has been conducted in cooperation with the National Archives of Finland.
[1] Poso, V., Välisalo, T., Toivanen, I., Lipsanen, M., Kukkohovi, L., Kytöaho, R., Palander, S., Pohjola, M., Laitinen, V., Föhr, A., Abdelamir, A. & Niemi, J. (2025). NER annotation guidelines for archival data. University of Jyväskylä. URN: https://urn.fi/URN:NBN:fi:jyu-202501291584
[2] Toivanen, I., Poso, V., Lipsanen, M., & Välisalo, T. (2025). Developing named-entity recognition for state authority archives. In O. Holownia, & E. S. Sigurðarson (Eds.), DHNB2024 Conference Post-Proceedings (7). University of Oslo Library. Digital Humanities in the Nordic and Baltic Countries Publications. https://doi.org/10.5617/dhnbpub.12262
[3] Poso, V., Lipsanen, M., Toivanen, I., & Välisalo, T. (2024). Making Sense of Bureaucratic Documents: Named Entity Recognition for State Authority Archives. In Archiving 2024 Final Program and Proceedings (pp. 6-10). Society for Imaging Science & Technology. Archiving, 21. https://doi.org/10.2352/issn.2168-3204.2024.21.1.2
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.