<< List of all deliverables

D3.2.2: Annotation & analysis tools for NARC data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.2: Report on annotation & analysis tools for NARC data
Date of reporting: 08-11-2023

Report authors: Venla Poso (University of Jyväskylä), Ida Toivanen (University of Jyväskylä), Tanja Välisalo (University of Jyväskylä), Antero Holmila (University of Jyväskylä)

Deliverable location: Released soon.

Description

Named entity recognition (NER) model for state authority archival data.

The National Archives of Finland started a mass digitisation project in 2019, where the aim is to digitise over 135 kilometres of archival data. We identified a need for an advantaged information extraction method from unstructured and noisy text, which will make data more accessible and potentially generate innovative uses of the data in the research sector. The process included two questionnaires to the end-users, creation of annotation guidelines, manual annotation, inter-annotator agreement testing and model development.

This process resulted in a NER model, which identifies ten different entity categories (person, organisation, date, location, geopolitical location, nationalities/religious and political groups, event, product, journal number and Finnish business identity code). Journal number and Finnish business code are newly established named entities derived from the responses to two questionnaires, as opposed to the others which rely on existing NER models. The model obtains comparable results with non-OCR’d data while significantly improving named entity recognition results when tested with OCR’d state authority archival data.

Development was conducted in cooperation with the National Archives of Finland and their DALAI project.

Links

Version 0.1: https://huggingface.co/Kansallisarkisto/finbert-ner

Publications

Poso, Venla, Tanja Välisalo, Ida Toivanen, Antero Holmila, and Jari Ojala. 2023. “Untapped Data Resources. Applying NER for Historical Archival Records of State Authorities”. Digital Humanities in the Nordic and Baltic Countries Publications 5 (1). Oslo, Norway: 55-69. DOI: 10.5617/dhnbpub.10650

Search the Language Bank Portal:
Tanja Säily
Researcher of the Month: Tanja Säily

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information