<< List of all deliverables

D4.2.2: Parliament of Finland Ontology

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.2: Report on Parliament of Finland Ontology
Date of reporting: 2023-11-09

Report author: Eero Hyvönen (Aalto University)
Contributors: Eero Hyvönen (PI), Laura Sinikallio, Petri Leskinen, Senka Drobac, Jouni Tuominen, Matti La Mela, Mikko Koho, Esko Ikkala, Minna Tamper, Rafael Leal, Heikki Rantala

Deliverable locations:

The Parliament of Finland Ontology with populated data and data services have been published Feb 14, 2023, using CC BY 4.0 license at the following platforms:

  1. Automatically daily updating CSV files of speeches in CSC’s Allas data service: https://a3s.fi/parliamentsampo/speeches/csv/index.html
  2. XML versions of speeches in Parla-CLARIN format (updated less frequently): https://a3s.fi/parliamentsampo/speeches/xml/index.html
  3. A ParlaMint-format sub-corpus of this has been published on 9.11.2023 in the repository of the pan-European ParlaMint II project: https://nl.ijs.si/et/tmp/ParlaMint/Repo/?C=M;O=D
  4. A CSV file of the Members of the Parliament and other speakers in CSC’s Allas data service: https://a3s.fi/parliamentsampo/actors/csv/index.html
  5. Speech data combined with the Parliament’s ontology, actors and other entities as linked data in RDF Turtle format is available 1) on the Linked Data Finland platform’s dataset page https://www.ldf.fi/dataset/semparl and as a data dump on the Zenodo.org data service https://doi.org/10.5281/zenodo.7636420.
  6. Semantic portal Parlamenttisampo.fi

Software tools

New declarative version of Sampo-UI framework used in the ParliamentSampo portal:
https://github.com/SemanticComputing/sampo-ui

Learning materials

Open online video lectures course.
https://seco.cs.aalto.fi/teaching/sw-introduction/index.html

Tutorial materials and video on using Sampo-UI frameworks for portal development:
https://seco.cs.aalto.fi/tools/sampo-ui/

Events organized

2023

2022

Publications

2023

Senka Drobac, Laura Sinikallio and Eero Hyvönen: An OCR Pipeline for Transforming Parliamentary Debates into Linked Data: Case ParliamentSampo – Parliament of Finland on the Semantic Web. Digital Humanities in the Nordic and Baltic Countries Publication, DHNB2023 Conference Proceedings, vol. 5, no. 1, University of Oslo Library, Norway, 2023. bib pdf link

Eero Hyvönen: Parlamenttisampo avaa eduskunnan miljoona puhetta ja kansanedustajien verkostot kaikkien tutkittaviksi. Tieteessä tapahtuu, vol. 41, no. 1, Tieteellisten seurain valtuuskunta (TSV), 2023. bib pdf link

Eero Hyvönen, Petri Leskinen and Jouni Tuominen: A Data-driven Approach to Create an Ontology of Parliamentary Work: Case Parliament of Finland on the Semantic Web. Proceedings of SWODCH 2023. Semantic Web and Ontology Design for Cultural Heritage. Co-located with the 22nd International Semantic Web Conference (ISWC 2023) in Athens, Greece, CEUR Workshop Proceedings, Vol-3540, November, 2023. bib pdf link

Eero Hyvönen, Laura Sinikallio, Petri Leskinen, Senka Drobac, Rafael Leal, Matti La Mela, Jouni Tuominen, Henna Poikkimäki and Heikki Rantala: Plenary Speeches of the Parliament of Finland as Linked Open Data and Data Services. Joint Proceedings of the Second International Workshop on Knowledge Graph Generation From Text and the First International BiKE Challenge co-located with 20th Extended Semantic Conference (ESWC 2023), pp. 1-20, CEUR Workshop Proceedings, Vol. 3447, August, 2023. bib pdf link

Henna Poikkimäki, Petri Leskinen and Eero Hyvönen: Applying Network and Bibliometric Analyses to Mentions of Politicians in Plenary Speeches: Case ParliamentSampo – Parliament of Finland on the Semantic Web. August, 2023. Submitted for evaluation. bib pdf

Eero Hyvönen, Petri Leskinen and Heikki Rantala: Integrating Faceted Search with Data Analytic Tools in the User Interface of ParliamentSampo – Parliament of Finland on the Semantic Web. Proceedings of ESWC 2023, poster and demo papers, Sringer-Verlag, June, 2023. bib pdf

Eero Hyvönen: Creating and Using a National Linked Open Data Infrastructure for Cultural Heritage Applications and Digital Humanities Research: Lessons Learned. DARIAH Annual Event 2023, abstracts of papers, DARIAH-EU, June, 2023. bib link

Eero Hyvönen: Creating and Using a Linked Open Ontology and Data Infrastructure for Digital Humanities in Finland: Lessons Learned 2003-2023. June, 2023. Under review. bib pdf

Minna Tamper, Laura Sinikallio, Jouni Tuominen and Eero Hyvönen: Transforming Linguistically Annotated Finnish Parliamentary Debates Into the Parla-CLARIN Format. Digital Humanities in the Nordic and Baltic Countries Seventh Conference (DHNB 2023), Book of Abstracts (Sofie Gilbert and Annika Rockenberger (eds.)), pp. 118, University of Oslo Library, Oslo, Norway, March, 2023. bib link

Eero Hyvönen: How to Create a National Cross-domain Ontology and Linked Data Infrastructure and Use It on the Semantic Web. Programming and Data Infrastructure in Digital Humanities, Book of Abstracts, pp. 7, High Performance Computing Centre, University of Évora, Portugal, March, 2023. bib link

Eero Hyvönen, Petri Leskinen, Laura Sinikallio, Senka Drobac, Rafael Leal, Matti La Mela, Jouni Tuominen, Henna Poikkimäki and Heikki Rantala: ParliamentSampo Infrastructure for Publishing the Plenary Speeches and Networks of Politicians of the Parliament of Finland as Open Data Services. Aalto University, Dept. of Computer Science, February, 2023. Paper published at the publication event of the ParliamentSampo data service and portal. bib pdf

Eero Hyvönen: How to Create a National Cross-domain Ontology and Linked Data Infrastructure and Use It on the Semantic Web. Semantic Web – Interoperability, Usability, Applicability, IOS Press, 2023. Forth-coming. bib pdf

Eero Hyvönen: Digital Humanities on the Semantic Web: Sampo Model and Portal Series. Semantic Web – Interoperability, Usability, Applicability, vol. 14, no. 4, pp. 729-744, IOS Press, 2023. bib pdf link

2022

Henna Poikkimäki, Petri Leskinen, Minna Tamper and Eero Hyvönen: Analyses of Networks of Politicians Based on Linked Data: Case ParliamentSampo – Parliament of Finland on the Semantic Web. Semantic Web and Ontology Design for Cultural Heritage (SWODCH 2022), Turin, Italy, Proceedings, CEUR WS Proceedings, 2022. Accepted. bib pdf

Eero Hyvönen, Laura Sinikallio, Petri Leskinen, Matti La Mela, Jouni Tuominen, Kimmo Elo, Senka Drobac, Mikko Koho, Esko Ikkala, Minna Tamper, Rafael Leal and Joonas Kesäniemi: Linked Data Approach for Studying Parliamentary Speeches and Networks of Politicians in Finland 1907-2021 (long paper). Digital Humanities 2022, Conference Abstracts, July 25-29, 2022 Online, Tokyo. Japan, University of Tokyo, pp. 254-257, ADHO, July, 2022. bib link

Matti La Mela, Fredrik Norén and Eero Hyvönen (eds.): Proceedings of the Digital Parliamentary Data in Action (DiPaDA 2022) Workshop. CEUR Workshop Proceedings, vol. 3133, May, 2022. bib link

Eero Hyvönen, Laura Sinikallio, Petri Leskinen, Matti La Mela, Jouni Tuominen, Kimmo Elo, Senka Drobac, Mikko Koho, Esko Ikkala, Minna Tamper, Rafael Leal and Joonas Kesäniemi: Finnish Parliament on the Semantic Web: Using ParliamentSampo Data Service and Semantic Portal for Studying Political Culture and Language. Digital Parliamentary data in Action (DiPaDA 2022), Workshop at the 6th Digital Humanities in Nordic and Baltic Countries Conference, long paper, pp. 69-85, CEUR Workshop Proceedings, Vol. 3133, May, 2022. bib pdf link

Minna Tamper, Rafael Leal, Laura Sinikallio, Petri Leskinen, Jouni Tuominen and Eero Hyvönen: Extracting Knowledge from Parliamentary Debates for Studying Political Culture and Language. Proceedings of the 1st International Workshop on Knowledge Graph Generation From Text and the 1st International Workshop on Modular Knowledge co-located with 19th Extended Semantic Conference (ESWC 2022) (Sanju Tiwari, Nandana Mihindukulasooriya, Francesco Osborne, Dimitris Kontokostas, Jennifer D’Souza and Mayank Kejriwal (eds.)), vol. 3184, pp. 70-79, CEUR WS, May, 2022. International Workshop on Knowledge Graph Generation from Text (TEXT2KG 2022). bib pdf link

Matti La Mela, Fredrik Norén and Eero Hyvönen: Digital Parliamentary Data in Action (DiPaDA 2022): Introduction. Proceedings of the Digital Parliamentary Data in Action (DiPaDA 2022) Workshop, CEUR Workshop Proceedings, Vol. 3133, May, 2022. bib pdf link

Laura Sinikallio: Eduskunnan täysistuntojen pöytäkirjojen muuntaminen semanttiseksi dataksi ja julkaiseminen verkkopalveluna (Transformation of the Debates of the Parliament of Finland into Semantic Data and a Data Service. (in Finnish), University of Helsinki, Department of Computer Science, February, 2022. MSc Thesis. bib pdf link

Esko Ikkala, Eero Hyvönen, Heikki Rantala and Mikko Koho: Sampo-UI: A Full Stack JavaScript Framework for Developing Semantic Portal User Interfaces. Semantic Web – Interoperability, Usability, Applicability, vol. 13, no. 1, pp. 69-84, January, 2022. Online version published in 2021, print version in 2022. bib pdf link

<< List of all deliverables

D5.2.2: Educational material

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 5.2: Report on Educational material
Date of reporting: 29-11-2023

Report author: Sanna Kumpulainen (Tampere University)
Contributors: Sanna Kumpulainen, Jaakko Peltonen, Anna Sendra Toset, Elina Late, Farid Alijani (Tampere University)
Deliverable location: Deliverable location will be made public during December 2023 through the DARIAH-FI website

Description

This deliverable is a living document that includes relevant information regarding the educational materials relevant to the DARIAH-FI research infrastructure and its resources, such as documentation created for helping use the different tools, datasets, and workflows, and guidance on which courses might be relevant to use the resources more efficiently. The document also includes an overview of the state of the digital humanities and computational social sciences education in Finland, including links to relevant courses and programmes at the bachelor, master, and doctoral level.

To create this deliverable, we used data from an internal survey on digital humanities and computational social sciences education in Finland conducted within the members of the DARIAH-FI research infrastructure, as well as information provided by the different work packages on their respective resources (e.g., location, related educational materials). Since some educational materials are still under development and have not been released yet, it is expected that the document will be made public during December 2023 through the DARIAH-FI website.

<< List of all deliverables

D5.1.3: Protocol for collecting workshop data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 5.1: Report on Protocol for collecting workshop data
Date of reporting: 29-11-2023

Report author: Sanna Kumpulainen (Tampere University)
Contributors: Sanna Kumpulainen, Jaakko Peltonen, Anna Sendra Toset, Elina Late, Farid Alijani (Tampere University)
Deliverable location: https://doi.org/10.5281/zenodo.10217404

Description

The protocol for collecting workshop data is available at: https://doi.org/10.5281/zenodo.10217404

This document is intended to serve as an initial guide for collecting user experience data from workshops and training sessions related to the resources developed by the FIN-CLARIAH consortium. In this context, the deliverable includes recommendations for designing the study and for setting up the data collection process, as well as information for creating protocols, informed consents, and other similar documents related to collecting user experience data.

To create this document, we used data collected from semi-structured interviews (n=34) with potential end-users of DARIAH-FI conducted between September 2022 and February 2023, as well as information gathered via a selected narrative review of different resources related to research design. This deliverable must be read in conjunction with the specific instructions provided by the institutions where the studies that collect user experience data take place.

Note: An updated version of this document might be released during Q1/2024.

<< List of all deliverables

FIN-CLARIAH D4.3.3: Representative Twitter dataset(s) of user-generated texts and metadata

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.3: Report on Representative Twitter dataset(s) of user-generated texts and metadata
Date of reporting: 25-11-2023

Report author: Mikko Laitinen (University of Eastern Finland)
Contributors: Masoud Fatemi, Mehrdad Salimi, Paula Rautionaho (all from the University of Eastern Finland)
Deliverable location: https://nts-csc.rahtiapp.fi/ The URL is currently open for researchers, and we will add authentication to it in the spring of 2024.

Description

The WP’s main objective was to develop a representative dataset of social media data from Twitter from the five Nordic countries. The underlying idea is that various social media applications offer a promising and extremely large source of data for a range of disciplines in social sciences and the humanities (SSH) today, but research activities are often hindered by the lack of technical knowledge in collecting, pre-processing and analysing very large datasets. During the funding period, we expanded the data collection substantially, when it because clear that the future of the data collection route became more and more uncertain. All the materials were collected during the period when the academic application programming interface of this social media platform was still open, and later on when the company changed its name to X, the API was closed down. In the hindsight, the decision to store large amounts of material from various geographic settings turned out to be a wise move, because this subproject has now saved 12.5 years of material for future research.

The project activities so far have consisted of two parts:

  1. Collecting data: Masoud Fatemi has been in charge of the data collection. Our dataset initially focused on the Nordic region, but we decided to expand this considerably when it became clear that the API would be closed after changes in ownership of the platform. In addition to our original data, we expanded the data collection to social network information in the Nordic region, the United States, the United Kingdom, and to Australia, together with partners from the Australian Digital Observatory from the Queensland University of Technology and the University of Queensland. Basic information of the datasets are shown in Table 1 below, and they range in size from nearly 800 million words to nearly 4 billion words in the US and the Australian networks. The datasets cover a slightly different time frames from 2006 to May 2023. A substantial part of the NTS data will be shared via the Language Bank of Finland during 2023–24.

Social media datasets collected in the 2022-2023 in this subproject

  1. An easy-to-use graphic interface: The second part consisted of designing an easy-to-use graphic interface for accessing the material and for carrying out basic analysis and visualizations of the NTS data. The interface is currently in the piloting phase, and can be accessed at https://nts-csc.rahtiapp.fi/. It currently has only partial data, but we aim at adding all the NTS data to the CSC by spring 2025. Mehrdad Salimi was hired for this task in June 2022, and his contract is until May 2024, by which time, the interface will be fully functional.

This WP has reached its objectives and succeeded in creating a national niche within the Finnish DH sphere. We have a good team that combines expertise from sociolinguistics and computer sciences, and we are able to develop digital tools for a range of audiences.

For 2024–2025, we aim at continuing the work, and adding a graphic interface for accessing network information and combining this network information with textual searches.

<< List of all deliverables

D2.5.2: Analysis and annotation tools for learner performances

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.5: Report on Finnish as a second language learners automated analysis and annotation tools
Date of reporting: 28-11-2023

Report author: Ari Huhta (University of Jyväskylä)
Contributors:: Jenny Tarvainen, Ida Toivanen, Sirkku Kronholm, Mika Halttunen (University of Jyväskylä)
Deliverable location: (so far only in Google Drive)

Description

Work in WP2.5 divides into two stages. In 2022, tools used for automated analysis of texts written by native speakers of Finnish were reviewed in collaboration with WP3.2. To investigate how the tools perform with texts written by Finnish as second language (L2) learners’ texts, texts collected in previous projects were used to test certain tools. The texts represented different proficiency levels defined in the Common European Framework of Reference for Languages (CEFR), based on assessments by trained raters.

Testing focused on two promising tools, Finnish Tagtools (Language Bank) and Turku-neural-parser-pipeline. Both tools utilize machine learning with pre-trained language models. The tools perform e.g. segmentation, lemmatization and morphological tagging for Finnish texts. In addition, TurkuNPP provides information about universal dependency relations. The tools were tested with L2 Finnish learners’ texts evaluated at several CEFR levels. Only a few texts could be analysed due to various technical and other reasons. However, it was clear that the tools do not function well on learner performances, with various mistakes often confusing the processing. Typical L2 Finnish characteristics, like mixing back and front vowels (kavelin vs. kävelin), can cause incorrect lemmatization and/or tagging. However, in some cases, tools are faithful to learner language forms and are able to give the lemma based on the inflected learner language form rather than giving the targeted Finnish lemma (e.g. lumihannenlumihansi not lumihanki). As language learning researchers have started to see learner language as a valuable language variant, this can be seen as a positive characteristic, but useful tools should give both learner language lemmas and targeted lemmas. A poster presentation of these findings was given at the annual conference of the Finnish Association of Applied Linguistics in November 2022.

In the second stage in 2023, a study has been conducted to build models for classifying learner language into CEFR levels and to investigate resources needed to establish strong deep learning based L2 Finnish research in the future. This will facilitate e.g. designing automated tools for learner language detection for pedagogical and assessment purposes and contributing to the development of textual models for Finnish. Specifically, the study investigates (1) if the currently available CEFR-annotated datasets are enough for training deep learning models, (2) how the trained models perform with new data, (3) if pretraining with MLM learner language improves model performance, and (4) if the model performs equally well across all CEFR levels.

Four CEFR annotated written Finnish as a second or foreign language datasets were used: International Corpus of Learner Finnish (ICLFI), The Advanced Finnish Learner’s Corpus (LAS2), and two young learner corpora from the cross-sectional Cefling and the longitudinal Topling projects.

The state-of-the-art Finnish BERT model, FinBERT base was used and tested against FinBERT large. To inspect the effect of pretraining (with masked language modeling (MLM) objective, models trained with and without pretraining were compared. The models were evaluated with test data extracted from all four datasets. The evaluation metrics include accuracy, F1-score, recall and precision. For model evaluation, an average value over five folds for each evaluation metric is computed. An article based on the study is currently in preparation.

Events / presentations:

Sirkku Kronholm & Ari Huhta: Automaattisten tekstityökalujen kehittäminen oppijankieliseen aineistoon. Poster presentation. AFinLA autumn symposium. Helsinki. 27.-29.10.2022. https://www.helsinki.fi/assets/drupal/2022-10/AFinLA2022_FINALFINAL_Timetable_A3.pdf

<< List of all deliverables

FIN-CLARIAH D4.1.4: R/Python module

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.1: Report on R/Python module
Date of reporting: 22-11-2023

Report author: Julia Matveeva (University of Turku), Leo Lahti (University of Turku)
Contributors: Pyry Kantanen (University of Turku), Akewak Jeba (University of Turku)
Deliverable location: https://github.com/fennicahub/fennica

Description

  1. Python module: We have developed a Python script utilizing Pandas, designed to selectively extract MARC fields from the raw data. This script allows for the extraction of fields individually or in batches, which are then saved in CSV format. The Python module is available at the following URL: https://github.com/fennicahub/fennica/tree/master/inst/examples/field_picking.
  1. R module, known as the Fennica-R package, functions as an algorithmic toolkit designed explicitly for transparent quantitative analysis of the Finnish national bibliography, Fennica, and its metadata. Initially deployed to harmonize a subset of 70,000 entries, the module has recently undergone updates to facilitate the analysis of a more extensive dataset, now encompassing 1 million entries, including a subset for the period 1809-1917. The CSV files generated by the Python module are instrumental in further harmonization processes via the Fennica package.

The Fennica-R package is publicly accessible at https://github.com/fennicahub/fennica. See the package README for an up-to-date link to outputs generated by the package.

<< List of all deliverables

FIN-CLARIAH D3.5.2: Text network analysis of political texts

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.5: Report on Text network analysis of political texts
Date of reporting: 22-11-2023

Report author: Kimmo Elo (University of Turku)
Contributors: Kimmo Elo, Veronika Laippala, Otto Tarkka, Pyry Kantanen, Markus Korhonen (all from the University of Turku)

Deliverable location:

  • KWIC-tool: http://finparl-01.utu.fi/apps/KWIC/
  • TNA-tool: http://finparl-01.utu.fi/apps/TNA/

Both URLs will be opened for public use on December 19, 2023.

Description

The WP’s main objective was to develop tools based on network analysis for the analysis of political texts. The following two (2) tools are now available for public use (as beta releases, see below):

  1. KWIC tool for FinParl corpus: This tool provides a user interface to query word embeddings with KWIC (Key Word In Context) method. The tool offers a simple, yet intuitive user interface built with R Shiny, with which the user can query key word embeddings of the FinParl corpus of plenary debates of the Finnish parliament (eduskunta) and use the KWIC results to inspect n-grams and to visualise key word embeddings as text networks.
  1. TNA tool for the analysis of speeches of Finnish MPs: This tool will provide functionalities for vocabulary based content analysis of political speeches. The user selects an MP and can then study 1) a timeline of the MP’s plenary speeches, 2) a wordcloud of max. 500 most used words by the MP, as well 3) a speaker-to-concept network consisting of the 50 most frequently used concepts of the selected MP and and of his/her most similar colleagues (similarity measured as word-based cosine similarity).

The dataset these deliverables are based on covers a timespan from 1990 to 2021. The WP 3.5 uses a tailored FinParl corpus consisting of all plenary speeches of the Finnish eduskunta since 1907. The data and apps are located on data servers of the University of Turku.

Overall, the WP 3.5 has reached its most central objectives and succeeded in creating a well-functioning, active, multi-disciplinary collaboration network within the University of Turku. This network brings together expertise from social sciences and computational linguistics and is well capable of developing tools for a wide audience. The team dynamics is at good level and regular internal meetings are used to discuss current issues, problems, and solutions.

The deliverables to be published in December 2023 should be, however, considered as project milestones only. From 2024 onwards the future development of both tools will continue within the FIRI consortium project “LAWPOL”. The next milestone is Q2/2024, until which these tools are expected to be fully integrated in the first release of LAWPOL Digital Workbench for Political and Legal Studies bringing together most important political materials related to Finnish democracy.

<< List of all deliverables

FIN-CLARIAH D3.3.2: R package for data concept network

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.3: Report on R package for data concept network
Date of reporting: 2023-11

Report author: Maria Valaste (University of Helsinki)
Contributors: Adeline Clarke (University of Helsinki), Ida Toivanen (University of Jyväskylä), Jani-Matti Tirkkonen (University of Eastern Finland), Jaakko Peltonen (Tampere University)
Deliverable location: Several repositories in Github are published (see below).

Description

The aim of WP3.3 is to enhance the utilization of unstructured qualitative textual in the context of Finnish surveys with the use of a concept network tool. The purpose of this toolbox is to build a bridge from not-very-NLP-coding-apt social science researchers towards the computational NLP community’s text analytics methods and processes that might be useful for understanding the results of their survey.

This deliverable is R package, which brings together tools for Finnish-language data for open-ended question analysis. In addition to the analysis tools, the R package contains a sample dataset to familiarise the user with the functions of the package. This has been built on the basis of the previous deliverable 3.3.1.

Repositories

https://github.com/DARIAH-FI-Survey-Concept-Network (public)
https://github.com/DARIAH-FI-Survey-Concept-Network/finnishsurveytext (will be published)

<< List of all deliverables

D3.2.2: Annotation & analysis tools for NARC data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.2: Report on annotation & analysis tools for NARC data
Date of reporting: 08-11-2023

Report authors: Venla Poso (University of Jyväskylä), Ida Toivanen (University of Jyväskylä), Tanja Välisalo (University of Jyväskylä), Antero Holmila (University of Jyväskylä)

Deliverable location: Released soon.

Description

Named entity recognition (NER) model for state authority archival data.

The National Archives of Finland started a mass digitisation project in 2019, where the aim is to digitise over 135 kilometres of archival data. We identified a need for an advantaged information extraction method from unstructured and noisy text, which will make data more accessible and potentially generate innovative uses of the data in the research sector. The process included two questionnaires to the end-users, creation of annotation guidelines, manual annotation, inter-annotator agreement testing and model development.

This process resulted in a NER model, which identifies ten different entity categories (person, organisation, date, location, geopolitical location, nationalities/religious and political groups, event, product, journal number and Finnish business identity code). Journal number and Finnish business code are newly established named entities derived from the responses to two questionnaires, as opposed to the others which rely on existing NER models. The model obtains comparable results with non-OCR’d data while significantly improving named entity recognition results when tested with OCR’d state authority archival data.

Development was conducted in cooperation with the National Archives of Finland and their DALAI project.

Links

Version 0.1: https://huggingface.co/Kansallisarkisto/finbert-ner

Publications

Poso, Venla, Tanja Välisalo, Ida Toivanen, Antero Holmila, and Jari Ojala. 2023. “Untapped Data Resources. Applying NER for Historical Archival Records of State Authorities”. Digital Humanities in the Nordic and Baltic Countries Publications 5 (1). Oslo, Norway: 55-69. DOI: 10.5617/dhnbpub.10650

<< List of all deliverables

D2.4.3.3: Initializing terminology collections

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.4: Report on Initializing terminology collection
Date of reporting: 2023-11

Report author: Harri Kettunen (UHEL)
Contributors: Tiina Onikki-Rantajääskö (UHEL)
Deliverable location: The Helsinki Term Bank for the Arts and Sciences – Tieteen termipankki

Description

Since the start of 2023, 613 new concept pages have been created at the Helsinki Term Bank for the Arts and Sciences (HTB) in the following fields: Archaeology, Botany, Classical Studies, Environmental Sciences, Geology, Geophysics, History, Language Technology, Linguistics, Literary Studies, Martial Arts Studies, Media And Communication Studies, Mesoamerican Studies, North American Studies, Nutritional Sciences, Open Science, Philosophy, Religion Studies, Semiotics, and Theology.

Furthermore, updates have been made to the database on 1,806 concept pages in the following fields: Aesthetics, Archaeology, Art History, Astronomy, Behavioral Sciences, Biology, Botany, Classical Studies, Clean Energy Research, Digital Humanities, Educational Sciences, Environmental Sciences, Epidemiology, Folklore, Food Sciences, Geology, Geophysics, Heritage Research, History, Language Technology, Language Technology, Law, Linguistics, Literary Studies, Martial Arts Studies, Media And Communication Studies, Mesoamerican Studies, Microbiology, Mycology, North American Studies, Open Science, Philosophy, Seismology, Semiotics, Språkvetenskap, Study Of Religions, Sustainability Science, Terminologiako Bankos, Terminology, Theology, Translation Studies, Veterinary Medicine, and Zoology.

In addition, the fields of Anthropology, Contaminated Land Studies, Gender Studies, Mathematics, and Urban Studies are working offline until there is a critical mass of terminology to be published at the HTB. Furthermore, terminology work has been agreed upon to be carried out in the following fields: Arctic Research, Asian Studies, Geography, Military Sciences, and Physiology. A multidisciplinary group has also been established for meta scientific terminology of transdisciplinarity.

All in all, 613 new new concept pages have been created and 1,806 existing concept pages have been updated. Since the beginning of the year, the volume of the new additions and edits totals 341,175 bytes, which is approximately 280,000 characters, which translates to ca. 100 A4-size pages. The full amount of concept pages as of November 26, 2023, is 45,010.

<< List of all deliverables

FIN-CLARIAH D2.3.2: Aligning and retrieving

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.3: Report on Aligning and retrieving
Date of reporting: 13-11-2023

Report author: Jack Rueter, Erik Axelson (University of Helsinki)
Contributors: Aleksei Ivanov (University of Tartu), Niko Partanen (University of Helsinki)
Deliverable location: Christmas Gospel text-to-speech in four Uralic languages

Description

The «Christmas Gospel text-to-speech in four Uralic languages» (shortname: xmas-gospel-tts) is a collection of .txt, .wav and .vrt files with a variety of alignments used in Korp searches. The collection is intended as a demo for showing how to donate and implement in parallel multi-lingual spoken materials to the Language Bank of Finland.

Background

A model for Massively Multilingual Speech (MMS, CC-BY-NC 4.0) has recently been developed at Facebook (Meta), with language support for hundreds of languages whose automatic speech recognition (ASR), text to speech (TTS) and language identification (LID) coverage is documented here.

The documentation at Meta includes 16 of approximately 32 Uralic languages or language forms spoken today. We chose three languages, Komi-Zyrian (kpv), Karelian (krl) and Erzya (myv), of the eight Uralic languages with coverage for the three categories of ASR, TTS and LID, and then we selected one additional language, Olonets-Karelian (olo, aka Livvi), one of the 16 languages lacking coverage for any of the three categories. Our choice of a fourth language was motivated by the fact that Karelian and Olonets-Karelian share much the same character-to-sound correlation and that the latter might actually be the source of digital information under the umbrella term Karelian.

The .txt files represent a segment of an existing parallel corpus, Parallel Biblical Verses for Uralic Studies (PaBiVUS), which is described in Metashare with a CC-BY-NC license. The segment or mini parallel corpus here is the Christmas Gospel (Luke 2:1–20), which is well known in Finland.

The .wav files have been produced as a text-to-speech exercise with a Python script by Aleksei Ivanov, Niko Partanen and Jack Rueter, utilizing the model for MMS built at Facebook (see above).

The .vrt files contain morpho-syntactically annotated versions of the Christmas Gospel texts, which have been subsequently inspected and manually corrected. The annotation used analysers built with Helsinki Finite-State Technologies (HFST) under continual development at Saami Language Technology (GiellaLT), based at the Norwegian Arctic University, in Tromsø: (Erzya; Komi-Zyrian; Karelian; Olonets-Karelian); Constraint Grammar (CG) methods as documented at the University of Southern Denmark, and a Universal Dependencies tool, Annotatrix.

The demo provides two facets of searchability on the Korp server. First, there is parallel corpus searchability, as found in the PaBiVUS corpus, i.e., there are links between .vrt coded verses of the Christmas Gospel with automatically annotated and subsequently manually corrected dependencies. Second, the text content of each verse is linked with the sound file (.wav), which allows for a sentence-to-utterance alignment as found, for example, in the Finnish Parliament materials, where timestamps would be the equivalents of our verse identifiers.

<< List of all deliverables

D2.2.2: Speech recognition for L2 update

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.2: Report on Speech recognition for L2
Date of reporting: 2023-11

Report author: Yaroslav Getman (Aalto University)
Contributors: Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., Clara, A., Kurimo, M. (Aalto University); Salvi, G., Svendsen, T. (Norwegian University of Science and Technology); Strömbergsson, S. (Karolinska Institutet); Smolander, A., Ylinen, S. (Tampere University); von Zansen, A., Hilden, R., Linden, K. (University of Helsinki); Kallio, H., Kuronen, M., Huhta, A., Kronholm, S. (University of Jyväskylä)

Deliverable location: Aalto Speech Research | Multi-task wav2vec2

Description

Systems trained to perform automatic speech recognition (ASR) and pronunciation rating for child L2 Finnish are available on HuggingFace Hub. The links to the models are collected on this GitHub page and the methods are described in [1] and [2].

ASR systems for L2 learners of Finnish are described in [3] and [4]. The scripts used to train the models are available on this GitHub page and the data will be released in Kielipankki here and here.

References

[1] Getman, Y., Al-Ghezi, R., Grosz, T., Kurimo, M. (2023) Multi-task wav2vec2 Serving as a Pronunciation Training System for Children. Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), 36-40, doi: 10.21437/SLaTE.2023-8
[2] Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., Kurimo, M., Salvi, G., Svendsen, T., Strömbergsson, S., Smolander, A., Ylinen, S. (2023) Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children. IEEE Access, vol. 11, pp. 86025-86037, 2023, doi: 10.1109/ACCESS.2023.3304274
[3] Kurimo, M., Getman, Y., Voskoboinik, E., Al-Ghezi, R., Kallio, H., Kuronen, M., von Zansen, A., Hilden, R., Kronholm, S., Huhta, A., Linden, K. (2023) New data, benchmark and baseline for L2 speaking assessment for low-resource languages. Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), 166-170, doi: 10.21437/SLaTE.2023-32
[4] Al-Ghezi, R., Voskoboinik, K., Getman, Y., von Zansen, A., Kallio, H., Clara, A., Kuronen, M., Huhta, A., Hilden, R. (in review) Automatic speaking assessment of Spontaneus L2 Finnish and Swedish. Language Assessment Quarterly.

<< List of all deliverables

D1.3.4: QA pair corpora

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on QA pair corpora
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

  1. Manually annotated English QA dataset

    100 manually annotated documents for question-answer pairs from a random sample of the documents labelled as having the QA label from the English web-scale dataset Falcon-refinedWeb. The dataset is split into 40 dev and 60 test, and includes 345 questions and 192 answers.

  2. Manually annotated Finnish QA dataset

    218 manually annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset is split into train, dev and test with 100, 50 and 68 documents respectively. The dataset includes 376 questions and 333 answers.

  3. ChatGPT-annotated Finnish QA dataset

    3,424 ChatGPT-annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset has been only used for training. The dataset includes 2,919 questions and 2,491 answers.

The first three datasets have been used in the training and testing of the QA pair extraction model introduced in report D.1.3.3 , and do not necessarily include QA pairs, as the documents were annotated by not taking into account whether there was a pair or not and instead by only annotating text spans for either a question or answer. The data for the first three can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/annotated-data

  1. Corpus of QA pairs retrieved from web-scale datasets

    QA pairs retrieved by the qa pair retrieval pipeline from several different corpora: the Finnish Parsebank, CC-Fi, mC4-Fi and the English Falcon-refinedWeb. The QA pair corpora includes almost 200K retrieved pairs from 125K documents after discarding low quality pairs. The final pairs can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/qa_predicted_final_files

The publication details will be updated later (work submitted for LREC-COLING 2024).

<< List of all deliverables

FIN-CLARIAH D3.1.4: Incremental update process

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.1: Report on Ingestion framework
Date of reporting: 2023-11

Report author: Johanna Lilja (National Library of Finland), Tuula Pääkkönen (National Library of Finland)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Description

The OAI-PMH API of the National Library is regularily queried for changes in the dataset. If such changes occur (additions/deletions) the respective files in are added/deleted to the downloaded dataset as needed and another snapshot is created (see D3.1.3). Deleted bindings support is still under development.

More information

FIN-CLARIAH WP3.1 presentation from DARIAH-FI workshop on December 1st, 2023.

<< List of all deliverables

D1.3.3: Models for retrieving QA pairs from the web

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on Models for retrieving QA pairs from the web
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

Our pipeline to retrieve question-answer pairs from text corpora includes two transformer models: one for extracting documents with likely QA pairs from web-crawled corpora, and another one for extracting the actual QA pairs from the documents.

The model for QA document identification is a cross-lingual sequence classification model trained on register annotated data in English and Finnish as well as unpublished versions of Swedish and French which is specifically fine-tuned to predict whether a document (a piece of text) includes something related to questions and answers or not.

The model for QA pair extraction is a token classification model (for English and Finnish) which predicts whether a token in the text belongs to a question, answer or other and then splits the text into QA pairs based on those predictions and aggregation strategies. This model is used on the documents labelled as having something related to questions and answers.

The publication details will be updated later (work submitted for LREC-COLING 2024).

Links

  • The model for QA document identification: https://huggingface.co/TurkuNLP/xlmr-qa-register
  • Corpus of Online Registers of English (CORE): https://github.com/TurkuNLP/CORE-corpus
  • FinCORE corpus: https://github.com/TurkuNLP/FinCORE_full
  • Multilingual register annotations: https://github.com/TurkuNLP/multilingual-register-labeling/tree/master/register-annotations
  • The model for QA pair extraction (English): https://huggingface.co/TurkuNLP/xlmr-qa-extraction-en
  • The model for QA pair extraction (Finnish): https://huggingface.co/TurkuNLP/xlmr-qa-extraction-fi
  • << List of all deliverables

    D1.2.2: Transcription Service for Finnish Interviews

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 1.2: Transcription Service for Finnish Interviews
    Date of reporting: 2023-10

    Report author: Martin Matthiesen (CSC)
    Contributors: Anssi Moisio (Aalto), Sam Hardwick (CSC), Niko Partanen (National Library), Aivo Olev (Tallinn University of Technology)
    Deliverable location: https://tekstiks.ee (Finnish)

    Description

    The transcription service split into two parts: The end user frontend is hosted at the University of Tallinn, Estonia at https://tekstiks.ee and the speech recognition backend is hosted at CSC – IT Center for Science in Finland. For details and usage instructions see https://www.kielipankki.fi/arkisto/resource-info/tools-for-speech-analysis-and-annotation/

    The source code is available on Github.

    References:

    Olev, A; Alumäe, T. (2022). Estonian Speech Recognition and Transcription Editing Service. Baltic J. Modern Computing, Vol. 10 (3), pp. 409–421. DOI: 10.22364/bjmc.2022.10.3.14

    Moisio, A; Porjazovski, D; Rouhe, A; Getman, Y; Virkkunen, A; AlGhezi, R; Lennes, M; Grósz, T; Lindén, K & Kurimo, M (2022). Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks. Language Resources and Evaluation. DOI: 10.1007/s10579-022-09606-3

    Moisio, A. (2022). Lahjoita puhetta baseline Kaldi ASR model (1.2). Zenodo. DOI: 10.5281/zenodo.7101543

    << List of all deliverables

    FIN-CLARIAH D3.5.1: Text network analysis of political texts

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 3.5: Report on Text network analysis of political texts
    Date of reporting: 06-06-2023

    Report author: Kimmo Elo (University of Turku)
    Contributors: Kimmo Elo, Veronika Laippala, Otto Tarkka (University of Turku)
    Deliverable location: None so far, R Shiny GUI and GitHub repository will be made public in Q3/2023.

    Description

    The WP’s main objective is to develop tools based on network analysis for the analysis of political texts. The tools will be made available both via a web-interface and as dedicated R packages. Three (3) tools are currently under development:

    1. A KWIC tool for FinParl corpus: This tool provides a user interface to query word embeddings with KWIC (Key Word In Context) method. The tool offers a simple, yet intuitive user interface built with R Shiny, with which the user can analyse key word embeddings of the FinParl corpus of plenary debates of the Finnish parliament (eduskunta). A beta version of this tool is already in the testing, the release is planned for Q3/2023.
    2. A tool for semantic and text network analysis and visualisations: Building on the KWIC tool, this tool will provide functionalities for vocabulary based content analysis of political text, for the comparison of different text networks, as well for dynamic text network analysis with a set of visualisation tools. These tools are currently under active development and testing, the production phase is expected to be completed in Q3/2023.
    3. A tool for analysing text reuse: This tool will offer functionalities to identify and analyse structural similarities of vocabulary-based text networks. Such structural patterns can help us to identify how phrases or longer text passages are re-used over time. The tool will also provide capabilities to identify patterns in concept embedding, a widely used strategy in political texts to frame different issues in the same (or similar) context. This tool is currently in planning, the active development and coding is expected to be completed in Q4/2023.

    All these tools will be developed for and tested with the FinParl-corpus consisting of all plenary speeches of the Finnish eduskunta since 1907. All tools will access a tailored dataset maintained on a server at the University of Turku.

    The FinParl-corpus used by this WP is structured according to the ParlaMint XML schema, so that – at least theoretically – the tools should be compatible with all corpora following the same ParlaMint schema. Our plan, however, is not to limit the analytical tools for the use with FinParl-corpus only. Instead, the tools will be designed to work with tidy data, and the WP provides tools to access relevant resources and to convert the working data in tidy data for further analysis.

    Overall, the WP is proceeding quite well and mostly in schedule. We have a small, yet active research team bringing together expertise from social sciences and computational linguistics and being capable of developing tools for a wide audience. The team dynamics is at good level and regular internal meetings are used to discuss current issues, problems, and solutions. The WP also benefits from a big FIRI research grant of the Academy of Finland covering the years 2023–2025 and allowing us a greater room of manoeuvring for the planning of the WP’s future development.

    << List of all deliverables

    D2.1.2: Licensing agreements for special categories

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 2.1: Report on Licensing agreements for special categories of personal data
    Date of reporting: 2023-06

    Report author: Mietta Lennes (UHEL)
    Contributors: Sirpa Kovanen, Krister Lindén (UHEL)
    Deliverable location: Deposition license agreement template

    Description

    The deposition license agreement template of the Language Bank of Finland allows for the deposition of resources that contain personal data (cf. D2.1.1: Licensing agreements for personal data). In addition, some research datasets may also include personal data belonging to special categories. Such data reveals the person’s racial or ethnic origin, political opinions. religion or philosophical beliefs, trade union membership, data concerning health, sexual orientation or activity, or genetic and biometric data for identifying the person.

    Personal data belonging to special categories are considered sensitive. In some cases, it is not possible to completely remove the sensitive data without making the entire resource unusable regarding the research purpose. However, it may still be possible to deposit the resource (or some version of it) with the Language Bank, given that sufficient and proportionate safety measures are applied.

    Preparing for the deposition of a sensitive dataset

    Before the resource can be deposited, the data controller regarding the original purpose of use (in practice, usually, the depositing researchers themselves) must conduct a preliminary risk assessment and a Data Protection Impact Assessment (DPIA) if appropriate. In this process, the researchers should primarily follow the instructions of their home organization. For convenience, the Language Bank also provides an instruction page for the preliminary evaluation of data protection.

    Before depositing, the researchers are responsible for minimizing the amount of personal data, and especially the sensitive information, to the extent that is possible and proportionate with regard to the research purpose. In order to maintain the deposited content accessible and useful for other researchers, some documentation of the pseudonymization process can be included in the metadata of the resource.

    Additional data protection terms and conditions

    For resources containing personal data, the resource-specific data protection terms and conditions and the description of the categories of personal data in the resource are included in an annex of the deposition license agreement with the Language Bank. In the same annex, it is possible for the data controller to specify further requirements, in case the processing of personal data contained in the resource is seen to involve risks that call for a particularly high level of information security.

    Protective measures applicable to sensitive datasets

    Currently, the Language Bank offers the following protective measures that can be applied on sensitive datasets:

    1. Access management in the restricted license category (RES): Based on application, access to the resource can be restricted to individual researchers who have produced an acceptable research plan, matching the original research purpose of the resource.
    2. Data protection terms and conditions: When submitting their application, each user of the resource must accept the license of the resource in question, including the resource-specific data protection terms and conditions recorded in the deposition license agreement by the original data controller. The license is persistently available via the metadata record of the resource and the license information is also included in the data package that is provided to end-users via the Language Bank.
    3. Data encryption: In the case of sensitive datasets, the package can be stored in an encrypted form, and the package can be re-encrypted by the Language Bank on an individual basis for each recipient, to ensure that only the authorized user can decrypt and access the package content after downloading. The very first dataset applying this safety measure is the Finnish Dark Web Marketplace Corpus (findarc), published on 30 May 2023. To make the encrypted dataset more accessible, the Language Bank offers instructions for using GPG keys.
    4. Sensitive Data (SD) services at CSC – IT Center for Science: The Language Bank is currently preparing to start using the SD platform for making sensitive datasets available to researchers and research teams who need a secure environment for reusing a given resource (see further details about the SD services at CSC). The aforementioned Finnish Dark Web Marketplace Corpus will be used as a test case.

    The Language Bank is also collaborating with the DELAD Task Force in CLARIN. DELAD focusses on sharing corpora of disordered speech that often contain, e.g., health-related data and data from children.

     


    Last updated: 2023-06-06

    << List of all deliverables

    D5.1.2: Log Data Collection and Analysis

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 5.1: Report on Log Data Collection and Analysis
    Date of reporting: 05-06-2023

    Report authors: Sanna Kumpulainen, Jaakko Peltonen, Farid Alijani (Tampere University)
    Contributors: Sanna Kumpulainen, Jaakko Peltonen, Farid Alijani, Anna Sendra Toset (Tampere University)
    Deliverable location: GitHub repository

    Description

    In general the goal of WP5.1 is to design and develop methods that enable analysis of log data from systems in the FIN-CLARIAH infrastructure and are usable for compatible other systems. The analysis of log data can serve purposes such as monitoring use of the systems and for recommendation of content to end-users.

    As one of the deliverables and initial attempts, we conducted a comprehensive study on the utility of the log data to investigate the feasibility of developing both user-based and item-based recommender systems which could be potentially deployed for end-users in the future.

    Secondly, as a proof of concept we have developed a collaborative recommender system to assist information retrieval in digital libraries, based on log data gathered from use of the libraries. The developed recommender system combines collaborative and content-based recommendation. It has been initially developed with similarity search approaches, and is extensible to various inference schemes including neural approaches in future work.

    In the proof of concept recommender system, we are currently using the National Library of Finland (NLF) dataset (digi.kansalliskirjasto.fi), including metadata of the collection, description, preservation and accessibility of Finland’s printed national heritage as digitized materials. The proof of concept is easily extensible to comparable log files of other digital libraries, and similar approaches can be applied to other DARIAH-FI collections. We have an open access GitHub repository for the public use which has been primarily tailored to the SLURM clusters, provided by CSC infrastructures for data storage and massive computational resources.

    << List of all deliverables

    D3.2.1: Pipeline for transferring archival data

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 3.2: Report on Pipeline for transferring archival data
    Date of reporting: 02-06-2023

    Report author: Tanja Välisalo (University of Jyväskylä)
    Contributors: Ida Toivanen (University of Jyväskylä), Venla Poso (University of Jyväskylä)
    Deliverable location: https://www.jyu.fi/hytk/fi/tutkimus/infrastruktuurit/fin-clariah/tyopaketit/d3-2-1-pipeline.pdf/

    Description

    The process of transferring large amounts of data from the Finnish National Archives to a research institute needs a defined technical process as well as licences and agreements. Two types of data transfer cases for state authority archives have been identified depending on the status of the documents: (1) documents are in the ‘storage’ phase (actively used by the state authority); (2) documents are in the ‘archival’ phase (long-term storage). A pilot data transfer project using the type of documents in case 1 has been conducted. Based on the pilot, demands for licences and agreements have been identified. An ideal licensing/agreement process has been described.

Search the Language Bank Portal:
Lotta Leiwo
Researcher of the Month: Lotta Leiwo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information