<< List of all deliverables

FIN-CLARIAH D4.3.1: Subsetting tool

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.3: Report on Subsetting tool
Date of reporting: 14-11-2022

Report author: Eetu Mäkelä (University of Helsinki)
Contributors: Ville Vaara (University of Helsinki)
Deliverable location: Internal

Description

The prototype version of the subsetting tool is at https://github.com/hsci-r/octavo/. This prototype version of the tool has been and is being successfully used in multiple research projects. At the same time, the prototype is 1) not as easily updatable as we’d like and 2) not as easily maintainable as we’d like. Both of these hindrances are mainly caused by the tool being built by hooking into the Lucene search library on multiple levels of interfaces (mostly according to whichever interface provided the most efficient way to enact each functionality), which considerably increases system complexity. Additionally, some of the integrations are on really low levels, where interface stability between versions is considerably lower.

In order to overcome these deficiencies, WP4.3 has been evaluating whether a production version of the tool could be built on top of Elasticsearch, which is also based on Lucene, but offers APIs and interfaces on a much higher level of abstraction and standardisation. The idea here is that if the same functionalities could be built using Elasticsearch, there would be 1) much less API surface between the custom and standard parts of the system, and 2) the remaining extension points would be more standard, widely documented, stable and understood.

In pursuit of this, the WP has all of a) catalogued the current Lucene extension points that the current prototype is using, b) catalogued which functionalities rely on which extension points, and rated them based on how important they have been for actual users in the associated research projects, and c) respectively gone over the extension points and possibilities offered by Elasticsearch. Next, these need to be brought together and aligned with each other to come up with a go/no-go decision on whether a sufficient number of the functionalities rated as important can be developed just using the well-documented extension points of Elasticsearch, and thus whether we should go ahead with the actual reimplementation of the tool using that framework.

According to the original plan, getting to the point where a decision could be made was slated for Q3/2022. However, due to delays in hiring, we are only now at the point where the constituent sides of the background reports are completed and working out their alignment can begin. At present, we expect to be able to make the go/no-go decision itself within a month from now.

<< List of all deliverables

FIN-CLARIAH D4.1.1: Harmonized version of the Finnish National Bibliography (FNB)

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.1: Report on Harmonized FNB
Date of reporting: 14-11-2022

Report author: Leo Lahti (University of Turku)
Contributors: Pyry Kantanen (University of Turku)
Deliverable location: Internal

Description

Digital metadata collections are valuable for cataloguing and information retrieval. They provide structured data that has a foreseen impact on developing methods, applications and tools, and they are increasingly recognized as a potential research object that allows large-scale statistical comparisons, albeit often only after substantial harmonization, enrichment, and curation.

We currently have the National bibliography of Finland with metadata of Finnish printings, audiovisual material and web material across several centuries to the current time available through kansalliskirjasto.finna.fi and as linked open data at data.nationallibrary.fi. The WP4.1 delivers harmonized version of the Finnish National Bibliography (FNB) and release the data, code, workflows, and analysis tools under an open license. D4.1.1 (Q3/2022) suffered from recruitment delays but the research assistant has been actively working on the project and CSC integration and it seems realistic that this deliverable can be completed by the end of Q4/2022.

<< List of all deliverables

FIN-CLARIAH D3.3.1: Qualitative survey data concept network

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.3: Report on Qualitative survey data concept network
Date of reporting: 14-11-2022

Report author: Krista Lagus (University of Helsinki)
Contributors: Rachel Bryant, Maria Litova, Tuukka Oikarinen, Joni Oksanen, Maria Valaste (University of Helsinki), Sakari Taipale, Ida Toivanen, Tomi Oinas (University of Jyväskylä), Jani-Matti Tirkkonen (University of Eastern Finland), Jaakko Peltonen (Tampere University)
Deliverable location: https://github.com/DARIAH-FI-Survey-Concept-Network

Description

The objective of the WP3.3 is to better use unstructured qualitative textual data in the context of Finnish surveys with the use of a concept network tool. The toolbox is intended to build a bridge from not-very-NLP-coding-apt social science researchers towards the computational NLP community’s text analytics methods and processes that might be useful for understanding the results of their survey.

Currently, the concept network tool consists of implementing multiple use cases for the exploratory analysis of survey open responses as separate processing streams. Use cases for the streams are being defined based on working with the pilot data sets. There are 5 pilot data sets that have been obtained for explorative methodological work to facilitate tool development. So far analysis and development work has begun on three; remaining two will be utilised for testing the tools during 2023. We are likely to reach the final version of deliverable 3.3.1 in Q4/2022 or Q1/2023. At the moment, we have pushed the deadline to Q4/2022 due to data set obtaining, data analyses and methods exploration taking more time than was originally anticipated. The toolset (process pipeline) development is in progress, not ready to be released yet, as it needs harmonising, testing and documenting.

<< List of all deliverables

FIN-CLARIAH D3.1.2: Ingestion framework

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.1: Report on Ingestion framework
Date of reporting: 2022-12

Report author: Johanna Lilja (National Library of Finland), Tuula Pääkkönen (National Library of Finland)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Description

Basic concept of how the data is downloaded exists. Technology defined (Apache airflow for workflow management) has been chosen. Script created for downloading METS XML, and then ALTO XML files via Airflow. CSC Project created with necessary data requests.

More information

FIN-CLARIAH WP3.1 presentation from DARIAH-FI workshop on November 9th, 2022.

<< List of all deliverables

D3.1.1: Initial NLF Data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.1: Report on Initial NLF Data
Date of reporting: 2022-09

Report author: Johanna Lilja (National Library of Finland), Tuula Pääkkönen (National Library of Finland)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Description

Basic concept of how the data is downloaded exists. Technology defined (Apache airflow for workflow management) has been chosen. Script created for downloading METS XML, and then ALTO XML files via Airflow. CSC Project created with necessary data requests.

More information

FIN-CLARIAH WP3.1 presentation from DARIAH-FI workshop on November 9th, 2022.

<< List of all deliverables

D1.3.1: Corpora of non-standard language

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on Corpora of non-standard language
Date of reporting: 2022-09

Report author: Veronika Laippala (UTU)
Contributors: Veronika Laippala, Filip Ginter, Sampo Pyysalo, Anni Eskelinen, Anna Salmela (UTU)
Deliverable location: turkunlp.org | github.com/TurkuNLP

Description

1) Text quality data

2) Register (genre) annotations for Oscar

3) Toxic language use for Finnish

  • Toxic language can be defined as rude, disrespectful language, likely to make someone leave a discussion
  • Toxic language data and models for Finnish to be published in early 2023 (submitting to Nodalida)
  • Will be available at github.com/TurkuNLP and as a Huggingface dataset

<< List of all deliverables

D2.4.3.1: Initializing terminology collections

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.4: Report on Initializing terminology collection
Date of reporting: 2022-09

Report author: Harri Kettunen (UHEL)
Contributors: Tiina Onikki-Rantajääskö (UHEL)
Deliverable location: The Helsinki Term Bank for the Arts and Sciences – Tieteen termipankki

Description

During the first 9 months of the project, the Helsinki Term Bank for the Arts and Sciences has initiated terminology work in the following new fields: behavioral sciences, mathematics, Mesoamerican studies, North American Indigenous studies, and theology, whereof mathematics, and theology are initially working offline and will publish the terminology work at a later date. Furthermore, terminology work has been agreed upon this year in the following fields: Arctic research, Asian studies, gender studies, geography, military sciences, nutritional sciences, and physiology. In addition, an interdisciplinary working group has been established for metascientific terminology. The group is composed of researchers from different fields and from various universities in Finland.

New concept pages have been created in the following fields in 2022: art history, behavioral sciences, botany, classical studies, digital humanities, forensics, genealogical studies, geology, history, law, linguistics, martial arts studies, Mesoamerican studies, open science, philosophy, physics, religion studies, social psychology, sustainability studies, and translation studies. All in all, 584 entirely new new concept pages have been created since January 2022.

Furthermore, the database has been updated in the following fields: aesthetics, archaeology, art history, astronomy, behavioral sciences, biology, biotechnology, botany, classical studies, digital humanities, education, environmental sciences, film and television studies, geology, heritage studies, history, Indigenous studies, language technology, law, linguistics, literary studies, martial arts studies, Mesoamerican studies, meteorology, open science, performing arts, philosophy, physics, plain language research, religion studies, semiotics, social psychology, terminology, and translation studies. In total, 2111 existing concept pages have been updated and in total 1114 new terms have been added. The full amount of concept pages as of October 30, 2022, is 47,664.

Between 1 January and 29 October 2022, there have been 546,202 users whereof 524,066 have been new users. The total number of sessions has been 855,533 with 1,509,599 page views.

Awards

The Helsinki Term Bank for the Arts and Sciences was granted the following awards in 2022:

  • Finnish Open Educational Practice Award (May 2022)
  • The University of Helsinki Open Science Award (October 2022)

It should be noted that the project coordinator Harri Kettunen started in the beginning of March 2022. The first step to establish a new field is to gather and activate an expert group, and it takes time before the guidance to the voluntary terminology work results in concept pages of the database. Thus networking has been the main activity of the project coordinator.

<< List of all deliverables

D2.4.1: Term discovery procedures

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.4: Report on Term discovery procedures
Date of reporting: 2022-09

Report author: Krister Lindén (UHEL)
Contributors: Sam Hardwick, Harri Kettunen (UHEL)
Deliverable location: Kielipankin työkaludemot (kielipankki.fi) | Käsitelouhinta

Description

Concept mining: The glossary (in this case, the The Helsinki Term Bank for the Arts and Sciences) is used in conjunction with the reference corpus (FTC newspaper data) to find related terms from the target data (theses from different faculties) related terms of existing terms as well as new terms specific to the target data.

<< List of all deliverables

D1.2.1: Forced-Alignment Service

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.2: Report on Forced-Alignment Service
Date of reporting: 2022-09

Report author: Martin Matthiesen (CSC)
Contributors: Juho Leinonen (Aalto), Sam Hardwick, Mietta Lennes (UHEL)
Deliverable location: Language Bank Tools Demos (kielipankki.fi) | Forced Alignment

Description

The forced alignment tool provides time stamps for transcribed words or utterances of an audio file. The tool can be used in puhti.csc.fi and a web interface can be accessed on the Language Bank Demo Tools page, included on the list of tools at kielipankki.fi.

The source code for the original forced aligner is provided on GitHub, https://github.com/aalto-speech/finnish-forced-alignment, and the Docker image on which the tool is based can be found on Dockerhub, https://hub.docker.com/r/juholeinonen/kaldi-align. The specific endpoints for the forced aligner versions installed in the Language Bank of Finland are included in the code repository at https://github.com/Traubert/kielipankki-services, under services/finnish-forced-align.

References:

finnish-forced-alignment: J. Leinonen, S. Virpioja and M. Kurimo. ”Grapheme-Based Cross-Language Forced Alignment: Results with Uralic Languages” NoDaLiDa. 2021.

<< List of all deliverables

D1.1.1: Updating LBF resource selection

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.1: Report on Updating LBF resource selection
Date of reporting: 2022-09

Report author: Jussi Piitulainen (UHEL)
Contributors: Ute Dieckmann, Varpu Vehomäki, Krister Lindén, Mietta Lennes (UHEL)
Deliverable location: Corpora | Kielipankki

Description

The Kielipankki data sets are available in appropriate channels: the download service, the Korp concordance engine, and a data directory in the Puhti computing enviroment. The data sets have persistent identifiers and are documented in public metadata records, resource family pages, and resource group pages.

We are in progress updating data sets (Suomi24, STT newswire) with Universal Dependencies (UD2) annotations in addition to the previous annotation model. We are in progress using automatic language identification to separate the Finnish and Swedish texts in a large new batch of the National Library newspaper corpus (KLK). Data sets in the ingestion pipeline are being documented and prioritized to become available in the appropriate Kielipankki channels.

<< List of all deliverables

D2.1.1: Licensing agreements for personal data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.1: Report on Licensing agreements for personal data
Date of reporting: 2022-09

Report author: Mietta Lennes (UHEL)
Contributors: Sirpa Kovanen, Krister Lindén (UHEL)
Deliverable location: Deposition license agreement template

Description

The deposition license agreement template of the Language Bank of Finland has been thoroughly updated in order to allow for the deposition of resources that contain personal data. The template now includes a new annex where the data processing terms and conditions regarding personal data can be included.

The deposition agreement contains both general as well as resource-specific terms and conditions according to which the Language Bank may distribute a given resource. The template can be used when depositing a new resource in the Language Bank of Finland. The completed document is to be signed by the rightholder(s), by the controller regarding the personal data (if applicable), and by the Language Bank of Finland, legally represented by the University of Helsinki. In order to make the administrative procedure faster, the Language Bank has also started using a system for electronic signing.

Specific details of the license terms and conditions are always separately agreed for each individual resource. When planning on the deposition of research data in the Language Bank of Finland, depositors should contact FIN-CLARIN so that the situation regarding the material and the practical possibilities distributing the data can be checked and discussed together if required. In order to make the discussions easier, it is recommended to submit a request to create the preliminary metadata record for the new resource first.