FIN-CLARIAH Deliverables

<< FIN-CLARIAH Overview

This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).

FIN-CLARIAH Funding period 2026-2029

Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.

Module 1: Natural Language Processing (NLP)

The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)

W1.1 Text processing and annotation environments

To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)

D1.1.1 Support common CLARIN formats like TEI (CSC/Martin Matthiesen). 2026-12
D1.1.2 Convert VRT to TEI and showcase the result in a compatible web interface like the KorAP platform used in German CLARIN. (CSC/Martin Matthiesen) 2027-07
D1.1.3 Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen) 2028-04
D1.1.4 Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen) 2029-10

W1.2 Speech processing and annotation

To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)

D1.2.1 Updated backend of existing ASRs (CSC/Sam Hardwick) 2026-10
D1.2.2 A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats) 2027-08
D1.2.3 Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick) 2028-06
D1.2.4 Expansion and upgrade of Oulu Clarin-D centre to C or B status; provision of access to additional language resources sourced from multimedia social media content. (OU/Steven Coats) 2029-11

W1.3 Video processing and annotation

To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)

D1.3.1 Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes) 2026-06
D1.3.2 Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick) 2027-08
D1.3.3 Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes) 2028-09
D1.3.4 Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes) 2029-10

Module 2: Language Research Infrastructure (LRI)

This module takes care of the specialised language processing needs in the fields of language-based research. (L:UHEL/ARTS Krister Lindén)

W2.1 Processing Research Data

To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)

D2.1.1 Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen) 2026-05
D2.1.2 Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen) 2027-09
D2.1.3 Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen) 2028-06
D2.1.4 Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes) 2029-11

W2.2 Training environments

To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)

D2.2.1 Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson) 2026-12
D2.2.2 Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson) 2027-12
D2.2.3 Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson) 2028-06
D2.2.4 Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson) 2029-08

W2.3 Translation and Interpretation

To provide infrastructure for translation and interpretation research on fact checking and verification of LLM output. (L:UHEL/ARTS Tommi Jauhiainen; P:CSC; C:UTA, UEF)

D2.3.1 Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen) 2026-05
D2.3.2 Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) 2027-06
D2.3.3 Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) 2028-08
D2.3.4 A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen)  2029-10

W2.4 Terminology

To provide infrastructure for the terminology work in the Helsinki Term Bank for the Arts and Sciences (HTB) and related terminology development projects. (L:UHEL/ARTS Tiina Onikki; C:UVAASA)

D2.4.1 Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics. 2026-09
D2.4.2 Initiate and develop terminology groups on geography, social geography, and environmental sciences. 2027-12
D2.4.2 Initiate and develop terminology groups on social policy, economics, and political science. 2028-05
D2.4.3 Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences. 2029-11

Module 3: Structuring Data

This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)

W3.1 Data Management

To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)

D3.1.1 Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models 2026-05
D3.1.2 Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models 2027-09
D3.1.3 Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements 2028-04
D3.1.4 Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements 2029-10

W3.2 Data Ingestion

To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)

D3.2.1  Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen)   2026-11 

 

D3.2.2  Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen) 2027-06 
D3.2.3  Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material  (NLF/Aija Vahtola) 2028-12 
D3.2.4  Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola)  2029-11 

W3.3 Enrichment

To enable the systematic and detailed analysis of noisy datasets in different formats and thereby provide unseen possibilities for SSH research. (All the deliverables set to 2029 also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UTU Veronika Laippala; P:UEF, JYU, OU, UHEL/ARTS, UHEL/SOC, Aalto; C:UHEL/NLF)

D3.3.1 Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti) 2029-11
D3.3.2 Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola) 2027-11
D3.3.3 Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala) 2029-11
D3.3.4 Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila) 2029-11
D3.3.5 Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa) 2029-11

Module 4: Analyzing Structured Data

The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)

W4.1 Analytical Support for computational SSH

To enable researchers to utilise large born-digital data effectively and to focus on analysis rather than dealing with technical details in often high volume and high velocity. (All the deliverables also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UEF Mikko Laitinen; P:JYU, OU, UHEL/SOC; C:UHEL/NLF)

D4.1.1 Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki)  2029-11
D4.1.2 Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen)  2029-11
D4.1.3 Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste) 2029-11
D4.1.4 Analysis tools for multimodal livestream data (JYU/Raine Koskimaa)  2029-11

Module 5: Information Interaction (IIA)

Interaction refers to the need 1) to collect information on how researchers interact with the RI in order to develop the tools and services accordingly, and 2) to offer education and consultation on how researchers can enhance their work by using the infrastructure, thus increasing the RI’s active user base. (L:TAU Sanna Kumpulainen)

W5.1 Evidence-Based Infrastructure Development

To provide a close dialogue with the user community to ensure the best possible development of the RI. (L:TAU Sanna Kumpulainen; P:UHEL/ARTS; C:UHEL/NLF, UTU, CSC, UHEL/SOC, AALTO, JYU, UEF, OU)

D5.1.1 Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen) 2026-06
D5.1.2 Educational resources for infrastructure tools and data.  (L:TAU:/Sanna Kumpulainen) 2027-11
D5.1.3 Community engagement: User interaction with multimodal data.  (TAU:/Sanna Kumpulainen) 2028-06
D5.1.4 Evidence-based infrastructure development: User experience and the feedback instrument.  (TAU:/Sanna Kumpulainen) 2029-11

Top of page

<< FIN-CLARIAH Overview

Last modified on 2024-11-11

Vastaa

Search the Language Bank Portal:
Sofoklis Kakouros
Researcher of the Month: Sofoklis Kakouros

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information