This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).
Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.
The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)
To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)
D1.1.1 | Support common CLARIN formats like TEI (CSC/Martin Matthiesen). | 2026-12 |
D1.1.3 | Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen) | 2028-04 |
D1.1.4 | Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen) | 2029-10 |
To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)
D1.2.1 | Updated backend of existing ASRs (CSC/Sam Hardwick) | 2026-10 |
D1.2.2 | A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats) | 2027-08 |
D1.2.3 | Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick) | 2028-06 |
To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)
D1.3.1 | Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes) | 2026-06 |
D1.3.2 | Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick) | 2027-08 |
D1.3.3 | Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes) | 2028-09 |
D1.3.4 | Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes) | 2029-10 |
To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)
D2.1.1 | Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen) | 2026-05 |
D2.1.2 | Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen) | 2027-09 |
D2.1.3 | Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen) | 2028-06 |
D2.1.4 | Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes) | 2029-11 |
To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)
D2.2.1 | Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson) | 2026-12 |
D2.2.2 | Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson) | 2027-12 |
D2.2.3 | Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson) | 2028-06 |
D2.2.4 | Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson) | 2029-08 |
D2.3.1 | Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen) | 2026-05 |
D2.3.2 | Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) | 2027-06 |
D2.3.3 | Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) | 2028-08 |
D2.3.4 | A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen) | 2029-10 |
D2.4.1 | Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics. | 2026-09 |
D2.4.2 | Initiate and develop terminology groups on geography, social geography, and environmental sciences. | 2027-12 |
D2.4.2 | Initiate and develop terminology groups on social policy, economics, and political science. | 2028-05 |
D2.4.3 | Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences. | 2029-11 |
This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)
To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)
D3.1.1 | Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models | 2026-05 |
D3.1.2 | Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models | 2027-09 |
D3.1.3 | Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements | 2028-04 |
D3.1.4 | Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements | 2029-10 |
To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)
D3.2.1 | Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen) | 2026-11
|
D3.2.2 | Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen) | 2027-06 |
D3.2.3 | Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material (NLF/Aija Vahtola) | 2028-12 |
D3.2.4 | Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola) | 2029-11 |
D3.3.1 | Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti) | 2029-11 |
D3.3.2 | Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola) | 2027-11 |
D3.3.3 | Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala) | 2029-11 |
D3.3.4 | Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila) | 2029-11 |
D3.3.5 | Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa) | 2029-11 |
The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)
D4.1.1 | Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki) | 2029-11 |
D4.1.2 | Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen) | 2029-11 |
D4.1.3 | Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste) | 2029-11 |
D4.1.4 | Analysis tools for multimodal livestream data (JYU/Raine Koskimaa) | 2029-11 |
D5.1.1 | Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen) | 2026-06 |
D5.1.2 | Educational resources for infrastructure tools and data. (L:TAU:/Sanna Kumpulainen) | 2027-11 |
D5.1.4 | Evidence-based infrastructure development: User experience and the feedback instrument. (TAU:/Sanna Kumpulainen) | 2029-11 |
Last modified on 2024-11-11