<< List of all deliverables

D2.2.1: Transformer training for specialised data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.2: Report on Transformer training for specialised data
Date of reporting: 09-06-2025

Report author: Erik Axelson (University of Helsinki)
Contributors: Ghent Center for Digital Humanities [1] & Language and Translation Technology Team (LT3) [2] (Ghent University); Sam Hardwick, Katri Tegel (CSC)
Deliverable location: N/A

Description

In this workpackage, we aim at creating a self-study course implemented as Jupyter Notebooks. Its purpose is to learn to build up a language model from scratch in the CSC computing environment using one or more existing resources of Language Bank of Finland, but not limited to them. For this purpose, we have tested two resources using the Noppe [3] service of CSC. One is an external resource developed in the framework of the CLS – Computational Literary Studies Project (2020-2025) [4]. The other is CSC’s Aitta [5] inference service for which they also offer a course ”Aitta – LLM Inference” in Noppe.

The CLSInfra repository [6] hosts the work done in the framework of CLS for Natural Language Processing pipelines for the DH community. The pipelines are demonstrated with Jupyter Notebooks. We have tested them in the Noppe service of CSC. If problems have been encountered, they have been reported to CLSinfra team. They have fixed the issues that we have reported so far. We will continue to go through the Notebooks, and we aim at running all of them in the Noppe service. Then we can later modify them for example for Finnish language or minority languages such as Sami languages, other Fenno-Ugric languages or Finland Swedish.

CSC’s ”Aitta – LLM Inference” course uses large language models available in their Aitta inference service. We have tested creating keys to access language models in Aitta and managed to use them in Noppe and run the exercises. Aitta already offers some models to use, and future features will include the ability for users to upload models and create embeddings themselves. These features will make it later possible to use our own materials.

We plan to have our own course environment ready in the beginning of fall 2025.

[1] Ghent Center for Digital Humanities: https://www.ghentcdh.ugent.be/
[2] Language and Translation Technology Team (LT3): https://lt3.ugent.be/
[3] Noppe: https://noppe.2.rahtiapp.fi/
[4] Computational Literary Studies Project (2020-2025): https://clsinfra.io/
[5] Aitta: https://staging-aitta.2.rahtiapp.fi/public
[6] The CLSInfra repository: https://github.com/GhentCDH/CLSinfra

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.2.2: Ingestion of heritage and societal data from Sampo

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.2: Report on Ingestion of heritage and societal data from Sampo
Date of reporting: 04-06-2025

Report author: Eero Hyvönen (Aalto University)
Contributors:
Deliverable location: Linked Data Finland (LDF.fi), several online data services, CSC Allas (ParliamentSampo), zenodo.org (several submissions of data dumps), various web portal URL-addresses

New Sampo systems:

CoinSampo
A new Sampo system based on archaeological data from the Cultural Heritage Agency.

LetterSampo
A new large Sampo system in use (some 1.3 million letters by nearly 120,000 historical people) from 1700 fonds.

OperaSampo
OperaSampo finished:

ArtSampo
First demo version using LLMs finished:

  • Data service and KG: Available at https://ldf.fi (to be opened)
  • Data publication: Available at https://zenodo.org (to be opened)
  • Portal: https://artsampo.demo.seco.cs.aalto.fi / (to be opened)

New Projects and Applications

A new follow-up project proposal for LetterSampo for the Research Council of Finland (2025) with Sibelius Academy on analyzing Historical Letter textual contents (using LLMs, NLP, and KGs) (Anne Kauppala, Eero Hyvönen).

New joint works on applying the DARIAH-FI Sampo infra:
1) VU University, Amsterdam, PH-Sampo; 2) Geneve Graduate Institute, Switzerland, applying ParliamentSampo to, e.g., United Nations speeches; 3) Nomisma.org, NomismaSampo; 4) the British Museum, PASampo; 5) Heritage Practice Communities network, HPC Sampo; 6) University of Latvia, Nobel Price Sampo, DBLP Sampo.

Maintaining of ParlamentSampo
ParliamentSampo data was updated by data related of the new parliament 2023–2024. New semantic data regarding interruptions and laughter at the parliament. This was reported by YLE in prime time TV news and by an article on the web.

SampoSampo – Connecting Everything to Everything Else
A first demonstration of a new kind of data linking service, inspired by the international VIAF.org service of national libraries, was created.

Education: tutorials

Tutorial: How to create a Linked Open Data service and semantic portal for your Cultural Heritage data. (in English), November 28, 2024.

Tutorial organized at the Digital Humanities in the Nordic and Baltic Countries 2025 conference (DHNB 2025):

DHNB 2025 Tutorial, Tartu, Estonia: How to create a Linked Open Data service and semantic portal for your Cultural Heritage data. (in English), March 4, 2025.

Publication Events 2024–2025

Publications 2024–2025

Research articles related to the research above

    2025: 13 papers
    2024: 31 papers

One related dissertation accepted in 2024 (by Petri Leskinen) and one manuscript in pre-examination in 2025 (by Heikki Rantala), in addition to several MSc works.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.1: Automated metadata of archival data from NAF

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Automated metadata of archival data from NAF
Date of reporting: 04-06-2025

Report authors: Venla Poso (JYU), Ida Toivanen (JYU)
Contributors: Antero Holmila (JYU), Venla Poso (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Ilkka Jokipii (NAF)

Deliverable locations:

Description

The National Archives of Finland has been digitising their material at an increasing pace. For example, they started piloting a mass digitisation project in 2019, where the aim is to digitise over 135 kilometres of archival data. The aim of the deliverable D3.3.1 was to develop machine learning methods for generating metadata, such as document type and journal number, from OCR-scanned archival materials to facilitate their analysis and information extraction. The goal has been to generate metadata which helps to make large variant data collections within the archives more usable. The development process has included creating a deep learning (DL) model for named entity recognition (work started in 2022–2023) and for document type classification (2024–2025).

The research started with archival data included developing named-entity recognition (for example, journal number) for state authority archives via (1) publishing annotation guidelines to aid the annotation process and recognize the properties of archival data [1], and (2) DL modelling based on annotated archival data [2,3]. In addition to publishing a DL model trained with the annotated data [3], we evaluated an archival text model against a Finnish text model to see and determine how big an effect noise brings to real-life cases and the acute workings of models [2].

The process of developing document type classification for noisy and diverse archival data has included collecting and annotating a new benchmark dataset from openly available archival data (to be published) and evaluating different DL model architectures for the task of document type classification. As a result we released an image-based model that classifies scanned documents into seven different categories: cover page, card index, map, picture, running text, table or form, and newspaper (https://huggingface.co/jyu-digihum/findoctype). Our future work will entail adding a multimodal dimension to the current framework.

Development has been conducted in cooperation with the National Archives of Finland.

Publications

[1] Poso, V., Välisalo, T., Toivanen, I., Lipsanen, M., Kukkohovi, L., Kytöaho, R., Palander, S., Pohjola, M., Laitinen, V., Föhr, A., Abdelamir, A. & Niemi, J. (2025). NER annotation guidelines for archival data. University of Jyväskylä. URN: https://urn.fi/URN:NBN:fi:jyu-202501291584

[2] Toivanen, I., Poso, V., Lipsanen, M., & Välisalo, T. (2025). Developing named-entity recognition for state authority archives. In O. Holownia, & E. S. Sigurðarson (Eds.), DHNB2024 Conference Post-Proceedings (7). University of Oslo Library. Digital Humanities in the Nordic and Baltic Countries Publications. https://doi.org/10.5617/dhnbpub.12262

[3] Poso, V., Lipsanen, M., Toivanen, I., & Välisalo, T. (2024). Making Sense of Bureaucratic Documents: Named Entity Recognition for State Authority Archives. In Archiving 2024 Final Program and Proceedings (pp. 6-10). Society for Imaging Science & Technology. Archiving, 21. https://doi.org/10.2352/issn.2168-3204.2024.21.1.2

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.2.1: Ingestion of structured data from FINNA

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.2: Report on Ingestion of structured data from FINNA
Date of reporting: 04-06-2025

Report author: Joona Manner (National Library of Finland, Finna Unit)
Contributors: Joona Manner, Juha Luoma, Julia Isotalo, Riitta Peltonen, Päivi Maria Pihlaja (National Library of Finland)
Deliverable location: https://github.com/NatLibFi/Finna-API-image-file-downloader

Description

The aim of the deliverable was to improve researchers’ access to vast image collections and related metadata for data-intensive research. Finna is a national infrastructure and discovery service maintained and developed by the National Library of Finland and providing access to collections of almost 500 libraries, archives and museums.

In this delivery, we enhance Finna’s data reuse services to meet researchers’ needs and improve the technical features of the Finna Application Interface (API) service. The deliverable contributes to the objective of connecting the research infrastructure to accruing data sources, enhancing researchers’ access to open data and enabling workflow automation.

We planned the technical improvements and guidance materials in consultation with researchers and other stakeholders, including an open survey questionnaire in August 2024 and a collaborative workshop in September 2024, which involved researchers from both the social sciences and humanities, as well as the IT Centre for Science (CSC).

The new API image file and metadata download system includes a command prompt-based Nodejs scripts that allow users to download high-resolution images with related metadata in JSON format based on Finna search from Finna’s material providers.

The script enables the downloading of thousands of high-resolution images without triggering Finna’s data rate limiter, which is also necessary to prevent malicious attacks on Finna’s infrastructure.

Finna’s API image file and metadata download system will in the future also automatically create a report on possibly missing image files, which will help users and organisations solve these issues, improving Finna’s content quality in the long run.

The automated Nodejs script requires an API key that users can generate with their personal Finna account. Creating a Finna account requires email confirmation. The API key feature will be available in Autumn 2025. Before this, keys are provided on demand for individual research purposes.

The project has been in line with many of the National Library’s strategic objectives, including the objective of the Finna vision to promote the use of data as a resource.

Instructions in GitHub:
https://github.com/NatLibFi/Finna-API-image-file-downloader/releases/download/Demo_for_Workshops/Finna_API_instructions.pdf

Instructions will also be added under the Finna service guidance materials:
https://www.kiwi.fi/display/Finna/Finna+API+Documentation+In+English

The new features were presented and tested at the following events:

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.2: Automated harmonisation and enrichment of metadata

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Automated harmonisation and enrichment of metadata
Date of reporting: 18-03-2025

Report authors: Akewak Jeba (University of Turku), Leo Lahti (University of Turku)
Contributors: Julia Matveeva (University of Turku), Muluh Geraldson (University of Turku)
Deliverable location: github.com/fennicahub (see below for specific outputs)

Keywords: data science, metadata, bibliographies, enrichment

Description

This deliverable provides resources for gathering, harmonizing, enriching, and summarizing structured metadata from the Finnish National Library, in particular the National Bibliography Fennica. The open data and workflows can be used in research, training, and outreach. Further metadata resources are available for complementary cultural heritage from archives, libraries, museums, and other actors. This deliverable expands the scope of the metadata collections that are seamlessly interlinked with statistical environment, enhancing the integration of Finna and Finto with Fennica.

Earlier work with Fennica, including metadata harmonization and visualization workflows, is described in FIN-CLARIAH (2022–23) Deliverable D4.1.3, which focused on preparing and publishing the cleaned Fennica dataset along with interactive tools for analysis and presentation.

This deliverable consists of the following resources:

1. Systematic approach to retrieve Finna metadata into open computing environments is implemented as open software finna. This uses REST API and OAI-PMH API for data retrieval. The release version is available through CRAN repository.

2. Data science methods to enrich structured metadata from Finna and Fennica are provided via the finto R package based on actor cross-linking. This provides fluent access to Finto keyword service (finto.fi) from R statistical environment and allows interaction with Finto service. Examples regarding Fennica author enrichment using Kanto/Finto are available via the package vignette.

3. Data analysis and visualization techniques to support the research use of cultural heritage metadata collections are provided via the finna package and demonstrated in the package vignettes. Geospatial analysis and visualization of metadata from Finna and Fennica is further supported by the maintained geofi package.

Resource links:

  • finna, an open source R package for collecting cultural metadata using the Finna API.
  • finto, an open source R package for retrieving vocabulary data and for enriching the metadata using the Finto API.
  • geofi, supporting the maintenance of open source R package for accessing Finnish Geospatial Data and visualisation.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.2: Analysis Tools for Multimodal Born-digital Social Media

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on analysis tools for multimodal born-digital social media: Nordic Tweet Stream (NTS)
Date of reporting: 18-12-2024

Report author: Mikko Laitinen (UEF)
Contributors: Paula Rautionaho (UEF), Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location: https://nordictweetstream.fi/

Description

The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region covering over 11 years from 2013 to 2023. It is accessible through a graphic interface that allows users to search, subset, visualize, and download extremely large-scale user-generated data from one social media application.

The objective of this digital interface is to enable easy access to and distribution of born-digital data for basic research. We have recently witnessed the closing down of free access to various digital sources because of the APIcalypse (Bruns 2019) and feel that, despite restrictive measures by social media giants, it is extremely important to store cultural heritage from social media. We operate according to the FAIR Data Principle. The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).

The NTS provides data spanning from January 2013 to May 2023, encompassing over 900 million tokens from more than 73 million messages, generated by nearly 900,000 individuals. The dataset includes content in 73 languages. The largest languages are Swedish (c. 31 %), English (c. 26 %) and Finnish (c. 13 %). Detailed information of the material is found in the Statistics pages of the interface.

The NTS dataset is intended for use by researchers across various disciplines, including sociolinguistics, dialectology, social sciences, and cultural studies. It can serve as both primary data and supplementary material alongside structured corpus data. This interface is designed for users seeking quick access to the data. Advanced users, however, may prefer to utilize the download function to retrieve the data for further processing in other environments.

Publications

Laitinen, M., Lundberg, J., Levin, M., & Martins, R. M. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data. In DHN 2018 Digital Humanities in the Nordic Countries 3rd Conference: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference Helsinki, Finland, pp. 349–362. https://erepo.uef.fi/handle/123456789/6697

Events

NTS presented in the following event:

References

  • Bruns, Axel. 2019. After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566, doi: 10.1080/1369118X.2019.1637447
  • Wilkinson, M. D. et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. doi:10.1038/sdata.2016.18

 
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
 

<< List of all deliverables

D4.1.6: Enrich survey data with register data and unstructured text

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Enrich survey data with register data and unstructured text
Date of reporting: 12-12-2024

Report authors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Contributors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Deliverable location: https://cran.r-project.org/web/packages/finnsurveytext/index.html

Description

The finnsurveytext R package has been developed to aid researchers in analyzing responses to open-ended survey questions and other structured text data. This user-friendly tool facilitates reproducible analysis of text data by providing features such as summarizing response properties, identifying frequent words and phrases, visualizing responses, and generating concept network plots. The second version of the package, released in August 2024, integrates with the widely-used R package survey, allowing for survey design to be incorporated into the analysis. Although originally designed for analyzing text in Finnish, the package is versatile and can be used for text analysis in other languages as well.

R package finnsurveytext was released with 2 updates to CRAN. The R package is located at CRAN and additional material is available on the website. An article on the package has been written and is available on Zenodo and for review in the new DARIAH publication.

The results of the work package were presented at two events: an invited lecture at the Workshop on Survey Statistics 2024, held in Poznan, Poland from 26-30 August, and at Statistics Sweden and Örebro University Summer School 2024 in August 28.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
 

FIN-CLARIAH Deliverables

<< FIN-CLARIAH Overview

This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).

FIN-CLARIAH Funding period 2026-2029

Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.

Module 1: Natural Language Processing (NLP)

The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)

W1.1 Text processing and annotation environments

To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)

D1.1.1 Support common CLARIN formats like TEI (CSC/Martin Matthiesen). 2026-12
D1.1.2 Convert VRT to TEI and showcase the result in a compatible web interface like the KorAP platform used in German CLARIN. (CSC/Martin Matthiesen) 2027-07
D1.1.3 Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen) 2028-04
D1.1.4 Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen) 2029-10

W1.2 Speech processing and annotation

To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)

D1.2.1 Updated backend of existing ASRs (CSC/Sam Hardwick) 2026-10
D1.2.2 A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats) 2027-08
D1.2.3 Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick) 2028-06
D1.2.4 Expansion and upgrade of Oulu Clarin-D centre to C or B status; provision of access to additional language resources sourced from multimedia social media content. (OU/Steven Coats) 2029-11

W1.3 Video processing and annotation

To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)

D1.3.1 Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes) 2026-06
D1.3.2 Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick) 2027-08
D1.3.3 Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes) 2028-09
D1.3.4 Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes) 2029-10

Module 2: Language Research Infrastructure (LRI)

This module takes care of the specialised language processing needs in the fields of language-based research. (L:UHEL/ARTS Krister Lindén)

W2.1 Processing Research Data

To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)

D2.1.1 Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen) 2026-05
D2.1.2 Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen) 2027-09
D2.1.3 Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen) 2028-06
D2.1.4 Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes) 2029-11

W2.2 Training environments

To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)

D2.2.1 Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson) 2026-12
D2.2.2 Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson) 2027-12
D2.2.3 Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson) 2028-06
D2.2.4 Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson) 2029-08

W2.3 Translation and Interpretation

To provide infrastructure for translation and interpretation research on fact checking and verification of LLM output. (L:UHEL/ARTS Tommi Jauhiainen; P:CSC; C:UTA, UEF)

D2.3.1 Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen) 2026-05
D2.3.2 Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) 2027-06
D2.3.3 Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) 2028-08
D2.3.4 A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen)  2029-10

W2.4 Terminology

To provide infrastructure for the terminology work in the Helsinki Term Bank for the Arts and Sciences (HTB) and related terminology development projects. (L:UHEL/ARTS Tiina Onikki; C:UVAASA)

D2.4.1 Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics. 2026-09
D2.4.2 Initiate and develop terminology groups on geography, social geography, and environmental sciences. 2027-12
D2.4.2 Initiate and develop terminology groups on social policy, economics, and political science. 2028-05
D2.4.3 Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences. 2029-11

Module 3: Structuring Data

This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)

W3.1 Data Management

To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)

D3.1.1 Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models 2026-05
D3.1.2 Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models 2027-09
D3.1.3 Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements 2028-04
D3.1.4 Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements 2029-10

W3.2 Data Ingestion

To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)

D3.2.1  Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen)   2026-11 

 

D3.2.2  Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen) 2027-06 
D3.2.3  Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material  (NLF/Aija Vahtola) 2028-12 
D3.2.4  Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola)  2029-11 

W3.3 Enrichment

To enable the systematic and detailed analysis of noisy datasets in different formats and thereby provide unseen possibilities for SSH research. (All the deliverables set to 2029 also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UTU Veronika Laippala; P:UEF, JYU, OU, UHEL/ARTS, UHEL/SOC, Aalto; C:UHEL/NLF)

D3.3.1 Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti) 2029-11
D3.3.2 Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola) 2027-11
D3.3.3 Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala) 2029-11
D3.3.4 Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila) 2029-11
D3.3.5 Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa) 2029-11

Module 4: Analyzing Structured Data

The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)

W4.1 Analytical Support for computational SSH

To enable researchers to utilise large born-digital data effectively and to focus on analysis rather than dealing with technical details in often high volume and high velocity. (All the deliverables also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UEF Mikko Laitinen; P:JYU, OU, UHEL/SOC; C:UHEL/NLF)

D4.1.1 Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki)  2029-11
D4.1.2 Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen)  2029-11
D4.1.3 Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste) 2029-11
D4.1.4 Analysis tools for multimodal livestream data (JYU/Raine Koskimaa)  2029-11

Module 5: Information Interaction (IIA)

Interaction refers to the need 1) to collect information on how researchers interact with the RI in order to develop the tools and services accordingly, and 2) to offer education and consultation on how researchers can enhance their work by using the infrastructure, thus increasing the RI’s active user base. (L:TAU Sanna Kumpulainen)

W5.1 Evidence-Based Infrastructure Development

To provide a close dialogue with the user community to ensure the best possible development of the RI. (L:TAU Sanna Kumpulainen; P:UHEL/ARTS; C:UHEL/NLF, UTU, CSC, UHEL/SOC, AALTO, JYU, UEF, OU)

D5.1.1 Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen) 2026-06
D5.1.2 Educational resources for infrastructure tools and data.  (L:TAU:/Sanna Kumpulainen) 2027-11
D5.1.3 Community engagement: User interaction with multimodal data.  (TAU:/Sanna Kumpulainen) 2028-06
D5.1.4 Evidence-based infrastructure development: User experience and the feedback instrument.  (TAU:/Sanna Kumpulainen) 2029-11

Top of page

<< FIN-CLARIAH Overview

<< List of all deliverables

D2.1.1: Integrate environment for personal data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Integrate environment for personal data
Date of reporting: 30-09-2024

Report authors: Mietta Lennes (UH)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/sd-services/

Keywords for the deliverable page: sensitive data; confidential data; secure desktop; SD services

Description

In case a research dataset contains special categories of personal data or other types of confidential information that cannot be removed without hampering the research purpose, it may be necessary to use a secure environment for processing the data (cf. Deliverable 2.1.2 of the previous funding period of FIN-CLARIAH 2022-2023).

CSC – IT Center for Science provides Sensitive Data services for sharing and analyzing data securely from a web browser. The sensitive data files can be encrypted and uploaded via SD Connect, where they are available to the secure desktop instances of the members of the same project. The virtual machines for the secure desktops are configured and accessed via SD Desktop.

It is also possible to install and use special tools in the SD Desktop environment. Researchers who need to process audio and video material securely can now also conveniently install tools such as ELAN (video and audio) or Praat (audio) for viewing, editing, annotating, querying and analyzing their data, or well-known command-line tools such as Whisper (automatic speech recognition) as part of their workflow in the secure environment. For faster access to audio and video files, and external volume can be selected when configuring the virtual machine.

We will continue testing, documenting and improving the functionalities of the SD Desktop with the users of the Language Bank. We are also looking into the possibility of the Language Bank using SD Desktop instances for providing individual users with restricted access to specific sensitive datasets. The SD services are still under active development and the remaining issues can be addressed in collaboration with the experts at CSC.

For researchers in the SSH fields, the step-by-step instructions for using the Sensitive Data services are now maintained on a support page in the online portal of the Language Bank of Finland.

 

 

<< List of all deliverables

D1.2.1: Data collection for minority languages

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.1: Data collection for minority languages
Date of reporting: 26-09-2024

Report authors: Martin Matthiesen (CSC)
Contributors: Wilhelmina Dyster (UH), Sjur Moshagen, Katri Hiovain-Asikainen (UiT)
Deliverable location: n/a

Keywords for the deliverable page: Finland-Swedish, Sámi

Description

In this workpackage two minority languages are collected: Swedish spoken in Finland and Sámi languages spoken in Norway, Sweden and Finland.

Data collected during the Donera Prat campaign[1] is currently manually transliterated. This work is expected to be ready by November 2024. The planned release date for the data for research is January 2025.

The data collection for Sámi languages is focusing on the broadcasting companies in the Nordic Countries (NRK[2], SVT[3], YLE[4]) where they are spoken and the University of Tromsø. The national broadcasters already have some of their Sámi data subtitled in a Sámi language and their respective national languages, making it a valuable resource for research.

We achieved a general understanding that the Language Bank of Finland can serve as the main sharing organisation for Sámi data and we already did test transfers of data from SVT and Tromsø. YLE’s Sámi data is available via KAVI[5]. Before the data can be shared via the Language Bank of Finland, we need to overcome technical and legal hurdles. While on the technical side we already reached broad agreement and will for example, share the data from the various sources with no or little changes, and KAVI and Aalto University already have experience in collaborating using the LUMI supercomputer,  the legal side seems to be a bigger challenge. NRK, SVT and YLE are currently investigating legal implications of sharing their data via the Finnish Language Bank.

[1] Donera Prat https://svenska.yle.fi/a/7-10009203

[2] Norwegian Television: https://www.nrk.no/about/

[3] Swedish Television: https://omoss.svt.se/about-svt.html

[4] Finnish Television: https://yle.fi/aihe/about-yle

[5] The Finnish National Audio Visual Institute, https://kavi.fi/en/

<< List of all deliverables

D3.1.1: Comprehensive data versioning

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Comprehensive data versioning
Date of reporting: 25-09-2024

Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Keywords for the deliverable page: versioning, updates, differences

Description

The versioning mechanism has been tested with new data from the National Library. We discovered that we will likely need to make changes to the mechanism how data is packaged into zip files to avoid unnecessary growth of the versions stored in Allas.

Interviews with potential users of the data have been conducted: Erik Axelson and Ville Vaara (both UH).  Both interviews are summarized below.

Using the data set as a potential source for newer versions of the KLK dataset in Kielipankki. (Erik Axelson)

In 2024 FIN-CLARIN has published a new version of ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], klk-fi-v2-1874-vrt, for short. This version was created using data directly obtained from the National Library, since our harvesting mechanism was not quite ready at the start of the project to create the new dataset. The NLF source data was extracted, tokenized and syntactically annotated and converted to the VRT format[3].  A list of included publications was compiled[4] and also End user notes, which document inconsistencies found after publication[5]. FIN-CLARIN has well established processes to obtain new copies from the National Library and these copies are in a different internal format than the data provided in this workpackage[2]. However, the differences are small and the data is well suited to be a basis for the next iteration. Since a new version of klk-fi-v2-1874-vrt is not planned during this project we will demonstrate the changes needed with a proof-of-concept.

Using the dataset as a basis for an Elastic Search instance containing NLF data (Ville Vaara)

Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[6]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. When considering newer versions it became clear that an easy way of finding differences between the versions is a reasonable addition to the present implementation. The dataset is presently 10 TB in size and comparing two  datasets of that size (the present version and an earlier version) to find out the differences is something that should be done once during the update and provided to the user as a service, enabling easier updates of indexes.

Next steps

Moving forward we need to investigate the unnecessary growth of the versions and add functionality to make incremental updates of derived datasets (like in the Elastic Search case mentioned above) easier, by providing the differences between versions in a machine readable way. In deliverable 3.1.2 we will demonstrate the changes with working code.

References

[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401

[2] See the Harvester documentation for details.

[3] Introduction to VRT: http://urn.fi/urn:nbn:fi:lb-2023020121

[4] List of publications: http://urn.fi/urn:nbn:fi:lb-2023092801

[5] End user notes: http://urn.fi/urn:nbn:fi:lb-2023101001

[6] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Donera prat (Lahjoita puhetta)

Suomeksi | In English

Donera prat-kampanjerna på finska och finlandssvenska är avslutade från och med 6.3.2024. Ett stort tack till alla donatorer!

Från och med den 16 juni 2020 har Yle, tidigare Vake Oy (Valtion kehitysyhtiö; för närvarande Ilmastorahasto Oy) och Helsingfors universitet drivit kampanjen Lahjoita puhetta för insamling av finskt tal. I en mindre Donera prat -kampanj som startade 2021 har även finlandssvenskt tal samlats in. Under det första året av den finska kampanjen donerades mer än 3000 timmar tal. På senare tid har dock mycket få donationer kommit in.

Donationskampanjerna för finskt och finlandssvenskt tal är nu avslutade. Datamängderna kommer att organiseras och lagras av Språkbanken i Finland (Kielipankki). Via den finska Språkbanken kan forskare och företag få tillgång till Donate Speech-datamängder på särskilda villkor. Vi hoppas att data kommer att hjälpa både forskare och företag att skapa bättre modeller av finskt och finlandssvenskt tal och att utveckla framtida tjänster som lätt kan användas på finska och finlandsvenska.

Läs mer:

 


Uppdaterat: 6.3.2024

<< List of all deliverables

D1.1.2: Ingesting new unstructured resources

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.1: Report on ingesting new unstructured resources
Date of reporting: 30-11-2023

Report authors: Mietta Lennes, Jussi Piitulainen (University of Helsinki)
Contributors: Ute Dieckmann, Erik Axelson, Jyrki Niemi, Jack Rueter, Tommi Jauhiainen, Krister Lindén (University of Helsinki)
Deliverable location: Corpora and tools available via the Language Bank of Finland

Keywords for the deliverable page: corpus, data set, automatic language identification

Description

The Newspaper and Periodical Corpus of the National Library of Finland was extended with a significant amount of new material from the National Library. The new version was organized according to the automatically identified language of each sentence. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (klk-fi-v2), consisting of more than 22 billion word tokens, was published in Korp in summer 2023. It consists of the text elements that contain at least one ”fin” sentence (from the new material, from the previous version of klk-fi, and from the previous klk-sv). Moreover, the summary attributes indicate the frequency distribution of languages within each text and each paragraph. An extended version of the Swedish sub-corpus (klk-sv-v2) has been compiled in a similar way (any ”swe” in a text), but the Swedish data is currently still waiting for the rest of the annotations to be completed. For details of the reorganization process of the National Library data according to language, see Jauhiainen et al. 2022.

The HeLI-OTS language identification tool was adapted for the format used in the Language Bank of Finland, together with a post-processor written to correct the identification of each sentence within its context. Another new tool was written to partition the corpus, first by the main identified languages, then by the year of publication.

As a demonstration of ingesting resources including parallel spoken material in multiple languages, the corpus Christmas Gospel text-to-speech in four Uralic languages was prepared and made available for searching and playback via Korp (for details on this effort, see D2.3.2).

Other corpora published in Korp during the years 2022-23 include, e.g., the Finnish News Agency Archive 1992-2018, Kielipankki Korp Version; Corpus of Contemporary American English (COCA) – Kielipankki Korp version 2020 and Erzya and Moksha Extended Corpora (ERME) version 2, Korp.

In addition, various downloadable resources were published, e.g., Corpus of Contemporary American English – Kielipankki VRT version 2020; FinnTreeBank 1, 2 and 3; Word embeddings trained with word2vec from the Finnish Text Collection; The Coronavirus Corpus (Mark Davies, english-corpora.org) – Kielipankki version 2021-05; and The Finnish Dark Web Marketplace Corpus.

During the project, the resource publication pipeline of the Language Bank of Finland has been refined and documented. The structure of the pipeline was first presented at the CLARIN Annual Conference in 2022 and described in the conference proceedings (Dieckmann & al., 2023, see below).

Publications

  • Jauhiainen, T., Piitulainen, J., Axelson, E., Lindén, K. (2022) Language diversity in the newspaper and periodical corpus of the National Library of Finland. Poster presented at Digital Research Data and Human Sciences (DRDHum), 1.-3.12.2022, Jyväskylä, Suomi. Download the poster
  • Dieckmann, U., Lennes, M., Piitulainen, J., Niemi, J., Axelson, E., Jauhiainen, T., Lindén, K. (2023) The Pipeline for Publishing Resources in the Language Bank of Finland. Erjavec, T., Eskevich, M. (editors), Selected Papers from the CLARIN Annual Conference 2022, pp. 33-43. Linköping University Electronic Press.

<< List of all deliverables

DX.Y.Z: Title of Deliverable

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP x.y: Report on <topic of the deliverable>
Date of reporting: dd-mm-2025

Report authors: Firstname Lastname (Organization)
Contributors: Firstname Lastname (Organization)
Deliverable location: <link to, e.g., a GitHub repository, or other external location that includes further information or relevant content>

Keywords for the deliverable page: (any relevant keywords separated with semicolons; for search engines etc.)

Description

The description text (max. 3000 characters) may include the following, if applicable:

  • Links to external resources
  • Publications, if any (including DOI)
  • Events, if any (including links)
  • Insert the following text to the end of your report: FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

The publication-ready deliverable should be emailed as a MS Word document (or similar) to wilhelmina.dyster (ATT) helsinki.fi, Cc:krister.linden (ATT) helsinki.fi.

Deadline for deliverables due 2025-09: Send the content for your deliverable page by 22.09.2025.

FIN-CLARIAH Deliverables

<< FIN-CLARIAH Overview

This page showcases the project deliverables (see template and instructions for reporting).

FIN-CLARIAH Funding period 2024-2025
FIN-CLARIAH Funding period 2022-2023 (Completed)

FIN-CLARIAH Funding period 2024-2025

Module 1: Natural Language Processing (NLP)

W1.1 Text processing and annotation environments

D1.1.1 Named-entity annotation 2024-09
D1.1.2 Ingesting new unstructured resources 2025-12

W1.2 Speech processing and annotation

D1.2.1 Data collection for minority languages 2024-09
D1.2.2 Transcription service for minority languages 2025-09

W1.3 Video processing and annotation

D1.3.1 Tools and guidelines for video processing 2025-06

Module 2: Language Research Infrastructure (LRI)

W2.1 Personal and Copyrighted Research Data

D2.1.1 Integrate environment for personal data 2024-09
D2.1.2 Framework for processing copyrighted data for verification of research 2025-09

W2.2 Training environments

D2.2.1 Transformer training for specialised data 2024-12 2025-06
D2.2.2 Transformer adaptation for specialised data 2025-12

W2.3 Translation and Interpretation

D2.3.1 Remote access to text data repositories 2024-12 2025-09
D2.3.2 Remote access to video data repositories 2025-12

W2.4 Terminology

D2.4.1 Term definition discovery procedures 2024-09
D2.4.2 Initializing terminology collections 2025-12

Module 3: Structuring Data

W3.1 Data Management

D3.1.1 Comprehensive data versioning 2024-09
D3.1.2 Workflow automation and version syncing 2025-09

W3.2 Data Ingestion

D3.2.1 Ingestion of structured data from Finna (NLF) 2025-03 2025-06
D3.2.2 Ingestion of heritage and societal data from Sampo 2025-06
D3.2.3 Ingestion of multimodal societal data from the Web 2025-12

W3.3 Enrichment

D3.3.1 Automated metadata of archival data from NAF 2025-03 2025-06
D3.3.2 Automated harmonisation and enrichment of metadata 2024-12 2025-03
D3.3.3 Machine-learning -based enrichment of social media 2025-06 2025-09
D3.3.4 Machine-learning -based enrichment of textual and audio-visual social media contents 2025-11
D3.3.5 Forensic linguistics corpus and search interface C.R.I.M.E 2025-09
D3.3.6 Reliable image labelling with computer vision 2025-09

Module 4: Analyzing Structured Data

W4.1 Analytical Support for computational SSH

D4.1.1 Analysis of video stream interactions with AI solutions 2025-06 2025-09
D4.1.2 Analysis Tools for Multimodal Born-digital Social Media 2024-12
D4.1.3 Analysis of interactions and regional language variation in social media 2025-12
D4.1.4 Analysis of multimodal properties of naturalistic speech 2025-12
D4.1.5 Analysis of multimodal cultural heritage 2025-12
D4.1.6 Enrich survey data with register data and unstructured text 2025-06

Module 5: Information Interaction (IIA)

W5.1 Evidence-Based Infrastructure Development

D5.1.1 Community engagement: multim. societal data researchers 2024-09
D5.1.2 Community engagement: multim. heritage researchers 2025-06
D5.1.3 Evidence-based infrastructure development 2024-12
D5.1.4 Educational resource development 2025-12


FIN-CLARIAH Funding period 2022-2023

Completed

Module 1: Natural Language Processing (NLP)

W1.1 Text processing and annotation environments

D1.1.1 Updating LBF resource selection 2022-09
D1.1.2 Ingesting new unstructured resources 2023-12

W1.2 Speech processing and annotation

D1.2.1 Forced-Alignment Service 2022-09
D1.2.2 Transcription Service for Finnish Interviews 2023-09

W1.3 Noise-tolerant NLP

D1.3.1 Corpora of non-standard language 2022-09
D1.3.2 System for detecting toxic language 2023-06
D1.3.3 Models for retrieving QA pairs from the web 2023-09
D1.3.4 QA pair corpora 2023-12

Module 2: Language Research Infrastructure

W2.1 Social Data Science

D2.1.1 Licensing agreements for personal data 2022-09
D2.1.2 Licensing agreements for special categories 2023-06

W2.2 Learners’ Assessment Environments

D2.2.1 Speech recognition for L2 2022-12
D2.2.2 Speech recognition for L2 update 2023-12

W2.3 Translation and Interpretation

D2.3.1 Licensing interpretation sessions 2022-12
D2.3.2 Aligning and retrieving 2023-12

W2.4 Terminology

D2.4.1 Term discovery procedures 2022-09
D2.4.2 Terminology application 2023-06
D2.4.3.1 Initializing terminology collections 2022-09
D2.4.3.2 Initializing terminology collections 2023-06
D2.4.3.3 Initializing terminology collections 2023-12

W2.5 Solutions for better use of language learner performances in research

D2.5.1 Test performances storage 2022-12
D2.5.2 Analysis and annotation tools for learner performances 2023-12

Module 3: Structuring Data

W3.1 Increasingly automated ingestion of material

D3.1.1 Initial NLF data 2022-09
D3.1.2 Ingestion framework 2022-12
D3.1.3 Versioning support 2023-06
D3.1.4 Incremental update process 2023-12

W3.2 AI solutions to better use of National Archives mass digitisation services

D3.2.1 Pipeline for transferring archival data 2022-12 2023-06
D3.2.2 Annotation & analysis tools for NARC data 2023-12

W3.3 AI solutions to better use of textual qualitative survey data

D3.3.1 Qualitative survey data concept network 2022-09
D3.3.2 R package for data concept network 2023-09 2023-12

W3.4 Developing analysis methods for real-time chats in gameplay streams

D3.4.1 Livestream data collector 2022-12

W3.5 Developing analysis methods for text network analysis of political texts

D3.5.1 Text network analysis of political texts 2022-12 2023-06
D3.5.2 Text network analysis of political texts 2023-09 2023-12

Module 4: Analyzing Structured Data

W4.1 Metadata harmonization and analysis

D4.1.1 Harmonized FNB 2022-09
D4.1.2 Harmonization code 2022-12
D4.1.3 Visualisation workflow 2023-06
D4.1.4 R/Python module 2023-12

W4.2 Linked Open Data Services

D4.2.1 LDF knowledge extraction tools 2022-12
D4.2.2 Parliament of Finland Ontology 2023-12

W4.3 Subsetting data

D4.3.1 Subsetting tool 2022-09
D4.3.2 Statistical overviews and bias detection 2023-06
D4.3.3 Representative Twitter dataset 2023-12

Module 5: Information Interaction

W5.1 Evidence-based RI development

D5.1.1 User experience questionnaire 2022-09
D5.1.2 Log data collection and analysis 2023-06
D5.1.3 Protocol for collecting workshop data 2023-12

W5.2 Education and dissemination

D5.2.1 Actor network 2022-12
D5.2.2 Educational material 2023-12

Top of page

<< FIN-CLARIAH Overview

Kielipankki Live

In English

Kielipankki Live on verkkotapahtumien sarja, jossa haastatellaan tutkijoita ja keskustellaan ajankohtaisista Kielipankkiin liittyvistä aiheista. Tapahtumissa tallennetut esitykset julkaistaan jälkikäteen YouTubessa (katso linkit aiempien tapahtumien kohdalta). Kun haluat pysyä ajan tasalla Kielipankki Live -tilaisuuksista ja muista Kielipankin uutisista, tilaa uutiskirje!

Seuraava Kielipankki Live 14.12.2020 klo 13-15


kuva äänen aaltomuodosta

Pääaihe: Puhetta sisältävät tutkimusaineistot ja niiden tietosuojakäytänteet
Luvassa asiantuntevia vieraita ja keskustelua! Esitykset pidetään englanniksi, mutta kysymyksiä voi esittää myös suomeksi. Tilaisuus alkaa klo 13.00 ja päättyy joustavasti, kuitenkin viimeistään klo 15.

 

Ohjelma

  • Mietta Lennes: Ajankohtaisia asioita Kielipankissa
  • Krister Lindén: Tietoisku kieliaineistojen oikeudellisista kysymyksistä
  • Haastattelussa Rosa González Hautamäki ja Tomi Kinnunen: Kokemuksia AVOID-korpuksen ja muiden puheaineistojen keräämisestä ja jakamisesta puheteknologiseen tutkimukseen
  • Satu Saalasti: DELAD-projekti tähtää poikkeavan puheen aineistojen jakamiseen tutkijoille
  • Aleksi Rossi: Lyhyt tilannekatsaus Lahjoita puhetta -kampanjan tilanteesta
  • Questions & Answers: Kysy Kielipankin henkilökunnalta ja asiantuntijoilta
  • Avoin keskustelu

Ilmoittautuminen

Ilmoittaudu tapahtumaan tällä lomakkeella viimeistään 11.12.2020. Ilmoittautumisen yhteydessä voit esittää kysymyksiä tutkijavieraille ja Kielipankin asiantuntijoille. Myös tapahtuman aikana on mahdollisuus kysyä ja keskustella.

Kaikille ennakkoon ilmoittautuneille lähetetään liittymislinkki Zoom-alustalle ennen tilaisuuden alkua. Myös ennakkoilmoittautumisen päätyttyä voit saada liittymislinkin lähettämällä sähköpostia osoitteeseen fin-clarin [AT] helsinki.fi.

Kielipankki Live -tapahtumat tallennetaan

Huomaathan, että Kielipankki Live -tapahtumat tallennetaan ja videotallenteen keskeiset osuudet julkaistaan verkossa jälkikäteen. Jos et halua kuvasi tai äänesi olevan mukana tallenteessa, pidäthän kameran ja mikrofonin pois päältä tapahtuman aikana. Keskusteluun voi osallistua myös chatissa. Tapahtuman osallistujien nimiä tai yhteystietoja ei julkaista.


Kaikki Kielipankki Live -tapahtumat

  • 14.12.2020 klo 13-15 (Ilmoittaudu tapahtumaan)
  • 24.8.2020

In English

XLVI Kielitieteen päivät 16.–18. toukokuuta 2019

järjestetään Joensuussa Itä-Suomen yliopistossa. Tapahtuman teemana on kieli, elämä ja yhteiskunta. Myös Kielipankki näkyy paikan päällä ja etenkin perjantaiaamupäivällä 17.5. saatat bongata yliopistolla ihmisiä, joilla on yllään vaaleansininen possupaita… Vedä meitä hihasta, poikkea esittelypisteellä tai tule kuuntelemaan esitelmiä!

Kielipankki-aiheisten esitysten alustava aikataulu

Kielitieteen päivien päivitetty ohjelma ja lisätiedot

Tervetuloa tutustumaan Kielipankkiin esittelypisteellä konferenssin aikana!

Suomeksi

The XLVI Annual Conference of Linguistics

will be organized in Joensuu by the University of Eastern Finland. The theme of the conference is language, life, and the society. The Language Bank of Finland will be present during the conference and especially on Friday morning, you might notice some people wearing a pale blue t-shirt with a happy piglet… Come and talk to us, visit our stand or see our presentations!

Pre-final schedule of the presentations related to the Language Bank of Finland:

  • Thursday 16.5. 16:30 room AG106 / Selkokielen työpaja (Klaara-verkosto):
    Kielipankin selkosuomen aineistot (The Easy-to-read Finnish corpora in the Language Bank of Finland; Hanna Westerlund)
  • Friday 17.5.  10:00-10:30 room AG101:
    Kielipankin kiertue 2019: Työkalut, aineistot ja muut palvelut (Kielipankki Roadshow 2019: Tools, corpora and other services; Mietta Lennes)

Updated programme and further information about the Annual Conference of Linguistics

Welcome to meet Kielipankki, The Language Bank of Finland at its stand during the conference!

Introduction to the Language Bank of Finland at the workshop “Digital Parliamentary data and research”

Friday 3 May at 12.00
Aalto University (Otaniemi), CS-Building, Room T4 / A238 (Konemiehentie 2)

The aim of the workshop was to discuss the novel digital parliamentary datasets—in particular those of Parliament of Finland—their use in research, the related research resources and tools, and their future development for researchers, but also for citizens and the media. FIN-CLARIN and the Korp version 1.1 of the Plenary Sessions of the Parliament of Finland, available in the Language Bank of Finland, was also presented during the afternoon.

Mietta Lennes: FIN-CLARIN and Parliamentary Data in Kielipankki – the Language Bank of Finland (PowerPoint / PDF slides)

Further information including the programme of the workshop can be found at https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/workshop-digital-parliamentary-data-and-research.

Introduction to the Language Bank of Finland at the workshop “Digital Parliamentary data and research”

Friday 3 May at 12.00
Aalto University (Otaniemi), CS-Building, Room T4 / A238 (Konemiehentie 2)

The aim of the workshop was to discuss the novel digital parliamentary datasets—in particular those of Parliament of Finland—their use in research, the related research resources and tools, and their future development for researchers, but also for citizens and the media. FIN-CLARIN and the Korp version 1.1 of the Plenary Sessions of the Parliament of Finland, available in the Language Bank of Finland, was also presented during the afternoon.

Mietta Lennes: FIN-CLARIN and Parliamentary Data in Kielipankki – the Language Bank of Finland (PowerPoint / PDF slides)

Further information including the programme of the workshop can be found at https://www.helsinki.fi/en/helsinki-centre-for-digital-humanities/workshop-digital-parliamentary-data-and-research.

Search the Language Bank Portal:
Jörg Tiedemann
Researcher of the Month: Rea Peltola

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information