
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.2: Report on Transformer adaptation for specialised data
Date of reporting: 25-11-2025
Report author: Erik Axelson, Jack Rueter (University of Helsinki)
Contributors: Jack Rueter (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: N/A
In this work package, we aim to provide an MCP server for facilitation of fst-tool and LLM linking for less technically oriented people.
MCP (Model Context Protocol) provides a powerful new opportunity to bring large language model (LLM) capabilities into the research and learning of low-resource languages by creating a bridge between rule-based, finite-state linguistic tools and LLM-based modern chatbots. By hosting HFST [1] analyzers and open-source dictionaries designed and authored by individual humans and teams at GiellaLT [2] and Apertium [3] through UralicNLP [4] libraries on an MCP server, even users with no technical background — and working from a laptop or cellphone — can access lemmatizers, morphological analyzers, and translation dictionaries for dozens of minority languages. This approach opens the door to more inclusive language technology, making advanced tools available to communities that have historically lacked computer-aided support.
We have familiarized ourselves with the use of a local MCP server from a laptop, and have run into memory issues. A so-called free server with a larger memory set at CSC would provide an ideal solution for individual users, as the server would host the model. Some language communities might want to have their specific language data housed as private, i.e., there would have to be different access to this material. The Language Bank of Finland is making plans for the installation of MCP service to allow extensive testing.
[1] HFST – Helsinki Finite-State Technology
[2] GiellaLT – an infrastructure for rule-based language technology aimed at minority and indigenous languages
[3] Apertium – a free/open-source machine translation platform
[4] UralicNLP – an NLP library for Uralic languages
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.1: Report on Framework for processing copyrighted data for verification of research
Date of reporting: 28-11-2025
Report authors: Mietta Lennes (UH)
Contributors: Sirpa Kovanen (UH), Krister Lindén(UH), Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/data-management/dela/
Keywords for the deliverable page: copyrighted data, personal data, social media data, data protection, safeguards
Researchers in Social Sciences and Humanities often need to use data collected from social media platforms. Currently, the reuse of social media data for research purposes is legally challenging. Some part of the content originating from social media is usually protected by copyright or related rights. Social media postings (often including images and videos) may also contain personal data. The terms of use of social media platforms tend to be volatile and non-transparent, and individual permissions cannot be requested due to the large numbers of potential rightholders and data subjects.
Since neither the related EU regulations nor the Finnish legislation are well established in current legal practice, the possibilities for depositing research data from social media must be considered on a case by case basis. It may be possible to archive data obtained from social media and make it available for restricted purposes under certain conditions, according to Section 13 b of the Finnish Copyright Act (i.e., Tekijänoikeuslaki 13 b §), concerning data mining.
Two social media datasets, Finnish presidential elections 2024 in social media (somepressa24), collected by researchers at UHEL, and Nordic Tweet Stream 2013-2023 (nts) collected by a team at UEF, both teams participating in the FIN-CLARIAH project, have been suggested for deposition to the Language Bank of Finland. Using the potential redistribution of these two resources as an example, a review of the current legal risks and restrictions was performed by the legal advisors at UHEL. The negotiations for depositing the first dataset are nearly complete, and the dataset is to be delivered to the Language Bank in December 2025 and to be made available under a RES category license in early 2026. After the first experiences with somepressa24 at UHEL, we aim for a similar deposition agreement with UEF regarding the nts dataset.
The Language Bank of Finland offers frameworks, instructions and technical solutions for deposition agreements and end-user licenses, for access management (the Language Bank Rights system at CSC), and for data encryption or secure processing in a restricted environment if necessary (SD services at CSC). Step-by-step instructions to using the Sensitive Data services (cf. Deliverable 2.1.1.), including the secure SD Desktop environment, are now available both in Finnish and in English for researchers in Social Sciences and Humanities. The Language Bank also collects and shares the links to the privacy notices published by the users of the Language Bank.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Advanced analytic social media tools and data
Date of reporting: 26-11-2025
Report author: Mikko Laitinen (UEF)
Contributors: Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location:
Keywords: social media corpora; social network tools; ego networks; gender
Our work has resulted in building four massive social media corpora from one social media application. The purpose is to enable research access to large-scale and curated social media data, which is often a bottle neck in SSH (Laitinen & Rautionaho 2025). The four datasets are named Digital Social Network Corpora (DSN), as they not only consist of user-generated texts but also of detailed information of people’s social networks. They cover four geographic areas: Australia (DSN Ozzie), the Nordic countries (DSN Nordic), the United Kingdom (DSN British), and the United States (DSN America).
In total, they include 19,345 ego networks, consisting of a central node (ego), its directly connected neighbors (alters), and the connections between the alters. These networks were filtered using a semi-automated method to target what we call genuine human accounts, meaning that we aimed to exclude accounts with unusual network qualities, such as bots, celebrities, politicians, organizations, and businesses. Recreating a comparable dataset to the DSN corpora under the current paid data access policies of the social media application (X) would cost over 3 million euros and take around 58 years, given the current limitations of data access policies.
The resulting datasets are extremely large but contain carefully curated social networks with user-generated textual material. The network datasets contain material from 829,608 users, and the data range from 2006 to 2023. Altogether, they contain more than 700 million messages and nearly 10 billion words keyed in by users.
With their detailed structure, massive size, and coverage over 17 years, the DSN corpora support new research and enable re-examining old questions in the humanities. A case in point is the role of weak ties in the spread of innovations, where prior empirical evidence in sociolinguistics comes from ethnographic observations based on very small networks. One clear limitation of ethnographic network investigations is that participant observation methods are limited to networks of 30–50 individuals. The networks in the DSN corpora are substantially larger and close to average human networks in general, making it possible to investigate a variety of networks of different sizes and structures.
Publications:
Laitinen, Mikko & Paula Rautionaho. 2025. Reuse of social media data in corpus linguistics. International Journal of Corpus Linguistics. doi: 10.1075/ijcl.24136.lai
Masoud Fatemi & Mikko Laitinen. 2025. From tweets to networks: Introducing four large network-based social media corpora. CLARIN Annual Conference Proceedings, 2025. Ed by Cristina Crisot and Thalassia Kontino. Vienna, Austria, 2025. pp. 100–104. (https://www.clarin.eu/sites/default/files/CLARIN2025_ConferenceProceedings.pdf)
Events:
CLARIN 2025 conference Vienna 30 Sept – 2 October 2025 (https://www.clarin.eu/event/2025/clarin-annual-conference-2025)
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Machine-learning-based enrichment of textual and audio-visual social media contents
Date of reporting: 20-11-2025
Report authors: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable locations:
Keywords: video clip analysis; multimodal; MLLM; video summarization; data enrichment; Twitch
The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a vast amount of diverse information: the visual action, the auditory context from caster commentary, and the text-based reactions from the live chat, all representing dense and valuable data for understanding online communities. However, the sheer volume and complexity of this data creates a need for efficient analysis tools. Our previous tools have focused on chat-analysis or chat content detection [1, 2].
This deliverable presents a continuation of the deliverable D4.1.1 tool for the automated understanding and enrichment of such clips. The tool is powered by state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family, guided by a multi-step Chain-of-Thought prompt. This prompt instructs the MLLM to focus on data enrichment, systematically analyzing the clip’s metadata, audio-visual content, and chat log, producing a JSON file.
This structured JSON data is organized into three parts. The analysis begins with the audiovisual analysis of the content in the video. It identifies all key entities involved, logs chronological actions in the video, transcribes the on-screen text, and breaks down caster commentary into key quotes and emotional tones. Next, the “chat reaction” section shows how the audience reacted to the jargon used by the community while also providing a glossary to explain the cultural meaning behind this. Finally, the “causal synthesis” connects these two modalities. It provides a narrative summary explaining why the clip matters and establishes direct causal links between the audiovisual triggers to the exact chat reactions they caused.
All generated analyses are automatically saved and accessible within the video_descriptions category of the data viewer section.
Publications
[1] Jari Lindroos, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Ida Toivanen. ”From PogChamps to Insights: Detecting Original Content in Twitch Chat.” In Hawaii International Conference on System Sciences, pp. 2542-2551. Hawaii International Conference on System Sciences, 2025. https://doi.org/10.24251/hicss.2025.308
[2] Jari Lindroos, Ida Toivanen, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Sami Äyrämö. ”Participant profiling on Twitch based on chat activity and message content.” In International GamiFIN Conference, pp. 18-29. CEUR Workshop Proceedings, 2025. https://ceur-ws.org/Vol-4012/paper18.pdf
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.2: Report on Ingestion of multimodal societal data from the Web
Date of reporting: 20-11-2025
Report authors: Matti Nelimarkka (University of Helsinki), Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Matti Nelimarkka (University of Helsinki), Denis Davydov (University of Helsinki), Anita Braida (University of Helsinki), Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable locations:
Keywords for the deliverable page: Twitch, YouTube, chat data, video data
This deliverable focuses on infrastructures for acquisition of multimodal and societal data harvested from the web. The task includes the implementation and maintenance of data collection tools for most popular Finnish discussion forums, YouTube, and Twitch. This deliverable contains two parts ⎯ part A conducted by the Centre for Social Data Science, University of Helsinki and part B by the University of Jyväskylä.
PART A: FINNISH DISCUSSION FORUMS
To ensure that researchers have access beyond global platforms (where data collection is a shared global concern) University of Helsinki build and maintain forum scrapers which extract the content to user-generated content including vauva.fi, kaksplus.fi and comments on yle.fi and hs.fi. These can be used through a command line interface which produces the content as a CSV file for further analysis. We also provided modifications to the 4CAT platform (https://4cat.nl/) to ensure it correctly operates with Finnish language.
PART B: YOUTUBE CHAT COLLECTOR & TWITCH VIDEO COLLECTOR
The team from the University of Jyväskylä presents a continuation of the deliverable for the Twitcher data collector tool. We present new added features such as the option to collect chat data from YouTube from either live or past broadcasts. The collected YouTube chat data can also be viewed in the data viewer section and are also automatically saved in CSC Allas. We also implemented the option to collect videos from Twitch past broadcasts in regard to the video clip analysis tool presented in D3.3.4 and D4.1.1.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Analysis of multimodal cultural heritage
Date of reporting: 20-11-2025
Report author: Ilkka Lähteenmäki (University of Oulu)
Contributor: Ilkka Lähteenmäki (University of Oulu)
Deliverable location: 10.5281/zenodo.17700648
This paper examines if historians and cultural heritage researchers can justifiably depend on multimodal AI systems for accessing large visual collections from social epistemology point of view. Building on Inkeri Koskinen’s “necessary trust view” and Jakob Ortmann’s account of task-specific epistemic reliance, it argues that digital history and cultural heritage form a non-typical setting for current social epistemology of AI. In contrast to the physical sciences, where AI tools such as AlphaFold are embedded in long-standing evaluation regimes and well-defined tasks, historical research involves open-ended, exploratory questions, fuzzy and historically shifting concepts, and interpretive practices centred on individual researchers and small teams.
The paper uses examples from recent proposals for using multimodal AI for text-to-image, image-to-text and image-to-image retrieval, and for AI-assisted metadata generation and “distant viewing” of images. It shows how hopes for a multimodal turn in digital humanities confronts the essential epistemic opacity of deep neural networks and the difficulty of evaluating reliability for complex open ended retrieval tasks. Three suggested mitigation strategies are discussed: critical analyses of models and training data; historically informed reflection on bias and concept change; and fine-tuning or post-processing of models for specific purposes. From a social epistemology perspective, each strategy encounters limits when generalised to research infrastructure meant to support many corpora, tasks and user communities.
The paper then turns to approaches that argue for using multimodality theory to design metadata schemas and guide AI-based annotation. It shows how this is a attempt to shift epistemic trust from AI systems back to scholars (at least partially) in effort to make use of the developing technology. However, this brings into discussion old debates of between theories of meaning. Especially with image data the theoretical discussion of how images meanings should be established and if these theories are implementable to computational models need to be explored. Couple examples from contemporary photography and medieval manuscript research illustrate both the potential of AI-supported exploration and the need for additional contextual and theoretical work to render outputs historically interpretable.
The central claim is that, given the essential epistemic opacity of AI, it currently looks like justified epistemic dependence in history and cultural heritage research needs be organised around situated, task-specific, and accountable uses of multimodal models rather than general-purpose models. The options for research infrastructures for establishing trust are therefore focus on building mechanisms for task-specific reliability assessment, or embedding trusted identifiable human agents or institutions between users and models.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 1.2: Report on Transcription service for minority languages
Date of reporting: 24-11-2025
Report authors: Martin Matthiesen (CSC)
Contributors: Yaroslav Getman, Tamas Grosz (Aalto), Sam Hardwick (CSC)
Deliverable location: https://github.com/CSCfi/Kielipankki-utilities/tree/master/asr/apptainer
Keywords for the deliverable page: Finland-Swedish, Sámi
An Automatic Speech Recognition model for Northern Sámi (henceforth ”Sámi ASR model”) has been created at Aalto University. The model has been packaged into a container[2] at CSC, which may be used in the user’s preferred computing environment, and also in CSC’s Secure Desktop environment[3] for processing sensitive data.
The packaging process, which can be repurposed for other wav2vec models and/or models available via Huggingface[4] is documented in the Language Bank’s Github[5] repository.
At the time of writing the model for Finnish-Swedish is still under development at Aalto University. It will be packaged as soon as it becomes available.
[1] Sámi ASR model: https://huggingface.co/GetmanY1/wav2vec2-large-sami-cont-pt-22k-finetuned
[2] https://www.kielipankki.fi/tools/sami-asr/
[3] In SD Desktop the tool can be installed using the ”auto-apptainer” tool.
[4] https://en.wikipedia.org/wiki/Hugging_Face
[5] https://github.com/CSCfi/Kielipankki-utilities/tree/master/asr/apptainer
The FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.3: Report on Remote access to video data repositories
Date of reporting: 21-11-2025
Report authors: Tommi Jauhiainen, Erik Axelson (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024102501 and urn:nbn:fi:lb-2025081401
In this work package, we aimed to provide infrastructure for translation and interpretation research, both in machine translation and in translation studies, by enhancing our access to remote video data repositories. During the project, we focused on improving our access to the Parliament of Finland.
With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API. We published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the Language Bank of Finland (LBF) resource publishing pipeline under urn:nbn:fi:lb-2024102501. The original metadata includes timestamps, which will enable direct links from the Korp service to the video material available on the Parliament servers. The Korp version will contain approximately 7,000,000 tokens corresponding to about 4,500 hours of video data.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Analysis of multimodal properties of naturalistic speech
Date of reporting: 12-11-2025
Report author: Steven Coats (University of Oulu)
Contributors: Alessandro Basile (Sorbonne Nouvelle University, France), Cameron Morin (University of Paris-Cité, France), Robert Fuchs (University of Bonn, Germany)
Deliverable location: Online search interface: https://ycsep.corpora.li (on Zenodo).
Downloadable static corpus: https://doi.org/10.7910/DVN/B7JRID
Keywords: Singapore English, Corpus Linguistics, YouTube, World Englishes, Podcasts
Recent advances in streaming protocols and automatic speech recognition (ASR) have enabled large-scale spoken language corpora, yet research on Singapore English remains constrained by small or text-based datasets. The YouTube Corpus of Singapore English Podcasts (YCSEP) addresses this gap with 620 hours of transcribed, diarized speech from over 1,300 podcast episodes by Singapore-based content creators. YCSEP supports the empirical analysis of phonetics, morphosyntax, and discourse, enabling the study of low-frequency features like discourse particles and reduplication. The dataset reflects informal, spontaneous speech from diverse speakers and facilitates investigation into nativization and endonormative stabilization processes in postcolonial English. Built using a pipeline of yt-dlp, WhisperX, and Pyannote, YCSEP offers robust empirical grounding for linguistic features such as verb complementation and modality. It also contributes to broader theoretical discussions on areal norms and construction grammar in World Englishes.
The corpus is available in two versions: An online search engine, through which transcripts and audio are accessible and downloadable (https://ycsep.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/B7JRID).
Related publication:
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. 2025. The YouTube Corpus of Singapore English Podcasts. English World-Wide. https://doi.org/10.1075/eww.25018.coa
Related presentations:
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the Mutations du Discours Numérique Seminar. Arras, France, April 22nd, 2025. https://calenda.org/1204680; https://adum.fr/script/formations.pl?mod=3633487&site=l
Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the 8th Conference of the International Society for the Linguistics of English. Santiago de Compostela, Spain, September 3rd, 2025. https://isle8conference.com/
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.3: Report on Remote access to text data repositories
Date of reporting: 30-09-2025
Report authors: Tommi Jauhiainen (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024071601 and urn:nbn:fi:lb-2025081401
In this work package, we aimed to provide infrastructure for translation and interpretation research, both in machine translation and in translation studies, by enhancing our access to remote text data repositories. During the project, we focused on improving our access to three significant external sources of text data: the Parliament of Finland, the National Broadcasting company (Yle), and the various institutional repositories managed by the Finnish Universities.
With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API and published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the resource publishing pipeline of the Language Bank of Finland (LBF). For future updates of this resource, we plan to collaborate with the Parlamenttisampo and maintain the software components used to extract and parse the API-provided dataset together.
Similarly, we published a new source version of the Yle Finnish News Archive, covering the years 2022-2024: urn:nbn:fi:lb-2025081401. We have worked on streamlining the publishing pipeline for resources that are regularly updated, which include both the Parliament and Yle datasets. Preliminary investigations indicate that the best throughput will be achieved by creating a customized pipeline for each resource with checklists tailored to make the creation and publishing of new versions as easy as possible.
We have also created a semi-automated system that can be used to harvest all PDF-formatted publications from the institutional repositories managed by Finnish Universities. Automated harvesting was made possible by the widespread use of DSpace software as the backend of these repositories. We are further developing automated methods to determine the types of language resources that can be published based on this collection. The licenses under which the texts have been published vary considerably, and we aim to publish them as openly as possible.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Reliable Enrichment of Visual Data
Date of reporting: 29-09-2025
Report authors: Matti Nelimarkka (University of Helsinki)
Contributors: Anton Berg (University of Helsinki), Leonardo Negri (University of Helsinki)
Deliverable location: https://github.com/uh-soco/coslab-core and https://github.com/uh-dcm/coslab-gui
Image recognition services, such as Amazon Rekognition, Google Vision and Azure AI Vision, allow anyone to label image content, however their outputs vary per service (ref to image as data book). Cross-service label agreement score (COSLAB) allows researchers to quantitatively compare labels across services and determine which of the output labels are reliable. This allows researchers to use these outputs in their research and addresses common critique for the scholarly use of such services (ref to image as data book).
The objective of this work was to (a) devise a method to assess the reliability of labels and (b) develop a graphical user interface allowing non-technical users to conduct this analysis. This objective aims to make image recognition tools available for humanities scholars and social scientists.
The underlying COSLAB was originally developed in Berg & Nelimarkka (2023), showing no systematic differences in the quality across different kinds of image datasets, thus suggesting that overall image recognition services can be used, particularly for explorative image analysis.
The graphical user interface provides non-technical frontend to image labelling services and COSLAB calculations. The drag & drop interface allows sending images for image recognition services and then calculates per-label scores, indicating if different image recognition services recognised similar things. The final output containing both the per-image labels and COSLAB scores can be exported e.g. to Microsoft Excel. This allows researchers to further use the results in their analysis tool of choice.
Berg, A., & Nelimarkka, M. (2023). Do you see what I see? Measuring the semantic differences in image‐recognition services’ outputs. In Journal of the Association for Information Science and Technology (Vol. 74, Issue 11, pp. 1307–1324). Wiley. https://doi.org/10.1002/asi.24827
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Analysis of video stream interactions with AI solutions
Date of reporting: 22-09-2025
Report author: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable location: https://collector-twitcher.2.rahtiapp.fi/Video_clip_summary
Keywords: video clip analysis; multimodal; MLLM; video summarization; Twitch
The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a lot of various multimodal information; the visual action of the gameplay, the auditory context from the caster commentary, and the text-based reactions from the live chat, which all represent a dense and valuable information for understanding online communities and digital entertainment. However, the sheer volume and complexity of this data creates a need for efficient tools for its analysis. Our previous tools have focused on chat-analysis or chat content detection [1, 2], which, however, do not seem to cover the diverse nature of content in Twitch thoroughly enough. The primary challenge lies in the multimodal nature of the data. Some of the characteristics of Twitch data include a wide range of dynamic scenes, dense on-screen information, and a complex interaction between the visual gameplay, audio commentary, and massive chat audience. A true understanding of a Twitch clip requires not just the perception of events within each modality but the synthesis of their interplay. This creates a clear research gap for tools that can comprehensively understand and summarize the information within these complex multimedia clips.
This deliverable presents a tool for the automated understanding and summarization of such clips. The tool utilizes the state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family. The tool helps the user to generate a chronological summary of the key audio-visual events, a thematic analysis of chat reactions, and an overall summary from the video and chat input information. This is guided by a structured Chain-of-Thought-based prompt.
[1] Jari Lindroos, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Ida Toivanen. ”From PogChamps to Insights: Detecting Original Content in Twitch Chat.” In Hawaii International Conference on System Sciences, pp. 2542-2551. Hawaii International Conference on System Sciences, 2025. https://doi.org/10.24251/hicss.2025.308
[2] Jari Lindroos, Ida Toivanen, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Sami Äyrämö. ”Participant profiling on Twitch based on chat activity and message content.” In International GamiFIN Conference, pp. 18-29. CEUR Workshop Proceedings, 2025. https://ceur-ws.org/Vol-4012/paper18.pdf
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Machine learning-based enrichment of social media
Date of reporting: 22-09-2025
Report authors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)
Contributors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)
Deliverable location:
Keywords: machine learning; social media; web registers; register variation
Web-crawled datasets have become invaluable resources for SSH research, supporting diverse fields including corpus linguistics, digital humanities, and computational social science. However, publicly available web datasets like HPLT 2.0 and FineWeb provide only basic metadata about their contents, such as document URLs and crawl dates, which limits their research potential. Enriching these noisy collections with contextual metadata would greatly improve their value for SSH research.
In this deliverable, we focus on automatically identifying social media text varieties in web datasets, using machine learning. We publish the following resources:
We approach the web text classification problem using the framework of register variation (Egbert and Biber 2018; Biber and Conrad 2019), where “register” denotes a text variety associated with a particular situational context, such as News report or Recipe. We use the 25-class web register taxonomy developed by Skantsi and Laippala (2023) to label 3 million randomly selected documents from the HPLT 2.0 corpus (Burchell et al. 2025) in English, Finnish, and Swedish (1M samples each). This automatic labeling uses the multilingual BGE-M3 model (Chen et al. 2024), fine-tuned for register classification following Henriksson et al. (2024).
From this 3M document sample we then select a social media subset by choosing documents labeled with any of the following three registers: Narrative Blog, Opinion Blog, or Interactive Discussion. We also include so-called “hybrids” – documents assigned to more than one register label, such as Narrative blog + Recipe. This process yields a dataset of approximately 113,000 English, 290,000 Finnish, and 335,000 Swedish social media documents, with Narrative Blogs being the most common category across all languages.
To further analyze the contents of the identified social media documents, we apply HDBSCAN clustering (McInnes et al. 2017) on their semantic vector representations, revealing meaningful thematic subgroups within some register categories. For instance, applying keyword analysis on the clusters, we identify hand-crafting and cooking themes in hybrid documents labeled Narrative Blog + How-to/Instructional. We develop simple logistic regression classifiers trained on these thematic clusters, allowing SSH researchers to first categorize text by register, then select social media registers of interest, and finally identify specific thematic subgroups where applicable.
References
Biber, Douglas, and Susan Conrad. 2019. Register, Genre, and Style. Cambridge: Cambridge University Press.
Burchell, Laurie, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova et al. 2025. “An expanded massive multilingual dataset for high-performance language technologies.” arXiv e-prints: arXiv-2503.
Chen, Jianlv, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. “Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.” arXiv preprint arXiv:2402.03216.
Egbert, Jesse, and Douglas Biber. 2018. Register Variation Online. Cambridge: Cambridge University Press.
Henriksson, Erik, Amanda Myntti, Saara Hellstrom, Anni Eskelinen, Selcen Erten-Johansson and Veronika Laippala. 2024. “Automatic register identification for the open web using multilingual deep learning.” arXiv preprint arXiv:2406.19892.
McInnes, Leland, John Healy, and Steve Astels. 2017. “hdbscan: Hierarchical density based clustering.” J. Open Source Softw. 2:11, 205.
Skantsi, Valtteri, and Veronika Laippala. 2023. “Analyzing the unrestricted web: The finnish corpus of online registers.” Nordic Journal of Linguistics 48:1, 1-31.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Forensic Linguistics Corpus and Search Interface C.R.I.M.E
Date of reporting: 01-09-2025
Report authors: Steven Coats (University of Oulu)
Contributors: Dana Roemling (University of Birmingham)
Deliverable location: Online search interface: https://forensic.corpora.li (DOI)
Keywords: Forensic linguistics; corpus linguistics, YouTube, investigative interviews
CRIME is the Corpus of Recorded Investigative, Media, and Evidence-based proceedings, a structured, searchable resource comprising audio and ASR-generated transcripts from investigative interviews, courtroom interactions, and related media. Collected from publicly available YouTube sources according to the provisions of the EU Data Mining Act, the corpus addresses a critical gap in current research: the lack of large-scale, real-world datasets that integrate reliable transcripts with corresponding audio.
Previous studies often rely on limited data, constraining generalizability and hindering methodological innovation. By enabling detailed analysis of linguistic, phonetic, pragmatic, and discourse-level features, CRIME supports interdisciplinary research in linguistics, law, psychology, and computational modeling. Potential applications include the identification of language patterns associated with interviewing strategies and outcomes, as well as leveraging large language models to explore affective and interactional dynamics.
This resource offers substantial potential to inform both academic inquiry and evidence-based practices in investigative interviewing and broader criminal justice contexts. The corpus is available in two versions: An online search engine, powered by BlackLab, through which transcripts and audio are accessible and downloadable (https://forensic.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/MLMB6E).
Coats, Steven and Dana Roemling. 2025. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. In: Fábián, Annamária and Igor Trost (eds.), Impulses and Approaches to Computer-Mediated Communication Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities, 45-49. University of Bayreuth, Germany. https://www.cmc2025.uni-bayreuth.de/pool/dokumente/CMC-2025-Proceedings-2.pdf
Coats, Steven and Dana Roemling. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the Aston University Institute for Forensic Linguistics Research Seminar. Birmingham, UK, April 24th, 2025.
Coats, Steven and Dana Roemling. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the 12th International Conference on CMC and Social Media Corpora for the Humanities (CMC-Corpora 2025). Bayreuth, Germany, September 5th, 2025. https://www.cmc2025.uni-bayreuth.de/en/
Roemling, Dana and Steven Coats. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. Presentation at the 7th General ILLA Conference. Kaunas, Lithuania, September 5th, 2025. https://conferences.vdu.lt/etn/general-illa-conference/
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.1: Report on Comprehensive data versioning
Date of reporting: 22-09-2025
Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester
Keywords for the deliverable page: versioning, updates, differences
The versioning mechanism has been rigorously tested with a daily update schedule, which is far too often, considering that the data set is changing relatively rarely and a monthly update schedule is envisaged. We have added improvements to better serve the Elastic Search use case and make it easier to track the provenance of the dataset and to improve the reliability of the snapshot creation. Below we describe in more details how the dataset serves the selected use cases.
To create ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], (”KLK”, for short) using this data set[2] the original Python scripts[3] need to be changed. Presently they are operating on directories extracted from zip files obtained directly from the National Library of Finland (NLF). We decided to not use these files directly for two reasons:
Unlike planned we opted in the end to not create a working proof-of-concept, but to explain below the steps needed to adapt the present scripts to the new format. One major change is to operate on the zip files instead of a Posix file structure. Especially in HPC filesystems like Lustre working on zip files is much more efficient than to extract the small files contained in them. Concretely Python’s zipfile module[4] can be used to search for METS files within the downloaded zip files in /scratch/project_2006633/nlf-harvester/zip on Puhti. METS files of a specific binding are contained in the ”mets” directory of said binding. The corresponding OCR data can then be found in the ”alto” directory on the same level.
The example of binding 19712 below illustrates how finding METS files (in the ”mets” directory) leads to the respective OCR data (in the ”alto” directory on the same level as the mets file).
A minor issue was observed: Before using the dataset for the next version of ”KLK”, we need to request a collection of periodicals (marked ”aikakausi”) to be added to the dataset, presently we only download newspapers (marked ”sanomalehti”).
Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[5]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. In this use case it was important to keep the Elastic Search Engine in sync with changes within the data set. While we already provide versions, comparing these version is resource intensive. To make comparison easier, we introduced a ”log” directory (/scratch/project_2006633/nlf-harvester/log/ containing listings of additions and deletions that were performed during each synchronisation as well as general information about snapshot runs. We also made it easy to refer to a specific version of the dataset by tagging it with the hash number used in the restic backup. Since the changes from one version to another can be potentially large (e.g. if NLF publishes are new version of the OCR’d scans), resources on HPC login nodes are not sufficient to generate snapshots using restic. For that reason restic is now run as a HPC job on a compute node with adequate resources.
Summary and Outlook
The goal of this work package was create a consistent download framework for publicly available newspaper data from the NLF. To achieve this we used Apache Airflow for task automation and Restic for versioning. It turned out that Apache Airflow is not designed to deal with too many tasks at once that might take a long time. We had to find compromises to reduce the number of tasks.
We ran the download pipeline on a daily basis for few weeks without issue and are now confident that Airflow can be run on a monthly basis to update the dataset. Restic turned out to be a reliable tool for versioning. The versioning to Allas makes it possible to free space on Puhti in case the data set is not in active use after the end of the project. It also makes it possible to stage the data set to other environments, like personal laptops or the LUMI super computer. Long term funding for keeping the data on Allas still needs to be worked out.
[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401
[2] See the Harvester documentation for details.
[3] https://github.com/CSCfi/Kielipankki-utilities/tree/master/corp/klk-alto
[4] Introduction to the python zipfile module: https://realpython.com/python-zipfile/
[5] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.2: Report on Transformer training for specialised data
Date of reporting: 09-06-2025
Report author: Erik Axelson (University of Helsinki)
Contributors: Ghent Center for Digital Humanities [1] & Language and Translation Technology Team (LT3) [2] (Ghent University); Sam Hardwick, Katri Tegel (CSC)
Deliverable location: N/A
In this workpackage, we aim at creating a self-study course implemented as Jupyter Notebooks. Its purpose is to learn to build up a language model from scratch in the CSC computing environment using one or more existing resources of Language Bank of Finland, but not limited to them. For this purpose, we have tested two resources using the Noppe [3] service of CSC. One is an external resource developed in the framework of the CLS – Computational Literary Studies Project (2020-2025) [4]. The other is CSC’s Aitta [5] inference service for which they also offer a course ”Aitta – LLM Inference” in Noppe.
The CLSInfra repository [6] hosts the work done in the framework of CLS for Natural Language Processing pipelines for the DH community. The pipelines are demonstrated with Jupyter Notebooks. We have tested them in the Noppe service of CSC. If problems have been encountered, they have been reported to CLSinfra team. They have fixed the issues that we have reported so far. We will continue to go through the Notebooks, and we aim at running all of them in the Noppe service. Then we can later modify them for example for Finnish language or minority languages such as Sami languages, other Fenno-Ugric languages or Finland Swedish.
CSC’s ”Aitta – LLM Inference” course uses large language models available in their Aitta inference service. We have tested creating keys to access language models in Aitta and managed to use them in Noppe and run the exercises. Aitta already offers some models to use, and future features will include the ability for users to upload models and create embeddings themselves. These features will make it later possible to use our own materials.
We plan to have our own course environment ready in the beginning of fall 2025.
[1] Ghent Center for Digital Humanities: https://www.ghentcdh.ugent.be/
[2] Language and Translation Technology Team (LT3): https://lt3.ugent.be/
[3] Noppe: https://noppe.2.rahtiapp.fi/
[4] Computational Literary Studies Project (2020-2025): https://clsinfra.io/
[5] Aitta: https://staging-aitta.2.rahtiapp.fi/public
[6] The CLSInfra repository: https://github.com/GhentCDH/CLSinfra
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.2: Report on Ingestion of heritage and societal data from Sampo
Date of reporting: 04-06-2025
Report author: Eero Hyvönen (Aalto University)
Contributors: –
Deliverable location: Linked Data Finland (LDF.fi), several online data services, CSC Allas (ParliamentSampo), zenodo.org (several submissions of data dumps), various web portal URL-addresses
CoinSampo
A new Sampo system based on archaeological data from the Cultural Heritage Agency.
LetterSampo Finland
A new large Sampo system in use (some 1.3 million letters by nearly 120,000 historical people) from 1700 fonds.
OperaSampo
OperaSampo finished:
ArtSampo
First demo version using LLMs finished:
A new follow-up project proposal for LetterSampo Finland for the Research Council of Finland (2025) with Sibelius Academy on analyzing Historical Letter textual contents (using LLMs, NLP, and KGs) (Anne Kauppala, Eero Hyvönen).
New joint works on applying the DARIAH-FI Sampo infra:
1) VU University, Amsterdam, PH-Sampo; 2) Geneve Graduate Institute, Switzerland, applying ParliamentSampo to, e.g., United Nations speeches; 3) Nomisma.org, NomismaSampo; 4) the British Museum, PASampo; 5) Heritage Practice Communities network, HPC Sampo; 6) University of Latvia, Nobel Price Sampo, DBLP Sampo.
Maintaining of ParlamentSampo
ParliamentSampo data was updated by data related of the new parliament 2023–2024. New semantic data regarding interruptions and laughter at the parliament. This was reported by YLE in prime time TV news and by an article on the web.
SampoSampo – Connecting Everything to Everything Else
A first demonstration of a new kind of data linking service, inspired by the international VIAF.org service of national libraries, was created.
Tutorial: How to create a Linked Open Data service and semantic portal for your Cultural Heritage data. (in English), November 28, 2024.
Tutorial organized at the Digital Humanities in the Nordic and Baltic Countries 2025 conference (DHNB 2025):
DHNB 2025 Tutorial, Tartu, Estonia: How to create a Linked Open Data service and semantic portal for your Cultural Heritage data. (in English), March 4, 2025.
Research articles related to the research above
One related dissertation accepted in 2024 (by Petri Leskinen) and one manuscript in pre-examination in 2025 (by Heikki Rantala), in addition to several MSc works.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Automated metadata of archival data from NAF
Date of reporting: 04-06-2025
Report authors: Venla Poso (JYU), Ida Toivanen (JYU)
Contributors: Antero Holmila (JYU), Venla Poso (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Ilkka Jokipii (NAF)
Deliverable locations:
The National Archives of Finland has been digitising their material at an increasing pace. For example, they started piloting a mass digitisation project in 2019, where the aim is to digitise over 135 kilometres of archival data. The aim of the deliverable D3.3.1 was to develop machine learning methods for generating metadata, such as document type and journal number, from OCR-scanned archival materials to facilitate their analysis and information extraction. The goal has been to generate metadata which helps to make large variant data collections within the archives more usable. The development process has included creating a deep learning (DL) model for named entity recognition (work started in 2022–2023) and for document type classification (2024–2025).
The research started with archival data included developing named-entity recognition (for example, journal number) for state authority archives via (1) publishing annotation guidelines to aid the annotation process and recognize the properties of archival data [1], and (2) DL modelling based on annotated archival data [2,3]. In addition to publishing a DL model trained with the annotated data [3], we evaluated an archival text model against a Finnish text model to see and determine how big an effect noise brings to real-life cases and the acute workings of models [2].
The process of developing document type classification for noisy and diverse archival data has included collecting and annotating a new benchmark dataset from openly available archival data (to be published) and evaluating different DL model architectures for the task of document type classification. As a result we released an image-based model that classifies scanned documents into seven different categories: cover page, card index, map, picture, running text, table or form, and newspaper (https://huggingface.co/jyu-digihum/findoctype). Our future work will entail adding a multimodal dimension to the current framework.
Development has been conducted in cooperation with the National Archives of Finland.
[1] Poso, V., Välisalo, T., Toivanen, I., Lipsanen, M., Kukkohovi, L., Kytöaho, R., Palander, S., Pohjola, M., Laitinen, V., Föhr, A., Abdelamir, A. & Niemi, J. (2025). NER annotation guidelines for archival data. University of Jyväskylä. URN: https://urn.fi/URN:NBN:fi:jyu-202501291584
[2] Toivanen, I., Poso, V., Lipsanen, M., & Välisalo, T. (2025). Developing named-entity recognition for state authority archives. In O. Holownia, & E. S. Sigurðarson (Eds.), DHNB2024 Conference Post-Proceedings (7). University of Oslo Library. Digital Humanities in the Nordic and Baltic Countries Publications. https://doi.org/10.5617/dhnbpub.12262
[3] Poso, V., Lipsanen, M., Toivanen, I., & Välisalo, T. (2024). Making Sense of Bureaucratic Documents: Named Entity Recognition for State Authority Archives. In Archiving 2024 Final Program and Proceedings (pp. 6-10). Society for Imaging Science & Technology. Archiving, 21. https://doi.org/10.2352/issn.2168-3204.2024.21.1.2
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.2: Report on Ingestion of structured data from FINNA
Date of reporting: 04-06-2025
Report author: Joona Manner (National Library of Finland, Finna Unit)
Contributors: Joona Manner, Juha Luoma, Julia Isotalo, Riitta Peltonen, Päivi Maria Pihlaja (National Library of Finland)
Deliverable location: https://github.com/NatLibFi/Finna-API-image-file-downloader
The aim of the deliverable was to improve researchers’ access to vast image collections and related metadata for data-intensive research. Finna is a national infrastructure and discovery service maintained and developed by the National Library of Finland and providing access to collections of almost 500 libraries, archives and museums.
In this delivery, we enhance Finna’s data reuse services to meet researchers’ needs and improve the technical features of the Finna Application Interface (API) service. The deliverable contributes to the objective of connecting the research infrastructure to accruing data sources, enhancing researchers’ access to open data and enabling workflow automation.
We planned the technical improvements and guidance materials in consultation with researchers and other stakeholders, including an open survey questionnaire in August 2024 and a collaborative workshop in September 2024, which involved researchers from both the social sciences and humanities, as well as the IT Centre for Science (CSC).
The new API image file and metadata download system includes a command prompt-based Nodejs scripts that allow users to download high-resolution images with related metadata in JSON format based on Finna search from Finna’s material providers.
The script enables the downloading of thousands of high-resolution images without triggering Finna’s data rate limiter, which is also necessary to prevent malicious attacks on Finna’s infrastructure.
Finna’s API image file and metadata download system will in the future also automatically create a report on possibly missing image files, which will help users and organisations solve these issues, improving Finna’s content quality in the long run.
The automated Nodejs script requires an API key that users can generate with their personal Finna account. Creating a Finna account requires email confirmation. The API key feature will be available in Autumn 2025. Before this, keys are provided on demand for individual research purposes.
The project has been in line with many of the National Library’s strategic objectives, including the objective of the Finna vision to promote the use of data as a resource.
Instructions in GitHub:
https://github.com/NatLibFi/Finna-API-image-file-downloader/releases/download/Demo_for_Workshops/Finna_API_instructions.pdf
Instructions will also be added under the Finna service guidance materials:
https://www.kiwi.fi/display/Finna/Finna+API+Documentation+In+English
The new features were presented and tested at the following events:
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Automated harmonisation and enrichment of metadata
Date of reporting: 18-03-2025
Report authors: Akewak Jeba (University of Turku), Leo Lahti (University of Turku)
Contributors: Julia Matveeva (University of Turku), Muluh Geraldson (University of Turku)
Deliverable location: github.com/fennicahub (see below for specific outputs)
Keywords: data science, metadata, bibliographies, enrichment
This deliverable provides resources for gathering, harmonizing, enriching, and summarizing structured metadata from the Finnish National Library, in particular the National Bibliography Fennica. The open data and workflows can be used in research, training, and outreach. Further metadata resources are available for complementary cultural heritage from archives, libraries, museums, and other actors. This deliverable expands the scope of the metadata collections that are seamlessly interlinked with statistical environment, enhancing the integration of Finna and Finto with Fennica.
Earlier work with Fennica, including metadata harmonization and visualization workflows, is described in FIN-CLARIAH (2022–23) Deliverable D4.1.3, which focused on preparing and publishing the cleaned Fennica dataset along with interactive tools for analysis and presentation.
This deliverable consists of the following resources:
1. Systematic approach to retrieve Finna metadata into open computing environments is implemented as open software finna. This uses REST API and OAI-PMH API for data retrieval. The release version is available through CRAN repository.
2. Data science methods to enrich structured metadata from Finna and Fennica are provided via the finto R package based on actor cross-linking. This provides fluent access to Finto keyword service (finto.fi) from R statistical environment and allows interaction with Finto service. Examples regarding Fennica author enrichment using Kanto/Finto are available via the package vignette.
3. Data analysis and visualization techniques to support the research use of cultural heritage metadata collections are provided via the finna package and demonstrated in the package vignettes. Geospatial analysis and visualization of metadata from Finna and Fennica is further supported by the maintained geofi package.
Resource links:
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
Last modified on 2025-03-28
