D1.3.1: Develop licensing and protection schemes for sharing sign language data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 367751
Start date: 01-01-2026
Duration: 24 months

WP 2.3: Report on developing licensing and protection schemes for sharing sign language data
Date of reporting: 11-06-2026

Report author: Mietta Lennes (University of Helsinki)
Deliverable location: https://www.kielipankki.fi/corpora/resource-families-fin-clarin/sign-language-resources/

Keywords for the deliverable page: sign language; personal data; video processing; sensitive data; SD Desktop

Description

Several resource groups containing sign language material are available via the Language Bank of Finland. The very first sign language resource containing sign language was The Kipo Corpus (2010 The Language Policy Programme for the National Sign Languages in Finland), published openly in 2015. In the years 2016, 2019 and 2024-2025, large numbers of annotated sign language recordings have been published in the CFINSL and CFSTS resource groups of Finnish and Finland-Swedish Sign Language. The sign language corpora can be found on the website of the Language Bank, under the Sign Language Resource Family.

Most sign language resources tend to contain personal data, as the signers are identifiable on the video recordings on the basis of their face, physical appearance and movements. In free signing and signed conversation, the signers may also refer to other people. The data usually cannot be anonymized for research purposes. Due to the personal data, the decisions on the appropriate end-user licenses and the data protection schemes must be made on the basis of the information given to the data subjects (the participating signers), and on the evaluation of the potential risks vs. benefits regarding the processing of the types of data in question.

For sign language communities, it is often desirable to make some language data publicly available. By informing the participating signers in an appropriate way, it is possible to publish the content openly, given that the publication is not considered harmful to the people involved. Some of the above-mentioned resources were made publicly available via the Language Bank of Finland, whereas others are only available for research purposes upon application.

The depositing organization is generally responsible for setting the terms and conditions on how the personal data can be processed and redistributed. If protection is needed, the Language Bank offers options for managing and restricting access to the data via federated academic login (CLARIN ACA type licenses) or individual access granted upon application (CLARIN RES type licenses).

For additional protection, it is even possible to share the data in packages that are separately encrypted for individual users, or the data can be made accessible via SD Desktop provided by CSC. However, the latter two options are currently not used for sign language data. The encryption of large amounts of video files is time-consuming and would often not be in proportion with the protection requirements, since encrypted data would still need to be decrypted for the actual research use. SD Desktop offers a secure environment for analyzing and processing data. The current tools and technical properties of SD Desktop may not yet be sufficient for the convenient playback, annotation and analysis of sign language videos. However, we are collaborating with CSC to investigate the possibilities for adding tools on SD Desktop that would enable users to run useful analyses in manual or batch mode, to produce data that can be safely exported from the secure environment. For further details on sensitive data, see Deliverable 2.1.1 and the support page regarding sensitive data in the Language Bank.

The FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 367751.

<< List of all deliverables

D2.3.1: Develop policies for processing and sharing translation memories

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 367751
Start date: 01-01-2026
Duration: 24 months

WP 2.3: Report on developing policies for processing and sharing translation memories
Date of reporting: 11-06-2026

Report authors: Mietta Lennes (University of Helsinki)
Deliverable location: https://www.kielipankki.fi/support/data-management/dela/

Keywords for the deliverable page: translation memory, machine translation

Description

A translation memory is a bilingual or multilingual database of previous translations of text segments of varying size. In some cases, a translation memory (or a translation memory manager/system) can also refer to a language-technological tool that helps translators by suggesting translations on the basis of similar, previously translated segments of text. (For the definitions in Finnish, see https://tieteentermipankki.fi/wiki/Language_Technology:translation-memory.)

Professional translators use translation memories as part of their workflow. Translation memories help translators in maintaining consistent terminology across documents and make their work significantly faster. Translation memories are thus a valuable source for research in language technology, linguistics, terminology and translation studies. For example, by using machine learning techniques, translation memories could be used for analyzing specific types of translation solutions, for enriching and extending the existing data, or for extending the translation solutions to other languages. However, due to, e.g., copyright reasons, translation memories can often only be used and shared within a company or an organization. It can be difficult to share translation memories with a larger research community.

In case a translation memory is publicly available, or in case a deposition agreement can be reached with the rightholders about the appropriate restrictions of use regarding research purposes, it is possible to provide access to the data via the Language Bank of Finland. The translation memory database can be made available via the Language Bank of Finland as a downloadable package. It is also possible to consider different platforms for accessing and querying the content, e.g., as a parallel corpus via the Korp concordancer, or as a lexical resource via the Karp platform. Karp is currently not yet available in the Language Bank of Finland, but the Language Bank will discuss the possibilities of installing it (see the original Karp in Språkbanken, the Language Bank of Sweden). In case the translation memory includes very sensitive content that requires additional protection, it is also possible to use the secure SD Desktop environment for providing access to the data (cf., Deliverable 2.1.1 and the support page regarding sensitive data in the Language Bank).

The FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 367751.

FIN-CLARIAH Deliverables (2022-2023)

Name: ASTIN: Language Technologies in the Nordic Countries 2026
Start: 2026-11-10T00:00:00+02:00
End: 2026-11-11T23:59:59+02:00
Location: Helsinki

<< FIN-CLARIAH Deliverables

This page outlines the project deliverables for 2022-2023 (completed).

FIN-CLARIAH Funding period 2022-2023

(completed)

Module 1: Natural Language Processing (NLP)
Module 2: Language Research Infrastructure
Module 3: Structuring Data
Module 4: Analyzing Structured Data
Module 5: Information Interaction
- W5.1 Evidence-based RI development
- W5.2 Education and dissemination

Module 1: Natural Language Processing (NLP)

W1.1 Text processing and annotation environments

D1.1.1	Updating LBF resource selection	2022-09
D1.1.2	Ingesting new unstructured resources	2023-12

W1.2 Speech processing and annotation

D1.2.1	Forced-Alignment Service	2022-09
D1.2.2	Transcription Service for Finnish Interviews	2023-09

W1.3 Noise-tolerant NLP

D1.3.1	Corpora of non-standard language	2022-09
D1.3.2	System for detecting toxic language	2023-06
D1.3.3	Models for retrieving QA pairs from the web	2023-09
D1.3.4	QA pair corpora	2023-12

Module 2: Language Research Infrastructure

W2.1 Social Data Science

D2.1.1	Licensing agreements for personal data	2022-09
D2.1.2	Licensing agreements for special categories	2023-06

W2.2 Learners’ Assessment Environments

D2.2.1	Speech recognition for L2	2022-12
D2.2.2	Speech recognition for L2 update	2023-12

W2.3 Translation and Interpretation

D2.3.1	Licensing interpretation sessions	2022-12
D2.3.2	Aligning and retrieving	2023-12

W2.4 Terminology

D2.4.1	Term discovery procedures	2022-09
D2.4.2	Terminology application	2023-06
D2.4.3.1	Initializing terminology collections	2022-09
D2.4.3.2	Initializing terminology collections	2023-06
D2.4.3.3	Initializing terminology collections	2023-12

W2.5 Solutions for better use of language learner performances in research

D2.5.1	Test performances storage	2022-12
D2.5.2	Analysis and annotation tools for learner performances	2023-12

Module 3: Structuring Data

W3.1 Increasingly automated ingestion of material

D3.1.1	Initial NLF data	2022-09
D3.1.2	Ingestion framework	2022-12
D3.1.3	Versioning support	2023-06
D3.1.4	Incremental update process	2023-12

W3.2 AI solutions to better use of National Archives mass digitisation services

D3.2.1	Pipeline for transferring archival data	~~2022-12~~ 2023-06
D3.2.2	Annotation & analysis tools for NARC data	2023-12

W3.3 AI solutions to better use of textual qualitative survey data

D3.3.1	Qualitative survey data concept network	2022-09
D3.3.2	R package for data concept network	~~2023-09~~ 2023-12

W3.4 Developing analysis methods for real-time chats in gameplay streams

D3.4.1

Livestream data collector

2022-12

W3.5 Developing analysis methods for text network analysis of political texts

D3.5.1	Text network analysis of political texts	~~2022-12~~ 2023-06
D3.5.2	Text network analysis of political texts	~~2023-09~~ 2023-12

Module 4: Analyzing Structured Data

W4.1 Metadata harmonization and analysis

D4.1.1	Harmonized FNB	2022-09
D4.1.2	Harmonization code	2022-12
D4.1.3	Visualisation workflow	2023-06
D4.1.4	R/Python module	2023-12

W4.2 Linked Open Data Services

D4.2.1	LDF knowledge extraction tools	2022-12
D4.2.2	Parliament of Finland Ontology	2023-12

W4.3 Subsetting data

D4.3.1	Subsetting tool	2022-09
D4.3.2	Statistical overviews and bias detection	2023-06
D4.3.3	Representative Twitter dataset	2023-12

Module 5: Information Interaction

W5.1 Evidence-based RI development

D5.1.1	User experience questionnaire	2022-09
D5.1.2	Log data collection and analysis	2023-06
D5.1.3	Protocol for collecting workshop data	2023-12

W5.2 Education and dissemination

D5.2.1	Actor network	2022-12
D5.2.2	Educational material	2023-12

Top of page

<< FIN-CLARIAH Deliverables

FIN-CLARIAH Deliverables (2024-2025)

<< FIN-CLARIAH Deliverables

This page outlines the project deliverables for 2024-2025 (completed).

FIN-CLARIAH Funding period 2024-2025

Module 1: Natural Language Processing (NLP)
Module 2: Language Research Infrastructure (LRI)
Module 3: Structuring Data
Module 4: Analyzing Structured Data
- W4.1 Analytical Support for computational SSH
Module 5: Information Interaction (IIA)
- W5.1 Evidence-Based Infrastructure Development

Module 1: Natural Language Processing (NLP)

W1.1 Text processing and annotation environments

D1.1.1	Named-entity annotation	2024-09
D1.1.2	Ingesting new unstructured resources	2025-11

W1.2 Speech processing and annotation

D1.2.1	Data collection for minority languages	2024-09
D1.2.2	Transcription service for minority languages	~~2025-09~~ 2025-11

W1.3 Video processing and annotation

D1.3.1

Tools and guidelines for video processing

2025-06

Module 2: Language Research Infrastructure (LRI)

W2.1 Personal and Copyrighted Research Data

D2.1.1	Integrate environment for personal data	2024-09
D2.1.2	Framework for processing copyrighted data for verification of research	~~2025-09~~ 2025-11

W2.2 Training environments

D2.2.1	Transformer training for specialised data	~~2024-12~~ 2025-06
D2.2.2	Transformer adaptation for specialised data	2025-12

W2.3 Translation and Interpretation

D2.3.1	Remote access to text data repositories	~~2024-12~~ 2025-09
D2.3.2	Remote access to video data repositories	2025-12

W2.4 Terminology

D2.4.1	Term definition discovery procedures	2024-09
D2.4.2	Initializing terminology collections	2025-12

Module 3: Structuring Data

W3.1 Data Management

D3.1.1	Comprehensive data versioning	2024-09
D3.1.2	Workflow automation and version syncing	2025-09

W3.2 Data Ingestion

D3.2.1	Ingestion of structured data from Finna (NLF)	~~2025-03~~ 2025-06
D3.2.2	Ingestion of heritage and societal data from Sampo	2025-06
D3.2.3	Ingestion of multimodal societal data from the Web	2025-12

W3.3 Enrichment

D3.3.1	Automated metadata of archival data from NAF	~~2025-03~~ 2025-06
D3.3.2	Automated harmonisation and enrichment of metadata	~~2024-12~~ 2025-03
D3.3.3	Machine-learning -based enrichment of social media	~~2025-06~~ 2025-09
D3.3.4	Machine-learning -based enrichment of textual and audio-visual social media contents	2025-11
D3.3.5	Forensic linguistics corpus and search interface C.R.I.M.E	2025-09
D3.3.6	Reliable image labelling with computer vision	2025-09

Module 4: Analyzing Structured Data

W4.1 Analytical Support for computational SSH

D4.1.1	Analysis of video stream interactions with AI solutions	~~2025-06~~ 2025-09
D4.1.2	Analysis Tools for Multimodal Born-digital Social Media	2024-12
D4.1.3	Advanced analytic social media tools and data	2025-12
D4.1.4	Analysis of multimodal properties of naturalistic speech	2025-12
D4.1.5	Analysis of multimodal cultural heritage	2025-12
D4.1.6	Enrich survey data with register data and unstructured text	2025-06

Module 5: Information Interaction (IIA)

W5.1 Evidence-Based Infrastructure Development

D5.1.1	Community engagement: multim. societal data researchers	2024-09
D5.1.2	Community engagement: multim. heritage researchers	2025-06
D5.1.3	Evidence-based infrastructure development	2024-12
D5.1.4	Educational resource development	2025-12

Top of page

<< FIN-CLARIAH Deliverables

<< List of all deliverables

D1.1.2: Ingesting new unstructured resources

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 2024-01-01
Duration: 24 months

Report author: Jussi Piitulainen (UHEL)
WP 1.1: Report on Ingesting new unstructured resources
Date of reporting: 2024-11-28
Contributors: Jussi Piitulainen, Jyrki Niemi, Jack Rueter, Erik Axelson, Ute Dieckmann, Mietta Lennes, Tommi Jauhiainen (UHEL), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/corpora/

Keywords for the deliverable page: conversion; annotation; interoperability; VRT; UralicUD; Korp; Mink

Description

The Language Bank of Finland receives and obtains text resources in different formats ranging from plain text documents to text enriched with complex annotations and document-level metadata. We aim to ensure that the material is made available to researchers in formats that are usable and interoperable. For text corpora, the Language Bank particularly supports and promotes VRT (VeRticalized Text) as an interchange format by developing, maintaining and utilizing the set of open-source VRT Tools for converting, enriching and ingesting resources containing text. All currently supported formats can be found via the Standards Information System of CLARIN.

The Suomi24 resource group was extended with the discussions from the years 2021–2023 (The Suomi24 Corpus 2021-2023, VRT version, and The Suomi24 Sentences Corpus 2021-2023, Korp version). Moreover, the entire The Suomi24 Sentences Corpus 2001-2023, Korp version and The Suomi 24 Corpus 2001-2023, VRT version now include named-entity and identified-language annotations. The Ylenews resource group was also extended with material from the years 2022-2024, which was made available for download (Yle Finnish News Archive 2022-2024, source). The Korp version of this extension will be published soon.

The Language Bank contributes to the Universal Dependencies (UD) project in order to maintain validity and coverage of the treebanks not only for Finnish but also more generally for Finnic, Finno-Ugric and Uralic languages (Uralic UD). Samples of languages in these groups will also be included in the text resources licensed by the Institute for Bible Translation and in other multilingual text collections that are currently being processed for publication.

In addition to other corpora, the Language Bank participated in publishing several resources prepared by the Ancient Near Eastern Empires (ANEE) research group, including Oracc, Achemenet and Babylonian Administrative and Legal Texts (BALT), available via Korp with linkage from their corresponding lexical networks.

The Trankit toolbox (see Nguyen et al. 2021), a recommended replacement for the old dependency parsers by the Turku NLP group, was installed in the CSC Puhti environment. Trankit was tested to be robust for the kind of morpho-syntactic annotation of pre-segmented Finnish that we need for the existing KLK and Suomi24 corpora. Once adapted for the CWB-VRT format, Trankit would be used to re-annotate the existing corpora with the Universal Dependencies (UD2) features and dependency syntax. Trankit could also be adapted for the segmentation of paragraphs into sentences and tokens, and it adds support for many other languages apart from Finnish.

The Mink platform, developed by Språkbanken Text in Sweden, was test-installed by the Language Bank. Mink allows users to process their own text corpora and to access the result via a private Korp instance. After the new version of the Korp platform is officially published at the Language Bank, it will be possible to make Mink available for wider use by the community. Support for user authentication in Mink is to be added in the year 2026.

The Language Bank participates in the recently launched CLARIN PressMint project that aims to compile a multilingual, comparable, annotated, translated and interoperable set of corpora of European historical newspapers by using a common TEI format. For PressMint, we will transform the out-of-copyright data from our existing KLK corpora (newspapers and magazines from the National Library) from the CWB-VRT format to the appropriate TEI format.

References

VRT format
Finnish tagtools 1.6 – tokenization, lemmatization, named-entity recognition, …
HeLI-OTS 2.0 – off-the-shelf language identifier with language models for 220 languages
Trankit 1.0.0. –
Nguyen, Minh Van and Lai, Viet and Veyseh, Amir Pouran Ben and Nguyen, Thien Huu (2021). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.
Mink

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.

<< List of all deliverables

D2.2.2: Transformer adaptation for specialised data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.2: Report on Transformer adaptation for specialised data
Date of reporting: 25-11-2025

Report author: Erik Axelson, Jack Rueter (University of Helsinki)
Contributors: Jack Rueter (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: N/A

Description

In this work package, we aim to provide an MCP server for facilitation of fst-tool and LLM linking for less technically oriented people.

MCP (Model Context Protocol) provides a powerful new opportunity to bring large language model (LLM) capabilities into the research and learning of low-resource languages by creating a bridge between rule-based, finite-state linguistic tools and LLM-based modern chatbots. By hosting HFST [1] analyzers and open-source dictionaries designed and authored by individual humans and teams at GiellaLT [2] and Apertium [3] through UralicNLP [4] libraries on an MCP server, even users with no technical background — and working from a laptop or cellphone — can access lemmatizers, morphological analyzers, and translation dictionaries for dozens of minority languages. This approach opens the door to more inclusive language technology, making advanced tools available to communities that have historically lacked computer-aided support.

We have familiarized ourselves with the use of a local MCP server from a laptop, and have run into memory issues. A so-called free server with a larger memory set at CSC would provide an ideal solution for individual users, as the server would host the model. Some language communities might want to have their specific language data housed as private, i.e., there would have to be different access to this material. The Language Bank of Finland is making plans for the installation of MCP service to allow extensive testing.

[1] HFST – Helsinki Finite-State Technology
[2] GiellaLT – an infrastructure for rule-based language technology aimed at minority and indigenous languages
[3] Apertium – a free/open-source machine translation platform
[4] UralicNLP – an NLP library for Uralic languages

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D2.1.2: Framework for processing copyrighted data for verification of research

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.1: Report on Framework for processing copyrighted data for verification of research
Date of reporting: 28-11-2025

Report authors: Mietta Lennes (UH)
Contributors: Sirpa Kovanen (UH), Krister Lindén(UH), Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/data-management/dela/

Keywords for the deliverable page: copyrighted data, personal data, social media data, data protection, safeguards

Description

Researchers in Social Sciences and Humanities often need to use data collected from social media platforms. Currently, the reuse of social media data for research purposes is legally challenging. Some part of the content originating from social media is usually protected by copyright or related rights. Social media postings (often including images and videos) may also contain personal data. The terms of use of social media platforms tend to be volatile and non-transparent, and individual permissions cannot be requested due to the large numbers of potential rightholders and data subjects.

Since neither the related EU regulations nor the Finnish legislation are well established in current legal practice, the possibilities for depositing research data from social media must be considered on a case by case basis. It may be possible to archive data obtained from social media and make it available for restricted purposes under certain conditions, according to Section 13 b of the Finnish Copyright Act (i.e., Tekijänoikeuslaki 13 b §), concerning data mining.

Two social media datasets, Finnish presidential elections 2024 in social media (somepressa24), collected by researchers at UHEL, and Nordic Tweet Stream 2013-2023 (nts) collected by a team at UEF, both teams participating in the FIN-CLARIAH project, have been suggested for deposition to the Language Bank of Finland. Using the potential redistribution of these two resources as an example, a review of the current legal risks and restrictions was performed by the legal advisors at UHEL. The negotiations for depositing the first dataset are nearly complete, and the dataset is to be delivered to the Language Bank in December 2025 and to be made available under a RES category license in early 2026. After the first experiences with somepressa24 at UHEL, we aim for a similar deposition agreement with UEF regarding the nts dataset.

The Language Bank of Finland offers frameworks, instructions and technical solutions for deposition agreements and end-user licenses, for access management (the Language Bank Rights system at CSC), and for data encryption or secure processing in a restricted environment if necessary (SD services at CSC). Step-by-step instructions to using the Sensitive Data services (cf. Deliverable 2.1.1.), including the secure SD Desktop environment, are now available both in Finnish and in English for researchers in Social Sciences and Humanities. The Language Bank also collects and shares the links to the privacy notices published by the users of the Language Bank.

Events

Presentation ”Find, use and deposit research data and tools via Kielipankki – The Language Bank of Finland” by Mietta Lennes at FIN-CLARIAH Roadshow, Vaasa, 14.3.2025
Presentation ”Licenses and data protection in the Language Bank of Finland” by Mietta Lennes at Rajapinta meet-up for researchers in Social Sciences, Helsinki/online, 27.5.2025
Discussion in the working group ”Agreements for the reuse of social media and interview data” at FIN-CLARIAH Meeting, Helsinki, 28.11.2025

Links

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.3: Advanced analytic social media tools and data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Advanced analytic social media tools and data
Date of reporting: 26-11-2025

Report author: Mikko Laitinen (UEF)
Contributors: Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location:

Keywords: social media corpora; social network tools; ego networks; gender

Description

Our work has resulted in building four massive social media corpora from one social media application. The purpose is to enable research access to large-scale and curated social media data, which is often a bottle neck in SSH (Laitinen & Rautionaho 2025). The four datasets are named Digital Social Network Corpora (DSN), as they not only consist of user-generated texts but also of detailed information of people’s social networks. They cover four geographic areas: Australia (DSN Ozzie), the Nordic countries (DSN Nordic), the United Kingdom (DSN British), and the United States (DSN America).

In total, they include 19,345 ego networks, consisting of a central node (ego), its directly connected neighbors (alters), and the connections between the alters. These networks were filtered using a semi-automated method to target what we call genuine human accounts, meaning that we aimed to exclude accounts with unusual network qualities, such as bots, celebrities, politicians, organizations, and businesses. Recreating a comparable dataset to the DSN corpora under the current paid data access policies of the social media application (X) would cost over 3 million euros and take around 58 years, given the current limitations of data access policies.

The resulting datasets are extremely large but contain carefully curated social networks with user-generated textual material. The network datasets contain material from 829,608 users, and the data range from 2006 to 2023. Altogether, they contain more than 700 million messages and nearly 10 billion words keyed in by users.

With their detailed structure, massive size, and coverage over 17 years, the DSN corpora support new research and enable re-examining old questions in the humanities. A case in point is the role of weak ties in the spread of innovations, where prior empirical evidence in sociolinguistics comes from ethnographic observations based on very small networks. One clear limitation of ethnographic network investigations is that participant observation methods are limited to networks of 30–50 individuals. The networks in the DSN corpora are substantially larger and close to average human networks in general, making it possible to investigate a variety of networks of different sizes and structures.

Publications:
Laitinen, Mikko & Paula Rautionaho. 2025. Reuse of social media data in corpus linguistics. International Journal of Corpus Linguistics. doi: 10.1075/ijcl.24136.lai

Masoud Fatemi & Mikko Laitinen. 2025. From tweets to networks: Introducing four large network-based social media corpora. CLARIN Annual Conference Proceedings, 2025. Ed by Cristina Crisot and Thalassia Kontino. Vienna, Austria, 2025. pp. 100–104. (https://www.clarin.eu/sites/default/files/CLARIN2025_ConferenceProceedings.pdf)

Events:
CLARIN 2025 conference Vienna 30 Sept – 2 October 2025 (https://www.clarin.eu/event/2025/clarin-annual-conference-2025)

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.4: Machine-learning-based enrichment of textual and audio-visual social media contents

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Machine-learning-based enrichment of textual and audio-visual social media contents
Date of reporting: 20-11-2025

Report authors: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)

Deliverable locations:

Keywords: video clip analysis; multimodal; MLLM; video summarization; data enrichment; Twitch

Description

The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a vast amount of diverse information: the visual action, the auditory context from caster commentary, and the text-based reactions from the live chat, all representing dense and valuable data for understanding online communities. However, the sheer volume and complexity of this data creates a need for efficient analysis tools. Our previous tools have focused on chat-analysis or chat content detection [1, 2].

This deliverable presents a continuation of the deliverable D4.1.1 tool for the automated understanding and enrichment of such clips. The tool is powered by state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family, guided by a multi-step Chain-of-Thought prompt. This prompt instructs the MLLM to focus on data enrichment, systematically analyzing the clip’s metadata, audio-visual content, and chat log, producing a JSON file.

This structured JSON data is organized into three parts. The analysis begins with the audiovisual analysis of the content in the video. It identifies all key entities involved, logs chronological actions in the video, transcribes the on-screen text, and breaks down caster commentary into key quotes and emotional tones. Next, the “chat reaction” section shows how the audience reacted to the jargon used by the community while also providing a glossary to explain the cultural meaning behind this. Finally, the “causal synthesis” connects these two modalities. It provides a narrative summary explaining why the clip matters and establishes direct causal links between the audiovisual triggers to the exact chat reactions they caused.

All generated analyses are automatically saved and accessible within the video_descriptions category of the data viewer section.

Publications

[1] Jari Lindroos, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Ida Toivanen. ”From PogChamps to Insights: Detecting Original Content in Twitch Chat.” In Hawaii International Conference on System Sciences, pp. 2542-2551. Hawaii International Conference on System Sciences, 2025. https://doi.org/10.24251/hicss.2025.308

[2] Jari Lindroos, Ida Toivanen, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Sami Äyrämö. ”Participant profiling on Twitch based on chat activity and message content.” In International GamiFIN Conference, pp. 18-29. CEUR Workshop Proceedings, 2025. https://ceur-ws.org/Vol-4012/paper18.pdf

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.2.3: Ingestion of multimodal societal data from the Web

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.2: Report on Ingestion of multimodal societal data from the Web
Date of reporting: 20-11-2025

Report authors: Matti Nelimarkka (University of Helsinki), Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Matti Nelimarkka (University of Helsinki), Denis Davydov (University of Helsinki), Anita Braida (University of Helsinki), Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)

Deliverable locations:

Finnish forum scrapers https://github.com/uh-dcm/finnish-forum-scrapers
4CAT for Finnish language https://github.com/uh-dcm/4cat_fi
Youtube Chat Collector https://collector-twitcher.2.rahtiapp.fi/YouTube%F0%9F%94%B4_chat_collect
Twitch and Youtube Data Viewer https://collector-twitcher.2.rahtiapp.fi/Data_viewer
Twitch Video Collector https://collector-twitcher.2.rahtiapp.fi/Collect_videos
JYU-digihum https://github.com/JYU-digihum

Keywords for the deliverable page: Twitch, YouTube, chat data, video data

Description

This deliverable focuses on infrastructures for acquisition of multimodal and societal data harvested from the web. The task includes the implementation and maintenance of data collection tools for most popular Finnish discussion forums, YouTube, and Twitch. This deliverable contains two parts ⎯ part A conducted by the Centre for Social Data Science, University of Helsinki and part B by the University of Jyväskylä.

PART A: FINNISH DISCUSSION FORUMS

To ensure that researchers have access beyond global platforms (where data collection is a shared global concern) University of Helsinki build and maintain forum scrapers which extract the content to user-generated content including vauva.fi, kaksplus.fi and comments on yle.fi and hs.fi. These can be used through a command line interface which produces the content as a CSV file for further analysis. We also provided modifications to the 4CAT platform (https://4cat.nl/) to ensure it correctly operates with Finnish language.

PART B: YOUTUBE CHAT COLLECTOR & TWITCH VIDEO COLLECTOR

The team from the University of Jyväskylä presents a continuation of the deliverable for the Twitcher data collector tool. We present new added features such as the option to collect chat data from YouTube from either live or past broadcasts. The collected YouTube chat data can also be viewed in the data viewer section and are also automatically saved in CSC Allas. We also implemented the option to collect videos from Twitch past broadcasts in regard to the video clip analysis tool presented in D3.3.4 and D4.1.1.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.5: Establishing Trust and Reliance on AI in History and Cultural Heritage Research – A social epistemology based view of the challenges of epistemically dependable multimodal AI systems for accessing collections.

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of multimodal cultural heritage
Date of reporting: 20-11-2025

Report author: Ilkka Lähteenmäki (University of Oulu)
Contributor: Ilkka Lähteenmäki (University of Oulu)
Deliverable location: 10.5281/zenodo.17700648

Description

This paper examines if historians and cultural heritage researchers can justifiably depend on multimodal AI systems for accessing large visual collections from social epistemology point of view. Building on Inkeri Koskinen’s “necessary trust view” and Jakob Ortmann’s account of task-specific epistemic reliance, it argues that digital history and cultural heritage form a non-typical setting for current social epistemology of AI. In contrast to the physical sciences, where AI tools such as AlphaFold are embedded in long-standing evaluation regimes and well-defined tasks, historical research involves open-ended, exploratory questions, fuzzy and historically shifting concepts, and interpretive practices centred on individual researchers and small teams.

The paper uses examples from recent proposals for using multimodal AI for text-to-image, image-to-text and image-to-image retrieval, and for AI-assisted metadata generation and “distant viewing” of images. It shows how hopes for a multimodal turn in digital humanities confronts the essential epistemic opacity of deep neural networks and the difficulty of evaluating reliability for complex open ended retrieval tasks. Three suggested mitigation strategies are discussed: critical analyses of models and training data; historically informed reflection on bias and concept change; and fine-tuning or post-processing of models for specific purposes. From a social epistemology perspective, each strategy encounters limits when generalised to research infrastructure meant to support many corpora, tasks and user communities.

The paper then turns to approaches that argue for using multimodality theory to design metadata schemas and guide AI-based annotation. It shows how this is a attempt to shift epistemic trust from AI systems back to scholars (at least partially) in effort to make use of the developing technology. However, this brings into discussion old debates of between theories of meaning. Especially with image data the theoretical discussion of how images meanings should be established and if these theories are implementable to computational models need to be explored. Couple examples from contemporary photography and medieval manuscript research illustrate both the potential of AI-supported exploration and the need for additional contextual and theoretical work to render outputs historically interpretable.

The central claim is that, given the essential epistemic opacity of AI, it currently looks like justified epistemic dependence in history and cultural heritage research needs be organised around situated, task-specific, and accountable uses of multimodal models rather than general-purpose models. The options for research infrastructures for establishing trust are therefore focus on building mechanisms for task-specific reliability assessment, or embedding trusted identifiable human agents or institutions between users and models.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D1.2.1: Transcription service for minority languages

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 1.2: Report on Transcription service for minority languages
Date of reporting: 24-11-2025
Report authors: Martin Matthiesen (CSC)
Contributors: Yaroslav Getman, Tamas Grosz (Aalto), Sam Hardwick (CSC)
Deliverable location: https://github.com/CSCfi/Kielipankki-utilities/tree/master/asr/apptainer

Keywords for the deliverable page: Finland-Swedish, Sámi

Description

An Automatic Speech Recognition model for Northern Sámi (henceforth ”Sámi ASR model”) has been created at Aalto University. The model has been packaged into a container[2] at CSC, which may be used in the user’s preferred computing environment, and also in CSC’s Secure Desktop environment[3] for processing sensitive data.

The packaging process, which can be repurposed for other wav2vec models and/or models available via Huggingface[4] is documented in the Language Bank’s Github[5] repository.

At the time of writing the model for Finnish-Swedish is still under development at Aalto University. It will be packaged as soon as it becomes available.

[1] Sámi ASR model: https://huggingface.co/GetmanY1/wav2vec2-large-sami-cont-pt-22k-finetuned

[2] https://www.kielipankki.fi/tools/sami-asr/

[3] In SD Desktop the tool can be installed using the ”auto-apptainer” tool.

[4] https://en.wikipedia.org/wiki/Hugging_Face

[5] https://github.com/CSCfi/Kielipankki-utilities/tree/master/asr/apptainer

The FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D2.3.2: Remote access to video data repositories

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.3: Report on Remote access to video data repositories
Date of reporting: 21-11-2025

Report authors: Tommi Jauhiainen, Erik Axelson (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024102501 and urn:nbn:fi:lb-2025081401

Description

With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API. We published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the Language Bank of Finland (LBF) resource publishing pipeline under urn:nbn:fi:lb-2024102501. The original metadata includes timestamps, which will enable direct links from the Korp service to the video material available on the Parliament servers. The Korp version will contain approximately 7,000,000 tokens corresponding to about 4,500 hours of video data.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.4: Analysis of multimodal properties of naturalistic speech: The YouTube Corpus of Singapore English Podcasts

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of multimodal properties of naturalistic speech
Date of reporting: 12-11-2025

Report author: Steven Coats (University of Oulu)
Contributors: Alessandro Basile (Sorbonne Nouvelle University, France), Cameron Morin (University of Paris-Cité, France), Robert Fuchs (University of Bonn, Germany)
Deliverable location: Online search interface: https://ycsep.corpora.li (on Zenodo).

Downloadable static corpus: https://doi.org/10.7910/DVN/B7JRID

Keywords: Singapore English, Corpus Linguistics, YouTube, World Englishes, Podcasts

Description

Recent advances in streaming protocols and automatic speech recognition (ASR) have enabled large-scale spoken language corpora, yet research on Singapore English remains constrained by small or text-based datasets. The YouTube Corpus of Singapore English Podcasts (YCSEP) addresses this gap with 620 hours of transcribed, diarized speech from over 1,300 podcast episodes by Singapore-based content creators. YCSEP supports the empirical analysis of phonetics, morphosyntax, and discourse, enabling the study of low-frequency features like discourse particles and reduplication. The dataset reflects informal, spontaneous speech from diverse speakers and facilitates investigation into nativization and endonormative stabilization processes in postcolonial English. Built using a pipeline of yt-dlp, WhisperX, and Pyannote, YCSEP offers robust empirical grounding for linguistic features such as verb complementation and modality. It also contributes to broader theoretical discussions on areal norms and construction grammar in World Englishes.

The corpus is available in two versions: An online search engine, through which transcripts and audio are accessible and downloadable (https://ycsep.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/B7JRID).

Related publication:

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. 2025. The YouTube Corpus of Singapore English Podcasts. English World-Wide. https://doi.org/10.1075/eww.25018.coa

Related presentations:

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the Mutations du Discours Numérique Seminar. Arras, France, April 22^nd, 2025. https://calenda.org/1204680; https://adum.fr/script/formations.pl?mod=3633487&site=l

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the 8^th Conference of the International Society for the Linguistics of English. Santiago de Compostela, Spain, September 3^rd, 2025. https://isle8conference.com/

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D2.3.1: Remote access to text data repositories

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.3: Report on Remote access to text data repositories
Date of reporting: 30-09-2025

Report authors: Tommi Jauhiainen (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024071601 and urn:nbn:fi:lb-2025081401

Description

In this work package, we aimed to provide infrastructure for translation and interpretation research, both in machine translation and in translation studies, by enhancing our access to remote text data repositories. During the project, we focused on improving our access to three significant external sources of text data: the Parliament of Finland, the National Broadcasting company (Yle), and the various institutional repositories managed by the Finnish Universities.

With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API and published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the resource publishing pipeline of the Language Bank of Finland (LBF). For future updates of this resource, we plan to collaborate with the Parlamenttisampo and maintain the software components used to extract and parse the API-provided dataset together.

Similarly, we published a new source version of the Yle Finnish News Archive, covering the years 2022-2024: urn:nbn:fi:lb-2025081401. We have worked on streamlining the publishing pipeline for resources that are regularly updated, which include both the Parliament and Yle datasets. Preliminary investigations indicate that the best throughput will be achieved by creating a customized pipeline for each resource with checklists tailored to make the creation and publishing of new versions as easy as possible.

We have also created a semi-automated system that can be used to harvest all PDF-formatted publications from the institutional repositories managed by Finnish Universities. Automated harvesting was made possible by the widespread use of DSpace software as the backend of these repositories. We are further developing automated methods to determine the types of language resources that can be published based on this collection. The licenses under which the texts have been published vary considerably, and we aim to publish them as openly as possible.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.6: Reliable image labelling with computer vision

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Reliable Enrichment of Visual Data
Date of reporting: 29-09-2025

Report authors: Matti Nelimarkka (University of Helsinki)
Contributors: Anton Berg (University of Helsinki), Leonardo Negri (University of Helsinki)
Deliverable location: https://github.com/uh-soco/coslab-core and https://github.com/uh-dcm/coslab-gui

Description

Image recognition services, such as Amazon Rekognition, Google Vision and Azure AI Vision, allow anyone to label image content, however their outputs vary per service (ref to image as data book). Cross-service label agreement score (COSLAB) allows researchers to quantitatively compare labels across services and determine which of the output labels are reliable. This allows researchers to use these outputs in their research and addresses common critique for the scholarly use of such services (ref to image as data book).

The objective of this work was to (a) devise a method to assess the reliability of labels and (b) develop a graphical user interface allowing non-technical users to conduct this analysis. This objective aims to make image recognition tools available for humanities scholars and social scientists.

The underlying COSLAB was originally developed in Berg & Nelimarkka (2023), showing no systematic differences in the quality across different kinds of image datasets, thus suggesting that overall image recognition services can be used, particularly for explorative image analysis.

The graphical user interface provides non-technical frontend to image labelling services and COSLAB calculations. The drag & drop interface allows sending images for image recognition services and then calculates per-label scores, indicating if different image recognition services recognised similar things. The final output containing both the per-image labels and COSLAB scores can be exported e.g. to Microsoft Excel. This allows researchers to further use the results in their analysis tool of choice.

Publications

Berg, A., & Nelimarkka, M. (2023). Do you see what I see? Measuring the semantic differences in image‐recognition services’ outputs. In Journal of the Association for Information Science and Technology (Vol. 74, Issue 11, pp. 1307–1324). Wiley. https://doi.org/10.1002/asi.24827

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.1: Analysis of video stream interactions with AI solutions

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of video stream interactions with AI solutions
Date of reporting: 22-09-2025

Report author: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable location: https://collector-twitcher.2.rahtiapp.fi/Video_clip_summary

Keywords: video clip analysis; multimodal; MLLM; video summarization; Twitch

Description

The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a lot of various multimodal information; the visual action of the gameplay, the auditory context from the caster commentary, and the text-based reactions from the live chat, which all represent a dense and valuable information for understanding online communities and digital entertainment. However, the sheer volume and complexity of this data creates a need for efficient tools for its analysis. Our previous tools have focused on chat-analysis or chat content detection [1, 2], which, however, do not seem to cover the diverse nature of content in Twitch thoroughly enough. The primary challenge lies in the multimodal nature of the data. Some of the characteristics of Twitch data include a wide range of dynamic scenes, dense on-screen information, and a complex interaction between the visual gameplay, audio commentary, and massive chat audience. A true understanding of a Twitch clip requires not just the perception of events within each modality but the synthesis of their interplay. This creates a clear research gap for tools that can comprehensively understand and summarize the information within these complex multimedia clips.

This deliverable presents a tool for the automated understanding and summarization of such clips. The tool utilizes the state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family. The tool helps the user to generate a chronological summary of the key audio-visual events, a thematic analysis of chat reactions, and an overall summary from the video and chat input information. This is guided by a structured Chain-of-Thought-based prompt.

Publications

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.3: Machine learning-based enrichment of social media

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Machine learning-based enrichment of social media
Date of reporting: 22-09-2025

Report authors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)
Contributors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)

Deliverable location:

Keywords: machine learning; social media; web registers; register variation

Description

Web-crawled datasets have become invaluable resources for SSH research, supporting diverse fields including corpus linguistics, digital humanities, and computational social science. However, publicly available web datasets like HPLT 2.0 and FineWeb provide only basic metadata about their contents, such as document URLs and crawl dates, which limits their research potential. Enriching these noisy collections with contextual metadata would greatly improve their value for SSH research.

In this deliverable, we focus on automatically identifying social media text varieties in web datasets, using machine learning. We publish the following resources:

A multilingual classifier for labeling web documents by their register (or genre), including social media categories such as blogs and forums.
Social media subtype classifiers for English, Finnish, and Swedish for identifying thematic groups within social media registers (e.g. travel topics within Narrative Blogs).
Datasets labeled with register and fine-grained social media subtype metadata.
A demonstration pipeline and tutorial on Google Colab
A code repository on Github

We approach the web text classification problem using the framework of register variation (Egbert and Biber 2018; Biber and Conrad 2019), where “register” denotes a text variety associated with a particular situational context, such as News report or Recipe. We use the 25-class web register taxonomy developed by Skantsi and Laippala (2023) to label 3 million randomly selected documents from the HPLT 2.0 corpus (Burchell et al. 2025) in English, Finnish, and Swedish (1M samples each). This automatic labeling uses the multilingual BGE-M3 model (Chen et al. 2024), fine-tuned for register classification following Henriksson et al. (2024).

From this 3M document sample we then select a social media subset by choosing documents labeled with any of the following three registers: Narrative Blog, Opinion Blog, or Interactive Discussion. We also include so-called “hybrids” – documents assigned to more than one register label, such as Narrative blog + Recipe. This process yields a dataset of approximately 113,000 English, 290,000 Finnish, and 335,000 Swedish social media documents, with Narrative Blogs being the most common category across all languages.

To further analyze the contents of the identified social media documents, we apply HDBSCAN clustering (McInnes et al. 2017) on their semantic vector representations, revealing meaningful thematic subgroups within some register categories. For instance, applying keyword analysis on the clusters, we identify hand-crafting and cooking themes in hybrid documents labeled Narrative Blog + How-to/Instructional. We develop simple logistic regression classifiers trained on these thematic clusters, allowing SSH researchers to first categorize text by register, then select social media registers of interest, and finally identify specific thematic subgroups where applicable.

References

Biber, Douglas, and Susan Conrad. 2019. Register, Genre, and Style. Cambridge: Cambridge University Press.

Burchell, Laurie, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova et al. 2025. “An expanded massive multilingual dataset for high-performance language technologies.” arXiv e-prints: arXiv-2503.

Chen, Jianlv, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. “Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.” arXiv preprint arXiv:2402.03216.

Egbert, Jesse, and Douglas Biber. 2018. Register Variation Online. Cambridge: Cambridge University Press.

Henriksson, Erik, Amanda Myntti, Saara Hellstrom, Anni Eskelinen, Selcen Erten-Johansson and Veronika Laippala. 2024. “Automatic register identification for the open web using multilingual deep learning.” arXiv preprint arXiv:2406.19892.

McInnes, Leland, John Healy, and Steve Astels. 2017. “hdbscan: Hierarchical density based clustering.” J. Open Source Softw. 2:11, 205.

Skantsi, Valtteri, and Veronika Laippala. 2023. “Analyzing the unrestricted web: The finnish corpus of online registers.” Nordic Journal of Linguistics 48:1, 1-31.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.5: Forensic Linguistics Corpus and Search Interface C.R.I.M.E

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Forensic Linguistics Corpus and Search Interface C.R.I.M.E
Date of reporting: 01-09-2025

Report authors: Steven Coats (University of Oulu)
Contributors: Dana Roemling (University of Birmingham)
Deliverable location: Online search interface: https://forensic.corpora.li (DOI)

Keywords: Forensic linguistics; corpus linguistics, YouTube, investigative interviews

Description

CRIME is the Corpus of Recorded Investigative, Media, and Evidence-based proceedings, a structured, searchable resource comprising audio and ASR-generated transcripts from investigative interviews, courtroom interactions, and related media. Collected from publicly available YouTube sources according to the provisions of the EU Data Mining Act, the corpus addresses a critical gap in current research: the lack of large-scale, real-world datasets that integrate reliable transcripts with corresponding audio.

Previous studies often rely on limited data, constraining generalizability and hindering methodological innovation. By enabling detailed analysis of linguistic, phonetic, pragmatic, and discourse-level features, CRIME supports interdisciplinary research in linguistics, law, psychology, and computational modeling. Potential applications include the identification of language patterns associated with interviewing strategies and outcomes, as well as leveraging large language models to explore affective and interactional dynamics.

This resource offers substantial potential to inform both academic inquiry and evidence-based practices in investigative interviewing and broader criminal justice contexts. The corpus is available in two versions: An online search engine, powered by BlackLab, through which transcripts and audio are accessible and downloadable (https://forensic.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/MLMB6E).

Related publication:

Coats, Steven and Dana Roemling. 2025. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. In: Fábián, Annamária and Igor Trost (eds.), Impulses and Approaches to Computer-Mediated Communication Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities, 45-49. University of Bayreuth, Germany. https://www.cmc2025.uni-bayreuth.de/pool/dokumente/CMC-2025-Proceedings-2.pdf

D3.1.2: Workflow automation and version syncing

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Comprehensive data versioning
Date of reporting: 22-09-2025

Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Keywords for the deliverable page: versioning, updates, differences

Description

The versioning mechanism has been rigorously tested with a daily update schedule, which is far too often, considering that the data set is changing relatively rarely and a monthly update schedule is envisaged. We have added improvements to better serve the Elastic Search use case and make it easier to track the provenance of the dataset and to improve the reliability of the snapshot creation. Below we describe in more details how the dataset serves the selected use cases.

Using the data set as a source for newer versions of the KLK dataset in Kielipankki.

To create ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], (”KLK”, for short) using this data set[2] the original Python scripts[3] need to be changed. Presently they are operating on directories extracted from zip files obtained directly from the National Library of Finland (NLF). We decided to not use these files directly for two reasons:

The files are in an internal format of the National Library and contain data which is not available publicly via the API of NLF, in this case the TIFF archive versions of scanned newspapers.
The TIFF files are very large and would significantly impact download times and storage requirements.

Unlike planned we opted in the end to not create a working proof-of-concept, but to explain below the steps needed to adapt the present scripts to the new format. One major change is to operate on the zip files instead of a Posix file structure. Especially in HPC filesystems like Lustre working on zip files is much more efficient than to extract the small files contained in them. Concretely Python’s zipfile module[4] can be used to search for METS files within the downloaded zip files in /scratch/project_2006633/nlf-harvester/zip on Puhti. METS files of a specific binding are contained in the ”mets” directory of said binding. The corresponding OCR data can then be found in the ”alto” directory on the same level.

The example of binding 19712 below illustrates how finding METS files (in the ”mets” directory) leads to the respective OCR data (in the ”alto” directory on the same level as the mets file).

1/19/197/1971/19712/19712/mets/19712_METS.xml
1/19/197/1971/19712/19712/alto/00001.xml
1/19/197/1971/19712/19712/alto/00002.xml
…

A minor issue was observed: Before using the dataset for the next version of ”KLK”, we need to request a collection of periodicals (marked ”aikakausi”) to be added to the dataset, presently we only download newspapers (marked ”sanomalehti”).

Using the dataset as a basis for an Elastic Search instance containing NLF data

Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[5]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. In this use case it was important to keep the Elastic Search Engine in sync with changes within the data set. While we already provide versions, comparing these version is resource intensive. To make comparison easier, we introduced a ”log” directory (/scratch/project_2006633/nlf-harvester/log/ containing listings of additions and deletions that were performed during each synchronisation as well as general information about snapshot runs. We also made it easy to refer to a specific version of the dataset by tagging it with the hash number used in the restic backup. Since the changes from one version to another can be potentially large (e.g. if NLF publishes are new version of the OCR’d scans), resources on HPC login nodes are not sufficient to generate snapshots using restic. For that reason restic is now run as a HPC job on a compute node with adequate resources.

Summary and Outlook

The goal of this work package was create a consistent download framework for publicly available newspaper data from the NLF. To achieve this we used Apache Airflow for task automation and Restic for versioning. It turned out that Apache Airflow is not designed to deal with too many tasks at once that might take a long time. We had to find compromises to reduce the number of tasks.

We ran the download pipeline on a daily basis for few weeks without issue and are now confident that Airflow can be run on a monthly basis to update the dataset. Restic turned out to be a reliable tool for versioning. The versioning to Allas makes it possible to free space on Puhti in case the data set is not in active use after the end of the project. It also makes it possible to stage the data set to other environments, like personal laptops or the LUMI super computer. Long term funding for keeping the data on Allas still needs to be worked out.

References

[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401

[2] See the Harvester documentation for details.

[3] https://github.com/CSCfi/Kielipankki-utilities/tree/master/corp/klk-alto

[4] Introduction to the python zipfile module: https://realpython.com/python-zipfile/

[5] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Last modified on 2026-06-05

Search the Language Bank Portal:

Researcher of the Month: Minna Sääskilahti

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information