Tapahtumakutsu: Ilmoittaudu 22.4.2026 järjestettävään työpajaan

Name: Conference: CMC and Social Media Corpora for the Humanities (CMC-Corpora)
Start: 2026-08-27T00:00:00+03:00
End: 2026-08-28T23:59:59+03:00
Location: Oulu

Tervetuloa työpajaan Digital Language Sovereignty – Euskadi–Finland AI & Language, joka kokoaa yhteen tutkijoita ja asiantuntijoita Suomesta ja Baskimaasta (Euskadi) keskustelemaan tekoälyn ja kieliteknologian ajankohtaisista kehityssuunnista sekä tulevaisuuden mahdollisuuksista. Tilaisuus on maksuton ja se järjestetään Helsingin yliopiston keskustakampuksella 22.4.2026 klo 9.00-16.00.

Keskustelemme päivän aikana mm. seuraavista teemoista:

kielten digitaalinen itsemäärämisoikeus
vähäresurssiset kielet
kielimallit, aineistot ja tutkimusinfrastruktuurit
yksityissektorin näkökulmat ja käytännön sovellukset

Tilaisuus on suunnattu tutkijoille ja jatko-opiskelijoille, kieliasiantuntijoille, julkisen hallinnon edustajille sekä tekoälyn ja kielimallien parissa työskenteleville asiantuntijoille.

Lue lisää täältä: https://www.kielipankki.fi/tapahtumat/digital-language-sovereignty-euskadi-finland-ai-language-workshop/

Ilmoittaudu mukaan tapahtumasivulla: https://euskorpora.eus/en/evento/workshop-digital-language-sovereignty-euskadi-finland-ai-language-workshop/

A new SKS publication: They call it syntax. Data-based approaches to Finnish dialects

A Finnish-language volume Sanovat syntaksiksi – Aineistopohjaisia tutkimuksia murteiden lauseopista (”They call it syntax. Data-based approaches to Finnish dialects”) has been published in the series Suomalaisen Kirjallisuuden Seuran Toimituksia. The volume presents recent research on the syntax of Finnish dialects and offers useful information about the history of the field as well as the spoken-language corpora available for research. The work also refers to several resources familiar to users of Kielipankki – the Language Bank of Finland, such as:

The volume also provides interesting background information on how these corpora have been compiled and the various stages they have gone through over the course of their existence.

Sanovat syntaksiksi – Aineistopohjaisia tutkimuksia murteiden lauseopista is openly available in digital form on the SKS (Finnish Literature Society) website: https://doi.org/10.21435/skst.1505

Suomenruotsin Lahjoita puhetta -aineisto (Donera prat) esillä Svenska Ylen uutisessa

Kielipankin tutkimusjohtaja Krister Lindén oli Svenska Ylen haastateltavana marraskuussa 2025.

Lue uutinen Svenska Ylen sivuilta

<< List of all deliverables

D2.2.2: Transformer adaptation for specialised data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.2: Report on Transformer adaptation for specialised data
Date of reporting: 25-11-2025

Report author: Erik Axelson, Jack Rueter (University of Helsinki)
Contributors: Jack Rueter (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: N/A

Description

In this work package, we aim to provide an MCP server for facilitation of fst-tool and LLM linking for less technically oriented people.

MCP (Model Context Protocol) provides a powerful new opportunity to bring large language model (LLM) capabilities into the research and learning of low-resource languages by creating a bridge between rule-based, finite-state linguistic tools and LLM-based modern chatbots. By hosting HFST [1] analyzers and open-source dictionaries designed and authored by individual humans and teams at GiellaLT [2] and Apertium [3] through UralicNLP [4] libraries on an MCP server, even users with no technical background — and working from a laptop or cellphone — can access lemmatizers, morphological analyzers, and translation dictionaries for dozens of minority languages. This approach opens the door to more inclusive language technology, making advanced tools available to communities that have historically lacked computer-aided support.

We have familiarized ourselves with the use of a local MCP server from a laptop, and have run into memory issues. A so-called free server with a larger memory set at CSC would provide an ideal solution for individual users, as the server would host the model. Some language communities might want to have their specific language data housed as private, i.e., there would have to be different access to this material. The Language Bank of Finland is making plans for the installation of MCP service to allow extensive testing.

[1] HFST – Helsinki Finite-State Technology
[2] GiellaLT – an infrastructure for rule-based language technology aimed at minority and indigenous languages
[3] Apertium – a free/open-source machine translation platform
[4] UralicNLP – an NLP library for Uralic languages

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D2.1.2: Framework for processing copyrighted data for verification of research

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.1: Report on Framework for processing copyrighted data for verification of research
Date of reporting: 28-11-2025

Report authors: Mietta Lennes (UH)
Contributors: Sirpa Kovanen (UH), Krister Lindén(UH), Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/data-management/dela/

Keywords for the deliverable page: copyrighted data, personal data, social media data, data protection, safeguards

Description

Researchers in Social Sciences and Humanities often need to use data collected from social media platforms. Currently, the reuse of social media data for research purposes is legally challenging. Some part of the content originating from social media is usually protected by copyright or related rights. Social media postings (often including images and videos) may also contain personal data. The terms of use of social media platforms tend to be volatile and non-transparent, and individual permissions cannot be requested due to the large numbers of potential rightholders and data subjects.

Since neither the related EU regulations nor the Finnish legislation are well established in current legal practice, the possibilities for depositing research data from social media must be considered on a case by case basis. It may be possible to archive data obtained from social media and make it available for restricted purposes under certain conditions, according to Section 13 b of the Finnish Copyright Act (i.e., Tekijänoikeuslaki 13 b §), concerning data mining.

Two social media datasets, Finnish presidential elections 2024 in social media (somepressa24), collected by researchers at UHEL, and Nordic Tweet Stream 2013-2023 (nts) collected by a team at UEF, both teams participating in the FIN-CLARIAH project, have been suggested for deposition to the Language Bank of Finland. Using the potential redistribution of these two resources as an example, a review of the current legal risks and restrictions was performed by the legal advisors at UHEL. The negotiations for depositing the first dataset are nearly complete, and the dataset is to be delivered to the Language Bank in December 2025 and to be made available under a RES category license in early 2026. After the first experiences with somepressa24 at UHEL, we aim for a similar deposition agreement with UEF regarding the nts dataset.

The Language Bank of Finland offers frameworks, instructions and technical solutions for deposition agreements and end-user licenses, for access management (the Language Bank Rights system at CSC), and for data encryption or secure processing in a restricted environment if necessary (SD services at CSC). Step-by-step instructions to using the Sensitive Data services (cf. Deliverable 2.1.1.), including the secure SD Desktop environment, are now available both in Finnish and in English for researchers in Social Sciences and Humanities. The Language Bank also collects and shares the links to the privacy notices published by the users of the Language Bank.

Events

Presentation ”Find, use and deposit research data and tools via Kielipankki – The Language Bank of Finland” by Mietta Lennes at FIN-CLARIAH Roadshow, Vaasa, 14.3.2025
Presentation ”Licenses and data protection in the Language Bank of Finland” by Mietta Lennes at Rajapinta meet-up for researchers in Social Sciences, Helsinki/online, 27.5.2025
Discussion in the working group ”Agreements for the reuse of social media and interview data” at FIN-CLARIAH Meeting, Helsinki, 28.11.2025

Links

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.3: Advanced analytic social media tools and data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Advanced analytic social media tools and data
Date of reporting: 26-11-2025

Report author: Mikko Laitinen (UEF)
Contributors: Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location:

Keywords: social media corpora; social network tools; ego networks; gender

Description

Our work has resulted in building four massive social media corpora from one social media application. The purpose is to enable research access to large-scale and curated social media data, which is often a bottle neck in SSH (Laitinen & Rautionaho 2025). The four datasets are named Digital Social Network Corpora (DSN), as they not only consist of user-generated texts but also of detailed information of people’s social networks. They cover four geographic areas: Australia (DSN Ozzie), the Nordic countries (DSN Nordic), the United Kingdom (DSN British), and the United States (DSN America).

In total, they include 19,345 ego networks, consisting of a central node (ego), its directly connected neighbors (alters), and the connections between the alters. These networks were filtered using a semi-automated method to target what we call genuine human accounts, meaning that we aimed to exclude accounts with unusual network qualities, such as bots, celebrities, politicians, organizations, and businesses. Recreating a comparable dataset to the DSN corpora under the current paid data access policies of the social media application (X) would cost over 3 million euros and take around 58 years, given the current limitations of data access policies.

The resulting datasets are extremely large but contain carefully curated social networks with user-generated textual material. The network datasets contain material from 829,608 users, and the data range from 2006 to 2023. Altogether, they contain more than 700 million messages and nearly 10 billion words keyed in by users.

With their detailed structure, massive size, and coverage over 17 years, the DSN corpora support new research and enable re-examining old questions in the humanities. A case in point is the role of weak ties in the spread of innovations, where prior empirical evidence in sociolinguistics comes from ethnographic observations based on very small networks. One clear limitation of ethnographic network investigations is that participant observation methods are limited to networks of 30–50 individuals. The networks in the DSN corpora are substantially larger and close to average human networks in general, making it possible to investigate a variety of networks of different sizes and structures.

Publications:
Laitinen, Mikko & Paula Rautionaho. 2025. Reuse of social media data in corpus linguistics. International Journal of Corpus Linguistics. doi: 10.1075/ijcl.24136.lai

Masoud Fatemi & Mikko Laitinen. 2025. From tweets to networks: Introducing four large network-based social media corpora. CLARIN Annual Conference Proceedings, 2025. Ed by Cristina Crisot and Thalassia Kontino. Vienna, Austria, 2025. pp. 100–104. (https://www.clarin.eu/sites/default/files/CLARIN2025_ConferenceProceedings.pdf)

Events:
CLARIN 2025 conference Vienna 30 Sept – 2 October 2025 (https://www.clarin.eu/event/2025/clarin-annual-conference-2025)

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.4: Machine-learning-based enrichment of textual and audio-visual social media contents

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Machine-learning-based enrichment of textual and audio-visual social media contents
Date of reporting: 20-11-2025

Report authors: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)

Deliverable locations:

Keywords: video clip analysis; multimodal; MLLM; video summarization; data enrichment; Twitch

Description

The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a vast amount of diverse information: the visual action, the auditory context from caster commentary, and the text-based reactions from the live chat, all representing dense and valuable data for understanding online communities. However, the sheer volume and complexity of this data creates a need for efficient analysis tools. Our previous tools have focused on chat-analysis or chat content detection [1, 2].

This deliverable presents a continuation of the deliverable D4.1.1 tool for the automated understanding and enrichment of such clips. The tool is powered by state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family, guided by a multi-step Chain-of-Thought prompt. This prompt instructs the MLLM to focus on data enrichment, systematically analyzing the clip’s metadata, audio-visual content, and chat log, producing a JSON file.

This structured JSON data is organized into three parts. The analysis begins with the audiovisual analysis of the content in the video. It identifies all key entities involved, logs chronological actions in the video, transcribes the on-screen text, and breaks down caster commentary into key quotes and emotional tones. Next, the “chat reaction” section shows how the audience reacted to the jargon used by the community while also providing a glossary to explain the cultural meaning behind this. Finally, the “causal synthesis” connects these two modalities. It provides a narrative summary explaining why the clip matters and establishes direct causal links between the audiovisual triggers to the exact chat reactions they caused.

All generated analyses are automatically saved and accessible within the video_descriptions category of the data viewer section.

Publications

[1] Jari Lindroos, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Ida Toivanen. ”From PogChamps to Insights: Detecting Original Content in Twitch Chat.” In Hawaii International Conference on System Sciences, pp. 2542-2551. Hawaii International Conference on System Sciences, 2025. https://doi.org/10.24251/hicss.2025.308

[2] Jari Lindroos, Ida Toivanen, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Sami Äyrämö. ”Participant profiling on Twitch based on chat activity and message content.” In International GamiFIN Conference, pp. 18-29. CEUR Workshop Proceedings, 2025. https://ceur-ws.org/Vol-4012/paper18.pdf

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.2.3: Ingestion of multimodal societal data from the Web

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.2: Report on Ingestion of multimodal societal data from the Web
Date of reporting: 20-11-2025

Report authors: Matti Nelimarkka (University of Helsinki), Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Matti Nelimarkka (University of Helsinki), Denis Davydov (University of Helsinki), Anita Braida (University of Helsinki), Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)

Deliverable locations:

Finnish forum scrapers https://github.com/uh-dcm/finnish-forum-scrapers
4CAT for Finnish language https://github.com/uh-dcm/4cat_fi
Youtube Chat Collector https://collector-twitcher.2.rahtiapp.fi/YouTube%F0%9F%94%B4_chat_collect
Twitch and Youtube Data Viewer https://collector-twitcher.2.rahtiapp.fi/Data_viewer
Twitch Video Collector https://collector-twitcher.2.rahtiapp.fi/Collect_videos
JYU-digihum https://github.com/JYU-digihum

Keywords for the deliverable page: Twitch, YouTube, chat data, video data

Description

This deliverable focuses on infrastructures for acquisition of multimodal and societal data harvested from the web. The task includes the implementation and maintenance of data collection tools for most popular Finnish discussion forums, YouTube, and Twitch. This deliverable contains two parts ⎯ part A conducted by the Centre for Social Data Science, University of Helsinki and part B by the University of Jyväskylä.

PART A: FINNISH DISCUSSION FORUMS

To ensure that researchers have access beyond global platforms (where data collection is a shared global concern) University of Helsinki build and maintain forum scrapers which extract the content to user-generated content including vauva.fi, kaksplus.fi and comments on yle.fi and hs.fi. These can be used through a command line interface which produces the content as a CSV file for further analysis. We also provided modifications to the 4CAT platform (https://4cat.nl/) to ensure it correctly operates with Finnish language.

PART B: YOUTUBE CHAT COLLECTOR & TWITCH VIDEO COLLECTOR

The team from the University of Jyväskylä presents a continuation of the deliverable for the Twitcher data collector tool. We present new added features such as the option to collect chat data from YouTube from either live or past broadcasts. The collected YouTube chat data can also be viewed in the data viewer section and are also automatically saved in CSC Allas. We also implemented the option to collect videos from Twitch past broadcasts in regard to the video clip analysis tool presented in D3.3.4 and D4.1.1.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.5: Establishing Trust and Reliance on AI in History and Cultural Heritage Research – A social epistemology based view of the challenges of epistemically dependable multimodal AI systems for accessing collections.

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of multimodal cultural heritage
Date of reporting: 20-11-2025

Report author: Ilkka Lähteenmäki (University of Oulu)
Contributor: Ilkka Lähteenmäki (University of Oulu)
Deliverable location: 10.5281/zenodo.17700648

Description

This paper examines if historians and cultural heritage researchers can justifiably depend on multimodal AI systems for accessing large visual collections from social epistemology point of view. Building on Inkeri Koskinen’s “necessary trust view” and Jakob Ortmann’s account of task-specific epistemic reliance, it argues that digital history and cultural heritage form a non-typical setting for current social epistemology of AI. In contrast to the physical sciences, where AI tools such as AlphaFold are embedded in long-standing evaluation regimes and well-defined tasks, historical research involves open-ended, exploratory questions, fuzzy and historically shifting concepts, and interpretive practices centred on individual researchers and small teams.

The paper uses examples from recent proposals for using multimodal AI for text-to-image, image-to-text and image-to-image retrieval, and for AI-assisted metadata generation and “distant viewing” of images. It shows how hopes for a multimodal turn in digital humanities confronts the essential epistemic opacity of deep neural networks and the difficulty of evaluating reliability for complex open ended retrieval tasks. Three suggested mitigation strategies are discussed: critical analyses of models and training data; historically informed reflection on bias and concept change; and fine-tuning or post-processing of models for specific purposes. From a social epistemology perspective, each strategy encounters limits when generalised to research infrastructure meant to support many corpora, tasks and user communities.

The paper then turns to approaches that argue for using multimodality theory to design metadata schemas and guide AI-based annotation. It shows how this is a attempt to shift epistemic trust from AI systems back to scholars (at least partially) in effort to make use of the developing technology. However, this brings into discussion old debates of between theories of meaning. Especially with image data the theoretical discussion of how images meanings should be established and if these theories are implementable to computational models need to be explored. Couple examples from contemporary photography and medieval manuscript research illustrate both the potential of AI-supported exploration and the need for additional contextual and theoretical work to render outputs historically interpretable.

The central claim is that, given the essential epistemic opacity of AI, it currently looks like justified epistemic dependence in history and cultural heritage research needs be organised around situated, task-specific, and accountable uses of multimodal models rather than general-purpose models. The options for research infrastructures for establishing trust are therefore focus on building mechanisms for task-specific reliability assessment, or embedding trusted identifiable human agents or institutions between users and models.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D1.2.1: Transcription service for minority languages

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 1.2: Report on Transcription service for minority languages
Date of reporting: 24-11-2025
Report authors: Martin Matthiesen (CSC)
Contributors: Yaroslav Getman, Tamas Grosz (Aalto), Sam Hardwick (CSC)
Deliverable location: https://github.com/CSCfi/Kielipankki-utilities/tree/master/asr/apptainer

Keywords for the deliverable page: Finland-Swedish, Sámi

Description

An Automatic Speech Recognition model for Northern Sámi (henceforth ”Sámi ASR model”) has been created at Aalto University. The model has been packaged into a container[2] at CSC, which may be used in the user’s preferred computing environment, and also in CSC’s Secure Desktop environment[3] for processing sensitive data.

The packaging process, which can be repurposed for other wav2vec models and/or models available via Huggingface[4] is documented in the Language Bank’s Github[5] repository.

At the time of writing the model for Finnish-Swedish is still under development at Aalto University. It will be packaged as soon as it becomes available.

[1] Sámi ASR model: https://huggingface.co/GetmanY1/wav2vec2-large-sami-cont-pt-22k-finetuned

[2] https://www.kielipankki.fi/tools/sami-asr/

[3] In SD Desktop the tool can be installed using the ”auto-apptainer” tool.

[4] https://en.wikipedia.org/wiki/Hugging_Face

[5] https://github.com/CSCfi/Kielipankki-utilities/tree/master/asr/apptainer

The FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D2.3.2: Remote access to video data repositories

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.3: Report on Remote access to video data repositories
Date of reporting: 21-11-2025

Report authors: Tommi Jauhiainen, Erik Axelson (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024102501 and urn:nbn:fi:lb-2025081401

Description

With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API. We published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the Language Bank of Finland (LBF) resource publishing pipeline under urn:nbn:fi:lb-2024102501. The original metadata includes timestamps, which will enable direct links from the Korp service to the video material available on the Parliament servers. The Korp version will contain approximately 7,000,000 tokens corresponding to about 4,500 hours of video data.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.4: Analysis of multimodal properties of naturalistic speech: The YouTube Corpus of Singapore English Podcasts

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of multimodal properties of naturalistic speech
Date of reporting: 12-11-2025

Report author: Steven Coats (University of Oulu)
Contributors: Alessandro Basile (Sorbonne Nouvelle University, France), Cameron Morin (University of Paris-Cité, France), Robert Fuchs (University of Bonn, Germany)
Deliverable location: Online search interface: https://ycsep.corpora.li (on Zenodo).

Downloadable static corpus: https://doi.org/10.7910/DVN/B7JRID

Keywords: Singapore English, Corpus Linguistics, YouTube, World Englishes, Podcasts

Description

Recent advances in streaming protocols and automatic speech recognition (ASR) have enabled large-scale spoken language corpora, yet research on Singapore English remains constrained by small or text-based datasets. The YouTube Corpus of Singapore English Podcasts (YCSEP) addresses this gap with 620 hours of transcribed, diarized speech from over 1,300 podcast episodes by Singapore-based content creators. YCSEP supports the empirical analysis of phonetics, morphosyntax, and discourse, enabling the study of low-frequency features like discourse particles and reduplication. The dataset reflects informal, spontaneous speech from diverse speakers and facilitates investigation into nativization and endonormative stabilization processes in postcolonial English. Built using a pipeline of yt-dlp, WhisperX, and Pyannote, YCSEP offers robust empirical grounding for linguistic features such as verb complementation and modality. It also contributes to broader theoretical discussions on areal norms and construction grammar in World Englishes.

The corpus is available in two versions: An online search engine, through which transcripts and audio are accessible and downloadable (https://ycsep.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/B7JRID).

Related publication:

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. 2025. The YouTube Corpus of Singapore English Podcasts. English World-Wide. https://doi.org/10.1075/eww.25018.coa

Related presentations:

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the Mutations du Discours Numérique Seminar. Arras, France, April 22^nd, 2025. https://calenda.org/1204680; https://adum.fr/script/formations.pl?mod=3633487&site=l

Coats, Steven, Carmelo Alessandro Basile, Cameron Morin, and Robert Fuchs. The YouTube Corpus of Singapore English Podcasts. Presentation at the 8^th Conference of the International Society for the Linguistics of English. Santiago de Compostela, Spain, September 3^rd, 2025. https://isle8conference.com/

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Suomen ja suomenruotsin Lahjoita puhetta -aineistot ja LUMI-tekoälytehdas esillä Ylen uutisessa

Kielipankin tutkimusjohtaja Krister Lindén oli Ylen haastateltavana 13.10.2025.

Lue uutinen Ylen sivuilta

<< List of all deliverables

D2.3.1: Remote access to text data repositories

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.3: Report on Remote access to text data repositories
Date of reporting: 30-09-2025

Report authors: Tommi Jauhiainen (University of Helsinki)
Contributors: Erik Axelson, Ute Dieckmann, Heidi Jauhiainen, Mietta Lennes, Jussi Piitulainen (University of Helsinki), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: urn:nbn:fi:lb-2024071601 and urn:nbn:fi:lb-2025081401

Description

In this work package, we aimed to provide infrastructure for translation and interpretation research, both in machine translation and in translation studies, by enhancing our access to remote text data repositories. During the project, we focused on improving our access to three significant external sources of text data: the Parliament of Finland, the National Broadcasting company (Yle), and the various institutional repositories managed by the Finnish Universities.

With the cooperation of the Finnish Parliament, we deepened our understanding of the Parliament API and published a source version of a dataset containing speeches from plenary sessions from 2015 to 2023: urn:nbn:fi:lb-2024071601. Currently, the Korp version of the resource is being prepared in the resource publishing pipeline of the Language Bank of Finland (LBF). For future updates of this resource, we plan to collaborate with the Parlamenttisampo and maintain the software components used to extract and parse the API-provided dataset together.

Similarly, we published a new source version of the Yle Finnish News Archive, covering the years 2022-2024: urn:nbn:fi:lb-2025081401. We have worked on streamlining the publishing pipeline for resources that are regularly updated, which include both the Parliament and Yle datasets. Preliminary investigations indicate that the best throughput will be achieved by creating a customized pipeline for each resource with checklists tailored to make the creation and publishing of new versions as easy as possible.

We have also created a semi-automated system that can be used to harvest all PDF-formatted publications from the institutional repositories managed by Finnish Universities. Automated harvesting was made possible by the widespread use of DSpace software as the backend of these repositories. We are further developing automated methods to determine the types of language resources that can be published based on this collection. The licenses under which the texts have been published vary considerably, and we aim to publish them as openly as possible.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.6: Reliable image labelling with computer vision

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Reliable Enrichment of Visual Data
Date of reporting: 29-09-2025

Report authors: Matti Nelimarkka (University of Helsinki)
Contributors: Anton Berg (University of Helsinki), Leonardo Negri (University of Helsinki)
Deliverable location: https://github.com/uh-soco/coslab-core and https://github.com/uh-dcm/coslab-gui

Description

Image recognition services, such as Amazon Rekognition, Google Vision and Azure AI Vision, allow anyone to label image content, however their outputs vary per service (ref to image as data book). Cross-service label agreement score (COSLAB) allows researchers to quantitatively compare labels across services and determine which of the output labels are reliable. This allows researchers to use these outputs in their research and addresses common critique for the scholarly use of such services (ref to image as data book).

The objective of this work was to (a) devise a method to assess the reliability of labels and (b) develop a graphical user interface allowing non-technical users to conduct this analysis. This objective aims to make image recognition tools available for humanities scholars and social scientists.

The underlying COSLAB was originally developed in Berg & Nelimarkka (2023), showing no systematic differences in the quality across different kinds of image datasets, thus suggesting that overall image recognition services can be used, particularly for explorative image analysis.

The graphical user interface provides non-technical frontend to image labelling services and COSLAB calculations. The drag & drop interface allows sending images for image recognition services and then calculates per-label scores, indicating if different image recognition services recognised similar things. The final output containing both the per-image labels and COSLAB scores can be exported e.g. to Microsoft Excel. This allows researchers to further use the results in their analysis tool of choice.

Publications

Berg, A., & Nelimarkka, M. (2023). Do you see what I see? Measuring the semantic differences in image‐recognition services’ outputs. In Journal of the Association for Information Science and Technology (Vol. 74, Issue 11, pp. 1307–1324). Wiley. https://doi.org/10.1002/asi.24827

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D4.1.1: Analysis of video stream interactions with AI solutions

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Analysis of video stream interactions with AI solutions
Date of reporting: 22-09-2025

Report author: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable location: https://collector-twitcher.2.rahtiapp.fi/Video_clip_summary

Keywords: video clip analysis; multimodal; MLLM; video summarization; Twitch

Description

The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a lot of various multimodal information; the visual action of the gameplay, the auditory context from the caster commentary, and the text-based reactions from the live chat, which all represent a dense and valuable information for understanding online communities and digital entertainment. However, the sheer volume and complexity of this data creates a need for efficient tools for its analysis. Our previous tools have focused on chat-analysis or chat content detection [1, 2], which, however, do not seem to cover the diverse nature of content in Twitch thoroughly enough. The primary challenge lies in the multimodal nature of the data. Some of the characteristics of Twitch data include a wide range of dynamic scenes, dense on-screen information, and a complex interaction between the visual gameplay, audio commentary, and massive chat audience. A true understanding of a Twitch clip requires not just the perception of events within each modality but the synthesis of their interplay. This creates a clear research gap for tools that can comprehensively understand and summarize the information within these complex multimedia clips.

This deliverable presents a tool for the automated understanding and summarization of such clips. The tool utilizes the state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family. The tool helps the user to generate a chronological summary of the key audio-visual events, a thematic analysis of chat reactions, and an overall summary from the video and chat input information. This is guided by a structured Chain-of-Thought-based prompt.

Publications

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.3: Machine learning-based enrichment of social media

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Machine learning-based enrichment of social media
Date of reporting: 22-09-2025

Report authors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)
Contributors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)

Deliverable location:

Keywords: machine learning; social media; web registers; register variation

Description

Web-crawled datasets have become invaluable resources for SSH research, supporting diverse fields including corpus linguistics, digital humanities, and computational social science. However, publicly available web datasets like HPLT 2.0 and FineWeb provide only basic metadata about their contents, such as document URLs and crawl dates, which limits their research potential. Enriching these noisy collections with contextual metadata would greatly improve their value for SSH research.

In this deliverable, we focus on automatically identifying social media text varieties in web datasets, using machine learning. We publish the following resources:

A multilingual classifier for labeling web documents by their register (or genre), including social media categories such as blogs and forums.
Social media subtype classifiers for English, Finnish, and Swedish for identifying thematic groups within social media registers (e.g. travel topics within Narrative Blogs).
Datasets labeled with register and fine-grained social media subtype metadata.
A demonstration pipeline and tutorial on Google Colab
A code repository on Github

We approach the web text classification problem using the framework of register variation (Egbert and Biber 2018; Biber and Conrad 2019), where “register” denotes a text variety associated with a particular situational context, such as News report or Recipe. We use the 25-class web register taxonomy developed by Skantsi and Laippala (2023) to label 3 million randomly selected documents from the HPLT 2.0 corpus (Burchell et al. 2025) in English, Finnish, and Swedish (1M samples each). This automatic labeling uses the multilingual BGE-M3 model (Chen et al. 2024), fine-tuned for register classification following Henriksson et al. (2024).

From this 3M document sample we then select a social media subset by choosing documents labeled with any of the following three registers: Narrative Blog, Opinion Blog, or Interactive Discussion. We also include so-called “hybrids” – documents assigned to more than one register label, such as Narrative blog + Recipe. This process yields a dataset of approximately 113,000 English, 290,000 Finnish, and 335,000 Swedish social media documents, with Narrative Blogs being the most common category across all languages.

To further analyze the contents of the identified social media documents, we apply HDBSCAN clustering (McInnes et al. 2017) on their semantic vector representations, revealing meaningful thematic subgroups within some register categories. For instance, applying keyword analysis on the clusters, we identify hand-crafting and cooking themes in hybrid documents labeled Narrative Blog + How-to/Instructional. We develop simple logistic regression classifiers trained on these thematic clusters, allowing SSH researchers to first categorize text by register, then select social media registers of interest, and finally identify specific thematic subgroups where applicable.

References

Biber, Douglas, and Susan Conrad. 2019. Register, Genre, and Style. Cambridge: Cambridge University Press.

Burchell, Laurie, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova et al. 2025. “An expanded massive multilingual dataset for high-performance language technologies.” arXiv e-prints: arXiv-2503.

Chen, Jianlv, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. “Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.” arXiv preprint arXiv:2402.03216.

Egbert, Jesse, and Douglas Biber. 2018. Register Variation Online. Cambridge: Cambridge University Press.

Henriksson, Erik, Amanda Myntti, Saara Hellstrom, Anni Eskelinen, Selcen Erten-Johansson and Veronika Laippala. 2024. “Automatic register identification for the open web using multilingual deep learning.” arXiv preprint arXiv:2406.19892.

McInnes, Leland, John Healy, and Steve Astels. 2017. “hdbscan: Hierarchical density based clustering.” J. Open Source Softw. 2:11, 205.

Skantsi, Valtteri, and Veronika Laippala. 2023. “Analyzing the unrestricted web: The finnish corpus of online registers.” Nordic Journal of Linguistics 48:1, 1-31.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

<< List of all deliverables

D3.3.5: Forensic Linguistics Corpus and Search Interface C.R.I.M.E

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Forensic Linguistics Corpus and Search Interface C.R.I.M.E
Date of reporting: 01-09-2025

Report authors: Steven Coats (University of Oulu)
Contributors: Dana Roemling (University of Birmingham)
Deliverable location: Online search interface: https://forensic.corpora.li (DOI)

Keywords: Forensic linguistics; corpus linguistics, YouTube, investigative interviews

Description

CRIME is the Corpus of Recorded Investigative, Media, and Evidence-based proceedings, a structured, searchable resource comprising audio and ASR-generated transcripts from investigative interviews, courtroom interactions, and related media. Collected from publicly available YouTube sources according to the provisions of the EU Data Mining Act, the corpus addresses a critical gap in current research: the lack of large-scale, real-world datasets that integrate reliable transcripts with corresponding audio.

Previous studies often rely on limited data, constraining generalizability and hindering methodological innovation. By enabling detailed analysis of linguistic, phonetic, pragmatic, and discourse-level features, CRIME supports interdisciplinary research in linguistics, law, psychology, and computational modeling. Potential applications include the identification of language patterns associated with interviewing strategies and outcomes, as well as leveraging large language models to explore affective and interactional dynamics.

This resource offers substantial potential to inform both academic inquiry and evidence-based practices in investigative interviewing and broader criminal justice contexts. The corpus is available in two versions: An online search engine, powered by BlackLab, through which transcripts and audio are accessible and downloadable (https://forensic.corpora.li), and a static, text-only, downloadable version containing transcripts and metadata in tabular form (https://doi.org/10.7910/DVN/MLMB6E).

Related publication:

Coats, Steven and Dana Roemling. 2025. CRIME: The Corpus of Recorded Investigative, Media, and Evidence-based Proceedings. In: Fábián, Annamária and Igor Trost (eds.), Impulses and Approaches to Computer-Mediated Communication Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities, 45-49. University of Bayreuth, Germany. https://www.cmc2025.uni-bayreuth.de/pool/dokumente/CMC-2025-Proceedings-2.pdf

D3.1.2: Workflow automation and version syncing

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Comprehensive data versioning
Date of reporting: 22-09-2025

Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Keywords for the deliverable page: versioning, updates, differences

Description

The versioning mechanism has been rigorously tested with a daily update schedule, which is far too often, considering that the data set is changing relatively rarely and a monthly update schedule is envisaged. We have added improvements to better serve the Elastic Search use case and make it easier to track the provenance of the dataset and to improve the reliability of the snapshot creation. Below we describe in more details how the dataset serves the selected use cases.

Using the data set as a source for newer versions of the KLK dataset in Kielipankki.

To create ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], (”KLK”, for short) using this data set[2] the original Python scripts[3] need to be changed. Presently they are operating on directories extracted from zip files obtained directly from the National Library of Finland (NLF). We decided to not use these files directly for two reasons:

The files are in an internal format of the National Library and contain data which is not available publicly via the API of NLF, in this case the TIFF archive versions of scanned newspapers.
The TIFF files are very large and would significantly impact download times and storage requirements.

Unlike planned we opted in the end to not create a working proof-of-concept, but to explain below the steps needed to adapt the present scripts to the new format. One major change is to operate on the zip files instead of a Posix file structure. Especially in HPC filesystems like Lustre working on zip files is much more efficient than to extract the small files contained in them. Concretely Python’s zipfile module[4] can be used to search for METS files within the downloaded zip files in /scratch/project_2006633/nlf-harvester/zip on Puhti. METS files of a specific binding are contained in the ”mets” directory of said binding. The corresponding OCR data can then be found in the ”alto” directory on the same level.

The example of binding 19712 below illustrates how finding METS files (in the ”mets” directory) leads to the respective OCR data (in the ”alto” directory on the same level as the mets file).

1/19/197/1971/19712/19712/mets/19712_METS.xml
1/19/197/1971/19712/19712/alto/00001.xml
1/19/197/1971/19712/19712/alto/00002.xml
…

A minor issue was observed: Before using the dataset for the next version of ”KLK”, we need to request a collection of periodicals (marked ”aikakausi”) to be added to the dataset, presently we only download newspapers (marked ”sanomalehti”).

Using the dataset as a basis for an Elastic Search instance containing NLF data

Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[5]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. In this use case it was important to keep the Elastic Search Engine in sync with changes within the data set. While we already provide versions, comparing these version is resource intensive. To make comparison easier, we introduced a ”log” directory (/scratch/project_2006633/nlf-harvester/log/ containing listings of additions and deletions that were performed during each synchronisation as well as general information about snapshot runs. We also made it easy to refer to a specific version of the dataset by tagging it with the hash number used in the restic backup. Since the changes from one version to another can be potentially large (e.g. if NLF publishes are new version of the OCR’d scans), resources on HPC login nodes are not sufficient to generate snapshots using restic. For that reason restic is now run as a HPC job on a compute node with adequate resources.

Summary and Outlook

The goal of this work package was create a consistent download framework for publicly available newspaper data from the NLF. To achieve this we used Apache Airflow for task automation and Restic for versioning. It turned out that Apache Airflow is not designed to deal with too many tasks at once that might take a long time. We had to find compromises to reduce the number of tasks.

We ran the download pipeline on a daily basis for few weeks without issue and are now confident that Airflow can be run on a monthly basis to update the dataset. Restic turned out to be a reliable tool for versioning. The versioning to Allas makes it possible to free space on Puhti in case the data set is not in active use after the end of the project. It also makes it possible to stage the data set to other environments, like personal laptops or the LUMI super computer. Long term funding for keeping the data on Allas still needs to be worked out.

References

[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401

[2] See the Harvester documentation for details.

[3] https://github.com/CSCfi/Kielipankki-utilities/tree/master/corp/klk-alto

[4] Introduction to the python zipfile module: https://realpython.com/python-zipfile/

[5] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Kirjautuminen Kielipankin palveluihin on uudistunut

Kielipankin Korp-palveluun ja latauspalveluun kirjautumista on päivitetty ja myös kirjautumissivun pitäisi nyt näyttää aiempaa selkeämmältä. Myös Kielipankin oikeudet -palvelu (LBR, https://lbr.csc.fi) lisätään uuteen kirjautumisjärjestelmään myöhemmin tänä vuonna.

Kirjautumissivulla näkyvät sekä kotimaiset että kansainväliset kirjautumistavat. Vaihtoehtoina ovat Haka, eduGAIN, CLARIN ja Eduuni.

Joko Haka-, eduGAIN- tai CLARIN-kirjautumista edellytetään yleensä niiden aineistojen käyttöön, jotka ovat saatavilla akateemisiin tutkimustarkoituksiin CLARIN ACA -lisensseillä.

Eduuni-kirjautumistapa on tarkoitettu ensisijaisesti ei-akateemisille käyttäjille, esimerkiksi tiettyjen aineistojen rajoitettua kaupallista käyttöä varten.

Kaikilla kirjautumistavoilla voi hakea pääsyä aineistoihin Kielipankin oikeudet -palvelussa.

Lue lisää

Hae Kielipankki-portaalista:

Kuukauden tutkija: Max Wahlström

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot

Tapahtumakutsu: Ilmoittaudu 22.4.2026 järjestettävään työpajaan

A new SKS publication: They call it syntax. Data-based approaches to Finnish dialects

Suomenruotsin Lahjoita puhetta -aineisto (Donera prat) esillä Svenska Ylen uutisessa

D2.2.2: Transformer adaptation for specialised data

Description

D2.1.2: Framework for processing copyrighted data for verification of research

Description

Events

Links

D4.1.3: Advanced analytic social media tools and data

Description

D3.3.4: Machine-learning-based enrichment of textual and audio-visual social media contents

Description

D3.2.3: Ingestion of multimodal societal data from the Web

Description

D4.1.5: Establishing Trust and Reliance on AI in History and Cultural Heritage Research – A social epistemology based view of the challenges of epistemically dependable multimodal AI systems for accessing collections.

Description

D1.2.1: Transcription service for minority languages

Description

D2.3.2: Remote access to video data repositories

Description

D4.1.4: Analysis of multimodal properties of naturalistic speech: The YouTube Corpus of Singapore English Podcasts

Description

Suomen ja suomenruotsin Lahjoita puhetta -aineistot ja LUMI-tekoälytehdas esillä Ylen uutisessa

D2.3.1: Remote access to text data repositories

Description

D3.3.6: Reliable image labelling with computer vision

Description

Publications

D4.1.1: Analysis of video stream interactions with AI solutions

Description

Publications

D3.3.3: Machine learning-based enrichment of social media

Description

D3.3.5: Forensic Linguistics Corpus and Search Interface C.R.I.M.E

Description

Related publication:

Related presentations:

D3.1.2: Workflow automation and version syncing

Description

Using the data set as a source for newer versions of the KLK dataset in Kielipankki.

Using the dataset as a basis for an Elastic Search instance containing NLF data

References

Kirjautuminen Kielipankin palveluihin on uudistunut

Uutisia

Yhteystiedot