<< List of all deliverables

D3.2.3: Ingestion of multimodal societal data from the Web

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.2: Report on Ingestion of multimodal societal data from the Web
Date of reporting: 20-11-2025

Report authors: Matti Nelimarkka (University of Helsinki), Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Matti Nelimarkka (University of Helsinki), Denis Davydov (University of Helsinki), Anita Braida (University of Helsinki), Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)

Deliverable locations:

Keywords for the deliverable page: Twitch, YouTube, chat data, video data

Description

This deliverable focuses on infrastructures for acquisition of multimodal and societal data harvested from the web. The task includes the implementation and maintenance of data collection tools for most popular Finnish discussion forums, YouTube, and Twitch. This deliverable contains two parts ⎯ part A conducted by the Centre for Social Data Science, University of Helsinki and part B by the University of Jyväskylä.

PART A: FINNISH DISCUSSION FORUMS

To ensure that researchers have access beyond global platforms (where data collection is a shared global concern) University of Helsinki build and maintain forum scrapers which extract the content to user-generated content including vauva.fi, kaksplus.fi and comments on yle.fi and hs.fi. These can be used through a command line interface which produces the content as a CSV file for further analysis. We also provided modifications to the 4CAT platform (https://4cat.nl/) to ensure it correctly operates with Finnish language.

PART B: YOUTUBE CHAT COLLECTOR & TWITCH VIDEO COLLECTOR

The team from the University of Jyväskylä presents a continuation of the deliverable for the Twitcher data collector tool. We present new added features such as the option to collect chat data from YouTube from either live or past broadcasts. The collected YouTube chat data can also be viewed in the data viewer section and are also automatically saved in CSC Allas. We also implemented the option to collect videos from Twitch past broadcasts in regard to the video clip analysis tool presented in D3.3.4 and D4.1.1.

 

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Search the Language Bank Portal:
Krista Ojutkangas
Researcher of the Month: Krista Ojutkangas

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information