
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 3.3: Report on Machine-learning-based enrichment of textual and audio-visual social media contents
Date of reporting: 20-11-2025
Report authors: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable locations:
Keywords: video clip analysis; multimodal; MLLM; video summarization; data enrichment; Twitch
The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a vast amount of diverse information: the visual action, the auditory context from caster commentary, and the text-based reactions from the live chat, all representing dense and valuable data for understanding online communities. However, the sheer volume and complexity of this data creates a need for efficient analysis tools. Our previous tools have focused on chat-analysis or chat content detection [1, 2].
This deliverable presents a continuation of the deliverable D4.1.1 tool for the automated understanding and enrichment of such clips. The tool is powered by state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family, guided by a multi-step Chain-of-Thought prompt. This prompt instructs the MLLM to focus on data enrichment, systematically analyzing the clip’s metadata, audio-visual content, and chat log, producing a JSON file.
This structured JSON data is organized into three parts. The analysis begins with the audiovisual analysis of the content in the video. It identifies all key entities involved, logs chronological actions in the video, transcribes the on-screen text, and breaks down caster commentary into key quotes and emotional tones. Next, the “chat reaction” section shows how the audience reacted to the jargon used by the community while also providing a glossary to explain the cultural meaning behind this. Finally, the “causal synthesis” connects these two modalities. It provides a narrative summary explaining why the clip matters and establishes direct causal links between the audiovisual triggers to the exact chat reactions they caused.
All generated analyses are automatically saved and accessible within the video_descriptions category of the data viewer section.
Publications
[1] Jari Lindroos, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Ida Toivanen. ”From PogChamps to Insights: Detecting Original Content in Twitch Chat.” In Hawaii International Conference on System Sciences, pp. 2542-2551. Hawaii International Conference on System Sciences, 2025. https://doi.org/10.24251/hicss.2025.308
[2] Jari Lindroos, Ida Toivanen, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Sami Äyrämö. ”Participant profiling on Twitch based on chat activity and message content.” In International GamiFIN Conference, pp. 18-29. CEUR Workshop Proceedings, 2025. https://ceur-ws.org/Vol-4012/paper18.pdf
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
