
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 4.1: Report on Analysis of video stream interactions with AI solutions
Date of reporting: 22-09-2025
Report author: Jari Lindroos (JYU), Raine Koskimaa (JYU)
Contributors: Jari Lindroos (University of Jyväskylä), Raine Koskimaa (JYU), Ida Toivanen (JYU), Tanja Välisalo (NAF), Jaakko Peltonen (TAU)
Deliverable location: https://collector-twitcher.2.rahtiapp.fi/Video_clip_summary
Keywords: video clip analysis; multimodal; MLLM; video summarization; Twitch
The proliferation of short-form video on livestreaming platforms like Twitch presents a significant challenge for multimodal content analysis. Each clip contains a lot of various multimodal information; the visual action of the gameplay, the auditory context from the caster commentary, and the text-based reactions from the live chat, which all represent a dense and valuable information for understanding online communities and digital entertainment. However, the sheer volume and complexity of this data creates a need for efficient tools for its analysis. Our previous tools have focused on chat-analysis or chat content detection [1, 2], which, however, do not seem to cover the diverse nature of content in Twitch thoroughly enough. The primary challenge lies in the multimodal nature of the data. Some of the characteristics of Twitch data include a wide range of dynamic scenes, dense on-screen information, and a complex interaction between the visual gameplay, audio commentary, and massive chat audience. A true understanding of a Twitch clip requires not just the perception of events within each modality but the synthesis of their interplay. This creates a clear research gap for tools that can comprehensively understand and summarize the information within these complex multimedia clips.
This deliverable presents a tool for the automated understanding and summarization of such clips. The tool utilizes the state-of-the-art Multimodal Large Language Models (MLLMs) from the Google Gemini family. The tool helps the user to generate a chronological summary of the key audio-visual events, a thematic analysis of chat reactions, and an overall summary from the video and chat input information. This is guided by a structured Chain-of-Thought-based prompt.
[1] Jari Lindroos, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Ida Toivanen. ”From PogChamps to Insights: Detecting Original Content in Twitch Chat.” In Hawaii International Conference on System Sciences, pp. 2542-2551. Hawaii International Conference on System Sciences, 2025. https://doi.org/10.24251/hicss.2025.308
[2] Jari Lindroos, Ida Toivanen, Jaakko Peltonen, Tanja Välisalo, Raine Koskimaa, and Sami Äyrämö. ”Participant profiling on Twitch based on chat activity and message content.” In International GamiFIN Conference, pp. 18-29. CEUR Workshop Proceedings, 2025. https://ceur-ws.org/Vol-4012/paper18.pdf
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.
