<< List of all deliverables

D3.3.3: Machine learning-based enrichment of social media

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.3: Report on Machine learning-based enrichment of social media
Date of reporting: 22-09-2025

Report authors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)
Contributors: Erik Henriksson (University of Turku), Tuomas Lundberg (University of Turku), Veronika Laippala (University of Turku)

Deliverable location:

Keywords: machine learning; social media; web registers; register variation

Description

Web-crawled datasets have become invaluable resources for SSH research, supporting diverse fields including corpus linguistics, digital humanities, and computational social science. However, publicly available web datasets like HPLT 2.0 and FineWeb provide only basic metadata about their contents, such as document URLs and crawl dates, which limits their research potential. Enriching these noisy collections with contextual metadata would greatly improve their value for SSH research.

In this deliverable, we focus on automatically identifying social media text varieties in web datasets, using machine learning. We publish the following resources:

We approach the web text classification problem using the framework of register variation (Egbert and Biber 2018; Biber and Conrad 2019), where “register” denotes a text variety associated with a particular situational context, such as News report or Recipe. We use the 25-class web register taxonomy developed by Skantsi and Laippala (2023) to label 3 million randomly selected documents from the HPLT 2.0 corpus (Burchell et al. 2025) in English, Finnish, and Swedish (1M samples each). This automatic labeling uses the multilingual BGE-M3 model (Chen et al. 2024), fine-tuned for register classification following Henriksson et al. (2024).

From this 3M document sample we then select a social media subset by choosing documents labeled with any of the following three registers: Narrative Blog, Opinion Blog, or Interactive Discussion. We also include so-called “hybrids” – documents assigned to more than one register label, such as Narrative blog + Recipe. This process yields a dataset of approximately 113,000 English, 290,000 Finnish, and 335,000 Swedish social media documents, with Narrative Blogs being the most common category across all languages.

To further analyze the contents of the identified social media documents, we apply HDBSCAN clustering (McInnes et al. 2017) on their semantic vector representations, revealing meaningful thematic subgroups within some register categories. For instance, applying keyword analysis on the clusters, we identify hand-crafting and cooking themes in hybrid documents labeled Narrative Blog + How-to/Instructional. We develop simple logistic regression classifiers trained on these thematic clusters, allowing SSH researchers to first categorize text by register, then select social media registers of interest, and finally identify specific thematic subgroups where applicable.

References

Biber, Douglas, and Susan Conrad. 2019. Register, Genre, and Style. Cambridge: Cambridge University Press.

Burchell, Laurie, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova et al. 2025. “An expanded massive multilingual dataset for high-performance language technologies.” arXiv e-prints: arXiv-2503.

Chen, Jianlv, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. “Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.” arXiv preprint arXiv:2402.03216.

Egbert, Jesse, and Douglas Biber. 2018. Register Variation Online. Cambridge: Cambridge University Press.

Henriksson, Erik, Amanda Myntti, Saara Hellstrom, Anni Eskelinen, Selcen Erten-Johansson and Veronika Laippala. 2024. “Automatic register identification for the open web using multilingual deep learning.” arXiv preprint arXiv:2406.19892.

McInnes, Leland, John Healy, and Steve Astels. 2017. “hdbscan: Hierarchical density based clustering.” J. Open Source Softw. 2:11, 205.

Skantsi, Valtteri, and Veronika Laippala. 2023. “Analyzing the unrestricted web: The finnish corpus of online registers.” Nordic Journal of Linguistics 48:1, 1-31.

 
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Search the Language Bank Portal:
Krista Ojutkangas
Researcher of the Month: Krista Ojutkangas

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information