<< List of all deliverables

D4.1.3: Advanced analytic social media tools and data

Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Advanced analytic social media tools and data
Date of reporting: 26-11-2025

Report author: Mikko Laitinen (UEF)
Contributors: Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location:

Keywords: social media corpora; social network tools; ego networks; gender

Description

Our work has resulted in building four massive social media corpora from one social media application. The purpose is to enable research access to large-scale and curated social media data, which is often a bottle neck in SSH (Laitinen & Rautionaho 2025). The four datasets are named Digital Social Network Corpora (DSN), as they not only consist of user-generated texts but also of detailed information of people’s social networks. They cover four geographic areas: Australia (DSN Ozzie), the Nordic countries (DSN Nordic), the United Kingdom (DSN British), and the United States (DSN America).

In total, they include 19,345 ego networks, consisting of a central node (ego), its directly connected neighbors (alters), and the connections between the alters. These networks were filtered using a semi-automated method to target what we call genuine human accounts, meaning that we aimed to exclude accounts with unusual network qualities, such as bots, celebrities, politicians, organizations, and businesses. Recreating a comparable dataset to the DSN corpora under the current paid data access policies of the social media application (X) would cost over 3 million euros and take around 58 years, given the current limitations of data access policies.

The resulting datasets are extremely large but contain carefully curated social networks with user-generated textual material. The network datasets contain material from 829,608 users, and the data range from 2006 to 2023. Altogether, they contain more than 700 million messages and nearly 10 billion words keyed in by users.

With their detailed structure, massive size, and coverage over 17 years, the DSN corpora support new research and enable re-examining old questions in the humanities. A case in point is the role of weak ties in the spread of innovations, where prior empirical evidence in sociolinguistics comes from ethnographic observations based on very small networks. One clear limitation of ethnographic network investigations is that participant observation methods are limited to networks of 30–50 individuals. The networks in the DSN corpora are substantially larger and close to average human networks in general, making it possible to investigate a variety of networks of different sizes and structures.

Publications:
Laitinen, Mikko & Paula Rautionaho. 2025. Reuse of social media data in corpus linguistics. International Journal of Corpus Linguistics. doi: 10.1075/ijcl.24136.lai

Masoud Fatemi & Mikko Laitinen. 2025. From tweets to networks: Introducing four large network-based social media corpora. CLARIN Annual Conference Proceedings, 2025. Ed by Cristina Crisot and Thalassia Kontino. Vienna, Austria, 2025. pp. 100–104. (https://www.clarin.eu/sites/default/files/CLARIN2025_ConferenceProceedings.pdf)

Events:
CLARIN 2025 conference Vienna 30 Sept – 2 October 2025 (https://www.clarin.eu/event/2025/clarin-annual-conference-2025)
 
 
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Search the Language Bank Portal:
Krista Ojutkangas
Researcher of the Month: Krista Ojutkangas

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information