<< List of all deliverables

FIN-CLARIAH D4.3.3: Representative Twitter dataset(s) of user-generated texts and metadata

Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.3: Report on Representative Twitter dataset(s) of user-generated texts and metadata
Date of reporting: 25-11-2023

Report author: Mikko Laitinen (University of Eastern Finland)
Contributors: Masoud Fatemi, Mehrdad Salimi, Paula Rautionaho (all from the University of Eastern Finland)
Deliverable location: https://nts-csc.rahtiapp.fi/ The URL is currently open for researchers, and we will add authentication to it in the spring of 2024.


The WP’s main objective was to develop a representative dataset of social media data from Twitter from the five Nordic countries. The underlying idea is that various social media applications offer a promising and extremely large source of data for a range of disciplines in social sciences and the humanities (SSH) today, but research activities are often hindered by the lack of technical knowledge in collecting, pre-processing and analysing very large datasets. During the funding period, we expanded the data collection substantially, when it because clear that the future of the data collection route became more and more uncertain. All the materials were collected during the period when the academic application programming interface of this social media platform was still open, and later on when the company changed its name to X, the API was closed down. In the hindsight, the decision to store large amounts of material from various geographic settings turned out to be a wise move, because this subproject has now saved 12.5 years of material for future research.

The project activities so far have consisted of two parts:

  1. Collecting data: Masoud Fatemi has been in charge of the data collection. Our dataset initially focused on the Nordic region, but we decided to expand this considerably when it became clear that the API would be closed after changes in ownership of the platform. In addition to our original data, we expanded the data collection to social network information in the Nordic region, the United States, the United Kingdom, and to Australia, together with partners from the Australian Digital Observatory from the Queensland University of Technology and the University of Queensland. Basic information of the datasets are shown in Table 1 below, and they range in size from nearly 800 million words to nearly 4 billion words in the US and the Australian networks. The datasets cover a slightly different time frames from 2006 to May 2023. A substantial part of the NTS data will be shared via the Language Bank of Finland during 2023–24.

Social media datasets collected in the 2022-2023 in this subproject

  1. An easy-to-use graphic interface: The second part consisted of designing an easy-to-use graphic interface for accessing the material and for carrying out basic analysis and visualizations of the NTS data. The interface is currently in the piloting phase, and can be accessed at https://nts-csc.rahtiapp.fi/. It currently has only partial data, but we aim at adding all the NTS data to the CSC by spring 2025. Mehrdad Salimi was hired for this task in June 2022, and his contract is until May 2024, by which time, the interface will be fully functional.

This WP has reached its objectives and succeeded in creating a national niche within the Finnish DH sphere. We have a good team that combines expertise from sociolinguistics and computer sciences, and we are able to develop digital tools for a range of audiences.

For 2024–2025, we aim at continuing the work, and adding a graphic interface for accessing network information and combining this network information with textual searches.

Search the Language Bank Portal:
Heidi Niva
Researcher of the Month: Heidi Niva


Upcoming events


The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information