D4.1.2: Analysis Tools for Multimodal Born-digital Social Media

Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on analysis tools for multimodal born-digital social media: Nordic Tweet Stream (NTS)
Date of reporting: 18-12-2024

Report author: Mikko Laitinen (UEF)
Contributors: Paula Rautionaho (UEF), Masoud Fatemi (UEF), Mehrdad Salimi (UEF)
Deliverable location: https://nordictweetstream.fi/


The Nordic Tweet Stream (NTS) is a monitor corpus of geolocated tweets and associated metadata from the Nordic region covering over 11 years from 2013 to 2023. It is accessible through a graphic interface that allows users to search, subset, visualize, and download extremely large-scale user-generated data from one social media application.

The objective of this digital interface is to enable easy access to and distribution of born-digital data for basic research. We have recently witnessed the closing down of free access to various digital sources because of the APIcalypse (Bruns 2019) and feel that, despite restrictive measures by social media giants, it is extremely important to store cultural heritage from social media. We operate according to the FAIR Data Principle. The guiding principles of FAIR aim at making data findable, accessible, interoperable, and reusable (Wilkinson et al. 2016).

The NTS provides data spanning from January 2013 to May 2023, encompassing over 900 million tokens from more than 73 million messages, generated by nearly 900,000 individuals. The dataset includes content in 73 languages. The largest languages are Swedish (c. 31 %), English (c. 26 %) and Finnish (c. 13 %). Detailed information of the material is found in the Statistics pages of the interface.

The NTS dataset is intended for use by researchers across various disciplines, including sociolinguistics, dialectology, social sciences, and cultural studies. It can serve as both primary data and supplementary material alongside structured corpus data. This interface is designed for users seeking quick access to the data. Advanced users, however, may prefer to utilize the download function to retrieve the data for further processing in other environments.


Laitinen, M., Lundberg, J., Levin, M., & Martins, R. M. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data. In DHN 2018 Digital Humanities in the Nordic Countries 3rd Conference: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference Helsinki, Finland, pp. 349–362. https://erepo.uef.fi/handle/123456789/6697


NTS presented in the following event:


  • Bruns, Axel. 2019. After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566, doi: 10.1080/1369118X.2019.1637447
  • Wilkinson, M. D. et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3, 160018. doi:10.1038/sdata.2016.18

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

D4.1.6: Enrich survey data with register data and unstructured text

Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 4.1: Report on Enrich survey data with register data and unstructured text
Date of reporting: 12-12-2024

Report authors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Contributors: Adeline Clarke (University of Helsinki), Maria Valaste (University of Helsinki)
Deliverable location: https://cran.r-project.org/web/packages/finnsurveytext/index.html


The finnsurveytext R package has been developed to aid researchers in analyzing responses to open-ended survey questions and other structured text data. This user-friendly tool facilitates reproducible analysis of text data by providing features such as summarizing response properties, identifying frequent words and phrases, visualizing responses, and generating concept network plots. The second version of the package, released in August 2024, integrates with the widely-used R package survey, allowing for survey design to be incorporated into the analysis. Although originally designed for analyzing text in Finnish, the package is versatile and can be used for text analysis in other languages as well.

R package finnsurveytext was released with 2 updates to CRAN. The R package is located at CRAN and additional material is available on the website. An article on the package has been written and is available on Zenodo and for review in the new DARIAH publication.

The results of the work package were presented at two events: an invited lecture at the Workshop on Survey Statistics 2024, held in Poznan, Poland from 26-30 August, and at Statistics Sweden and Örebro University Summer School 2024 in August 28.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.

Language Data Space and ALT-EDIC in Finland

What is the European Language Data Space (LDS)?

The EU is in the process of creating an internal market for all types of data. The aim is to ensure that data can be shared from one stakeholder to another within the region, in accordance with the EU legislation. Data sharing requires interactive networks – data spaces – that can connect data providers and users, and offer a platform for them to communicate, make contracts and trade with each other.

All the upcoming European data spaces will be developed in line with the European Data Strategy. There are development plans for data spaces for approximately 15 different strategic fields. According to the vision, data spaces will allow for the commercialisation and more efficient re-use of data. This will benefit not only commercial stakeholders in the EU, but also EU citizens by providing them with better digital services, for example. In addition, researchers could gain access to new types of data and materials, which could boost basic research and increase opportunities for product development and innovation.

The European Language Data Space (shortened: LDS) is an ecosystem for the sharing and commercialisation of language data, such as text and speech data, and for the development of large language models and language-centric Artificial Intelligence. The Language Data Space is being developed and coordinated by the LDS Consortium, which was established in early 2023 with the support of the European Commission. The first phase of the LDS will last three years and during this period, the technical and legal framework for the operation of the common language data platform will be established in cooperation with the various stakeholders.

The work on the language data space will also be driven forward by ALT-EDIC, the language technology alliance of EU member states established in early 2024. In particular, ALT-EDIC aims to ensure the development of EU-based large language models.

The Language Data Space will be built partly on top of existing networks and language technology infrastructures. Sitra’s publication Snapshot of Finnish data spaces (2024) summarises well the current situation in Finland with regard to language technologies and the Language Data Space (in Finnish).

LDS Workshop in Finland and elsewhere in Europe

In spring 2024, the LDS Consortium launched a series of country-specific workshops to share information about the possibilities of the common Language Data Space, and to reach as many stakeholders in each member country as possible. The workshops are organised in collaboration with local institutions. In April 2024, Finland had the honour of being the first EU member state to host an LDS workshop. The event was organised locally by the University of Helsinki. More information on workshops in other EU countries and upcoming LDS events can be found on the Language Data Space website.

The Finnish LDS workshop provided an opportunity for organisations and companies in Finland to exchange ideas on the possibilities and challenges that a common platform and marketplace for language models and data could offer. As remote presenters, the workshop featured Philippe Gelin from the European Commission, and Georg Rehm from DFKI in Germany. In the panel discussions (see photos below), partners from the LAREINA project coordinated by the University of Helsinki shared their views on the importance of language data and on the challenges regarding the availability and technical quality of data or regarding copyright constraints. Without access to electronic data of sufficient quality and scope, it is difficult to develop language models for speakers of small and medium-sized languages.

After the LDS workshop, Finland initiated the membership process to join ALT-EDIC as an observer member. After summer 2024, the full membership of Finland in ALT-EDIC was confirmed for the next three years. The administrative representative of Finland in ALT-EDIC is the Ministry of Transport and Communications, with whom the University of Helsinki aims to maintain an active dialogue.

LDS invites businesses and other stakeholders to join the user group

Language Data Space invites European stakeholders to join the LDS User Group. The group includes commercial stakeholders from different sectors as well as representatives from both public administrations and research. The news from the remote meeting of the LDS User Group in November 2024 can be found here. Joining the LDS User Group is done via a form that can be found on the LDS website. In particular, language data providers and utilisers, as well as language model developers, are warmly welcome to join the group.

At the end of 2024, the Language Data Space is entering the pilot phase, where the Language Bank of Finland is also actively involved. The aim is to test the pilot version of the LDS platform in Finland and to collect user feedback. The Language Bank of Finland is also planning to organise a workshop in spring 2025 on the Language Data Space, ALT-EDIC and copyright issues. We will inform about this upcoming event on our website and through the LAREINA project.

All photos in the article: Jyrki Niemi / University of Helsinki


Language Data Space (LDS) ja ALT-EDIC Suomessa

Mikä on European Language Data Space (LDS)?

EU luo parhaillaan kaikenlaiselle datalle sisämarkkinoita, joilla pyritään varmistamaan datan liikkuvuus alueen toimijoiden välillä EU:n lainsäädännön mukaisesti. Tähän tarkoitukseen tarvitaan vuorovaikutteisia verkkoja – data-avaruuksia – jotka voivat yhdistää datan tarjoajat ja käyttäjät ja tarjota heille alustan keskinäiseen viestintään, sopimusten laatimiseen ja kaupankäyntiin.

Kaikki vireillä olevat eurooppalaiset data-avaruudet kehitetään Euroopan datastrategian mukaisesti. Data-avaruuksia ollaan pystyttämässä jo noin 15 eri toimialalle. Tarkoituksena on, että niiden avulla dataa voitaisiin kaupallistaa ja tehostaa sen uudelleenkäyttöä. Tästä hyötyisivät EU-alueen kaupallisten toimijoiden ohella myös kansalaiset, kun esimerkiksi digitaaliset palvelut paranisivat. Lisäksi tutkijat voisivat saada pääsyn uudenlaisiin aineistoihin, mikä tukisi perustutkimusta ja parantaisi mahdollisuuksia tuotekehittelyyn ja innovaatioihin.

European Language Data Space (lyhenne: LDS) eli Eurooppalainen kielidata-avaruus on kielidatan, kuten teksti- ja puheaineistojen, jakamiseen ja kaupallistamiseen sekä suurten kielimallien ja kielikeskeisen tekoälyn kehittämiseen tarkoitettu ekosysteemi. Kielidata-avaruutta kehittää ja koordinoi LDS-konsortio, joka perustettiin Euroopan komission myötävaikutuksella alkuvuonna 2023. LDS:n ensimmäinen vaihe kestää kolme vuotta, joiden aikana on tarkoitus luoda yhteisen kielidata-alustan toiminnalle tekniset ja juridiset puitteet yhteistyössä eri sidosryhmien kanssa.

Kielidata-avaruuden edistämiseen osallistuu myös ALT-EDIC, alkuvuonna 2024 perustettu kieliteknologian allianssi, jonka jäseninä ovat EU-valtiot. ALT-EDICin tavoitteena on varmistaa etenkin EU-lähtöisten suurten kielimallien kehittäminen.

Kielidata-avaruutta rakennetaan osin jo olemassa olevien verkostojen ja kieliteknologisten infrastruktuurien päälle. Sitran julkaisu Suomalaisten data-avaruuksien tilannekuva (2024) kiteyttää hyvin Suomen tilanteen kieliteknologioiden ja kielidata-avaruuden osalta.

LDS-työpaja Suomessa ja muualla Euroopassa

LDS-konsortio käynnisti keväällä 2024 maakohtaisten työpajojen sarjan, jotta tieto yhteisen kielidata-avaruuden mahdollisuuksista kantautuisi eteenpäin ja tavoittaisi kunkin jäsenmaan sidosryhmiä mahdollisimman laajalti. Työpajat järjestetään yhteistyössä paikallisten toimijoiden kanssa. Huhtikuussa 2024 Suomella oli kunnia olla ensimmäinen EU:n jäsenvaltio, jossa toteutettiin LDS-työpaja. Paikallisena järjestäjänä toimi Helsingin yliopisto. Muissa EU-maissa järjestettyihin työpajoihin sekä tuleviin LDS-tapahtumiin voi tutustua Language Data Spacen verkkosivuilla.

LDS-työpaja tarjosi Suomessa toimiville organisaatioille ja yrityksille tilaisuuden vaihtaa ajatuksia siitä, millaisia mahdollisuuksia ja haasteita yhteinen kielimallien ja -datan jakelualusta ja kauppapaikka voisi tarjota. Työpajassa vierailivat etäpuhujina Philippe Gelin Euroopan komissiosta sekä Georg Rehm DFKI:sta, Saksasta. Paneelikeskusteluihin (ks. kuvat alla) osallistui Helsingin yliopiston koordinoiman LAREINA-hankkeen yhteistyökumppaneita, joilla on kokemusta kielidatan merkityksestä sekä tietoa haasteista, jotka liittyvät datan saatavuuteen, tekniseen laatuun tai tekijänoikeuksien tuomiin rajoitteisiin. Jos riittävän laadukkaisiin ja laajoihin sähköisiin aineistoihin ei ole pääsyä, on vaikea kehittää omia kielimalleja pienten ja keskisuurten kielten puhujille.

LDS-työpajan jälkimainingeissa Suomi käynnisti jäsenyysprosessin ja liittyi ALT-EDICiin tarkkailijajäseneksi. Kesän 2024 jälkeen vahvistettiin myös Suomen täysjäsenyys seuraaviksi kolmeksi vuodeksi. Hallinnollisesti Suomea edustava taho ALT-EDIC-asioissa on Liikenne- ja viestintäministeriö, jonka kanssa Helsingin yliopisto pyrkii pitämään aktiivisesti yhteyttä.

LDS kutsuu yrityksiä ja muita sidosryhmiä käyttäjäryhmään

Language Data Space kutsuu eurooppalaisia toimijoita mukaan LDS-käyttäjäryhmään. Mukana ryhmässä on eri alojen kaupallisia toimijoita ja julkishallinnon sekä tutkimuksen edustajia. LDS-käyttäjäryhmän marraskuussa 2024 järjestetyn etätapaamisen kuulumisia voi lukea verkkouutisesta. LDS:n sivuilla on myös käyttäjäryhmän liittymiskaavake. Ryhmään ovat tervetulleita erityisesti kielidatan tarjoajat ja hyödyntäjät sekä kielimallien kehittäjät.

Vuoden 2024 lopulla Language Data Space on siirtymässä pilotointivaiheeseen, jossa myös Kielipankki on aktiivisesti mukana. Tavoitteena on testata Suomessa LDS-alustan pilottiversiota sekä kerätä siitä käyttäjäpalautetta. Kielipankki suunnittelee myös järjestävänsä keväällä 2025 työpajan, jonka aiheina ovat LDS:n ja ALT-EDICin lisäksi tekijänoikeusasiat. Tiedotamme tapahtumasta myöhemmin verkkosivuillamme sekä LAREINA-hankkeen kautta.

Artikkelin kuvat: Jyrki Niemi / Helsingin yliopisto


Finnish News Agency Archive (1992-): License of full text versions will be terminated on 21.2.2025

According to a notice from the rightholder, the end-user license of the full-text versions of the Finnish News Agency Archive will be terminated on 21st February 2025. In case you were granted the right to use the full text versions via the Language Bank of Finland, you must stop using the resources in question and you must remove them from your devices by the aforementioned deadline (see the license link above). The users who have access rights to the full-text versions have also been notified by email on 21st November 2024.

Please note that the termination of the license only affects the full-text versions of the resource! You may continue using those versions of the Finnish News Agency Archive that only show restricted contexts (e.g., the Korp versions of the archive in the Language Bank) or where the order of the sentences has been scrambled.

This page outlines the project deliverables for 2026-2029 (see template and instructions for reporting).

FIN-CLARIAH Funding period 2026-2029

Each WP has a leader (L:) and one or more participants from the consortium partners (P:) and collaborators (C:). The WP leader and participants contribute to the work in the WP. Collaborators are test users providing feedback, evaluation and beta testing of the deliverables.

Module 1: Natural Language Processing (NLP)

The module handles the basic language processing when a new resource is licensed from the rights holder, integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms. These processes need to be upgraded in view of recent developments in transformer technology, LLMs and AI. (L:UHEL/ARTS Krister Lindén)

W1.1 Text processing and annotation environments

To streamline and consolidate the text annotation in the RI components. (L:UHEL/ARTS Jussi Piitulainen; P:CSC; C:UEF, UTU, AALTO)

D1.1.1 Support common CLARIN formats like TEI (CSC/Martin Matthiesen). 2026-12
D1.1.2 Convert VRT to TEI and showcase the result in a compatible web interface like the KorAP platform used in German CLARIN. (CSC/Martin Matthiesen) 2027-07
D1.1.3 Apply new technologies such as LLMs for ingesting accruing data sets and improving annotation of existing data sets. (UHEL/ARTS/Jussi Piitulainen) 2028-04
D1.1.4 Develop metadata interoperability of FIN-CLARIAH resources for other infrastructures like ALT-EDIC (UHEL/ARTS/Jussi Piitulainen) 2029-10

W1.2 Speech processing and annotation

To provide automated speech recognition with an emphasis on recognizing, classifying and annotation of everyday speech and dialects. (L:CSC Sam Hardwick; P:UHEL/ARTS; C:AALTO, Kotus, OU, UTU, UEF, UHEL/SOC, UHEL/NLF)

D1.2.1 Updated backend of existing ASRs (CSC/Sam Hardwick) 2026-10
D1.2.2 A pipeline for the automated collection, processing, transcription and annotation (e.g. diarization and demographic annotation) of multimodal social media data. (OU/Steven Coats) 2027-08
D1.2.3 Support for additional future models and make the processing pipeline transparent for easy evaluation of suitability for data with elevated security requirements (CSC/Sam Hardwick) 2028-06
D1.2.4 Expansion and upgrade of Oulu Clarin-D centre to C or B status; provision of access to additional language resources sourced from multimedia social media content. (OU/Steven Coats) 2029-11

W1.3 Video processing and annotation

To simplify researcher use, management, annotation and sharing of collections of video recordings. (L:UHEL/ARTS Mietta Lennes; P:CSC; C:JYU, OU)

D1.3.1 Develop licensing and protection schemes for sharing sign language data (UHEL/ARTS/Mietta Lennes) 2026-06
D1.3.2 Data handling model for the entry and removal for large amounts of video data for research (CSC/Sam Hardwick) 2027-08
D1.3.3 Inventory and installation of tools for automated annotation of video and sign language data with LLM technologies (UHEL/ARTS/Mietta Lennes) 2028-09
D1.3.4 Inventory and installation of tools for accessing video and sign language data (UHEL/ARTS/Mietta Lennes) 2029-10

Module 2: Language Research Infrastructure (LRI)

This module takes care of the specialised language processing needs in the fields of language-based research. (L:UHEL/ARTS Krister Lindén)

W2.1 Processing Research Data

To share language resources and tools for datasets containing personal or copyrighted data. (L:CSC Martin Matthiesen; P:UHEL/ARTS; C:UHEL/SOC, UTU)

D2.1.1 Document the current options and fitness for purpose to use other processing environments, like supercomputers provided by CSC. (CSC/Martin Matthiesen) 2026-05
D2.1.2 Propose a proof-of-concept to address issues found in D 2.1.1. (CSC/Martin Matthiesen) 2027-09
D2.1.3 Pilot a processing pipeline with a real research use case, e.g. KAVI audio data. (CSC/Martin Matthiesen) 2028-06
D2.1.4 Protected processing and sharing of matriculation essays for research. (UHEL/ARTS/Mietta Lennes) 2029-11

W2.2 Training environments

To provide interactive online training environments for humanities scholars for creating specialised processing modules from LLMs. (L:UHEL/ARTS Erik Axelsson; P:CSC; C:AALTO, JYU, UTU, OU, Kotus)

D2.2.1 Training environment for DH scholars applying LLMs to annotation of text resources (UHEL/ARTS Erik Axelsson) 2026-12
D2.2.2 Training environment for DH scholars applying LLMs to annotation of audio resources (UHEL/ARTS Erik Axelsson) 2027-12
D2.2.3 Training environment for DH scholars applying LLMs to annotation of video resources (UHEL/ARTS Erik Axelsson) 2028-06
D2.2.4 Training environment for DH scholars applying LLMs to annotation of multimodal resources (UHEL/ARTS Erik Axelsson) 2029-08

W2.3 Translation and Interpretation

To provide infrastructure for translation and interpretation research on fact checking and verification of LLM output. (L:UHEL/ARTS Tommi Jauhiainen; P:CSC; C:UTA, UEF)

D2.3.1 Develop policies for processing and sharing translation memories (UHEL/ARTS Tommi Jauhiainen) 2026-05
D2.3.2 Install pipeline for automated cleaning and transcription of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) 2027-06
D2.3.3 Provide access to transcriptions of multilingual audio and video data (UHEL/ARTS Tommi Jauhiainen) 2028-08
D2.3.4 A pipeline for the automated collection, processing, transcription and annotation of multilingual media (UHEL/ARTS Tommi Jauhiainen)  2029-10

W2.4 Terminology

To provide infrastructure for the terminology work in the Helsinki Term Bank for the Arts and Sciences (HTB) and related terminology development projects. (L:UHEL/ARTS Tiina Onikki; C:UVAASA)

D2.4.1 Initiate and develop terminology groups on biology, microbiology, ecology, evolutionary biology, biotechnology, and genetics. 2026-09
D2.4.2 Initiate and develop terminology groups on geography, social geography, and environmental sciences. 2027-12
D2.4.2 Initiate and develop terminology groups on social policy, economics, and political science. 2028-05
D2.4.3 Initiate and develop terminology groups on sociology, psychology, social psychology, and educational sciences. 2029-11

Module 3: Structuring Data

This module standardises efforts in data capture and provides resources and incentives for collaboration by processing unstructured text and metadata with different areas of Digital Humanities (DH) as use cases. (L:UHEL/ARTS Mikko Tolonen)

W3.1 Data Management

To significantly upgrade the data management, versioning and workflow automation capabilities that underlie the whole infrastructure for data ingestion. (L:CSC Anni Järvenpää; P:UHEL/ARTS; C:UHEL/NLF, UHEL/SOC, NAF, OU, JYU)

D3.1.1 Upgrading the base data storage, access and processing infrastructure to handle the large volumes of multimodal data needed to both train and use foundational models 2026-05
D3.1.2 Upgrading the data workflow automation and versioning capabilities to handle the large volumes of multimodal data needed to both train and use foundational models 2027-09
D3.1.3 Second upgrade of the base data infrastructure to account for the rapidly changing systems and requirements 2028-04
D3.1.4 Second upgrade of the workflow and versioning to account for the rapidly changing systems and requirements 2029-10

W3.2 Data Ingestion

To improve the RI by connecting it to accruing data sources. (L:UHEL/NLF Johanna Lilja; P:Aalto, OU, JYU, UHEL/ARTS; C:CSC)

D3.2.1  Ingestion of visual cultural heritage. Validation of the API solution and further development of the interoperability between Finna and FIN-CLARIAH-infrastructure. (NLF/FINNA/Riitta Peltonen)   2026-11 


D3.2.2  Ingestion of new types of data More comprehensive engagement of the cultural heritage organisations that provides new types of data and facilitating dialogue between them and researchers. (NLF/FINNA/Riitta Peltonen) 2027-06 
D3.2.3  Ingestion of in-copyright publications/webarchive. Building a research environment for legal deposit material  (NLF/Aija Vahtola) 2028-12 
D3.2.4  Ingestion of in-copyright publications/webarchive. Piloting the research environment for legal deposit material with researchers (NLF/Aija Vahtola)  2029-11 

W3.3 Enrichment

To enable the systematic and detailed analysis of noisy datasets in different formats and thereby provide unseen possibilities for SSH research. (All the deliverables set to 2029 also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UTU Veronika Laippala; P:UEF, JYU, OU, UHEL/ARTS, UHEL/SOC, Aalto; C:UHEL/NLF)

D3.3.1 Statistical methods for denoising and enrichment of structured cultural heritage data (UTU/Leo Lahti) 2029-11
D3.3.2 Neuro-symbolic tools based on Generative AI and LLMs for enriching metadata (Aalto/Annastiiina Ahola) 2027-11
D3.3.3 Using foundational models to deeply enrich and sample from massive but noisy, multilingual web data (UTU/Veronika Laippala) 2029-11
D3.3.4 Multimodal modelling for deep enrichment of archival documents (JYU/ Antero Holmila) 2029-11
D3.3.5 Multimodal modelling for the deep enrichment of livestream data (JYU, Raine Koskimaa) 2029-11

Module 4: Analyzing Structured Data

The module will develop the technical services needed to support data-intensive SSH research on the various types of raw data. (L:UHEL/ARTS Mikko Tolonen)

W4.1 Analytical Support for computational SSH

To enable researchers to utilise large born-digital data effectively and to focus on analysis rather than dealing with technical details in often high volume and high velocity. (All the deliverables also have sub-deliverables. However, for presentation clarity, only the overall development strand names and final deliverables are shown.) (L:UEF Mikko Laitinen; P:JYU, OU, UHEL/SOC; C:UHEL/NLF)

D4.1.1 Analytical and conceptual tools for multimodal cultural heritage analysis. (OU/Ilkka Lähteenmäki)  2029-11
D4.1.2 Develop a national digital ecosystem (“Nordic Digital Observatory”) for effective use of large-scale social media data in fundamental research (UEF/ Mikko Laitinen)  2029-11
D4.1.3 Analysis tools for Social Science data from multiple data sources (UHEL/SOC/Maria Valaste) 2029-11
D4.1.4 Analysis tools for multimodal livestream data (JYU/Raine Koskimaa)  2029-11

Module 5: Information Interaction (IIA)

Interaction refers to the need 1) to collect information on how researchers interact with the RI in order to develop the tools and services accordingly, and 2) to offer education and consultation on how researchers can enhance their work by using the infrastructure, thus increasing the RI’s active user base. (L:TAU Sanna Kumpulainen)

W5.1 Evidence-Based Infrastructure Development

To provide a close dialogue with the user community to ensure the best possible development of the RI. (L:TAU Sanna Kumpulainen; P:UHEL/ARTS; C:UHEL/NLF, UTU, CSC, UHEL/SOC, AALTO, JYU, UEF, OU)

D5.1.1 Community engagement: Researchers using LLMs as research tools. (TAU:/Sanna Kumpulainen) 2026-06
D5.1.2 Educational resources for infrastructure tools and data.  (L:TAU:/Sanna Kumpulainen) 2027-11
D5.1.3 Community engagement: User interaction with multimodal data.  (TAU:/Sanna Kumpulainen) 2028-06
D5.1.4 Evidence-based infrastructure development: User experience and the feedback instrument.  (TAU:/Sanna Kumpulainen) 2029-11

D2.1.1: Integrate environment for personal data

Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Integrate environment for personal data
Date of reporting: 30-09-2024

Report authors: Mietta Lennes (UH)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/support/sd-services/

Keywords for the deliverable page: sensitive data; confidential data; secure desktop; SD services


In case a research dataset contains special categories of personal data or other types of confidential information that cannot be removed without hampering the research purpose, it may be necessary to use a secure environment for processing the data (cf. Deliverable 2.1.2 of the previous funding period of FIN-CLARIAH 2022-2023).

CSC – IT Center for Science provides Sensitive Data services for sharing and analyzing data securely from a web browser. The sensitive data files can be encrypted and uploaded via SD Connect, where they are available to the secure desktop instances of the members of the same project. The virtual machines for the secure desktops are configured and accessed via SD Desktop.

It is also possible to install and use special tools in the SD Desktop environment. Researchers who need to process audio and video material securely can now also conveniently install tools such as ELAN (video and audio) or Praat (audio) for viewing, editing, annotating, querying and analyzing their data, or well-known command-line tools such as Whisper (automatic speech recognition) as part of their workflow in the secure environment. For faster access to audio and video files, and external volume can be selected when configuring the virtual machine.

We will continue testing, documenting and improving the functionalities of the SD Desktop with the users of the Language Bank. We are also looking into the possibility of the Language Bank using SD Desktop instances for providing individual users with restricted access to specific sensitive datasets. The SD services are still under active development and the remaining issues can be addressed in collaboration with the experts at CSC.

For researchers in the SSH fields, the step-by-step instructions for using the Sensitive Data services are now maintained on a support page in the online portal of the Language Bank of Finland.



D1.2.1: Data collection for minority languages

Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.1: Data collection for minority languages
Date of reporting: 26-09-2024

Report authors: Martin Matthiesen (CSC)
Contributors: Wilhelmina Dyster (UH), Sjur Moshagen, Katri Hiovain-Asikainen (UiT)
Deliverable location: n/a

Keywords for the deliverable page: Finland-Swedish, Sámi


In this workpackage two minority languages are collected: Swedish spoken in Finland and Sámi languages spoken in Norway, Sweden and Finland.

Data collected during the Donera Prat campaign[1] is currently manually transliterated. This work is expected to be ready by November 2024. The planned release date for the data for research is January 2025.

The data collection for Sámi languages is focusing on the broadcasting companies in the Nordic Countries (NRK[2], SVT[3], YLE[4]) where they are spoken and the University of Tromsø. The national broadcasters already have some of their Sámi data subtitled in a Sámi language and their respective national languages, making it a valuable resource for research.

We achieved a general understanding that the Language Bank of Finland can serve as the main sharing organisation for Sámi data and we already did test transfers of data from SVT and Tromsø. YLE’s Sámi data is available via KAVI[5]. Before the data can be shared via the Language Bank of Finland, we need to overcome technical and legal hurdles. While on the technical side we already reached broad agreement and will for example, share the data from the various sources with no or little changes, and KAVI and Aalto University already have experience in collaborating using the LUMI supercomputer,  the legal side seems to be a bigger challenge. NRK, SVT and YLE are currently investigating legal implications of sharing their data via the Finnish Language Bank.

[1] Donera Prat https://svenska.yle.fi/a/7-10009203

[2] Norwegian Television: https://www.nrk.no/about/

[3] Swedish Television: https://omoss.svt.se/about-svt.html

[4] Finnish Television: https://yle.fi/aihe/about-yle

[5] The Finnish National Audio Visual Institute, https://kavi.fi/en/

D3.1.1: Comprehensive data versioning

Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 3.1: Report on Comprehensive data versioning
Date of reporting: 25-09-2024

Report authors: Martin Matthiesen (CSC)
Contributors: Erik Axelson, Eetu Mäkelä, Ville Vaara (UH), Sam Hardwick, Anni Järvenpää (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Keywords for the deliverable page: versioning, updates, differences


The versioning mechanism has been tested with new data from the National Library. We discovered that we will likely need to make changes to the mechanism how data is packaged into zip files to avoid unnecessary growth of the versions stored in Allas.

Interviews with potential users of the data have been conducted: Erik Axelson and Ville Vaara (both UH).  Both interviews are summarized below.

Using the data set as a potential source for newer versions of the KLK dataset in Kielipankki. (Erik Axelson)

In 2024 FIN-CLARIN has published a new version of ”The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT”[1], klk-fi-v2-1874-vrt, for short. This version was created using data directly obtained from the National Library, since our harvesting mechanism was not quite ready at the start of the project to create the new dataset. The NLF source data was extracted, tokenized and syntactically annotated and converted to the VRT format[3].  A list of included publications was compiled[4] and also End user notes, which document inconsistencies found after publication[5]. FIN-CLARIN has well established processes to obtain new copies from the National Library and these copies are in a different internal format than the data provided in this workpackage[2]. However, the differences are small and the data is well suited to be a basis for the next iteration. Since a new version of klk-fi-v2-1874-vrt is not planned during this project we will demonstrate the changes needed with a proof-of-concept.

Using the dataset as a basis for an Elastic Search instance containing NLF data (Ville Vaara)

Another use case for the data is the Elastic Search based tool developed in the previous FIN-CLARIAH development round in WP4.3[6]. In that use case the NLF data is converted to JSON suitable as input data for an Elastic Search Engine. When considering newer versions it became clear that an easy way of finding differences between the versions is a reasonable addition to the present implementation. The dataset is presently 10 TB in size and comparing two  datasets of that size (the present version and an earlier version) to find out the differences is something that should be done once during the update and provided to the user as a service, enabling easier updates of indexes.

Next steps

Moving forward we need to investigate the unnecessary growth of the versions and add functionality to make incremental updates of derived datasets (like in the Elastic Search case mentioned above) easier, by providing the differences between versions in a machine readable way. In deliverable 3.1.2 we will demonstrate the changes with working code.


[1] National Library of Finland. The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT [data set]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2024060401

[2] See the Harvester documentation for details.

[3] Introduction to VRT: http://urn.fi/urn:nbn:fi:lb-2023020121

[4] List of publications: http://urn.fi/urn:nbn:fi:lb-2023092801

[5] End user notes: http://urn.fi/urn:nbn:fi:lb-2023101001

[6] See Deliverable 4.3.2 of FIN-CLARIAH 2022-2023. The current implementation can be found here: https://dariahfi-es.2.rahtiapp.fi (access available upon request)

Change in metadata platform

The Language Bank of Finland maintains metadata records of all the resources it distributes. Each individual resource version has its own metadata record with a persistent identifier.

For providing the metadata records, the Language Bank has been using a platform called META-SHARE, but the system is no longer supported. All our currently existing metadata records have been moved to COMEDI, a service hosted by a Norwegian CLARIN centre, CLARINO Bergen. The persistent identifiers of the metadata records curated by the Language Bank of Finland now point to the corresponding records on the COMEDI system.

Please note that, although the metadata records now look a bit different, the content and location of the actual language resources remain unchanged.

Metadatan tallennusalusta on vaihtunut

Kielipankki ylläpitää kuvailutietueita kaikista välittämistään aineistoista. Jokaisella yksittäisellä aineistoversiolla on oma kuvailutietue, jolla on pysyvä tunniste.

Kielipankissa on käytetty kuvailutietojen tarjoamiseen META-SHARE-nimistä alustaa, mutta sen tuki on loppunut. Kielipankin kaikki nykyiset kuvailutietueet on siirretty norjalaisen CLARIN-keskuksen, CLARINO Bergenin ylläpitämälle COMEDI-alustalle. Kaikkien Kielipankin hoitamien kuvailutietueiden pysyvät tunnisteet on automaattisesti ohjattu uusiin osoitteisiin COMEDIssa.

Huomaathan, että vaikka kuvailutietueet näyttävät nyt vähän erilaisilta, itse aineistojen sisältö tai sijainti eivät ole muuttuneet.

Mylly will be discontinued on 17th June 2024

Due to very low usage, the Mylly service (https://mylly.rahtiapp.fi) will be shut down at the same time as CSC’s cloud services move to Rahti’s new version during the summer 2024. Mylly will be available until  17th June 2024. Due to the short notice, we will keep the users’ data for three months after the shutdown.

In case you wish to download your data, you can do it yourself by 17th June or by contacting CSC service desk within three months.

In case you wish to utilise the tool scripts from Mylly on other services (e.g., Puhti or CSC Notebooks), the software will still be available on GitHub.

Mylly-palvelu suljetaan 17.6.2024

Vähäisestä käytöstä johtuen Mylly-palvelu (https://mylly.rahtiapp.fi) ajetaan alas samassa yhteydessä, kun CSC:n pilvipalvelut siirtyvät Rahtin uuteen versioon kesän 2024 aikana. Mylly on käytettävissä vielä 17.6.2024 asti. Nopeasta aikataulusta johtuen pyrimme säilyttämään käyttäjien aineistot vielä 3 kuukautta tämän jälkeen.

Jos haluat Myllyssä olleet aineistosi talteen, voit ladata ne itse 17.6. asti tai seuraavan kolmen kuukauden ajan ottamalla yhteyttä CSC:n asiakaspalveluun.

Jos haluat hyödyntää Myllyn työkaluskriptejä muilla alustoilla (esim. Puhti tai CSC Notebooks), skriptit ovat saatavilla GitHubista myös jatkossa.

Introducing: LAREINA project (funded by Business Finland)

An article presenting the LAREINA – Language Resource Infrastructure for AI (2023–25) project has been published on the website of the University of Helsinki. The LAREINA project is funded by Business Finland and implemented by Aalto University and the University of Helsinki as part of Tietoevry’s Veturi programme. The project involves companies and public sector organisations as partners.

The LAREINA project develops speech recognition and speech synthesis for Finnish, Finnish-Swedish and the Sámi languages. The project partners will test the components in different tasks and in areas such as call centres and machine translation. The LAREINA project aims to ensure that high-quality speech interfaces and speech-based AI services are also available for speakers of small languages.

The outputs of the LAREINA project will be published under an open licence, allowing also for commercial use, and they will also be available through the Language Bank of Finland – Kielipankki.

Read more about the LAREINA project on the University of Helsinki website: ”Speech-based AI services needed for small languages as well – researchers support companies in product development” (Published on 11.04.2024)

Visit the LAREINA project webpage: https://www.kielipankki.fi/business/lareina/

Esittelyssä: Business Finlandin rahoittama LAREINA-hanke

Helsingin yliopiston verkkosivuilla on julkaistu juttu, jossa esitellään LAREINA – Language Resource Infrastructure for AI -hanke (2023–25). Business Finlandin rahoittaman hankkeen toteuttavat Aalto-yliopisto ja Helsingin yliopisto osana Tietoevryn Veturi-ohjelmaa. Hankkeessa on mukana yhteistyökumppaneina yrityksiä ja julkishallinnon puolen organisaatioita.

LAREINA-hankkeessa kehitetään puheentunnistusta ja puhesynteesiä suomen, suomenruotsin sekä saamen kielille. Hankkeessa mukana olevat kumppanit testaavat niitä esimerkiksi puhelinpalveluissa ja kääntämisessä. LAREINA-hankkeen tavoitteena on varmistaa, että laadukkaita puhekäyttöliittymiä ja puhepohjaisia tekoälypalveluita pystytään tuottamaan myös pienten kielten puhujille.

LAREINA-hankkeen tuotoksia julkaistaan avoimella, myös kaupallisen käytön sallivalla lisenssillä myöhemmin myös Kielipankin kautta.

Lue lisää LAREINA-hankkeesta Helsingin yliopiston verkkosivuilta: ”Puheella toimivia tekoälypalveluja tarvitaan myös pienille kielille – tutkijat vauhdittavat yritysten tuotekehitystä” (julkaistu 11.4.2024).

Tutustu LAREINA-hankkeen verkkosivuihin: https://www.kielipankki.fi/yrityksille/lareina/

Transnational Access Grants: Calls for Applications

CLARIN is a consortium partner in the Advancing FronTier Research In the Arts and hUManities (ATRIUM) project.

The ATRIUM project invites researchers to apply to participate in Transnational Access training visits to support their research. ATRIUM’s Transnational Access (TNA) scheme offers researchers the possibility to apply for a fully funded placement at several different partner organisations to access expert knowledge and advice from leading Data Management organisations across Europe.

The TNA scheme aims to recruit and support approximately 200 Arts and Humanities researchers with mentorship and access to knowledge, data and tools from 14 different institutions across Europe. Researchers who are successful in their applications will be supported to visit the infrastructure providers in our consortium in person, benefiting from direct contact, knowledge sharing and network building. In total, 388 weeks of Transnational Access will be provided during the ATRIUM project.

There are two types of types of TNA applications:

  • Individual Access – These are individual applications based on a specific research topic proposed by the applicant that match the specialisms of the host organisation.
  • Summer School Access – These are fixed events during the year that provide access for a group of researchers based on a set of predetermined specialised topics.

The first collection date is 31 May, 2024, and applicants will be notified by 28 June, 2024. Calls for applications will be issued several times per year throughout the duration of the project (March 2024 to December 2028).

Individual Access applications will be offered on a rolling basis with a deadline every three months. Summer Schools will be offered 1 to 2 times a year with a fixed deadline 3 to 4 months ahead of the scheduled event.

Visit www.atrium-research.eu for more information.

Lahjoita puhetta -kampanja on päättynyt

Suurkiitos kaikille lahjoittajille!

Yle, Helsingin yliopisto ja Valtion kehitysyhtiö Vake (sittemmin Ilmastorahasto Oy) toteuttivat yhdessä suomenkielisen puheen Lahjoita puhetta -keruukampanjan, joka on ollut käynnissä 16.6.2020 lähtien. Puhelahjoituksia kertyi ensimmäisen vuoden aikana menestyksekkäästi yli 3000 tuntia. Viime vuosina ja kuukausina lahjoituksia on kuitenkin tullut enää harvakseltaan. Pienemmällä Donera prat -kampanjalla kerättiin vuodesta 2021 alkaen myös suomenruotsia.

Molemmat keruukampanjat on nyt suljettu. Aineistot järjestellään ja tallennetaan Kielipankkiin, jonka kautta tutkijat ja yritykset voivat saada puhedataa käyttöönsä tietyillä ehdoilla. Toivomme, että aineistot auttavat tutkijoita ja yrityksiä luomaan parempia suomenkielisen puheen malleja sekä kehittämään tulevaisuuden palveluita, jotka toimivat sujuvasti suomen kielellä.

Lue lisää…

Donera prat -kampanjen är avslutad.

Tack till alla som donerade!

Mer information på de finska och finlandssvenska kampanjerna 2020-2024 (på engelska)


Lahjoita puhetta -kampanja on päättynyt.

Kiitos kaikille lahjoittajille!

Lisätietoja vuosina 2020-2024 toteutetuista suomen ja suomenruotsin Lahjoita puhetta -kampanjoista ja niissä kerätyn puheaineiston käytöstä

Lahjoita puhetta -kampanjan logo


