
Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 2024-01-01
Duration: 24 months
Report author: Jussi Piitulainen (UHEL)
WP 1.1: Report on Ingesting new unstructured resources
Date of reporting: 2024-11-28
Contributors: Jussi Piitulainen, Jyrki Niemi, Jack Rueter, Erik Axelson, Ute Dieckmann, Mietta Lennes, Tommi Jauhiainen (UHEL), Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: https://www.kielipankki.fi/corpora/
Keywords for the deliverable page: conversion; annotation; interoperability; VRT; UralicUD; Korp; Mink
The Language Bank of Finland receives and obtains text resources in different formats ranging from plain text documents to text enriched with complex annotations and document-level metadata. We aim to ensure that the material is made available to researchers in formats that are usable and interoperable. For text corpora, the Language Bank particularly supports and promotes VRT (VeRticalized Text) as an interchange format by developing, maintaining and utilizing the set of open-source VRT Tools for converting, enriching and ingesting resources containing text. All currently supported formats can be found via the Standards Information System of CLARIN.
The Suomi24 resource group was extended with the discussions from the years 2021–2023 (The Suomi24 Corpus 2021-2023, VRT version, and The Suomi24 Sentences Corpus 2021-2023, Korp version). Moreover, the entire The Suomi24 Sentences Corpus 2001-2023, Korp version and The Suomi 24 Corpus 2001-2023, VRT version now include named-entity and identified-language annotations. The Ylenews resource group was also extended with material from the years 2022-2024, which was made available for download (Yle Finnish News Archive 2022-2024, source). The Korp version of this extension will be published soon.
The Language Bank contributes to the Universal Dependencies (UD) project in order to maintain validity and coverage of the treebanks not only for Finnish but also more generally for Finnic, Finno-Ugric and Uralic languages (Uralic UD). Samples of languages in these groups will also be included in the text resources licensed by the Institute for Bible Translation and in other multilingual text collections that are currently being processed for publication.
In addition to other corpora, the Language Bank participated in publishing several resources prepared by the Ancient Near Eastern Empires (ANEE) research group, including Oracc, Achemenet and Babylonian Administrative and Legal Texts (BALT), available via Korp with linkage from their corresponding lexical networks.
The Trankit toolbox (see Nguyen et al. 2021), a recommended replacement for the old dependency parsers by the Turku NLP group, was installed in the CSC Puhti environment. Trankit was tested to be robust for the kind of morpho-syntactic annotation of pre-segmented Finnish that we need for the existing KLK and Suomi24 corpora. Once adapted for the CWB-VRT format, Trankit would be used to re-annotate the existing corpora with the Universal Dependencies (UD2) features and dependency syntax. Trankit could also be adapted for the segmentation of paragraphs into sentences and tokens, and it adds support for many other languages apart from Finnish.
The Mink platform, developed by Språkbanken Text in Sweden, was test-installed by the Language Bank. Mink allows users to process their own text corpora and to access the result via a private Korp instance. After the new version of the Korp platform is officially published at the Language Bank, it will be possible to make Mink available for wider use by the community. Support for user authentication in Mink is to be added in the year 2026.
The Language Bank participates in the recently launched CLARIN PressMint project that aims to compile a multilingual, comparable, annotated, translated and interoperable set of corpora of European historical newspapers by using a common TEI format. For PressMint, we will transform the out-of-copyright data from our existing KLK corpora (newspapers and magazines from the National Library) from the CWB-VRT format to the appropriate TEI format.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.
Last modified on 2025-11-27
