Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months
WP 2.2: Report on Transformer training for specialised data
Date of reporting: 09-06-2025
Report author: Erik Axelson (University of Helsinki)
Contributors: Ghent Center for Digital Humanities [1] & Language and Translation Technology Team (LT3) [2] (Ghent University); Sam Hardwick, Katri Tegel (CSC)
Deliverable location: N/A
In this workpackage, we aim at creating a self-study course implemented as Jupyter Notebooks. Its purpose is to learn to build up a language model from scratch in the CSC computing environment using one or more existing resources of Language Bank of Finland, but not limited to them. For this purpose, we have tested two resources using the Noppe [3] service of CSC. One is an external resource developed in the framework of the CLS – Computational Literary Studies Project (2020-2025) [4]. The other is CSC’s Aitta [5] inference service for which they also offer a course ”Aitta – LLM Inference” in Noppe.
The CLSInfra repository [6] hosts the work done in the framework of CLS for Natural Language Processing pipelines for the DH community. The pipelines are demonstrated with Jupyter Notebooks. We have tested them in the Noppe service of CSC. If problems have been encountered, they have been reported to CLSinfra team. They have fixed the issues that we have reported so far. We will continue to go through the Notebooks, and we aim at running all of them in the Noppe service. Then we can later modify them for example for Finnish language or minority languages such as Sami languages, other Fenno-Ugric languages or Finland Swedish.
CSC’s ”Aitta – LLM Inference” course uses large language models available in their Aitta inference service. We have tested creating keys to access language models in Aitta and managed to use them in Noppe and run the exercises. Aitta already offers some models to use, and future features will include the ability for users to upload models and create embeddings themselves. These features will make it later possible to use our own materials.
We plan to have our own course environment ready in the beginning of fall 2025.
[1] Ghent Center for Digital Humanities: https://www.ghentcdh.ugent.be/
[2] Language and Translation Technology Team (LT3): https://lt3.ugent.be/
[3] Noppe: https://noppe.2.rahtiapp.fi/
[4] Computational Literary Studies Project (2020-2025): https://clsinfra.io/
[5] Aitta: https://staging-aitta.2.rahtiapp.fi/public
[6] The CLSInfra repository: https://github.com/GhentCDH/CLSinfra
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 358720.