
Project: FIN-CLARIAH
Grant agreement: Research Council of Finland no. 367751
Start date: 01-01-2026
Duration: 24 months
WP 3.1: Report on upgrading the base data storage;
Date of reporting: 29-05-2026
Report authors: Anni Järvenpää (CSC)
Contributors: Sam Hardwick, Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester
The newspapers and journals dataset from the National Library of Finland has been made available on CSC’s next generation national supercomputer Roihu. At the time of writing, Roihu, and hence the dataset, is only accessible to pilot users, but Roihu will be opened to all users at the end of the pilot phase in June 2026. The current version of the dataset is available on Puhti until its EOL, but the future monthly version updates will be performed on Roihu. The documentation) has been updated accordingly. Other versions of the dataset can be accessed in Allas as before.
Compared to Puhti, Roihu has significantly greater processing power in terms of both CPU and GPU performance, and the bandwidth of the main data storage is ten times higher. Together these factors make Roihu much better suited for using the dataset in computationally intensive tasks such as training foundational models.
The data retention policy for this work package has also been defined: previous versions will be retained for a duration of three years. The deletion of old versions according to the new policy will be implemented in deliverable 3.1.2: currently there are no versions old enough to warrant deletion.
Other minor improvements to the data pipeline and the dataset itself were also introduced. Most notably for end users, a bug related to deleting files associated with bindings that have been removed from the source data was fixed. The bug prevented the updates for April and May of 2026, but now that a workaround has been implemented, the next update should complete normally.
Future deliverables will further improve the processes related to updating and accessing the dataset. It will for example be moved from the general scratch storage space on Roihu to a separate dataset project. Detailed information about data projects has not been published yet, but they are intended for exactly this kind of longer term availability and sharing of datasets. Related to the discovered file deletion bug, consistent usage of ZIP64 format extensions will be ensured in the ZIP files that make up the dataset so that the local and central directory file headers are always in sync. Usability of the dataset via LUMI AI Factory is also an improvement target for deliverable 3.1.2.
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Research Council of Finland under grant number 367751.
