FIN-CLARIAH Roadmap

Name of research infrastructure (RI): FIN-CLARIAH
Name of international RI: CLARIN ERIC (National Member), DARIAH ERIC (Cooperating Partner / Aalto & UHEL)
Stage of RI life cycle: operation
Applicant organisation (head organisation): University of Helsinki (Faculty of Arts/ARTS, Faculty of Social Sciences/SOC, National Library of Finland/NLF) (UHEL)
Other participating research organisations (consortium parties): CSC – IT Centre for Science Ltd., Aalto and Tampere (TAU) Universities, Universities of Eastern Finland (UEF), Jyväskylä (JYU), Turku (UTU) and Vaasa (UVA), Institute for the Languages of Finland (Kotus), and the National Archives of Finland (NARC)
Name of RI director: Krister Lindén, UHEL
Name of RI vice director: Mikko Tolonen, UHEL

Summary

FIN-CLARIAH is an RI for Social Sciences and Humanities (SSH) that comprises two components, FIN-CLARIN and DARIAH-FI. Taking as its core the well-established best practices developed in the language resource and language research-based FIN-CLARIN, FIN-CLARIAH seeks to significantly broaden the scope of infrastructural support in two major new directions: first, to reach beyond language materials into other kinds of materials in the form of structured and multimodal big data; and second, to cater to a broader range of SSH fields. This will be organised so that FIN-CLARIN continues to break new ground in supporting research based on language data, while DARIAH-FI engages researchers from a broad range of disciplines to develop infrastructure for big, heterogeneous SSH data. Beyond collaborating at the boundaries where their missions overlap, both components will share facilities for the management and negotiation of material rights, for technical access, as well as for hosting documentation, tools and services.

By applying for joint roadmap status as FIN-CLARIAH, we aim to significantly improve access to RI and utilisation of resources across the SSH disciplines. FIN-CLARIN is a mature infrastructure with a high-quality service model through its online service centre Kielipankki – The Language Bank of Finland (www.kielipankki.fi). Our vision is that this part of the infrastructure keeps integrating common tools and resources for processing language and language-related data. At the same time, national surveys and evidence from research projects have shown a need for infrastructure also supporting other types of data and questions. FIN-CLARIN’s focus on centrally provided resources for language-based research will be complemented by DARIAH-FI’s wider discipline base and bottom-up approach to data and service creation. Our vision is that the two RI components are complementary and will provide a workflow in which research groups from anywhere in Finland can get peer support and resources for applying digital methods regardless of their SSH field. And, once their research advances, they will be able to communicate their derivative data and new tools, some of which will end up in the Language Bank, a national distribution channel for harmonised data and benchmarked tools.

Figure 1. National and/or international operations of the RI. CLARIN ERIC (Common Language Resource and Technology Infrastructure) and DARIAH ERIC (Digital Research Infrastructure for the Arts and Humanities)

FIN-CLARIAH can be described on three levels according to Fig. 1, in which it is the common RI for its two components, which function as the national nodes of the corresponding international RIs. Of these, Finland is already a member of CLARIN ERIC. State membership in DARIAH ERIC can only be achieved by inclusion on the national roadmap. Hence, FIN-CLARIAH offers the Language Bank (Kielipankki) as a national RI service centre for the SSH field based on resources and services in cooperation with its national RI members and collaborators.

FIN-CLARIAH provides potential for world-class research and scientific breakthroughs via its support to Centres of Excellence as well as several other projects funded by the Academy of Finland (AoF). The infrastructure also offers services to various ERC-funded projects and Horizon 2020 projects. With a new service level, we will ensure that the renewal of academic excellence in the whole SSH community in Finland is supported by access to data, tools and knowledge about digital research. Several research groups and users for high-quality research currently supported by the FIN-CLARIAH facilities as well as researcher training and teaching activities are further detailed in Section 1.

FIN-CLARIAH is of broad national interest and enhances its international impact through its national consortium consisting of all the relevant universities as well as CSC, Kotus, NLF and NARC, and through the international CLARIN ERIC and DARIAH ERIC consortiums, which help the national consortia focus on their local resources while reaching international coverage through sharing and cooperation. FIN-CLARIAH has a long-term plan for scientific goals, maintenance, financing and utilisation as detailed in Sections 2 and 3.

FIN-CLARIAH provides access to resources that are too extensive for individual research groups to manage on their own, e.g., the NLF collection of 200 years of newspapers and periodicals, the several decades of Parliamentary records, and the written and spoken news collections of the national broadcasting company YLE, as well as the born digital and the hand-written document collections of NARC, etc. In cooperation with the affiliated research groups, FIN-CLARIAH acquires access to and supports relevant new cutting-edge resources and technology such as neural network based tools for video, speech, and text processing as outlined in Section 4.

FIN-CLARIAH offers SSH data, methods, services and platforms which are openly and easily accessible to researchers, industry and other actors by facilitating privileged access to its closed resources for academic researchers and to its open resources also for industry researchers and citizen scientists. Our long-term aim is also to develop a network of digital SSH research hubs at all Finnish universities. This means that there will be an access point to knowledge about digital research in all SSH campuses in Finland that ensures that a researcher with particular needs will find the right contacts and data and tools even if they do not exist at the researcher’s own university. FIN-CLARIAH has a plan for access to and preservation of collected data and materials as well as a risk management plan outlined in Sections 5 and 6.

1 Scientific and educational relevance of the RI

FIN-CLARIAH develops infrastructure services for SSH. CLARIN, which offers language-based data and tools, is a highly relevant RI for SSH, where nearly 80 % of all data is unstructured text.[1] DARIAH’s strength lies in bringing together individual expertise and developments in state-of-the-art, digitally-enabled research across the SSH field, and scaling these activities across Europe. CLARIN and DARIAH are close collaborators focusing on user involvement and knowledge sharing to promote the uptake of tools and resources.

1.1 Science

The primary goal of FIN-CLARIAH is to collect, provide access to and facilitate collaboration on tools and databases consisting of millions of documents, as well as research data produced by SSH researchers in Finland and to be used throughout CLARIN and DARIAH for research and education. The collections span different periods, genres and regions as well as different modalities such as text, audio, pictures and video. The relevance of the RI lies in the amount and diversity of the material as well as in the seamless access for researchers and students through web services and application programming interfaces. Persistently available collections enable SSH scholars to reach similar repeatability and replicability of research results that is common in natural sciences. Many claims based on intuition can be supported or rejected by broader and more objective evidence. This improves the quality of research and makes the research-based teaching of hundreds of researchers, teachers and advanced students more rewarding.

Going forward, FIN-CLARIAH focuses on three SSH infrastructure goals:

  1. Access to data, including
    • digitised, printed and handwritten textual and numeric data in large quantities;
    • born-digital data, including user-generated social media data;
    • maps, images, audio, video, sensory capture, register data and combinations of data types;
    • various metadata that accompany and/or describe other primary data.
  2. Availability of tools such as
    • machine learning techniques on large-scale datasets to gain insights not possible through traditional quantitative and qualitative research approaches;
    • high-powered computing, which enables immense datasets to be mined and modelled efficiently;
    • open research software ecosystems and reproducible research tools.
  3. Open collaboration and international best practices
    • ensuring data quality, completeness, and efficient data documentation and sharing according to the FAIR principles;
    • ensuring long-term sustainability of data and re-use of data-sets;
    • supporting researchers and students of data-driven SSH disciplines in questions of ethics and best practice in data management;
    • applying and developing new co-creative methods of collaboration;
    • advancing robust mechanisms and practices that encourage and enable researchers to document and publish their tools and methodological solutions in a re-usable and interoperable way.

These goals are primarily promoted through SSH research projects, for which FIN-CLARIAH is the key RI which coordinates and facilitates the dissemination of resulting data, tools and best practices. The new service level will enable research communities to steer the course of the development of digital research with respect to their particular interests of knowledge. The interests of research in different areas of SSH fields are vastly different from each other. Digital history, for example, has particular needs with respect to data and tools access, development and sharing that are currently not met at the infrastructural level. Within FIN-CLARIAH, we can assist the SSH research communities in building new common services needed by the whole community. This is a lengthy process for which targeted infrastructural support is needed.

2013–2018 Total JUFO 3 JUFO 2 JUFO 1
Linguistics 3 307 265 1 110 1 932
Humanities (excl. Linguistics) 9 732 1 015 3 286 5 431
Social sciences 21 716 2 414 6 402 12 900
Total 34 755 3 694 10 798 20 263
Table 1: Number of peer reviewed publications from 2013 to 2018 according to the merit of publication channel with JUFO 3 (highest) to JUFO 1 (basic level). (retrieved from juuli.fi[2] on April 19, 2020)

The sum of peer reviewed publications in the SSH field by the FIN-CLARIAH organizations is 34755 (i.e. almost 30% of the publication volume in all fields of science) during 2013-2018 with an annual average of approximately 5800. For an overview of the publication channels according to merit in the relevant fields, see Table 1. In these fields, the use of digitized sources and methods is increasingly important. To keep up or increase the research output, FIN-CLARIAH is needed.

FIN-CLARIN: The FIN-CLARIN consortium includes universities with active teaching and research in linguistics or language technology. The resources, i.e. corpora and tools, are provided by the FIN-CLARIN RI for research in these fields; most of them are multi-purpose resources which are also used in other SSH fields as well as in computer science for machine learning and AI development. It can be argued that practically all of the recent research in the field of Linguistics and Language Technology in the FIN-CLARIN member universities in Finland has benefited from the FIN-CLARIN infrastructure directly or indirectly.

FIN-CLARIN offers persistent identifiers (PID, provided by NLF) to the resources provided by its members as in-kind contributions to be integrated with the collections at the Language Bank where the resources can be searched by content and annotations. Many of the resources have been scientifically documented in journal papers or technical reports by the depositors before they were deposited. References to such resources are in practice often made to these scientific publications and not to FIN-CLARIN or the Language Bank. This can be compared with the general practice in the field to refer to books and articles by their authors and not by the library from which they are borrowed. The appreciation for resource PIDs is slowly growing for practical purposes in tandem with the practice to refer to scientific publications using their DOI. However, the practice to mention the resource PID may take some time to foster, as long as no explicit scientific merit is gained by publishing resources and the scientific practice is to refer to an original publication documenting the resource.

Using Google Scholar, FIN-CLARIN resource depositor publications have been cited more than 14000 times during 2013-2018 and the publications of the Language Bank staff have been cited 1327 times. As pointed out above, the SSH infrastructure is rarely mentioned as a source from which the resources are available, but the names of specific resources available through the Language Bank like “OPUS” are mentioned 2 040 times, “Suomi24” 1 050, “Tieteen termipankki” 872, “ORACC” 452, and “TOPLING” 210 times. Google mentions are of course only indicative as they include international mentions of FIN-CLARIN resources but lack mentions in non-public repositories. For concrete examples of how the resources have been used for research purposes, see the https://www.kielipankki.fi/language-bank/researcher-of-the-month-archive/

The most widely applied results using tools and datasets developed by or deposited in the Language Bank have been achieved by annotating contemporary and historical text with base forms and named entities, which are the basic units for content search and analysis in all SSH fields. Other important achievements are large-scale annotations of morphological, syntactic and/or discourse structure for linguistic research in the national languages as well as Sámi and other Finno-Ugric languages. In addition, general purpose technologies for language identification, speech recognition, machine translation and language understanding of Finnish and Finno-Ugric languages have been developed using the Language Bank resources. In particular, we would like to emphasize our successes in speech recognition, where the Aalto University won first position in the Multi-Genre Broadcast Challenge in 2017, syntactic parsing where the Turku University gained second position in the CoNLL 2018 shared task: Multilingual parsing and in language identification where software developed by the Language Bank staff at the University of Helsinki won first position in the VarDial Language and Dialect Identification challenges in 2018.

As datasets are openly available, the field in general has moved in an evidence-based direction where research results need to be based on openly accessible resources and to be credibly reproducible in order for the research to be accepted. Another general trend is towards big data collected centrally allowing specialized resources to be extracted from the big datasets and complemented or contrasted with additional data collected by the researchers themselves for special purposes. In particular, we would like to draw attention to the following scientifically significant publications and breakthroughs for the development of the field:

  1. Smit, P., Virpioja, S., & Kurimo, M. (2017, August). Improved Subword Modeling for WFST-Based Speech Recognition. In INTERSPEECH (pp. 2551-2555).
  2. Jenna Kanerva, Filip Ginter, Niko Miekka, Akseli Leino, and Tapio Salakoski. (2018). Turku neural parser pipeline: An end-to-end system for theCoNLL 2018 shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics.
  3. Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2018). Automatic Language Identification in Texts: A Survey. arXiv preprint arXiv:1804.08186. Final publication in 2019, in Journal of Artificial Intelligence Research, 65, 675-782.

Currently SSH researchers have centralized access to more than 200 large datasets (https://www.kielipankki.fi/corpora/) from various SSH fields in the Language Bank web services (https://www.kielipankki.fi/tools/). SSH researchers can benefit from the tools available in the Language Bank to carry out collaborative work using, e.g., Korp for content search, Mylly for interactive processing, and Download for data dumps, while avoiding significant amounts of work with regard to cleaning data and setting up local services.

DARIAH-FI: Scholarly practices in SSH disciplines are transforming: research groups increasingly collect or generate their own data and create tools for data management, processing, and analysis to address the complex questions posed by a growing amount of diverse data. The advancement of Digital Humanities (DH) and Social Data Science (SDS) in Finland requires the construction of a national RI that is researcher-driven and specifically adapted to the needs of the Finnish SSH research community, and that removes the onus of infrastructure development from individual research teams. DARIAH-FI will fill this gap by pooling project-based tools and services into an open, virtual service cluster. The platform generates new multidisciplinary opportunities for quantitative as well as qualitative analysis of multimodal data. Close cooperation with DARIAH-EU will help ensure the long-term sustainability of services, improve the efficiency of research by taking advantage of the substantial synergies in the development and use of digital resources, and significantly improve the international collaboration and scientific breakthrough potential of the Finnish SSH research community.

The RI component is in the initial phase of its development. The new service will focus on research areas currently lacking support for the use of digital data and tools. Crucially, DARIAH-FI expands beyond textual data, enabling researchers to take a multimodal approach to large datasets and to use different data types ranging from textual data to register materials, maps, images, sound and even sensory data. We are witnessing crucial shifts in research practice, which will pose challenges not only for database management but also for statistical and computational analysis.

A good example of this is digital history which as an area of research has gathered enough momentum in Finland over the past decade that data and tools developed in different research groups can be shared also on the national level. Another illustrative example is computational sociolinguistics, which is opening up new ground for research on sociolinguistic variability. It intersects with computational techniques to understand language variation and change and its social embedding. Studies in this field utilize AI tools and advanced quantitative methods to analyze large and complex datasets, many of which have been harvested from born-digital sources.

To serve such needs, DARIAH-FI aims to create national peer-to-peer workflows and protocols connected to a platform for creating, enriching and sharing large, digitized datasets. The vision for DARIAH-FI is to enable a similar process for other digital SSH research as well, especially in the SSH fields not yet fully served by FIN-CLARIN and the Language Bank. Data collection and processing tools require significant investments in time and resources, and new methods of generating, transferring, combining and annotating data create massive and complex datasets that require powerful statistical methods and visual data exploration. Currently, this work is mostly carried out by individual research teams, resulting in a) insufficient resources for infrastructure building, b) lack of sustainable solutions and therefore c) duplication of work between researchers. DARIAH-FI will deploy, benchmark and further develop tools and services on top of CSC’s national infrastructure. It will facilitate access to heterogeneous data sets and computational methods, and help researchers avoid overlapping work related to data cleaning and analysis by taking advantage of synergies between research teams and existing data analytics tools, for instance regarding geospatial information, language processing, or standard machine learning tasks. Research groups will contribute new and targeted analysis methods, which can be readily accessed and used through the CSC environment without the need for local installations. By building generic, sustainable and interoperable open access workflows and tools, and by connecting Finnish universities to leading international infrastructure networks, DARIAH-FI can significantly speed up the research process and ensure that resources spent on SSH infrastructure are used as efficiently as possible.

1.2 Landscape

In November 2015, Finland joined CLARIN ERIC, which provides the technical framework and the norms and practices enabling the use of language-based resources for a large number of international researchers in academia in the EU. This signalled a commitment to develop and maintain CLARIN ERIC and FIN-CLARIN as its national node until further notice. UHEL and Aalto have been DARIAH-EU Cooperating Partners since 2017; the establishment of DARIAH-FI is a necessary next step towards Finland joining DARIAH-EU as a country member.

FIN-CLARIN is currently on the national RI roadmap of Finland and serves as Finland’s national node in CLARIN ERIC, which is one of the landmark RIs of the European Strategy Forum on RIs (ESFRI). DARIAH-FI aims to become the national node of the ESFRI landmark RI DARIAH ERIC–a network to enhance and support digitally enabled research and teaching across the Arts and Humanities.

Links to the national research and infrastructure community: The Steering Group of FIN-CLARIAH has representatives from the original FIN-CLARIN consortium members (UHEL, CSC, Kotus, Aalto and TAU, EAF, JYU, UTU, UVA and OU) augmented with representatives from NLF and NARC representing the whole SSH field in Finland. With their approx. 6000 staff and 40000 students, the members of FIN-CLARIAH provide a large Finnish user-base for the national and international resources.

The members of the original FIN-CLARIN consortium agreed already in 2007 on the goal to deposit their language-based resources in the Language Bank as their common repository facilitating access also to restricted resources for researchers in Finland and the EU. In 2018, FIN-CLARIN concluded a cooperation agreement with the RIs in Finland benefitting the Social Sciences, i.e. CESSDA (Council of European Social Science Data Archives) and its national branch FSD (Tietoarkisto), ESS (European Social Survey), and SHARE (Survey of Health, Ageing and Retirement in Europe), concerning researcher training, long-term storage, and archiving, ideally also comprising a register of consenting potential research participants.

FIN-CLARIAH relies on the CSC RI computing capacity for processing its datasets. In addition, CSC connects FIN-CLARIAH to national and European computational infrastructure projects. The Language Bank Rights web service for granting access to restricted resources is maintained in cooperation with the ELIXIR RI (an RI for Biocomputing).

FIN-CLARIAH also collaborates with other Finnish memory organisations and data providers, e.g. Finnish Heritage Agency, Institute for the Languages of Finland, Finnish Literature Society, Society of Swedish Literature in Finland, National Land Survey of Finland and Statistics Finland.

FIN-CLARIAH coordinating activities to support national and international networks: FIN-CLARIAH activities can be seen from two different angles. FIN-CLARIAH focuses both on tools and datasets for top-level research, and on specific research areas and their international connections. Essentially the infrastructure bridges the gap between the two so that key tools and datasets become available for the specific research areas supported by FIN-CLARIAH. The support for top-level research nationally is presented in Table 2.

Data and processing type Primary data and/or tool providers Relevant
top-level research activities
Examples of international infrastructure cooperation by FIN-CLARIAH partners
Text and data processing and annotation environments All FIN-CLARIAH partners Aging, CM, COMHIS, PapyGreek, Adaptation, FoTran, Units, Ndebele, Inari Places, OcEx, SemFields, Viral Culture, Urko EOSC (European Open Science Cloud), SSHOC (SSH Open Cloud), RDA (Research Data Alliance), AARC (Authentication and Authorisation for Research and Collaboration project), PRACE (Partnership for Advanced Computing in Europe), NeIC (Nordic e-Infrastructure Collaboration)
Speech processing and annotation Aalto University, Universities of Eastern Finland, Helsinki, Turku and Oulu, Kotus, NLF III, DLT, ELA ICSI (International Computer Science Institute, USA), NINJAL (National Institute for Japanese Language and Linguistics)
Video and picture processing and annotation Aalto University, University of Eastern Finland, JYU, Kotus, National Library, YLE, NARC, UHEL MIND, ETD, MeMAD, COMET, Mutable, GameCult, MCG University of Surrey (UK), New York University (USA)
Manuscripts and historical documents Kotus, JYU, NLF, NARC, UHEL ANEE, CSTT, STRATAS, DEMLANG READ co-op (Virtual Research Environment for the automated recognition, transcription, and indexing of handwritten archival documents)
National Linked Open Data and Semantic Web infrastructure and applications NARC, NLF, Kotus, Finnish Heritage Agency, Finnish Literature Society, National Land Survey of Finland, Parliament of Finland, Ministry of Justice, Aalto, UHEL MMM, SemParl, SuALT, ARIADNEPlus Univ. of Oxford (UK), Univ. of Pennsylvania (USA), Colorado Univ. (USA), IRHT Paris (France), Keyo University (Japan), etc.
Table 2: Supporting top-level national research

(CoE = Finnish Centre of Excellence, ERC = European Research Council, EU = EU Horizon 2020 Project, DH = AoF DH project, PD = AoF post-doctoral project, AP = AoF project)

  • CoE ANEE (Ancient Near Eastern Empires, Svärd), CSTT (Changes in Sacred Texts and Traditions, Nissinen), III (Intersubjectivity in Interaction, Sorjonen), GameCult (Game and culture studies, Koskimaa), MCG (Music Cognition, Toiviainen), Aging (Ageing and Care)
  • ERC PapyGreek (Digital Grammar of Greek Documentary Papyri, Vierros), Adaptation (Linguistic Adaptation, Sinnemäki), FoTran (Found in Translation, Tiedemann), Sensotra (Sensory Transformations and Transgenerational Environmental Relationships in Europe, 1950-2020, Järviluoma)
  • EU MeMAD (Automated video description and translation, Kurimo), ARIADNEPlus (Data infrastructure for archeology, Hyvönen)
  • DH CM (Citizen Mindscapes, Lagus), DLT (Digital Language Typology: mining from the surface to the core, Vainio), COMHIS (Computational History and the Transformation of Public Discourse in Finland, 1640–1910, Tolonen), STRATAS (Sociolinguistic research on language change, Nordlund), MMM (Mapping Manuscript Migrations, Hyvönen), SemParl (Semantic Parliament, Hyvönen), SuALT (Finnish Archaeological Finds Recording Linked Open Database, Hyvönen)
  • PD ELA (Early language acquisition, Räsänen), ETD (Embodied Task Dynamics, Simko),
  • AP MIND (Cross-modal connections, between speech communication, hand gestures and perception, Tiippana), COMET (Motion in time, Leino), DEMLANG (Democracy and language, Palander-Collin), Ndebele (Language change in Ndebele, Aunio), SemFields (Semantic fields, Lindén), Mutable (Interpretation for visually impaired, Hirvonen), Units (Sub-sentential units, Helasvuo), Inari Places (Multi-lingual place names in Inari, Valtonen), KATVE (Carelian language in Finland and Tver, Palander), OcEx (Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840 -1914) Viral Culture (Viral Culture in Early Nineteenth-Century Europe, Salmi), Urko (Uralic triangulation, Onkamo)

FIN-CLARIAH actively supports high-profile research projects at UHEL with tools and technology, e.g., the AoF-funded Centre of Excellence on Ancient Near Eastern Empires, as well as several AoF-funded DH projects. The infrastructure also offers support to various ERC-funded UHEL projects in the SSH field, e.g., Gulag Echoes in the “multicultural prison” (Judith Pallot), Crosslocations in the Mediterranean (Sarah Green), FoTran: Found in Translation (Jörg Tiedemann), Digital Grammar of Greek Documentary Papyri (Marja Vierros), and Linguistic Adaptation (Kaius Sinnemäki), and Horizon2020 projects, e.g. NewsEye.

In addition to UHEL, FIN-CLARIAH also supports researchers at other research organisations in Finland in a range of disciplines including but not limited to social and political sciences (e.g. the AoF project Citizen Mindscapes, UHEL/UTU/UEF; Tackling Biases and Bubbles in Participation, Strategic Research Council at AoF, UHEL/UTA/UTU/Etla Economic Research/Finnish Institute for Health and Welfare), social and economic history, historical demography, economics and sociology (e.g. the AoF project Contextualising Finnish Early Modern Economy (1500-1860), JYU), game and cultural studies (e.g. the AoF Centre of Excellence in Game Culture Studies, TAU/JYU/UTU), and sociolinguistics and dialectology (e.g. the AoF project Migration and Linguistic Diversification, UEF).

Collaboration with the international/national research field and infrastructure community: The EOSC project is one of the largest EU funded e-Infrastructure initiatives aiming at building a collaborative data infrastructure allowing researchers to share data within and between communities and enabling them to carry out their research effectively. User communities throughout Europe are engaged in projects supporting the EOSC vision. CLARIN ERIC is a partner of the EOSC hub, in which CSC has a leading role. DARIAH-EU is a partner in the SSH Open Cloud (SSHOC; https://www.sshopencloud.eu/) which is part of EOSC.

FIN-CLARIN is a part of the European CLARIN effort to build an international network of trusted CLARIN centres for offering language-based materials in multiple languages to researchers throughout the world. Currently CLARIN has the following 21 members: Austria, Bulgaria, Croatia, Cyprus, the Czech Republic, Denmark, Estonia, Finland, Germany, Greece, Hungary, Iceland, Italy, Latvia, Lithuania, the Netherlands, Norway, Poland, Portugal, Slovenia, and Sweden with France, South Africa and the UK as observers as well as an individual centre in the US.

DARIAH-FI is represented by UHEL and Aalto as Cooperating Partners in the European DARIAH RI. DARIAH has 19 members and several Cooperating Partners in 8 non-member countries. DARIAH-FI will provide all Finnish universities and research institutes full access to the services provided by DARIAH-EU to improve their opportunities for international collaboration and networking. DARIAH-FI will be developed in alignment with the SSHOC Marketplace to ensure the long-term sustainability of its services (https://www.sshopencloud.eu/ssh-open-marketplace).

Finland is currently participating in the COST Action (CA) 18209 (“Nexus Linguarum”) that aims to promote synergies across Europe between linguists, computer scientists, terminologists, and other stakeholders in industry and society, in order to investigate and extend the area oflinguistic data science. FIN-CLARIAH is represented in the Management Committee of the CA by Prof. Eero Hyvönen and Dr. Jouni Tuominen as members and by MA Mietta Lennes and MSc Minna Tamper as substitutes.

FIN-CLARIAH collaborates with the international READ Cooperative. Benchmark institutions for identifying best practices and exchanging perspectives include the Berkeley Institute for Data Science and the Digital Methods Initiative at the University of Amsterdam. Other international collaborations include DH in the Nordic Countries (DHN), the European Association for DH (EADH), and various international networks using digitised sources and databases (e.g. ClioInfra and EH-net in the field of economic history). FIN-CLARIAH participates in the work of the International Internet Preservation Consortium through NLF.

Nationally FIN-CLARIAH collaborates closely with the AoF CoE on Ancient Near Eastern Empires testing deep learning and neural network methods in DH research to develop machine learning methods suitable for efficient use on small datasets. The RI also serves the Finnish Centre for Artificial Intelligence (FCAI–an AoF Flagship involving UHEL, Aalto and the Technology Development Centre VTT Oy) with datasets for language-centric AI. For other key national collaborators such as the FIN-CLARIAH consortium members, and how the project benefits from their cooperation, see the description of Project collaborators in the Application form.

The interdisciplinary nature of DH and SDS also provides opportunities for collaboration beyond SSH disciplines, in particular with computational sciences. Emerging new research projects in digital history for example are highly interdisciplinary. A recent development that also needs infrastructural support is that interdisciplinary collaboration leads to formation of research groups where the backgrounds of the collaborators vary from humanities to computer science. This will have a profound impact on the whole SSH research culture and needs infrastructural support starting from the management of such groups and access to the right kind of communicative tools. In addition, medical science has benefited from large digitised historical databases when studying hereditary diseases as well as new opportunities for data mining freeform patient records.

Usage by other research fields and RIs: While the tools and materials in the Language Bank primarily serve researchers of various SSH branches, they also serve scholars in computer science. In computer science, FIN-CLARIAH serves the areas of data science, statistics, visualisation, machine learning and AI, forming an integral part of modern DH, social and legal science research. In particular, language technology-related AI research crucially depends on access to substantial language resources.

While the various RIs of the Social Sciences provide survey data of societal phenomena, the Language Bank provides raw data for indicators of societal phenomena in the form of historical newspapers, which can be mined for data covering periods before such data was systematically collected by the other RIs. Contemporary sources in the Language Bank like discussion forums such as Suomi24 offer similar opportunities.

The curated data and the know-how developed by CLARIN for user authentication and authorization as well as the procedures for clearing intellectual property rights represent state-of-the-art best-practice and can be used for negotiating and distributing digital materials created or collected by human activity also in other infrastructures. A significant benefit of the FIN-CLARIAH collaboration is that the entire SSH field in Finland can benefit from the FIN-CLARIN expertise in these issues, e.g. through resource sharing with CESSDA and the Finnish Data Archive. FIN-CLARIAH has a common interest with CESSDA regarding transliterated audio and video interviews, jointly providing access to such data via the Language Bank. Similar needs for an authorization system for the life science datasets exist in ELIXIR. CSC develops the shared technology in the form of REMS–a resource entitlement system.

Other actors in the landscape: Libraries also constitute an RI providing access to electronic publications but currently their main business is to provide reading access to whole works, whereas the main function of FIN-CLARIAH is to offer processing access to datasets. When the Digital Single Market Directive is implemented nationally, libraries can also provide access to electronic publications for text and data mining for research purposes. Researchers will need a secured repository, e.g. the Language Bank to store and communicate their datasets.

Google has digitised works, but they are not normally available for further processing as research materials. There are also data repositories, such as ELDA (Evaluations and Language resources Distribution Agency) and LDC (Linguistic Data Consortium), which provide copies of language-based materials for a fee also to industry. Their service is complementary to FIN-CLARIAH, which primarily offers datasets free of charge to academia.

1.3 Added value

Added value for science: FIN-CLARIAH reduces the time researchers spend on data collection and preprocessing by providing a centralized platform that researchers and students can access through their workstations and where they can locate research materials in huge collections, seamlessly get the necessary permissions, and start doing research immediately. This improves efficiency and quality: 1) Providing remote access to large data collections and processing environments, many problems which used to take weeks, can now be solved in hours or minutes, and new regularities and anomalies are easier to discover than with traditional tools and datasets. 2)Many claims which were based on intuition can be verified by more objective and broader evidence. SSH research will become more reproducible as the materials on which the arguments were based are accessible for other researchers for verification of the claims.

Through its emphasis on researcher-driven prototyping of new data science methods and open collaboration models using CSC and the Language Bank as a virtual hub, FIN-CLARIAH a) accelerates and supports the adoption of the latest data science methods and best practices in SSH research and teaching; b) helps strengthening the connections between research groups nationally and internationally; and c) promotes collaborative practices where researchers benefit from each other’s work and become co-creators of methods and applications instead of relying only on existing tools.

While striving for multidisciplinary work, FIN-CLARIAH fosters new SSH practices. It allows researchers to manage and analyse all kinds of SSH data, whether pre-existing or researcher-produced, including combinations of large amounts of multimodal data, and helps them deal with sensitive or otherwise restricted data. Robust computational solutions tailored to SSH requirements create unique opportunities for combining qualitative and quantitative approaches in research fields that have often focused on one or the other, and for studying phenomena at all levels of granularity from single instances to diachronic and spatial variation in large datasets. The infrastructural investment will pay off as synergies in data sharing and methods development, e.g. commonly agreed application programming interfaces (aka APIs), data formats, and data science tools; and as cutting-edge research that would otherwise not be possible.

FIN-CLARIAH creates new opportunities for quantitative as well as qualitative analysis in SSH disciplines, thus contributing to the scientific breakthrough potential of its users. E.g., the READ Transkribus handwritten text recognition technology revolutionises access to historical handwritten documents, providing these collections in fully searchable and computer-readable form and thus enabling new approaches to historical handwritten documents, including tabular numeric data and population data. Furthermore, the Semantic Computing technologies developed at Aalto underlying the core ontologies and the methods for annotation, linking, and publishing harmonised datasets, represent the most up-to-date developments in Linked Data-based DH research and have received several scientific awards such as the LODLAM Technical Challenge Open Data prize in 2017 and the Open Finland Challenge award (Public Services & Active Citizens) in 2015.

FIN-CLARIAH also adds value to the datasets it provides by facilitating advanced content search through uniform web interfaces and well-defined application programming interfaces (API) facilitating access and hypotheses testing on diverse datasets in a uniform way. The Language Bank currently provides roughly 5 million documents containing 13 billion words of Finnish, 3.5 billion words of Swedish, and 3.5 billion words of other languages as well as several hundred hours of spoken data, which are made available for download, content search or online processing. Data in Finno-Ugric languages with no representation in the EU are also made available through the Language Bank. For the convenience of Finnish researchers, the Language Bank hosts some internationally available tools and datasets for other languages. Moreover, FIN-CLARIAH offers international researchers centralized access to SSH resources in Finland.

Added value for education: FIN-CLARIAH arranges courses and skill development in cooperation with the master level and doctoral programmes of the member universities and their joint national training networks. Advanced students, PhD students, and researchers are typical users of digital language resources. As part of the university curricula at UHEL and JYU, FIN-CLARIN offers open online courses on analysis and processing methods of data containing text and speech, including terminology work and concept analysis.

FIN-CLARIAH also arranges a national bi-annual RDHum Conference.[3] CLARIN ERIC and DARIAH ERIC support the annual international Helsinki Digital Humanities Hackathon at UHEL. CLARIN and DARIAH offer training on the EU level as part of their Knowledge Sharing Infrastructures[4] and promote international cooperation through mobility grants and funding for Master Classes, Summer Schools and Training events promoting the uptake of CLARIN and DARIAH standards and the spreading of best practices.

FIN-CLARIAH offers training for Bachelor, Master level and PhD students in the use of language-based materials accessible through the Language Bank. The researchers and PhD students learn to locate suitable language-based materials in the vast number of datasets and tools in the Language Bank and the CLARIN and DARIAH repositories around Europe. In addition, the students learn how to prepare their own materials to be compatible with CLARIN and DARIAH standards and how to submit them to FIN-CLARIAH for future use by other scholars. In cooperation with the Doctoral Programme in Language Studies, an annual FIN-CLARIN course on Data Management and Annotation is arranged by UHEL. The course is offered for MA and PhD students in other programmes and universities as well. A similar course is also organized in cooperation with CESSDA for its researchers and PhD students in social sciences. The NLF also offers data clinics for university students and researchers on technical and legal aspects of using digitised publications and organizes informal meetings and symposia for researchers using materials offered by the library.

FIN-CLARIAH’s work on ethics, access and openness contributes to curriculum building at the national level and promotes both ethically enlightened (https://vastuullinentiede.fi/en) research practices as well as open science. Education also benefits from the development and sharing of digital resources through electronic notebooks, interactive web applications, and dedicated open data platforms. Data science and open data resources provided by FIN-CLARIAH, directly coupled with ongoing research projects, can be utilised in academic teaching on all levels. Tutorials, learning modules and other types of widely applicable content can be shared, accessed and embedded in a large number of courses.

2013 2014 2015 2016 2017 2018
BA students 25656 24384 22953 21849 20595 19938
MA students 14100 14142 14487 14697 14820 15030
Licentiate or PhD students 5241 5232 5094 4824 4656 4524
Total 44997 43758 42534 41370 40071 39492
Table 3: Number of students in SSH comprise approx. 28 % of all students within the eight FIN-CLARIAH member universities. (Data from Vipunen –Education Statistics Finland, https://vipunen.fi/ retrieved on April 20, 2020.)

FIN-CLARIAH has a potential user-base in the research and research-based teaching of hundreds of researchers and teachers, as well as their students. There are thousands of licentiate or PhD students, more than fifteen thousand Master level students and about twenty thousand Bachelor students in the SSH fields in Finland alone, as shown in Table 3, annually resulting in approx. 3500 degrees in the Humanities and 3000 degrees in the Social Sciences, out of which approx. 130 doctoral degrees in each.

The Helsinki Term Bank for the Arts and Sciences (HTB) (https://tieteentermipankki.fi) is a multidisciplinary project which aims to gather a permanent terminological database for all fields of research in Finland. HTB is mentioned in the Programme of the Prime Minister’s Office for implementing the National Language Strategy. HTB has a significant impact on all basic and advanced researcher training and education as well as on the education of corresponding subjects in secondary schools by providing background information for teaching materials and science education. With a potential user base exceeding 100 000 teachers and students from all areas of science and humanities. This manifests itself in an average of 2000 daily users of the HTB.

FIN-CLARIN has offered international courses on morphology and language processing using its technology for morphologically rich languages Helsinki Finite-State Technology (HFST, https://en.wikipedia.org/wiki/HFST). Currently, online training courses are offered at the EU-level to PhD students and researchers via a CLARIN Knowledge Centre as part of the Knowledge Sharing Infrastructure within CLARIN ERIC. For this purpose, FIN-CLARIN has established a distributed K-Centre on Systems and Frameworks for Morphologically Rich Languages (SAFMORIL, https://www.kielipankki.fi/safmoril) in cooperation with Norway, Sweden, Latvia and Lithuania.

DARIAH-FI will facilitate the spread of digital competences across Finnish universities to enable them to compete at the highest international level in data-intensive SSH research. DARIAH-FI will support domestic researcher mobility by joint development workshops and enhanced collaboration opportunities enabled by the infrastructure, while active collaboration with DARIAH-EU promotes international researcher mobility and, together with DARIAH-FI’s open source practices, creates international impact.

2 Wide and versatile impact

2.1 Impact for society at large

FIN-CLARIAH will have a wide impact on Finnish business and industry, society and employment, knowledge and innovation ecosystems and new business initiatives.

The availability of sufficient amounts of language data is necessary for building adequate language technology for any language. For more information, see e.g. the Finnish Language White Paper.[5] The Language Bank caters primarily to academia, but also aims to remedy the situation for business and industry in Finland. Efforts to make the tools and data collections in the Language Bank available to industry for a fee is under consideration by Vake (https://vake.fi/) –The Finnish State Development Company with the interest to provide seed funding for developing Finnish language components for AI for the benefit of Finnish industry, building partly on the resources already available in the Language Bank. Business Finland (https://www.businessfinland.fi/) aims to provide funding for language technology and AI application development for market needs.

FIN-CLARIAH improves the competitiveness of the Finnish society through commercial research groups and software industry by providing data for developing innovations in AI and machine learning. These are new types of commercial research and development requiring millions of documents with thousands of millions of words and hundreds of hours of speech data. The development of language resources for commercial AI research is pursued in cooperation with the Technology Industries of Finland and the Finnish Centre for Artificial Intelligence (FCAI). Through industry collaboration, FIN-CLARIAH facilitates education of the next generation of data scientists, who will be well-versed in the diverse methodological needs of SSH disciplines and therefore equipped for employment in a versatile job market in government and industry. Data-intensive SSH research can be expected to lead to new innovations and spin-offs, for instance by facilitating the use of open data or analysis infrastructures as part of commercial knowledge and innovation ecosystems.

Such new business initiatives include, e.g.,the emergence of consumer behaviour-aware services, statistical barometers, development of national scanning and optical character recognition services, improved knowledge of history with more historical resources on the internet, improved historiography of ancestry, validated support for student assessments, corpus-informed development of teaching and study materials, artificial training environments for oral skills, semi-automated assessment environment for oral skills, language identification for migrants, machine-assisted integration services for foreigners, training of interpreter working skills, interpretation within multilingual organizations, support for revival and sustainability of endangered or minority languages, development of specialized dictionaries, support for lexicographical work, terminology skills in industry, linguistic guidelines for computer programs and systems, improved quality of user-interface planning and localization, etc.

CLARIN and DARIAH set de facto standards for resources to improve equal access. The resources developed or provided through FIN-CLARIAH enable national research in SSH big data, quantitative studies of speech and video material, technology-based research of interpretation and translation, as well as e.g. research in automated video and speech commentaries to fully implement the web accessibility directive.[6] By providing ethical guidance and sustainable access mechanisms to cultural heritage collections, FIN-CLARIAH can also help safeguard the rights of indigenous peoples.

FIN-CLARIAH’s emphasis on open licensing also guarantees that project outcomes will be widely usable in society. Many prototypes using the Linked Data Finland infrastructure are already openly available: 1) the BookSampo system (in Finnish public libraries) had 2 million users in 2019, 2)WarSampo 690 000 had users, and there have been tens of thousands of users of 3)BiographySampo, 4)NameSampo and 5) the Finto service (NLF) based on the ONKI prototype.[7] FIN-CLARIAH is also actively providing parallel language and terminology resources to the European Commission (EC) through the European Language Resource Coordination (http://www.lr-coordination.eu) to improve the public EC machine translation services.

2.2 Impact for Finland

Membership in the international RIs CLARIN ERIC and DARIAH ERIC is important for Finland.

CLARIN ERIC makes digital language resources available to scholars, researchers, students, and citizen-scientists in all its member countries through public or single sign-on access.[8] As each CLARIN member country focuses on its own national resources, CLARIN ERIC supports scholars who want to engage in cutting edge data-driven research adding a transnational aspect to the resource sharing.CLARIN offers long-term solutions and technology services for deploying, connecting, analysing and sustaining digital language data and tools. This is manifested through three CLARIN priority areas: Uptake, Technical Infrastructure and Knowledge Sharing:

  1. Uptake by researchers: CLARIN offers a central entry point for researchers interested in language resources and technologies and stimulates CLARIN-wide uptake activities.
  2. Technical infrastructure: CLARIN has constructed a sound and stable technical basis to support the sharing of language data and tools across institutional, disciplinary and international borders. Besides enriching and strengthening this infrastructure, CLARIN works towards increasing interoperability within the CLARIN ecosystem. This requires efforts by the tool and the data providers coordinated by the National Coordinators’ Forum and the Standing Committee for CLARIN Technical Centres.
  3. Knowledge Sharing Infrastructure: CLARIN is also an ecosystem for the exchange of knowledge and expertise. This Knowledge Sharing Infrastructure works as ‘glue’ for the various communities engaged with CLARIN, and consists of measures and facilities for transferring knowledge between parties involved in the construction, operation and use of the infrastructure. An important instrument are the so-called CLARIN knowledge centres (Kcentres) that bring together expertise in a certain domain, topic, data modality, etc.

CLARIN ERICalsopromotes inter- and transnational cooperation through mobility grants (https://www.clarin.eu/content/clarin-for-researchers). FIN-CLARIN provides added CLARIN value for Finland by facilitating education and cooperation between universities, research groups and researchers internationally as documented in Table 4, many of which are active in CLARIN ERIC member consortia.

Research area Primary FIN-CLARIN members Examples of international research partners of FIN-CLARIN members
Social Science Research in large datasets Universities of Helsinki and Turku, Tampere University University of Duisburg Essen (Germany, CLARIN), University of Hong Kong (China)
Learners’ assessment environments Aalto University, Universities of Jyväskylä, Helsinki, Oulu and Turku, Kotus Lancaster University (UK, CLARIN), Penn State University (US), Reitaku University (Japan), Tallinn University (Estonia, CLARIN), Educational Testing Service (US)
Translation and Interpretation Universities of Eastern Finland and Helsinki, Tampere University Charles University (Czech, CLARIN), Uppsala University (Sweden, CLARIN)
Dictionary Development and Lexicographical research Kotus, Universities of Eastern Finland and Helsinki, Tampere University Moscow Pedagogical University, Russian Academy of Science, University of Tromssa (Norway, CLARIN), University of Hamburg (Germany, CLARIN)
Terminology Development and Processing of Language for Special Needs Universities of Vaasa, Helsinki and Turku Eurac Research (Italy),
NHH (Norwegian School of Economics, Norway), Isof (Institute for Language and Folklore in Sweden, CLARIN), CBS (Copenhagen Business School, Denmark)
Table 4: Providing infrastructure to facilitate CLARIN cooperation in research

FIN-CLARIAH will also align its activities with the DARIAH ERIC infrastructure. DARIAH ERICservices of particular interest to Finland and Finnish research and society include DARIAH Working Groups in strategic areas defined by DARIAH’s Virtual Competence Centres as well as opportunities for collaboration through DARIAH Regional Hubs and with DARIAH service providers and partner institutions in other countries. DARIAH requires in-kind contributions from its members and provides access to these tools, and plays an active role in the development of the SSH Open Marketplace, which will help researchers discover tools and services. #dariahTeach provides open educational resources for digital arts and humanities while DARIAH-CAMPUS, currently under development, is a new discovery framework and hosting platform for CLARIN and DARIAH learning resources. DARIAH-FI will provide all Finnish universities and research institutes full access to the services provided by DARIAH-EU, thus improving their opportunities for international collaboration and networking through, e.g. the Open Science Hub.

3 Ownership, financing, know-how and organisation

3.1 Ownership

Ownership and location of the RI: The host of FIN-CLARIAH RI is UHEL and the main RI computing facilities are hosted by CSC. The Faculty of Arts at UHEL currently carries the overall responsibility for the RI as the host of the Director and Vice Director of FIN-CLARIAH at the University of Helsinki. The FIN-CLARIAH consortium consists of two national RI components FIN-CLARIN and DARIAH-FI. The FIN-CLARIN RI members have in 2016 signed a Consortium Agreement (Konsortiosopimus[9]) to join the RI, detailing their responsibilities, the steering activities (Työjärjestys) and the economy (Yhteenvetopanostuksista) of FIN-CLARIN that the parties have agreed on. The DARIAH-FI RI members have signed letters of intent to participate in a consortium.

Host organizations’ support for and strategic alignment with the RI: The UHEL Strategic Plan 2021-2030[10] names openness as one of the university’s strategic choices and places strong emphasis on consolidating research and learning infrastructures. The Faculty of Arts has accordingly named the utilisation of a wide range of data, methods and advanced infrastructure as one of its four orientations. The Language Policy of UHEL mentions the HTB as a means to create a foundation for implementing the strategic aim of multilingualism and parallel language use. DH at UHEL has received funding through strategic profiling (PROFI2: Helsinki Centre for Digital Humanities HELDIG). Coordinated by the Faculty of Arts, HELDIG serves the entire central campus by offering methods and materials that enable new research approaches and disciplinary development, e.g. state-of-the-art Linked Open Data services and Semantic Computing. The Faculty of Social Sciences coordinates a profiling action which tackles the societal implications of digitalisation (PROFI4: Inequality, Wellbeing and Security INEQ), and hosts the Centre for Social Data Science. FIN-CLARIAH will ideally complement the recently founded Helsinki Institute for Social Sciences and Humanities (HSSH), which functions as a local hub for research infrastructure and will host the UHEL functions of FIN-CLARIAH.

CSC–IT Center for Science Ltd., a national centre for IT expertise owned by the Finnish state and higher education institutions, supports FIN-CLARIAH as part of its strategic aims to 1) enable world-class data management and computing by providing resources and supporting computational ecosystems; 2) build services and collaborations to maximise the value of data; and 3) help customers and collaborators to leverage AI.

Consortium member and collaborator support for and strategic alignment with the RI: Aalto University (Aalto) and UHEL have been strategic partners since 2016 and have collaborated on several DH projects between 2002–2020. Five Aalto schools engage in DH research as part of Aalto’s research strategy to engage in multidisciplinary collaboration to find unique solutions for the benefit of industry and society.

FIN-CLARIAH will directly contribute to two strategic areas of the University of Turku (UTU), Digital Futures and Cultural Memory and Social Change. UTU hosts The Archives of History, Culture and Arts Studies and supports their digitisation. Through FIN-CLARIAH, this infrastructure can be linked to other similar collections in Finland. UTU is known as a locus of compiling language corpora and digital curation in the humanities. It hosts the Agricola portal, an open platform for DH content, and the Digilang portal for language corpora compiled at UTU.

At Tampere University’s (TAU) Faculty of Information Technology and Communication Sciences, technology and the humanities come together in a unique way. FIN-CLARIAH will contribute to TAU’s research mission to ensure socially responsible digitalisation and transformation of work. The FIRE (Finnish Information Retrieval Experts) group at TAU is internationally established as conducting high-quality research on task-based and interactive information retrieval. The COSSTT research group (Corpus-based Studies of Specialised Texts, Translations and Terminology) is collecting multilingual text corpora with an emphasis on parallel and comparable corpora of language for special purposes, as well as multilingual terminological databases. The SMiLE research group (Statistical Machine Learning and Exploratory Data Analysis) is creating methods and algorithms for analysis of latent trends and structures in text corpora.

FIN-CLARIAH strengthens the University of Jyväskylä’s (JYU) profiling areas Crisis Redefined, Cyber Security Research, Ageing and Care, and Research Collegium for Language in Changing Societies, all of which incorporate DH in their profiles. JYU also hosts two AoF Centres of Excellence: Game Culture Studies and Ageing and Care, which will both use and be involved in developing FIN-CLARIAH. JYU recently launched a Digital Programme to harness the benefits of digitalisation in research.

The University of Eastern Finland’s (UEF) current strategy highlights interdisciplinary research and the use of digital methods and big data. UEF has established numerous professorial and post-doctoral positions in digital orientations and data sciences. FIN-CLARIAH is closely connected to such AoF and European Research Council funded UEF research areas as Borders, Mobilities and Cultural Encounters and Learning in Digitised Society.

The University of Oulu (OU) offers expertise, e.g., in Finnic minority and regional languages, Northern varieties of Finnish, learners’ corpora, corpus linguistics, corpus methodology, Saami linguistics, Saami language technology, training researchers and other professionals who have a profound knowledge of Saami language and culture. The University of Oulu provides the Giellagas Corpus of Spoken Saami Language and an error-annotated version of the International Corpus of Learner Finnish, ICLFI.

The University of Vaasa (UVA) offers expertise, e.g., in terminology, especially conceptual analysis, research in language for special purposes as well as discourse analysis. The University of Vaasa maintains WasaTerm – Terminologisia sanastoja, Terminology Forum.

The Institute for the Languages of Finland (Kotus) is devoted to the study and language planning of Finnish and Swedish. It also coordinates the activities of the Saami, Romani, and Sign Language Boards. In FIN-CLARIAH, the Institute for the Languages of Finland offers expertise, e.g., in internet access to scientific dictionaries, other dictionaries, lexicons and wordlists, resources for linguistic and onomasiological maintenance, text, speech and video corpora.

The National Library of Finland (NLF) contribution to FIN-CLARIAH is in line with its strategic goals to promote open science and improve the preconditions of cultural and social research through DH cooperation.

The National Archives of Finland (NARC) is a co-founder of the EU-wide READ Cooperative using neural networks to convert digitised handwritten and printed historical documents into machine-readable and searchable format.

3.2 Funding base

FIN-CLARIAH consists of the national RI components FIN-CLARIN and DARIAH-FI. As an existing national RI, FIN-CLARIN has previous funding and established in-kind contributions from its members, whereas DARIAH-FI as a new RI is currently consolidating its funding base. The two RI components have common as well as separate development and upgrade needs. For this reason, two separate financial plans are outlined while making sure that there is no overlap in the foreseen development activities.

FIN-CLARIN previous funding and current funding base: Around 1996-1997 the Language Bank began storing and providing access to language resources as a cooperation project between UHEL and CSC. During 2006, the FIN-CLARIN consortium was established as an advisory board for the Language Bank, and FIN-CLARIN received strategic funding from UHEL comparable to approx. 1 M€/year in terms of the full-cost model until 2012. FIN-CLARIN has received funding for 2013 from the AoF for the FIN-CLARIN BUILD, for 2014 for the FIN-CLARIN-INVEST, and during 2015-2016 for the FIN-CLARIN-INVEST2 projects. The most recent funding 2017-2019 has been for the FIN-CLARIN UPGRADE project. The full-cost model funding during the five previous years has been approx. 1 M€/year with 30% own funding by the AoF Project consortium members.

The FIN-CLARIN host organization UHEL has committed to financing its own share of the costs for the FIN-CLARIAH project covering the years 2021-2030. Similar commitments have been made by the other participating organizations.

The FIN-CLARIN members provide in-kind contributions in the form of datasets and tools, which they produce or create as part of their normal activities with external or internal funding and which they have agreed to deposit and make available through the Language Bank. An estimate of the work invested by the FIN-CLARIN members during the previous five years amounts to approx. 15 M€ and is foreseen to continue at a level of 3 M€ annually[11] according to the estimate from December 2015 with an average increase of 2% annually due to increased salary costs.

Financial plan 2022-2030 for FIN-CLARIN: UHEL and CSC are responsible for the CLARIN ERIC coordination activities and the development of the Language Bank with a projected cost breakdown as shown in Table 5. The budget of the FIN-CLARIN members for language resource development is estimated at a level of 3M€ annually excluding funding for any Centres of Excellence, DH, or other AoF projects, which roughly contribute another 3 M€ annually to the field of research. In total, this projects to an investment of 54 M€ in the field over the next nine years. The requested funding from the AoF for the coordination of FIN-CLARIN and the upgrading of the Language Bank over a nine-year period is estimated to approximately 1 M€ annually, i.e. 9 M€ over the roadmap period, amounting to 10% [= 6.5/(9*6+9)] of the total investment.

k€ 2022 2023 2024 2025 2026 2027 2028 2029 2030 Tot.
UHEL/ARTS 757 778 800 799 821 644 638 655 674 6566
CSC 280 285 289 294 297 301 306 311 315 2677
Tot. 1036 1063 1089 1093 1118 945 944 966 989 9243
Own 311 319 327 328 335 284 283 290 297 2773
AoF 725 744 762 765 783 662 661 676 692 6470
Tot. 1036 1063 1089 1093 1118 945 944 966 989 9243
Table 5: Funding Plan in k€ for 2022-2030 for FIN-CLARIN

Financial plan 2022-2027 for DARIAH-FI: The total budget for the DARIAH-FI construction phase 2022–2027 is shown in Table 6. Further development and upgrading costs (2028–) are expected to be lower than the initial construction costs. The budget of the DARIAH-FI members for infrastructure development is estimated to be approx. 0.7M€ annually on average to be funded by the AoF over the first six years. The total funding of research in SSH by Finnish foundations is estimated at 37M€ for the Social Sciences and 36M€ for the Humanities[12] which roughly contributes 73M€ annually to the SHH fields of research. The requested infrastructure funding from the AoF for the coordination and construction of DARIAH-FI over a six-year period is estimated at 1% [= 4/(6*73+4)] of the total investment in the SSH field in Finland.

Based on a continuous evaluation of the RI and the needs of the research community, further external development funding will be sought from Finnish foundations. The funding contribution applied for from the AoF is required for building analysis tools and user interfaces, and for managing the construction of the infrastructure extension. DARIAH-FI will need a full-time coordinator to ensure that the resources allocated to member organisations are used efficiently to construct a unified infrastructure.

2022 2023 2024 2025 2026 2027 Tot.
UHEL/ARTS 239 125 100 108 77 46 696
UHEL/SOC 151 199 112 108 77 46 693
UHEL/NLF 96 97 192 90 64 39 578
Aalto 139 143 146 100 71 43 643
CSC 126 128 129 90 65 65 604
TAU 96 168 165 100 71 43 643
UEF 142 143 144 100 71 43 643
JYU 132 173 123 100 71 43 642
UTU 164 183 82 100 85 51 665
Tot. 1286 1358 1194 896 654 419 5806
Own 386 407 358 269 196 126 1742
AoF 900 951 836 627 458 293 4065
Table 6: Funding Plan in k€ for 2022-2027 for DARIAH-FI

AoF funding cost structure: The funding is needed for investment costs, i.e. for the acquisition of materials and systems and for the design of new services for the FIN-CLARIAH infrastructure, and for the significant upgrade and extension of existing parts of the infrastructure. Service costs for UHEL consist mainly of special devices for recording or scanning or fees related to such activities. For CSC, the costs include infrastructure services and license fees. Annual traveling costs and other expenses cover approximately 1-2 trips abroad per person at approximately 1500€/trip. The annual inflation rate for salaries and other costs is projected at 1-3 %.

3.3 Know-how

Competence of key persons: The FIN-CLARIAH Director, FIN-CLARIN National Coordinator, Research Director, PI, DrKrister Lindén serves as Chair of the Strategy and Management Board of CLARIN and as Vice Chair of the CLARIN National Coordinators’ Forum. He also participates in the CLARIN Legal Issues Committee and in the CLARIN Interoperability Committee. He has regular contacts with the members of FIN-CLARIAH and other actors in the field as well as with Nordic, European and other institutions and colleagues facilitating co-operation and co-development of resources. He has experience as CEO and CTO of the commercial company Lingsoft Inc. with successful application and completion of several EU projects. Lindén is very familiar with current methods and branches within language and speech technology. Lindén has directed a number of AoF-funded research projects and is Vice Team Leader of the Centre of Excellence for Ancient Near Eastern Empires. In addition to having developed software for processing resources for the national languages of Finland, he has published more than 100 peer-reviewed scientific publications. He is also on the scientific advisory boards of NLF and Kotus as well as several commercial companies. He also serves as National Anchor Point for the European Language Resource Consortium and the European Language Grid.

The Vice Director Mikko Tolonen is Associate Professor of DH at the University of Helsinki. He is the PI of the Helsinki Computational History Group at the Helsinki Centre for DH (HELDIG) and has worked as Professor of Research on Digital Resources at the National Library of Finland. He is the chair of DH in the Nordic Countries, a member of the board of directors for the European Association for DH, and on the scientific advisory board of CSC. In 2016, along with his research group, he was awarded an Open Science and Research Award by the Finnish Ministry of Education and Culture. He was the PI for Finland in the Humanities at Scale (DESIR 2017-19) project furthering DARIAH-EU’s aim to integrate digitally-enabled research in the Arts and Humanities in Europe. He also serves on the CLARIN SSH Expert Panel.

Competence of current other infrastructure personnel in terms of the implementation: Martin Matthiesen (CSC) has a degree in Language Technology and has previously worked in systems administration for a private language technology company. He is currently Chair of the CLARIN AAI Taskforce and Chair of the Standing Committee for Service Centres. Tommi Jauhiainen (UHEL) has a doctoral degree in Language Technology and a strong experience in software engineering projects from his time as the information systems manager at the National Library where he had the technical responsibility for the FinELib and Finna RIs on the National Roadmap of Finland. Mietta Lennes (UHEL) has a degree in Phonetics and has later specialized in user interaction and online teaching. She is also a member of the CLARIN User Involvement Committee. Jyrki Niemi (UHEL) has a degree in Language Technology and has later specialized in programming and the Korp concordancing system. Jussi Piitulainen (UHEL) has a doctoral degree in Language Technology and has specialized in processing large datasets. Erik Axelson (UHEL) has a degree in Speech Processing and has later specialized in programming and software applications. Tero Aalto (CSC) has a degree in Language Technology and has worked in CLARIN since its early stages, and is a member of the CLARIN PID taskforce.

Long-term commitment of the hosting organizations to develop the RI: Director, PI Krister Lindén works full-time for FIN-CLARIAH. He also carries the overall responsibility for the development and promotion of the Language Bank. Vice Director, Prof Mikko Tolonen works part-time for FIN-CLARIAH. Both have permanent positions at the UHEL Department of DH, Faculty of Arts. The above-mentioned infrastructure personnel at UHEL and CSC also have permanent positions. UHEL is committed to developing the Helsinki Institute for Social Sciences and Humanities (HSSH) as a local coordinator of research infrastructure and a node of FIN-CLARIAH.

Staff development through international mobility: Since the RI is built within a rapidly changing research environment, regular staff training is of utmost importance. CLARIN ERIC offers an exchange programme for visiting other CLARIN Centres in Europe for learning specific skills related to resources, and how to implement or install them in a national setting. The programme is also open for promoting national solutions to other CLARIN Centres by invitation. A good candidate for such mobility is the REMS system developed by CSC when implemented in other CLARIN Centres. This type of mobility is partly funded by CLARIN ERIC.

Recruitment. Temporary replacements or additional staff need to be hired when upgrading and developing the infrastructure. All new recruitments are handled through open, international calls with explicit criteria for each task to ensure the hiring of best-practice expertise. FIN-CLARIAH commits to the principles of the European Charter for Researchers and the Code of Conduct for the Recruitment of Researchers, i.e. open, transparent and merit-based recruitment.

3.4 Organisational structure

FIN-CLARIAH has been established in 2020 based on the two national infrastructure components FIN-CLARIN and DARIAH-FI. Director of FIN-CLARIAH is PI Krister Lindén, UHEL/ARTS, and Vice-Director is Prof Mikko Tolonen, UHEL/ARTS. FIN-CLARIAH has an Executive Committee consisting of the Directors and National Coordinators for the daily operation of the RI.

FIN-CLARIAH has a Steering Group with at most two representatives from each member organisation in addition to the Director and Vice-Director of the RI. FIN-CLARIAH also has an Advisory Board with representatives from its members, collaborators and key stakeholders. The Director is Chair of the Steering Group and the Vice Director is Chair of the Advisory Board. The Steering Group and the Advisory Board have semi-annual meetings.

Member representatives in the FIN-CLARIAH Steering Group

  • Aalto Eero Hyvönen, Professor of Semantic Media Technology
  • Aalto Mikko Kurimo, Associate Professor of Speech and Language Processing
  • CSC Pekka Lehtovuori, Director of Services for Computational research
  • Kotus Ulla-Maija Forsberg, Director of the Institute for the Languages of Finland
  • JYU Ari Huhta, Professor of Language Assessment
  • JYU Jari Ojala, Professor of Comparative Business History
  • NARC Päivi Happonen, Deputy Director General at the National Archives of Finland
  • NARC Maria Kallio, Senior Officer at the National Archives of Finland
  • TAU Mihail Mihailov, Professor of Translation Science
  • TAU Sanna Kumpulainen, Associate Professor of Information Studies
  • UEF Mikko Laitinen, Professor of English Language
  • UEF Jukka Mäkisalo, University Lecturer of Translation Studies
  • UHEL/NLF Johanna Lilja, Service Director of the Research Library of the NLF
  • UHEL/SOC Krista Lagus, Professor of Computational Social Science
  • UTU Marja-Liisa Helasvuo, Professor of Finnish Language
  • UTU Hannu Salmi, Academy Professor and Professor of Cultural History
  • OU Tiina Keisanen, Professor of English Language
  • OU Jari Sivonen, Professor of Finnish Language
  • UVA Merja Koskela, Professor of Applied Linguistics
  • UVA Niina Nissilä, University Lecturer of Marketing and Communication

Division of labour

The Faculty of Arts at UHEL carries the overall responsibility for the FIN-CLARIAH RI as the host of the Director and Vice Director. UHEL will in particular be responsible for curating, preparing and updating language-based materials into installable databases in the Language Bank according to CLARIN and DARIAH standards. Curating material includes informing the depositors and helping them to enforce formatting standards and licensing conventions for language tools and materials. In addition, UHEL will further develop or acquire tools for processing and automatically annotating language-based materials to be provided through the Language Bank and CSC. UHEL will also have coordinating responsibility for information dissemination for using tools and materials as well as FIN-CLARIAH researcher training through online courses and websites.

CSC provides and is responsible for the technical infrastructure of FIN-CLARIAH as well as providing technological expertise for developing the services. CSC’s role includes installation, maintenance, development and tailoring of the language-based resource services available for the Finnish research community, as well as their technical user support. CSC offers the computational, storage and other platforms for the Language Bank, installs new versions of databases and software as they become available, and assists in tailoring and developing processing and annotation tools and researcher workflows. CSC is responsible for operating the authentication and authorization infrastructure, allocates user space and provides access. CSC participates in the integration and harmonization of CLARIN and DARIAH services. In particular, CSC provides services as a CLARIN B Service Centre. CSC provides training for its services and assists UHEL in providing training for the Language Bank services to the FIN-CLARIAH members.

Other FIN-CLARIAH members develop tools and workflows as well as curate materials to be tailored to CLARIN and DARIAH standards and centrally provided through the Language Bank and the CSC services. All FIN-CLARIAH members provide advice locally on how to use FIN-CLARIAH services in their research and education.

FIN-CLARIN: FIN-CLARIN is the national node of CLARIN ERIC. The FIN-CLARIN members form the Steering Group of FIN-CLARIN. Chair of the Steering Group is the FIN-CLARIN National Coordinator, Research Director, Dr Krister Lindén, UHEL. Vice Chair is Dr Pekka Lehtovuori, CSC. UHEL and CSC provide national and international coordination of FIN-CLARIN as well as national implementation through the Language Bank.

FIN-CLARIN provides a centralized Service Centre, the Language Bank, hosting collaboration platforms, e.g., Tieteen termipankki, and web services, e.g. Korp, The Mill, etc. The FIN-CLARIN Steering Group meets semi-annually for evaluating the deliverables, and the Language Bank project teams consisting of staff from UHEL and CSC meet on a monthly basis to coordinate the development. CLARIN ERIC has monthly meetings for reporting on the progress within the national consortia.

DARIAH-FI: DARIAH-FI aims to become the national node of DARIAH ERIC and is led by a Steering Group comprising a National Coordinator and consortium member representatives. The coordinator will manage the construction through biweekly video conference team meetings and will be located at UHEL. The DARIAH-FI Steering Group will meet quarterly and is composed of experts in DH, SDS as well as machine learning and computational methods, and will receive further expert support from an International Advisory Board, including DARIAH-EU representatives. Chair of the DARIAH-FI Steering Group is Prof Mikko Tolonen, UHEL, and Vice Chair Dr Aleksi Kallio, CSC.

DARIAH-FI is developed in collaboration between several of its member or partner organisations, ensuring bottom-up influence and accessibility. CSC will host the majority of the services, although some services hosted at NARC or NLF may not be moved for legal or contractual reasons.

4 Research infrastructure activities

A majority of the FIN-CLARIAH investments are related to licensing negotiations for tools and datasets, followed by the tool integration and adaptation for inclusion into the Language Bank and CSC Services, or dataset processing for inclusion in the Language Bank databases, which serve a dual purpose of providing content for research in SSH as well as material for various types of language research. Large datasets and general-purpose tools will be acquired pre-emptively to support top-level research activities in the field. Open source tools will primarily be acquired from the national or international infrastructure partners within the CLARIN and DARIAH networks but also from outside in cooperation with the FIN-CLARIAH members to ensure that new tools are suitable for their intended task, e.g. through adaptation to the Finnish language. FIN-CLARIAH also specifically adapts some of the tools to the needs of the top-level research units and the FIN-CLARIAH members. Innovative and high-quality research-based specialized tools and datasets will become available as research projects mature, at which point they can be integrated into the Language Bank and the CSC Services. Key resources not developed by any of the FIN-CLARIAH collaborators will need to be developed centrally within the Language Bank. The current publicly available web services and databases provided via the Language Bank and the FIN-CLARIAH members are listed on the web pages www.kielipankki.fi/tools/ and www.kielipankki.fi/corpora/.

4.1 Life cycle

The lifecycle of an RI is determined by its components, which age for three main reasons:

  1. updates of supporting hardware and operating systems
  2. advent of new technologies prompting the acquisition of upgraded tools and
  3. shifts in research focus of a majority of the users

The hardware updates are announced well in advance and are easy to foresee by adopting good software and database installation practices making reinstallation as smooth as possible. The advent of new technologies and new versions of software or databases is a regular occurrence, and upgrades need to be scheduled at regular intervals to keep the RI from degrading or becoming irrelevant. Shifts in research interests can be gleaned from day-to-day interaction with top-level research projects and responded to by trying out new tools and data collections at the request of top-level researchers or their representatives in the Steering Groups and Advisory Boards of FIN-CLARIAH, FIN-CLARIN and DARIAH-FI.

For SSH scholars, data collections are important vehicles for research, so a balance has to be found between providing research data for top-level special interest and mainstream research. Large data collections of general interest can often be used as reference material complementing more specialized material.

It is important to keep frequently used tools and data collections up-to-date, while still introducing prioritized tools and data collections with related training activities. Unused tools can be decommissioned and databases without recurring users put into long-term storage to be retrieved only for replicability and verifiability. The decision to decommission or update is relevant when a system update prompts reinstallation.

Currently, FIN-CLARIAH services are embodied in the Language Bank (https://www.kielipankki.fi/) with research and collaboration platforms like Tieteen termipankki – The Helsinki Term Bank for the Arts and Sciences (https://tieteentermipankki.fi/), a corpus content search engine like Korp (https://korp.csc.fi), an interactive data processing and visualization environment The Mill (https://www.kielipankki.fi/support/mylly/) as well as in the download service (https://www.kielipankki.fi/download). They have an increasing user community among top-level researchers asking for updates and additions to the data collections. The services will need to be tailored towards the additional user community targeted by DARIAH-FI and updated with new tools and resources for the existing FIN-CLARIN user base.

DARIAH-FI will seek funding to progress from its current planning phase to a fully operational RI within the first two years of development (2022-23). During the construction phase, it will recruit a National Coordinator and other necessary staff, develop KPIs and collaboration procedures within FIN-CLARIAH and, pending approval by the FIRI committee, apply for DARIAH-EU country membership. Concretely, it will integrate the existing best practices and partial infrastructures of its participants, identify gaps and develop solutions to close them. In its consolidation phase (2024-), DARIAH-FI will lobby for local investments to develop a national network of FIN-CLARIAH Open Science Hubs at the participating institutions. These will then take on maintenance of the mature parts of the infrastructure, while the core of DARIAH-FI focuses on engaging in further international development e.g. through the EU programme Horizon Europe. Throughout, RI development will be guided by continuous monitoring of KPIs, stakeholder needs and changes in the operating environment.

FIN-CLARIAH combines the mature research infrastructure of FIN-CLARIN and the emerging infrastructure of DARIAH-FI. This will enable the renewal of high-quality digital research in SSH in Finland by giving support to various SSH research communities to take charge of their own data and tool development ensuring that the focus will remain in their actual interest in knowledge.

Exit strategy: If the user-base erodes, i.e. the services offered by the RI are better provided in some other way according to top-level researchers, the RI should be decommissioned. The dismantling of the RI requires that datasets be transferred to long-term storage to be accessible for the verifiability of conducted research. Computing power and memory hardware can be reallocated and operational staff can be terminated or reassigned to other duties. No date has currently been set for when the RI is to conclude its activities as there is no indication that the RI is becoming obsolete in the foreseeable future.

4.2 Responsibility and sustainable development in RI activities

Good research practice: FIN-CLARIAH provides guidelines on how to collect data and how to document the process (in Finnish https://www.kielipankki.fi/tuki/keruuvaiheen-luvat/) to ensure reuse, openness and ethical processing of the data. For promoting good research practices and for ethically collecting and sharing research data, FIN-CLARIAH provides researcher training outlined in Section 1.3.

Good governance: Some of the datasets may contain personal or copyrighted data intended only for research purposes. Such materials cannot be published openly without explicit permission, and access may require that the user specifies the purpose of research and commits to appropriate safeguards for the data. An authorization workflow system called the Language Bank Rights has been developed by CSC for appropriately handling such requests for research data with restricted access. Even when reference data is available in FIN-CLARIAH, new research ideas may require that researchers collect additional copyrighted or personal data. To facilitate the deposition of such data in the RI for sharing, FIN-CLARIN currently provides guidance to its user base before the researchers start collecting new data, as this may call for careful ethical and practical considerations. This best practice can be extended to the whole field of SSH through DARIAH-FI.

Sustainability in the RI operations according to the UN goals: By offering online resources and collaboration websites FIN-CLARIAH enhances sustainability in promoting working methods free from the restrictions of place and time.A core value of FIN-CLARIAH is to promote Quality Education by offering access to online teaching programs on how to use tools and datasets for all researchers and, resources permitting, to citizen scientists as well. FIN-CLARIAH aims at promoting Gender Equality with a particular view to facilitating child rearing with equal time for maternity and paternity leave to level the impact during recruitment. It may also have implications for collecting SSH data representing all genders. Decent Work and Economic Growth will be supported by facilitating development of human-machine interaction through language-centric AI, e.g. allowing command of AI in repetitive or heavy lifting tasks. Investment in RI promotes Industry, Innovation, and Infrastructure development by enabling more efficient research in academia and industry alike when datasets can be reused and reference data is readily available so that only incremental data specific to a research problem needs to be collected. Reduced Inequalities arepromoted by enabling equal access to service for minorities in Finland. The RI supports the development and documentation of the Sámi and Romani languages, the sign languages used in Finland, as well as technologies for picture description for the visually impaired and speech recognition for the hard of hearing. Peace, Justice and Strong Institutions are promoted through facilitated access to the Parliamentary records offering opportunities to keep the political parties accountable for promises and actions.

RI assessment of own carbon footprint: The Language Bank uses the CSC supercomputer Lumi which will be located in Kajaani. CSC is committed to sustainable development. For example, the upcoming LUMI system will have a negative footprint of -13500T CO2eq/Year. LUMI runs on 100% renewable electricity and its waste heat will cover 20% of Kajaani’s district heating capacity, which is currently generated using fossil fuels. For details, see https://www.csc.fi/web/atcsc/-/lumi-tulee-vuoden-paasta

In addition, FIN-CLARIAH will predominantly use email and video conference equipment for national and international collaboration. We aim to avoid unnecessary traveling to conferences in order to minimize the carbon footprint of the participants, while recognising that a limited amount of face-to-face meetings between individuals promotes understanding and smooth cooperation especially in the introductory phases. Going to large conferences in the fields of research, where many key persons are present for multiple purposes, is also an effective way of minimising overall traveling, while each video conference meeting corresponds to a quantifiable reduction in traveling.

4.3 Long-term perspective and dynamism

Services and users: The main users of FIN-CLARIAH are the SSH researchers. Their primary infrastructure need is for access to resources such as data and tools. Massive amounts of freely available data allow the researchers to spend more time on their research questions and to devote less time to collecting and preparing their own resources. However, it is not enough just to make data available. FIN-CLARIAH data needs to be in formats allowing access with the researchers’ preferred tools for which purpose FIN-CLARIAH will build APIs and data converters while extending and enriching its collections.

The FIN-CLARIAH domain within the area of SSH covers:

  1. any form of language-based or language-related data (in contrast to structured numerical data collected by questionnaires), e.g., text, speech, video, multisensory or multilingual data
  2. other forms of multimodal data on societal and cultural phenomena (derived from collections in galleries, libraries, museums, or archives), e.g. metadata collections, data networks, supporting data sets, statistical data, refined archival data
  3. benchmark data to promote method and software development
  4. acquisition of tools and refined data sets developed by researchers and research groups according to FIN-CLARIAH interoperability standards

The long-term vision is that FIN-CLARIAH will be the key IT infrastructure for the SSH area in Finland. The FIN-CLARIAH mission is to offer interoperable resources for the SSH area and to share best practices in order to develop norms and guidelines to achieve this. FIN-CLARIAH will thus support and advance multidisciplinary research co-operation, open collaboration and the renewal of SSH research culture.

The mission will be implemented through the FIN-CLARIAH strategy to:

  1. offer established interoperable digital tools and data sets on common openly available platforms
  2. develop norms and guidelines for how to make new resources accessible and interoperable
  3. assist the research community by acquiring prioritized big and accruing data sets
  4. engage in prioritized pilot projects and developer networks to integrate specialized resources to renew the infrastructure
  5. support open collaboration models by providing a shared development platform for researchers

A survey of digital practices in the Arts and Humanities in 2016 in Finland found that over 90% of the respondents considered improved access to existing digital resources important for their research; digitisation of resources and improved access to digital tools was important to over 80% of the respondents.[13] The survey indicated a clear demand for the support that FIN-CLARIN has been offering.

In a survey conducted at the UHEL in autumn 2019 with 356 respondents among the PIs in SSH, researchers working in DH, SDS, audio-visual research or experimental research voiced many urgent needs that can be used to guide what parts of FIN-CLARIAH require upgrades or development. For instance, there is a need to manage and re-use large datasets originating from various sources, such as archives, registers and social media. Researchers require tools for harvesting online data, processing textual data, automatic speech recognition and for automating other mechanical steps in their data processing. The increasing size of datasets within SSH needs to be addressed with regard to long-term preservation.

The survey also highlighted that best practices for the legal and ethical aspects of data management in light of the new data protection legislation are called for. FIN-CLARIN is currently updating its general deposition agreement framework nationally and within CLARIN ERIC, which will have an impact on the communication of SSH research data between research groups in Finland. Technical solutions, e.g., remote desktop services provided by CSC are available for handling protective measures for personal data, but the solutions need to be integrated into the FIN-CLARIAH infrastructure along with dedicated training and information sharing activities.

Usage of the existing RI developed by FIN-CLARIN: The users of the infrastructure is defined in terms of academic researchers in various SSH fields in Finland and abroad, but FIN-CLARIAH also serves other fields of research. Top-level research groups served by the Language Bank are presented in Section 1.2. To monitor usage of publicly available resources, FIN-CLARIAH uses Google Analytics. For our publicly available tools, we also monitor downloads and sample requests.

Year Total Users Increase Total Sessions Increase
2016 5 471 9 750
2017 7 784 +42 % 14 361 +47 %
2018 10 141 +30 % 17 121 +19 %
2019 14 368 +42 % 23 130 +35 %
Table 7: Research usage of the RI public resources in 2016-2019 (Google Analytics)

According to Google Analytics, the usage of the publicly available resources of the Language Bank has increased with approx. 35% annually with a total increase since 2015 of approx. 250% as shown in Table 7. In addition, there are around 400 registered users requiring access to restricted Language Bank resources, and Tieteen termipankki has approximately 900 registered expert volunteers updating it. In 2019, individual Language Bank services like Korp had 247 267 user requests, OPUS had 182 510 mainly international users, and Tieteen termipankki had 1.3 M user sessions, of which more than 30% were research-related also serving important educational interests.

Hundreds of students and researchers attend the online courses, tutorials and presentations offered by the Language Bank each year as shown in Table 8. The courses are already integrated in several study programs within the University of Helsinki, and the introductory course Corpus Linguistics and Statistical Methods regularly receives students from several universities nationally and internationally, which shows up as an increasing course attendance that is already testing the limits of what the Language Bank can currently handle despite primarily using online facilities. The Language Bank also frequently organizes roadshows at its member organizations. The national tour in 2017 accounts for the peak in presentation attendance. The Language Bank gives guest lectures at thematic research seminars arranged by FIN-CLARIN members to promote datasets and tools.

2013 2014 2015 2016 2017 2018
Events People
Courses 2 16 3 69 4 104 5 184 4 213 6 330
Presentations and lectures 2 40 2 35 1 40 7 160 19 454 7 173
Table 8: Number of courses and presentations given by the Language Bank of Finland and the total number of participants involved in these activities.

The Language Bank handles about 500–1000 email messages annually sent by individuals to the helpdesk address, as seen in Table 9. The Language Bank can also be reached for support by phone or via social media channels such as Twitter and Facebook. The users often ask detailed questions about managing written or spoken data created in their research projects or about using various tools and services for their research. General information and instructions are available via the online portal of the Language Bank. As more questions have readily available answers in the portal, the number of email messages are decreasing.

Number of e-mail messages sent by individuals to the Language Bank helpdesk 699 778 701 1088 877 447 4590
Average number of messages / month 58 65 58 91 73 37 64
Number of unique senders 132 163 208 127 200 136 878
Table 9: Number of messages sent by individuals to the FIN-CLARIN service address fin-clarin@helsinki.fi.

Extending the user base: The total number of students and researchers in SSH in Finland is approximately 40000 (as shown in Table 3), so with the current number of annual users at approximately 14000, there is potential for doubling the number of users of the Language Bank. The size of the user base will depend on the engagement of the SSH community and the quality of the services, the provided training and the users’ ability to interact with the data. FIN-CLARIAH aims to facilitate collaboration between computationally oriented research groups and more traditional SSH researchers. Quality control of services offered and educational activities to engage potential users are an important aspect of FIN-CLARIAH activities. Usage will increase as a new generation of researchers educated in digital methods adopts the RI as a critical research and collaboration tool.

FIN-CLARIN: The focus of the coordinating activities in FIN-CLARIN will be on making data available in more well-known interoperable formats having CLARIN standardized annotations. This will require an active engagement with the growing user communities as well as dedicated interactive seminars for finding common solutions.

FIN-CLARIN will continue extending the FIN-CLARIN collections with data and tools requested or developed by the user communities. The most urgent need is to upgrade the processing capacity of the RI as we get approximately 35 new databases from individual researchers or research teams annually, but we are currently only able to process and integrate 25 on average. The need to streamline our processes pertains to all types of data. Another overarching theme is the processing of non-standard language as manifested in social media, historical newspapers and spoken language resources, which requires adapting tools for automatically annotating and enriching the data.

The coordinating activities are seen from two different perspectives as detailed in Section 1.2. In the following work packages (WP), we separately describe for each WP the main impact of the foreseen upgrading interventions concerning data collections and tools requested by the FIN-CLARIN user communities. The WPs correspond to the two coordinating perspectives, which can be grouped into two modules. Module 1 Natural Language Processing comprising WPs 1-3 which are language datatype-oriented and Module 2 Language Research Infrastructure comprising WPs 4-8 which support specific areas of the SSH field predominantly in need of language-based data.

Module 1: Natural Language Processing (NLP) This module aims to take care of all the basic language processing that is needed when a new resource is integrated into the infrastructure and made available through various distribution channels such as metadata servers, content search facilities and collaboration platforms.

WP1 Text processing and annotation environments: The foreseen impact is to improve our processes so that we make annotated data available faster than we get new data in order to catch up with the backlog of deposited resources. Another goal is to improve our search to support document reuse identification and background sentiment analysis. Due to non-standard language, the search is less effective in historical and social media resources but can be improved by recent neural network and machine learning methods. These methods apply to e.g. transliterated ancient cuneiform text on clay tablets as well as to modern social media forum discussions in Suomi24.

WP2 Speech processing and annotation: The foreseen impact is to provide automated speech recognition with an emphasis on recognizing and classifying everyday speech and dialectal variants, while supporting interactive editing and annotation of speech data. The ultimate goal is to provide access for researchers to all audio-visual data from the Parliament, the YLE Living Archive, the National Library, and NARC.

WP3 Video and picture processing and annotation: The foreseen impact is to provide automated description of videos, films and pictures in order to make them searchable based on their visual content. For the deaf, automated sign language recognition and synthesis is the ultimate goal. In addition, the visually impaired can use tools for video description of the visual content of television programs.

Module 2 Language Research Infrastructure (LRI) This module takes care of the specialized language processing needs in the fields of research supported by FIN-CLARIN.

WP4 Social Data Science: The foreseen impact for the scholars in the field is to create a research frontier by connecting scholars in language technology, data analysis, SSH. The goal is to document the cultural contexts of writing and reading blog discussions, in order to understand their role in the Finnish and international context, and in the daily life of its users as well as to foster research by incorporating a data analysis toolbox for social scientists. Another goal is to provide visualization methods and tools, and to demonstrate their value.

WP5 Learners’ assessment environments: The foreseen impact is to provide automated annotation of language learners’ spoken and written performances. The ultimate goal is to develop and enhance the automated annotation of errors and other linguistic features in the spoken and written texts, which will allow automated assessment of learners’ skill level and, thus, make it possible to combine impartial computer-based assessment with human assessors’ judgments, for example, in the Matriculation Examination.

WP6 Translation and Interpretation: The foreseen impact is to provide infrastructure for translation and interpretation research both in machine translation as well as in translation studies. An important aspect of this is the search and retrieval of translation samples, i.e. bilingual samples in parallel corpora and monolingual samples in related corpora in different languages. Reasonably large samples of text in other languages than Finnish is a goal, but also access to speech data in other languages is necessary.

WP7 Lexicographical research: The foreseen impact is to provide high-quality dictionaries for special interests or less-resourced languages on the internet by supporting a platform for crowdsourcing. This can be realized by integrating language technology tools for mining the internet for neologisms and their contexts in order to provide a web-service environment for combining corpora and dictionaries. In addition, all digital corpora and dictionaries of Kotus will be made available online.

WP8 Terminology and Processing of Language for Special Needs: The foreseen impact is to provide infrastructure for the terminology work in Tieteen termipankki as well as to support development of tools for research in language for special needs and plain language. We can do this by displaying terminology in context and by providing other tools for finding prototypical or defining contexts for terms in scientific text corpora. In addition, we need to facilitate terminology documentation and development in Tieteen termipankki.

DARIAH-FI: The goal of DARIAH-FI is to build on the success of FIN-CLARIN by expanding the infrastructure also for non-language material, as well as serve a broader range of needs across the social sciences and humanities. To accomplish this, instead of starting from a blank slate, DARIAH-FI seeks to unify, host and develop on a national level the partial best practices that already exist across research groups and organization-level infrastructure projects. Therefore, for the DARIAH-FI modules of 3) SSH Big Data, 4) Analytica, and 5) Information Interaction, some work is already in progress, while other activities are in the planning stage of development. The modules will be developed in collaboration between member and partner organisations: UHEL, UEF, JYU and NARC share responsibility for SSH Big Data; Aalto, CSC and UHEL for Analytica; and TAU, UTU, NLF, FSD and Oulu for Information Interaction. CSC will host the majority of the services, although some services hosted at NARC or NLF may not be moved for legal or contractual reasons. DARIAH-FI fosters a vision of a national collaborative network of Open Science Hubs that will coordinate efforts towards the openness of research practices, strengthen the open science aspects of existing infrastructures and help to develop and expand open collaboration models. One example of such open science hubs is HELDIG (UHEL), which facilitates the transformation of research culture towards data-intensive collaboration in SSH fields.

Module 3: SSH Big Data (BD). Researchers currently have access to some digitised datasets (e.g. NLF’s digitised historical newspapers as language resources available through FIN-CLARIN) and can search their content online or download them as data dumps, but research across the fields of social science and the humanities will need to both be able to look at these resources from new, non-linguistic perspectives, as well as amend them with other kinds of data, ranging from library catalogues to societal registries, from social network data to historical correspondence network information, from multimodal to sensor data. Currently, competences for collecting and analysing such data are scattered in individual research groups and universities. SSH Big Data will standardise efforts in data capture and dissemination and provide resources and incentives for collaboration. The module will work closely with Module 4: Analytica to create a “sandbox” environment for data-intensive collaboration.

WP 9 Multimodal historical data. Harmonization of metadata collections ranging from library catalogues to museum collections will let researchers study the materiality of different types of objects and model the development of different data types across time. Particular attention will be paid to access to general-purpose data sets with respective metadata that can support multiple research projects, such as standard geographic information, e.g. digital maps. Such supporting data sets need to be gathered only once, with possible later updates, and can thereafter serve the research community for a long time.

WP 10 Online media and social media data. Online media and social media data capture and analysis tools are to be modified and developed to enable research on local communities using local languages. Machine learning will facilitate the classification of large amounts of heterogeneous social media data.

WP 11 Monitoring human behaviour through sensory and movement capture. The goal of the WP is to facilitate nationally distributed efforts and multidisciplinary online collaboration between institutions and research teams conducting experimental research on human behaviour.

Module 4: Analytica. The Analytica module will develop the technical services needed to support data-intensive SSH research on the types of raw data sourced in Module 1 and 4 above. Due to not having been created for research, many datasets are severely biased, and cannot be used for research as is. Tooling is needed to uncover and document this bias, and enable researchers to filter trusted representative subsets from the larger whole. Similarly, tools are needed to close the gap between the crude signals in the data and the nuanced objects of interest for SSH research. To this end, the module will deploy, benchmark and further develop common tools and services enabling access to (WP 12), documentation, clean-up, refinement and enrichment (WP 13), and analysis of the data (WP 14).

Various tools have been developed by the consortium partners, including the COMHIS ecosystem (comhis.github.io) for historical metadata and text mining, rOpenGov developer network (ropengov.github.io) for computational social science and open government data analytics, and the SAMPO ontology services. Social sciences at JYU also offers expertise on quantitative research methodology and data analysis. Analytica will build on these existing tools and partial workflows, and will integrate them as best-practice end-to-end workflows on top of CSC’s national infrastructure including e.g. the Language Bank. This will increase overall capability and rigour, as currently many Finnish projects need to build these workflows on their own, limiting them to targeting only some parts of the full process needed for trustworthy research. The module will also support and provide input to the customisation of CSC’s infrastructure to better match the needs of SSH workflows.

WP 12 Data access. Analytica facilitates data access from various original sources in researcher-friendly formats. Quality assurance and monitoring tools are provided to make the data access infrastructure reliable. It also aims to operate between the data providers and CSC to aggregate various data types used in SSH research, and is further supported by the participating research institutions who provide pre-processed data sets and methods, and active user feedback.

WP 13 Data refinement. Automatic workflows supported in the CSC environment refine data, create indexes, uncover biases, train models and perform other operations to transform original data sets to provide value for research.

WP 14 Data analysis Tailoring analysis tools to the needs of SSH research enables wider adoption of digital methods and renewal of SSH disciplines.

Module 5: Information Interaction (IIA). ‘Interaction’ has a two-fold meaning in this module. Firstly, it refers to the necessity of collecting information on how researchers interact with the RI in order to develop the tools and services accordingly (WP 15). Secondly, it refers to the need to offer education and consultation on how researchers can enhance their work by using the infrastructure, thus increasing the RI’s active user base (WP 16). Information Interaction collects RI performance data and designs and develops tools, protocols, services and learning environments to support RI development and to respond to the need for education and guidance on RI use.

WP 15 Evidence-based RI development. Keeping a close dialogue between users and developers of the infrastructure and collecting user data logs enables continuous monitoring of RI performance and evidence-based development of tools and protocols. A careful analysis reveals critical points for the development of the research infrastructure and highlights educational needs.

WP 16 Education and dissemination. Researcher education and guidelines for best practices help to grow and support the FIN-CLARIAH user base and promote Open Science in data-intensive SSH research.As an essential premise for data-intensive SSH research, FIN-CLARIAH will develop a standardised ethical guide and an up-to-date procedure on legal issues on the maintenance, access and dissemination of in- and out-of-copyright data, including digitised cultural heritage materials as well as born-digital, potentially sensitive content.

Open access policies: FIN-CLARIAH is committed to providing the widest possible access to all its resources. To promote open access, it is important that funding agencies require that research data be made as openly accessible as possible through an RI when granting research project funding. FIN-CLARIAH provides a platform (https://kielipankki.fi/download), where individual teams can distribute moderately-sized data and tools. By logging in to the CSC computing environment, the FIN-CLARIAH data is readily available to SSH researchers in the Language Bank directory and the FIN-CLARIAH tools are preinstalled.

Where possible, the already existing data distribution infrastructure will be utilised. The data is mainly offered for academic research, but some datasets are available for commercial purposes as well. Each dataset is associated with a license that specifies user rights. The licenses vary by dataset since the data originates from various sources, and the licenses are not always modifiable. Open licenses such as CC-BY or MIT will be preferred whenever possible. For user access guidelines, see https://www.kielipankki.fi/support/.

FIN-CLARIN has created a contractual framework for CLARIN for opening up a variety of language-based materials including multimedia with speech and video clips in addition to text samples, allowing researchers free access to materials used as statistics, excerpts, or full documents. FIN-CLARIN also offers dataset reference instructions as an incentive for depositing resources as openly as possible in the Language Bank, which will be extended to the larger SHH field through FIN-CLARIAH. For more policy details, see the Appendix “Data Management Plan”. FIN-CLARIAH members also contribute their software and datasets via international open research software hubs such as rOpenSci and Zenodo.

Many resources are already freely and openly available, but some have copyright restrictions or restrictions due to personal data protection. The restricted materials can be used as statistics and in short excerpts, and through the FIN-CLARIN agreement with Kopiosto, many copyrighted works can also be downloaded and shared. The new EU directive on Text and Data Mining will offer additional possibilities when implemented in Finland in 2021.

Finnish users have access to language-based materials in other CLARIN Centres, and foreign users can access materials in the Finnish CLARIN Centre. On the European level, CLARIN and DARIAH are involved in EOSC. Information about FIN-CLARIAH metadata is openly available on the metashare.csc.fi website and through vlo.clarin.eu. In general, there will be no charge on RI or data use for scientific purposes. However, the READ/Transkribus extension has an annual fee for institutions such as universities, and other software or datasets may also require license fees.

5 Digital platforms and data

5.1 Data management policy

FIN-CLARIAH is a data-centric service cluster that enables researchers to process, analyse, and share digital resources. As an RI, it differs from standard data repositories and providers, such as the FSD and NLF, which focus on data management but offer few tools for in-depth analysis or research collaboration. The RI is designed to support researcher-driven development, benchmarking, and deployment. The services will be implemented on CSC’s national service for data management and computing. The RI’s data management policy is described in the separate Appendix “Data Management Plan”

5.2 Digitalisation and data intensity

Research data in SSH is produced in heterogeneous research groups, which increases the need for a centrally coordinated infrastructure, including shared practices and support for finding, accessing and sharing data in a streamlined fashion. The digitalisation of research and increasing data intensity were the driving forces behind establishing the Language Bank already in 1996. This was later formalized nationally with the first Letter of Intent to build a FIN-CLARIN consortium in 2006. With continuing digitalisation, we look forward to extending the services through FIN-CLARIAH to SSH big data along the lines above.

6 Risk management plan

Most of the activities in FIN-CLARIAH relate to coordinating the acquisition of tools and datasets and making them available for the FIN-CLARIAH research communities. This requires negotiations with rights-holders, which is best done centrally to provide access for everyone on similar conditions and to save time for research at the FIN-CLARIAH member organizations and other user groups. Sometimes the process can be lengthy, taking years to secure modest or no cost solutions to eventually get access for research purposes only. To mitigate the risk of failure several negotiations on both similar and different resources need to be on-going at the same time. The new TDM directive will over time foster a more positive attitude towards research use among rights-holders, provided adequate protective measures are applied and shown to be effective.

Datasets can be processed only after they have become accessible. To mitigate potential delays of making datasets available to researchers, even minor processing is acceptable initially. The annotation of the datasets can be upgraded or further processed when appropriate tools become available.

Tools need to be adapted to the FIN-CLARIAH technical environment and made available through standardized interfaces as web services. Generic tools may need to be adapted to a specific language as well as tailored to provide output in CLARIN and DARIAH-approved standard form. To mitigate the risk of not having a tool operational as soon as needed, the level of standardization of the input and output can be upgraded with time. For critical tools, developing an open source tool with the essential features within FIN-CLARIAH is an option if no open source tool is readily available.

A dynamically developing infrastructure can become complex and challenging to maintain in the long term, so specific attention needs to be paid to support and maintenance of critical resources. Collecting user data logs enables continuous monitoring of RI performance and evidence-based development of tools and protocols. A careful analysis reveals critical points for the development of the research infrastructure and highlights educational needs. Internal evaluation processes and researcher-driven procedures for developing the various components of FIN-CLARIAH ensure that the infrastructure evolves to address the changing needs of researchers.

DARIAH-FI will be developed as part of the European DARIAH ERIC infrastructure and will seek further funding from national research foundations as well as international sources, including the forthcoming EU Framework Programmes. In the event that DARIAH-FI fails to secure FIRI funding to kick-start construction, the consortium will continue to build collaboration as a network of local infrastructure projects and seek external funding through e.g. Nordic collaboration.

As FIN-CLARIAH aims to adopt data and software developed by a number of already established local infrastructures, obtaining sufficient coherence and synergies requires active synchronisation between projects and platforms to achieve interoperability. To mitigate the risk, the development efforts will be coordinated through biweekly meetings of the National Coordinators with the leaders of on-going Work Packages, as well as biweekly meetings of the RI Directors and the National Coordinators. The National Coordinators will also participate in monthly meetings with CLARIN ERIC and DARIAH ERIC. In the semi-annual meetings of the Steering Groups the outcomes of related national and international projects will be monitored and discussed.


[1] http://roadmap2018.esfri.eu/media/1066/esfri-roadmap-2018.pdf (p. 108)

[2] The Juuli portal (https://juuli.fi/) is maintained by the Ministry of Culture and Education.

[3] https://www.oulu.fi/suomenkieli/node/55261

[4] https://www.clarin.eu/content/knowledge-sharing and https://teach.dariah.eu, https://campus.dariah.eu

[5] http://www.meta-net.eu/whitepapers/volumes/finnish

[6] Web Accessibility Directive (EU) 2016/2102, https://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX%3A32016L2102

[7]1) http://kirjasampo.fi/ 2) http://sotasampo.fi/ 3) http://biografiasampo.fi/ 4) http://nimisampo.fi/ 5) http://finto.fi/en/

[8] https://www.clarin.eu/content/vision-and-strategy

[9] https://www.kielipankki.fi/organisaatio/fin-clarin/konsortiosopimus/

[10] https://www.helsinki.fi/en/university/strategic-plan-2021-2030

[11] http://www.helsinki.fi/finclarin/konsortio/FIN-CLARIN-panostus-yhteenveto.pdf

[12] https://www.hbl.fi/artikel/utredning-stiftelserna-delar-ut-nastan-halv-miljard-euro-medicin-far-mest/

[13] https://www.helsinki.fi/sites/default/files/atoms/files/dariah_web_survey_chapter_finland.pdf (Matres 2016)

Vastaa

Search the Language Bank Portal:
Mila Oiva
Researcher of the Month: Mila Oiva

 

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317