In case personal data are processed in your research project and there are high risks associated with the processing, you are required by data processing regulations to carry out a data protection impact assessment (DPIA) before starting to process the personal data. The higher risks the processing involves, the more carefully you need to protect the data. Consider the protection measures and methods you can use so as to minimize or eliminate the risks.
The list of questions on this page is intended to help you plan your research project. You can use the questions to make a preliminary assessment of the risks that may be involved in the processing of personal data in your research. A data protection impact assessment is likely to be required if you answer ”yes” to more than one of the ten questions. Please note that the interpretations of the questions may vary in practice, and the individual criteria mentioned under each question are suggestions only.
When processing personal data, you should primarily follow the instructions given by the data controller. Therefore, you must always check with your home organization whether and how you are required to carry out the data protection impact assessment.
Further information regarding data processing impact assessment is available on the website of the Office of the Data Protection Ombudsman.
Processing can be considered as large-scale processing if, for example:
Sensitive or highly personal data includes:
Last updated 6.9.2021
Research participants should be given sufficient details regarding the study for which personal data are to be collected. It is recommended that the following brochure be used as a supplement to the rest of the information that is provided to the research participants. The brochure includes basic information on the Language Bank of Finland and on the process of storing research materials for further use in the long term.
Last updated 10.5.2022
URN for this page: http://urn.fi/urn:nbn:fi:lb-2020081522
Always comply with these guidelines when processing corpora obtained from the Language Bank of Finland that contain personal data.
Corpora stored in the Language Bank of Finland that contain personal data have the following label in their licence:
PRIV: There are personal data in the resource.
The licence details of individual corpora can be found in the corpora listing of the Language Bank of Finland next to the corpus in question as well as in its metadata, which can be accessed using the persistent identifier assigned to the corpus (i.e., the URN address included in the citation instructions).
All corpora labelled PRIV contain a separate description of the resource-specific data protection terms and conditions, including the following details:
The creation of resource-specific data protection pages is currently in progress. In case you discover that a separate description of the data protection terms and conditions for a specific corpus is not yet available and you cannot find corresponding information in the metadata of the resource, please request clarification from the FIN-CLARIN service address: fin-clarin(at)helsinki.fi.
When using a PRIV-labelled corpus, you undertake to process the personal data included in it confidentially, carefully and solely for the purpose for which you were granted access to the corpus. Further restrictions are described in the resource-specific data protection terms and conditions that are published along with the corpus-specific license.
When processing corpora that contain personal data, please apply sufficient protective measures in accordance with the instructions provided by your own organisation. Special care is needed when processing corpora that contain sensitive personal data (also known as special categories of personal data).
When starting to process a corpus obtained through the Language Bank of Finland that contains personal data for the purposes of new research or another purpose, you and/or your home organisation assume the role of data controller for the corpus. Among other responsibilities, the controller is obliged to demonstrate the lawfulness of the processing of personal data, when necessary.
The instructions provided by your own organisation must be observed in the first instance when processing personal data. If instructions provided by your home organisation are unavailable, you can familiarise yourself, for example, with the Data Management Guidelines published by the Finnish Social Science Data Archive when planning the processing.
As the controller, you must usually draw up a privacy notice on the processing of personal data. Comply with the instructions provided by your own organisation in this instance as well. When drawing up a privacy notice, you can utilise the privacy notice associated with the original corpus, or the description of the personal data included in it.
When starting to use a corpus stored in the Language Bank of Finland that contains personal data, first publish the privacy notice pertaining to your purpose of processing, for example, on a website provided by your organisation. You can share a short title of your project that is understandable to the general public as well as a link to the openly available privacy notice by using this form. We publish this information on the Language Bank of Finland website to make it available to anyone interested in the purposes for which the corpus is used.
Comply with the guidelines of your own organisation. When necessary, you can view examples of protective measures employed by the Language Bank of Finland and other potential measures which you may need when processing personal data.
Personal data must also be processed responsibly and in compliance with good ethics when creating scientific publications and presentations based on corpora.
When reporting on the results of scientific research, personal data must be, as a rule, removed or redacted, for example, by pseudonymisation and by classifying data subjects’ age, domicile and other details into more extensive categories so that study participants cannot be identified on the basis of such details or by combining them with other data.
In certain cases, presenting scientific research results requires the presentation of data that contain personal data. For example, it may be necessary to link short individual samples from the corpus to a scientific article, or a specific section must be presented in connection with a conference presentation. However, carefully consider the potential impact on and risk to the study subjects, their family members or others close to them associated with publishing or presenting samples that contain personal data. The scope of the samples intended for publication must not exceed the scientific purposes, and all unnecessary personal data must be removed or pseudonymised from the samples using appropriate means.
Please also note that if the study subjects have been, for some reason, clearly informed that no personal data associated with them will be published, and the sample to be published cannot be fully anonymised, a separate consent for publishing the sample must be requested from the subjects.
If a PRIV-labelled corpus, which requires access rights, is to be processed for more than one purpose – for example, if at a later date there is a wish to carry out a new study not directly connected to the previous topic – access rights must be applied for from the Language Bank of Finland separately for each purpose. Naturally, all grounds for the processing must be stated in the privacy notice(s).
If you come across personal data which you believe should not be included in a corpus based on its description, please report the matter immediately to the Language Bank of Finland and/or directly to the controller of the data. This also applies to instances where you suspect that personal data have, for some reason, fallen into the wrong hands.
Last updated 30.8.2021
It is important to cite language resources in a coherent way. This will enable other researchers to replicate your research, and the authors or developers of the resource can receive credit for their work.
When you use a language resource (a corpus or a tool) that is available via the Language Bank of Finland, please adhere to the citation instructions provided by the Language Bank. This way, you provide an accurate reference to the exact version of the resource. In the Language Bank od Finland, every resource version has a unique persistent identifier that is always included in the reference. The identifier exists in order to ensure that the resource can be accessed and the study can be replicated in the future even if the location of the resource changes.
The license conditions of many corpora and tools require the users to provide a reference to the resource in question. In this case, the license terms will usually mention the BY condition (Attribution; Nimeä in Finnish). A reference is systematically required for all language resources that are licensed for academic use (CLARIN ACA) or for individual use (CLARIN RES). Even openly licensed language resources may require appropriate citation (e.g., Creative Commons Attribution and other open licenses).
By providing a reference to the Language Bank of Finland and to its language resources, you can help FIN-CLARIN keep track of the usage of its corpora and services and maintain the Language Bank of Finland.
Reference instructions for individual corpus versions or variants can be found at the quotation mark icon-quote-right on the Corpora list of the Language Bank of Finland.
The reference instructions are also mentioned in the metadata of each language reource. The metadata of the corpora that are available via the Language Bank of Finland are stored and distributed on the META-SHARE service. The metadata record of a specific language resource can always be accessed with the persistent identifier that is included in the citation instructions, or by clicking on the corpus title on the corpus list of the Language Bank. In the metadata record, the link to the reference instructions can usually be found in the Documentation section. In some cases, the citation instructions are directly available in the Attribution Details field. The metadata record also provides details on the corpus-specific license.
For corpus versions that are offered via the Korp concordancing service, the link to the citation instructions is available in the corpus information frame that pops up when the mouse cursor is moved over a corpus title in the corpus selection menu, as well as under the corpus details in the information column on the right when an individual search result is selected in the concordance view.
In case the resource is available via the download service of the Language Bank of Finland, it includes a file called README containing the persistent identifier of that particular resource version.
As an example, here are the reference instructions to the language resource titled Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, Version 2:
University of Helsinki (2017). Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, Version 2 [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2017091901
Note that the exact formatting practices of data references may vary in different publications. In any case, it is best to try and include the details that are included in the citations instructions provided by the Language Bank of Finland. When you are writing scientific journal articles or producing other research output, you may need to check the publication-specific instructions in order to see whether it is customary to include data sources in the bibliography or to create a separate list for them.
In case you wish to refer to the Language Bank of Finland as a collection of services, please use the web address www.kielipankki.fi.
A presentation of the FIN-CLARIN consortium on the web portal of the Language Bank of Finland: http://urn.fi/urn:nbn:fi:lb-2014120212
The general reference instructions of CLARIN ERIC and CLARIN services can be found under CLARIN Frequently Asked Questions.
When you deposit a corpus or a tool to be distributed via the Language Bank of Finland maintained by FIN-CLARIN, your work will gain more visibility and your resource will be available for users. Many Finnish research funding organizations recommend that all research data containing language be deposited with the Language Bank of Finland.
If a corpus or tool is readily available, it will be used and cited more often. A unique, persistent identifier and citation instructions are assigned to each resource that is distributed via the Language Bank of Finland. This makes it easy for you and others to refer to your resource in publications. The language resources you deposited can also be included in your CV.
In some cases, it is not possible to make a resource openly available. The terms and conditions for distributing your resource will be agreed by us with you. If necessary, it is possible to restrict access to the resource for identified users only, or to individual users who are granted access based on a research plan they presented. In the latter case, access rights can be managed conveniently in our online service called Language Bank Rights.
University of Helsinki Data Support will hold an event focusing on the opening of research data. The event ”What it takes: Open your research data” takes place on 26 March 2020 at 13–15 at Think Corner Stage (Yliopistonkatu 4).
This online course can support you in managing, annotating and analyzing your language material when you start working with your MA thesis or PhD project. A similar course has been previously offered by FIN-CLARIN under the title Corpus Clinic. Please note that the number of participants is restricted. Read more…
FIN-CLARIN is a distributed infrastructure. The FIN-CLARIN partners handle production of language technology and resources relatively independently. As an infrastructure, FIN-CLARIN has three primary strategies to promote good data management: training, support, and infrastructure design. FIN-CLARIN as a consortium gives guidance and requirements for publishing datasets and open access publishing.
The data and technology produced by FIN-CLARIN can be divided into two parts:
The researchers and data owners using FIN-CALRIN have ultimate responsibility for the type 1 data management. However, FIN-CLARIN shares technical expertise and good scientific practices. The DMP allows FIN-CLARIN to reach a significant number of users within Finland. To this end, FIN-CLARIN makes information on data management a key part of its activities.
To facilitate the data management, FIN-CLARIN requires partners to follow open access policies, and provides this DMP with general principles within the infrastructure, and detailed Data Management guidelines that take into account the specific policies and environments of each partner and gives partner and discipline-specific advice.
In this section, we describe our support for managing the tools and data from research projects using the FIN-CLARIN infrastructure.
The FIN-CLARIN partners already have individual data management practices. Through the collaboration in FIN-CLARIN, the lessons from these programs can be spread among institutions in the same way as FIN-CLARIN supports knowledge transfer.
As FIN-CLARIN partners already have their own data management support and policies, they will be adhered to. These are published on the web and listed in Table 1. All of them take into account data management throughout the data lifecycle, from planning to archiving and reuse.
Furthermore, each partner will name a data management contact person, who is in charge of enforcing this policy at his or her location. They will provide partner-specific user support and training.
FIN-CLARIN is an infrastructure, and thus day-to-day data management must ultimately be done by end-users. In order to ensure that users follow the DMPol, we recommend initial training in data management. For projects applying for restricted resources, we require an initial data management plan before access is granted. This ensures that users consider data management as part of their research.
Primarily, we recommend that users follow the existing partner data management guidelines with respect to openness and dissemination. When possible, we extend and improve these guidelines with focus on scientific research and seek to resolve any potential conflicts between policies.
The consortium expects compliance with the data management guidelines, and all partner sites provide support to their users of the infrastructure. We continually update the guidelines with current best practices and the latest recommended services.
FIN-CLARIN as an infrastructure covers the data mid-life cycle, i.e. the actual storage and computation. However, the FIN-CLARIN support services cover all stages of the data lifecycle. FIN-CLARIN is committed to open science principles and open publishing. The data guidelines further explain the recommendations of implementing this and how to leverage existing services.
FIN-CLARIN provides a landing page on data management with information on data management specific to language technology and resources. It contains both new information and links to CLARIN ERIC as well as national and partner-specific information, including local contacts. This information is available for central use as well as for partner-specific documentation.
The consortium publishes the data management policy as well as practical data guidelines and collate advice on data management practices on its website.
FIN-CLARIN recommends open science principles and open access publishing. For implementing this, FIN-CLARIN recommends using suitable existing services such as DMPTuuli for project data management planning, the upcoming national digital preservation service portfolio for research, as well as international services such as Zenodo and EUDAT. See Table 2.
All relevant publications must be reported according to each organization’s guidelines, ensuring they are sent to the national VIRTA publication reporting system. In general, this is done through the university reporting systems, which are also used for our internal reporting and acknowledgement of the infrastructure. All publications produced using the FIN-CLARIN infrastructure should include a reference to the technologies and resources provided by the infrastructure. The references to be used are the persistent identifiers given by the infrastructure through the reference service, e.g. Digital Morphology Archives.
Research data that is prepared for sharing has to be stored either in the IDA service, or in a similar organizational/national/international archive. When choosing a storage and sharing service, the user must consider legal and ethical issues and ensure that the stability and availability of the chosen service are suitable for long-term storage. The service must also give a persistent identifier for the datasets so that there is a way to refer to the resource. We recommend that all datasets are stored in a service from which they can be shared and that they are licensed so that others can use them (e.g. Creative Commons licenses).
The choice of licenses has to be done taking into account legal and ethical aspects of the data. Software and databases have their own license recommendations. For open datasets, we recommend Creative Commons BY 4.0 and for open metadata CC0. The former requires attribution to the original creator and the latter waives all rights ensuring maximum visibility for metadata.
All tools and datasets must be described in META-SHARE or Virtual Language Observatory. If a resource is described in another service, there must be a reference to either of these descriptions (for instance with a persistent identifier). The description of a dataset has to include administrative, technical and descriptive metadata according to current standards. The goal is to ensure good discoverability of the resource and adequate levels of information for others to evaluate the possibility for reuse. In addition to the metadata description, a link to the resource and possible license information has to be included. All resources produced using FIN-CLARIN should include a reference to the infrastructure service.
FIN-CLARIN provides a variety of tools for integration with the data life cycle.
Each partner provides core day-to-day data storage for research activities. In general, this storage space is large and fast, but not backed up. A smaller home directory space is provided for backed up code and critical configurations. For large data storage, the partner storage locations must be used for back-up unless a separate agreement is made with CSC. Each partner offers integration with its local resources. Data can be stored both in individual user folders, or in group folders for collaboration. In all cases, data is protected by file system permissions.
FIN-CLARIN has installed commands in its computing environment, which allow direct access to the IDA storage service. This allows direct staging to and from long-term storage. The ePouta cloud service provides several tiers of data storage: default, non-backed up disks, high-performance IO storage, and normal backed up storage.
By far, the hardest data management problems are on the end-user side, where we assist through our consortium training processes. Support is provided both nationally and locally, with the consortium serving as a conduit for best practices to be shared. Local support staff is able to provide the most useful support. The goal is to nationalize local best practices as well as promote CLARIN ERIC standards.
FIN-CLARIN, as part of its data management activities, organizes events for data management planning and roadshows to all the partners providing targeted training and offering support for the FIN-CLARIN partners, so that they can better support their researchers. The consortium makes use of the Open science training material as well as CSC’s existing training framework including training on data science and data management.
In this section, we outline the data produced by FIN-CLARIN in the daily operation of its services.
Primarily, FIN-CLARIN internal data contains status, usage, and job statistics. This is primarily useful for reporting and development of FIN-CLARIN services. Data is collected automatically by FIN-CLARIN services as a normal part of the usage of the FIN-CLARIN platforms and services. For example, the search tools contains records of all executed searches. This provides data automatically in a structured and interoperable form. All software and automated configurations are considered data.
Since the FIN-CLARIN centralized service setup is automatic, the software stack collecting the data is known and reproducible. Because all data comes from standard open source systems, documentation and structuring is automatic. We will prefer the standard forms from these systems when releasing the data, and defer most documentation to the authoritative upstream sources by linking. Data quality matches that of the CLARIN infrastructure: the data documents the actual performance of the systems.
FIN-CLARIN operational data is backed up as a part of normal operations of systems of this scale. The total size of the data is small relative to the capacity of the FIN-CLARIN systems.
The relevant operational data is non-personal and FIN-CLARIN can release it independently. Usage data may be released only in a sufficiently anonymous and aggregated form. Partners, in conjunction with guidelines produced by FIN-CLARIN, will conduct anonymization.
The infrastructure data is reported as part of the annual reporting. Summaries are also included in FIN-CLARIN and partner reports.
Software and other code is made available from the CSC organization account under appropriate free software and open source licenses (e.g. MIT, GPLv3+).
A copy of the deposited item is placed in the backed-up long-term preservation system of the repository. The item is read from the storage from time to time to ensure that the deposited item is still accessible and readable with existing software. In case of difficulties, a recovery procedure is invoked.
The course is organized in English during 9.11.2018 – 26.04.2019. The registration deadline has been extended until 23rd November 2018 until when you can join the course area on Moodle. Welcome aboard!
A language resource consists of three parts at the minimum:
In addition, a language resource may have its own license page and instructions, if needed. In case several members of a single language resource family share license terms, only one license information document is produced. Language resource specific instruction pages describe only such specific features related to the said resource’s usage that have not been covered in the applicable tool’s or another application’s general instructions.
All parts of a language resource are referred to using persistent identifiers (PID). The Language Bank of Finland uses both the URN and Handle systems. Of these two, URN is more common in the Nordic countries and Handle is more prolific globally. At the Language Bank URNs and Handles have a 1:1 mapping, e.g. hdl:11113/lb-201710212 and urn:nbn:fi:lb-201710212 point to the same page.
A persistent identifier in the Language Bank means that the user can rely on the information referred to by the identifier to remain accessible, even if the language resource’s location changes. The new location is accessible either directly (the identifier points directly to the new location) or indirectly (the identifier points at a page with information about the location of the old version and how to continue using it as well as how to access the new version).
Persistent identifiers have two main functions:
A language resource may have several different variants (i.e. versions) that form a language resource family.
Examples of language resource families:
In all aforementioned cases, it is important that the language resource’s user be able to unambiguously refer to the applicable resource at present as well as in the future. This is why each version always has its own abbreviation, metadata page and location. On the other hand, a language resource family may share a license or instruction page.
To see how the Language Bank fares in relation to RDA recommendations, see the commented RDA Data Versioning Working Group report.
A new version of a corpus is generated when the corpus’s content changes significantly. What constitutes a significant change is defined individually for each corpus. If the corpus description does not specify otherwise, such changes that may substantially affect research results or that are not easily reversible are considered significant. All non-significant changes are recorded in the change log in the corpus’s metadata.
Examples of non-significant changes:
If a new version of a corpus is generated, its relation to the previous versions is recorded in META-SHARE. The new version receives a new PID and a new META-SHARE record. In the META-SHARE record, the new and old versions are linked with the IsNewVersionOf, IsPreviousVersionOf relations, see below.
In case the previous version is no longer relevant to research, the new version replaces it in the Language Bank’s corpus list. The kielipankki.fi/<abbreviation> links also always point at the most recent versions. However, PIDs are always preserved. They point at either the old version or relevant information (”tombstone page”) about how to obtain it or how queries executed in the old version can be reproduced in the new version.
Suomi24: The corpus is updated biannually. The versions’ abbreviations follow the format Suomi24-<year><year half>, e.g. Suomi24-2016H1. Newer versions always contain the previous versions, and queries can be reproduced by defining the period accordingly.
New corpora receive new version numbers, e.g. helpuhe-v2. META-SHARE contains a description of the difference between the new and the old version. The old version is archived if need be, and PIDs point at a ”tombstone page”.
The Language Bank does not delete the deposited language resources without their owner’s consent.
Two versions or variations of a language resource, e.g. a corpus packaged in different ways. Downloadable versions are usually considered the ”OriginalFormOf” VariantForms.
The language resource is derived from another, e.g. a frequency lexicon or a language model.
The language resource is a previous / newer version of the related resource.
Eg. Version 1 points to version 2 using IsPreviousVersionOf. Example: lehdet90ff-v1.
The language resource is a part of another (broader resource or collection). Can be used e.g. for parts of a serial corpus.
The corpus is continuation to another. The content is different but the compilation method is the same.
The tool that was used in creating the corpus, e.g. a parser.
The language resource family shares metadata, e.g. a license or description.
The shared ”roof” metadata points to the more specific metadata using the IsMetadataFor relation, and the more specific metadata points back to the shared ”roof” metadata using the HasMetadata relation (See , page 37). Example: ceal.
Shared metadata has no direct link to the language resource’s content.
If none of the relations described above applies, other possible relations can be found at DataCite (). Using relation terminology other than DataCite’s is not permitted.
 DataCite Metadata Working Group. (2016, alkaen sivulta 37). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.0. DataCite e.V. http://doi.org/10.5438/0012