<< List of all deliverables

D2.1.2: Licensing agreements for special categories

Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.1: Report on Licensing agreements for special categories of personal data
Date of reporting: 2023-06

Report author: Mietta Lennes (UHEL)
Contributors: Sirpa Kovanen, Krister Lindén (UHEL)
Deliverable location: Deposition license agreement template


The deposition license agreement template of the Language Bank of Finland allows for the deposition of resources that contain personal data (cf. D2.1.1: Licensing agreements for personal data). In addition, some research datasets may also include personal data belonging to special categories. Such data reveals the person’s racial or ethnic origin, political opinions. religion or philosophical beliefs, trade union membership, data concerning health, sexual orientation or activity, or genetic and biometric data for identifying the person.

Personal data belonging to special categories are considered sensitive. In some cases, it is not possible to completely remove the sensitive data without making the entire resource unusable regarding the research purpose. However, it may still be possible to deposit the resource (or some version of it) with the Language Bank, given that sufficient and proportionate safety measures are applied.

Preparing for the deposition of a sensitive dataset

Before the resource can be deposited, the data controller regarding the original purpose of use (in practice, usually, the depositing researchers themselves) must conduct a preliminary risk assessment and a Data Protection Impact Assessment (DPIA) if appropriate. In this process, the researchers should primarily follow the instructions of their home organization. For convenience, the Language Bank also provides an instruction page for the preliminary evaluation of data protection.

Before depositing, the researchers are responsible for minimizing the amount of personal data, and especially the sensitive information, to the extent that is possible and proportionate with regard to the research purpose. In order to maintain the deposited content accessible and useful for other researchers, some documentation of the pseudonymization process can be included in the metadata of the resource.

Additional data protection terms and conditions

For resources containing personal data, the resource-specific data protection terms and conditions and the description of the categories of personal data in the resource are included in an annex of the deposition license agreement with the Language Bank. In the same annex, it is possible for the data controller to specify further requirements, in case the processing of personal data contained in the resource is seen to involve risks that call for a particularly high level of information security.

Protective measures applicable to sensitive datasets

Currently, the Language Bank offers the following protective measures that can be applied on sensitive datasets:

  1. Access management in the restricted license category (RES): Based on application, access to the resource can be restricted to individual researchers who have produced an acceptable research plan, matching the original research purpose of the resource.
  2. Data protection terms and conditions: When submitting their application, each user of the resource must accept the license of the resource in question, including the resource-specific data protection terms and conditions recorded in the deposition license agreement by the original data controller. The license is persistently available via the metadata record of the resource and the license information is also included in the data package that is provided to end-users via the Language Bank.
  3. Data encryption: In the case of sensitive datasets, the package can be stored in an encrypted form, and the package can be re-encrypted by the Language Bank on an individual basis for each recipient, to ensure that only the authorized user can decrypt and access the package content after downloading. The very first dataset applying this safety measure is the Finnish Dark Web Marketplace Corpus (findarc), published on 30 May 2023. To make the encrypted dataset more accessible, the Language Bank offers instructions for using GPG keys.
  4. Sensitive Data (SD) services at CSC – IT Center for Science: The Language Bank is currently preparing to start using the SD platform for making sensitive datasets available to researchers and research teams who need a secure environment for reusing a given resource (see further details about the SD services at CSC). The aforementioned Finnish Dark Web Marketplace Corpus will be used as a test case.

The Language Bank is also collaborating with the DELAD Task Force in CLARIN. DELAD focusses on sharing corpora of disordered speech that often contain, e.g., health-related data and data from children.


Last updated: 2023-06-06

Search the Language Bank Portal:
Krister Lindén
Researcher of the Month: Krister Lindén


Upcoming events


The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information