Suomeksi

Preliminary evaluation of data protection

In case personal data are processed in your research project and there are high risks associated with the processing, you are required by data processing regulations to carry out a data protection impact assessment (DPIA) before starting to process the personal data. The higher risks the processing involves, the more carefully you need to protect the data. Consider the protection measures and methods you can use so as to minimize or eliminate the risks.

The list of questions on this page is intended to help you plan your research project. You can use the questions to make a preliminary assessment of the risks that may be involved in the processing of personal data in your research. A data protection impact assessment is likely to be required if you answer ”yes” to more than one of the ten questions. Please note that the interpretations of the questions may vary in practice, and the individual criteria mentioned under each question are suggestions only.

When processing personal data, you should primarily follow the instructions given by the data controller. Therefore, you must always check with your home organization whether and how you are required to carry out the data protection impact assessment.

Further information regarding data processing impact assessment is available on the website of the Office of the Data Protection Ombudsman.

Preliminary evaluation questions

1. Will personal data be processed on a large scale?

Processing can be considered as large-scale processing if, for example:

  • There are more than 10 000 research participants/data subjects
  • A large amount of data about the same individual is collected
  • Data is collected about a large portion of the members of a specific group (for example, a large portion of the members of a small ethnic group or the employees of a certain employer)
  • The processing is permanent or long in duration
  • The processing is geographically extensive

2. Will sensitive or highly personal data be processed?

Sensitive or highly personal data includes:

  • Data concerning health
  • Location data (monitoring the movement of a person)
  • Genetic data
  • Biometric data for the purpose of identifying a person
  • Racial or ethnic origin
  • Political opinions
  • Religious or philosophical beliefs
  • Trade union membership
  • Sex life or sexual orientation
  • Data concerning criminal convictions or offences
  • Financial data that might be used for payment fraud
  • Electronic communication (such as emails)
  • Data otherwise considered as very personal (such as notes and diaries)

3. Will there be exceptions to the following rights of data subjects:

  • Informing participants about the project
  • Right to receive copies of data processed about the participant
  • Right to rectify inaccurate personal data
  • Right to restriction of processing
  • Right to object to the processing of personal data (for example, if the processing takes place in a public place, discussion board etc. where data subjects cannot avoid the collection of data)

4. Will data from multiple datasets be combined in a way that is unpredictable to the data subjects?

  • For example, combining data collected for two different purposes or data held by two different data controllers

5. Will the research involve the processing of data concerning individuals who are in a vulnerable position and for whom it may be difficult to exercise the rights of data subjects?

  • e.g., children, the elderly, asylum seekers and patients

6. Will the processing involve automated decision-making (meaning a decision with no human involvement) and/or profiling that may produce significant effects to the participant?

  • Significant effects or legal effects may include exclusion, discrimination, significant impact on privacy, determining the compensation of a participant on the basis of automated decisions etc.

7. Will personal data be used for evaluation or scoring of participants?

  • For example, assessing or predicting disease/health risks or creating a profile based on an individual’s behavior

8. Does the research involve systematic monitoring of the participants?

9. Will new technology be used for processing of personal data in an innovative way?

  • Will data be collected or processed in a novel way?
  • Are the consequences of the use of the new technology unknown?

10. If the research material/data is published or if it would be leaked to the public, could it cause significant harm to data subjects?

  • e.g., threat of violence or persecution

Last updated 6.9.2021

Suomeksi

Brochure about the Language Bank of Finland for research participants

Research participants should be given sufficient details regarding the study for which personal data are to be collected. It is recommended that the following brochure be used as a supplement to the rest of the information that is provided to the research participants. The brochure includes basic information on the Language Bank of Finland and on the process of storing research materials for further use in the long term.

Last updated 10.5.2022

What it takes: Open your research data

University of Helsinki Data support organizes a webinar about opening research data on 25th of March 13-15. Four Finnish data archives and service providers, including the Language Bank of Finland, will introduce themselves and tackle questions related to how research data can be opened.
Read more and join the webinar via Zoom!

Guidelines for processing corpora stored in the Language Bank of Finland that contain personal data

Suomeksi

URN for this page: http://urn.fi/urn:nbn:fi:lb-2020081522

Always comply with these guidelines when processing corpora obtained from the Language Bank of Finland that contain personal data.

Does the corpus contain personal data?

Corpora stored in the Language Bank of Finland that contain personal data have the following label in their licence:

PRIV: There are personal data in the resource.

The licence details of individual corpora can be found in the corpora listing of the Language Bank of Finland next to the corpus in question as well as in its metadata, which can be accessed using the persistent identifier assigned to the corpus (i.e., the URN address included in the citation instructions).

Resource-specific data protection terms and conditions

All corpora labelled PRIV contain a separate description of the resource-specific data protection terms and conditions, including the following details:

  • Data controller of the personal data that are distributed via the Language Bank of Finland
  • Types of personal data and data subject groups included in the corpus 
  • Description of the purposes for which the corpus can be further distributed by the Language Bank of Finland
  • Restrictions regarding the location and transfer of the personal data to countries outside Finland
  • Further processing instructions pertaining to personal data in the specific corpus, if any

The creation of resource-specific data protection pages is currently in progress. In case you discover that a separate description of the data protection terms and conditions for a specific corpus is not yet available and you cannot find corresponding information in the metadata of the resource, please request clarification from the FIN-CLARIN service address: fin-clarin(at)helsinki.fi.

How to process corpora that contain personal data?

By using the corpora stored in the Language Bank of Finland, you undertake to comply with the general terms of use of the Language Bank of Finland as well as corpus-specific special terms. 

When using a PRIV-labelled corpus, you undertake to process the personal data included in it confidentially, carefully and solely for the purpose for which you were granted access to the corpus. Further restrictions are described in the resource-specific data protection terms and conditions that are published along with the corpus-specific license.

  • If you are granted access to a corpus on the basis of a personal application and you have presented a research plan or a similar description of the purpose in connection with the application, you can use the corpus only for the purpose stated. Additional restrictions which apply to individual corpora are stated in resource-specific license and data protection terms and conditions.
  • If you gain access to a corpus without a separate application, but access requires logging in as a researcher or student, the corpus can be processed only for research and teaching purposes. Additional restrictions which apply to individual corpora are stated in resource-specific license and data protection terms and conditions.

When processing corpora that contain personal data, please apply sufficient protective measures in accordance with the instructions provided by your own organisation. Special care is needed when processing corpora that contain sensitive personal data (also known as special categories of personal data).

Carry out your duties as the data controller

When starting to process a corpus obtained through the Language Bank of Finland that contains personal data for the purposes of new research or another purpose, you and/or your home organisation assume the role of data controller for the corpus. Among other responsibilities, the controller is obliged to demonstrate the lawfulness of the processing of personal data, when necessary.

The instructions provided by your own organisation must be observed in the first instance when processing personal data. If instructions provided by your home organisation are unavailable, you can familiarise yourself, for example, with the Data Management Guidelines published by the Finnish Social Science Data Archive when planning the processing.

Remember to draw up a privacy notice

As the controller, you must usually draw up a privacy notice on the processing of personal data. Comply with the instructions provided by your own organisation in this instance as well. When drawing up a privacy notice, you can utilise the privacy notice associated with the original corpus, or the description of the personal data included in it.

When starting to use a corpus stored in the Language Bank of Finland that contains personal data, first publish the privacy notice pertaining to your purpose of processing, for example, on a website provided by your organisation. You can share a short title of your project that is understandable to the general public as well as a link to the openly available privacy notice by using this formWe publish this information on the Language Bank of Finland website to make it available to anyone interested in the purposes for which the corpus is used.

Apply proportionate protective measures

Comply with the guidelines of your own organisation. When necessary, you can view examples of protective measures employed by the Language Bank of Finland and other potential measures which you may need when processing personal data. 

Personal data in scientific presentations and publications

Personal data must also be processed responsibly and in compliance with good ethics when creating scientific publications and presentations based on corpora.

When reporting on the results of scientific research, personal data must be, as a rule, removed or redacted, for example, by pseudonymisation and by classifying data subjects’ age, domicile and other details into more extensive categories so that study participants cannot be identified on the basis of such details or by combining them with other data.

In certain cases, presenting scientific research results requires the presentation of data that contain personal data. For example, it may be necessary to link short individual samples from the corpus to a scientific article, or a specific section must be presented in connection with a conference presentation. However, carefully consider the potential impact on and risk to the study subjects, their family members or others close to them associated with publishing or presenting samples that contain personal data. The scope of the samples intended for publication must not exceed the scientific purposes, and all unnecessary personal data must be removed or pseudonymised from the samples using appropriate means.

Please also note that if the study subjects have been, for some reason, clearly informed that no personal data associated with them will be published, and the sample to be published cannot be fully anonymised, a separate consent for publishing the sample must be requested from the subjects.

Several purposes? 

If a PRIV-labelled corpus, which requires access rights, is to be processed for more than one purpose – for example, if at a later date there is a wish to carry out a new study not directly connected to the previous topic – access rights must be applied for from the Language Bank of Finland separately for each purpose. Naturally, all grounds for the processing must be stated in the privacy notice(s).

Errors and misconduct

If you come across personal data which you believe should not be included in a corpus based on its description, please report the matter immediately to the Language Bank of Finland and/or directly to the controller of the data. This also applies to instances where you suspect that personal data have, for some reason, fallen into the wrong hands.

Privacy practices of the Language Bank of Finland

Last updated 30.8.2021

How to cite the Language Bank of Finland and FIN-CLARIN

It is important to cite language resources in a coherent way. This will enable other researchers to replicate your research, and the authors or developers of the resource can receive credit for their work.

When you use a language resource (a corpus or a tool) that is available via the Language Bank of Finland, please adhere to the citation instructions provided by the Language Bank. This way, you provide an accurate reference to the exact version of the resource. In the Language Bank od Finland, every resource version has a unique persistent identifier that is always included in the reference. The identifier exists in order to ensure that the resource can be accessed and the study can be replicated in the future even if the location of the resource changes.

The license conditions of many corpora and tools require the users to provide a reference to the resource in question. In this case, the license terms will usually mention the BY condition (Attribution; Nimeä in Finnish). A reference is systematically required for all language resources that are licensed for academic use (CLARIN ACA) or for individual use (CLARIN RES). Even openly licensed language resources may require appropriate citation (e.g., Creative Commons Attribution and other open licenses).

By providing a reference to the Language Bank of Finland and to its language resources, you can help FIN-CLARIN keep track of the usage of its corpora and services and maintain the Language Bank of Finland.

Citing a corpus that is available via the Language Bank of Finland

Reference instructions for individual corpus versions or variants can be found at the quotation mark icon-quote-right on the Corpora list of the Language Bank of Finland.

The reference instructions are also mentioned in the metadata of each language reource. The metadata of the corpora that are available via the Language Bank of Finland are stored and distributed on the META-SHARE service. The metadata record of a specific language resource can always be accessed with the persistent identifier that is included in the citation instructions, or by clicking on the corpus title on the corpus list of the Language Bank. In the metadata record, the link to the reference instructions can usually be found in the Documentation section. In some cases, the citation instructions are directly available in the Attribution Details field. The metadata record also provides details on the corpus-specific license.

For corpus versions that are offered via the Korp concordancing service, the link to the citation instructions is available in the corpus information frame that pops up when the mouse cursor is moved over a corpus title in the corpus selection menu, as well as under the corpus details in the information column on the right when an individual search result is selected in the concordance view.

In case the resource is available via the download service of the Language Bank of Finland, it includes a file called README containing the persistent identifier of that particular resource version.

Reference format

As an example, here are the reference instructions to the language resource titled Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, Version 2:

University of Helsinki (2017). Corpus of Finnish Magazines and Newspapers from the 1990s and 2000s, Version 2 [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2017091901

Note that the exact formatting practices of data references may vary in different publications. In any case, it is best to try and include the details that are included in the citations instructions provided by the Language Bank of Finland. When you are writing scientific journal articles or producing other research output, you may need to check the publication-specific instructions in order to see whether it is customary to include data sources in the bibliography or to create a separate list for them.

References to the Language Bank of Finland, FIN-CLARIN or CLARIN

The address of the Language Bank of Finland (Kielipankki)

In case you wish to refer to the Language Bank of Finland as a collection of services, please use the web address www.kielipankki.fi.

Refer to the FIN-CLARIN consortium

A presentation of the FIN-CLARIN consortium on the web portal of the Language Bank of Finland: http://urn.fi/urn:nbn:fi:lb-2014120212

Refer to CLARIN ERIC

The general reference instructions of CLARIN ERIC and CLARIN services can be found under CLARIN Frequently Asked Questions.

More information about citing data

Why should you deposit your resource with the Language Bank of Finland?

When you deposit a corpus or a tool to be distributed via the Language Bank of Finland maintained by FIN-CLARIN, your work will gain more visibility and your resource will be available for users. Many Finnish research funding organizations recommend that all research data containing language be deposited with the Language Bank of Finland.

If a corpus or tool is readily available, it will be used and cited more often. A unique, persistent identifier and citation instructions are assigned to each resource that is distributed via the Language Bank of Finland. This makes it easy for you and others to refer to your resource in publications. The language resources you deposited can also be included in your CV.

In some cases, it is not possible to make a resource openly available. The terms and conditions for distributing your resource will be agreed by us with you. If necessary, it is possible to restrict access to the resource for identified users only, or to individual users who are granted access based on a research plan they presented. In the latter case, access rights can be managed conveniently in our online service called Language Bank Rights.

Inform the Language Bank of Finland about a forthcoming language resource

What it takes: open your research data

University of Helsinki Data Support will hold an event focusing on the opening of research data. The event ”What it takes: Open your research data” takes place on 26 March 2020 at 13–15 at Think Corner Stage (Yliopistonkatu 4).

Read more…

Data Clinic 2019-20

This online course can support you in managing, annotating and analyzing your language material when you start working with your MA thesis or PhD project. A similar course has been previously offered by FIN-CLARIN under the title Corpus Clinic. Please note that the number of participants is restricted. Read more…

FIN-CLARIN Data management policy (DMPol)

1. FIN-CLARIN overview and division of responsibilities

FIN-CLARIN is a distributed infrastructure. The FIN-CLARIN partners handle production of language technology and resources relatively independently. As an infrastructure, FIN-CLARIN has three primary strategies to promote good data management: training, support, and infrastructure design. FIN-CLARIN as a consortium gives guidance and requirements for publishing datasets and open access publishing.

The data and technology produced by FIN-CLARIN can be divided into two parts:

  1. Data sets and technologies developed by scientific research projects of the FIN-CLARIN partners. These resources are owned and handled by the FIN-CLARIN partners. The partners are responsible for their own storage and open access according to their guidelines and policies. The resources are language-oriented, and are highly valuable from a scientific point of view.
  2. FIN-CLARIN internal data on infrastructure usage, including technical information and statistics. This data is relatively small in scope, and is handled by CSC – IT Center for Science Ltd. following the processes and requirements set by CLARIN ERIC and EGI (the European Grid Infrastructure). This quantitative data is valuable for further development of FIN-CLARIN.

The researchers and data owners using FIN-CALRIN have ultimate responsibility for the type 1 data management. However, FIN-CLARIN shares technical expertise and good scientific practices. The DMP allows FIN-CLARIN to reach a significant number of users within Finland. To this end, FIN-CLARIN makes information on data management a key part of its activities.

To facilitate the data management, FIN-CLARIN requires partners to follow open access policies, and provides this DMP with general principles within the infrastructure, and detailed Data Management guidelines that take into account the specific policies and environments of each partner and gives partner and discipline-specific advice.

2 Management of FIN-CLARIN data developed by partner research projects

In this section, we describe our support for managing the tools and data from research projects using the FIN-CLARIN infrastructure.

2.1 Existing data management policies and activities of partners

The FIN-CLARIN partners already have individual data management practices. Through the collaboration in FIN-CLARIN, the lessons from these programs can be spread among institutions in the same way as FIN-CLARIN supports knowledge transfer.

As FIN-CLARIN partners already have their own data management support and policies, they will be adhered to. These are published on the web and listed in Table 1. All of them take into account data management throughout the data lifecycle, from planning to archiving and reuse.

Table 1. Data management policies and guidelines of FIN-CLARIN partners

Furthermore, each partner will name a data management contact person, who is in charge of enforcing this policy at his or her location. They will provide partner-specific user support and training.

2.2 FIN-CLARIN Data management principles and guidelines

FIN-CLARIN is an infrastructure, and thus day-to-day data management must ultimately be done by end-users. In order to ensure that users follow the DMPol, we recommend initial training in data management. For projects applying for restricted resources, we require an initial data management plan before access is granted. This ensures that users consider data management as part of their research.

Primarily, we recommend that users follow the existing partner data management guidelines with respect to openness and dissemination. When possible, we extend and improve these guidelines with focus on scientific research and seek to resolve any potential conflicts between policies.

The consortium expects compliance with the data management guidelines, and all partner sites provide support to their users of the infrastructure. We continually update the guidelines with current best practices and the latest recommended services.

FIN-CLARIN as an infrastructure covers the data mid-life cycle, i.e. the actual storage and computation. However, the FIN-CLARIN support services cover all stages of the data lifecycle. FIN-CLARIN is committed to open science principles and open publishing. The data guidelines further explain the recommendations of implementing this and how to leverage existing services.

FIN-CLARIN provides a landing page on data management with information on data management specific to language technology and resources. It contains both new information and links to CLARIN ERIC as well as national and partner-specific information, including local contacts. This information is available for central use as well as for partner-specific documentation.

The consortium publishes the data management policy as well as practical data guidelines and collate advice on data management practices on its website.

FIN-CLARIN recommends open science principles and open access publishing. For implementing this, FIN-CLARIN recommends using suitable existing services such as DMPTuuli for project data management planning, the upcoming national digital preservation service portfolio for research, as well as international services such as Zenodo and EUDAT. See Table 2.

Table 2. Recommended and supported data management solutions

All relevant publications must be reported according to each organization’s guidelines, ensuring they are sent to the national VIRTA publication reporting system. In general, this is done through the university reporting systems, which are also used for our internal reporting and acknowledgement of the infrastructure. All publications produced using the FIN-CLARIN infrastructure should include a reference to the technologies and resources provided by the infrastructure. The references to be used are the persistent identifiers given by the infrastructure through the reference service, e.g. Digital Morphology Archives.

Research data that is prepared for sharing has to be stored either in the IDA service, or in a similar organizational/national/international archive. When choosing a storage and sharing service, the user must consider legal and ethical issues and ensure that the stability and availability of the chosen service are suitable for long-term storage. The service must also give a persistent identifier for the datasets so that there is a way to refer to the resource. We recommend that all datasets are stored in a service from which they can be shared and that they are licensed so that others can use them (e.g. Creative Commons licenses).

The choice of licenses has to be done taking into account legal and ethical aspects of the data. Software and databases have their own license recommendations. For open datasets, we recommend Creative Commons BY 4.0 and for open metadata CC0. The former requires attribution to the original creator and the latter waives all rights ensuring maximum visibility for metadata.

All tools and datasets must be described in META-SHARE or Virtual Language Observatory. If a resource is described in another service, there must be a reference to either of these descriptions (for instance with a persistent identifier). The description of a dataset has to include administrative, technical and descriptive metadata according to current standards. The goal is to ensure good discoverability of the resource and adequate levels of information for others to evaluate the possibility for reuse. In addition to the metadata description, a link to the resource and possible license information has to be included. All resources produced using FIN-CLARIN should include a reference to the infrastructure service.

2.3 FIN-CLARIN-provided data management tools

FIN-CLARIN provides a variety of tools for integration with the data life cycle.

Each partner provides core day-to-day data storage for research activities. In general, this storage space is large and fast, but not backed up. A smaller home directory space is provided for backed up code and critical configurations. For large data storage, the partner storage locations must be used for back-up unless a separate agreement is made with CSC. Each partner offers integration with its local resources. Data can be stored both in individual user folders, or in group folders for collaboration. In all cases, data is protected by file system permissions.

FIN-CLARIN has installed commands in its computing environment, which allow direct access to the IDA storage service. This allows direct staging to and from long-term storage. The ePouta cloud service provides several tiers of data storage: default, non-backed up disks, high-performance IO storage, and normal backed up storage.

2.4 Implementing data management training and instruction

By far, the hardest data management problems are on the end-user side, where we assist through our consortium training processes. Support is provided both nationally and locally, with the consortium serving as a conduit for best practices to be shared. Local support staff is able to provide the most useful support. The goal is to nationalize local best practices as well as promote CLARIN ERIC standards.

FIN-CLARIN, as part of its data management activities, organizes events for data management planning and roadshows to all the partners providing targeted training and offering support for the FIN-CLARIN partners, so that they can better support their researchers. The consortium makes use of the Open science training material as well as CSC’s existing training framework including training on data science and data management.

3 Management of FIN-CLARIN internal data

In this section, we outline the data produced by FIN-CLARIN in the daily operation of its services.

3.1 Types of data

Primarily, FIN-CLARIN internal data contains status, usage, and job statistics. This is primarily useful for reporting and development of FIN-CLARIN services. Data is collected automatically by FIN-CLARIN services as a normal part of the usage of the FIN-CLARIN platforms and services. For example, the search tools contains records of all executed searches. This provides data automatically in a structured and interoperable form. All software and automated configurations are considered data.

3.2 Documentation and quality

Since the FIN-CLARIN centralized service setup is automatic, the software stack collecting the data is known and reproducible. Because all data comes from standard open source systems, documentation and structuring is automatic. We will prefer the standard forms from these systems when releasing the data, and defer most documentation to the authoritative upstream sources by linking. Data quality matches that of the CLARIN infrastructure: the data documents the actual performance of the systems.

3.3 Storage and backup

FIN-CLARIN operational data is backed up as a part of normal operations of systems of this scale. The total size of the data is small relative to the capacity of the FIN-CLARIN systems.

3.4 Ethics and legal compliance

The relevant operational data is non-personal and FIN-CLARIN can release it independently. Usage data may be released only in a sufficiently anonymous and aggregated form. Partners, in conjunction with guidelines produced by FIN-CLARIN, will conduct anonymization.

3.5 Data sharing and long-term preservation

The infrastructure data is reported as part of the annual reporting. Summaries are also included in FIN-CLARIN and partner reports.

Software and other code is made available from the CSC organization account under appropriate free software and open source licenses (e.g. MIT, GPLv3+).

A copy of the deposited item is placed in the backed-up long-term preservation system of the repository. The item is read from the storage from time to time to ensure that the deposited item is still accessible and readable with existing software. In case of difficulties, a recovery procedure is invoked.

4 References

Registration period for the course Corpus Clinic

The course is organized in English during 9.11.2018 – 26.04.2019. The registration deadline has been extended until 23rd November 2018 until when you can join the course area on Moodle. Welcome aboard!

More information

Life cycle and metadata model of language resources

Parts of a language resource

A language resource consists of three parts at the minimum:

In addition, a language resource may have its own license page and instructions, if needed. In case several members of a single language resource family share license terms, only one license information document is produced. Language resource specific instruction pages describe only such specific features related to the said resource’s usage that have not been covered in the applicable tool’s or another application’s general instructions.

Persistent identifiers

All parts of a language resource are referred to using persistent identifiers (PID). The Language Bank of Finland uses both the URN and Handle systems. Of these two, URN is more common in the Nordic countries and Handle is more prolific globally. At the Language Bank URNs and Handles have a 1:1 mapping, e.g. hdl:11113/lb-201710212 and urn:nbn:fi:lb-201710212 point to the same page.

A persistent identifier in the Language Bank means that the user can rely on the information referred to by the identifier to remain accessible, even if the language resource’s location changes. The new location is accessible either directly (the identifier points directly to the new location) or indirectly (the identifier points at a page with information about the location of the old version and how to continue using it as well as how to access the new version).

Persistent identifiers have two main functions:

  • To ensure accessibility of information if its location changes (e.g. if the corpora in Korp have been migrated elsewhere).
  • To retain information about past language resources continuing to provide the old version is not practical (e.g. for financial reasons).

Language resource versions

A language resource may have several different variants (i.e. versions) that form a language resource family.

Examples of language resource families:

  • Different parsers’ morphological analysis results for a single corpus.
  • Text version of an audio or video corpus (manually or automatically generated)
  • Accumulating corpus: the content is almost identical but one version has more or newer content.
  • Repaired corpus: flaws in a corpus have been identified and fixed manually or automatically.

In all aforementioned cases, it is important that the language resource’s user be able to unambiguously refer to the applicable resource at present as well as in the future. This is why each version always has its own abbreviation, metadata page and location. On the other hand, a language resource family may share a license or instruction page.

To see how the Language Bank fares in relation to RDA recommendations, see the commented RDA Data Versioning Working Group report.

When is a new version generated?

A new version of a corpus is generated when the corpus’s content changes significantly. What constitutes a significant change is defined individually for each corpus. If the corpus description does not specify otherwise, such changes that may substantially affect research results or that are not easily reversible are considered significant. All non-significant changes are recorded in the change log in the corpus’s metadata.

Examples of non-significant changes:

  • A single article in a large conversation corpus has to be removed at an informant’s request. In this case, providing the previous version would not be possible in the first place.
  • Some hand-written tags in a large corpus have been found to contain a typographical error.
  • A corpus has been automatically converted from Latin-1 to UTF-8 character encoding. The old encoding remains accessible in the archive.

How is a new version generated?

If a new version of a corpus is generated, its relation to the previous versions is recorded in META-SHARE. The new version receives a new PID and a new META-SHARE record. In the META-SHARE record, the new and old versions are linked with the IsNewVersionOf, IsPreviousVersionOf relations, see below.

In case the previous version is no longer relevant to research, the new version replaces it in the Language Bank’s corpus list. The kielipankki.fi/<abbreviation> links also always point at the most recent versions. However, PIDs are always preserved. They point at either the old version or relevant information (”tombstone page”) about how to obtain it or how queries executed in the old version can be reproduced in the new version.

Accumulating corpora

Suomi24: The corpus is updated biannually. The versions’ abbreviations follow the format Suomi24-<year><year half>, e.g. Suomi24-2016H1. Newer versions always contain the previous versions, and queries can be reproduced by defining the period accordingly.

Other corpora

New corpora receive new version numbers, e.g. helpuhe-v2. META-SHARE contains a description of the difference between the new and the old version. The old version is archived if need be, and PIDs point at a ”tombstone page”.

Preservation of language resources

The Language Bank does not delete the deposited language resources without their owner’s consent.

Common language resource relations

IsVariantFormOf / IsOriginalFormOf

Two versions or variations of a language resource, e.g. a corpus packaged in different ways. Downloadable versions are usually considered the ”OriginalFormOf” VariantForms.

IsDerivedFrom / IsSourceOf

The language resource is derived from another, e.g. a frequency lexicon or a language model.

IsPreviousVersionOf / Is NewVersionOf

The language resource is a previous / newer version of the related resource.

Eg. Version 1 points to version 2 using IsPreviousVersionOf. Example: lehdet90ff-v1.

IsPartOf / HasPart

The language resource is a part of another (broader resource or collection). Can be used e.g. for parts of a serial corpus.

IsContinuedBy / Continues

The corpus is continuation to another. The content is different but the compilation method is the same.

IsCompiledBy / Compiles

The tool that was used in creating the corpus, e.g. a parser.

IsMetadataFor / HasMetadata

The language resource family shares metadata, e.g. a license or description.

The shared ”roof” metadata points to the more specific metadata using the IsMetadataFor relation, and the more specific metadata points back to the shared ”roof” metadata using the HasMetadata relation (See [1], page 37). Example: ceal.

Shared metadata has no direct link to the language resource’s content.

Other relations

If none of the relations described above applies, other possible relations can be found at DataCite ([1]). Using relation terminology other than DataCite’s is not permitted.

Sources

[1] DataCite Metadata Working Group. (2016, alkaen sivulta 37). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.0. DataCite e.V. http://doi.org/10.5438/0012