Creating a metadata record on META-SHARE
<< Development <<
This document describes the minimum set of details that should be included in a metadata record that is published on META-SHARE.
NB: To request the creation of a metadata record for a new/forthcoming resource, Language Bank users are requested to fill in this form:
Submit information about a language resource to Kielipankki
Download the Excel helper document for ingesting the submitted metadata for a new resource (GitHub, New_resource.xlsx)
How to create a metadata record on META-SHARE
* = The metadata field is mandatory, i.e., at least some relevant information should be filled in right from the start.
(*) = The metadata field is required by the Language Bank at a later stage, but it is not mandatory for publishing the initial version of the metadata record.
- *Resource name (Required: Identification): The full name or title by which the resource is known.
- Preferably, the resource name should be available both in English and in Finnish. Swedish is to be avoided, since it is not well supported and might be confusing.
- The resource name may also include an extension at the end, identifying a specific variant or version of the original content, e.g., ”The Suomi24 Corpus 2001-2017, VRT version 1.1”. See resource naming conventions.
- The element can be repeated for the different language versions using the ”lang” attribute to specify the language.
- *Description (Required: Identification): Provides the description of the resource in prose.
- The element can be repeated for the different language versions using the ”lang” attribute to specify the language. Provide a Description at least in English. A Finnish translation is usually also provided.
- The Description field should begin with the following type of sentence (English):
”This resource [is | will be] available [via Korp | for download] in Kielipankki – the Language Bank of Finland[, see Access location].”
A standard expression can then be used for finding all resources that are distributed via Kielipankki.
- Do not repeat URNs or other links that are already mentioned elsewhere in the metadata!
- Do not mention physical links like ”korp.csc.fi/download”. (If really necessary, use more current aliases: kielipankki.fi/download; kielipankki.fi/lataus.)
- In the text, try to describe things like what, why, how much, created how, by whom‚ etc. The starting point of the English and Finnish versions should preferably be written by the corpus owner/depositor.
- If external documentation is available, you may use a standard note like ”For links to further information, please see Documentation.”
- Please enter the keyword/hashtag to be used for the resource on its own line at the end of the Description(s). See instructions in naming conventions, e.g.,
- Do not include Change Log information in the Description (see instructions under Documentation).
- (*)Resource short name (Required: Identification): The short form (abbreviation, acronym etc.) used to identify the resource
- Do not localize the short name! It is usually in English. It is used to create directory names, or it may be part of a filename.
- Do not use spaces. Use only lowercase letters. See the resource naming conventions.
- Formatting the short names for different versions and variants of the same resource group:
- If the resource is the source version (the first version obtained by the Language Bank), end the shortname with ”-src”, as in urn:nbn:fi:lb-2017070501
- If the resource is the Korp version, end the shortname with ”-korp”, as in urn:nbn:fi:lb-2019120403
- If the resource is the VRT-file exported from Korp, intended for the download service, end the shortname with ”-vrt”, as in urn:nbn:fi:lb-2019052701
- If the resource is a scrambled version, insert a ”-s-” in the shortname before the -src, -korp, or -vrt, as in urn:nbn:fi:lb-2019120404
- Especially if the resource is a parallel version of another corpus, inset ”-par-” in the shortname. A bit like urn:nbn:fi:lb-2019042605 (but the example could be improved by having the ”-par-” before the previously mentioned acronyms and after the information of the dates like ”-2018-”).
- (*)Url (Required: Identification): The ”Access location” of the resource. Usually a URN, but not neccessarilly. Can be a download location or a URL pointing to Korp.
- (*)Identifier (Required: Identification): The unique citable URN is the primary ID of this particular version or variant of the resource. The URN is used to refer to the resource from various services, e.g., from the corresponding end-user license pages, from the Language Bank Rights system, and from the list of resources on the Language Bank website.
- For the time being, we add ”http://urn.fi/” in front of the URN, in order to make the link clickable via META-SHARE.
- The URN may be requested by the Language Bank and added to the Identifier field of the metadata record as soon as
- the resource already exists, and
- a preliminary assessment has been made that the resource can probably be included in the Language Bank, even if a deposition agreement or a specific license has not yet been fully confirmed.
- (*)Resource creator (Recommended: Resource Creation): This field should contain the names of the people who are to be cited as the ”authors” of the resource.
- (Until now, the field has not been used consistently, but the information should be added for future reference.)
- The same author/creator names should be mentioned in the citation instructions that are also provided via the list of Corpora/Resources on the Language Bank website.
- *Distribution (Required: Identification): Specify the license details.
- Minimally, select ”Under Negotiation”, in case the resource is not yet available (and/or nothing further is known).
- Select ”Available – Restricted Use” in case the resource is available and access to the resource is restricted in some way (CLARIN ACA, CLARIN RES, or similar).
- Select ”Available – Unrestricted Use” in case the resource is available and access to the resource is not restricted in any way (CLARIN PUB, Creative Commons licenses, or similar).
- If the general license category (such as CLARIN RES) or a specific license (e.g., CC-BY) is already known to apply on the final resource, the Licence (see below) can be added at an early stage – even if Availability is technically still ”Under Negotiation”.
- (*)Licence: Select the appropriate license category and the individual terms and conditions that are applicable. Note, however, that the license cannot be specified fully via META-SHARE, and this is why we always additionally include a persistent reference to the license page under Documentation.
- Note that it is in principle possible to define the metadata for several different licenses for a given resource, but this feature is not systematically used by the Language Bank. (It might, however, be useful in case the same content is available, e.g., for research use as well as for commercial use. To avoid confusion, such parallel license documents should be very clearly separated in the Documentation section for the different types of users and purposes.)
- (*)Attribution text: Information regarding the recommended/required way of citing the resource.
- If the resource is not yet available in the Language Bank, the automatic citation instructions will not yet be found on the Language Bank website. Meanwhile, for the convenience of the resource creator/depositor, we can offer to include the corresponding citation format in the Attribution text field.
- When the resource is available in the Language Bank, the Attribution text should include the text: ”see Documentation” (English), ”katso Documentation” (Finnish). At the same time, the link to the automatically generated citation instructions should be added under Documentation.
- (*)Licensor: In case an organization or a person (or both) specifically licenses the resource to be distributed by the Language Bank, they should be listed here.
- This field can list the parties who signed the deposition license agreement with the University of Helsinki (that represents the Language Bank).
- In the case of resources that contain personal data, the Licensors should include the Data Controller.
- It is recommended that this field is filled in when possible.
- (*)Distribution rights holder: This field is relevant only in cases of ACA and RES licenses. It has not been systematically used by the Language Bank, but it is recommended to include this information for future reference, if possible.
- The Distribution rights holder is usually the University of Helsinki, in case the license was given to the Language Bank in a deposition license agreement (for resources including personal data, this is the default option 1 in agreements made after 2021) and the Language Bank is not required to ask the original rightholder’s permission for granting access to the resource.
- In some cases of RES licenses, the deposition license agreement may have been made so that the original depositor/Data Controller remains in charge of distribution and the Language Bank is only a Data Processor. In this case, the Distribution rights holder is the same entity as the Licensor.
- (*)Ipr holder: Regardless of the licensing process, the IPR holders of the resource can be listed here.
- It is recommended that this field is filled in (if possible and relevant).
- (*)Availability start/end date:
- (*)Availability start date can be used for keeping records of when the resource was first made available via the Language Bank.
- The date should be added at the time of publishing the resource.
- NB: Previously, this field has not been used generally, and accurate information may be lacking for many older resources.
- In addition, in case there are specific terms in the license agreement, for instance an embargo period that allows the Language Bank to keep the resource available to users during a specified time period only, the availability start and/or end dates can be defined here even before the resource is actually available.
- *Contact person (Required: Identification): Contact information for inquiries about the resource, e.g., regarding access to the resource or obtaining further information about the content.
- In case a specific contact person cannot be specified for the resource, please use the existing contact records for Language Bank helpdesk, ”FIN-CLARIN User support email@example.com” or ”User support at CSC – IT Center for Science Ltd. The Language Bank of Finland firstname.lastname@example.org”.
- Documentation (Recommended: Resource Documentation): Further information items that can include brief pieces of text or more elaborate references to external documentation of the resource. The Language Bank regularly includes the following details:
- (*)documentInfoType: License. A reference to the end-user license page in the Language Bank (see the internal instructions on creating and updating license pages). This information should be added as soon as the license details have been confirmed with the resource depositor.
- (*)documentInfoType: A reference to resource group page on the Language Bank website.
- The resource group page collects together all the current versions and variants of the same resource as well as the available instructions, manuals and pieces of further documentation that may be available for the group of resources.
- Especially for large groups of resources, it makes sense to create a documentInfoType item once.
- The title of the document should be ”Resource group page: <shortname>”. This will make it easy to reuse the link for other resources in the same group on META-SHARE.
- The URL of the document should be the URN of the (English) resource group page.
- (*)documentUnstructured: Citation instructions. A reference to the automatically generated citation instructions on the Language Bank website. This information is added at the time when the resource is published in the Language Bank. (example)
- In the documentUnstructured dialog box, insert the text: How to cite / Viittausohje: https://www.kielipankki.fi/viittaus/?key=[CORPUS-URN]&lang=en(in the citation link, replace the text [CORPUS-URN] with the URN of the resource as mentioned in the Identifier field, but excluding the first part ”http://urn.fi/”).
- Test the citation link and make sure it works. If it does not, make sure that the information in the Portal is correct. (Instead of the URN, the corpus shortname can alternatively be used in the citation link, in case the URN does not work for some reason.)
- (*)documentUnstructured: Change Log. (See example)
- This field can be used for keeping track of
- major changes in the metadata (such as modifying the corpus name/title) or
- minor, ”backwards compatible” changes in the resource itself (in case the changes are not significant enough to create a completely new metadata record).
- Note that there is a size limit of 1000 characters in the documentUnstructured type of field, so you should add a new CHANGE LOG item if required.
- Example of a ”Change Log”:
<date1>: what changed;
<date2>: what changed item1
* what changed item2
* what changed item3;
<date3>: what changed
- Unfortunately, the content inserted in documentUnstructured fields will be displayed by META-SHARE without any line breaks or other formatting. (This is not very nice, but it is the best META-SHARE can do at the moment.)
- Do not use the Required: Revision or Recommended: Version/Revision fields for storing the log details.
- Version (Recommended: Version > Version+Revision): These fields are not consistently used by the Language Bank. They may include information about the most recent version of the resource, or potentially about frequent updates to the resource, if planned. If versions or revisions are made, you should include a comment about them in the Change Log (as explained above, see Documentation).
- NB: this section describes the versioning and revision of the resource itself.
- The corresponding field for metadata revisions is Required: Metadata > Revision.
- (*)Relations (Recommended: Relations > Show): The Relation fields must be heavily used in case the resource group includes several versions or variants of the resource.
- Each relation describes a specific relation that the current resource has with regard to a ”target” resource. For instance, (the current resource) IsNewVersionOf ”<Name of the target resource>, <URN of the target resource>”.
Corpus Text Info
(to be defined)
Corpus Audio Info
(to be defined)
Corpus Video Info
(to be defined)
How to create a new metadata record on META-SHARE
Before starting to create a new metadata record on META-SHARE, please make sure that a record does not already exist for the resource in question.
Download the Excel helper document for ingesting the submitted metadata for a new resource (GitHub, New_resource.xlsx)
- Log in to META-SHARE
- NB: login is only available to specified group of metadata editors; not for all users, since the metadata are centrally curated by the Language Bank.)
- Go to Manage Resources: Manage your own resources.
- On top right, click on Add Resource +.
- Select a Corpus (for instance).
- Select all the media types that are relevant for the corpus.
- NB: If you select too many media types, or if you forget one of the types that the resource includes, these cannot be edited later (without exporting and importing the metadata record in a potentially tedious way).
- Add at least the required metadata (see instructions below) and save the record. (Do not try to edit several metadata records at the same time, since this may result in strange behaviour.)
- When ready to go public with the initial metadata record, go back to Manage Resources: Manage your own resources.
- Tick the box in front of the metadata record that you need to publish.
- From the drop-down menu above the list, select ”Ingest”. This includes the record in the internal META-SHARE database.
- Again, tick the box in front of the metadata record that you need to publish.
- From the drop-down menu above the list, select ”Publish”. This publishes the record on META-SHARE and makes the metadata visible to users and available to external services for automatic harvesting.
- To get the physical link of the metadata record, open it once more in the editor and click ”View on site ->” on top right corner, or search for the metadata record via the META-SHARE browser. When the record is displayed, you can copy the link from the address bar.
- You can then go and request a URN (Identifier) that points to the metadata record.
<< Development <<
Last updated: 20.5.2022
Life cycle and metadata model of language resources
Parts of a language resource
A language resource consists of three parts at the minimum:
In addition, a language resource may have its own license page and instructions, if needed. In case several members of a single language resource family share license terms, only one license information document is produced. Language resource specific instruction pages describe only such specific features related to the said resource’s usage that have not been covered in the applicable tool’s or another application’s general instructions.
All parts of a language resource are referred to using persistent identifiers (PID). The Language Bank of Finland uses both the URN and Handle systems. Of these two, URN is more common in the Nordic countries and Handle is more prolific globally. At the Language Bank URNs and Handles have a 1:1 mapping, e.g. hdl:11113/lb-201710212 and urn:nbn:fi:lb-201710212 point to the same page.
A persistent identifier in the Language Bank means that the user can rely on the information referred to by the identifier to remain accessible, even if the language resource’s location changes. The new location is accessible either directly (the identifier points directly to the new location) or indirectly (the identifier points at a page with information about the location of the old version and how to continue using it as well as how to access the new version).
Persistent identifiers have two main functions:
- To ensure accessibility of information if its location changes (e.g. if the corpora in Korp have been migrated elsewhere).
- To retain information about past language resources continuing to provide the old version is not practical (e.g. for financial reasons).
Language resource versions
A language resource may have several different variants (i.e. versions) that form a language resource family.
Examples of language resource families:
- Different parsers’ morphological analysis results for a single corpus.
- Text version of an audio or video corpus (manually or automatically generated)
- Accumulating corpus: the content is almost identical but one version has more or newer content.
- Repaired corpus: flaws in a corpus have been identified and fixed manually or automatically.
In all aforementioned cases, it is important that the language resource’s user be able to unambiguously refer to the applicable resource at present as well as in the future. This is why each version always has its own abbreviation, metadata page and location. On the other hand, a language resource family may share a license or instruction page.
To see how the Language Bank fares in relation to RDA recommendations, see the commented RDA Data Versioning Working Group report.
When is a new version generated?
A new version of a corpus is generated when the corpus’s content changes significantly. What constitutes a significant change is defined individually for each corpus. If the corpus description does not specify otherwise, such changes that may substantially affect research results or that are not easily reversible are considered significant. All non-significant changes are recorded in the change log in the corpus’s metadata.
Examples of non-significant changes:
- A single article in a large conversation corpus has to be removed at an informant’s request. In this case, providing the previous version would not be possible in the first place.
- Some hand-written tags in a large corpus have been found to contain a typographical error.
- A corpus has been automatically converted from Latin-1 to UTF-8 character encoding. The old encoding remains accessible in the archive.
How is a new version generated?
If a new version of a corpus is generated, its relation to the previous versions is recorded in META-SHARE. The new version receives a new PID and a new META-SHARE record. In the META-SHARE record, the new and old versions are linked with the IsNewVersionOf, IsPreviousVersionOf relations, see below.
In case the previous version is no longer relevant to research, the new version replaces it in the Language Bank’s corpus list. The kielipankki.fi/<abbreviation> links also always point at the most recent versions. However, PIDs are always preserved. They point at either the old version or relevant information (”tombstone page”) about how to obtain it or how queries executed in the old version can be reproduced in the new version.
Suomi24: The corpus is updated biannually. The versions’ abbreviations follow the format Suomi24-<year><year half>, e.g. Suomi24-2016H1. Newer versions always contain the previous versions, and queries can be reproduced by defining the period accordingly.
New corpora receive new version numbers, e.g. helpuhe-v2. META-SHARE contains a description of the difference between the new and the old version. The old version is archived if need be, and PIDs point at a ”tombstone page”.
Preservation of language resources
The Language Bank does not delete the deposited language resources without their owner’s consent.
Common language resource relations
IsVariantFormOf / IsOriginalFormOf
Two versions or variations of a language resource, e.g. a corpus packaged in different ways. Downloadable versions are usually considered the ”OriginalFormOf” VariantForms.
IsDerivedFrom / IsSourceOf
The language resource is derived from another, e.g. a frequency lexicon or a language model.
IsPreviousVersionOf / Is NewVersionOf
The language resource is a previous / newer version of the related resource.
Eg. Version 1 points to version 2 using IsPreviousVersionOf. Example: lehdet90ff-v1.
IsPartOf / HasPart
The language resource is a part of another (broader resource or collection). Can be used e.g. for parts of a serial corpus.
IsContinuedBy / Continues
The corpus is continuation to another. The content is different but the compilation method is the same.
IsCompiledBy / Compiles
The tool that was used in creating the corpus, e.g. a parser.
IsMetadataFor / HasMetadata
The language resource family shares metadata, e.g. a license or description.
The shared ”roof” metadata points to the more specific metadata using the IsMetadataFor relation, and the more specific metadata points back to the shared ”roof” metadata using the HasMetadata relation (See , page 37). Example: ceal.
Shared metadata has no direct link to the language resource’s content.
If none of the relations described above applies, other possible relations can be found at DataCite (). Using relation terminology other than DataCite’s is not permitted.
 DataCite Metadata Working Group. (2016, alkaen sivulta 37). DataCite Metadata Schema Documentation for the Publication and Citation of Research Data. Version 4.0. DataCite e.V. http://doi.org/10.5438/0012