We use the CEAL parallel corpus here as an example: urn:nbn:fi:lb-2020012801
Some corpora have their official name decided already in the agreement. This should be checked before actually publishing a resource. The metadata in metashare can be older than the agreement and thus have an incorrect name. The name in the agreement can be adjusted to suit the version of the resource being published.
The name of the corpus has been decided in the agreement to be: ”Englantilaisen ja amerikkalaisen kirjallisuuden klassikoita Kersti Juvan suomentamina, englanti-suomi rinnakkaiskorpus”.
No English version of the name was given, so the name was translated as ”Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus”.
The different versions of the resource get various fixed terms attached to the name. The three basic versions we provide are named as follows:
If the published version is the original deposited resource, or a minimally corrected one, distributed through the download-service, we add the word ”source”, ”lähdemateriaali”, or ”källmaterial” to the name.
If the published version is the Korp version, we add the word ”Korp” to the name.
If the published version is the VRT package exported from Korp, we add the accronym ”VRT” to the name. The VRT can be accessible from the download-service or for example only on Puhti shell.
If the version of the resource is somehow scrambled to allow less restrictive licencing, we add the word ”scrambled”, ”sekoitettu”, or ”blandad” to the name before the word indicating the basic version. Example:
If the corpus is a copy (or version) of a resource published through another service already, we add the words ’Kielipankki version’ or ’Kielipankin versio’ to the name. Example:
If the corpus is part of a growing or otherwise changing resource, we add a date to the name, which indicates the most recent data in the corpus. Example:
Shortname is used in Metashare, Portal, the Download service, and IDA. Shortnames are written completely in lowercase letters. Characters allowed are ”a–z”, ”0–9” and ”-”.
The source version is indicated by ”-src”, Korp version by ”-korp”, and the VRT version by ”-vrt”.
In Korp, make the relation to shortname as clear as possible, for example: ”ceal_par”. Korp source in IDA: ceal-par/ceal_par_korp_20150323.tgz
In Some, the hashtags being with ”lb_”, indicating a language bank resource. ”lb_” is followed by the common name of the resource family, such as:
If a new version of the corpus is created by adding to or modifying the original texts, the version information is added after the name of the corpus. Example:
If we modify the attribute information in the vrt-file, for example re-parsing with new parser and not including the old one, we add the version information after the VRT (or Korp, etc.) word. Examples: