Language resource naming conventions

We use the CEAL parallel corpus here as an example: urn:nbn:fi:lb-2020012801

Resource name

Some corpora have their official name decided already in the agreement. This should be checked before actually publishing a resource. The preliminary metadata in COMEDI may have been created before the agreement was finalised and therefore have an incorrect name. The name in the agreement can be adjusted to suit the version of the resource being published.

The name of the corpus has been decided in the agreement to be: ”Englantilaisen ja amerikkalaisen kirjallisuuden klassikoita Kersti Juvan suomentamina, englanti-suomi rinnakkaiskorpus”.

Translations

No English version of the name was given, so the name was translated as ”Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus”.

Different versions

The different versions of the resource get various fixed terms attached to the name. The three basic versions we provide are named as follows:

If the published version is the original deposited resource, or a minimally corrected one, distributed through the download-service, we add the word ”source”, ”lähdemateriaali”, or ”källmaterial” to the name.

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, source

If the published version is the Korp version, we add the word ”Korp” to the name.

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, Korp

If the published version is the VRT package exported from Korp, we add the accronym ”VRT” to the name. The VRT can be accessible from the download-service or for example only on Puhti shell.

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, VRT

Additions

If the version of the resource is somehow scrambled to allow less restrictive licencing, we add the word ”scrambled”, ”sekoitettu”, or ”blandad” to the name before the word indicating the basic version. Example:

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, scrambled, Korp

If the corpus is a copy (or version) of a resource published through another service already, we add the words ’Kielipankki version’ or ’Kielipankin versio’ to the name. Example:

  • The Finnish sub-corpus of the Classics Library of the National Library of Finland – Kielipankki version, VRT

If the corpus is part of a growing or otherwise changing resource, we add a date to the name, which indicates the most recent data in the corpus. Example:

  • The Coronavirus Corpus – Kielipankki version 2021-05, source

Resource shortname

The shortname is used in COMEDI, the Portal, the Download service, and in IDA. Shortnames are written completely in lowercase letters. Characters allowed are ”a–z”, ”0–9” and ”-”.

The source version is indicated by ”-src”, Korp version by ”-korp”, and the VRT version by ”-vrt”.

Examples:

  • Long name: Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, VRT
  • Shortname: ceal-par-vrt
  • In Download: ceal-par-vrt/ceal-par-vrt.zip (Zip should contain directory named the same, without .zip: ceal-par-vrt)

  • Long name: The Coronavirus Corpus – Kielipankki version 2021-05, source
  • Shortname: coronavirus-2021-05-src

Korp-name

In Korp, make the relation to the shortname as clear as possible, for example: ”ceal_par”. Korp source in IDA: ceal-par/ceal_par_korp_20150323.tgz

Social Media (SOME) hashtags

In SOME, the hashtags being with ”lb_”, indicating a language bank resource. ”lb_” is followed by the common name of the resource family, such as:

  • lb_ceal

Versioning

If a new version of the corpus is created by adding to or modifying the original texts, the version information is added after the name of the corpus. Example:

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus version 2, VRT
  • ceal-par-v2-vrt

If we modify the attribute information in the vrt-file, for example re-parsing with new parser and not including the old one, we add the version information after the VRT (or Korp, etc.) word. Examples:

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, VRT version 2
  • ceal-par-vrt-v2
  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus version 2, VRT version 2
  • ceal-par-v2-vrt-v2
Search the Language Bank Portal:
Elina Vaahensalo
Researcher of the Month: Elina Vaahensalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information