Language resource naming conventions

We use the CEAL parallel corpus here as an example: urn:nbn:fi:lb-2020012801

Resource name

Some corpora have their official name decided already in the agreement. This should be checked before actually publishing a resource. The metadata in metashare can be older than the agreement and thus have an incorrect name. The name in the agreement can be adjusted to suit the version of the resource being published.

The name of the corpus has been decided in the agreement to be: ”Englantilaisen ja amerikkalaisen kirjallisuuden klassikoita Kersti Juvan suomentamina, englanti-suomi rinnakkaiskorpus”.

Translations

No English version of the name was given, so the name was translated as ”Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus”.

Different versions

The different versions of the resource get various fixed terms attached to the name. The four basic versions we provide are named as follows:

  • If the published version is the original deposited resource, or a minimally corrected one, distributed through the download-service, we add the word ”source”, ”lähdemateriaali”, or ”källmaterial” to the name.
    • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, source
  • If the published version is the Korp version, we add the word ”Korp” to the name.
    • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, Korp
  • If the published version is the VRT package exported from Korp, we add the accronym ”VRT” to the name. The VRT can be accessible from the download-service or for example only on Puhti shell.
    • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, VRT
  • If the published version is the LAT version, we add the accronym ”LAT” to the name.

If the version of the resource is somehow scrambled to allow less restrictive licencing, we add the word ”scrambled”, ”sekoitettu”, or ”blandad” to the name before the word indicating the basic version. Example:

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, scrambled, Korp

Resource shortname

Shortname is used in Metashare, Portal, the Download service, and IDA. Shortnames are written completely in lowercase letters. Characters allowed are ”a–z”, ”0–9” and ”-”.

The source version is indicated by ”-src”, Korp version by ”-korp”, and the VRT version by ”-vrt”.

Examples:

  • Long name: Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, VRT
  • Shortname: ceal-par-vrt
  • In Download: ceal-par-vrt/ceal-par-vrt.zip (Zip should contain directory named the same, without .zip: ceal-par-vrt)

Korp-name

In Korp, make the relation to shortname as clear as possible, for example: ”ceal_par”. Korp source in IDA: ceal-par/ceal_par_korp_20150323.tgz

Social Media (SOME) hashtags

In Some, the hashtags being with ”lb_”, indicating a language bank resource. ”lb_” is followed by the common name of the resource family, such as:

  • lb_ceal

Versioning

If a new version of the corpus is created by adding to or modifying the original texts, the version information is added after the name of the corpus. Example:

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus version 2, VRT
  • ceal-par-v2-vrt

If we modify the attribute information in the vrt-file, for example re-parsing with new parser and not including the old one, we add the version information after the VRT (or Korp, etc.) word. Examples:

  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus, VRT version 2
  • ceal-par-vrt-v2
  • Classics of English and American Literature as translated by Kersti Juva, English-Finnish parallel corpus version 2, VRT version 2
  • ceal-par-v2-vrt-v2
Hae Kielipankki-portaalista:
Tommi Kurki
Kuukauden tutkija: Tommi Kurki

 

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4144036 / 029 4129317