Corpus data publication for download at the Language Bank

Status 27.1.2020: DRAFT

Introduction

This guideline is intended as a short guide to define the minimal steps necessary to prepare a corpus data publication for download at the Language Bank of Finland.

Name, short name and version

The corpus needs

  • a name (e.g. ”The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s) Downloadable Version”),
  • and a version (If applicable: Major.Minor.Patch, e.g. 1.1.2 or YEAR+Part of Year, eg 2017H2 for the second half of 2017).
  • a short name containing version information (e.g. ”helpuhe1-dl”)

If an older version of the same corpus exists, a decision needs to be made whether to update metadata in an existing description or to create new metadata. See our Lifecycle Model for details. The name of the corpus will be visible in the column ”Description” of page korp.csc.fi/download/ and the text should link to the metadata page at metashare.csc.fi. The name is essentially the same as the metadata long name of the corpus, possibly shortened a bit if the long name is too long. It the directory does not have a metadata page, just create a descriptive name for it (e.g. the semfinlex corpus has subcorpora that are grouped under a common directory).

The package

  • The package needs to contain only the relevant data, no .tmp directories, etc.
  • The format is zip.
  • Zip files start with the shortname, omitting the redundant ”-dl”: ”short name-specifier.zip” (e.g. ”helpuhe1-annotations.zip”).
  • Packages need to contain  subdirectories to extract to, usually based on the shortname. There should only be directories in the zip’s root directory, no files.
  • There is a README.txt and optionally LICENSE.txt present in the subdirectory.
  • Upload the package to puhti.csc.fi:/proj/clarin/download/preview/ and inform kielipankki@csc.fi.

The license

The package has to have a license to inform the user what he or she can and cannot do with the software. Less restrictive licenses are preferred, the license should be stated in the README.txt or a LICENSE.txt file.

README.txt

The README.txt should at least contain the Name of the corpus and the META-SHARE decscription and a PID to the META-SHARE article describing this resource. Licence can be given in README.txt or in a separate LICENSE.txt. README.txt should also contain a short description of corpus, including directory and filename scheme if there are several of them.

Descriptive metadata

The descriptive metadata describes a specific instance of the corpus. It is not a manual, but helps a user searching for corpora to determine whether the corpora is worth downloading. The PID pointing to the metadata is the persistent identifier of the corpus version in question. The metadata in turn points to the download location of the corpus and explains where the manual can be found (e.g. inside the package or on a separate web page). Every update gets a new version number. The PID of the metadata needs to be mentioned in the README.txt of the downloadable packages.

Checklist

A quick reminder of the topics above.

  • Name
  • Version
  • License
  • clean package in zip format
    • check with unzip -l after zipping.
  • Descriptive metadata (metashare.csc.fi)
  • PIDs (at least one to metadata)
  • README.txt contains
    • License (alternative: separate LICENSE.txt)
    • PID to metadata
    • Short description of corpus
    • Version number
  • Finalized packages to /proj/clarin/download/preview

Korp version vs. download package

A case example: The semfinlex corpus was first published in korp with beta status and it was advertised in korp. After it had been available for testing for two weeks, the beta status was removed and no backward incompatible changes to the corpus were allowed from that on. The download packages were created at this point. The corpus (including the freshly generated download packages) was then advertised to a wider audience in the portal.

Language

Most of the corpora have the name, README, metadata etc. in English but some are in Finnish.

Search the Language Bank Portal:
Tommi Kurki
Researcher of the Month: Tommi Kurki

 

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317