[Importing corpus data to Korp: technical documentation]

The computing environment for corpus processing

The primary environment for processing corpora to be imported to Korp is CSC’s computing environment. You may also process corpora on your own Linux workstation, but that is recommended mainly for corpora with free licences. Please note that the previously used local Korp test server nyklait-09-01.hum.helsinki.fi is being phased out, so it is no longer a recommended corpus processing or Korp environment.

However, Korp itself cannot be run in the computing environment, so once you wish to test your corpus in practice, you will need to do it elsewhere, on the Korp server. The Korp test servers are currently not functional. On your own Linux server, you may set up the Korp frontend, but it is currently difficult to test new corpora locally, since the Korp frontend assumes the presence of all configured corpora on the Korp (backend) server used. (We have plans to fix this, though.)

If you wish to process corpora or install Korp locally, please note that you need a development version of Corpus Workbench (CWB), at least version 3.4.9, but preferably the latest one. (Korp does not work with the “stable” version 3.0.) is See the IMS Open Corpus Workbench (CWB) page for information on accessing the CWB Subversion repository. In addition to the cwb section of the repository, you will also need cwb-perl to be able to use the korp-make script, and cwb-doc contains documentation for both importing corpora and using the query language CQP.

[TODO: More information on how to install CWB and other tools locally.]

Processing corpora in the computing environment

Directory structure

The directories related to Korp and corpora are under the CLARIN project directory /proj/clarin/, most of them under /proj/clarin/korp/. Note that you need to be in the user group clarin to access the directories. The relevant directories are the following, relative to /proj/clarin:

Directory Description
Corpus data directories
korp/corpora/src/corpus/ Corpus data files for the corpus corpus in the source (non-VRT) format
korp/corpora/data/corpus/ CWB data file for corpus
korp/corpora/registry/ CWB registry files for corpora
korp/corpora/pkgs/corpus/ Korp package files for corpus
korp/corpora/log/ Log files, in particular for korp-make
korp/corpora/vrt/corpus/ VRT and other generated files for corpus
vrt-in/ VRT files to be parsed and NER-tagged
vrt-out/ Parsed and NER-tagged VRT files produced from those in vrt-in
Code and other directories
korp/cwb/bin/ Executables for the CWB
korp/git-work/Kielipankki-konversio/ A working copy of the Kielipankki-konversio GitHub repository, kept up-to-date with the repository. (The older directory name korp/git-work/korp-corpimport/ may also be used.)
korp/scripts/ A symbolic link to korp/git-work/Kielipankki-konversio/scripts/ containing many general-purpose corpus processing scripts

Setting up your environment

To reduce the amount of typing when running corpus processing scripts, you should add to your path the following directories: /proj/clarin/korp/cwb/bin, /proj/clarin/korp/scripts and /proj/clarin/korp/git-work/Kielipankki-konversio/corp/corpus if you use corpus-specific scripts for the corpus corpus:


  PATH=$PATH:/proj/clarin/korp/cwb/bin:/proj/clarin/korp/scripts:/proj/clarin/korp/git-work/Kielipankki-konversio/corp/corpus

Alternatively, you may also use your own working copy of the Kielipankki-konversio GitHub repository, in particular if you use your own private branch for new conversion scripts before pushing them to the public repository.

You may also use your own work directory /wrk/username/ in the computing environment for processing corpora, or for smaller corpora, your home directory. It is easiest if you set up under /wrk/username/corpora/ a subdirectory structure similar to that under /proj/clarin/korp/corpora/, as in the above table. In that case, you need to set the following environment variables to simplify corpus processing:


  CORPUS_ROOT=/wrk/username/corpora
  CORPUS_REGISTRY=$CORPUS_ROOT/registry

Please keep in mind that neither /proj/clarin nor your personal work directory is backed up, so valuable data and scripts should be copied elsewhere. Also note that files that have not been used in 90 days are deleted automatically from the personal work directory.

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information