Importing corpus data to Korp: technical documentation

Korp is the Web-based text corpus content search interface used in Kielipankki – The Language Bank of Finland. Korp has been (and is being) developed by Språkbanken at the University of Gothenburg. Korp is built on top of the IMS Open Corpus Workbench (CWB) corpus search software.

End users or corpus owners or compilers themselves cannot add corpora to Korp nor modify the existing corpora, so the staff of FIN-CLARIN and Kielipankki do that. The amount of work required depends on the format of the original corpus data: the closer the format is to the input format of CWB, the less work is required.

The actual Korp production server is at CSC. New corpora will be prepared in CSC’s computing environment and possibly in the future a dedicated Korp test server at CSC. (The old local test server is being phased out, so its use is deprecated.) At present, corpora and corpus configurations can be installed on the Korp production server by Kielipankki staff at CSC and by Jyrki Niemi.

The following pages contain documentation for importing corpora to the Korp corpus search service. The documentation is mainly rather technical.

Note that this documentation is currently under construction. For some more information, you may refer to the old Finnish documentation, but please note that it is at places outdated and also incomplete.