Installing and updating Korp corpora and configuration on CSC’s Korp server

The scripts referred to the instructions below are in the Kielipankki-konversio GitHub repository They are also in the directory /v/korp/scripts of korp.csc.fi.

Updating Korp production server and its corpora

From the perspective of users, the corpora and the search interface of the production Korp server korp.csc.fi should preferably be updated at times when Korp is not used much. If the update does not take long, a service break is usually not needed. If a service break is needed, the users should be informed of it via the Korp newsdesk.

A simple way to ensure that Korp’s background CGI script processes are not being run at the beginning of an update is the prepend the commands below with pgrep korp ||; for example:

   pgrep korp || korp-install-corpus.sh korpus

If the command outputs process ids, the corpus has not been installed. However, this approach does not guarantee that Korp processes are not started during the run time of the installation script. Moreover, this also prevents installation when Korp CGI scripts are run directly (via the Korp API), without the search interface.

A user who has Korp open during an update needs to reload the Korp page after the update for the page to refresh.

Installing corpora

A corpus should be installed before installing the accompanying changes to the corpus configuration. It is easiest to install and update corpora with the script korp-install-corpora.sh from corpus packages created with the script korp-make-corpus-package.sh (or korp-make):

   korp-install-corpus.sh [--package-dir=package_directory] corpus|corpus_package

The installation script accepts as arguments the names of either logical corpora (corpus) or corpus package files (corpus_package). If the argument is corpus, the installation script searches from the corpus package directory or its subdirectories the most recent package for corpus. Likewise, if the name of corpus_package has no directory components, the script searches for the corpus_package from package_directory and its subdirectories.

The installation script installs the Corpus Workbench files in the corpus package and uploads data into Korp MySQL database. Uploading data into the MySQL database requires defining the MySQL username and password for the Korp database either in the configuration file (~/.my.cnf) or via the environment variables KORP_MYSQL_USER and KORP_MYSQL_PASSWORD.

The installation script installs the newest corpus package file it finds if it is newer than the the package previously installed for the corpus or if the corpus has not yet been installed at all. (You can specify the option --force to reinstall an already-installed pacakge.) Information on the corpus packages installed with the script is in the file /v/corpora/korp_installed_corpora.list. Each line of the file shows the timestamp of the installation, the name of the (logical) corpus, the name of the corpus package file, the timestamp of the corpus package and the username of the installer. (Corpora installed or updated otherwise than with this script are not shown in the file.)

It is not necessary to download corpus packages onto the Korp server, as long as they can be retrieved over the network ia SSH. In this case, package_directory begins with the name of the server followed by a colon and the directory on the server, typically for example, --package-dir=puhti:/proj/clarin/korp/corpora/pkgs. Alternatively, the server name and colon can be prepended to corpus_package if it contains a full path.

Clearing Korp search cache

If existing corpora have been updated, the Korp search cache /v/korp/cache should be cleared. Unfortunately, it is not easy to clear only files corresponding to the updated corpora, so it is safest to clear the whole cache. You can do that with the script korp-clear-cache.sh:

  korp-clear-cache.sh info count query timespan wordpicture

Restricted corpora (CLARIN ACA, CLARIN RES)

If the access to the corpus is restricted (licence category ACA or RES) and if the corpus package does not already contain the information (as they nowadays should; see here for more information), it needs to be specified in the MySQL database korp_auth for each physical corpus. The information can be added with the script /v/korp/authing/auth as follows:

   /v/korp/authing/auth licence_category CORPUS

Here licence_category is either ACA or RES, and CORPUS is the uppercase id of the physical corpus.

Updating the Korp search interface

After modifying Korp configuration (for example, adding a new corpus), it is simplest to update the Korp search interface directly from the GitHub repository with the script korp-install.sh. This requires that the changes have been committed and pushed into the repository. The script is run as follows:

   korp-install.sh frontend[:version] [target]

Here target is the name of a directory relative to the top Web content directory of the server, /var/www/html, which contains the main production Korp. You should first install your configuration changes to a test instance (see below), and only after testing that it works, you should install it to the production Korp. When installing to target, the installed Korp search interface is visible at https://korp.csc.fi/target/.

version is the version (commit) to be retrieved from the GitHub repository (a “refspec” in Git terms). It is typically the name of a branch or a tag in the repository. Howeer, it may also be relative to the most recent commit of a branch or tag: for example, master~2 means the third-newest commit of the master branch. The default version is master, that is, the most recent commit in the master branch.

To update the production Korp, version needs to be master and target / (a single slash).

The script creates a backup copy of the version of the Korp search interface currently in use. The backup can be restored by using the option --revert:

   korp-install.sh --revert frontend [target]

Here target must be the same as when updating the search interface.

Note that the script creates only a single backup copy, so that the backup created by a new update overwrites the previous backup.

The script also retrieves and installs possibly updated Korp news from the GitHub repository.

The script korp-install.sh works only for updating the Korp JavaScript search interface. The CGI scripts need to be updated by hand.

Creating a test instance of the Korp search interface

You can create personal or corpus-specific test instances of the Korp search interface on the Korp server. To create a test instance accessible at https://korp.csc.fi/test-name/:

  1. Copy the Korp base files to the /var/www/html/test-name:
    	cp -dpr /var/www/html/test-base /var/www/html/test-name
    
  2. Install the Korp search interface from Git repository branch branch:
    	korp-install.sh frontend:branch test-name
    

Note that the test instances and the main production Korp share corpus data, so please be careful when testing updates to existing corpora. It might often be best to make the updated corpora with new names, and possibly rename them later.

Hae Kielipankki-portaalista:
Tommi Kurki
Kuukauden tutkija: Tommi Kurki

 

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4144036 / 029 4129317