The scripts referred to the instructions below are in the Kielipankki-konversio GitHub repository They are also in the directory
From the perspective of users, the corpora and the search interface of the production Korp server
korp.csc.fi should preferably be updated at times when Korp is not used much. If the update does not take long, a service break is usually not needed. If a service break is needed, the users should be informed of it via the Korp newsdesk.
A simple way to ensure that Korp’s background CGI script processes are not being run at the beginning of an update is the prepend the commands below with
pgrep korp ||; for example:
pgrep korp || korp-install-corpus.sh korpus
If the command outputs process ids, the corpus has not been installed. However, this approach does not guarantee that Korp processes are not started during the run time of the installation script. Moreover, this also prevents installation when Korp CGI scripts are run directly (via the Korp API), without the search interface.
A user who has Korp open during an update needs to reload the Korp page after the update for the page to refresh.
A corpus should be installed before installing the accompanying changes to the corpus configuration. It is easiest to install and update corpora with the script
korp-install-corpora.sh from corpus packages created with the script
korp-install-corpus.sh [--package-dir=package_directory] corpus|corpus_package …
The installation script accepts as arguments the names of either logical corpora (corpus) or corpus package files (corpus_package). If the argument is corpus, the installation script searches from the corpus package directory or its subdirectories the most recent package for corpus. Likewise, if the name of corpus_package has no directory components, the script searches for the corpus_package from package_directory and its subdirectories.
The installation script installs the Corpus Workbench files in the corpus package and uploads data into Korp MySQL database. Uploading data into the MySQL database requires defining the MySQL username and password for the Korp database either in the configuration file (
~/.my.cnf) or via the environment variables
The installation script installs the newest corpus package file it finds if it is newer than the the package previously installed for the corpus or if the corpus has not yet been installed at all. (You can specify the option
--force to reinstall an already-installed pacakge.) Information on the corpus packages installed with the script is in the file
/v/corpora/korp_installed_corpora.list. Each line of the file shows the timestamp of the installation, the name of the (logical) corpus, the name of the corpus package file, the timestamp of the corpus package and the username of the installer. (Corpora installed or updated otherwise than with this script are not shown in the file.)
It is not necessary to download corpus packages onto the Korp server, as long as they can be retrieved over the network ia SSH. In this case, package_directory begins with the name of the server followed by a colon and the directory on the server, typically for example,
--package-dir=puhti:/proj/clarin/korp/corpora/pkgs. Alternatively, the server name and colon can be prepended to corpus_package if it contains a full path.
If existing corpora have been updated, the Korp search cache
/v/korp/cache should be cleared. Unfortunately, it is not easy to clear only files corresponding to the updated corpora, so it is safest to clear the whole cache. You can do that with the script
korp-clear-cache.sh info count query timespan wordpicture
If the access to the corpus is restricted (licence category ACA or RES) and if the corpus package does not already contain the information (as they nowadays should; see here for more information), it needs to be specified in the MySQL database
korp_auth for each physical corpus. The information can be added with the script
/v/korp/authing/auth as follows:
/v/korp/authing/auth licence_category CORPUS
Here licence_category is either
RES, and CORPUS is the uppercase id of the physical corpus.
After modifying Korp configuration (for example, adding a new corpus), it is simplest to update the Korp search interface directly from the GitHub repository with the script
korp-install.sh. This requires that the changes have been committed and pushed into the repository. The script is run as follows:
korp-install.sh frontend[:version] [target]
Here target is the name of a directory relative to the top Web content directory of the server,
/var/www/html, which contains the main production Korp. You should first install your configuration changes to a test instance (see below), and only after testing that it works, you should install it to the production Korp. When installing to target, the installed Korp search interface is visible at
version is the version (commit) to be retrieved from the GitHub repository (a “refspec” in Git terms). It is typically the name of a branch or a tag in the repository. Howeer, it may also be relative to the most recent commit of a branch or tag: for example,
master~2 means the third-newest commit of the
master branch. The default version is
master, that is, the most recent commit in the
To update the production Korp, version needs to be
master and target
/ (a single slash).
The script creates a backup copy of the version of the Korp search interface currently in use. The backup can be restored by using the option
korp-install.sh --revert frontend [target]
Here target must be the same as when updating the search interface.
Note that the script creates only a single backup copy, so that the backup created by a new update overwrites the previous backup.
The script also retrieves and installs possibly updated Korp news from the GitHub repository.
You can create personal or corpus-specific test instances of the Korp search interface on the Korp server. To create a test instance accessible at
cp -dpr /var/www/html/test-base /var/www/html/test-name
korp-install.sh frontend:branch test-name
Note that the test instances and the main production Korp share corpus data, so please be careful when testing updates to existing corpora. It might often be best to make the updated corpora with new names, and possibly rename them later.