[Importing corpus data to Korp: technical documentation]

Using Git for the Korp source code and corpus conversion scripts

Version control and Korp

A version control system keeps track of who made (committed) what changes where and when. It saves the stages of development of files or collection of files. This is particularly useful when files are modified by several people, possibly in parallel: version control makes it easier to view and often also merge the changes made by different people. They can be compared and if necessary, the content of a file may be restored to a previous state.

The Korp frontend and backend and corpus conversion scripts are version controlled using the Git version control system. In contrast, the corpus data itself is not. Typically, a version control system is used to store only such files that are not generated automatically from other files. For example, compiled program files and corpus files imported to Korp are not stored in version control. In addition, Git does not suit well to storing very large files, such as corpus files.

The Git version control system

For the version control of Korp, we use Git, which is an open-source, distributed version control system. Git is available for Unix/Linux, Mac OS X and Windows.

Git can be used in CSC’s computing environment and on the Korp server at CSC. Please note that the default Git is an old version (1.7.1); although it can be used for Korp repositories, you can take into use a newer one with the command module load git/2.12.1. (Version 1.9.2 is also available.) If you process corpora on your own computer, you should have Git installed.

At its core, Git is command-line software, and this page contains instructions on using Git on the command line. However, there are a number of different GUIs for Git, and Git is supported (to various degrees) in Emacs and development environments such as Eclipse. The GitHub site used for hosting the Korp Git repositories also has a separate client, GitHub Desktop, available for Windows and Mac OS X.

The Git command-line commands are of the form git command [options] [arguments]. The command git help command gives help (a manual page) on the command git command. In addition, git help tutorial shows the Git tutorial that is a part of the Git documentation.

It may also be helpful to read at least parts of more comprehensive Git guides, in particular if you are new to version control or Git:

The Git repositories for Korp

Kielipankki currently has four Git repositories (collections of files) related to Korp, all hosted on GitHub:

Conversion, other corpus processing scripts and other utility scripts
The Korp frontend: a fork of Språkbanken’s korp-frontend with the corpora of Kielipankki and some modifications
The Korp backend: a fork of Språkbanken’s korp-backend with Kielipankki’s modifications
The Kielipankki annotation laboratory

[2020-02-27: Note that the repositories Kielipankki-korp-frontend and Kielipankki-korp-backend were previously named korp-frontend and korp-backend, respectively. The repository Kielipankki-utilities is the public successor of the previous private Kielipankki-konversio repository.]

The first three repositories are public, the last one private. If you cannot access them or if you can only read them but need write access, please ask Martin Matthiesen for an invitation. (You also need a GitHub account first.)

In order to access a repository, you need to clone it to get a copy of it, your own workspace (also called working tree or working directory):

git clone git@github.com:CSCfi/repository_name.git

where repository_name is one of Kielipankki-utilities, Kielipankki-korp-frontend, Kielipankki-korp-backend or Kielipankki-annotlab. You will need to enter your GitHub account password unless you set up the SSH key-based authentication for GitHub (see the GitHub instructions on that). Alternatively, you can use a HTTPS URL for cloning (see GitHub instructions):

git clone https://github.com/CSCfi/repository_name.git

The above git clone command creates in your current work directory the subdirectory repository_name (unless you explicitly specify a different directory name as the third argument of git clone) containing the files of repository in an editable form and whose subdirectory .git contains the local copy of the Git repository. You can make changes to your local copy, commit them and later push them to the main repository.

Before making changes to the files, you should add to Git configuration your name and email address:

git config --global user.name "Firstname Lastname"
git config --global user.email "email.address@example.fi"

The content of Korp Git repositories [TODO]

The central content of Korp Git repositories are the following:

[TODO: Add more details below.]


The source code for the Korp frontend in JavaScript, including HTML and CSS files and their Jade and SCSS sources.



Basic Git usage for the Korp repositories

Committing changes

After making changes, such as adding the configuration of a new corpus, the added or changed files need to be committed to the repository, that is, the changes are recorded to the version control system along with the current date and time and your name and email address. This creates a new commit (sometimes called revision, version or changeset), which is done with the with the command

git commit filename

If filename is a directory name, Git commits all the changes in the directory and its subdirectories, so to commit all the changes in the files of the current directory and its subdirectories, use the command

git commit .

You can also commit all the changed (and removed) files in the whole repository with the command git commit -a.

The git commit command opens an editor, in which you should write a concise description of the change. By convention, the first line of the commit is a summary that should preferably be shorter than 50 characters. A possible more detailed description can be written after the summary and a blank line. Our convention is to write commit messages in English.

In general, you should try to group changes logically so that a single commit would not contain many changes unrelated to each other.

To add a completely new file to the repository, use the command

git add filename

This command only marks filename to be added by the next commit, so to actually commit the file, you need to do git commit (without arguments). git add can also be used to mark changes in files already in the repository to be committed, so if the new file should be a part of a commit containing changes to other files, you should mark them with git add filename first and then do git commit.

If you have made changes unrelated to each other to a file without committing them in between, you may consider committing the changes separately. To mark only certain changes to be committed, use the command

git add -p filename

Git will then ask for each change made to filename if you wish to mark it to be committed. Other changes will be left outside the commit. After marking the changes, commit them with git commit as usual.

Viewing changes

To show the files changed in your workspace with regard to the previous commit and the files marked to be committed, use the command

git status

To view the actual changes, use the command

git diff

To see the changes marked to be committed, use

git diff --cached

The previous commits, their descriptions, dates and authors are shown with the command

git log

With the option -p, git log also shows the changes in the commit with regard to the previous commit.

Instead of version numbers, Git identifies commits with SHA1 checksums, which consist of 40 hexadecimal digits. The commit identifier may be abbreviated to a unique prefix, often 7 digits.

The current commit may be named or tagged with the command

git tag --annotate name

This typically makes sense when the version is somehow significant, such as a publicly announced update.

Undoing changes

To discard the changes in a file and to return it to the state it was in the most recent commit, use

git checkout -- file ...

To discard all the changes in your workspace, replace file with the top directory of your workspace. Please note that after that you cannot recover the discarded changes.

To make your workspace correspond to a previous state (commit), use the command git reset. Please see git help reset first.

If you have not yet published your changes (see below), you can discard your changes completely, but published changes cannot be undone completely in any easy way. However, you can make a new commit that reverts the changes with the command git revert; see git help revert for more information

Publishing changes

Once you have committed changes to your own clone of the repository and wish to publish them (in practice, to FIN-CLARIN employees), you should first get the changes made by others to the master repository with the command

git pull origin

If there are changes and if they conflict with your changes (at least from Git’s point of view), you will need to edit the conflicting files. You can find the conflicting places by searching for the string <<<<< (see git help merge for more information). Once you have resolved the conflicts, you need to run the following commands (topdir is the top directory of your workspace):

git add -u topdir
git commit

After that, you can actually publish the changes to the master repository:

git push origin

It may be worthwhile to do git pull once in a while even if you are not yet about to publish your changes, since that reduces the chances of a large number of conflicts.


Named branches allow parallel development of independent features not to disturb each other. If you think that a feature will need more than one commit, it is recommended that you make a new branch for it. New corpus configurations to the Korp frontend should always be first added to branches of their own, to make testing easier and to avoid affecting the production Korp in case the configuration does not work. The changes made to different branches are usually eventually merged to the main or default branch, named master. In addition, the Korp Git repositories contain also other public branches. New branches are by default private (present only in your own copy of the repository), unless you publish them to the main repository with the command git push origin branch.

A typical use case for branches is to develop code in your own private branch, such as develop or dev. Once the code is stable enough and tested, the branch is merged to the master branch. In addition, branches can be used to develop individual features. The changes made to these feature or topic branches are typically first merged to the development branch and then to the master branch. But for adding Korp corpus configurations, it probably suffices to have a branch for each corpus or set of related corpora, or even a single personal branch, if you only work on a single corpus at the same time. For the ease of testing on the Korp server, the corpus configuration branches should be published.

It is easy to make a new branch branch_name in Git with the command

git branch branch_name

To change to branch branch_name, use the command

git checkout branch_name

This changes the content of the files tracked by Git in your workspace to correspond to their content in the branch branch_name. If you have uncommitted changes in you workspace that would be overwritten by the branch change, Git declines to change branches. You will need to commit the changes or stash them before being able to change branches. Stashing means putting the uncommitted changes temporarily aside; it is done with the command

git stash save [message]

Please see git help stash for more information.

To see the existing branches, use the command

git branch

The existing branch is marked with an asterisk. To see the last commit (summary) of each branch, use the command

git branch -v

Changes made to another branch may be merged to the currently active branch with the command

git merge branch

This may cause conflicts, which need to be resolved manually, similarly to when running git pull.

Alternatively, if the changed branch is not public or the changes made to it with relation to the master branch have not been published (pushed), you may use (in the changed branch and assuming that the main branch is master) the command

git rebase master

git merge and git rebase typically produce different commit histories: git merge retains the branches to be merged as such and combines them with a new commit. In contrast, git rebase in practice adds to the current branch the changes made to the master branch after the current branch had been separated from it, which creates a linear commit history in this respect. Both approaches have their advantages and disadvantages in different situations. However, please keep in mind that git rebase should be used only if the changed branch is private or if the changes made to it with relation to the master branch have not been published (pushed) yet.

One possible workflow for merging branch branch to master is to rebase it first on master and then merge:

git checkout branch
git rebase master
git checkout master
git merge branch

This workflow probably works best if you only have a few changes in branch with respect to master.

File timestamps

Git commits are accompanied with the timestamp of creating the commit, but Git by design does not store nor retain file timestamps. For example, the files changed when doing git pull or changing branches get the current time as their timestamp. Similarly, when cloning a Git repository, the initial timestamp of the files will be the cloning time. Even though saving timestamps might be useful in some cases, this is a feature of Git that we will have to live with.

Search the Language Bank Portal:
Tommi Kurki
Researcher of the Month: Tommi Kurki



The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317