[Importing corpus data to Korp: technical documentation]
A version control system keeps track of who made (committed) what changes where and when. It saves the stages of development of files or collection of files. This is particularly useful when files are modified by several people, possibly in parallel: version control makes it easier to view and often also merge the changes made by different people. They can be compared and if necessary, the content of a file may be restored to a previous state.
The Korp frontend and backend and corpus conversion scripts are version controlled using the Git version control system. In contrast, the corpus data itself is not. Typically, a version control system is used to store only such files that are not generated automatically from other files. For example, compiled program files and corpus files imported to Korp are not stored in version control. In addition, Git does not suit well to storing very large files, such as corpus files.
For the version control of Korp, we use Git, which is an open-source, distributed version control system. Git is available for Unix/Linux, Mac OS X and Windows.
Git can be used in CSC’s computing environment and on the Korp server at CSC. Please note that the default Git is an old version (1.7.1); although it can be used for Korp repositories, you can take into use a newer one with the command
module load git/2.12.1. (Version 1.9.2 is also available.) If you process corpora on your own computer, you should have Git installed.
At its core, Git is command-line software, and this page contains instructions on using Git on the command line. However, there are a number of different GUIs for Git, and Git is supported (to various degrees) in Emacs and development environments such as Eclipse. The GitHub site used for hosting the Korp Git repositories also has a separate client, GitHub Desktop, available for Windows and Mac OS X.
The Git command-line commands are of the form
git command [options] [arguments]. The command
git help command gives help (a manual page) on the command
git command. In addition,
git help tutorial shows the Git tutorial that is a part of the Git documentation.
It may also be helpful to read at least parts of more comprehensive Git guides, in particular if you are new to version control or Git:
Kielipankki currently has four Git repositories (collections of files) related to Korp, all hosted on GitHub:
[2020-02-27: Note that the repositories
Kielipankki-korp-backend were previously named
korp-backend, respectively. The repository
Kielipankki-utilities is the public successor of the previous private
The first three repositories are public, the last one private. If you cannot access them or if you can only read them but need write access, please ask Martin Matthiesen for an invitation. (You also need a GitHub account first.)
In order to access a repository, you need to clone it to get a copy of it, your own workspace (also called working tree or working directory):
git clone firstname.lastname@example.org:CSCfi/repository_name.git
where repository_name is one of
Kielipankki-annotlab. You will need to enter your GitHub account password unless you set up the SSH key-based authentication for GitHub (see the GitHub instructions on that). Alternatively, you can use a HTTPS URL for cloning (see GitHub instructions):
git clone https://github.com/CSCfi/repository_name.git
git clone command creates in your current work directory the subdirectory repository_name (unless you explicitly specify a different directory name as the third argument of
git clone) containing the files of repository in an editable form and whose subdirectory
.git contains the local copy of the Git repository. You can make changes to your local copy, commit them and later push them to the main repository.
Before making changes to the files, you should add to Git configuration your name and email address:
git config --global user.name "Firstname Lastname" git config --global user.email "email@example.com"
The central content of Korp Git repositories are the following:
[TODO: Add more details below.]
After making changes, such as adding the configuration of a new corpus, the added or changed files need to be committed to the repository, that is, the changes are recorded to the version control system along with the current date and time and your name and email address. This creates a new commit (sometimes called revision, version or changeset), which is done with the with the command
git commit filename …
If filename is a directory name, Git commits all the changes in the directory and its subdirectories, so to commit all the changes in the files of the current directory and its subdirectories, use the command
git commit .
You can also commit all the changed (and removed) files in the whole repository with the command
git commit -a.
git commit command opens an editor, in which you should write a concise description of the change. By convention, the first line of the commit is a summary that should preferably be shorter than 50 characters. A possible more detailed description can be written after the summary and a blank line. Our convention is to write commit messages in English.
In general, you should try to group changes logically so that a single commit would not contain many changes unrelated to each other.
To add a completely new file to the repository, use the command
git add filename
This command only marks filename to be added by the next commit, so to actually commit the file, you need to do
git commit (without arguments).
git add can also be used to mark changes in files already in the repository to be committed, so if the new file should be a part of a commit containing changes to other files, you should mark them with
git add filename first and then do
If you have made changes unrelated to each other to a file without committing them in between, you may consider committing the changes separately. To mark only certain changes to be committed, use the command
git add -p filename
Git will then ask for each change made to filename if you wish to mark it to be committed. Other changes will be left outside the commit. After marking the changes, commit them with
git commit as usual.
To show the files changed in your workspace with regard to the previous commit and the files marked to be committed, use the command
To view the actual changes, use the command
To see the changes marked to be committed, use
git diff --cached
The previous commits, their descriptions, dates and authors are shown with the command
With the option
git log also shows the changes in the commit with regard to the previous commit.
Instead of version numbers, Git identifies commits with SHA1 checksums, which consist of 40 hexadecimal digits. The commit identifier may be abbreviated to a unique prefix, often 7 digits.
The current commit may be named or tagged with the command
git tag --annotate name
This typically makes sense when the version is somehow significant, such as a publicly announced update.
To discard the changes in a file and to return it to the state it was in the most recent commit, use
git checkout -- file ...
To discard all the changes in your workspace, replace file with the top directory of your workspace. Please note that after that you cannot recover the discarded changes.
To make your workspace correspond to a previous state (commit), use the command
git reset. Please see
git help reset first.
If you have not yet published your changes (see below), you can discard your changes completely, but published changes cannot be undone completely in any easy way. However, you can make a new commit that reverts the changes with the command
git revert; see
git help revert for more information
Once you have committed changes to your own clone of the repository and wish to publish them (in practice, to FIN-CLARIN employees), you should first get the changes made by others to the master repository with the command
git pull origin
If there are changes and if they conflict with your changes (at least from Git’s point of view), you will need to edit the conflicting files. You can find the conflicting places by searching for the string
git help merge for more information). Once you have resolved the conflicts, you need to run the following commands (topdir is the top directory of your workspace):
git add -u topdir git commit
After that, you can actually publish the changes to the master repository:
git push origin
It may be worthwhile to do
git pull once in a while even if you are not yet about to publish your changes, since that reduces the chances of a large number of conflicts.
Named branches allow parallel development of independent features not to disturb each other. If you think that a feature will need more than one commit, it is recommended that you make a new branch for it. New corpus configurations to the Korp frontend should always be first added to branches of their own, to make testing easier and to avoid affecting the production Korp in case the configuration does not work. The changes made to different branches are usually eventually merged to the main or default branch, named
master. In addition, the Korp Git repositories contain also other public branches. New branches are by default private (present only in your own copy of the repository), unless you publish them to the main repository with the command
git push origin branch.
A typical use case for branches is to develop code in your own private branch, such as
dev. Once the code is stable enough and tested, the branch is merged to the master branch. In addition, branches can be used to develop individual features. The changes made to these feature or topic branches are typically first merged to the development branch and then to the master branch. But for adding Korp corpus configurations, it probably suffices to have a branch for each corpus or set of related corpora, or even a single personal branch, if you only work on a single corpus at the same time. For the ease of testing on the Korp server, the corpus configuration branches should be published.
It is easy to make a new branch branch_name in Git with the command
git branch branch_name
To change to branch branch_name, use the command
git checkout branch_name
This changes the content of the files tracked by Git in your workspace to correspond to their content in the branch branch_name. If you have uncommitted changes in you workspace that would be overwritten by the branch change, Git declines to change branches. You will need to commit the changes or stash them before being able to change branches. Stashing means putting the uncommitted changes temporarily aside; it is done with the command
git stash save [message]
git help stash for more information.
To see the existing branches, use the command
The existing branch is marked with an asterisk. To see the last commit (summary) of each branch, use the command
git branch -v
Changes made to another branch may be merged to the currently active branch with the command
git merge branch
This may cause conflicts, which need to be resolved manually, similarly to when running
Alternatively, if the changed branch is not public or the changes made to it with relation to the master branch have not been published (pushed), you may use (in the changed branch and assuming that the main branch is
master) the command
git rebase master
git merge and
git rebase typically produce different commit histories:
git merge retains the branches to be merged as such and combines them with a new commit. In contrast,
git rebase in practice adds to the current branch the changes made to the master branch after the current branch had been separated from it, which creates a linear commit history in this respect. Both approaches have their advantages and disadvantages in different situations. However, please keep in mind that
git rebase should be used only if the changed branch is private or if the changes made to it with relation to the master branch have not been published (pushed) yet.
One possible workflow for merging branch branch to
master is to rebase it first on master and then merge:
git checkout branch git rebase master git checkout master git merge branch
This workflow probably works best if you only have a few changes in branch with respect to
Git commits are accompanied with the timestamp of creating the commit, but Git by design does not store nor retain file timestamps. For example, the files changed when doing
git pull or changing branches get the current time as their timestamp. Similarly, when cloning a Git repository, the initial timestamp of the files will be the cloning time. Even though saving timestamps might be useful in some cases, this is a feature of Git that we will have to live with.