Meeting on the use of Git repositories and GitHub/GitLab in the Language Bank of Finland

When, Where, Who

Date: 3.12.2018 9.30-12.00 at CSC.

Present: Mietta, Jussi, Jyrki (some comments below), Sam, Tero, João (from 10:25), Martin (notes).

Agenda

Present situation

Let us look at a few examples of the 12 repos more or less in use.

Korp (Jyrki)

The following 2 repos are obsolete:

  • Kielipankki-korp-frontend -> Archive
  • Kielipankki-korp-backend -> soon Archive

Our Korp development happens here:

The forks have master and dev branches that supercede the now obsolete ”Kielipankki-korp”-versions.

The sb/dev, sb/master branches follow the original Språkbanken’s branches. Jyrki plans to port our changes to these branches so that pull requests are possible. [Jyrki: Or, to be exact, I intend to port our changes to separate topic branches based on sb/master or sb/dev.]

 

CSCfi/Kielipankki (Martin)

The repo is for internal use only and contains a few things that do not belong there. These directories will be removed (the praat scripts, commandline scripts). The relevant directories are:

  • FIN-CLARIN-Administration
    • The kielipankki.fi/corpora list
    • The PID master file (for URN/Handle)
    • Extended metadata for lehet90ff
  • servers
    • all (first attempt for common roles)
    • pid (our PID hanlding)
    • portal (www.kielipankki.fi)
    • webanno (kielipankki.fi/webanno)
    • www (intended to contain common configurations / roles for all web servers)
  • scripts
    • Internal scripts, now documented in README.md:s
  • csr will be moved to scripts/csr

CSCfi/Kielipankki-palvelut (Martin)

The public ”Kielipankki-palvelut” contains two main parts:

  • commandline
    • Our software portfolio like hfst, ffmpeg, kaldi, etc.
    • Can be compiled on Taito and for use in Mylly2
  • servers
    • sanat.csc.fi, including 3 wikis
      • nimiarkisto (for Kotus)
      • termipankki.fi
      • sanat.csc.fi
    • metalb
      • syncmeta, our OAI-PMH endpoint

CSCfi/Kielipanki-konversio (Jyrki/Jussi/all)

We discussed parts of ”Kielipankki-konversio”, the main decisions below:

  • The repo will be renamed to ”Kielipankki-conversion” and made public.
    [Jyrki: (1) Unfortunately, the repository contains some corpus data, which will have to be removed before making it public. Or maybe it would be better to keep the current repository private and archive it and make a new repository with the corpus data filtered out. I can take care of the filtering. I will also create a Jira ticket for the issue.]
    [Jyrki: (2) Did we actually decide that the new name will be “Kielipankki-conversion”, and not something (that I find) more descriptive like ”Kielipankki-corpus-conversion” or ”Kielipankki-corpus-processing”? Even though the latter are longer, they might be easier to find for people outside the Language Bank looking for corpus processing tools. On the other hand, some tools in the repo might be used for other purposes than corpus processing, so too narrowly descriptive naming might not be desirable, either. And descriptive labels like “corpus processing” might also help people to find the repository.]
  • Jussi will move his VRT scripts in ”prevrt” to a new top-level directory ”vrt-tools”.

Documentation

We agreed that we need a minimum of documentation. Therefore most repo-directories should have a README.md describing what is in them. It might even be possible that the README.md:s can be generated in some cases.

The public repositories should have Github topics in Finnish and English so that they can be found easier. [Jyrki: I think that the private ones could also have topics. Who decides what topics to use and who adds them?]

We also agreed that reliable documentation will require effort on our part, like documentation reviews, it otherwise gets too often neglected. We made no formal decision on how to proceed.

Gitlab for the Language Bank? (João)

We use GitHub, should we start using Gitlab?

GitHub and Gitlab do similar things, both are well suited for version control and Continuous Integration (CI). GitHub uses travis-ci.org for CI, Gitlab has it built in.

The Language Bank does not yet use Gitlab and we discussed serveral options why we should start using it, since we have access to a Neic-hosted instance at https://source.coderefinery.org/.

The main point for starting to use Gitlab is the easier CI framework and very good support for Continuous Delivery (”CD”, i.e. automatic deployment). The points against:

  • Added complexity
    • Almost everyone needs separate Gitlab accounts.
    • Another GUI to master
    • GitHub/lab often need syncing, which is a commercial feature ($20/month/user) and not supported by the Neic instance.
    • Semimanual syncing is possible but needs work.
  • Our need for CD is not that high.
  • CSC’s GitHub comes with access to Travis.
  • The HFST-team uses Travis already.
  • The Travis-syntax is not massively more complicated than Gitlab’s CI/CD-syntax.

Decision: We focus on GitHub and Travis CI. João will follow Gitlab-developments as part of his involvement with Neic CodeRefinery and will create some test instances, but we will not use Gitlab for the Language Bank for the time being.

Effective usage of GitHub/lab (all)

What to move to Gitlab

Nothing at the moment, since we do not actively use Gitlab.

Effective forking and branching

We had to skip this part.

Automated tests

We will look into Github and travis-ci.org. A very good simple example: https://github.com/CSCfi/ansible-workshops/blob/master/.travis.yml (The script performs a syntax check on an Ansible script).

Search the Language Bank Portal:
Tommi Kurki
Researcher of the Month: Tommi Kurki

 

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317