[Importing corpus data to Korp: technical documentation]

Overview of processing a corpus for Korp

Adding a new text corpus to Korp typically consists of the following steps. The list only contains steps that relate to the corpus processing proper, not to negotiating licence agreements, the acquisition of corpus data, creating metadata for the corpus, nor actually publishing the corpus. [TODO: Should we maybe also add some of those steps?] More details on the steps will be provided on the linked pages.

  1. Retrieve corpus data. This may involve receiving the data as an email attachment or via Funet FileSender, downloading a file from a specified location in the IDA storage service, or harvesting data from the Web.
  2. Upload the original corpus data to IDA, unless it had already been done. The data should be packaged appropriately.
  3. Preprocess the data before converting to VRT. This may involve OCR’ing PDF files to text, converting the character encoding or fixing apparent errors.
  4. Convert the data to the VRT format used as Korp corpus input format. If a script exists for the same or a similar format in the conversion script repository, preferably use it as such or modified, but writing custom scripts for the input format may also be required.
  5. Parse the VRT and recognize named entities. Pass the generated VRT to Jussi Piitulainen for parsing, which also adds morphosyntactic, part-of-speech and named entity annotations. This currently only applies to corpora in (standard) Finnish.
  6. Encode VRT data into the CWB database format and create a Korp package for the corpus. This can often be done with a single command, which also generates certain data required for the Korp database.
  7. Add corpus configuration to the Korp frontend. This consists of making changes to the Korp configuration and translation files on your own configuration branch of the Korp frontend repository and committing the changes.
  8. Install the corpus package and configuration. At this stage, the corpus configuration should be installed on a separate test instance of the Korp frontend. If you do not have access to the Korp server, you need to request someone having the rights to do that.
  9. Test the corpus in Korp. Check that the corpus shows up and works as expected in the Korp test instance.
  10. Inform others of the corpus and request feedback. You should inform at least fin-clarin (at) helsinki.fi and the original corpus owner or compiler if applicable. If you get feedback, you may need to redo some of the previous steps.
  11. Install the corpus configuration to the production Korp, once the corpus works in Korp as desired. Again, you may need the help of someone with the appropriate rights.
  12. Upload the corpus package to the IDA storage service.
Search the Language Bank Portal:
Juraj Šimko
Researcher of the Month: Juraj Šimko


Upcoming events


The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information