VRT format in a nutshell

This page describes the contents of a VRT formatted document in simple terms, e.g., for a user who downloads and needs to use resources in VRT format. We refer here to the VRT format as used in the Language Bank of Finland, as others may have slightly different conventions.

VRT (VeRticalized Text) is the input format for the IMS Open Corpus Workbench (CWB) software underlying Korp. VRT is a token-oriented columnar text format: each token (word) is on its own line together with its possible annotation attributes (positional attributes), such as lemma, part of speech, morphological analysis and syntactic relation, separated by tabs. The structure of the text is represented with XML-style tags on their own lines. Start tags may contain XML-style attributes for the structure (structural attributes), which may vary between corpora. In contrast to XML, VRT does not require a single root element (structural attribute), so a VRT input may consist of a sequence of texts, for example.

(There is another, more technical description of VRT documents for internal use and for resource depositors at https://www.kielipankki.fi/development/korp/corpus-input-format/.)

The data at the character level:

  • the data is UTF-8-encoded Unicode;
  • the characters &, <, > are encoded in XML/HTML style as &amp;, &lt; and &gt;, and in addition, in structural attribute values as &quot;; and
  • lines end in a single line-feed character (Unix-style).

Structural elements are typically text, paragraph and sentence . Not all corpora have the paragraph level, though, and corpora can also have other structures, such as clause or ne (named entity).

The positional attributes (given in the first line of the VRT file as a comment) are often the following:

  • word form
  • the number of the token within the sentence
  • lemma
  • lemmacomp (lemma with compound boundaries marked)
  • part of speech
  • lex/ (lemgram; a combination of lemma and part of speech)
  • morphological analysis (morphosyntactic description)
  • dependency relation
  • dependency head number (the number of the word within the sentence)

An attribute may have multiple values. A multi-valued attribute is represented by separating the values by vertical bars and adding vertical bars at the beginning and end of the whole value; for example, |Adj|Noun|Verb|. The name of a multi-valued positional attribute is suffixed with a slash in the positional attributes comment line. An empty feature set attribute value (no values in the set) is denoted by a single vertical bar.

VRT files may contain additional comment lines beginning with <!--? VRT extracted from Korp CWB data has a couple of such lines at the beginning and end of each file.

