[Importing corpus data to Korp: technical documentation]

The Korp corpus input format

This page contains information on the input formats VRT and HRT used for Kielipankki’s Korp text corpus search service. The information is primarily aimed at Kielipankki’s staff importing corpora to Korp, but it may also be useful for corpus providers if they can affect the corpus format. For further information, please contact fin-clarin [at] helsinki.fi.

The section VRT (Kielipankki flavour) in brief below describes briefly the important characteristics of VRT for those who wish or need to produce corpus data in the VRT format.

The separate document VRT format in a nutshell describes the contents of a VRT formatted document in simple terms, e.g., for a user who downloads and needs to use resources in VRT format.

Introduction to corpus input formats

The original format of corpora to be imported to Korp may vary widely. The data may be plain text, HTML, in a tabular format such as CoNLL-X, in an XML format such as TEI. In any case, it is important that the data format is as consistent as possible, since it makes corpus processing faster and the result better.

The data in the original format is converted to VRT (VeRticalized Text), possibly via HRT (HoRizontal Text). VRT is the input format for the IMS Open Corpus Workbench (CWB) software underlying Korp, whereas HRT is an untokenized intermediate format.

VRT is a token-oriented columnar text format: each token (word) is on its own line together with its possible annotation attributes (positional attributes) separated by tabs. The structure of the text is represented with XML-style tags (structural attributes) on their own lines. Start tags may contain XML-style attributes (structural attribute annotations) for the structure.

Note that this document describes the VRT format as used in Kielipankki (the Language Bank of of Finland) for Korp and downloadable data, with some additional conventions and constraints compared with the more general input format recognized by CWB. Also note that even though not all the requirements and recommendations for the input format described in this document are mandated by CWB, violating them may make parts of the corpus data impossible or more difficult to search in Korp and also more difficult to process the VRT data with other tools.

VRT (Kielipankki flavour) in brief

This section describes briefly the important characteristics of the Kielipankki flavour of VRT for those who wish or need to produce corpus data in the VRT format. For more details and an example, please see the following sections.

Tokens

  • Each token on its own line
  • Token (positional) attributes separated by single tabs
  • All tokens have the same number of attributes in the same order
  • Word form (word) as the first attribute; the rest can vary
  • Use underscore _ for empty or missing attribute values
  • Declare attribute names with a positional-attributes comment line before the first token; e.g.: <!-- #vrt positional-attributes: word lemma pos msd -->

Structures (elements)

  • Delimited by XML-style tags, each on its own line, without leading or trailing spaces
  • Start tags may contain attributes
  • Standard structures are text, paragraph (optional) and sentence
  • A text is a logical unit with common characteristics and metadata
  • Each token must be enclosed in a sentence enclosed in a text
  • If the corpus has paragraphs, each sentence should be enclosed in a paragraph
  • Intermediate and intra-sentence structures are allowed: e.g. chapter between text and paragraph, or clause or ne (name expression) within sentence; they need not cover all tokens
  • No root element required (unlike in XML): a corpus is typically a sequence of texts
  • Structures can cross (unlike in XML): e.g. sentence and page; however, avoid this if not needed
  • Nesting structures of the same type is allowed but complicates searches

Attributes of structures

  • The attributes of a structure contain metadata for the structure
  • All structures of the same type in a corpus should have the same attributes, preferably listed in the same order
  • Attribute values enclosed in double quotation marks
  • Empty values as empty strings
  • A single space between attributes; no spaces around equals signs between attribute names and values
  • Structural attributes with a special meaning in Korp
    • text structures: datefrom, dateto, timefrom, timeto: The creation date and time interval of the original text: yyyymmdd for dates, hhmmss for times; all empty if the creation time is not known (see below for more information)
    • text, paragraph and sentence structures: id: An identifier of the structure, unique within a single corpus

Attribute and structure names

  • The names of structures and positional and structural attributes may consist of lower-case az and 09, attributes names also underscores _; cannot begin with a digit
  • Attributes with names beginning with an underscore are considered private or internal

Attribute values

  • In general, positional and structural attribute values may be any free-form character strings (see below for restrictions of and recommendations on character content)
  • Some support for integers; no special support for floating-point values
  • Multi-valued (feature-set) attributes
    • Represent ambiguity or uncertainty
    • A vertical bar | at the beginning and end and separating individual values; e.g. |Adj|Noun|Verb|
    • An empty set is represented by a single vertical bar |
    • The order of individual values is not significant
    • Replace vertical bars in individual values with e.g. the Unicode broken bar U+00A6
  • Ranked values
    • An extension of feature-set attributes in Korp
    • Each individual value has a suffix of a colon : followed by a number (integer or float); e.g. |Adj:0.7|Noun:0.22|Verb:0.08|
    • The numbers may denote probabilities or some other ranking of the values

Character content

  • All content in UTF-8
  • Preferably Unix-style line endings (bare LF, U+000A)
  • No control codes (including tabs and any line separating characters) in attribute values
  • Encode <, > and & as &lt;, &gt; and &amp; in positional and structural attribute values
  • Encode " as &quot; in structural attribute values
  • Use literal Unicode characters instead of XML numeric character references (&#nnnn;, &#xhhhh;) and HTML character entity references, such as &auml;
  • Spaces (U+0020) and no-break spaces (NBSP, U+00A0) can be used in attribute values
    • Remove leading and trailing spaces and NBSPs and convert multiple consecutive ones to single ones
    • Values consisting of only spaces and NBSPs should be emptied completely; remove tokens consisting only of them
  • Remove soft hyphens (U+00AD)

Other content

  • Single-line XML-style comments <!-- … --> are allowed and ignored (no leading or trailing spaces on the line)
  • Special VRT comments <!-- #vrt key: value --> contain information used and generated by Kielipankki VRT Tools

Parallel corpora

  • Each language (or other parallel part) of a parallel corpus encoded separately
  • Each part should have link structures with id attributes
    • Structures with the same value for id in different parts are linked with each other
    • A link structure may cover one or more sentences or paragraphs, for example

VRT example

The following is an example of the VRT format as used by Korp. Tab characters are represented as → followed by spaces. Structural elements are text, paragraph and sentence, and the positional attributes are word form (word), the number of the token within the sentence (ref), lemma (lemma), lemma with compound boundaries marked (lemmacomp), part of speech (pos), morphological analysis (msd), dependency head number (dephead) and dependency relation (deprel).


<!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel -->
<text filename="EuroParl Corpus/fi-en/fi/ep-00-01-17.txt" title="" datefrom="20000117" dateto="20000117" timefrom="000000" timeto="235959">
<paragraph id="1">
<sentence id="1" line="2">
Istuntokauden→	1→	istuntokausi→	istunto#kausi→	N→	N Gen Sg→	2→	obj
uudelleenavaaminen→	2→	uudelleenavaaminen→	uudelleen#avaaminen→	N→	N Nom Sg→	0→	main
</sentence>
</paragraph>
<paragraph id="2">
<sentence id="2" line="4">
Julistan→	1→	julistaa→	julistaa→	V→	V Prs Act Sg1→	0→	main
perjantaina→	2→	perjantai→	perjantai→	N→	N Ess Sg→	1→	advl
joulukuun→	3→	joulukuu→	joulu#kuu→	N→	N Gen Sg→	5→	attr
17.→	4→	17.→	17.→	Num→	Num Digit→	5→	attr
päivänä→	5→	päivä→	päivä→	N→	N Ess Sg→	1→	advl
...
.→	26→	.→	.→	Punct→	Punct→	-→	-
</sentence>
</paragraph>
...
</text>

Note that you need to use double quotation marks around the structural attribute annotation values. Even though the CWB encoder also understands single quotes, other tools processing VRT data assume double quotes.

The first line of the example is a positional attributes comment listing the names of the token (positional) attributes in the order they appear in the data. This is an extension of the Kielipankki VRT format.

Note that the initial input VRT often contains only the attribute word (word form) and the rest are added in the annotation process in Kielipankki.

In contrast to XML, VRT does not require a single root element (structural attribute), so a VRT input may consist of a sequence of texts, for example. Another difference from XML is that VRT allows crossing structural attributes; for example:


<page id="p1">
...
<sentence id="s8">
...
</page>
<page id="p2">
...
</sentence>
...
</page>

However, using crossing structures makes it impossible to use XML tools for VRT data, so they should not be used whenever not necessary.

As the VRT format resembles XML, an XML format may be a good basis for corpus data to be imported to Korp. However, note that nesting the same structural attribute type in VRT is somewhat cumbersome, so it would be better to avoid having a clause inside a clause, for example.

Completely empty lines in the VRT input should be avoided. Even though leading and trailing spaces in attribute values are stripped in the encoding phase, they should preferably be stripped from the VRT input. The VRT input may contain XML-style comments <!-- ... --> that are ignored, but each comment must be on its own line: multi-line comments are not recognized. An XML declaration at the beginning of a file is ignored. (All these require explicit options to the CWB encoding program, but it probably makes sense to use them.)

For more information on the VRT format in general, please refer to the CWB Corpus Encoding Tutorial in the CWB documentation. Some Korp-specific information is also found on Språkbanken’s Korp backend information page.

Character encoding and character content

The input data for Korp must be UTF-8-encoded Unicode. If the original data is not in UTF-8, you need to convert it, using e.g. iconv. In that case, you should know what is the original character encoding of the data. In a bad case, it may be a mix of two or more 8-bit encodings, such as ISO 8859-1 and Windows-1252, which complicates converting the encoding correctly; and in the worst case, the character encoding may be already incorrectly converted.

The characters & and < in the data need to be encoded as XML predefined entities &amp; and &lt;. Other XML predefined entities may also be used: &quot; for " (straight ASCII double quotation mark), &apos; for ' (straight ASCII single quotation mark) and &gt; for >, but they are mandatory only if the value of a structural attribute enclosed in quotes contains the same type of quote.

In contrast, do not use the numeric character references of XML &#nnnn; and &#xhhhh;, nor HTML character entity references, such as &auml;, since the CWB encoder treats them literally. Instead, use the corresponding UTF-8-encoded Unicode characters directly.

The line endings in the VRT input may be either Unix- or Windows-style (bare LF or CR+LF), but Unix-style bare LF is preferred.

VRT data may not contain tabs or any line-separating characters anywhere else than as separators of positional attributes and lines, respectively. Moreover, the data should not contain any other control characters (characters in the ranges U+0000…U+001F and U+007F…U+009F), nor preferably the Unicode line and paragraph separators (U+2028, U+2029). They should be stripped from the data, or if their presence is essential, encoded in a corpus-specific way. In addition, soft hyphens (U+00AD) should also be removed.

The space (U+0020) and no-break space (NBSP, U+00A0) may be used in the values of positional and structural attributes between other characters. However, spaces and NBSPs at the beginning and end of an attribute value should be stripped, multiple consecutive spaces and NBSPs should be converted to a single one, and values consisting of only spaces and NBSPs should be emptied. In particular, tokens consisting of only spaces and NBSPs should be removed.

The Unicode characters FIGURE SPACE (U+2007) and NARROW NO-BREAK SPACE (U+202F) (and also THIN SPACE, U+2007, when used as a thousands separator) should be converted to NBSPs and treated as above. Other Unicode spaces should be treated as and converterd to plain spaces. (Information on Unicode spaces.)

Positional (token) attributes

The positional token attributes (columns) may be, for example, the following (in a dependency-parsed corpus with named entities marked):

  • word form (word)
  • the number of the token within the sentence (ref)
  • lemma (lemma)
  • lemma with compound boundaries marked (lemmacomp)
  • part of speech (pos)
  • morphological analysis (morphosyntactic description) (msd)
  • dependency head (the number of the word within the sentence) (dephead)
  • dependency relation (deprel)
  • named entity tag (nertag)

The names of positional attributes are specified at the beginning of VRT data (before the first token line) via a positional-attributes comment indicating the order of the attributes; for example:


<!-- #vrt positional-attributes: word ref lemma lemmacomp pos msd dephead deprel -->

With the exception of the word form that should be first, the attributes can be in some other order, as long as all the tokens in a corpus have the same attributes in the same order. If a corpus does not have an attribute, it is left out. If some tokens have an attribute and some others do not, a single underscore (_) should preferably be used to denote the empty value, even though a completely empty value is also allowed. Even if the missing values were completely omitted, the attribute-separating tabs must be present, to keep the attribute alignment correct.

An attribute may have multiple values, in which case it is referred to as a feature-set attribute in CWB. Multi-valued attributes can be used to represent ambiguity or uncertainty, for example. A multi-valued attribute is represented by separating the values by vertical bars and adding vertical bars at the beginning and end of the whole value; for example, |Adj|Noun|Verb|. An empty feature set attribute value (no values in the set) is denoted by a single vertical bar |.

If a value of a feature set would itself contain the vertical bar, it should be replaced with another character, such as the Unicode broken bar U+00A6 ¦.

An extension of feature-set attributes in Korp are ranked attributes. Their values are feature-set values in which each individual value has a suffix consisting of a colon : followed by a number (an integer or a float); e.g. |Adj:0.7|Noun:0.22|Verb:0.08|. The numbers may denote probabilities or some other ranking of the values.

Structures

Korp recognizes and uses three levels of structures: text, paragraph and sentence. In the input VRT, they are represented with structural attributes of the same names (text, paragraph and sentence). Paragraphs are optional, but texts and sentences are required.

Sentence is the (default) context of a match in the KWIC concordance in Korp, whereas the enclosing paragraph is shown in the context view. (If a corpus has no paragraphs, also the context view shows sentences.) A text is a logical unit with common characteristics and metadata, such as the same writer and timestamp. It may be, for example, a single message on a discussion forum, an article in a magazine or a complete novel. It is a matter of decision what constitutes a text; for example, in OCR’d newspapers and magazines without article boundaries marked, a text might be one issue or one page.

In addition to texts, paragraphs and sentences, the VRT input may also contain other structures, even though they are seen in Korp only via their possible attributes. The structures may be at any level, for example, chapters containing paragraphs, or clauses within sentences. If the original data has other structures, they should preferably be preserved in the VRT format. However, please note that currently some VRT processing tools do not handle correctly VRT data containing structures within sentences.

Note that all tokens should be within text and sentence structures; otherwise Korp will not show them and some processing tools, such as the parser will fail. Moreover, if the data has paragraphs, each sentence should be inside a paragraph. For example, if the text contains headings, they should be enclosed in sentence structures (and also paragraph if the data has paragraphs), with perhaps the attribute type having value heading, instead of inventing a distinct structure for headings.

Attributes of structures

The attributes of structural elements can mostly be free-form, but the same attribute names should be used for the same information across corpora; see below for commonly used attribute names. The metadata for a complete text should be represented as attributes of the text start tag. The names of attributes should be in English.

If the creation date (and time) of the original text is known, it is represented (as local time) in the attributes datefrom, dateto, timefrom and timeto of the structure text. The values of datefrom and dateto must be of the form yyyymmdd (or empty), and the values of timefrom and timeto of the form hhmmss (or empty). If the full creation date is known but not the time, the values of datefrom and dateto are the same, the value of timefrom is 000000, and that of timeto 235959. If only the year is known, the value of datefrom should be yyyy0101 and dateto yyyy1231. If the creation date is unknown, the attributes should be left empty. No ad-hoc values may be used, such as marking uncertainty with a question mark.

If you need separate, human-readable date and time attributes for the corpus, you can use date_orig and time_orig containing values extracted from the original corpus data. For uniformity across corpora, date_iso and time_iso can be used to represent the date and time in the long ISO formats yyyy-mm-dd and hh:mm:ss.

In particular dependency-parsed corpora require each sentence structure to have the attribute id, which is unique within the corpus.

All the structures of the same type in a corpus should have the same attributes. If the value of an attribute is empty for some structure, it should still be represented as attrname="". Athough the order of the attributes in the VRT does not matter, they should preferably be in the same order in all the structures of a corpus. An alphabetic (lexicographic) order is recommended.

The names of structures and their attributes in VRT may only contain the characters az (lowercase only), 09 and _ (underscore). (CWB also allows a - (hyphen), but not all VRT tools can handle it.) The names may not begin with a digit. Moreover, do not use the underscore in the names of structures; in their attributes it may be used without problems. You should also avoid using the following reserved words of the CQP query language as structure or (positional) attribute names:

asc ascending by cat cd collocate contains cut def define delete desc descending diff difference discard dump exclusive exit expand farthest foreach group host inclusive info inter intersect intersection join keyword left leftmost macro maximal match matchend matches meet MU nearest no not NULL off on randomize reduce RE reverse right rightmost save set show size sleep sort source subset TAB tabulate target target[0-9] to undump union unlock user where with within without yes

A structural attribute may also be a multi-valued (feature set) or ranked attribute, in which case its values are represented in the same way as the values of a positional feature-set or ranked attribute; for example, nertag="|LocGpl|LocPpl|" or nertag="|LocGpl:0.8|LocPpl:0.2|".

Commonly used structural attributes

The following structural attributes occur in several corpora and should be used for future corpora with similar information. The attributes whose names are in bold are required by Korp. The format of the value is shown for attributes with fixed-format values. The list is not exhaustive, so if you think your corpus would have attributes probably used before, you may ask for advice in naming the attributes.

Structure Attribute Description Value
text datefrom The starting date of original creation or publication of the text Format yyyymmdd; empty if undated; if only year is known, use yyyy0101
text dateto The ending date of original creation or publication of the text Format yyyymmdd; empty if undated; if only year is known, use yyyy1231
text timefrom The starting time of day of original creation or publication of the text Format hhmmss; empty if undated; if only the creation date is known, use 000000
text timeto The ending time of day of original creation or publication of the text Format hhmmss; empty if undated; if only the creation date is known, use 235959
text date_orig Creation or publication date of the text Free-form date in the original format present in the data; may contain ranges and indicators of uncertainty
text time_orig Creation or publication time of the text Free-form time in the original format present in the data; may contain time zone information
text datetime_orig Combined creation or publication date and time of the text Free-form date and time in the original format present in the data
text date_iso Creation or publication date of the text Long ISO format yyyy-mm-dd
text time_iso Creation or publication time of the text Long ISO format hh:mm:ss
text title Title of the text
text author Author of the text
text translator Translator of the text
text year Writing or publication year of the text
text filename Name of the source file for the text
text url URL of a human-readable version of the text, possibly with context
text issue Magazine or newspaper issue
text lang Language of the text Preferably a three-letter ISO 639-2 code
text subject Subject of the text
text publisher Publisher of the text
text wordcount The number of words in the text, excluding punctuation marks
paragraph id An identifier for the paragraph
paragraph type The type of the paragraph For example, paragraph, heading
sentence id An identifier unique within a corpus; required for dependency-parsed corpora

[TODO: Extend the list]

Attributes and their values in the Korp search interface

The Korp search interface requires information on the structures and attributes used in a corpus. Each corpus should preferably be accompanied with a list of annotation attributes of tokens and in particular structural attributes, together with their brief labels or descriptions in at least Finnish, preferably also in English and Swedish. (For Swedish corpora, the attribute labels may be only in Swedish and for corpora in other languages only in English.)

If the set of values of an attribute is fixed and relatively small, such as parts of speech, Korp’s extended search may have a selection list for it with human-readable names of the values (e.g., N = noun). It would be good to have a list of such names for attribute values as well.

Parallel corpora

Each language (or otherwise parallel part) of parallel corpora is encoded separately. The alignment between different languages is marked by using the value for the id attribute of the aligned structures. It is now preferable to use the separate alignment structure link marking alignment units that may span several sentences, for example. It is possible to add the alignment id attribute to paragraphs or sentences, as in a number of existing corpora, but using link enables better compatibility with Språkbanken’s Korp.

CWB allows one-to-one, one-to-many, many-to-one, many-to-many and crossing alignments, but the alignment regions must be contiguous.

It is also possible to use alignments specified separately either as correspondences of aligned regions by token positions in the corpus or as alignment beads referring to the attributes of aligned structures. [TODO: Add examples]

[TODO: Add information about word alignment (optional).]

CWB limitations

Note the following technical limitations of the CWB, which affect the VRT files (Appendix B of the CWB Corpus Encoding Tutorial):

  • The maximum number of tokens in a corpus is 2,147,483,647 (231 – 1) tokens. However, in practice, it is better to keep the size of a (physical) corpus smaller, less than 500,000,000 tokens. Larger corpora should be split to subcorpora.
  • The maximum length of an attribute value is 4,095 bytes. Note that a character in the UTF-8 encoding may occupy more than one byte.
  • The maximum size of the lexicon for a positional or structural attribute (approximately the combined length of all the distinct values of the attribute) is 2,147,483,647 bytes.
  • The maximum length of a line in the VRT input is 65,536 bytes.
  • The maximum length of the name of an input file is 1,024 bytes.