[Importing corpus data to Korp: technical documentation]

The Korp corpus input format

This page contains information on the input format required for Kielipankki’s Korp text corpus search service. The information is primarily aimed at Kielipankki’s staff importing corpora to Korp, but it may also be useful for corpus providers if they can affect the corpus format. For further information, please contact fin-clarin [at] helsinki.fi.

The original format of corpora to be imported to Korp may vary widely. The data may be plain text, HTML, in a tabular format such as CoNLL-X, in an XML format such as TEI. In any case, it is important that the data format is as consistent as possible, since it makes corpus processing faster and the result better.

VRT file format

The input format for the IMS Open Corpus Workbench (CWB) software underlying Korp is VRT (VeRticalized Text). VRT is a token-oriented columnar text format: each token (word) is on its own line together with its possible annotation attributes (positional attributes), such as lemma, part of speech, morphological analysis and syntactic relation, separated by tabs. The structure of the text is represented with XML-style tags (structural attributes) on their own lines. Start tags may contain XML-style attributes for the structure.

Note that even though not all the requirements and recommendations for the input format described in this document are mandated by CWB, violating them may make parts of the corpus data impossible or more difficult to search in Korp and also more difficult to process the VRT data with other tools. However, keep in mind that CWB often does not even warn about violations.

The following is an example of the VRT format as used by Korp. Tab characters are represented as →. Structural elements are text, paragraph and sentence, and the positional attributes are word form, the number of the token within the sentence, lemma, lemma with compound boundaries marked, part of speech, morphological analysis, dependency head number and dependency relation.



<text filename="EuroParl Corpus/fi-en/fi/ep-00-01-17.txt" title="" datefrom="20000117" dateto="20000117" timefrom="000000" timeto="235959">
<paragraph id="1">
<sentence id="1" line="2">
Istuntokauden→	1→	istuntokausi→	istunto#kausi→	N→	N Gen Sg→	2→	obj
uudelleenavaaminen→	2→	uudelleenavaaminen→	uudelleen#avaaminen→	N→	N Nom Sg→	0→	main
</sentence>
</paragraph>
<paragraph id="2">
<sentence id="2" line="4">
Julistan→	1→	julistaa→	julistaa→	V→	V Prs Act Sg1→	0→	main
perjantaina→	2→	perjantai→	perjantai→	N→	N Ess Sg→	1→	advl
joulukuun→	3→	joulukuu→	joulu#kuu→	N→	N Gen Sg→	5→	attr
17.→	4→	17.→	17.→	Num→	Num Digit→	5→	attr
päivänä→	5→	päivä→	päivä→	N→	N Ess Sg→	1→	advl
...
.→	26→	.→	.→	Punct→	Punct→	-→	-
</sentence>
</paragraph>
...
</text>

Note that you should use double quotation marks around the structural attribute values. Even though the CWB encoder also understands single quotes, some other tools processing VRT data might assume double quotes.

In contrast to XML, VRT does not require a single root element (structural attribute), so a VRT input may consist of a sequence of texts, for example. Another difference from XML is that VRT allows crossing structural attributes; for example:


<page id="p1">
...
<sentence id="s8">
...
</page>
<page id="p2">
...
</sentence>
...
</page>

However, using crossing structures makes it impossible to use XML tools for VRT data, so they should not be used whenever not necessary.

As the VRT format resembles XML, an XML format may be a good basis for corpus data to be imported to Korp. However, note that nesting the same structural attribute type in VRT is somewhat cumbersome, so it would be better to avoid having a clause inside a clause, for example.

Completely empty lines in the VRT input should be avoided within sentences. Even though leading and trailing spaces in attribute values are stripped in the encoding phase, they should preferably be stripped in the VRT input. The VRT input may contain XML-style comments <!-- ... --> that are ignored, but each comment must be on its own line and multi-line comments are not recognized. An XML declaration at the beginning of a file is ignored. (All these require explicit options to the CWB encoding program, but it probably makes sense to use them.

For more information on the VRT format, please refer to the CWB Corpus Encoding Tutorial in the CWB documentation (a possibly more up-to-date local copy retrieved from the CWB version control). Some Korp-specific information is also found on Språkbanken’s Korp backend information page.

Character encoding and character content

The input data for Korp must be UTF-8-encoded Unicode. If the original data is not in UTF-8, you need to convert it, using e.g. iconv. In that case, you should know what is the original character encoding of the data. In a bad case, it may be a mix of two or more 8-bit encodings, such as ISO 8859-1 and Windows-1252, which complicates converting the encoding correctly; and in the worst case, the character encoding may be already incorrectly converted.

The characters & and < in the data need to vbe encoded as XML predefined entities &amp; and &lt;. Other XML predefined entities may also be used: &quot; for " (straight ASCII double quotation mark), &apos; for ' (straight ASCII single quotation mark) and &gt; for >, but they are mandatory only if the value of a structural attribute enclosed in quotes contains the same type of quote.

In contrast, do not use the numeric character references of XML &#nnnn; and &#xhhhh;, nor HTML character entity references, such as &auml;, since the CWB encoder treats them literally. Instead, use the corresponding UTF-8-encoded Unicode characters directly.

The line endings in the VRT input may be either Unix- or Windows-style (bare LF or CR+LF), but Unix-style bare LF should be preferred.

VRT data may not contain tabs or any line-separating characters anywhere else than as separators of positional attributes and lines, respectively. Moreover, the data should not contain any other control characters (characters in the ranges U+0000…U+001F and U+007F…U+009F), nor preferably the Unicode line and paragraph separators (U+2028, U+2029). They should be stripped from the data, or if their presence is essential, encoded in a corpus-specific way. In addition, soft hyphens (U+00AD) should also be removed.

The no-break space (NBSP, U+00A0) may be used in the values of positional and structural attributes between other characters. However, NBSPs at the beginning and end of an attribute value should be stripped, multiple consecutive NBSPs should be converted to a single one, and values consisting of only NBSPs should be emptied. In particular, tokens consisting of only NBSPs should be removed. The Unicode characters FIGURE SPACE (U+2007) and NARROW NO-BREAK SPACE (U+202F) (and also THIN SPACE, U+2007, when used as a thousands separator) should be converted to NBSPs and treated as above. Other Unicode spaces should be treated as and converterd to plain spaces. (Information on Unicode spaces.)

Positional (token) attributes

The positional token attributes (columns) may be, for example, the following (in a dependency-parsed corpus with named entities marked):

  • word form
  • the number of the token within the sentence
  • lemma
  • lemma with compound boundaries marked
  • part of speech
  • morphological analysis (morphosyntactic description)
  • dependency head (the number of the word within the sentence)
  • dependency relation
  • named entity tag

The names of positional attributes are not specified in the VRT data; instead, each attribute is recognized by its position (column). The attributes are assigned names at the corpus encoding stage.

With the exception of the word form that should be first, the attributes can be in some other order, as long as all the tokens in a corpus have the same attributes. (The names of the attributes are specified at the encoding stage.)

If a corpus does not have an attribute, it is left out. If some tokens have an attribute and some others do not, the missing values can be either left completely empty, or a single underscore (_) may be used to denote the empty value. Even if the missing values were completely omitted, the attribute-separating tabs need to be present, to keep the attribute alignment correct.

An attribute may have multiple values, in which case it is referred to as a feature set attribute in CWB. Multi-valued attributes can be used to represent ambiguity or uncertainty, for example. A multi-valued attribute is represented by separating the values by vertical bars and adding vertical bars at the beginning and end of the whole value; for example, |Adj|Noun|Verb|. An empty feature set attribute value (no values in the set) is denoted by a single vertical bar.

If a value of a feature set would itself contain the vertical bar, it should be replaced with another character. If it is replaced with the Unicode control character U+0083, it can be searched and shown literally as a vertical bar in Korp.

Structures

Korp recognizes and uses three levels of structures: text, paragraph and sentence. In the input VRT, they are represented with structural attributes of the same names (text, paragraph and sentence). Paragraphs are optional, but texts and sentences are required.

Sentence is the (default) context of a match in the KWIC concordance in Korp, whereas the enclosing paragraph is shown in the context view. (If a corpus has no paragraphs, also the context view shows sentences.) A text is a logical unit with common characteristics and metadata, such as the same writer and timestamp. It may be, for example, a single message on a discussion forum, an article in a magazine or a complete novel. It is a matter of decision what constitutes a text; for example, in OCR’d newspapers and magazines without article boundaries marked, a text might be one issue or one page.

In addition to the texts, paragraphs and sentences, the VRT input may also contain other structures, even though they are seen in Korp only via their possible attributes. The structures may be at any level, for example, chapters containing paragraphs, or clauses within sentences. If the original data has other structures, they should preferably be preserved in the VRT format. However, please note that currently some VRT processing scripts do not handle correctly VRT data containing structures within sentences.

Note that all tokens should be within sentence structures; otherwise Korp will not show them and some processing tools, such as the parser will fail. For example, if the text contains headings, they should be enclosed in sentence structures, with perhaps the attribute type having value heading, instead of inventing a distinct structure for headings.

Attributes of structures

The attributes of structural elements can mostly be free-form, but the same attribute names should be used for the same information across corpora; see below for commonly used attribute names. The metadata for a complete text should be represented as attributes of the text start tag. The names of attributes should be in English.

If the creation date (and time) of the original text is known, it is represented (as local time) in the attributes datefrom, dateto, timefrom and timeto of the structure text. The values of datefrom and dateto should be of the form yyyymmdd, and the values of timefrom and timeto of the form hhmmss. If the full creation date is known but not the time, the values of datefrom and dateto are the same, the value of timefrom is 000000, and that of timeto 235959. If only the year is known, the value of datefrom should be yyyy0101 and dateto yyyy1231. If the creation date is unknown, the attributes should be left empty. No ad-hoc values may be used, such as marking uncertainty with a question mark.

If you need separate, human-readable date and time attributes for the corpus, you can use date_orig and time_orig containing values extracted from the original corpus data. For uniformity across corpora, date_iso and time_iso can be used to represent the date and time in the long ISO formats yyyy-mm-dd and hh:mm:ss.

In particular dependency-parsed corpora require each sentence structure to have the attribute id, which is unique within the corpus.

All the structures of the same type in a corpus should have the same attributes. If the value of an attribute is empty for some structure, it should still be represented as attrname="". Athough the order of the attributes in the VRT does not matter, they should preferably be in the same order in all the structures of a corpus. An alphabetic (lexicographic) order is recommended.

The names of structures and their attributes in VRT may only contain the characters az (lowercase only), 09, - (hyphen) and _ (underscore). The names may not begin with a digit. Moreover, you should avoid the underscore in the names of structures; in their attributes it may be used without problems. And the following reserved words of the CQP query language may not be used as structure or attribute names:

asc ascending by cat cd collocate contains cut def define delete desc descending diff difference discard dump exclusive exit expand farthest foreach group host inclusive info inter intersect intersection join keyword left leftmost macro maximal match matchend matches meet MU nearest no not NULL off on randomize reduce RE reverse right rightmost save set show size sleep sort source subset TAB tabulate target target[0-9] to undump union unlock user where with within without yes

A structural attribute may also be a multi-valued (feature set) attribute, in which case its values are represented in the same way as the values of a positional feature set attribute; for example, nertag="|LocGpl|LocPpl|".

Commonly used structural attributes

The following structural attributes occur in several corpora and should be used for future corpora with similar information. The attributes whose names are in bold are required by Korp. The format of the value is shown for attributes with fixed-format values. The list is not exhaustive, so if you think your corpus would have attributes probably used before, you may ask for advice in naming the attributes.

Structure Attribute Description Value
text datefrom The starting date of original creation or publication of the text Format yyyymmdd; empty if undated; if only year is known, use yyyy0101
text dateto The ending date of original creation or publication of the text Format yyyymmdd; empty if undated; if only year is known, use yyyy1231
text timefrom The starting time of day of original creation or publication of the text Format hhmmss; empty if undated; if only the creation date is known, use 000000
text timeto The ending time of day of original creation or publication of the text Format hhmmss; empty if undated; if only the creation date is known, use 235959
text date_orig Creation or publication date of the text Free-form date in the original format present in the data; may contain ranges and indicators of uncertainty
text time_orig Creation or publication time of the text Free-form time in the original format present in the data; may contain time zone information
text datetime_orig Combined creation or publication date and time of the text Free-form date and time in the original format present in the data
text date_iso Creation or publication date of the text Long ISO format yyyy-mm-dd
text time_iso Creation or publication time of the text Long ISO format hh:mm:ss
text title Title of the text
text author Author of the text
text translator Translator of the text
text year Writing or publication year of the text
text filename Name of the source file for the text
text url URL of a human-readable version of the text, possibly with context
text issue Magazine or newspaper issue
text lang Language of the text Preferably a three-letter ISO 639-2 code
text subject Subject of the text
text publisher Publisher of the text
text wordcount The number of words in the text, excluding punctuation marks
paragraph id An identifier for the paragraph
paragraph type The type of the paragraph For example, paragraph, heading
sentence id An identifier unique within a corpus; required for dependency-parsed corpora

[TODO: Extend the list]

Attributes and their values in the Korp search interface

The Korp search interface requires information on the structures and attributes used in a corpus. Each corpus should preferably be accompanied with a list of annotation attributes of tokens and in particular structural attributes, together with their brief labels or descriptions in at least Finnish, preferably also in English and Swedish. (For Swedish corpora, the attribute labels may be only in Swedish and for corpora in other languages only in English.)

If the set of values of an attribute is fixed and relatively small, such as parts of speech, Korp’s extended search may have for it a selection list with human-readable names of the values (e.g., N = noun). It would be good to have a list of such names for attribute values as well.

Parallel corpora

Each language (or otherwise parallel part) of parallel corpora is encoded separately. The alignment between different languages is marked by using the value for the id attribute of the aligned structures. It is now preferable to use the separate alignment structure link marking alignment units that may span several sentences, for example. It is possible to add the alignment id attribute to paragraphs or sentences, as in a number of existing corpora, but using link enables better compatibility with Språkbanken’s Korp.

CWB allows one-to-one, one-to-many, many-to-one, many-to-many and crossing alignments, but the alignment regions must be contiguous.

It is also possible to use alignments specified separately either as correspondences of aligned regions by token positions in the corpus or as alignment beads referring to the attributes of aligned structures. [TODO: Add examples]

[TODO: Add information about word alignment (optional).]

CWB limitations

Note the following technical limitations of the CWB, which affect the VRT files (Appendix B of the CWB Corpus Encoding Tutorial):

  • The maximum number of tokens in a corpus is 2,147,483,647 (231 – 1) tokens. However, in practice, it is better to keep the size of a (physical) corpus smaller, less than 500,000,000 tokens. Larger corpora should be split to subcorpora.
  • The maximum length of an attribute value is 4,095 bytes. Note that a character in the UTF-8 encoding may occupy more than one byte.
  • The maximum size of the lexicon for a positional or structural attribute (approximately the combined length of all the distinct values of the attribute) is 2,147,483,647 bytes.
  • The maximum length of a line in the VRT input is 65,536 bytes.
  • The maximum length of the name of an input file is 1,024 bytes.