Korp user guide

Suomeksi

Korp is a Web-based tool that allows its user to search for keywords in text corpora (typically grammatically parsed) and to generate concordances. Korp gives its users access to extensive collections of texts in Finnish and Finland Swedish. Many of these corpora can be accessed without logging in whereas others are only accessible to logged-in users. In some cases individual access rights are also required (instructions for applying).

Please note that this manual has been written specifically for the version of Korp used by the Language Bank of Finland and may not be fully applicable to other versions. Korp is developed at Språkbanken of the University of Gothenburg in Sweden, and their Korp site contains mostly Swedish-language materials. The Norwegian version of Korp contains corpora in Norwegian and Saami. Instructions for Korp as used by the Swedish Språkbanken (in Swedish only – note that these instructions may not be entirely applicable to the Finnish version of Korp).

Korp Start Korp:
https://korp.csc.fi

General information

The Korp graphical search interface can be accessed by a Web browser that has JavaScript enabled. Korp works best on Firefox and Chrome, whereas some of its functionality does not work on Internet Explorer.

Choosing the environment

You can switch between different modes by clicking on the links on the top of the Korp GUI. Different modes contain different types of corpora and have slightly different search features.

Four modes are available at the moment:

  • Finnish: the default environment; contains corpora and corpus samples of both written and spoken Modern Finnish as well as texts in Early Modern Finnish and Old Literary Finnish.
  • Swedish: texts primarily in Finland Swedish in addition to some Swedish-languages fragments of parallel corpora.
  • Other languages: texts in Finno-Ugric languages and fragments of parallel corpora in various languages
  • Parallel texts: parallel corpora, bilingual and multilingual text collections – each search result is shown with its translation in the parallel text

Selecting a corpus

On the right the Korp logo is the corpus selection bar that lets you select the corpora you want to search. It may say e.g.

4 of 923 corpora selected – 76.53M of 8,74G tokens

Clicking on the bar opens the corpus selection menu where you can select all the relevant corpora. The corpora are arranged hierarchically in a treelike structure. You can see the list of corpora in a branch by clicking on the triangles at the beginning of each line.

If you hover the cursor over the name of a corpus or a branch, an info box displaying the total number of sentences and tokens in the corpus or branch in question appears. (Note that the number of tokens also includes punctuation.)

The “Select all” button at the top of the menu selects all corpora listed in the menu. The “Select none” button clears all selections.

Please do not select all corpora in the Finnish or Parallel mode, since currently the KWIC result cannot be obtained if all the corpora are selected. In the Finnish mode, the search works if you select all other corpora except “1990- ja 2000-luvun suomalaisia aikakaus- ja sanomalehtiä”, for example.

The search function works best when all of the selected corpora have similar annotations.

Choosing the language

The language of the Korp interface (Finnish, Swedish, English) can be chosen in the upper-left corner of the GUI.

These instructions describe the English-language interface.

Search types

Korp allows thee types of searches: simple, extended and advanced. The type of search is selected by clicking on the respective tab above the search box. Simple search finds individual tokens. Extended search makes it possible to refer to several consecutive words and their attributes at once. In advanced search you can type a CQP query directly.

Word pictures can only be viewed for simple search (see below).

Simple search

In simple search you can search for a word form in the corpora by entering it in the search box. The search box has an auto-complete feature that suggests keywords together with their parts-of-speech in parentheses (note that this only works for POS-tagged corpora). The keywords displayed in grey are words that do not occur in any of the selected corpora. If a POS-tagged keyword in the list is selected, all word forms whose dictionary form and part-of speech match those of the selected keyword will be included in the search results. The word picture feature only works when a POS-tagged keyword is selected i.e. entering in the keyword and its POS manually does not suffice.

The following options can be selected for simple search:

  • initial: also search for word forms beginning with the input string
  • final: also search for word forms ending in the input string
  • case-insensitive: ignore letter case, treat uppercase and lowercase letters identically

Extended Search

Extended Search allows the user to search not only for individual word forms, but also for sequences of consecutive words. The values of the attributes of each keyword in the sequence can be defined individually.

Examples

The examples below are further explained on a separate page, including screen shots.

  • for all the occurrences of the verb olla (‘to be’), choose “base form”, “is” and then enter olla
  • for any inflected form of any adjective, choose “part-of-speech”, “is”, and “adjective”.
  • for all the occurrences of the illative forms of the word talo ‘house’ (taloon, taloihin, taloomme etc.) in a corpus annotated with the TDT (Turku Dependency Treebank) parser (in Finnish) (e.g. KLK):
    1. choose “base form”, “is” and type talo
    2. add a new search condition by clicking the plus sign (+) in the lower left corner of the search box
    3. choose “msd”, “contains”, then enter ”Ill” (as in ”illative case”) in the text box.

The attribute menu may contain many other attributes that can be selected to narrow down queries. The set of available search criteria varies from corpus to corpus. It often makes sense to search from one corpus at a time since it makes the search results more readable and less ambiguous. An alternative is to only search from corpora whose annotations are of the same type.

The query can be supplemented with additional search criteria by clicking the plus sign (+) in the lower left corner of the keyword box (in which case the matched words should fulfil both criteria) or by clicking on the word “or” above the plus sign (in which case matched words have to satisfy either one of the criteria).

Sequences of consecutive words

Words can be added in the sequence of words by clicking on the plus sign (+) on the right side of the rightmost keyword box. You can define the search criteria and attributes for each word of the sequence individually. To remove a word from the sequence, click on the × sign in the upper-right corner of the keyword box.

Any element in the sequence can be repeated by clicking on the cogwheel symbol in the lower right corner of the keyword box, then selecting “Repeat” in the menu, and finally entering the range of the number of repeats allowed. For instance “Repeat 0 to 1 times” means that the element can occur in a matched sequence only once or not at all. The repeated element can also be an unspecified word, which allows the user to search for phrases where other words can occur between the keywords: this can be done by selecting “word” and “is” in the drop-down menus and leaving the search box empty.

Note that Advanced Search can slow down or even crash if the query is too complex and if a large corpus has been selected.

Tip: The query as specified on the Extended Search tab also appears automatically on the Advanced Search tab (see below) as a CQP query which can then be modified for more specific results.

Advanced Search

In advanced search, the search criteria and the keywords are expressed as a CQP query. You can e.g. search for dependencies in dependency-parsed corpora in ways not supported by the extended search.

More information on and examples of CQP queries

Descriptions of syntactic and morphological annotations of corpora

  • The set of tags used for annotation varies from corpora to corpora.
  • Some of the corpora have not been annotated at all, i.e. queries can only refer to their text content.
  • Most corpora available in Korp (such as KLK) have been annotated automatically by the Turku Dependency Treebank (TDT) parser (in Finnish). Note that the annotations may contain errors.
  • The FinnTreeBank corpus uses attributes and annotations that are slightly different from those produces by TDT.
  • The annotations used by other corpora are usually described in the corpus descriptions.

Search result views

You can choose the way results are displayed by clicking on one of the three tabs: “KWIC” (default), “Statistics” and “Word picture”

KWIC (concordance)

The concordance view lists all sentences containing a match, with the matched sequence highlighted in bold text. The default format is the KWIC (Key Word in Context) concordance, where each sentence on displayed its own line and the matched words in the middle on top of each other. The view can be scrolled horizontally if some of the sentences are long. Entire paragraphs can be viewed by clicking on the “Show context” link at the top of the concordance view. The matched words are highlighted as before, but they are not aligned vertically like in the KWIC view.

Each matched word in a sentence in listed as a separate result on its own line.

The total number of hits in the selected corpora is displayed at the top of the concordance view. The coloured horizontal bar next to the number illustrates the number of hits in each corpus. The name of the corpus and the number of hits can be viewed by hovering the cursor over a section of the bar. Clicking on the section takes you to the first page with results from that corpus. You can move to a specific page by clicking on the page numbers below the bar.

A word in a sentence can be highlighted by clicking on it with the mouse or by moving around with the arrow keys. Information about the properties of the highlighted word as well as the sentence and/or the text where it occurs is displayed in the info box on the right-hand side of the concordance view. With dependency-parsed data, the head word of the highlighted word is shown against a pink background.

Statistics

The statistics view shows the total number of occurrences for each matched word in the results as well as the number of occurrences in individual corpora. The results are sorted by the properties of the selected word or text. The number of occurrences are shown as relative frequencies per million tokens, a common measure in corpus linguistics, and (in parentheses) as absolute frequencies.

By clicking on the button in the Statistics view, you can also see the frequency data plotted as a Trend Diagram. The relative frequency shown in the Trend Diagram is always tied to a specific time period (e.g. year, month or day). It is calculated as the search results matching the time period divided by the number of tokens of all selected corpora times one million. Note that tokens also contain punctuation marks.

Corpora contain time information in different granularity, e.g. some register the day where a word was written, others only specify the year. In that case the yearly frequency is mapped equally onto every day of the year.

Consider two corpora: Corpus A has 10 hits out of 10,000 tokens for the whole year 2016 and corpus B has 2 hits for 3 July 2016 out of 1,000 tokens and no hits for the other days of the year. In this case the Trend Diagram would be shown as follows:

For 3 July 2016:

Absolute hits: 12 (= 10 + 2)
Relative frequency: 1,090.91 (= (10 + 2) / (10,000 + 1,000) * 1,000,000)

For any other day in 2016 the result would be:

Absolute hits: 10
Relative frequency: 1,000.00 (= 10 / 10,000 * 1,000,000)

Word picture

The word picture view shows the words most typically associated with the searched word based on their dependency relations in all the selected corpora. The typicality of words is not based directly on their frequency but on their mutual information, which is described in more detail below.

The word picture feature only works under the following conditions:

  • the selected corpora contain dependency annotations, and
  • you use the simple search and search for either a single word form or a word and its part of speech selected from the autocompletion list (the part of speech may not be typed manually).

Note that corpora have been parsed programmatically, so the dependency annotations also have errors, including parsing errors, errors caused by unrecognized and thus unlemmatized words, and errors caused by incorrectly recognized words, in which case the part of speech of the word may be incorrect.

The dependency relations shown in the word picture depend on the part of speech of the searched word. The most typical words for each different dependency relation (based on mutual information) appear in their own “box”: for example, for verbs, the subject, object and adverbial, and for nouns, the premodifier, postmodifier and the verbs as whose subject (“Word verb”) or object (”Verb word”) the word occurs. For each word, the word picture shows its number (rank), the word itself and the total absolute frequency of the word in the selected corpora.

The document icon following the absolute frequency is a link opening a new concordance result tab that shows the sentences containing the word in the dependency relation in question with the original search word. In the concordance, you can see what kinds of syntactic structures the word picture information is based on. The value of the mutual information underlying the word order is displayed as a tooltip by hovering the mouse cursor over the absolute frequency (e.g., “mi: 58.35”).

On the right side of the word picture view, you can select whether an (approximate) part of speech is shown for the words, as well as the maximum number of words to be displayed for each dependency relation.

In Korp’s word picture, mutual information measures how typically different words are in a certain dependency relation with the searched word, such as the subjects of the searched verb. If a word is a common subject for many verbs, its mutual information with regard to the searched verb may be smaller than that of another word which is not as common as the subject of the searched verb but which nevertheless occurs as the subject of the searched verb much more frequently than as the subject of other verbs.

The word picture uses a measure called Lexicographer’s Mutual Information (LMI). Compared with the usual mutual information measure, it seeks to reduce the weight of low-frequency words compared. Typically, the (lexicographer’s) mutual information is calculated based on the frequencies of the words in the entire corpus, and is used as one metric in searching for collocations (words that typically occur together; see e.g., here or here).

However, for the word picture of Korp, the mutual information LMI(A,B) between two words A and B is computed among the words in one box of the word picture (e.g., the subjects of a verb), that is, for one dependency relation type Rel as follows ((x, Rel, y) denotes word x in dependency relation Rel with word y)::

n = the frequency of the dependency relations (x, Rel, y) corresponding to a “box” in the selected corpora (for any words x and y)
nA = the frequency of the dependency relations (A, Rel, y) in the selected corpora, that is, word A in relation Rel with any word y
nB = the frequency of the dependency relations (x, Rel, B) in the selected corpora, that is, any word x in relation Rel with word B
nAB = the frequency of the dependency relations (A, Rel, B) in the selected corpora: how many times word A occurs in relation Rel with word B

LMI(A,B) = nAB * log2 ((n * nAB) / (nA * nB))

(log2 is the base-two logarithm.)

Depending on the type of the dependency relation, either A or B may be the searched word and the other is another word appearing in a word picture box.

Displaying search results

Below the search box are three drop-down menus for modifying the way results are displayed. The first two affect the concordance view:

  • hits per page: the number of hits displayed at a time (25–1000, 25 by default)
  • sort within corpora: sorts the results from each corpus in one of the following ways:
    • not sorted: the results are shown in the same order in which they appear in the corpus
    • matched word(s): sort by the matched words alphabetically
    • left context: sort results alphabetically by the left-hand context
    • right context: sort results alphabetically by the right-hand context
    • random: a random order (note that the order is random only within each selected corpus, not across corpora)

The third menu affects the statistics view. In this menu, you can select the attribute by which the statistics are compiled. The statistics are calculated for word forms by default, in which case the table shows the distribution of word forms in the results. By selecting e.g. “part-of-speech”, the user can view the number and distribution of different parts of speech in the results.

Search the Language Bank Portal:
Tanja Säily
Researcher of the Month: Tanja Säily

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information