Korp user guide

Korp is a Web-based tool that allows its user to search for keywords in text corpora (typically grammatically parsed) and to generate concordances. Korp gives its users access to extensive collections of texts in Finnish and Finland Swedish. Many of these corpora can be accessed without logging in whereas others are only accessible to logged-in users. In some cases individual access rights are also required (instructions for applying).

Please note that this manual has been written specifically for the version of Korp used by the Language Bank of Finland and may not be fully applicable to other versions. Korp is developed at Språkbanken of the University of Gothenburg in Sweden, and their Korp site contains mostly Swedish-language materials. The Norwegian version of Korp contains corpora in Norwegian and Saami. Instructions for Korp as used by the Swedish Språkbanken (in Swedish only – note that these instructions may not be entirely applicable to the Finnish version of Korp).

Start Korp

General information

The Korp graphical search interface can be accessed by a Web browser that has JavaScript enabled. Korp works best on Firefox and Chrome, whereas some of its functionality does not work on Internet Explorer.

Choosing the environment

You can switch between different modes by clicking on the links on the top of the Korp GUI. Different modes contain different types of corpora and have slightly different search features.

Four modes are available at the moment:

  • Finnish: the default environment; contains corpora and corpus samples of both written and spoken Modern Finnish as well as texts in Early Modern Finnish and Old Literary Finnish.
  • Swedish: texts primarily in Finland Swedish in addition to some Swedish-languages fragments of parallel corpora.
  • Other languages: texts in Finno-Ugric languages and fragments of parallel corpora in various languages
  • Parallel texts: parallel corpora, bilingual and multilingual text collections – each search result is shown with its translation in the parallel text

Selecting a corpus

On the right the Korp logo is the corpus selection bar that lets you select the corpora you want to search. It may say e.g.

4 of 923 corpora selected – 76.53M of 8,74G tokens

Clicking on the bar opens the corpus selection menu where you can select all the relevant corpora. The corpora are arranged hierarchically in a treelike structure. You can see the list of corpora in a branch by clicking on the triangles at the beginning of each line.

If you hover the cursor over the name of a corpus or a branch, an info box displaying the total number of sentences and tokens in the corpus or branch in question appears. (Note that the number of tokens also includes punctuation.)

The “Select all” button at the top of the menu selects all corpora listed in the menu. The “Select none” button clears all selections.

Please do not select all corpora in the Finnish or Parallel mode, since currently the KWIC result cannot be obtained if all the corpora are selected. In the Finnish mode, the search works if you select all other corpora except “1990- ja 2000-luvun suomalaisia aikakaus- ja sanomalehtiä”, for example.

The search function works best when all of the selected corpora have similar annotations.

Choosing the language

The language of the Korp interface (Finnish, Swedish, English) can be chosen in the upper-left corner of the GUI.

These instructions describe the English-language interface.

Search types

Korp allows thee types of searches: simple, extended and advanced. The type of search is selected by clicking on the respective tab above the search box. Simple search finds individual tokens. Extended search makes it possible to refer to several consecutive words and their attributes at once. In advanced search you can type a CQP query directly.

Word pictures can only be viewed for simple search (see below).

Simple search

In simple search you can search for a word form in the corpora by entering it in the search box. The search box has an auto-complete feature that suggests keywords together with their parts-of-speech in parentheses (note that this only works for POS-tagged corpora). The keywords displayed in grey are words that do not occur in any of the selected corpora. If a POS-tagged keyword in the list is selected, all word forms whose dictionary form and part-of speech match those of the selected keyword will be included in the search results. The word picture feature only works when a POS-tagged keyword is selected i.e. entering in the keyword and its POS manually does not suffice.

The following options can be selected for simple search:

  • initial: also search for word forms beginning with the input string
  • final: also search for word forms ending in the input string
  • case-insensitive: ignore letter case, treat uppercase and lowercase letters identically

Extended Search

Extended Search allows the user to search not only for individual word forms, but also for sequences of consecutive words. The values of the attributes of each keyword in the sequence can be defined individually.

Examples

The examples below are further explained on a separate page, including screen shots.

  • for all the occurrences of the verb olla (‘to be’), choose “base form”, “is” and then enter olla
  • for any inflected form of any adjective, choose “part-of-speech”, “is”, and “adjective”.
  • for all the occurrences of the illative forms of the word talo ‘house’ (taloon, taloihin, taloomme etc.) in a corpus annotated with the TDT (Turku Dependency Treebank) parser (in Finnish) (e.g. KLK):
    1. choose “base form”, “is” and type talo
    2. add a new search condition by clicking the plus sign (+) in the lower left corner of the search box
    3. choose “msd”, “contains”, then enter ”Ill” (as in ”illative case”) in the text box.

The attribute menu may contain many other attributes that can be selected to narrow down queries. The set of available search criteria varies from corpus to corpus. It often makes sense to search from one corpus at a time since it makes the search results more readable and less ambiguous. An alternative is to only search from corpora whose annotations are of the same type.

The query can be supplemented with additional search criteria by clicking the plus sign (+) in the lower left corner of the keyword box (in which case the matched words should fulfil both criteria) or by clicking on the word “or” above the plus sign (in which case matched words have to satisfy either one of the criteria).

Sequences of consecutive words

Words can be added in the sequence of words by clicking on the plus sign (+) on the right side of the rightmost keyword box. You can define the search criteria and attributes for each word of the sequence individually. To remove a word from the sequence, click on the × sign in the upper-right corner of the keyword box.

Any element in the sequence can be repeated by clicking on the cogwheel symbol in the lower right corner of the keyword box, then selecting “Repeat” in the menu, and finally entering the range of the number of repeats allowed. For instance “Repeat 0 to 1 times” means that the element can occur in a matched sequence only once or not at all. The repeated element can also be an unspecified word, which allows the user to search for phrases where other words can occur between the keywords: this can be done by selecting “word” and “is” in the drop-down menus and leaving the search box empty.

Note that Advanced Search can slow down or even crash if the query is too complex and if a large corpus has been selected.

Tip: The query as specified on the Extended Search tab also appears automatically on the Advanced Search tab (see below) as a CQP query which can then be modified for more specific results.

Advanced Search

In advanced search, the search criteria and the keywords are expressed as a CQP query. You can e.g. search for dependencies in dependency-parsed corpora in ways not supported by the extended search.

More information on and examples of CQP queries

Descriptions of syntactic and morphological annotations of corpora

  • The set of tags used for annotation varies from corpora to corpora.
  • Some of the corpora have not been annotated at all, i.e. queries can only refer to their text content.
  • Most corpora available in Korp (such as KLK) have been annotated automatically by the Turku Dependency Treebank (TDT) parser (in Finnish). Note that the annotations may contain errors.
  • The FinnTreeBank corpus uses attributes and annotations that are slightly different from those produces by TDT.
  • The annotations used by other corpora are usually described in the corpus descriptions.

Search result views

You can choose the way results are displayed by clicking on one of the three tabs: “KWIC” (default), “Statistics” and “Word picture”

KWIC (concordance)

The concordance view lists all sentences containing a match, with the matched sequence highlighted in bold text. The default format is the KWIC (Key Word in Context) concordance, where each sentence on displayed its own line and the matched words in the middle on top of each other. The view can be scrolled horizontally if some of the sentences are long. Entire paragraphs can be viewed by clicking on the “Show context” link at the top of the concordance view. The matched words are highlighted as before, but they are not aligned vertically like in the KWIC view.

Each matched word in a sentence in listed as a separate result on its own line.

The total number of hits in the selected corpora is displayed at the top of the concordance view. The coloured horizontal bar next to the number illustrates the number of hits in each corpus. The name of the corpus and the number of hits can be viewed by hovering the cursor over a section of the bar. Clicking on the section takes you to the first page with results from that corpus. You can move to a specific page by clicking on the page numbers below the bar.

A word in a sentence can be highlighted by clicking on it with the mouse or by moving around with the arrow keys. Information about the properties of the highlighted word as well as the sentence and/or the text where it occurs is displayed in the info box on the right-hand side of the concordance view. With dependency-parsed data, the head word of the highlighted word is shown against a pink background.

Statistics

The statistics view shows the total number of occurrences for each matched word in the results as well as the number of occurrences in individual corpora. The results are sorted by the properties of the selected word or text. The number of occurrences are shown as relative frequencies per million tokens, a common measure in corpus linguistics, and (in parentheses) as absolute frequencies.

The relative frequency shown in the Trend Diagram is always tied to a specific time period (e.g. year, month or day). It is calculated as the search results matching the time period divided by the number of tokens of all selected corpora times one million. Note that tokens also contain punctuation marks.

Corpora contain time information in different granularity, e.g. some register the day where a word was written, others only specify the year. In that case the yearly frequency is mapped equally onto every day of the year.

Consider two corpora: Corpus A has 10 hits out of 10.000 tokens for the whole year 2016 and corpus B has 2 hits for the 3.7.2016 out of 1000 tokens and no hits for the other days of the year. In this case the Trend Diagram  would be shown as follows:

For the 3.7.2016:

Absolute hits: 12 (= 10 + 2)
Relative frequency: 1090,91 (= (10 + 2) / (10 000 + 1000) * 1 000 000)

For any other day in 2016 the result would be:

Absolute hits: 10
Relative frequency: 1000,00 (= 10 / 10 000 * 1 000 000)

Word picture

The word picture view shows the words most commonly associated with the keyword by dependency in all of the selected corpora. The “commonness” of a word does not derive directly from its frequency but from a statistical measure known as mutual information.

The word picture can only be viewed if

  • dependency analysis has been performed on the selected corpus, and
  • you use the simple search and either type a single word form or choose a POS-tagged keyword from the auto-completion drop-down menu (i.e. do not type in the keyword and POS manually).

Displaying search results

Below the search box are three drop-down menus for modifying the way results are displayed. The first two affect the concordance view:

  • hits per page: the number of hits displayed at a time (25–1000, 25 by default)
  • sort within corpora: sorts the results from each corpus in one of the following ways:
    • not sorted: the results are shown in the same order in which they appear in the corpus
    • matched word(s): sort by the matched words alphabetically
    • left context: sort results alphabetically by the left-hand context
    • right context: sort results alphabetically by the right-hand context
    • random: a random order (note that the order is random only within each selected corpus, not across corpora)

The third menu affects the statistics view. In this menu, you can select the attribute by which the statistics are compiled. The statistics are calculated for word forms by default, in which case the table shows the distribution of word forms in the results. By selecting e.g. “part-of-speech”, the user can view the number and distribution of different parts of speech in the results.

Search the Language Bank Portal:

Researcher of the Month: Viljami Haakana

 

Tällä hetkellä ei ole tulevia tapahtumat.

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317