Korp is a Web-based tool that allows its user to search for keywords in text corpora (typically grammatically parsed) and to generate concordances. Korp gives its users access to extensive collections of texts in Finnish and Finland Swedish. Many of these corpora can be accessed without logging in whereas others are only accessible to logged-in users. In some cases individual access rights are also required (instructions for applying).
Please note that this manual has been written specifically for the version of Korp used by the Language Bank of Finland and may not be fully applicable to other versions. Korp is developed at Språkbanken of the University of Gothenburg in Sweden, and their Korp site contains mostly Swedish-language materials. The Norwegian version of Korp contains corpora in Norwegian and Saami. Instructions for Korp as used by the Swedish Språkbanken (in Swedish only – note that these instructions may not be entirely applicable to the Finnish version of Korp).
The Korp graphical search interface can be accessed by a Web browser that has JavaScript enabled. Korp works best on Firefox and Chrome, whereas some of its functionality does not work on Internet Explorer.
You can switch between different modes by clicking on the links on the top of the Korp GUI. Different modes contain different types of corpora and have slightly different search features.
Four modes are available at the moment:
On the right the Korp logo is the corpus selection bar that lets you select the corpora you want to search. It may say e.g.
4 of 923 corpora selected – 76.53M of 8,74G tokens
Clicking on the bar opens the corpus selection menu where you can select all the relevant corpora. The corpora are arranged hierarchically in a treelike structure. You can see the list of corpora in a branch by clicking on the triangles at the beginning of each line.
If you hover the cursor over the name of a corpus or a branch, an info box displaying the total number of sentences and tokens in the corpus or branch in question appears. (Note that the number of tokens also includes punctuation.)
The “Select all” button at the top of the menu selects all corpora listed in the menu. The “Select none” button clears all selections.
Please do not select all corpora in the Finnish or Parallel mode, since currently the KWIC result cannot be obtained if all the corpora are selected. In the Finnish mode, the search works if you select all other corpora except “1990- ja 2000-luvun suomalaisia aikakaus- ja sanomalehtiä”, for example.
The search function works best when all of the selected corpora have similar annotations.
The language of the Korp interface (Finnish, Swedish, English) can be chosen in the upper-left corner of the GUI.
These instructions describe the English-language interface.
Korp allows thee types of searches: simple, extended and advanced. The type of search is selected by clicking on the respective tab above the search box. Simple search finds individual tokens. Extended search makes it possible to refer to several consecutive words and their attributes at once. In advanced search you can type a CQP query directly.
Word pictures can only be viewed for simple search (see below).
In simple search you can search for a word form in the corpora by entering it in the search box. The search box has an auto-complete feature that suggests keywords together with their parts-of-speech in parentheses (note that this only works for POS-tagged corpora). The keywords displayed in grey are words that do not occur in any of the selected corpora. If a POS-tagged keyword in the list is selected, all word forms whose dictionary form and part-of speech match those of the selected keyword will be included in the search results. The word picture feature only works when a POS-tagged keyword is selected i.e. entering in the keyword and its POS manually does not suffice.
The following options can be selected for simple search:
Extended Search allows the user to search not only for individual word forms, but also for sequences of consecutive words. The values of the attributes of each keyword in the sequence can be defined individually.
The examples below are further explained on a separate page, including screen shots.
The attribute menu may contain many other attributes that can be selected to narrow down queries. The set of available search criteria varies from corpus to corpus. It often makes sense to search from one corpus at a time since it makes the search results more readable and less ambiguous. An alternative is to only search from corpora whose annotations are of the same type.
The query can be supplemented with additional search criteria by clicking the plus sign (+) in the lower left corner of the keyword box (in which case the matched words should fulfil both criteria) or by clicking on the word “or” above the plus sign (in which case matched words have to satisfy either one of the criteria).
Words can be added in the sequence of words by clicking on the plus sign (+) on the right side of the rightmost keyword box. You can define the search criteria and attributes for each word of the sequence individually. To remove a word from the sequence, click on the × sign in the upper-right corner of the keyword box.
Any element in the sequence can be repeated by clicking on the cogwheel symbol in the lower right corner of the keyword box, then selecting “Repeat” in the menu, and finally entering the range of the number of repeats allowed. For instance “Repeat 0 to 1 times” means that the element can occur in a matched sequence only once or not at all. The repeated element can also be an unspecified word, which allows the user to search for phrases where other words can occur between the keywords: this can be done by selecting “word” and “is” in the drop-down menus and leaving the search box empty.
Note that Advanced Search can slow down or even crash if the query is too complex and if a large corpus has been selected.
Tip: The query as specified on the Extended Search tab also appears automatically on the Advanced Search tab (see below) as a CQP query which can then be modified for more specific results.
In advanced search, the search criteria and the keywords are expressed as a CQP query. You can e.g. search for dependencies in dependency-parsed corpora in ways not supported by the extended search.
More information on and examples of CQP queries
You can choose the way results are displayed by clicking on one of the three tabs: “KWIC” (default), “Statistics” and “Word picture”
The concordance view lists all sentences containing a match, with the matched sequence highlighted in bold text. The default format is the KWIC (Key Word in Context) concordance, where each sentence on displayed its own line and the matched words in the middle on top of each other. The view can be scrolled horizontally if some of the sentences are long. Entire paragraphs can be viewed by clicking on the “Show context” link at the top of the concordance view. The matched words are highlighted as before, but they are not aligned vertically like in the KWIC view.
Each matched word in a sentence in listed as a separate result on its own line.
The total number of hits in the selected corpora is displayed at the top of the concordance view. The coloured horizontal bar next to the number illustrates the number of hits in each corpus. The name of the corpus and the number of hits can be viewed by hovering the cursor over a section of the bar. Clicking on the section takes you to the first page with results from that corpus. You can move to a specific page by clicking on the page numbers below the bar.
A word in a sentence can be highlighted by clicking on it with the mouse or by moving around with the arrow keys. Information about the properties of the highlighted word as well as the sentence and/or the text where it occurs is displayed in the info box on the right-hand side of the concordance view. With dependency-parsed data, the head word of the highlighted word is shown against a pink background.
The statistics view shows the total number of occurrences for each matched word in the results as well as the number of occurrences in individual corpora. The results are sorted by the properties of the selected word or text. The number of occurrences are shown as relative frequencies per million tokens, a common measure in corpus linguistics, and (in parentheses) as absolute frequencies.
The relative frequency shown in the Trend Diagram is always tied to a specific time period (e.g. year, month or day). It is calculated as the search results matching the time period divided by the number of tokens of all selected corpora times one million. Note that tokens also contain punctuation marks.
Corpora contain time information in different granularity, e.g. some register the day where a word was written, others only specify the year. In that case the yearly frequency is mapped equally onto every day of the year.
Consider two corpora: Corpus A has 10 hits out of 10.000 tokens for the whole year 2016 and corpus B has 2 hits for the 3.7.2016 out of 1000 tokens and no hits for the other days of the year. In this case the Trend Diagram would be shown as follows:
For the 3.7.2016:
Absolute hits: 12 (= 10 + 2)
Relative frequency: 1090,91 (= (10 + 2) / (10 000 + 1000) * 1 000 000)
For any other day in 2016 the result would be:
Absolute hits: 10
Relative frequency: 1000,00 (= 10 / 10 000 * 1 000 000)
The word picture view shows the words most commonly associated with the keyword by dependency in all of the selected corpora. The “commonness” of a word does not derive directly from its frequency but from a statistical measure known as mutual information.
The word picture can only be viewed if
Below the search box are three drop-down menus for modifying the way results are displayed. The first two affect the concordance view:
The third menu affects the statistics view. In this menu, you can select the attribute by which the statistics are compiled. The statistics are calculated for word forms by default, in which case the table shows the distribution of word forms in the results. By selecting e.g. “part-of-speech”, the user can view the number and distribution of different parts of speech in the results.