Instructions for moving from Lemmie to Korp

The corpora that were previously in the WWW-Lemmie corpus tool are now available in the Korp corpus search interface of the Language Bank of Finland. This document intends to ease the transition from Lemmie to Korp by comparing and contrasting the two tools. Please refer to the Korp user guide for more information on Korp.

Even though Korp does not currently offer all the features of Lemmie, it has some additional features (such as a trend diagram), allows building many queries visually and searches faster than Lemmie. If Korp does not satisfy your corpus processing needs or if you find it inconvenient, please tell us what you are missing from Lemmie, so that we can better prioritize our Korp development efforts. Some new features for Korp are already being planned.

Korp is being developed mainly at Språkbanken (the Swedish Language Bank) of the University of Gothenburg, and they have their own Korp site. Unless otherwise noted, this document refers to the Korp installation of the Language Bank of Finland (Kielipankki), hosted at CSC. Korp’s user interface is localized into Finnish, Swedish and English. This document mostly refers to the English texts.

Accessing the corpora

In order to use the FTC and FSTC corpora in Korp, you need to have the access rights to them and to log in to Korp. Even if you have the access rights to the corpora, you may need to reapply for them in Language Bank Rights. You log in to Korp via the “Log in” link in the cog menu in the upper right corner of the Korp window. Once you have logged in, on the upper right side of the Korp window shows top right If you have logged in but Korp’s corpus selector still shows a lock icon instead of a checkbox before the name of the corpus, you lack the required access rights.

Korp has currently four different modes with different selections of corpora. You change the active mode via the links on the top left corner of the Korp window. FTC is in the Finnish mode (as “Suomen kielen tekstikokoelma”), and FSTC (as “Finlandssvenska textkorpus (UHLCS)”) and Svenska Parole in the Swedish mode. The Other languages mode will contain the Helsinki Corpus of Swahili 2 (HCS2), which contains more data than the Swahili corpora in Lemmie.

To select the corpora to use, open the corpus selection menu by clicking the corpus selection bar on the right side of the Korp logo. You probably should first clear the default selections by clicking “Select none”. The corpora are arranged hierarchically in a treelike folder structure. You can select all the subcorpora in a folder (branch), such as “Suomen kielen tekstikokoelma” (FTC), by ticking the checkbox in front of the name, or you can open the folder and select individual corpora or sub-folders. From a technical point of view, the leaf nodes are corpora for Korp, and the folders are corpus collections, even though they may constitute logical corpora.

Korp does not currently remember corpus selection nor other settings between sessions. However, you can copy and paste to a text document the Korp URL, which contains corpus selection and query information. By pasting the URL back to the browser address bar later, you can restore (most of) Korp’s state. Note that you need to be logged in to Korp first for an URL to work if you had selected corpora requiring individual access rights.

Alternatively, if you have already logged in to Korp and have the appropriate access rights, you can use the following direct links to select all the corpora in the collections (using the Korp interface in English):

FTC: https://korp.csc.fi/#?corpus=ftc&lang=en
FSTC: https://korp.csc.fi/?mode=swedish#?corpus=fstc&lang=en
Svenska Parole: https://korp.csc.fi/?mode=swedish#?corpus=parole_sv&lang=en

Comparison of the features of Lemmie and Korp

User-added and user-corrected features in corpora

Korp does not have any facility comparable to the user addition files of Lemmie, and it probably never will. If you wish to correct errors in the corpus annotations or add annotations of your own, you need to download query results, correct or augment them and process them further with other tools.

Search dialog

Korp has three different search modes, which differ in the way the corpus query is specified: simple, extended and advanced. Each of them produces the results as a KWIC concordance and frequency statistics.

Note that Korp does not currently compute collocation tables. Support for computing collocations has been requested, but so far we have no schedule for implementing the feature.

KWIC concordance result

In Korp, the context of the match in the KWIC view is the whole sentence, with the words matching the query in bold (called “node” in Lemmie), aligned at the centre. The KWIC can be scrolled horizontally. If a sentence contains multiple matches, each of them is shown as a separate line.

Unlike Lemmie, Korp does not show a line number for the KWIC lines. The source corpus name is shown on top of the KWIC result and whenever the source corpus changes within a result page.

Korp’s concordance only shows the word forms of each sentence: you cannot currently choose any other attribute (such as the base form) to be shown. You can click on a word to show the features of the word and the containing text (document) in the side bar on the right side of the concordance. You can also navigate in the concordance with arrow keys.

Showing match context

You can see the whole paragraphs containing the matches in the concordance by clicking “Show context” on top of the KWIC result. In contrast to Lemmie, this does not open a separate window for the paragraph context but changes the presentation of the KWIC result view. Unlike in Lemmie, you cannot currently view the whole document (or “division of text”). If you need the whole document view, please request the feature.

Sorting the concordance

Korp allows sorting the concordance only by the matching words (tokens), the left context or the right context (by the three words to the left or right of the match, respectively). Sorting is always based on the word form (“sort feature” in Lemmie) and sort direction is always ascending. As in Lemmie, sorting is case-sensitive. You need to choose the sort option before making a search.

The default order is “unsorted”, meaning that the matches come in the KWIC in the order they appear in the corpus. The fifth option is “random”, which orders the result randomly within each corpus.

Paging the result

You can choose the number of hits shown in the KWIC view at a time (by default 25, maximum 1000). You can change the page using the pager above the concordance. The result is sorted within each corpus, not only within the result page (as Lemmie’s “result chunks”).

Downloading the concordance result

You can download the KWIC result page as an Excel spreadsheet (“Annot” for line per token, “Ref” for sentence per token), a NooJ file or a JSON file. To download the result, click the corresponding icon (or choose from a drop-down list, depending on the Korp version) at the bottom right corner of the KWIC result. An XML format is currently not available, but more download formats may be added later. Note that only the result page currently being shown can be downloaded, not all the results.

Unlike Lemmie, Korp does not have access to your home directory on CSC servers, so you cannot save results there.

Printing the concordance result

Korp has no separate print command, but you can use the printing functionality of your Web browser for the KWIC result. In the KWIC view, long lines will be truncated in the printout.

Unfortunately, the statistics result cannot be printed.

Frequency results

Korp shows the relative and absolute frequencies of the values of a chosen attribute in the matches found by the query. The attribute is chosen in the “compile based on” drop-down menu. The first frequency column (“Total”) contains joint frequencies for all the chosen corpora, and subsequent columns for each individual corpus. In this respect, the Korp statistics result resembles Lemmie’s split frequency results, but without the limitations. The first row (“Σ”) shows the sum frequency of all the values.

Each cell in the frequency table contains two numbers: the first is the relative frequency per one million tokens, whereas the second figure in parentheses is the absolute frequency.

You can sort the frequency table by any column by clicking the column header. Clicking the header again reverses the sort direction.

You can view a KWIC concordance of the matches for an attribute value by clicking the value (row label). However, this might not work completely correctly if the search matches multiple tokens or if attribute values contain spaces (typically text attributes).

Unlike in Lemmie, you currently cannot narrow down the frequency table in Korp by specifying the minimum absolute or relative frequency to be shown.

Downloading the statistics result

You can download either the absolute or relative frequencies of the statistics table in a comma- (semicolon-) or tab-separated-values format.

Printing the statistics result

Unfortunately, the statistics result cannot be printed from Korp. You need to download the result and print it from a spreadsheet application, such as Microsoft Excel or LibreOffice Calc.

Collocation results

Korp does not currently compute collocation tables.

Even though it is no proper substitute for collocation tables, for some purposes you might make do with a list of words and their frequencies that occur immediately preceding or following a search word (or expression). You can compute them in Korp as follows. In the extended search, add a token search criterion “word” “is” “<any word>” (the default when adding a new token box) before or after the token criterion box for the word whose collocates you wish to list. Equivalently, in the advanced search, add [] before or after the search word. Choose “Statistics: compile based on” the attribute you wish to list. The statistics result then lists the multi-word expressions containing the main search word. You could then download the result and process it in another application.

Comparing searches

Korp has a feature for comparing search results, but it is different from that of Lemmie. It shows prominent values of a chosen attribute for the matches found by two searches, based on the log-likelihood ratio.

To be able to compare search results in Korp, you need to save them: after selecting corpora and specifying search criteria, press the downward arrow beside the “Search” button, enter a name for your search and press “Save”. (You need not make the actual search.) When you have at least two saved searches, choose the “Compare” search tab, select two named searches you have saved to compare with each other and the attribute with reference to which you wish to compare the searches.

The comparison result shows side by side at most 30 most prominent attribute values from the two searches. The result is ordered descending by the log-likelihood ratio value and the number shown for each attribute value is its absolute frequency. The log-likelihood value is illustrated by the bar on the background of the row, and you get the exact value by hovering the mouse pointer over the absolute frequency. By clicking a row, you should get a concordance of the given values, although at present that does not always work.

Comparing searches does not currently offer an option to download the result.

Settings dialog

Korp differs from Lemmie in that it has no persistent settings or a separate settings tab. Instead, corpora are selected from the corpus selector, other search settings are specified directly in the search tabs, and the language of the user interface is chosen at the top right of the Korp window.

Korp currently has no other way of saving corpus or search settings as copying the Korp URL from the Web browser’s address bar and pasting it to a text file, from which it can be later copied and pasted to the browser. The ability to save settings within Korp has been requested, and we will be investigating it.

Some of the settings in Lemmie cannot currently be simulated in Korp, whereas others can, although in some cases in a non-obvious way.

Query restriction

By default, Korp restricts multi-word queries within a sentence. In the extended and advanced search modes, you can also choose to restrict a query within a paragraph.

The other elements to which Lemmie can restrict queries are mapped to paragraph type in Korp. For example, to restrict a query in Korp to a title, you should add in the extended search to one of the token boxes an extra search criterion specifying “paragraph type” “is” “head”, or in the advanced search, append & paragraph_type = "head" within one set of square brackets.

Concordance settings

Korp always shows at least the sentence containing the match: you cannot specify a narrower context.

As mentioned earlier, in Korp you cannot choose the sort keys, directions and features.

At present, the concordance result in Korp always displays the word form only.

Frequency table settings

In Korp, you can dynamically choose the sort order in the statistics result view by clicking column labels.

You cannot cut off the result by frequency.

In Korp, you select the feature(s) to display and to compile statistics based on from the drop-down menu “Statistics: compile based on” at the bottom of the search pane.

My Results dialog

Korp has no facility for saving query results, and probably never will in the same way as Lemmie had. You need to re-perform a query to get the same results.

My Corpora dialog

The corpus settings in Korp are not persistent: you need to select anew the corpora you wish to use every time after you have logged in to Korp or after changing the mode (such as Finnish or Swedish).

You cannot create your own custom corpora in Korp. However, you can limit your search to documents matching given metadata criteria by specifying the metadata criteria in the extended or advanced search as a part of the search criteria of any of the tokens of the query, which is somewhat unintuitive. In the extended search, choose one of the text attributes, a condition and the desired value for comparison. In the advanced search, add the corresponding criterion, joined with an & (denoting and) within any set of square brackets (see below for details on the attribute names). Adding the criterion has to be repeated for each new query.

You cannot easily add individual documents to or remove such from the set of documents searched. However, to add a document, you can specify an additional or criterion that the Lemmie document id may be a given one, and to remove a document, you can specify an and criterion that the Lemmie document id is not a give one.

Documentation

A link to Korp user guide is in the cog menu at the top right corner of the Korp window.

Search syntax

Korp has three different ways of specifying corpus queries: simple, extended and advanced search. Korp’s simple search is simpler than Lemmie’s simple syntax, and the advanced search corresponds to Lemmie’s advanced syntax but is more expressive. In the extended search, a query is built semi-graphically; it has no equivalent in Lemmie.

The following subsections contrast Korp’s simple and advanced search to Lemmie’s simple and advanced syntax, respectively.

Simple syntax

In Korp’s simple search, you can either search for a single word form or a fixed phrase (consecutive word forms), or a single base form with a given part of speech. For a word form searches, you can specify that the that the result may contain also words of which the search words are prefixes or suffixes (that is, that the search words are truncated forms), or that the search is done case-insensitively.

To make a base-form query, you need to choose the base form and its part of speech from the auto-completion drop-down list that is displayed when you write a word in the input field. The part of speech is shown in parentheses.

For all other needs, you need to use Korp’s extended or advanced search; for eaxmple, to search for words with single arbitrary characters, for phrases with some words allowing truncation and some not, for words with a number of arbitrary words between them and for negation expressions.

Advanced syntax

In Korp’s advanced, you write your corpus query using the CQP query language. (CQP (Corpus Query Protocol) is the query language of Corpus Workbench, which underlies Korp.) CQP is in many ways similar to Lemmie’s advanced syntax, but they have subtle differences, briefly explained below. CQP also has a number of advanced features not found in Lemmie. For more information on CQP, please refer to the guide to Korp’s advanced search.

As in Lemmie, the query parameters (or token expressions) for a single token in CQP are written in square brackets. As in Lemmie, the empty brackets [] denote any token, and required and disqualifying features (attributes in Korp) are specified as [key="value"] and [key!="value"], respectively.

In CQP, the feature values in quotation marks are full-fledged (extended) regular expressions. As in Lemmie, a full stop . denotes any single character, but to specify truncation (any number (including zero) of any characters), you need to write .* (a period followed by an asterisk). Please see Korp’s advanced search guide for more information on the available regular expressions.

To require or disqualify more than one attribute in a token, the attribute specifications need to be separated by an ampersand (and), unlike in Lemmie: [key1="value1" & key2="value2"]. In Korp, you may also specify disjunctions (or) requiring that any one of the given attribute conditions hold by separating the attribute specifications by a vertical bar (|). Attribute specifications can also be grouped by parentheses, and an exclamation mark (!) denotes negation.

A token expression (query parameter) can be repeated using a similar iterator expression as in Lemmie: [key="value"]{n,m}, which allows repeating the given token expression n to m times. The values of n and m can be larger than the limit of 9 in Lemmie. CQP also allows other token-level regular expression constructs.

Attribute names and values

The attribute names that Korp uses are partly different from the feature names in Lemmie:

Lemmie feature	Korp attribute	Meaning
`wf`	`word`	word form
`bf`	`lemma`	base form
`pos`	`pos`	part of speech
`msd`	`msd`	morphosyntactic description

Attribute value sets are the same as in Lemmie.

The attribute id (an unique identifier of a word token occurrence) is not currently available in Korp. If you need it, please request it.

The Korp versions of the corpora do not currently have separate attributes corresponding to individual features of the morphosyntactic description (case, definitiveness, deponent, derivationSuff, extra, gender, grade, modality, number, person, possSuff, tense, voice). When you want to search for the value of a feature, you need to specify a value for the attribute msd which contains all the morphosyntactic features. For example, to search for words in the ablative case in the Finnish corpora (FTC), you can write the CQP token expression [msd=".*Abl.*"]. If a value is a substring of another value, you should use (.* )? instead of .*; for example, to search for words in the comitative case, use the token expression [msd="(.* )?Com(.* )?"] (with simple .*, the expression would also match the grade comparative (Comp). If you would like to have the individual attributes, please request.

In contrast to Lemmie, attribute values are matched case-sensitively by default in Korp. If you need case-insensitive matching, append %c to an attribute specification; for example, [word="on" %c].

Metadata attributes

In addition to the token attributes listed above, you can also refer to document-, paragraph- and sentence-level metadata attributes in the CQP queries. They are referred to in the form [_.element_attribute="value"], where element is the name of the element (text, paragraph or sentence) and attribute the attribute name. Note that the literal underscore and full stop before the name are obligatory. The following metadata attributes are available:

Attribute	Description
text_title	Document title
text_creator	Document creator (author)
text_publisher	Document publisher
text_wordcount	The number of words in the document
text_lemmie_id	The id of the document in Lemmie
text_lang	The language of the document: one of `fin`, `swe` or `eng`
text_date	The date of the text in the format yyyy-mm-dd
text_filename	The name of the original XML file in the Language Bank of Finland
text_rights	The licence type of the document: `A` or `B`
text_contributor	Document contributor
text_source	Document source
text_lemmie_corpus	Corpus id in Lemmie
text_subject	Document subject
paragraph_type	Type of paragraph: the XML element type corresponding to the paragraph
sentence_within	The XML elements above the sentence, from top to bottom

Note that not all documents have values for all the attributes.

Search the Language Bank Portal:

Researcher of the Month: Marko Jouste

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information