Korp: Advanced search

The Korp corpus search interface has an advanced search mode in which you write your corpus query in the CQP query language. (CQP (Corpus Query Protocol) is the query language of Corpus Workbench, which underlies Korp.) CQP queries can be used when searching for phenomena (such as dependencies) that cannot be expressed in Korp’s other search modes.

The properties of the CQP query language are described in greater detail in the CQP Tutorial. However, Korp’s advanced search only accepts individual CQP queries and does not support CQP commands that are used to process search results. Furthermore, CQP queries are not terminated with a semicolon in Korp.

The CQP query expressions corresponding to the currently active queries in the simple and extended search modes are displayed above the CQP input box and they can be copied to the input box to be modified. The user can also browse through their previous queries in the Search history menu in the upper-right corner of the Korp GUI, repeat a search, and copy the CQP query to the text box.

If the CQP query has a syntax error, Korp will (unfortunately) simply display the uninformative error message “An error occurred” without specifying the nature of the error.

Common word attributes

CQP queries refer to a token’s attributes by their internal names which are normally not displayed anywhere in the Korp interface (the menus and info boxes only show the full names of attributes in the interface language.) The names of some of the essential attributes, however, are the same in all corpora, although unannotated corpora naturally lack any annotation attributes.

The names and meanings of some common attributes:

Name	Meaning
`word`	word form (surface form)
`lemma`	the base form of the word
`lemmacomp`	the base form of the word with compound part boundaries marked (`\|` in TDT-annotated corpora usually, `#` in others) (not in all annotated corpora)
`pos`	part of speech
`msd`	morphological analysis a.k.a. morphosyntactic description
`ref`	the running number of the word within the sentence
`dephead`	the running number of the dependency head within the sentence
`deprel`	the word’s dependency relation to its head

Individual corpora may have other attributes in addition to the ones listed here.

Attribute values, on the other hand, are subject to greater variation since certain values are encoded differently depending on the annotation tool used. The POS and dependency relation tags used by certain corpora are described in the following pages (currently in Finnish only).

FinnTreeBank (FTB)
Corpora analysed with the Turku Dependency Treebank (TDT) parser (e.g., the newspaper and periodical collection of the National Library of Finland, Suomi24 discussion forum)

Note that the examples on this page use the part-of-speech and dependency relation codes of FinnTreeBank. The most common part-of-speech codes in TDT are the same, but the dependency relation codes differ more. For example, in FTB, a subject is subj, whereas in TDT, it is most often nsubj but also csubj, nsubj-cop or csubj-cop, and an object is obj in FTB but dobj in TDT.

CQP basics

Search criteria of individual tokens (attribute constraints)

The search criteria i.e. attribute constraints of individual tokens are the basic components of a CQP query. The search criteria, which are (usually) written in square brackets, specify the values that a token’s attributes must have in order for it to match the query. If the search criterion only refers to the token’s surface form (the attribute word), it can be simply written between quotes without brackets or attribute name.

Examples

CQP expression	Meaning
`"kieli"`	word form “kieli”
`[lemma="kieli"]`	tokens whose base form is “kieli”
`[pos="N"]`	tokens with the part-of-speech tag “N” i.e. nouns

The search criterion for an individual token can consist of several different attribute constraints joined by one of the following logical the logical operators: & (conjunction, AND), | (disjunction, OR), ! (negation, NOT), and -> (implication). Attribute constraints can be grouped by using parentheses. The comparison operator != (inequality) can also be used in addition to =.

Examples

CQP expression	Meaning
`[lemma="kuusi" & pos="n"]`	tokens with the base form “kuusi” and the POS tag “N”
`[lemma="kieli" & ! (deprel="subj" \| deprel="obj"]`	tokens with the base form “kieli” and with a dependency relation other than subject or object
`[lemma="kieli" & word!=lemma]`	tokens with the base form “kieli” whose surface form is different from the base form

You can refer to the value of a token’s attribute on either side of the operator as shown above.

Regular expressions

The values written between quotes are regular expressions and they may contain common regular expression constructs, e.g. "kiel[it].*" matches all tokens starting with “kieli” or “kielt” and [lemma="suur.+"] matches words whose surface form is “suur” followed by one more letter.

CQP regular expressions support the following constructs:

Construct	Description	Examples
alphanumeric symbols	match themselves
`.`	any single symbol
`[…]`	a set or range of symbols: any of the symbols inside the brackets	`[aeiouyäö]` matches a single Finnish vowel symbol and `[a-hw-z]` all the letters from `a` to `h` and `w` to `z`.
`[^…]`	the complement of a set or range of symbols, none of the symbols inside the brackets	`[^abcw-z]` matches any symbol except the letters `a`, `b`, `c`, `w`, `x`, `y`, and `z`.
RS	concatenation: the substring matched by the expression R if followed by a substring matched by the expression S	`[a-z][0-9]` matches a lowercase letter followed by a digit
`(…)`	grouping
R`*`	repeat zero or more times; R can be a single character, a set of characters, or parentheses containing a regular expression	`a.` matches all strings starting with an `a`, while `a(bc)`matches the strings `a`, `abc`, `abcbc`, `abcbcbc` etc.
R`+`	repeat once or more	`goo+d` matches the strings `good`, `goood`, `gooood` etc.
R`{`n`}`	repeat exactly n times
R`{`m`,`n`}`	repeat m to n times
R`?`	optionality (repeat zero or one time)	`favou?rite` matches `favorite` and `favourite`
R`\|`S	alternatives; match R or S	`apple\|orange` matches the strings `apple` and `orange`; `(read\|writ\|watch)ing` matches `reading`, `writing`, `watching`
`\`c	escaping; escapes a special character	`\.` matches a literal full stop

Multi-word search and token-level regular expressions

The simplest way to search for word multi-word sequences is to write a CQP query where the search criteria of each word are written consecutively, separated by spaces:

CQP query	Meaning
`"mistä" "syystä"`	the consecutive word forms “mistä”, “syystä”
`"mistä" [pos!="N"]`	the word “mistä” followed by a word that is not a noun
`[pos="A" & deprel="attr"] [pos!="N"]`	and adjective attribute followed by a word that is not a noun

A single search criterion (usually written in square brackets) can be thought of as a single keyword box in Korp’s Extended Search, where you can define several different attribute conditions. Just like the boxes in Extended Search can be concatenated to search for a sequence of words, the search criteria can be concatenated in a CQP query. However, a CQP query may contain groupings that cannot be expressed in the Extended Search.

Multi-word CQP queries are always restricted to words in the same sentence in Korp. The entire matched sequence of tokens is displayed in bold in the search results.

Some regular expression constructs can also be used with tokens. The basic units are the search criteria of individual words. The following constructs are supported: repetitions ?, *, +, {n} and R{m,n}; alternatives |; and grouping by parentheses. An empty pair of brackets [] refers to any token and is therefore analogous to the full stop . in character-level regular expressions.

Examples:

CQP query	Meaning
`[lemma="maailma"] []* [lemma="kieli"]`	a form of the word “maailma” followed by a form of the word “kieli” with any number of words in between
`("siksi" \| "sen" "vuoksi") [deprel!="subj"]{1,5} [deprel="main"]`	“siksi” or “sen vuoksi” followed by one to five non-subjects and a main verb

Searching for dependency relations

When searching for dependency relations you may wish to find e.g. constructions where a token is subordinate to another token (main word, i.e. head). Some of the relevant attributes are refdephead (the running number of the head) and deprel (the dependency relation)

When searching for dependency relations, a CQP feature that allows comparing attributes of different tokens comes in handy. The tokens have to be labelled before they can be referred to in an query. This is done by adding a label in front of the search criteria, e.g. a:[deprel="subj"]. The values of the attributes of a labelled token can then be referred to in the search criteria of other tokens in the form label.attribute, e.g. [dephead=a.ref]

CQP query	Meaning
`a:[deprel="main"] []* [lemma="kieli" & deprel="subj" & dephead=a.ref]`	the word “kieli” preceded by the subject of the main verb labelled as a
`a:[deprel="attr"] []* [deprel="subj" & ref=a.dephead]`	a subject and a preceding attribute (labelled as `a`), with zero or more words between the two
`a:[deprel="subj"] b:[dephead=a.ref] c:[dephead=b.ref] [dephead=c.ref]`	a subject followed by a sequence of three words (with each word being subordinate to the preceding word)

The entire matched sequence of words is displayed as a hit in the results.

Note that the ordering of search criteria in a CQP query defines the order of words in a matched sequence of words. For instance, if you want to find a verb and its subject regardless of in which order they appear in the sentence, you either have to make to separate queries, with the subject preceding the verb in the first one and the verb preceding the subject in the other, or combine these two expressions with the token-level | operator.

It is only possible to refer to a preceding labelled token in a token’s search criteria. References to labelled tokens following a given token requires the use of a global restriction. The global restriction is written at the end the search criterion after a double colon, and it can refer to the attributes of labelled tokens.

Examples:

CQP query	Meaning
`s:[lemma="jäsen" & deprel="subj"] []* v:[pos="V" & deprel="main"] []* o:[lemma="tuki" & deprel="obj"] :: s.dephead = v.ref & o.dephead = v.ref`	the subject “jäsen” (`s`) followed by (possibly after a number of words) the main verb (`v`), followed by the object (`o`) “tuki”

The earlier examples can also be rewritten with a global constraint, e.g.

a:[deprel="subj"] b:[] c:[] d:[] :: b.dephead=a.ref & c.dephead=b.ref & d.dephead=c.ref

matches the exact same sequence as

a:[deprel="subj"] b:[dephead=a.ref] c:[dephead=b.ref] [dephead=c.ref]

Other examples with global restrictions

The global constraints mentioned above make it possible to search for constructions where the same word appears several times, but the word can be any word. For example:

CQP query	Meaning
`a:[] “ja” b:[] :: a.word=b.word`	the word “ja” with the same word (surface form) on both sides
`a:[lemma!="olla"] b:[] :: a.lemma=b.lemma & a.word!=b.word`	two different forms of the same word (not “olla”), with one immediately following the other
`a:[word!="[-.,:;()]"] []* b:[] []* c:[] :: a.word=b.word & b.word=c.word`	the sentence contains the same word form (not punctuation) three or more times (warning: the search will be very slow and might not yield any results at all when searching in large corpora)