Korp Web service (API)

Introduction

The Korp corpus search interface of Kielipankki (The Language Bank of Finland) is accompanied with and based on a publicly accessible Web service. This service can be used to retrieve the data used for the Korp search interface by other programs and thus serves as an API for Korp. The Korp Web service also offers some features not currently available in the Korp search interface, such as pre-queries to limit the scope of the main query to structures (sentences, paragraphs or the like) also matching the pre-queries.

This document describes the commands available in the Korp API, their parameters and return data content. This document is largely based on the Språkbanken’s (The Swedish Language Bank at the University of Gothenburg) Korp Web service documentation.

Accessing the Web service

The address for Kielipankki’s main Korp Web service is

https://korp.csc.fi/cgi-bin/korp.cgi

The Korp Web service is accessed using standard HTTP(S) GET requests, so the URLs for API calls are of the form

https://korp.csc.fi/cgi-bin/korp.cgi?command=…&corpus=…&…

Here, command is the Web service command to execute and corpus the list of corpora as the targets of the command. These parameters are required by most of the commands.

The Web service returns data in JSON (JavaScript Object Notation) format, which shows up as text in the Web browser.

In addition, the KWIC search results can be downloaded from Korp in various other formats using the download service at

https://korp.csc.fi/cgi-bin/korp_download.cgi

To get an idea as to how the results of a Korp search could be obtained via the API, you can open a network traffic monitor in your Web browser’s developer tools and see how the Korp search interface interacts with the API. Look for calls to korp.cgi or korp_download.cgi. You can copy their URLs to the browser address bar, modify them parameters and see the effect. (The parameter querydata often appears in the URL but it is not needed for the Web service.)

Corpus Workbench

Korp is based on the (IMS Open) Corpus Workbench (CWB) and uses its Corpus Query Protocol (CQP) as the query language. It may be helpful to know the basics of CQP for using the Korp Web service. Please note that the CQP queries in Korp are single query expressions: you cannot use the CQP commands for counting, sorting, naming results or setting options, for example.

Corpus attributes

Corpora in Korp (CWB) consist of tokens (words and punctuation marks). Each token has named positional attributes, at least a word form. In addition, sequences of tokens may be grouped by structural attributes that correspond roughly to XML elements and their attributes.

The positional and structural attributes vary from corpus to corpus but some attributes are common to many corpora. Even if the attribute names are the same, it does not guarantee that the attribute value sets are the same.

Common positional (token) attributes for corpora are:

  • word: word form
  • lemma: base form
  • lemmacomp: base form with compound boundary marked
  • pos: part of speech
  • msd: morpho-syntactic description (morphological analysis)
  • ref: the number of the token in the sentence (one-based)
  • dephead: the number of the dependency head of the token
  • deprel: dependency relation with dephead
  • lex: a ”lemgram” for the word: lemma..pos.n

The use of lemgrams originates from the Swedish corpus markup, where it serves as a sense identifier. For Finnish corpora, lemgrams are constructed artificially. They use a different (fixed) set of part-of-speech tags and the sense number n is always 1.

Common structural attributes are text, paragraph and sentence, corresponding to divisions of the text. The attribute values associated with these structures are represented by the structural attributes struct_attr; for example, text_title for the title of a text and sentence_id for a sentence identifier.

Almost all corpora have the structural attributes text_datefrom, text_dateto, text_timefrom and text_timeto, which correspond to the creation date and time of the text, represented in the format yyyymmdd and hhmmss, respectively. If the exact date is known, the values of text_datefrom and text_dateto are the same. If only the year is known, text_datefrom is yyyy0101 and text_datetoyyyy1231; if the year and month are known, text_datefrom is the first day of the month and text_dateto is the last one. If the time is known, text_datefrom and text_dateto are the same; otherwise text_datefrom is 000000 and text_dateto 235959. An unknown date is represented by an empty string in all these attributes.

Conventions used in this document

The parameters of Korp Web service commands are described below as
follows:

  • a = …: a is a required parameter
  • [?] b = x: the optional parameter b takes value x
  • [+] c = …: the parameter c takes multiple values separated with commas

The above list would be represented as URL parameters as follows:

?a=…&b=x&c=…,…

The properties of the JSON objects returned by the Korp Web service are described so that the following JSON:

{
   "a": …,
   name: {
     "b": "y"
     "c": …
   }
 }

is described as follows:

  • a: … (description)
  • [+] name: name is a variable property name, typically the name (id) of a corpus
    • b: y
    • c: … (description)

The [+] above for name indicates that the property may be repeated multiple times, obviously with different property names. If a value is an array of objects, it is mentioned explicitly.

Common features of the commands

All commands take the following optional parameters:

  • [?] indent = num: Format the resulting JSON with indentation step num. The default is to return the JSON in a compact form, with no indentation or line breaks.
  • [?] callback = string: Enclose the resulting JSON in string(} (for some AJAX calls, for example).
  • [?] cache = true: Use a cached result if available; if not, store the result in the Korp query cache for future queries.

If a command causes an error, it returns JSON with property ERROR:

  • ERROR
    • type: The type of the error
    • value: Error message

All commands also return the real time (as opposed to CPU time) it used to take to execute the command:

  • time: Run time in seconds

Information commands

General information

Retrieve information about the available corpora and the CQP version used.

Parameters:

  • [?] command = info

Returns:

  • corpora: Comma-separated list of corpora available on the Korp server. The corpora are shown as upper-case corpus ids.
  • protected_corpora: An array of the names (ids) of the corpora that are protected
  • cqp-version: The CQP version used on the server

Please note that trying to access protected corpora (corpora requiring authentication for access) via the Korp Web service results in an error.

Example

Corpus information

Retrieve information on one or more corpora and their attributes.

Parameters:

  • [?] command = info
  • [+] corpus = List of upper-case corpus ids

Returns:

  • corpora
    • [+] corpus: Information on the corpus corpus (corpus in uppercase)
      • attrs
        • p: Comma-separated list of positional (word) attributes in corpus corpusname
        • s: Comma-separated list of structural (text) attributes in corpus corpusname. Attributes with a simple name without underscores designate structures and they have no particular values: for example, sentence for a sentence. Attributes with names of the form struct_attr containing underscores designate the attribute attr of the structure struct and they have values; for example, sentence_id for the identifier of a sentence.
        • [?] a: Comma-separated list of alignment attributes (for parallel corpora)
      • info
        • Charset: Character encoding of the corpus
        • Size: The number of tokens in the corpus
        • Sentences: The number of sentences in the corpus
        • Updated: The date of last update in ISO format yyyymmdd
    • total_size: The total number of tokens in the above corpora
    • total_sentences: The total number of sentences in the above corpora

Examples:

KWIC concordance

Perform a KWIC concordance search for one or more corpora.

Parameters:

  • [?] command = query
  • [+] corpus = Corpus id in uppercase
  • cqp = CQP query expression
  • [?] cqpn = (n is an integer) Additional CQP query expressions limiting the scope of the matches: the query results are only shown for structures (as specified with defaultwithin or within) that contain a match for all the queries. See below for more information.
  • start = The number of the first hit to include in the concordance (starting from 0)
  • end = The number of the last hit to include in the concordance
  • [?] defaultcontext = n struct: The default context (n struct elements) to show around the match: typically 1 sentence or 1 paragraph to show only the containing sentence or paragraph. You can also use nwords to show n words around the match, disregarding structure boundaries.
  • [?+] context = corpus:n struct: The context to show for corpus corpus instead of the default
  • [?+] show = The positional attributes to show for tokens (from the list of attrs.p returned by the info for a corpus), and also the structures whose opening and closing is to be shown within tokens (from the list of attrs.s returned by the info for a corpus, typically structure names without an underscore)
  • [?+] show_struct = The structural attributes to show (from the list of attrs.s returned by the info for a corpus)
  • [?] cut = The maximum number of hits to search
  • [?] defaultwithin = struct: Limit search witihin the structural attribute struct
  • [?+] within = corpus:struct: Limit search in corpus corpus within struct instead of the structure given with defaultwithin
  • [?] sort = Sort criterion for the search results within each corpus: one of keyword (the searched word), left (left context), right (right context) or random (random order)
  • [?] random_seed = n: Use n as the seed for the random number generator, to get a reproducible random order with sort=random
  • [?] incremental = true: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corpora

The word form is always shown in the concordance, even if show=word is not specified.

The additional CQP query parameters cqpn can be used to simulate order-independent conjunction of search criteria: it does not matter in which order the matches for the separate CQP queries appear in the text structure. This contrasts with a single CQP query, which always specifies the order in which the matching tokens must appear in the text. Note that the result will only indicate match positions for the largest-numbered query (the number of the unnumbered parameter cqp is 0); the rest are considered pre-queries limiting the scope of the matches.

Returns:

  • hits: The total number of hits
  • corpus_hits
    • [+] corpus: Number of hits for corpus
  • kwic: An array of KWIC rows with the following properties:
    • corpus: Corpus name in uppercase
    • match: Information on the match (of the main CQP query only):
      • start: The start position (word) of the match on the KWIC row
      • end: The end position (word) of the match on the KWIC row
      • position: Global corpus position (token number from the beginning of the corpus) for the match
    • tokens: An array of tokens on the KWIC row. Each token is an object, whose properties are the positional attributes specified in the parameter show (if they exist in the corpus in question). If structural attributes (structures) are specified in show, their opening and closing are shown in the property structs of the first and last token of the structure, respectively: the property structs.open lists all the structures opening before the token and structs.close the structures closing after the token.
    • structs:
      • [+] struct: The value of the structural attribute struct specified in the parameter show_struct for the first token of the matching row.
    • [?] aligned: For parallel corpora only
      • aligned_corpus: A list of aligned tokens in aligned_corpus.

Examples:

Statistics

Frequency information

Count the absolute and relative frequency of one or more attribute for
a CQP query.

Parameters:

  • command = count (or count_all for counting statistics for all tokens)
  • cqp = CQP query expression (not applicable to count_all)
  • [?] cqpn = Additional CQP query expressions (n is an integer); see the description above for KWIC concordance (not applicable to count_all)
  • [+] groupby = Positional and/or structural attributes according to which to group the results
  • [+] corpus = Corpus names in uppercase
  • [?] defaultwithin = struct: Limit search witihin the structural attribute struct
  • [?+] within = corpus:struct: Limit search in corpus corpus within struct instead of the structure given with defaultwithin
  • [?+] ignore_case = Attributes for which case is ignored
  • [?] start = The number of the first row (gropuby attribute value) to return
  • [?] end = The number of the last row to return
  • [?] incremental = true: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corpora

You should use the command count_all for counting statistics for all the tokens in one or more corpora: it is optimized and much faster in this task than count. count_all takes the same arguments as count, except for cqp (and cqpn).

Returns:

  • corpora:
    • [+] corpus: The frequencies for corpus
      • absolute: Absolute frequencies
        • [+] attribute1/attribute2/…: Absolute frequency for the given combination of values of the attributes specified in the groupby parameter
      • relative: Relative frequencies
        • [+] attribute1/attribute2/…: Relative frequency for the given combination of attribute values
      • sums: Sums of all the attribute values for the corpus
        • absolute: Sum of absolute frequencies
        • relative: Sum of relative frequencies
    • total: The total frequencies for all corpora in the same format as above for individual corpora
    • count: The total number of different values

Examples:

Trend diagram

Get the frequencies of one or more expression over time.

Parameters:

  • command = count_time
  • cqp = CQP query expression
  • [?] subcqpn = Subquery of the CQP query above; n is a number; see below for further information.
  • [+] corpus = Corpus names in uppercase
  • [?] granularity = Temporal granularity of the result: y (year; the default), m (month) or d (day)
  • [?] incremental = true: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corpora

If one or more subcqpn is specified, return the frequency information also for these queries.

The result is returned both by corpus and total for all corpora.

Returns:

  • corpora:
    • [+] corpus: The frequencies for corpus: an array of objects, one for the main CQP query and each subquery
      • [?] cqp: The sub-CQP query in question (not returned for the main query)
      • absolute:
        • [+] date: Absolute frequency for date
      • relative:
        • [+] date: Relative frequency for date
      • sums:
        • absolute: Sum of absolute frequencies
        • relative: Sum of relative frequencies
  • combined: The combined frequencies for all the corpora in corpora in the above format

Examples:

Log-likelihood comparison

Compare the search results of two sets of corpora using log-likelihood.

Parameters:

  • command = loglike
  • set1_cqp = CQP query expression for set 1
  • set2_cqp = CQP query expression for set 2
  • [+] groupby = Positional and/or structural attributes according to whose values to group the results
  • [+] set1_corpus = Corpus names in uppercase for set 1
  • [+] set2_corpus = Corpus names in uppercase for set 2
  • [?] max = The maximum number of results
  • [?] incremental = true: Return results incrementally (as soon as the results for each corpus are ready) for a search from multiple corpora

The command may be used to compare two different queries (or the same query) on two different sets of corpora (or the same set) as long as both sets of corpora have the attributes listed in the groupby parameter.

Returns:

  • average: average value for log-likelihood
  • loglike
    • [+] value: Log-likelihood for the value of the groupby attributes.
  • set1
    • [+] value: Absolute frequency of value in set 1
  • set2
    • [+] value: Absolute frequency of value in set 2

If the parameter groupby contains more than one attribute name, the values above have them separated by slashes (value1/value2/…).

Examples:

Word picture

Word picture table

Retrieve the most frequent dependency relations in which a lemgram or word form occurs.

Parameters:

  • command = relations
  • [+] corpus = Corpus name in uppercase
  • word = The lemgram or word form to search
  • [?] type = Search type: word (word form; the default) or lemgram
  • [?] min = Minimum frequency to be shown
  • [?] max = The maximum number of results (0 = no limit)
  • [?] incremental: Return information incrementally as the computing is ready for each individual corpus

Returns:

  • relations: An array of relations with the following properties:
    • source: List of sources, which are strings of the form CORPUS:id, where CORPUS is a corpus id and id an internal relation id; to be used as an input parameter to the command relations_sentences
    • dep: Dependent lemgram (or word form)
    • depextra: Dependent prefix (not used in Finnish corpora)
    • deppos: Dependent part of speech (mapped to the SUC2 tagset)
    • freq: Number of occurrences
    • head: Head lemgram (or word form)
    • headpos: Head part of speech (mapped to the SUC2 tagset)
    • mi: Lexicographer’s mutual information value
    • rel: Dependency relation (mapped to the Swedish treebank dependency labels)

Word picture hits

Retrieve the sentences in which a dependency relation occurs. The dependency relation is often from the word picture.

Parameters:

  • command = relations_sentences
  • [+] source = Strings of the form CORPUS:id, where CORPUS is a corpus id and id an internal relation id, as returned in the source value by the command relations
  • head = The lemgram of the head word
  • rel = Dependency relation
  • [?] dep = Dependent lemgram
  • [?] depextra = Dependent prefix (not used in Finnish corpora
  • [?] start = The number of the first hit to include in the concordance (starting from 0)
  • [?] end = The number of the last hit to include in the concordance

Returns:

The command returns a structure of the same type as the basic KWIC concordance returned by query.

Download KWIC search results

The KWIC concordance search results can be downloaded (exported) in various formats using the Web service at

https://korp.csc.fi/cgi-bin/korp_download.cgi

The main parameters the service takes are the following:

  • [?] query_params = The parameters to korp.cgi command query for generating a KWIC result; if specified, korp.cgi is called to generate the result.
  • [?] query_result = The Korp query result (in JSON) to format; overrides query_params
  • format = The format to which to convert the result; default: json (JSON). See below for further information.
  • filename_format = A format specification (template) for the (suggested) name of the file to generate; may contain the following format keys: {cqpwords} (the words in the CQP query), {start} (the number of the first hit), {end} (the number of the last hit), {date} (the date of the query as yyyymmdd), {time} (the time of the query as hhmmss), {ext} (file name extension based on the format); default: korp_kwic_{cqpwords}_{date}_{time}.{ext}
  • filename: The (suggested) name of the file to generate; overrides filename_format.

The service can either take the query results returned by the main Korp Web service command query in the parameter query_result or take the parameters for the command query, pass them to the main Web service and use the query results it returns. If neither query_params nor query_result is specified, the service assumes that the parameters contain parameters for the main Korp Web service command query to perform a CQP query.

The result is formatted according to format and possible additional parameters for the format.

Formats

The currently supported values for the format parameter are

  • json = JSON (default): The original JSON format returned by the main Korp Web service
  • nooj = NooJ format
  • annot (= tokens) = Linguistic annotations in a tabular format: line per token
  • sentences = Sentence per row, the word forms of the sentence in one column and each text attribute in its own column, and query information at the end of the file (but see below for a variant)
  • ref (= bibref) = Bibliographical reference in a tabular format: the whole sentence on one line and metadata information on the following lines

The format sentences has a variant which has the information on the whole result repeated for each row, instead of only once at the end of the file, and a field containing all the lemmas of a sentence. This variant is selected by giving the value lemmas-resultinfo to the parameter subformat. You can further customize the format with the parameters described further below.

The tabular formats annot, sentences and ref are usually followed by a comma and a physical format specifying the physical representation format for the tabular data:

  • tsv = TSV (tab-separated values) (default): fields (columns) separated by tabs, field values not quoted, Unix-style line endings (LF)
  • csv = CSV (comma-separated values): fields separated by commas, all field values in double quotes (also numeric values), literal doubles doubled, DOS/Windows-style line endings (CR+LF)
  • xls = Excel 97 XLS spreadsheet

Examples:

The physical format of TSV and CSV can be customized via the following parameters:

  • delimiter = the string separating fields (columns)
  • newline = the end-of-line character(s) (literally; does not accept C-style escape sequences)
  • quote = the quote character enclosing field values
  • replace_quote = the character(s) with which to replace quote characters in field values

The annot, sentences and ref formats use the parameters structs and attrs to specify which structural and positional attributes should be shown in the result. They take a comma-separated list of the following values:

  • name = Show attribute name.
  • * = Show all attributes listed in the Korp Web service parameter show_struct for structs and show for attrs.
  • + = As above, but only show only those that actually occur in the corpora from which the results come.
  • -name = Omit attribute name; used after * or + to omit an attribute.

The tabular formats can be customized with a number of formatting parameters, including the following:

  • sentence_fields = a comma-separated list of the names of the fields of sentences to display; available values include:
    • hit_num = the number of the hit across all pages of hits for the query (zero-based)
    • sentence_num = the number of the sentence in this file (zero-based)
    • corpus_name = name of the corpus
    • tokens = all tokens (typically, word forms) of the sentence, separated by a string that can be changed via the parameter token_sep
    • left_context = the tokens of the sentence to the left of the match, formatted in the same way as tokens
    • match = the tokens of the sentence that are a part of the match, formatted as tokens
    • right_context = the tokens of the sentence to the right of the match, formatted as tokens
    • attrs_type = the positional attribute attr for each of the token of the sentence, formatted as tokens, where attrs is a pluralized form of attribute name attr (note that attr needs to be listed in the parameter sentence_token_attrs) and type is one of all (all tokens of the sentence), left_context, match or right_context; for example, lemmas_all is a list of the lemma attribute of all the tokens of a sentence and poses_match is a list of pos attribute of the tokens of the match
    • aligned = for parallel corpora, the tokens of the sentence aligned with the match; use ?aligned to include the field only if the the result contains aligned sentences
    • structs = all the structural attributes of the sentence (and containing elements), each formatted by default as name: value and separated by semicolons
    • *structs = expanded to all the structural attributes of the sentence (and containing elements), each in its own field
    • struct_attr = the value of the structural attribute struct_attr for the sentence
    • params = Korp query parameters, formatted as name=value, separated by semicolons
    • date = the date and time of the query in the format YYYY-MM-DD hh:mm:ss
    • urn = URN of the corpus
    • metadata_linktype = link to metadata, where linktype is one of urn (bare URN), url (URL) or link (the URN as a URL if URN is available, otherwise the URL)
    • licence_linktype = link to licence information for the corpus (linktype as above)
    • licence_name = the name of corpus licence

    Note that values for the corpus information fields urn, metadata_linktype, licence_linktype and licence_name are currently not available in the API for all corpora. In sentence_fields, the field names may be prefixed with a question mark ? to include the field in the result only if any of the corpora in the result has a value for the field.

  • match_open = the string to be added to (by default, before) the first token of the match in the sentence fields tokens, match, attrs_all and attrs_match (default: the empty string)
  • match_close = the string to be added to (by default, after) the last token of the match (in the sentence fields as for match_open) (default: the empty string)

For example, the default value for sentence_fields in the format sentences is corpus,?urn,?metadata_link,?licence_name,?licence_link,match_pos,left_context,match,right_context,?aligned,*structs, whereas that for the variant lemmas-resultinfo is hit_num,corpus,tokens,lemmas_all,?aligned,*structs,?urn,?metadata_link,?licence_name,date,hitcount,?korp_url,params.


Researcher of the Month: Eero Voutilainen

 

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317