Corpus of Historical American English (COHA) -- Kielipankki
downloadable versions

IPR holder: Prof. Mark Davies, Brigham Young University

Please read the end-user licence of the corpus in the file
LICENCE.txt.

More information on this corpus version is available on META-SHARE:
http://urn.fi/urn:nbn:fi:lb-2017061926

The data is available in four formats, each in its own subfolder:

  db/   - Relational database as text:
  - Three tables: corpus (tokens), lexicon and sources (in the shared/
    subfolder).
  - Fields separated by tabs.
  - More information: https://www.corpusdata.org/database.asp
  - One corpus file per decade covering all genres.

  wlp/  - Word/Lemma/PoS:
  - Token per line; word, lemma and part of speech as tab-separated
    fields.
  - Each text in its own file: file names contain a genre id (one of
    fic[tion], mag[azine], news[paper] or nf (non-fiction)), year and
    text id.

  text/ - Linear text:
  - The text id and all the tokens of the text on the same line.
  - No lemma or part-of-speech annotations.
  - Punctuation marks separated from words.
  - Each text in its own file; the file names is in wlp/.

  vrt/  - VeRticalized Text (VRT):
  - Each token on its own line; texts, paragraphs and sentences marked
    with XML-style tags with attributes.
  - Converted from the WLP format.
  - The input format for Corpus Workbench (and Korp).

In addition, the subfolder shared/ contains the source text
information (metadata) and lexicon.

The first three formats are as provided by Mark Davies (except that
file names have been altered); more information on them can be found
at https://www.corpusdata.org/formats.asp

For more information on the VRT format in general, please see
https://www.kielipankki.fi/development/korp/corpus-input-format/

Each folder contains a zip file for the data files of each decade from
1810s to 2000s. The zip files for the Word/Lemma/PoS and linear text
formats contain one file for each source text, the database format has
all the texts in a single file, and the VRT format has one file for
each genre (fiction, magazine, newspaper and non-fiction),
corresponding to the subcorpora in the Korp version of the corpus.

Please note that the data has 10 words every 200 words replaced with @
characters to comply with the US Fair Use Law; see
https://www.corpusdata.org/limitations.asp

The original corpus data is searchable at https://corpus.byu.edu/coha/
and a Kielipankki version of the corpus data is searchable in Korp at
http://urn.fi/http://urn.fi/urn:nbn:fi:lb-2017061934

For further information, please contact kielipankki@csc.fi.