Corpus of Historical American English (COHA) -- Kielipankki downloadable versions IPR holder: Prof. Mark Davies, Brigham Young University Please read the end-user licence of the corpus in the file LICENCE.txt. More information on this corpus version is available on META-SHARE: http://urn.fi/urn:nbn:fi:lb-2017061926 The data is available in four formats, each in its own subfolder: db/ - Relational database as text: - Three tables: corpus (tokens), lexicon and sources (in the shared/ subfolder). - Fields separated by tabs. - More information: https://www.corpusdata.org/database.asp - One corpus file per decade covering all genres. wlp/ - Word/Lemma/PoS: - Token per line; word, lemma and part of speech as tab-separated fields. - Each text in its own file: file names contain a genre id (one of fic[tion], mag[azine], news[paper] or nf (non-fiction)), year and text id. text/ - Linear text: - The text id and all the tokens of the text on the same line. - No lemma or part-of-speech annotations. - Punctuation marks separated from words. - Each text in its own file; the file names is in wlp/. vrt/ - VeRticalized Text (VRT): - Each token on its own line; texts, paragraphs and sentences marked with XML-style tags with attributes. - Converted from the WLP format. - The input format for Corpus Workbench (and Korp). In addition, the subfolder shared/ contains the source text information (metadata) and lexicon. The first three formats are as provided by Mark Davies (except that file names have been altered); more information on them can be found at https://www.corpusdata.org/formats.asp For more information on the VRT format in general, please see https://www.kielipankki.fi/development/korp/corpus-input-format/ Each folder contains a zip file for the data files of each decade from 1810s to 2000s. The zip files for the Word/Lemma/PoS and linear text formats contain one file for each source text, the database format has all the texts in a single file, and the VRT format has one file for each genre (fiction, magazine, newspaper and non-fiction), corresponding to the subcorpora in the Korp version of the corpus. Please note that the data has 10 words every 200 words replaced with @ characters to comply with the US Fair Use Law; see https://www.corpusdata.org/limitations.asp The original corpus data is searchable at https://corpus.byu.edu/coha/ and a Kielipankki version of the corpus data is searchable in Korp at http://urn.fi/http://urn.fi/urn:nbn:fi:lb-2017061934 For further information, please contact kielipankki@csc.fi.