Corpus of Contemporary American English (COCA) -- Kielipankki downloadable versions IPR holder: Prof. Mark Davies, Brigham Young University Please read the end-user licence of the corpus in the file LICENCE.txt. More information on this corpus version is available on META-SHARE: http://urn.fi/urn:nbn:fi:lb-2017061923 The data is available in four formats, each in its own subfolder: db/ - Relational database as text: - Three tables: corpus (tokens), lexicon and sources (in the shared/ subfolder). - Fields separated by tabs. - More information: https://www.corpusdata.org/database.asp wlp/ - Word/Lemma/PoS: - Token per line; word, lemma and part of speech as tab-separated fields. - Texts marked with ##. text/ - Linear text: - The text id and all the tokens of the text on the same line. - No lemma or part-of-speech annotations. - Punctuation marks separated from words. vrt/ - VeRticalized Text (VRT): - Each token on its own line; texts, paragraphs and sentencs marked with XML-style tags with attributes. - Converted from the WLP format. - The input format for Corpus Workbench (and Korp). In addition, the subfolder shared/ contains the source text information (metadata) and lexicon. The first three formats are as provided by Mark Davies (except that file names have been altered); more information on them can be found at https://www.corpusdata.org/formats.asp For more information on the VRT format in general, please see https://www.kielipankki.fi/development/korp/corpus-input-format/ Each folder contains a zip file for the data files of each of the five subcorpora: academic, fiction, magazine, newspaper and spoken. The zip files of database, Word/Lemma/PoS and linear text contain contains one file per year, whereas the data for VRT is in a single file. Please note that the COCA extension files covering 2012-2015 are incorporated in the same zip files. (The extension file for 2012 is marked as "2012b".) Please note that the data has 10 words every 200 words replaced with @ characters to comply with the US Fair Use Law; see https://www.corpusdata.org/limitations.asp The original corpus data is searchable at https://corpus.byu.edu/coca/ and a Kielipankki version of the corpus data is searchable in Korp at http://urn.fi/urn:nbn:fi:lb-2017061933 For further information, please contact kielipankki@csc.fi.