Corpus of Contemporary American English - Kielipankki download version 2020 shortname: coca-dl-2020 metadata: http://urn.fi/urn:nbn:fi:lb-2022102101 IPR holder: Prof. Mark Davies, Brigham Young University license: CLARIN RES The complete license is available at http://urn.fi/urn:nbn:fi:lb-2017072503 A copy of the license is included in LICENSE.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at the above link. The corpus is available in three versions (text per line, token per line, a relational database) each in its own archive file: coca-dl-2020-text.zip - text files (text_*.txt): each text on its own line - @@textID, space-separated tokens (words, punctuation marks) - lexicon.txt - sources.txt coca-dl-2020-db.zip - three database tables as text with tab-separated fields - text files (db_*.txt): textID tokenID wordID - lexicon.txt - sources.txt - https://www.corpusdata.org/database.asp coca-dl-2020-wlp.zip - text files each token is on its own line - @@textID, Word, Lemma, PoS (tab-separated) - lexicon.txt - sources.txt Each archive contains the *same* lexicon and sources files: - lexicon.txt: wordID word lemma PoS (tab-separated) - sources.txt: textID ... (tab-separated) The files are as provided by Mark Davies; more information on them can be found at https://www.corpusdata.org/formats.asp The corpus consists of files in eight genres (academic, blogs, fiction, magazine, newspaper, spoken, TV/movies, web pages). Six of the genres are stored by year (1990-2019) in 30 files each, while blogs and web pages are stored in 34 consecutively numbered files (01-34). Please note that the data has 10 words every 200 words replaced with @ characters to comply with the US Fair Use Law; see https://www.corpusdata.org/limitations.asp The original corpus data is searchable at https://corpus.byu.edu/coca/ For further information, please contact fin-clarin@helsinki.fi .