Corpus of Contemporary American English - Kielipankki download version 2020

shortname: coca-dl-2020

metadata: http://urn.fi/urn:nbn:fi:lb-2022102101

IPR holder: Prof. Mark Davies, Brigham Young University

license: CLARIN RES
The complete license is available at http://urn.fi/urn:nbn:fi:lb-2017072503

A copy of the license is included in LICENSE.txt. The license details
may be subject to change, so before downloading the resource, please
refer to the latest version of the license at the above link.

The corpus is available in three versions (text per line, token per
line, a relational database) each in its own archive file:

coca-dl-2020-text.zip
- text files (text_*.txt): each text on its own line
- @@textID, space-separated tokens (words, punctuation marks)
- lexicon.txt
- sources.txt

coca-dl-2020-db.zip
- three database tables as text with tab-separated fields
- text files (db_*.txt): textID tokenID wordID
- lexicon.txt
- sources.txt
- https://www.corpusdata.org/database.asp

coca-dl-2020-wlp.zip
- text files each token is on its own line
- @@textID, Word, Lemma, PoS (tab-separated)
- lexicon.txt
- sources.txt

Each archive contains the *same* lexicon and sources files:
- lexicon.txt: wordID word lemma PoS (tab-separated)
- sources.txt: textID ... (tab-separated)

The files are as provided by Mark Davies; more information on them can
be found at https://www.corpusdata.org/formats.asp

The corpus consists of files in eight genres (academic, blogs,
fiction, magazine, newspaper, spoken, TV/movies, web pages). Six of
the genres are stored by year (1990-2019) in 30 files each, while
blogs and web pages are stored in 34 consecutively numbered files
(01-34).

Please note that the data has 10 words every 200 words replaced with @
characters to comply with the US Fair Use Law; see
https://www.corpusdata.org/limitations.asp

The original corpus data is searchable at https://corpus.byu.edu/coca/

For further information, please contact fin-clarin@helsinki.fi .