The Coronavirus Corpus (Mark Davies, english-corpora.org) - Kielipankki version 2021-05, source shortname: coronavirus-ecorg-2021-05-src metadata: http://urn.fi/urn:nbn:fi:lb-2022111701 IPR holder: Prof. Mark Davies, Professor of Linguistics (retired) license: CLARIN RES The complete license is available at http://urn.fi/urn:nbn:fi:lb-2022111703 A copy of the license is included in LICENSE.txt. The license details may be subject to change, so before downloading the resource, please refer to the latest version of the license at the above link. The corpus is available in three versions (text per line, token per line, a relational database) each in its own archive file: coronavirus-ecorg-2021-05-src-text.zip - text files (text_*.txt): each text on its own line - @@textID, space-separated tokens (words, punctuation marks) - lexicon.txt, lexicon-21-04.txt, lexicon-21-05.txt - sources.txt, sources-21-04.txt, sources-21-05.txt coronavirus-ecorg-2021-05-src-db.zip - three database tables as text with tab-separated fields - text files (db_*.txt): textID tokenID wordID - lexicon.txt, lexicon-21-04.txt, lexicon-21-05.txt - sources.txt, sources-21-04.txt, sources-21-05.txt - https://www.corpusdata.org/database.asp coronavirus-ecorg-2021-05-src-wlp.zip - text files each token is on its own line - @@textID, Word, Lemma, PoS (tab-separated) - lexicon.txt, lexicon-21-04.txt, lexicon-21-05.txt - sources.txt, sources-21-04.txt, sources-21-05.txt Each archive contains the *same* lexicon and sources files: - lexicon*.txt: wordID word lemma PoS (tab-separated) - sources*.txt: textID ... (tab-separated) The files are as provided by Mark Davies; more information on them can be found at https://www.corpusdata.org/formats.asp Please note that the data has 10 words every 200 words replaced with @ characters to comply with the US Fair Use Law; see https://www.corpusdata.org/limitations.asp The original corpus data is searchable at https://www.english-corpora.org/corona/ For further information, please contact fin-clarin@helsinki.fi .