The corpus concerned: Corpus of Contemporary American English – Kielipankki Korp version 2020
During the publication process it was noticed, that the data contains 72 texts, whose ids did not occur in sources.txt, which means that those texts lack metadata.
Concerned are 71 texts in the sub corpus ’TV and movies’, and 1 text in the sub corpus ’News’. They even lack the date information (text_datefrom and text_dateto).
Moreover, all the texts in the sub corpora ’Blog’ and ’Web’ are dated to 2012, even though they might actually have been written earlier. The reason for this: The Blog and General web page texts are a subset of the texts from the US in the GloWbE corpus. The web pages were collected in December 2012.
More information about the data can be found from here: https://www.english-corpora.org/coca/
Last updated: 11.10.2023
This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2023101002