Resource group page

coca-2020: Notes for the user

The corpus concerned: Corpus of Contemporary American English – Kielipankki Korp version 2020

During the publication process it was noticed, that the data contains 72 texts, whose ids did not occur in sources.txt, which means that those texts lack metadata.
Concerned are 71 texts in the sub corpus ’TV and movies’, and 1 text in the sub corpus ’News’. They even lack the date information (text_datefrom and text_dateto).

Moreover, all the texts in the sub corpora ’Blog’ and ’Web’ are dated to 2012, even though they might actually have been written earlier. The reason for this: The Blog and General web page texts are a subset of the texts from the US in the GloWbE corpus. The web pages were collected in December 2012.

More information about the data can be found from here:

Last updated: 11.10.2023

This page has a persistent identifier:

Search the Language Bank Portal:
Juraj Šimko
Researcher of the Month: Juraj Šimko


Upcoming events


The Language Bank's technical support:
kielipankki (at)
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at)
tel. +358 29 4129317

More contact information