Corpus of Global Web-Based English (GloWbE) –– Kielipankki downloadable versions IPR holder: Prof. Mark Davies, Brigham Young University Please read the end-user licence of the corpus in the file LICENCE.txt. More information on this corpus version is available on META-SHARE: http://urn.fi/urn:nbn:fi:lb-2017061929 The data is available in four formats, each in its own subfolder: db/ - Relational database as text: - Three tables: corpus (tokens), lexicon and sources (in the shared/ subfolder). - Fields separated by tabs. - More information: https://www.corpusdata.org/database.asp wlp/ - Word/Lemma/PoS: - Token per line; word, lemma and part of speech as tab-separated fields. - Texts marked with ##. text/ - Linear text: - The text id and all the tokens of the text on the same line. - No lemma or part-of-speech annotations. - Punctuation marks separated from words. vrt/ - VeRticalized Text (VRT): - Each token on its own line; texts, paragraphs and sentencs marked with XML-style tags with attributes. - Converted from the WLP format. - The input format for Corpus Workbench (and Korp). In addition, the subfolder shared/ contains the source text information (metadata) and lexicon. The first three formats are as provided by Mark Davies (except that file names have been altered); more information on them can be found at https://www.corpusdata.org/formats.asp For more information on the VRT format in general, please see https://www.kielipankki.fi/development/korp/corpus-input-format/ Each folder contains a zip file for each country, identified by a two-letter country code, except that GB and US have a separate file for each genre (blog and general) because of their size. Each zip file contains one or more files per genre. Please note that the data has 10 words every 200 words replaced with @ characters to comply with the US Fair Use Law; see https://www.corpusdata.org/limitations.asp The original corpus data is searchable at https://corpus.byu.edu/glowbe/ and a Kielipankki version of the corpus data is searchable in Korp at http://urn.fi/urn:nbn:fi:lb-2017061935 For further information, please contact kielipankki@csc.fi.