The N-grams of the Newspaper and Periodical Corpus of the National Library of Finland

The National Library of Finland has digitized a large proportion of Finland’s Finnish and Swedish newspapers, magazines, and periodicals published between 1820 and 2000 (Finnish) and between 1770 and 1940 (Swedish). This resource contains sets of unigrams, bigrams and trigrams extracted from a corpus that has been compiled from the digitized newspapers by the University of Helsinki.

The resource consists of plain UTF-8 encoded text files, each containing a list of n-grams that have been ordered by their frequencies from highest to lowest. Each line in a file consists of two or more fields separated by a whitespace character. The first field indicates the absolute frequency of a unique n-gram, and the remaining fields contain the tokens (strings of non-whitespace characters) of the n-gram itself. Uppercase letters have been retained as such and have not been converted into lowercase letters. Punctuation characters are treated as separate tokens except when they are part of an abbreviation (”etc.”, ”mm.”) or when they separate a case ending or an enclitic from an abbreviation or a sign (”EU:ssa”, ”%:iin”), as per the typographic principles of standard Finnish. The n-grams have been computed across sentence boundaries for each decade (from the 1770s to the 1940s and from the 1820s to the 2000s respectively) as well as for the entire corpus, with unigrams, bigrams and trigrams in separate files.

Since the source material has been digitized by the means of optical character recognition (OCR), the resource also contains erroneous word forms and non-word strings of characters. Furthermore, due to the large time span of the original corpus, the resource contains several lexical items and spelling variants that have since become obsolete in standard Finnish and standard Swedish.

The resource will be updated in the future as improvements are being made to the source material.

The data is derived from The Newspaper and Periodical Corpus of the National Library of Finland

Latest versions/subcorpora:
The Finnish N-grams 1820-2000 of the Newspaper and Periodical Corpus of the National Library of Finland
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Swedish N-grams 1770-1940 of the Newspaper and Periodical Corpus of the National Library of Finland
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool, or they are offered by another member organisation of FIN-CLARIN. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021091407

Search the Language Bank Portal:
Juho Leinonen
Researcher of the Month: Juho Leinonen

 

Tulevat tapahtumat

  1. CLARIN Annual Conference 2021

    27.9.2021 10.0029.9.2021 16.15

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information