Jyväskylä Corpus of Middle French

A digitized corpus for the study of the lexis and syntax of Middle French and for text editions. The corpus consists of 14 documents and 430 000 words. It comprises prose, novels, plays and lyrical poetry from the period 1300-1550.

More information on the corpus can be found here (in Finnish).

Latest versions/subcorpora:
Jyväskylä Corpus of Middle French
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Access the corpus in Puhti
Search for all versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023040501

Oulu Corpus

Suomeksi

Latest versions/subcorpora:
Oulu Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Apply for access

This version is available via the computing environment Puhti

Search for all versions in META-SHARE

Content

The Oulu Corpus is a research corpus of Standard Finnish in the 1960’s. The original material was collected by a group led by prof. Pauli Saukkonen at the University of Oulu. The original corpus project aimed to collect a representative sample of Standard Finnish language in the 1960’s media in order to create a frequency dictionary of Finnish. The annotated text material was converted into SGML format by the Institute for the Languages of Finland in 1997.

The resource is available via the computing environment. Access rights can be granted for research use by individual application.


Last updated: 10.5.2023

 

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023040502

The Wikipedia Corpus (Mark Davies, english-corpora.org) – Kielipankki version

This resource contains a copy of the original The Wikipedia Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains the full text of Wikipedia from the year 2014, with 1.9 billion words in more than 4.4 million articles. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Wikipedia Corpus (Mark Davies, english-corpora.org) – Kielipankki version, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032905

Triangle of Aspects Analysis of Frozen

This data set includes an analysis of the original, English-language version and of the Dutch-language version (as released in the Netherlands) of the nine songs with lyrics from the Disney film Frozen. This analysis employs the triangle of aspects, an analytical model developed specifically for translation research into songs from musical films. The collection of these data is part of the licensor’s Ph. D. project, tentatively titled “Musical, visual and verbal aspects of animated film song dubbing: A case study of Disney’s Frozen” (projected for publication in early 2020). This data set comprises 9 PDF files, one for each song, as well as a Word document that summarizes the findings and provides copyright notices.

Note that the resource shall be removed from the CLARIN Service on 21 December 2023. This time limit shall be conveyed to users of the resource upon downloading, and the user shall commit to removing the downloaded resource from his/her devices and other storage facilities governed by the user on or before 21 December 2023.

Latest versions/subcorpora:
Triangle of Aspects Analysis of Frozen
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032104

Collection of OTA Texts in Public Use

This is a snapshot of the Oxford Text Archive, for testing purposes. For more up-to-date versions of the archive see http://ota.ox.ac.uk/
The snapshot is available in Kielipankki – the Language Bank of Finland (puhti.csc.fi, /appl/data/kielipankki/ota), see Access rights.

Latest versions/subcorpora:
Collection of OTA Texts in Public Use
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Access the corpus in Puhti
Search for all versions in META-SHARE

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032101

UHLCS corpus collection

The University of Helsinki Language Corpus Server (UHLCS) is a multilingual data bank and data server which has been located at the Department of General Linguistics, the University of Helsinki. In Septemberg 2007, the UHLCS was moved to CSC (the Finnish IT Center for Science). The UHLCS, which is maintained by the University of Helsinki, was founded late in 1980. At present, the UHLCS contains computer corpora from more than 50 languages, including samples of minority languages and extensive corpora representing different text types. In 2000, the corpora from the Uralic, Turkic, Tungusic, Mongolic, Chukotko-Kamchatkan, Iranian and North-East Caucasian languages were edited for public use with the financial support of the Max Planck Institute for Evolutionary Anthropology, Leipzig. In summer 2003, the basis for the metadata descriptions of the corpora were prepared with the financial support of the ECHO-project (ECHO = European Cultural Inheritance Online). There are also tools at the UHLCS which can be used in analyzing the corpora. The use of most of the corpora is restricted for research and teaching.

The following corpora are available in Kielipankki – the Language Bank of Finland (puhti.csc.fi, access rights instructions).

Latest versions/subcorpora:  

Chuvash Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

English Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Corpus of Erzya and Moksha Mordvin Literature and Journals and Komi Zyrian Literature (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Erzya and Moksha Mordvin Word List Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Estonian Corpus 1 (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Estonian Corpus 2 (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Finnish Corpus (Bibles) (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Finnish Corpus (Literature) (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

The Helsinki Korp Version of the Finland-Swedish Text Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Korp

The Taito Version of the Finland-Swedish Text Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Ingrian Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Khanty Corpus (North Khanty, Corpora and Translations) (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Komi Zyrian Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Latin Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Lude (Ludian) Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Nenets Corpus (Tundra Nenets) (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

North Saami Corpus (Literature) (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

North Saami Corpus (Sámikultuvradoaibmagotti smiehttamush) (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Quantifiers and Quantification in Finnish and Languages Spoken in the Central Volga–Kama Region (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

The Susanne Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Ume Saami Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Uralic, Turkic, Indo-Iranian and Mongol languages; languages of Siberia and Caucasia (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Uzbek-English Dictionary (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti

Lists of Words Corpus (UHLCS)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions

Access the corpus in Puhti
Search for all versions in META-SHARE  

 

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023030901

The Intelligent Web Corpus (Mark Davies, english-corpora.org) – Kielipankki version

This resource contains a copy of the original The Intelligent Web Corpus (iWeb), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 14 billion words in 22 million web pages. The data was taken in 2017 from around 100,000 of the most widely-used websites (for English) in the world.

The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.

More information on Mark Davies’ corpora at Kielipankki.

Latest versions/subcorpora:
The Intelligent Web Corpus (Mark Davies, english-corpora.org) – Kielipankki version, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
The corpus will be available soon
Search for all versions in META-SHARE

Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112310

Psycholinguistic Descriptives

This material comprises a dataset and a query tool for acquiring commonly used psycholinguistic descriptives for Finnish words. The dataset is based on six large corpora from sources such as magazines, newspapers, movie and tv-series subtitles, encyclopedia topics and Internet discussions.
The material includes word surface form frequencies, lemma frequencies, syllable frequencies and letter n-gram frequencies. In addition the query tool can be used to acquire descriptives such as orthographic neighbors for lists of words.

Latest versions/subcorpora:
Psycholinguistic Descriptives
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081601

Opusparcus: Open Subtitles Paraphrase Corpus for Six Languages

Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

The data in Opusparcus has been extracted from OpenSubtitles2016 (http://opus.nlpl.eu/OpenSubtitles2016.php), which is in turn based on data from http://www.opensubtitles.org.

For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been annotated manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

Opusparcus is available for download at the Language Bank of Finland. The README file in the download folder contains detailed descriptions of the data sets.

Please cite the following paper in any work that utilizes any part of the Opusparcus corpus:
Mathias Creutz (2018). Open Subtitles Paraphrase Corpus for Six Languages. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference (LREC 2018), 7-12 May, Miyazaki, Japan.

Latest versions/subcorpora:
Opusparcus: Open Subtitles Paraphrase Corpus for Six Languages (version 1.0)
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081203

Finnish OpenSubtitles 2017

The corpus contains Finnish subtitles for movies and TV-series from http://www.opensubtitles.org

The corpus is a derivative of the [OPUS OpenSubtitles2018](http://opus.nlpl.eu/OpenSubtitles2018.php) multilingual corpus. Information on the material processing up to sentence splitting can be found in the original publication Lison & Tiedemann (2016). The corpus has been tokenized and annotated with morpho-syntactic analysis produced with the [Turku Dependency Parser](http://turkunlp.github.io/Finnish-dep-parser/). P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

Latest versions/subcorpora:
Finnish OpenSubtitles 2017, Kielipankki Korp Version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Finnish OpenSubtitles 2017, source
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Finnish OpenSubtitles 2017, VRT
icon-info-circleMetadata and license
icon-quote-rightAttribution instructions
Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081202

New Year’s Speeches of the Presidents of the Republic of Finland

This corpus contains the New year’s speeches given by the presidents of the republic of Finland in 1935-2007.

More information on the corpus: http://kaino.kotus.fi/korpus/teko/meta/presidentti/presidentti_coll_rdf.xml

Last versions/subcorpora:
New Year’s Speeches of the Presidents of the Republic of Finland
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051202

FinStud86 Corpus

The corpus contains Finnish language essays / compositions written by Finnish-speaking students taking the matriculation examination in 1986.

Latest versions/subcorpora:
FinStud86 Corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042604

Corpus of Finnish Matriculation Examination Essays from 1994, 1999 and 2004

The corpus contains Finnish essays written by the students of the 1994, 1999 and 2004 matriculation examinations.

Latest versions/subcorpora:
Corpus of Finnish Matriculation Examination Essays from 1994, 1999 and 2004
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021042603

FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts

The corpus contains juridical texts in Russian and Finnish arranged as a comparable text corpus. More information can be found from https://mustikka.uta.fi

Latest versions/subcorpora:
The Finnish Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
The Russian Sub-corpus of FiRuLex, Russian-Finnish Comparable Corpus of Legal Texts
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for these versions in META-SHARE

Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092402

Search the Language Bank Portal:
Aleksi Sahala
Researcher of the Month: Aleksi Sahala

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information