Tärkeää: STT:n uutisarkiston kokotekstiaineistojen käyttöoikeus päättyy 21.2.2025
Suomen Tietotoimiston (STT) uutisarkisto sisältää uutisjakelun suomenkieliset artikkelit, jotka STT on lähettänyt media-asiakkaidensa käytettäväksi vuodesta 1992 lähtien. Valtaosa artikkeleista on uutisjuttuja, joiden pituus vaihtelee hyvin lyhyistä ”viivauutisista” uutissähkeisiin ja pidempiin uutisjuttuihin. Artikkelit on luokiteltu osastoittain (kotimaa, ulkomaat, talous, politiikka, kulttuuri, viihde ja urheilu) ja niihin liittyvän metadatan mukaan (IPTC-asiasanat tai avainsanat sekä tietyiltä osin paikkaluokitukset). Arkisto sisältää myös muuta STT:n luomaa tai välittämää materiaalia kuten asiakkaille lähetettäviä uutislupauksia, urheilutuloksia, vieraskynäartikkeleita ja tiedotteita.
Tarkempaa tietoa eri aineistoversioiden sisällöstä löytyy niiden kuvailutiedoista. Kuvailutiedoista löytyvät myös tiedot aineiston käyttöoikeuksista ja lisensseistä.
Lisenssin muutos 2024-11-21: Oikeudenhaltija on ilmoittanut, että STT:n uutisarkiston kokotekstiaineistoja koskeva lisenssi päättyy 21.2.2025. Mikäli olet saanut Kielipankin kautta käyttöoikeuden STT:n uutisarkiston kokotekstiaineistoihin, sinun on lisenssiehtojen mukaisesti lopetettava kyseisten aineistojen käyttö ja poistettava ne laitteiltasi kolmen kuukauden siirtymäajan kuluessa eli 21.2.2025 mennessä (ks. lisenssin linkki edellä). Aiemmin luvan saaneille käyttäjille on ilmoitettu asiasta myös sähköpostitse.
Huomaathan, että käyttöoikeus päättyy vain STT:n uutisarkiston kokotekstiversioiden osalta! Niitä STT:n uutisarkiston versioita, joissa on saatavilla vain rajallisia konteksteja kerrallaan (esim. Kielipankissa olevat STT:n uutisarkiston Korp-versiot) tai joissa tekstisisällön virkejärjestys on sekoitettu, on edelleen sallittua käyttää. Kielipankki pyrkii lähitulevaisuudessa toimittamaan korvaavia aineistoversioita saataville latauspalvelun kautta.
Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2018121001
Important: The license of the full-text versions of the Finnish News Agency Archive will be terminated on 21.2.2025
The Finnish News Agency Archive corpus comprises newswire articles in Finnish sent to media outlets by the Finnish News Agency (STT) since 1992.
Most of the material is news articles that vary from short “news flashes” to telegrams and longer articles. News articles are categorized by department (domestic, foreign, economy, politics, culture, entertainment and sports) as well as by metadata (IPTC subject categories or keywords and location data). The archive also includes other material STT has created or forwarded such as news planning lists, sports results, analysis articles and press releases.
Further details of each version of the resource are maintained in the metadata record, findable via the persistent identifier (see the link at the resource title).
License change 2024-11-21: According to a notice from the rightholder, the end-user license of the full-text versions of the Finnish News Agency Archive will be terminated on 21st February 2025. In case you were granted the right to use the full text versions via the Language Bank of Finland, you must stop using the resources in question and you must remove them from your devices by the aforementioned deadline (see the license link above). The users who have access rights to the full-text versions have also been notified by email on 21st November 2024.
Please note that the termination of the license only affects the full-text versions of the resource! You may continue using those versions of the Finnish News Agency Archive that only show restricted contexts (e.g., the Korp versions of the archive in the Language Bank) or where the order of the sentences has been scrambled. The Language Bank is already working on new downloadable versions that can be made available under the public license.
Persistent identifier of this page: http://urn.fi/urn:nbn:fi:lb-2023072121
This corpus contains newspapers and magazines from Finland starting from 1770, compiled by the National Library of Finland.
NB: The Finnish acronym for the corpora The Newspaper and Periodical OCR Corpus of the National Library of Finland used to be ”Digilib”. Currently, however, the acronym ”klk” and the short names klk-fi-1874-dl and klk-fi-1920-dl are recommended instead.
Latest versions/subcorpora: | |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 (1771-1874), VRT Metadata and license Attribution instructions |
Download the resource |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, VRT Metadata and license Attribution instructions |
Download the resource |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp Metadata and license Attribution instructions Example queries in Korp |
Select the corpus in Korp |
The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Metadata and license Attribution instructions |
Select the corpus in Korp |
The Swedish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version Metadata and license Attribution instructions |
Select the corpus in Korp |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920) Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1771–1879, VRT Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical Corpus of the National Library of Finland, Swedish sub-corpus, 1880–1948, scrambled, VRT Metadata and license Attribution instructions |
Download the resource |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
Based on the KLK data, word-level collections of uni-, bi- and trigrams have been created and are available for download. These are their own data sets:
The N-grams of the Newspaper and Periodical Corpus of the National Library of Finland
The corpora consist mainly of digitized versions of texts originally printed on paper. These physical papers have been scanned, and optical character recognition (OCR) was performed on the resulting images. The digitized material spans a long period and contains different kinds of texts, writing styles and fonts. Scanning some parts of the material is more complex than scanning other parts, and the physical condition of the original texts also varies. The OCR techniques used have also varied, and there is the possibility that some of the texts have gone through manual post-correction. This results in some parts of the corpora being of terrible quality while others are of good quality. We have collected a list of publications related to OCR quality and collection processing:
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092404
Last updated: 19.6.2024
The Yle News Archive contains Finnish news articles from 2011 and Swedish articles from 2012 onwards from YLE. The archives are cumulative and the versions available through the Language Bank of Finland are listed on this page.
Several different versions of these resources are published in the Language Bank of Finland.
Details on the content and license of each version are available via COMEDI.
Last updated: 02.10.2024
This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2023072122
This resource contains a copy of the original News on the Web corpus (NOW), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains data from web-based newspapers and magazines in 20 different English-speaking countries from Jan 2010 to 31 May 2021. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
News on the Web (Mark Davies, english-corpora.org) – Kielipankki version 2021-05, source Metadata and license Attribution instructions |
The corpus will be available soon |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112405
The Coronavirus Corpus contains data on the medical, social, cultural, and economic impact of the coronavirus (COVID-19) from online magazines and newspapers in 20 different English-speaking countries from 1 Jan 2020 to 31 May 2021. The original version is provided by Mark Davies via the corpus interface at english-corpora.org. The Language Bank of Finland offers a ”snapshot” version of the corpus under a restricted academic license that is available for users affiliated with a university in Finland.
See also the rest of the corpora from english-corpora.org that are available at the Language Bank of Finland.
For the description of an individual corpus version, please see the metadata record (click on the link at the corpus title).
More information about all corpora from english-corpora.org that are available via the Language Bank
For the license text of an individual corpus, click on the license image in the corpus list, or see the metadata record (click on the link at the corpus title). Note that there are specific additional terms and conditions that apply on this and other corpora from BYU, see https://www.corpusdata.org/restrictions.asp. The link is included in the official license.
This page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022111705
The corpus contains different volumes of four magazines: Suomen Kuvalehti, Historiallinen aikakauskirja, Lakimies and Suomi.
Suomen Kuvalehti’s volumes: 1917, 1925, 1935, 1945, 1955, 1965, 1972 (approximately 5,4 million tokens).
Historiallinen Aikakauskirja’s volumes : 1917, 1920, 1925, 1935, 1945.
Lakimies’ volumes: 1917, 1920, 1925, 1935, 1945, 1955, 1965, 1972.
Suomi’s volumes: 1917, 1920, 1923, 1935, 1938.
The corpus is made up of two parts: one whose OCR (optical character recognition) has been checked and another one whose OCR hasn’t been checked. The former part’s size is 670 000 tokens and contains one 1935 issue from Historiallinen Aikakauskirja, Lakimies and Suomi, as well as 4 issues of Suomen Kuvalehti from each of the years mentioned above (1917, 1925, 1935, 1945, 1955, 1965 and 1972). These issues were chosen so that there would be an equal amount of texts from all year round.
Latest versions/subcorpora: | |
The Magazine Corpus of the Institute for the Languages of Finland, revised Metadata and license Attribution instructions | Select the corpus in Korp |
The Magazine Corpus of the Institute for the Languages of Finland, unrevised Metadata and license Attribution instructions | Select the corpus in Korp |
The Downloadable Version of the Magazine Corpus of the Institute for the Languages of Finland, revised Metadata and license Attribution instructions | The resource will be available soon |
The Downloadable Version of the Magazine Corpus of the Institute for the Languages of Finland, unrevised Metadata and license Attribution instructions | The resource will be available soon |
Of this language corpus different versions are published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-201407301
Jehovah’s Witnesses’ bible-based magazines: ’Awake!’, ’The Watchtower’, ’The Watchtower – Study Edition’ and ’The Watchtower – Study Edition (Simplified)’
Harvested from https://www.jw.org/en/library/magazines/ for the years 2010-2016, for all available languages.
Detailed information on the content of each version, user rights and licenses can be found from it's specific metadata record.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021061821
The corpus contains issues of the ’Karjalan Sanomat’ newspaper published in 2012-2014.
Latest versions/subcorpora: | |
The Karelian Finnish Newspaper Corpus Metadata and license Attribution instructions | Select the corpus in Korp |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021052401
The HS.fi News and Comments Corpus contains the domestic news of the Helsingin Sanomat website and their comments from 5.9.2011 to 4.9.2012. The corpus starts with the first news of 5.9.2011 and ends with a news published in the morning on 3.9.2012 and the comments published on the website by 5.9.2012.
Latest versions/subcorpora: | |
The HS.fi News and Comments Corpus Metadata and license Attribution instructions | Select the corpus in Korp |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021051910
NB: The Finnish acronym for this corpus used to be ”Digilib”, but the acronym ”klk” and the short names klk-fi-1874-dl and klk-fi-1920-dl are recommended instead from 23.11.2021 onwards. These corpora can be found on the resource group page of The Newspaper and Periodical Corpus of the National Library of Finland, Kielipankki Version now.
This corpus consists of the OCR results of the material in the corpus of publications digitized by the National Library of Finland.
The material published before 1875 is so old that any copyrights in it must have expired before 2015. For the material published from 1875 to 1920, note that parts of the resource are copyright-protected.
Latest versions/subcorpora: | |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1771-1874) Metadata and license Attribution instructions |
Download the resource |
The Newspaper and Periodical OCR Corpus of the National Library of Finland (1875-1920) Metadata and license Attribution instructions |
Download the resource |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-202104142
This resource contains entire newspaper and magazine articles published in Finnish in the 1990s and 2000s. The goal was to create a contemporary dataset of magazines and newspapers of various origins, such as scientific journals, regional newspapers, company internal circulations, and trade union member journals. A detailed list of all magazines and newspapers contained in this resource can be found here.
This resource contains Easy-to-read content: ’Leija’ and ’Selkosanomat/Selkouutiset’.
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. A copy of this data is also available in CSC's computing environment.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021032304
The Finnish Language Text Collection (Suomen kielen tekstikokoelma) is a selection of electronic Finnish texts from the 1990s. The collection contains texts from newspapers, journals as well as books. See the content details in Finnish.
All of the material is available for academic research use. A large part of the texts is also available for commercial use.
The collection was compiled by the Institute for the Languages of Finland, the Department of General Linguistics of the University of Helsinki and the Foreign Languages Department of the University of Joensuu.
Latest versions/subcorpora: | |
The Downloadable Version of the Finnish Text Collection Metadata and license Attribution instructions |
Download the resource |
The Downloadable Version of the Finnish Text Collection – Commercial Use Metadata and license Attribution instructions |
Download the resource |
The Helsinki Korp Version of the Finnish Text Collection Metadata and license Attribution instructions |
Select the corpus in Korp |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-201403268
Ylen suomenkielinen uutisarkisto sisältää uutisartikkelit vuodesta 2011 ja ruotsinkielinen uutisarkisto vuodesta 2012 alkaen. Aineistot ovat karttuvia ja tiedot Kielipankin versioista julkaistaan tällä aineistosivulla.
Aineistoista työstetään Kielipankissa erilaisia versioita, jotka ovat saatavilla Kielipankin latauspalvelussa ja/tai Korp-konkordanssipalvelussa.
Tarkempaa tietoa eri aineistoversioiden sisällöstä löytyy niiden kuvailutiedoista. Kuvailutiedoista löytyvät myös tiedot aineiston käyttöoikeuksista ja lisensseistä.
Viimeksi päivitetty: 02.10.2024
Tämän aineistoryhmäsivun PID: http://urn.fi/urn:nbn:fi:lb-2021020901
Viimeksi muokattu 2025-01-16