This data set includes an analysis of the original, English-language version and of the Dutch-language version (as released in the Netherlands) of the nine songs with lyrics from the Disney film Frozen. This analysis employs the triangle of aspects, an analytical model developed specifically for translation research into songs from musical films. The collection of these data is part of the licensor’s Ph. D. project, tentatively titled “Musical, visual and verbal aspects of animated film song dubbing: A case study of Disney’s Frozen” (projected for publication in early 2020). This data set comprises 9 PDF files, one for each song, as well as a Word document that summarizes the findings and provides copyright notices.
Note that the resource shall be removed from the CLARIN Service on 21 December 2023. This time limit shall be conveyed to users of the resource upon downloading, and the user shall commit to removing the downloaded resource from his/her devices and other storage facilities governed by the user on or before 21 December 2023.
Latest versions/subcorpora: | |
Triangle of Aspects Analysis of Frozen Metadata and license Attribution instructions |
Download the resource |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032104
This is a snapshot of the Oxford Text Archive, for testing purposes. For more up-to-date versions of the archive see http://ota.ox.ac.uk/
The snapshot is available in Kielipankki – the Language Bank of Finland (puhti.csc.fi, /appl/data/kielipankki/ota), see Access rights.
Latest versions/subcorpora: | |
Collection of OTA Texts in Public Use Metadata and license Attribution instructions |
Puhti | Access the corpus in
Search for all versions in META-SHARE |
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023032101
The University of Helsinki Language Corpus Server (UHLCS) is a multilingual data bank and data server which has been located at the Department of General Linguistics, the University of Helsinki. In Septemberg 2007, the UHLCS was moved to CSC (the Finnish IT Center for Science). The UHLCS, which is maintained by the University of Helsinki, was founded late in 1980. At present, the UHLCS contains computer corpora from more than 50 languages, including samples of minority languages and extensive corpora representing different text types. In 2000, the corpora from the Uralic, Turkic, Tungusic, Mongolic, Chukotko-Kamchatkan, Iranian and North-East Caucasian languages were edited for public use with the financial support of the Max Planck Institute for Evolutionary Anthropology, Leipzig. In summer 2003, the basis for the metadata descriptions of the corpora were prepared with the financial support of the ECHO-project (ECHO = European Cultural Inheritance Online). There are also tools at the UHLCS which can be used in analyzing the corpora. The use of most of the corpora is restricted for research and teaching.
The following corpora are available in Kielipankki – the Language Bank of Finland (puhti.csc.fi, access rights instructions).
Latest versions/subcorpora: | |
Chuvash Corpus (UHLCS) |
Puhti | Access the corpus in
English Corpus (UHLCS) |
Puhti | Access the corpus in
Corpus of Erzya and Moksha Mordvin Literature and Journals and Komi Zyrian Literature (UHLCS) |
Puhti | Access the corpus in
Erzya and Moksha Mordvin Word List Corpus (UHLCS) |
Puhti | Access the corpus in
Estonian Corpus 1 (UHLCS) |
Puhti | Access the corpus in
Estonian Corpus 2 (UHLCS) |
Puhti | Access the corpus in
Finnish Corpus (Bibles) (UHLCS) |
Puhti | Access the corpus in
Finnish Corpus (Literature) (UHLCS) |
Puhti | Access the corpus in
The Helsinki Korp Version of the Finland-Swedish Text Corpus (UHLCS) |
Korp | Access the corpus in
The Taito Version of the Finland-Swedish Text Corpus (UHLCS) |
Puhti | Access the corpus in
Ingrian Corpus (UHLCS) |
Puhti | Access the corpus in
Khanty Corpus (North Khanty, Corpora and Translations) (UHLCS) |
Puhti | Access the corpus in
Komi Zyrian Corpus (UHLCS) |
Puhti | Access the corpus in
Latin Corpus (UHLCS) |
Puhti | Access the corpus in
Lude (Ludian) Corpus (UHLCS) |
Puhti | Access the corpus in
Nenets Corpus (Tundra Nenets) (UHLCS) |
Puhti | Access the corpus in
North Saami Corpus (Literature) (UHLCS) |
Puhti | Access the corpus in
North Saami Corpus (Sámikultuvradoaibmagotti smiehttamush) (UHLCS) |
Puhti | Access the corpus in
Quantifiers and Quantification in Finnish and Languages Spoken in the Central Volga–Kama Region (UHLCS) |
Puhti | Access the corpus in
The Susanne Corpus (UHLCS) |
Puhti | Access the corpus in
Ume Saami Corpus (UHLCS) |
Puhti | Access the corpus in
Uralic, Turkic, Indo-Iranian and Mongol languages; languages of Siberia and Caucasia (UHLCS) |
Puhti | Access the corpus in
Uzbek-English Dictionary (UHLCS) |
Puhti | Access the corpus in
Lists of Words Corpus (UHLCS) |
Puhti | Access the corpus in
Search for all versions in META-SHARE |
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023030901
word2vec is a tool developed by the Turku NLP group for analyzing the semantic similarity of words.
This resource collection contains word embeddings trained with word2vec from various corpora. The embedding file is in a simple and easily parsed textual format produced by word2vec. The first line in the file gives the vocabulary size and dimension. Each line after that begins with a vocabulary item, followed by a space, followed by 128 floating point numbers (represented textually) each followed by a space.
Latest versions/subcorpora: | |
Word embeddings trained with word2vec from the Finnish Text Collection Metadata and license Attribution instructions |
Download the resource |
Word embeddings trained with word2vec from the Suomi24 corpus Metadata and license Attribution instructions |
Download the resource |
Search for all versions of this resource in META-SHARE |
Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022041401
Latest versions/subcorpora: | |
Uralic UD v2.10, Kielipankki Korp version |
Select the corpus in Korp |
Search for all versions in META-SHARE |
The latest version of this corpus contains Universal Dependencies version 2.10 for the following Uralic languages: Erzya, Estonian, Finnish, Hungarian, Karelian, Komi-Permyak, Komi-Zyrian, Livvi, Moksha, North Sami and Skolt Sami.
Treebanks and their licenses:
Erzya (JR); CC BY-SA 4.0
Estonian (EDT, EWT); CC BY-NC-SA 4.0
Finnish (FTB, OOD, PUD, TDT); FTB: CC BY 4.0, other: CC BY-SA 4.0
Hungarian (Szeged); CC BY-NC-SA 3.0
Karelian (KKPP); CC BY-SA 4.0
Komi-Permyak (UH); CC BY-SA 4.0
Komi-Zyrian (IKDP, Lattice); CC BY-SA 4.0
Livvi (KKPP); CC BY-SA 4.0
Moksha (JR); CC BY-SA 4.0
North Sami (Giella); CC BY-SA 4.0
Skolt Sami (Giellagas); CC BY-SA 4.0
Universal Dependencies v2.10 License Agreement
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022061003
Latest versions/subcorpora: | |
Parallel Bible Verses for Uralic Studies, Korp |
Select the corpus in Korp |
Search for all versions in META-SHARE |
These parallel corpora consist of Biblical verses (historical and contemporate, 1821–2019) from Erzya, Moksha, Olonets-Karelian (Livvi), Dvina-Karelian (North Karelian), Khanty, Komi-Permyak, Komi-Zyrian, Mansi, Udmurt and Veps, the majority of which, in reference to newer translations, come from the Institute for Bible Translation in Helsinki, Finland as originally organized for the University of Helsinki Language Corpus Server (UHLCS). Finnish and Russian translations are also included.
The purpose of these parallel corpora is to further the studies of translation in Uralic minority languages. Simultaneously, it provides an opportunity to follow changes in lexical and syntactic strategies used in different versions of Biblical verses in one language or compare lexicon and structure between languages. Lemmatization and morphological analyses are provided for all but Dvina-Karelian, Khanty, Veps and Russian, and the accuracy in the remaining languages should be developed as disambiguation resources. The Finnish texts have been analyzed with TNPP (Turku Neural Parser Pipeline), which includes lemmatization, morphological analysis as well as syntactic annotation. The texts in Erzya and Moksha also have lemmatization, morphological analysis and syntactic annotation.
The 27 books of the New Testament are included for the following languages:
Additionally, the following books are included:
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023030902
Latest versions: | |
The Helsinki Korp Version of Samples of Spoken Finnish Metadata and license Tämän version viittausohje |
Select the corpus in Korp |
Samples of Spoken Finnish, VRT Version Metadata and license Attribution instructions |
Download the resource |
Samples of Spoken Finnish, Downloadable Version (includes audio recordings and annotations) Metadata and license Attribution instructions |
Download the resource |
The Helsinki LAT Version of Samples of Spoken Finnish (PHASED OUT IN DECEMBER 2020) Metadata and license Attribution instructions |
(discontinued; downloadable version available) |
Search for all versions of this recourse in META-SHARE |
This corpus consists of audio samples with annotation on 50 Finnish dialects, based on the dialect book series of the same name published by the Institute for the Languages of Finland between 1978 and 2000 (Suomen kielen näytteitä).
PLEASE NOTE: The downloadable data was re-packaged on 31.01.2023, because some information was found to be missing in the former packages.
The following data was added:
– Four preface texts (’saate’) for the individual parts of the book series in PDF format
– PDF files with general information for each of the 50 municipalities
– wav files for the municipalities 9-14
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023012601
Please note that the descriptions and size information are based on our current estimates and may be updated at a later stage.
For companies and non-academic organizations, the following versions of this resource are currently available or forthcoming: | |
---|---|
Donate Speech Corpus: Sample Metadata A free sample that contains a randomly selected set of 40 audio files and their corresponding transcripts as plain text files and as annotation files including time alignments. The metadata regarding the recorded samples and the background details supplied by the speakers (if available) are also included. The total duration of the audio files is about 35 minutes. |
Price: Free of charge
See instructions. |
Donate Speech: Selected dataset Metadata This resource contains five different subsets that were selected at Aalto University especially for developing, training and testing ASR systems. The total duration of the audio files is about 131 hours. |
Price: 1000 €
See instructions. |
Donate Speech: Annotated dataset Metadata This resource contains all the annotated audio files, their transcriptions as raw text files and annotation files, and the background information regarding the recordings and speakers. The total duration of the audio files is about 1600 hours. |
Price: 5000 €
See instructions. |
Donate Speech: Complete dataset, version 1 Metadata The Complete dataset (version 1) includes the Annotated dataset (and the Selected dataset and the Sample). In addition, the Complete dataset also includes the audio files that were not transcribed or annotated. |
Price: 10 000 €
See instructions. |
The first version of the Donate Speech Corpus (Puhelahjat) is a collection of speech recordings accumulated during the Donate Speech campaign between 16.6.2020 and 14.9.2021.
The resource contains a total of about 3200 hours of speech recordings, out of which about 1600 hours have been transcribed. The resource also includes information about the elicitation tasks for which each of the speech samples was donated in the original campaign, and the background details that were voluntarily provided by speech donors.
The resource is available via the download service of the Language Bank of Finland under restricted terms and conditions. The services of the Language Bank are directed at academic researchers. For companies and non-academic organizations, access to Puhelahjat datasets may be acquired for a fee. Further details can be requested by email at lahjoita-puhetta@helsinki.fi.
NB: These instructions are still subject to change.
In accordance with the specific terms and conditions of the Puhelahjat resource, it is also possible to grant access to the data for commercial and non-academic purposes. However, in this case, a separate license agreement between the University of Helsinki and the company or organization is required. When the agreement is signed and the payment has been made, access can be granted to the representative authorized by the user organization.
When applying for the use of paid material, it must be shown that the license fee has been paid.
Last updated: 8.3.2023
Persistent Identifier of this page: urn:nbn:fi:lb-2022111627
Demopalvelu, jossa voi kokeilla puheen automaattista litterointia ja muokata automaattisesti tuotettua litteraattia selainkäyttöliittymän kautta.
Huom. Tämä palvelu on tarkoitettu toistaiseksi ainoastaan kokeilukäyttöön yksittäisillä äänitiedostoilla. Palvelua ei ole mitoitettu laajojen aineistojen käsittelyyn, eikä siinä tietosuojasyistä tule käsitellä luottamuksellisia puhetallenteita.
Viimeisin versio: | |
Tekstiks |
Käytä palvelua |
Etsi muita tämän työkalun versioita META-SHAREsta |
Tekstiks.ee on verkkoselaimella toimiva puheentunnistuspalvelu, jossa voi litteroida mm. viron- tai suomenkielistä puhetta.
Tekstiks-palvelu on osa kansainvälistä CLARIN-yhteistyötä. Litteraattien muokkaukseen tarkoitettu tekstieditori ja puheentunnistinten ajamiseen tarkoitettu käyttöliittymä on kehitettu Tallinnan teknillisen yliopiston (TalTech) kieliteknologian laboratoriossa. Palveluun on kytketty TalTechin oma viron kielen puheentunnistin sekä suomalaisen Kielipankin kautta tarjottava puheentunnistin, jonka avulla Tekstiks-palvelussa voi käyttää myös Aalto-yliopistossa kehitettyjä puheentunnistusmalleja mm. suomen kielelle.
Järjestelmässä voidaan käsitellä useita tiedostoja samanaikaisesti. Keskimääräinen tunnistusaika on noin puolet käsiteltävän äänitteen kokonaiskestosta (marraskuussa 2022). Selainkäyttöliittymässä voi valita viron sijaan myös suomen- tai englanninkielisen näkymän.
Aluksi käyttäjän pitää luoda paikallinen käyttäjätunnus Virossa sijaitsevalle Tallinnan teknillisen yliopiston hallinnoimalle palvelimelle. Tunnuksen luomiseen riittää toimiva sähköpostiosoite, jonka lisäksi annetaan käyttäjän nimi ja valitaan salasana. Käsiteltävät äänitiedostot ladataan Tekstiks-palvelimelle Viroon. Kirjautunut käyttäjä voi itse hallinnoida ja poistaa Tekstiks-palvelimeen lataamiaan tiedostoja.
Jos Tekstiks-palvelussa valitaan ja käynnistetään suomenkielisen puheen tunnistus, puheäänitteet siirretään verkon yli Suomeen CSC:n isännöimälle palvelimelle, jossa ne käsitellään. Tunnistettu teksti siirretään CSC:n palvelimelta takaisin Tekstiks-palvelimelle Viroon, jossa käyttäjä voi edelleen muokata tekstiä ja halutessaan ladata sen itselleen. Tällä hetkellä tuettu latausmuoto on .docx (MS Word -dokumentti).
Huomaa, että tämän testikäytössä olevan palvelun tietoturvan taso ei vielä riitä luottamuksellisen puheaineiston käsittelyyn.
Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2022122021
This resource contains a copy of the original TV corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 325 million words of data in 75,000 TV episodes from 1950 to 2018. The TV scripts come from several different English-speaking countries (US, UK, 4 other dialects), which allows to compare very informal language in these countries. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
The TV Corpus – Kielipankki version, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112415
This resource contains a copy of the original Corpus of American Soap Operas (SOAP), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 100 million words of data from 22,000 transcripts from American soap operas from the years 2001-2012, and it serves as a great resource to look at very informal language. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
Corpus of American Soap Operas – Kielipankki version, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112410
This resource contains a copy of the original News on the Web corpus (NOW), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains data from web-based newspapers and magazines in 20 different English-speaking countries from Jan 2010 to 31 May 2021. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
News on the Web – Kielipankki version 2021-05, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112405
This resource contains a copy of the original The Intelligent Web Corpus (iWeb), provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 14 billion words in 22 million web pages. The data was taken in 2017 from around 100,000 of the most widely-used websites (for English) in the world.
The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
The Intelligent Web Corpus – Kielipankki version, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112310
A demo service where you can try out automatic speech transcription and edit the automatically generated transcript via a browser interface.
Note: For the time being, this service is intended for trial use with individual audio files only. The service is not designed to handle large amounts of data and should not be used to handle confidential speech recordings for data protection reasons.
Latest version: | |
Access the service | |
Look for all versions of this tool in META-SHARE |
Tekstiks.ee is a web browser-based speech recognition service for transcribing speech in Estonian or Finnish.
The Tekstiks service is part of the international CLARIN cooperation. The text editor for editing transcripts and the interface for running speech recognition tools have been developed at the Laboratory of Language Technology at the Tallinn University of Technology (TalTech). TalTech’s own speech recogniser for the Estonian language is connected to the service, as well as the speech recogniser for Finnish provided through the Language Bank of Finland, which uses speech recognition models developed at Aalto University.
The system can handle several files simultaneously. The average processing time is about half of the recording’s length (in November 2022). The language of the browser interface can be set to English or Finnish instead of Estonian.
First, you need to create a local username on a server managed by the Tallinn University of Technology in Estonia. To create an account, all you need is a working email address, a user name and a password. The audio files to be processed are uploaded to the Tekstiks server in Estonia. The logged-in user can manage and delete files uploaded to the Tekstiks server.
If Finnish speech recognition is selected and activated in the Tekstiks service, the speech recordings are transferred over the network to a CSC-hosted server in Finland, where they are processed. The recognised text is transferred from the CSC server back to the Tekstiks server in Estonia, where the user can further edit the text and, if they wish, download it. Currently the supported download format is .docx (MS Word document).
Please note that the level of security of this test service is currently not sufficient to handle confidential speech data.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112802
This resource contains a copy of the original Movie Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 200 million words from about 25,000 movies from the years 1930-2018. The movie scripts come from several different English-speaking countries and include English from the US, UK and 4 other dialects. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
The Movie Corpus – Kielipankki version, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112305
This resource contains a copy of the original Coronavirus Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains data on the medical, social, cultural, and economic impact of the coronavirus (COVID-19) from online magazines and newspapers in 20 different English-speaking countries from 1 Jan 2020 to 31 May 2021. The corpus is related to many other corpora of English, formerly known as the ”BYU Corpora”.
More information on Mark Davies’ corpora at Kielipankki.
Latest versions/subcorpora: | |
The Coronavirus Corpus – Kielipankki version 2021-05, source Metadata and license Attribution instructions |
The corpus will be available soon |
Search for all versions in META-SHARE |
Of this language corpus different versions/subcorpora are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022111705
Oletko tutkija? Lahjoita puhetta -aineistot akateemiseen tutkimuskäyttöön löytyvät toiselta sivulta.
Huom. Aineistopakettien sisältökuvaukset ja kokotiedot perustuvat alustavaan arvioon ja niitä voidaan tarvittaessa tarkentaa.
Tästä aineistosta tarjotaan yritysten ja ei-akateemisten organisaatioiden käyttöön seuraavat paketit: | |
---|---|
Lahjoita puhetta -aineisto: Näyte Kuvailutiedot Ilmainen näyte, joka sisältää 40 satunnaisesti valittua äänitiedostoa, niiden litteraatit raakatekstinä ja kohdistustiedostoina sekä käytettävissä olevat äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 35 minuuttia. |
Hinta: Maksuton näyte Hanki käyttöoikeus Lataa aineisto |
Lahjoita puhetta: Valikoitu aineisto Kuvailutiedot Tämä kokoelma sisältää viisi eri osa-aineistoa, jotka on poimittu Aalto-yliopistossa erityisesti automaattisen puheentunnistuksen kehitys-, opetus- ja testausvaiheita varten. Äänitteiden yhteenlaskettu kesto on noin 131 tuntia. |
Hinta: 1000 € Hanki käyttöoikeus Aineistopaketti on valmisteilla, latauslinkki tulee tähän |
Lahjoita puhetta: Annotoitu aineisto Kuvailutiedot Tämä kokoelma sisältää koko aineiston versioon 1 kuuluvat litteroidut äänitteet, litteraatit raakatekstinä ja kohdistustiedostoina sekä äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 1600 tuntia. |
Hinta: 5000 € Hanki käyttöoikeus Aineistopaketti on valmisteilla, latauslinkki tulee tähän |
Lahjoita puhetta: Koko aineisto (versio 1) Kuvailutiedot Kokoelmassa on mukana kaikki aineiston versioon 1 kuuluvat litteroidut ja litteroimattomat äänitteet, litteraatit raakatekstinä ja kohdistustiedostoina sekä äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 3200 tuntia. Tämän version viittausohje |
Hinta: 10.000 € Hanki käyttöoikeus Lataa aineisto |
Lahjoita puhetta -aineisto eli Puhelahjat on koostettu 16.6.2020 alkaneessa Vaken, Ylen ja Helsingin yliopiston toteuttamassa Lahjoita puhetta -kampanjassa, jossa kuka tahansa ainakin hieman suomea osaava on voinut helppokäyttöisen selain- tai mobiilisovelluksen kautta lahjoittaa omaa puhettaan. Aineisto on siinä mielessä ainutlaatuinen, että se on alusta alkaen kerätty mahdollisimman läpinäkyvästi sekä tutkijoiden että yritysten rajoitettuun käyttöön siten, että puheen lahjoittajien tietosuojasta pyritään huolehtimaan aineiston koko elinkaaren ajan.
Aineistosta on saatavilla erilaisia paketteja Kielipankin latauspalvelussa, josta luvan saaneet tutkijat, yritykset ja ei-akateemiset organisaatiot pääsevät niitä käyttämään. Kielipankin palvelut on lähtökohtaisesti suunnattu vain tutkijoille. Yrityksille ja ei-akateemisille organisaatioille aineiston käyttö on näyteaineistoa lukuunottamatta maksullista. Lisätietoja saa osoitteesta lahjoita-puhetta@helsinki.fi.
Huom. Ohjeita päivitetään edelleen.
Puhelahjat-aineiston käyttöehtojen mukaisesti käyttöoikeuksia voidaan myöntää myös yrityksille tai ei-akateemisille organisaatioille. Kunkin ei-akateemisen käyttäjätahon kanssa tehdään kirjallinen sopimus halutun aineiston käytöstä. Kun sopimuksen mukaiset velvoitteet on suoritettu, pääsy aineistoon voidaan myöntää yrityksen valtuuttamalle edustajalle.
Viimeksi päivitetty: 8.3.2023
Tämän sivun pysyvä tunniste: urn:nbn:fi:lb-2022111628
Tärkeää tietoa aineiston käyttäjille: Poistopyynnöt
Aineiston versiot: | |
---|---|
Lahjoita puhetta -aineisto, versio 1.0 Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
(vain tutkijoille; yhdellä hakemuksella saa pääsyn kaikkiin aineiston versioihin) Hae käyttöoikeutta +PRIV: Aineisto sisältää henkilötietoja. Toimita julkinen ilmoitus henkilötietojen käsittelystä Lataa aineisto |
Lahjoita puhetta -aineisto: Näyte Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Lataa aineisto |
Lahjoita puhetta -aineisto: Opetusdata (100h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Testidata (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Kehitysdata (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Usean litteroijan testidata (1h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Lahjoita puhetta -aineisto: Testidata useaan kertaan litteroiduilta puhujilta (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje |
Aineiston latauslinkki tulee tähän |
Etsi muut saatavilla olevat versiot |
Lahjoita puhetta -aineisto, lyhytnimeltään Puhelahjat, on koostettu 16.6.2020 alkaneessa Vake Oy:n (sittemmin Ilmastorahasto), Ylen ja Helsingin yliopiston toteuttamassa Lahjoita puhetta -kampanjassa, jossa kuka tahansa suomea osaava henkilö on voinut halutessaan lahjoittaa omaa puhettaan kielentutkimuksen sekä kieliteknologian kehitystyön edistämiseksi. Lahjoitettu puhe on tallennettu helppokäyttöisen selain- tai mobiilisovelluksen kautta.
Kevääseen 2021 mennessä lahjoitetuista puhenäytteistä on rakennettu ääniaineiston ensimmäinen versio, jonka kokonaiskesto on noin 3200 tuntia. Vuonna 2021 näistä äänitteistä litteroitiin käsityönä noin 1600 tuntia ja näin syntyneet tekstimuotoiset litteroinnit kohdistettiin vastaaviin äänitteisiin automaattisilla menetelmillä.
Aineiston ensimmäinen varsinainen versio 1.0 on saatavilla Kielipankin latauspalvelussa, josta luvan saaneet tutkijat ja myöhemmin myös yritykset pääsevät sitä käyttämään. Samaan aineistoon sisältyviä, esimerkiksi automaattisen puheentunnistuksen kehittämistä varten poimittuja osa-aineistoja on lisäksi tarjolla erillisinä paketteina, joiden sisältö ja viittauskäytänteet löytyvät kunkin aineistoversion kuvailutietueesta.
Lahjoita puhetta -aineistokokonaisuutta on tarkoitus myös myöhemmin päivittää ja laajentaa, kun uusia lahjoituksia on kertynyt riittävästi. Uusia versioita tehdään myös sitä mukaa, kun tutkijat tai yritykset jatkavat olemassa olevien äänitteiden litterointia ja muuta annotointia.
Puhelahjat-aineiston käyttäminen on luvanvaraista. Puhelahjat-ryhmän kaikkien osa-aineistojen tutkimuskäyttöä koskee sama lisenssi, johon sisältyy myös aineistokohtaisia tietosuojaehtoja.
Yrityskäytön ohjeet löytyvät omalta sivultaan.
Viimeksi päivitetty: 23.12.2022
Tämän sivun pysyvä tunniste: urn:nbn:fi:lb-2022102122
Donate Speech datasets for commercial use: see further details on another page
Important information for all users of this resource: Removal requests
Versions of this resource: | |
---|---|
Donate Speech Corpus, version 1.0 Metadata License (for researchers) Attribution instructions |
, academic research use only Apply for access rights +PRIV: This resource contains personal data. Submit public information about personal data processing Download the resource |
Donate Speech Corpus: Sample Metadata License (for researchers) Attribution instructions |
Download the resource |
Donate Speech Corpus: Training data (100h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Test data (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Development data (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Multi-transcriber test data (1h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Donate Speech Corpus: Test data from multi-transcriber speakers (10h) Metadata License (for researchers) Attribution instructions |
(The download link will appear here) |
Look for other versions of this resource |
The Donate Speech Corpus, abbreviated Puhelahjat, was compiled in the Donate Speech campaign implemented by Vake Oy (later Ilmastorahasto), Yle and the University of Helsinki, launched on June 16, 2020. During the project, anyone who speaks some Finnish had the opportunity to donate their own speech in order to promote language research and the development of language technology. The donated speech was recorded via an easy-to-use browser or mobile application.
The first version of the audio material includes the speech samples that were donated by spring 2021. The total duration of the recordings in this version is approximately 3200 hours. In 2021, approximately 1,600 hours of the recordings were transcribed by hand, and the resulting transcriptions were aligned with the corresponding audio recordings using automatic methods.
The version 1.0 of the dataset is available in the download service for researchers that have been granted access. Some subsets of the complete dataset (selected for instance for the development of automatic speech recognition) will also be made available as separate download packages. The description and the citation practices of each subset can be found in the corresponding metadata records.
The Donate Speech datasets can be updated later, for instance after a sufficient amount of new donations have accumulated. New versions can also be created as researchers or companies continue to transcribe and annotate the existing recordings more extensively.
The research use of the Donate Speech Corpus and any of its subsets is subject to the license of the resource. Note that the license also includes resource-specific data protection conditions.
The instructions for commercial use can be found on a separate page.
Last updated: 27.10.2022
Persistent identifier of this page: urn:nbn:fi:lb-2022102121
Some tools are available as Docker images. They can be used without installing any other dependencies (except for Docker). At this time the images are replacements for the command-line versions of these tools, meaning that they’re used via stdin and stdout, but they can also be run in an application server as a web service.
For now, the available tools are finnish-nertag, finnish-postag and finnish-tokenize.
The images are available on the Language Bank’s Dockerhub account, and may be installed as follows:
sudo docker pull kielipankki/finnish-nertag:latest
(Or finnish-postag, etc.)
The resulting containers communicate via stdin and stdout, so you could test them like this::
$ sudo docker run --rm -i kielipankki/finnish-nertag <<< 'Pekingin olympialaiset 2008'
Pekingin <EnamexEvtXxx>
olympialaiset
2008 </EnamexEvtXxx>
They understand the same command-line options as the underlying tools:
$ sudo docker run --rm -i kielipankki/finnish-nertag --bio <<< 'Pekingin olympialaiset 2008'
Pekingin B-MISC
olympialaiset I-MISC
2008 I-MISC
$ sudo docker run –rm -i kielipankki/finnish-nertag –show-analyses <<< ’Pekingin olympialaiset 2008’
Pekingin peking [POS=NOUN][PROPER=PROPER][NUM=SG][CASE=GEN] [PROP=GEO] <EnamexEvtXxx>
olympialaiset olympialaiset [POS=NOUN][NUM=PL][CASE=NOM] _
2008 2008 [POS=NUMERAL][SUBCAT=CARD] _ </EnamexEvtXxx>