Pohjoismainen podcast-tietokanta (PLIS)

Pohjoismainen podcast-tietokanta (PLIS)

Name: Conference: CMC and Social Media Corpora for the Humanities (CMC-Corpora)
Start: 2026-08-27T00:00:00+03:00
End: 2026-08-28T23:59:59+03:00
Location: Oulu

In English

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Tulossa olevat versiot

Nämä aineistoversiot eivät vielä ole saatavilla Kielipankin kautta.

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Yhteyshenkilö	Sijainti	Aineistoryhmä ja ohje	Muu tieto
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Yhteyshenkilö	Sijainti	Aineistoryhmä ja ohje	Muu tieto

Tietoa aineistosta

Tämä aineisto koostuu pohjoismaisesta tietokannasta, joka sisältää podcasteja ja niiden transkriptioita. Tietokanta koottiin alun perin vertailututkimusta varten, jossa tarkasteltiin englannista peräisin olevia pragmaattisia lainasanoja pohjoismaisissa kielissä ja suomessa. Se sisältää aineistoa tanskaksi, suomeksi, islanniksi, norjaksi ja ruotsiksi (suomenruotsi ja Ruotsissa puhuttu ruotsi). Aineisto on kerätty vuonna 2025, ja se on peräisin pääasiassa vuodelta 2024.
Aineisto on merkitty kunkin kielen kirjallisten kieliopillisten sääntöjen mukaisesti. Uudemmat englannista peräisin olevat lainasanat ja kieltenvaihto on korostettu.

Lisenssi ja pääsy aineistoon

Jotkin tämän aineiston versiot ovat saatavilla julkisesti (PUB), kun taas toisiin täytyy kirjautua akateemisena käyttäjänä (ACA) tai hakea erikseen henkilökohtaista käyttöoikeutta (RES).
Lisenssikuvaketta napauttamalla näet tarkan aineistokohtaisen lisenssin.
(Joihinkin tämän aineiston versioihin voi sisältyä henkilötietoja (lisenssissä on merkintä +PRIV). Lisenssiin voi silloin sisältyä myös erityisiä tietosuojaehtoja, joita sinun on noudatettava. Jos käsittelet henkilötietoja, ylläpidä projektiasi koskevaa julkista tietosuojailmoitusta ja toimita sen linkki Kielipankille, ks. ohjeet.)
(Joidenkin tämän aineiston versioiden kopio voi olla saatavilla myös suoraan laskentaympäristössä (ks. Sijainti-sarake).)

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2026040104

The Longitudinal Corpus of Finnish Spoken in Tampere (1970s, 1990s and 2010s) (tampuhe)

The Longitudinal Corpus of Finnish Spoken in Tampere (1970s, 1990s and 2010s) (tampuhe)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

The material consists of interview recordings collected for a sociolinguistic longitudinal study. The roots of the study lie in the project ‘Nykysuomalaisen puhekielen murros’ (The Transformation of Contemporary Finnish Colloquial Language), which was launched in the 1970s. As part of the project, extensive urban colloquial language data was collected in four Finnish university cities: Tampere, Helsinki, Turku, and Jyväskylä.
The longitudinal corpus of Tampere colloquial language is similar in its implementation to the longitudinal corpus of Helsinki colloquial language (http://urn.fi/urn:nbn:fi:lb-2021052503). Both in Helsinki and Tampere follow-up rounds were conducted in the 1990s and 2010s, partly with the same interviewees.
The follow-up material also makes it possible to study changes in spoken language and dialects over time.

License and access

This resource requires you to apply for individual access rights (RES). Apply
Click on the license image to see the resource-specific license text.
All versions of this resource contain personal data (license condition +PRIV). The license includes additional data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.)

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2026012021

The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s) (helpuhe)

The Longitudinal Corpus of Finnish Spoken in Helsinki (1970s, 1990s and 2010s) (helpuhe)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Tietoa aineistosta

Helsingin puhekielen pitkittäiskorpus koostuu eri-ikäisten syntyperäisten helsinkiläisten äänitallennetuista yksilöhaastatteluista. Aineistoa on kerätty kolmella vuosikymmenellä, vuosina 1972–74, 1991–92 sekä 2013.

Tietoa tämän aineiston LAT-version poistumisesta vuoden 2020 lopussa

Kielipankin LAT-alusta on poistunut käytöstä vuonna 2020. Tätä aineistoa ei enää pääse käyttämään LAT-näkymän kautta, mutta LATissa ollut sisältö on edelleen saatavilla ladattavassa muodossa. Aineiston tutkimista ja käsittelyä voi siis edelleen jatkaa esimerkiksi ELAN-ohjelmalla.

Korpuksen rakenne

Korpus on jaettu kolmeen pääosaan äänitysten vuosikymmenen mukaisesti: 1970, 1990 ja 2010. Sosiolingvististä tutkimusta varten 1970-luvun aineisto on jaettu osiin haastateltavan asuinkaupunginosan mukaan (S=Sörnäinen, T=Töölö). Myöhemmin kerätyissä osa-aineistoissa tästä kaupunginosajaosta ei enää nuorten puhujien osalta ole pidetty kiinni, vaan S- ja T-koodit viittaavat pikemminkin puhujien koulutustaustaan (S=ammattikoululainen, T=lukiolainen). Jokainen osakorpus on jaoteltu lisäksi haastateltavan ikäryhmän (1=vanhimmat, 2=keski-ikäiset, 3=nuoret) mukaan.

Osakorpuksissa on mukana osittain samoja haastateltavia. Vuosina 1991-1992 tavoitettiin 29 haastateltua 1970-luvulla tehdyn tutkimuksen kahdesta nuorimmasta ikäryhmästä sekä otettiin tutkittavaksi uusi nuorten ryhmä (16 haastateltavaa). Vuonna 2013 toteutetussa jatkohankkeessa haastateltiin 27:ää 1990-luvun informanteista sekä otettiin tutkittavaksi uusi nuorten ryhmä (yht. 16). Aiempien kierrosten tavoin aineisto kerättiin yksilöhaastatteluin. 13 informantille kyseessä oli jo kolmas haastattelu.

Puhujat on merkitty koodeilla F(nainen) tai M(mies) + juokseva numero. Näillä koodeilla samat henkilöt on myös yksilöity kaikkien kolmen osa-aineiston välillä: tiettyyn puhujaan viitataan aina samalla koodilla, kun hän esiintyy useammassa kuin yhdessä osakorpuksessa.

Osakorpusten sisältö

1970-luvun osa-aineistosta on saatavilla litteroidut tekstit pääosin vain kohdistamattomina tekstitiedostoina (.txt) ja erillisinä, kokonaisen haastattelun pituisina äänitiedostoina (.wav). Litteroitu osuus kattaa noin puoli tuntia kustakin haastattelusta. Pieni osa litteroinneista on myös kohdistettu ääneen.
1990-luvun osakorpuksesta on tarjolla osaksi pelkät äänitiedostot, osaksi kohdistetut litteroinnit vastaavasti kuin 1970-luvun osuudesta. Huomaa, että litterointityyli on jossakin määrin erilainen kuin 1970-luvun osa-aineistossa.
2010-luvun osakorpus on litteroitu suoraan äänitiedostoon kohdistettuna.

Kaikkien kolmen osakorpuksen äänitiedostoihin kohdistetut annotaatiot ovat saatavilla sekä ELAN-ohjelmalla toimivassa .eaf-muodossa että Praat-ohjelmalla toimivassa .TextGrid-muodossa.

Versio 1 (helpuhe1):

Vuonna 2013 toteutetussa Helsingin puhekielen pitkittäiskorpus -hankkeessa kerättiin 2010-lukua edustava osa-aineisto sekä jalostettiin aiemmin 1970- ja 1990-luvuilla hankittu aineisto digitaaliseksi korpukseksi, mikä parantaa huomattavasti sen käytettävyyttä. Pitkittäiskorpus koostuu digitaalisista äänitiedostoista, jotka ovat kuunneltavissa kokonaisuudessaan, ja niihin liitettävistä litteroinneista, jotka kattavat tässä aineistoversiossa noin puoli tuntia kustakin haastattelusta. Litteraatit on suuressa osassa aineistoa kohdistettu äänitiedostojen vastaaviin kohtiin, jolloin litteroinnin perusteella voi tehdä hakuja ja hakuosumia vastaavia äänitteiden kohtia pääsee suoraan kuuntelemaan. Lisäksi suureen osaan ääniaineistoa on kohdistettu asiasanoja, joiden avulla voidaan tehdä myös aiheenmukaisia hakuja esimerkiksi kulttuurin- ja historiantutkimuksen tarpeisiin.

Versio 2 (helpuhe-v2):

Korpuksen toinen versio sisältää 1970-, 1990- ja 2010-luvun osa-aineistojen annotaatiotiedostojen päivityksiä: joko uusia litteraatteja äänitiedostoihin, joita ensimmäisessä versiossa ei ollut lainkaan litteroitu, tai 1970-luvun osa-aineiston kohdalla vanhojen litteraattien kohdistettuja versioita. Osa aikaisemmista litteroinneista on myös päivitetty tai äänitteestä on saatettu litteroida pitempi pätkä. Yhteensä 83 äänitiedostoon liittyviä annotaatioita on päivitetty tai lisätty. Uusia äänitteitä ei siis tätä versiota varten kerätty.

Käyttöohjeita

Korpuksen äänitiedostoihin kohdistettuja .eaf-muotoisia annotaatiotiedostoja voi selailla ja ääninäytteitä kuunnella verkon kautta LAT-alustalla. Napsauta ensin vasemmanpuoleisessa ikkunassa haluamaasi tiedostoa, esim. 1970-T1M2C_1.eaf, ja sitten joko ponnahdusvalikossa tai oikeanpuoleisen ikkunan yläreunassa näkyvää painiketta ”view node”.

Litteroinnit ja muu annotaatio

Aineiston litteroinnit ja muu annotaatio ovat saatavilla Praat-ohjelman käyttämässä TextGrid-muodossa sekä ELAN-ohjelman käyttämässä EAF-muodossa. Annotaatiotiedostoja voi käyttää verkkopohjaisesti LAT-alustalla tai ne voi ladata omalle koneelle ja avata muokattavaksi joko ELAN- tai Praat-ohjelmalla. Kummassakin tapauksessa on ladattava annotaatiotiedoston pariksi myös sitä vastaava WAV-muotoinen äänitiedosto.

LAT-alustalla olevat äänitiedostot (WAV ja M4A) ovat ”view node”-komennolla kuunneltavissa ja ”download”-komennolla ladattavissa yksitellen myös ilman annotaatiota. WAV-muotoiset äänitiedostot ovat yksikanavaisia (mono) ja ne on näytteistetty 16-bittisinä ja 44100 Hz:n taajuudella. Näitä äänitteiden WAV-versioita kannattaa käyttää, mikäli haluaa selata ja muokata annotaatioita omalla koneellaan. !M4A-muotoiset äänitiedostot ovat häviöllisesti pakattuja ja tiedostokooltaan pienempiä. Ne on tuotettu alkuperäisistä WAV-tiedostoista lähinnä Annex-työkalulla verkon yli tapahtuvaa kuuntelua ja käyttöä varten.

Huom! Haastattelut on nauhoitettu vaihtelevissa olosuhteissa ja erityisesti vanhimmat nauhat on digitoitu vasta myöhemmin. Tallenteissa voi esiintyä taustakohinaa ja muuta hälyä ja tallenteiden äänentaso saattaa vaihdella.

Annotaatiotiedostojen sisältö

Koko 2010-luvun osa-aineistosta sekä osittain myös 1990- ja 1970-lukujen osa-aineistoista on saatavilla ääneen kohdistetut litteraatit (.eaf, .TextGrid). Litteraatin perusteella voidaan siis tehdä hakuja ja kuunnella karkeasti kutakin hakutulosta vastaava ääninäytteen kohta. Osa 1970- ja 1990-luvun aineistosta on kuitenkin saatavilla vain erillisinä teksti- (.txt) ja äänitiedostoina (.wav).

Litteraatin ja äänen kohdistus on tarkoitettu hakujen, selailun ja kuuntelun helpottamiseksi. Se ei siis ole täysin tarkka, eikä kaikkia taukoja ole välttämättä merkitty.

Tiedostojen lataaminen omalle koneelle

Tiedostoja voi ladata LATista yksitellen omalle koneelle komennolla download (napsauta tiedostoa hiiren oikealla napilla tai valitse se klikkaamalla, jolloin painike tulee näkyviin sivun ylälaitaan). Vaihtoehtoisesti kaikki yksittäiseen haastatteluun liittyvät erilaiset tiedostot voi ladata yhtenä tiedostopakettina valitsemalla ensin kyseistä istuntoa vastaavan vihreän ”pussin” ja napsauttamalla sitten painiketta Download all resources. Kannattaa ladata vähintään EAF-tiedosto tai TextGrid-tiedosto sekä sitä vastaava WAV-muotoinen äänitiedosto ja sijoittaa nämä omalla koneella samaan hakemistoon. M4A-tiedostoa ei välttämättä kannata ladata, koska se on tuotettu ainoastaan verkkoselaimella tapahtuvaa kuuntelua varten.

Vanhemmat korpusversiot ja äänitiedostopaketit ovat ladattavissa Kielipankin latauspalvelusta.

Annotaatioihin pohjautuvien hakujen tekeminen LAT-alustalla (ja ELAN-ohjelmalla)

Annotaatiokerrosten tyyppien avulla voidaan tehdä hakuja Trova-työkalulla (napsauta helpuhe-solmua ja valitse annotation content search). Trova-ikkunan yläosasta voidaan rastittaa, minkätyyppisiin annotaatiotiedostoihin haku kohdistetaan: ELAN-muotoisiin .eaf-tiedostoihin, Praat-muotoisiin .TextGrid-tiedostoihin ja/tai kohdistamattomiin .txt-raakatekstitiedostoihin.

Myös vanhemmista korpusversioista voi tehdä hakuja omalle koneelle asennetulla ELAN-ohjelmalla. Koko korpus tai osakorpus täytyy tällöin ensin ladata Kielipankin latauspalvelusta. ELANissa voi käyttää toimintoa Search: Structured Search Multiple eaf, joka toimii vastaavalla periaatteella kuin LAT-palvelun Trova-työkalu. Hakualueeksi (Define Domain) määritellään ELANissa se hakemisto/hakemistot, johon korpuspaketit on purettu.

ELAN-hakujen tekemisestä on tulossa myöhemmin lisäohjeita.

Haastattelijoiden puheeseen liittyvien annotaatiokerrosten tyyppi (Tier type) on interviewer speech, kun taas kaikki speech-tyyppiset kerrokset liittyvät joko varsinaisten haastateltavien tai muiden äänitystilanteessa paikalla olleiden henkilöiden puheeseen. Kun kohdistetaan Single Layer- tai Multiple Layer -haku tietyntyyppisiin kerroksiin, voidaan etsiä osumia pelkästään haastateltavien vs. haastattelijoiden puheesta. 1970-luvun aineistoon on merkitty näkyviin haastattelijan nimikirjaimet, mutta 1990-luvun ja 2010-luvun aineistossa haastattelijan vuorot on merkitty pelkällä H-kirjaimella.

Osa aineistosta on koodattu temaattisesti ts. asiasanoitettu puheenaiheen mukaan. Tietyt asiasanat on merkitty samaa aihetta käsittelevän osuuden kohdalle 1-3 annotaatiokerrokseen. Näiden kerrosten nimet ovat annotaatiotiedostoissa asiasana1, asiasana2 ja asiasana3. Asiasanoja voi hakea valitsemalla kohteena olevan kerroksen tyypiksi Tier type: thematic keyword.

Muutamiin annotaatiotiedostoihin on myös merkitty referointiosuuksia (Tier type: reference) sekä nimiä (Tier type: name).

Korpuksen tuottajat

Helsingin puhekielen aineistohankkeen käynnisti prof. Terho Itkonen Helsingin yliopistossa. Vuodesta 1976 lähtien hanketta johti prof. Heikki Paunonen. 1970-luvun osa-aineisto on kerätty Itkosen ja Paunosen johdolla. 1990-luvun osakorpuksen aineisto kerättiin vuosina 1991–92, jolloin hankkeen johtajana jatkoi prof. Heikki Paunonen. Vuonna 2013 toteutetussa, Koneen Säätiön rahoittamassa jatkohankkeessa kerättiin 2010-luvun osa-aineisto, jonka haastatteluista ja litterointityöstä vastasivat tutkimusavustajina suomen kielen opiskelijat Saila Marttila, Sanni Surkka ja Suvi Syrjänen. Hankkeen johtajana toimi Hanna Lappalainen Helsingin yliopiston suomen kielen, suomalais-ugrilaisten ja pohjoismaisten kielten ja kirjallisuuksien laitokselta. Aineiston temaattisen koodauksen suunnittelusta ja toteutuksesta vastasi FT Pauliina Latvala, joka työskenteli hankkeessa apurahatutkijana.

Lisätietoa Helsingin puhekielen pitkittäiskorpus -aineistohankkeesta

Korpuksen versiot

Korpuksen ensimmäinen versio helpuhe1 on ladattavissa tiedostopaketteina Kielipankin latauspalvelusta (http://urn.fi/urn:nbn:fi:lb-2014073041).

Korpuksen toinen, annotaatioiden osalta päivitetty versio (helpuhe-v2, http://urn.fi/urn:nbn:fi:lb-2016041424) tulee myöhemmin saataville latauspaketteina.

Aineistosta on tekeillä myös Kielipankin Korp-palvelun kautta käytettävä versio.

Tämän aineistoryhmäsivun PID: http://urn.fi/urn:nbn:fi:lb-2025120402

Suomalaisen viittomakielen korpus

Suomalaisen viittomakielen korpus

In English

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Tulossa olevat versiot

Nämä aineistoversiot eivät vielä ole saatavilla Kielipankin kautta.

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Yhteyshenkilö	Sijainti	Aineistoryhmä ja ohje	Muu tieto
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Yhteyshenkilö	Sijainti	Aineistoryhmä ja ohje	Muu tieto

Tietoa aineistosta

Kunkin aineistoversion tarkemmat tiedot päivitetään kuvailutietueeseen, joka löytyy pysyvällä tunnisteella (ks. linkki aineiston otsikon kohdalla).

Tarkemmat tiedot tämän korpuksen toisen osan videoista löytyvät täältä.

Viittojien taustatiedot:

cfinsl (pdf)
cfinsl-p2 (pdf)
cfinsl-p3 (pdf)

Tärkeitä huomautuksia

Lisenssin muutos (6.12.2024): Syksyllä 2024 päivitetyn tallennussopimuksen mukaisesti tämän aineiston lisensseihin on lisätty aineistokohtaiset tietosuojaehdot.

Lisenssi ja pääsy aineistoon

Jotkin tämän aineiston versiot ovat saatavilla julkisesti (PUB), kun taas toisiin täytyy hakea erikseen henkilökohtaista käyttöoikeutta (RES).
Lisenssikuvaketta napauttamalla näet tarkan aineistokohtaisen lisenssin.
Tämän aineiston versioihin sisältyy henkilötietoja (lisenssissä on merkintä +PRIV). Henkilötietojen käsittelyssä on noudatettava aineistokohtaisia tietosuojaehtoja. Jos käsittelet henkilötietoja, ylläpidä projektiasi koskevaa julkista tietosuojailmoitusta ja toimita sen linkki Kielipankille, ks. ohjeet.

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2021092401

Corpus for the study of Language and Gender in Mexico and Spain (CoLaGe) (colage)

Corpus for the study of Language and Gender in Mexico and Spain (CoLaGe) (colage)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

The data have been collected as part of the research project Gender, society, and language use: evidence from Mexico and Spain funded by Kone Foundation in Valencia, Spain (2021-2022) and Guadalajara, Mexico (2022–2023). The objective has been to create a comparable corpus of spoken Spanish from each city to enable the study of the interconnections between speaker gender, societal gender roles and expectations and variation in spoken language combining sociolinguistic and social psychological methodologies.

The data consist of sociolinguistic interviews divided into parts where gender is vs. is not activated as discourse topic, and two role plays simulating conflictive situations, with the informant playing one role and the interviewer the other role. The informants represent a middle class socioeconomic background and are divided into two age groups, 30–40 and 60–70. A thorough description of the data and the sociolinguistic variables is available with the data.

License and access

To use the resource versions that contain audio material, you are required to apply for individual access rights (RES). Some versions of this resource only contain text and you can access them by logging in as an academic user (ACA).
Note that all versions of this resource (may) contain personal data (license condition +PRIV). Therefore, the licenses also include resource-specific data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.
To see the resource-specific license text, click on the license image of the resource version.

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2024030607

The Giellagas Corpus of Spoken Saami Languages (giellagas)

The Giellagas Corpus of Spoken Saami Languages (giellagas)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

The Giellagas Corpus of Spoken Saami Languages includes three subcorpora of Sámi languages spoken in Finland: Samples of Northern Saami (currently available, see above), and Aanaar (Inari) Saami and Skolt Saami, both of which will be made available at a later stage.

Further details of each version of the resource are maintained in the metadata record, findable via the persistent identifier (see the link at the resource title).

License and access

To access the versions of this resource, the user is required to apply for individual access rights (RES).
Click on the license image to see the resource-specific license text.
Some/all versions of this resource may contain personal data (license condition +PRIV). The license may then include additional data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2025021321

Follow-up Study of Dialects of Finnish

Follow-up Study of Dialects of Finnish

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Removed versions of this resource

These resource versions are no longer available in the Language Bank of Finland.

Shortname	Nimi ja kuvailutiedot	Lisenssi	Viite	Aineistoryhmä ja ohje	Publication year
Shortname	Nimi ja kuvailutiedot	Lisenssi	Viite	Aineistoryhmä ja ohje	Publication year

Resource information

Content corresponding to the previous LAT version of the material is available in the Language Bank download service

The Language Bank LAT platform was discontinued at the end of 2020, and this material is no longer accessible via the LAT service. However, the corresponding content is available in downloadable format. The data can therefore be further explored and processed using tools such as ELAN and Praat.

Resource content

The follow-up study of Finnish dialects was started in 1989. It is a sociolinguistic and dialectological longitudinal study which is carried out in cooperation with the universities. The goal of the project has been to study the development of regional dialects in real time in 10 rural municipalities at the interval of 10 years. The municipalities chosen represent the traditional main dialect groups of Finnish. In each town, altogether 15 speakers have been used as informants. The external variables used in the study include age and sex: in each town, speakers of three generations have been studied (O = old, M = middle-aged, Y = young), both men and women. The data have been collected by using the traditional dialect interview method, and the study has focused on phonological and morphological features. The second round of the project was started in 1999 and completed in 2007.

License and access

This resource requires you to apply for individual access rights (RES). Apply
Click on the license image to see the resource-specific license text.
This resource contains personal data (license condition +PRIV). The license includes additional data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2023082101

Samples of Spoken Finnish

Samples of Spoken Finnish

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Changes

PLEASE NOTE: The downloadable version of this data set was re-packaged on 31.01.2023, because some information was found to have been missing in the former download packages.
The following data were added:

Four preface texts (’saate’) from individual books in the original printed series ”Suomen kielen näytteitä”, in PDF format
PDF files with general information for each of the 50 municipalities
wav audio files for the municipalities 9-14

Detailed list of the added files

Content and structure

The corpus Samples of Spoken Finnish (SKN corpus) is based on the series of dialect books of the same name published by the Institute for the Languages of Finland between 1978 and 2000 (see Samples of Spoken Finnish). 50 books were published in total, each containing about two hours of dialect transcriptions. The municipalities selected for the series represent a comprehensive range of dialect areas. The material was mainly taken from the recordings of the Finnish Audio Recordings Archive. From the original SKN series, a data set has been created containing both the recordings and the transcribed text that has been aligned with the audio. The corpus was divided into fifty sections according to municipality and to the previously published dialect books. Two dialect samples are generally available for each section.

The text was manually aligned with the audio in fragments that roughly match sentence or utterance boundaries. The corpus is searchable based on the text, and it is possible to listen to the audio fragments that match the search results.

The SKN corpus contains a total of 696 376 transcribed words, of which 684 977 words have been assigned with a ”normalized” form (”normalized” referring to the corresponding form in Standard Spoken Finnish, as opposed to the original form in the local dialect in question). Note that the ”normalization” is not necessarily unambiguous, even though efforts were made to take into account the meaning of the word in context. A normalized form was not included for incomplete or unclear words. For a description of the principles of the normalization process, see the document yleiskielistys_skn.pdf (in Finnish) under the root folder of the corpus.

Several versions of the data are available (see listing above).

More information on the recordings and annotations

As the original interviews were recorded under varying conditions and the tapes were digitised at a later date, many of the audio recordings in this dataset contain background noise and occasional other disturbances, and the sound quality of the recordings may vary. The audio files in WAV format are single channel (mono) and they were sampled in 16-bit format at 44100 Hz.

The LAT version of the data was phased out in November 2020

The LAT platform of the Language Bank was discontinued at the end of year 2020. Although the Samples of Spoken Finnish resource can no longer be accessed through the LAT interface, all the content previously available on LAT is available for download. The annotated speech samples can be accessed on a local computer using tools such as ELAN and Praat.

Content of the annotation files in EAF format

Each audio recording of the original material corresponds to an annotation file in EAF format (e.g. SKN01a_Suomussalmi.eaf). Once you have downloaded the EAF file and the corresponding audio file on your local computer (see the downloadable version of the data), they can be opened for editing with ELAN. In case ELAN does not automatically find the media file (the wav audio) linked to the EAF in the directory where you placed it on your computer, you can locate the wav file manually. As soon as you save the EAF file again, the associated audio file will be found on the same computer the next time you use it.

The EAF annotation files of this corpus contain several annotation layers or tiers. One tier contains the transcripts of the utterances, ”sentences” or similar passages uttered by the speaker in question, and another tier contains the roughly ”normalized” editions of the transcribed passages. The alignment of the transcripts and the audio was intended to facilitate searching, browsing and listening. The alignment is not completely accurate, and not all pauses have been marked. In addition to the tiers of transcribed speech, the annotation file also contains tiers of word tokens, where the original and the roughly normalized forms of individual words were aligned with each other. Please note that the individual word tokens were not aligned with the audio, but they were only intended to facilitate more complex content searches in ELAN.

TextGrid files corresponding to EAF files are also available and can be used with Praat. The TextGrid file must be opened and viewed together with the corresponding WAV audio file in Praat (since the audio file is not automatically opened with the annotation file in Praat, unlike ELAN).

The alignment of the audio and the text was originally done by importing XML-formatted documents with the help of a Praat script into TextGrid-formatted annotation files, which were then converted into the EAF format by another Praat script.

In ELAN, a ”Linguistic type” was defined for each annotation layer in the EAF files, which allows the user to define focused searches that would only match, e.g., those tiers that contain the normalized word forms. For technical reasons, hierarchical relationships between annotation layers and linguistic types were not originally defined for the SKN corpus files. Thus, if you wish to edit the annotations in ELAN, please note that the annotation tiers are independent from each other, i.e. if you move annotations of the type ”normalised word”, or their boundaries, for example, the changes will not automatically be reflected in the corresponding units in the other tiers. It might sometimes be easier to use Praat for making the changes to the TextGrid annotation files, since it is possible in Praat to move co-located annotation boundaries in sync. Alternatively, you can first manually create a hierarchy between the annotation tiers in your ELAN corpus version by creating new versions of the linguistic types (Type: Add linguistic type…) and then by using the command Tier: Change parent of tier… in ELAN.

Searching based on annotations

It is possible to search the transcripts of the corpus via the Korp service.

Searches can also be performed in ELAN, where you can make use of the different types of annotation tiers, in addition to the transcribed text. The annotation tier types ”original sentence” and ”original word” represent the original transcript, and ”normalised sentence” and ”normalised word” represent the preliminary normalized translations of these. The standardized form of some sentences is also accompanied by additional notes, which are described in the ”note for normalised word” layer.

The annotation layers related to the interviewer’s speech are indicated as ”interviewer”. All other layers relate to the speech of either the interviewees or other people present at the time of the recording.

Resource creators

The original audio material was edited by Sakari Pietarila. The original transcripts have been published in dialect books, the prefaces of which are attached to the corresponding sections of the corpus as PDF documents. The text and audio were aligned at Kotus by My Sjöholm, Pauliina Liuska and Olli Miettinen. The normalization of the transcripts was performed by Maria Vilkuna, Pauliina Liuska and Pinja Ruponen at Kotus. The audio recordings and the annotation files were converted for the LAT system by Mietta Lennes.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2023012601

The TV Corpus (Mark Davies, english-corpora.org) – Kielipankki version

The TV Corpus (Mark Davies, english-corpora.org) – Kielipankki version

Currently available versions of this resource group

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource group

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

This resource contains a copy of the original TV corpus, provided by Mark Davies on 4th June 2021 via the corpus service at english-corpora.org. The corpus contains 325 million words of data in 75,000 TV episodes from 1950 to 2018. The TV scripts come from several different English-speaking countries (US, UK, 4 other dialects), which allows to compare very informal language in these countries.

More information about all corpora from english-corpora.org that are available via the Language Bank

License and access

For the license text of an individual corpus, click on the license image in the corpus list, or see the metadata record (click on the link at the corpus title). Note that there are specific additional terms and conditions that apply on this and other corpora from BYU, see https://www.corpusdata.org/restrictions.asp. The link is included in the official license.

Korp versions

Some of the corpus versions are available for searching via the Korp concordancer tool (click on the link under ’Location’).
Access to the Korp versions requires academic login via a university in Finland.

Downloadable versions

Access to the downloadable corpora mentioned above is restricted to researchers affiliated to member universities of the FIN-CLARIN consortium in Finland. Download access can usually be provided to graduate or postgraduate students in case the applicant needs the corpora for an MA thesis or for a PhD dissertation.
To obtain access to restricted corpora, please submit an application via the Language Bank Rights (after logging in to the LBR service, search the catalogue for ’Mark Davies’ downloadable corpora at Kielipankki.’).
To access the download service, click on the link under ’Location’, or see the metadata record for the link.

This page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112415

Corpus of American Soap Operas (Mark Davies, english-corpora.org) – Kielipankki version

Corpus of American Soap Operas (Mark Davies, english-corpora.org) – Kielipankki version

Currently available versions of this resource group

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource group

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

This resource contains a copy of the original Corpus of American Soap Operas (SOAP), provided by Mark Davies on 4th June 2021 via the corpus service at english-corpora.org. The corpus contains 100 million words of data from 22,000 transcripts from American soap operas from the years 2001-2012, and it serves as a great resource to look at very informal language.

More information about all corpora from english-corpora.org that are available via the Language Bank

License and access

Korp versions

Some of the corpus versions are available for searching via the Korp concordancer tool (click on the link under ’Location’).
Access to the Korp versions requires academic login via a university in Finland.

Downloadable versions

Access to the downloadable corpora mentioned above is restricted to researchers affiliated to member universities of the FIN-CLARIN consortium in Finland. Download access can usually be provided to graduate or postgraduate students in case the applicant needs the corpora for an MA thesis or for a PhD dissertation.
To obtain access to restricted corpora, please submit an application via the Language Bank Rights (after logging in to the LBR service, search the catalogue for ’Mark Davies’ downloadable corpora at Kielipankki.’).
To access the download service, click on the link under ’Location’, or see the metadata record for the link.

This page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112410

The Movie Corpus (Mark Davies, english-corpora.org) – Kielipankki version

The Movie Corpus (Mark Davies, english-corpora.org) – Kielipankki version

Suomeksi

Currently available versions of this resource group

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Resource information

This resource contains a copy of the original Movie Corpus, provided by Mark Davies on 4th June 2021 via the corpus service at https://www.english-corpora.org. The corpus contains 200 million words from about 25,000 movies from the years 1930-2018. The movie scripts come from several different English-speaking countries and include English from the US, UK and 4 other dialects.

More information about all corpora from english-corpora.org that are available via the Language Bank

License and access

Korp versions

Some of the corpus versions are available for searching via the Korp concordancer tool (click on the link under ’Location’).
Access to the Korp versions requires academic login via a university in Finland.

Downloadable versions

Access to the downloadable corpora mentioned above is restricted to researchers affiliated to member universities of the FIN-CLARIN consortium in Finland. Download access can usually be provided to graduate or postgraduate students in case the applicant needs the corpora for an MA thesis or for a PhD dissertation.
To obtain access to restricted corpora, please submit an application via the Language Bank Rights (after logging in to the LBR service, search the catalogue for ’Mark Davies’ downloadable corpora at Kielipankki.’).
To access the download service, click on the link under ’Location’, or see the metadata record for the link.

This page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022112305

Lahjoita puhetta (puhelahjat): Aineistot yrityksille ja ei-akateemisille organisaatioille

In English

Oletko tutkija? Lahjoita puhetta -aineistot akateemiseen tutkimuskäyttöön löytyvät toiselta sivulta.

Huom. Aineistopakettien sisältökuvaukset ja kokotiedot perustuvat alustavaan arvioon ja niitä voidaan tarvittaessa tarkentaa.

Tästä aineistosta tarjotaan yritysten ja ei-akateemisten organisaatioiden käyttöön seuraavat paketit:
Lahjoita puhetta -aineisto: Näyte Kuvailutiedot Ilmainen näyte, joka sisältää 40 satunnaisesti valittua äänitiedostoa, niiden litteraatit raakatekstinä ja kohdistustiedostoina sekä käytettävissä olevat äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 35 minuuttia. Tämän version viittausohje	Hinta: Maksuton näyte Hanki käyttöoikeus Lataa aineisto
Lahjoita puhetta: Valikoitu aineisto Kuvailutiedot Tämä kokoelma sisältää viisi eri osa-aineistoa, jotka on poimittu Aalto-yliopistossa erityisesti automaattisen puheentunnistuksen kehitys-, opetus- ja testausvaiheita varten. Äänitteiden yhteenlaskettu kesto on noin 131 tuntia. Tämän version viittausohje	Hinta: 1000 € * Hanki käyttöoikeus Lataa aineisto
Lahjoita puhetta: Annotoitu aineisto Kuvailutiedot Tämä kokoelma sisältää koko aineiston versioon 1 kuuluvat litteroidut äänitteet, litteraatit raakatekstinä ja kohdistustiedostoina sekä äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 1600 tuntia. Tämän version viittausohje	Hinta: 5000 € * Hanki käyttöoikeus Lataa aineisto
Lahjoita puhetta: Koko aineisto (versio 1) Kuvailutiedot Kokoelmassa on mukana kaikki aineiston versioon 1 kuuluvat litteroidut ja litteroimattomat äänitteet, litteraatit raakatekstinä ja kohdistustiedostoina sekä äänitteisiin ja puhujiin liittyvät taustatiedot. Äänitteiden yhteenlaskettu kesto on noin 3200 tuntia. Tämän version viittausohje	Hinta: 10.000 € * Hanki käyttöoikeus Lataa aineisto

* Hintoihin lisätään arvonlisävero 25,5 %.

Aineiston sisältö

Lahjoita puhetta -aineisto eli Puhelahjat on koostettu 16.6.2020 alkaneessa Vaken, Ylen ja Helsingin yliopiston toteuttamassa Lahjoita puhetta -kampanjassa, jossa kuka tahansa ainakin hieman suomea osaava on voinut helppokäyttöisen selain- tai mobiilisovelluksen kautta lahjoittaa omaa puhettaan. Aineisto on siinä mielessä ainutlaatuinen, että se on alusta alkaen kerätty mahdollisimman läpinäkyvästi sekä tutkijoiden että yritysten rajoitettuun käyttöön siten, että puheen lahjoittajien tietosuojasta pyritään huolehtimaan aineiston koko elinkaaren ajan.

Aineistosta on saatavilla erilaisia paketteja Kielipankin latauspalvelussa, josta luvan saaneet tutkijat, yritykset ja ei-akateemiset organisaatiot pääsevät niitä käyttämään. Kielipankin palvelut on lähtökohtaisesti suunnattu vain tutkijoille. Yrityksille ja ei-akateemisille organisaatioille aineiston käyttö on näyteaineistoa lukuunottamatta maksullista. Lisätietoja saa osoitteesta lahjoita-puhetta@helsinki.fi.

Annotoidun aineiston litteroinnissa käytetyt ohjeet (pdf)

Kuinka aineistoa pääsee käyttämään? Ohjeet yrityksille

Huom. Ohjeita päivitetään edelleen.

Puhelahjat-aineiston käyttöehtojen mukaisesti käyttöoikeuksia voidaan myöntää myös yrityksille tai ei-akateemisille organisaatioille. Kunkin ei-akateemisen käyttäjätahon kanssa tehdään kirjallinen sopimus halutun aineiston käytöstä. Kun sopimuksen mukaiset velvoitteet on suoritettu, pääsy aineistoon voidaan myöntää yrityksen valtuuttamalle edustajalle.

Aineiston käyttämisestä kiinnostuneet yritykset voivat ottaa yhteyttä osoitteeseen lahjoita-puhetta@helsinki.fi.
Yrityksiä koskevien lisenssisopimusten yleisiin ehtoihin voi tutustua täällä.
Ennen maksullisen aineiston hankkimista yritys voi saada veloituksetta pääsyn pieneen näyteaineistoon (”Lahjoita puhetta -aineisto: Näyte”). Myös näyteaineiston käsittelyä koskevat samat käyttöehdot kuin aineiston maksullisia versioita, joten erillinen sopimus tarvitaan.
Kun lisenssisopimus on tehty, yrityksen valtuuttama edustaja voi hakea pääsyä joko näyte- tai varsinaisen aineistoon Kielipankin oikeudet -palvelussa (LBR, Language Bank Rights).
Palvelu edellyttää hakijan sähköistä tunnistautumista eDuunin välittämällä identiteetillä tai jonkin luottamusverkostoihin kuuluvan akateemisen organisaation myöntämällä käyttäjätunnuksella. Tarvittaessa pääsyhakemuksen tekijä voi luoda itselleen eDuuni-identiteetin, jolla hän voi kirjautua palveluun. Identiteetin vahvistamiseen tarvitaan hakijan omassa käytössä oleva sähköpostiosoite.
Huom. eDuuni-identiteetin luominen on ilmaista! Yrityksen ei siis tarvitse ostaa muita eDuunin kautta tarjottuja palveluita.
Pääsyhakemuksen yhteydessä yrityksen on ilmoitettava oman hankkeensa julkinen otsikko sekä linkki aineistoon sisältyvien henkilötietojen käsittelyä koskevaan julkiseen tietosuojailmoitukseen. Tiedot julkaistaan Kielipankin verkkosivuilla.
Ohjeita tietosuojailmoituksen tekemiseen
Sopimuksen mukaisen lisenssimaksun on oltava suoritettuna ennen kuin pääsy maksulliseen aineistoon voidaan myöntää. Maksuohjeet saa osoitteesta lahjoita-puhetta@helsinki.fi.
Kun pääsyhakemus on hyväksytty, hakemuksen tehnyt henkilö saa pääsyn aineistoon sillä käyttäjätunnuksella, jolla hakemus tehtiin.

Tämän sivun pysyvä tunniste: urn:nbn:fi:lb-2022111628

Lahjoita puhetta -aineistot (puhelahjat) tutkimuskäyttöön

In English

Lahjoita puhetta -aineistot yrityskäyttöön ja ei-akateemisille organisaatioille: katso lisätiedot toiselta sivulta.

Tärkeää tietoa aineiston käyttäjille: Poistopyynnöt

Aineiston versiot:
Lahjoita puhetta -aineisto, versio 1.0 Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Hae käyttöoikeutta (vain tutkijoille; yhdellä hakemuksella saa pääsyn kaikkiin aineiston versioihin) +PRIV: Aineisto sisältää henkilötietoja. Toimita julkinen ilmoitus henkilötietojen käsittelystä Lataa aineisto
Lahjoita puhetta -aineisto: Näyte Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Lataa aineisto
Lahjoita puhetta: Valikoitu aineisto Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Lataa aineisto
Lahjoita puhetta -aineisto: Opetusdata (100h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Ladattavissa Valikoidun aineiston osana, ks. yllä
Lahjoita puhetta -aineisto: Testidata (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Ladattavissa Valikoidun aineiston osana, ks. yllä
Lahjoita puhetta -aineisto: Kehitysdata (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Ladattavissa Valikoidun aineiston osana, ks. yllä
Lahjoita puhetta -aineisto: Usean litteroijan testidata (1h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Ladattavissa Valikoidun aineiston osana, ks. yllä
Lahjoita puhetta -aineisto: Testidata useaan kertaan litteroiduilta puhujilta (10h) Kuvailutiedot Lisenssi (tutkijoille) Tämän version viittausohje	Ladattavissa Valikoidun aineiston osana, ks. yllä
Etsi muut saatavilla olevat versiot

Aineiston sisältö

Lahjoita puhetta -aineisto, lyhytnimeltään Puhelahjat, on koostettu 16.6.2020 alkaneessa Vake Oy:n (sittemmin Ilmastorahasto), Ylen ja Helsingin yliopiston toteuttamassa Lahjoita puhetta -kampanjassa, jossa kuka tahansa suomea osaava henkilö on voinut halutessaan lahjoittaa omaa puhettaan kielentutkimuksen sekä kieliteknologian kehitystyön edistämiseksi. Lahjoitettu puhe on tallennettu helppokäyttöisen selain- tai mobiilisovelluksen kautta.

Kevääseen 2021 mennessä lahjoitetuista puhenäytteistä on rakennettu ääniaineiston ensimmäinen versio, jonka kokonaiskesto on noin 3200 tuntia. Vuonna 2021 näistä äänitteistä litteroitiin käsityönä noin 1600 tuntia ja näin syntyneet tekstimuotoiset litteroinnit kohdistettiin vastaaviin äänitteisiin automaattisilla menetelmillä.

Aineiston ensimmäinen varsinainen versio 1.0 on saatavilla Kielipankin latauspalvelussa, josta luvan saaneet tutkijat ja myöhemmin myös yritykset pääsevät sitä käyttämään. Samaan aineistoon sisältyviä, esimerkiksi automaattisen puheentunnistuksen kehittämistä varten poimittuja osa-aineistoja on lisäksi tarjolla erillisinä paketteina, joiden sisältö ja viittauskäytänteet löytyvät kunkin aineistoversion kuvailutietueesta.

Lahjoita puhetta -aineistokokonaisuutta on tarkoitus myös myöhemmin päivittää ja laajentaa, kun uusia lahjoituksia on kertynyt riittävästi. Uusia versioita tehdään myös sitä mukaa, kun tutkijat tai yritykset jatkavat olemassa olevien äänitteiden litterointia ja muuta annotointia.

Annotoidun aineiston litteroinnissa käytetyt ohjeet (pdf)

Kuinka aineistoa pääsee käyttämään?

Puhelahjat-aineiston käyttäminen on luvanvaraista. Puhelahjat-ryhmän kaikkien osa-aineistojen tutkimuskäyttöä koskee sama lisenssi, johon sisältyy myös aineistokohtaisia tietosuojaehtoja.

Tutkimuskäyttö

Tutkijat voivat hakea aineiston käyttöoikeutta tavanomaisella hakemusmenettelyllä Kielipankin oikeudet -palvelussa (ks. ohjeet).
Tutkijan on syytä jo hakemusvaiheessa huomioida aineistokohtaiset käyttöehdot, ml. tietosuojaehdot, joiden mukaisissa rajoissa tutkimus on voitava toteuttaa myös henkilötietojen käsittelyn osalta, ks. lisenssi (tutkijoille).
Ennen aineiston käsittelyn aloittamista tutkijan on lomakkeella toimitettava Kielipankin julkaistavaksi hankkeensa yleistajuinen otsikko sekä linkki henkilötietojen käsittelyä koskevaan julkiseen tietosuojailmoitukseen.
Luvan saanut tutkija saa samalla hakemuksella pääsyn koko Lahjoita puhetta -aineistoon ja sen eri versioihin ja osa-aineistoihin.

Yrityskäytön ohjeet löytyvät omalta sivultaan.

Viimeksi päivitetty: 7.3.2024

Tämän sivun pysyvä tunniste: urn:nbn:fi:lb-2022102122

Aalto Finnish Parliament ASR Corpus 2008-2020 (fi-parliament-asr)

Aalto Finnish Parliament ASR Corpus 2008-2020 (fi-parliament-asr)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

This corpus is extracted from the Finnish parliament plenary session transcripts and videos by the
Aalto Speech Recognition group. The original session transcripts and videos are available at the web
portals of the Parliament of Finland (avoindata.eduskunta.fi and verkkolahetys.eduskunta.fi).

The Finnish corpus is split into three parts:
1. 2015-2020 set
2. 2008-2016 set
3. Development and evaluation sets

A non-overlapping combination of the 2008-2016 set and the 2015-2020 set form a training set of size:
– 1 422 318 sample pairs
– 3 130 hours of speech
– 19 356 831 word tokens

The Finland Swedish corpus contains:
– 3889 sample pairs
– 6.4 hours of speech
– 333 483 word tokens

All audio files in this corpus are single-channel wavs with sample rate 16 kHz and 16-bit precision.
The transcript files (.trn) are plain text files.

License and access

All versions of this resource are available publicly (PUB).
Click on the license image to see the resource-specific license text.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021081105

Finnish Proverb Collection (sananparsikokoelma)

Finnish Proverb Collection (sananparsikokoelma)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

The collection includes dialectal proverbs collected from various areas in the 1930s. This is a resource of Kotimaisten kielten keskus, the Institute for the Languages of Finland. For more information please see https://kaino.kotus.fi/korpus/sp/meta/sp_coll_rdf.xml.

This resource contains only a part of the 1.4 million proverbs collected in different regions of Finland. The National Archives of Finland have digitized quite a few of the handwritten cards containing proverbs. The digitized cards are available in jpg format at http://digi.narc.fi/digi/dosearch.ka?sartun=385077.KA

License and access

This resource is available publicly (PUB).
Click on the license image to see the resource-specific license text.

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2021081104

The Corpus of Beserman Udmurt, Kielipankki Version

The Corpus of Beserman Udmurt, Kielipankki Version

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

The Corpus of Beserman Udmurt comprises 65 000 tokens. The Beserman dialect of Udmurt is used in daily communication approximately by 2 000 speakers (according to the 2010 census). The Beserman live in the basin of the Cheptsa river in the Republic of Udmurtia and in the Kirov Oblast of the Russian Federation. In the scientific literature Beserman is considered to be a dialect of the Udmurt language which is characterized by an unusual combination of specifically Beserman phenomena (concentrated in vocabulary and phonetics) with certain traits of Northern and Southern Udmurt dialects, mostly morphological and phonological. The dialect remains the main means of everyday communication in Beserman villages, at least for the older generation.

The texts contained in the corpus have been collected in the villages of Shamardan (109 texts of 117), Vortsa (4 of 117), Malaya Yunda (1 of 117) and Zhuvam (3 of 117) in the Republic of Udmurtia in the years 2003-2015. There are 33 informants in total. The texts have been recorded, transcribed and grammatically annotated in the SIL FieldWorks software. The corpus contains narratives, life stories, dialogues, recipes, and recordings of psycholinguistic experiments. Each sentence is provided with interlinear glossing (according to the Leipzig Glossing Rules) and translation. Both the full text version with audio files and the corpus version are available at http://beserman.ru/corpus/search/?interface_language=en

License and access

The versions of this resource are available publicly (PUB).
Click on the license image to see the resource-specific license text.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021052406

Corpus of Age-related Voice Disguise (avoid)

Corpus of Age-related Voice Disguise (avoid)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

This corpus includes normal and age-related disguised speech uttered by 60 native Finnish speakers (31 females and 29 males). The speakers were asked to read the same text fragments several times, in their modal voice and in two disguised voices, first pretending to be an elderly speaker and then pretending to be a child. The texts consisted of the Finnish translations of The Rainbow Passage and The North Wind and the Sun, and two selected English sentences from the TIMIT[1] corpus (SA1, SA2). The corpus includes samples of 78 different sentences per speaker (66 Finnish, 12 English). The speech was recorded simultaneously with a portable recorder with close-talking microphone, and two smartphones applications, yielding a total of 14040 audio files (3 * 4680). The material was recorded in summer 2015 in order to study the effect of voice disguise on automatic speaker recognition.

Data protection policy for this corpus: http://urn.fi/urn:nbn:fi:lb-2018121021

Guidelines for processing corpora containing personal data in the Language Bank of Finland: http://urn.fi/urn:nbn:fi:lb-2020081522

License and access

This resource requires you to apply for individual access rights (RES).
Click on the license image to see the resource-specific license text.
This resource contains personal data (license condition +PRIV). The license includes additional data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021052405

ArkiSyn Database of Finnish Conversational Discourse (arkisyn)

ArkiSyn Database of Finnish Conversational Discourse (arkisyn)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

The Arkisyn corpus contains Finnish everyday conversations which have been morphologically and syntactically annotated. The data comes from the Conversation Analysis Archive at the University of Helsinki and the Finnish language Recording Archive at the University of Turku.

License and access

All versions of this resource are available publicly (PUB).
Click on the license image to see the resource-specific license text.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2014073026

Aalto University DSP Course Conversation Corpus (dspcon)

Aalto University DSP Course Conversation Corpus (dspcon)

Suomeksi

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

Aalto University DSP Course Conversation Corpus contains transcribed recordings of Finnish conversations by Digital Signal Processing course students in Aalto University, Finland, from 2013 onwards. The intention has been to use the data to build better models for automatic speech recognition of conversational Finnish.

The corpus includes audio files, handwritten word-level transcripts, and phone-level alignments generated using the Aalto ASR system.

License and access

All versions of this resource require you to log in as an academic user (ACA).
Click on the license image to see the resource-specific license text.
Some/all versions of this resource may contain personal data (license condition +PRIV). The license may then include additional data protection terms and conditions that you must follow. If processing personal data, maintain a public Privacy Notice regarding your project and provide the link to the Language Bank of Finland, see instructions.
Some/all versions of this resource are available in the computing environment (see column ’Location’).

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2015101901

Suomalainen radio- ja tv-korpus

Suomalainen radio- ja tv-korpus

In English

Saatavilla olevat versiot

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Sijainti	Viite	Aineistoryhmä ja ohje	Hae käyttöoikeutta	Julkaisuvuosi	Tukitaso

Tulossa olevat versiot

Nämä aineistoversiot eivät vielä ole saatavilla Kielipankin kautta.

Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Yhteyshenkilö	Sijainti	Aineistoryhmä ja ohje	Muu tieto
Lyhenne	Nimi ja kuvailutiedot	Lisenssi	Muoto	Tukitaso	Yhteyshenkilö	Sijainti	Aineistoryhmä ja ohje	Muu tieto

Aineiston aikaisempaa LAT-versiota vastaava sisältö löytyy nyt Kielipankin latauspalvelusta

Kielipankin LAT-alusta on poistettu käytöstä vuoden 2020 lopussa, eikä tätä aineistoa enää pääse käyttämään LAT-näkymän kautta. Vastaava sisältö on saatavilla ladattavassa muodossa. Aineiston tutkimista ja käsittelyä voi siis edelleen jatkaa esimerkiksi ELAN– ja Praat-ohjelmilla.

Tietoa aineistosta (in English)

The Finnish Broadcast Corpus is divided into two main parts: FBC-1 and FBC-2.

The Finnish Broadcast Corpus 1, FBC-1 contains 65 radio and tv recordings broadcast by YLE – the Finnish Broadcasting Company during the year 2003. Parts of the audio and video material have been annotated either manually or automatically in various levels: e.g., utterance (orthographic transcript), word, phone. FBC-1 was compiled under an initiative called Integrated Resources for Speech Technology and Spoken Language Research in Finland, funded by the Academy of Finland. It is CSC’s first multimodal corpus.

Details of the size of FBC-2 are being updated.

The material in the FBC-1 represents four categories:
* Radio monologues
– broadcast telegraph news (24 × 3 minutes, Nov. 2003)
– broadcast lectures of the week (8 × 14 minutes).
* Radio dialogues
– unfinished recordings of the Moninaisuusfoorumi event (5 × 1h).
* TV monologues
– broadcast main news read by Arvi Lind ja Eeva Polttila (15 × 30 minutes, September – November 2003), including the very last news telecast by Arvi Lind on October 15, 2003
* TV dialogues
– broadcast Aamu-TV programs (13 × ca. 12 minutes, 2003).

Formats:
* WAV audio format
* HQ_Pure audio format (44,1–48 KHz) (supported by the Puh-Editor, which is now obsolete)
* HQ_Pure audio format (16 KHz) (supported by the Puh-Editor, which is now obsolete)
* MPEG2 video

Funding Project:
Puheteknologian ja puheentutkimuksen yhteiset resurssit Suomessa, Integrated Resources for Speech Technology and Spoken Language Research in Finland (SA-Puhe)
Funding Type: National Funds
Funder: Academy of Finland
Funding Country: Finland
Project duration: 01/01/2002 – 12/31/2004

Lisenssi ja pääsy aineistoon

Tämän aineiston versioon täytyy hakea erikseen henkilökohtaista käyttöoikeutta (RES). Hae käyttöoikeutta
Lisenssikuvaketta napauttamalla näet tarkan aineistokohtaisen lisenssin.

Tämän aineistoryhmäsivun PID: http://urn.fi/urn:nbn:fi:lb-201403265

Viimeksi muokattu 2025-05-14

Hae Kielipankki-portaalista:

Kuukauden tutkija: Milla Uusitupa

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot