
At kielipankki.fi/future/mink, a browser-based tool called Mink is available, where users logged in via Haka can upload their own text materials for processing. The file formats supported by Mink include plain text (UTF-8), XML (where the analysis pipeline preserves the structures), Microsoft Word (.docx), Open Document (.odt), PDF, and CoNLL-U.
You can perform advanced searches on your own text corpora within the Korp environment, which is accessible through the Mink service. If necessary, texts can first be automatically parsed and annotated in Mink, which improves Korp’s search capabilities. For now, the Mink platform supports lemmatization (i.e., the reduction of words to their base forms) as well as morphological and dependency-based syntactic analysis for Finnish, Swedish, and English text, and the recognition of named phrases in English text. In addition to Korp, you can also save the results of the analysis to your own computer.
With Mink, users can prepare, test, and explore their own Korp corpus. For now, only the user themselves can access the materials they have transferred to the Mink’s Korp environment. However, separate arrangements may be made to make the corpus available to other researchers through the Language Bank’s shared Korp service. At a later stage, the plan is to make it possible to share the data stored in Mink, for example, with members of one’s own research group.
For now, more detailed instructions on how to use Mink can be found on the Swedish Språkbanken website. Please note that the Mink environment developed by Språkbanken has been slightly adapted for users of the Finnish Kielipankki, so not all features may work in exactly the same way in both Mink services.
The Mink platform is currently being further developed, and the Language Bank welcomes feedback on its functionality; see contact information.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026042421
Osoitteessa kielipankki.fi/future/mink on käytettävissä selainpohjainen työkalu Mink, johon Haka-kirjautuneet käyttäjät voivat viedä käsiteltäviksi omia tekstiaineistojaan. Minkissä tuetut aineistoformaatit ovat muotoilematon teksti (utf-8), XML (jonka sisältämät rakenteet analyysiputki säilyttää), Microsoft Word (.docx), Open Document (.odt), PDF ja CoNLL-U.
Omista tekstiaineistoista voi tehdä edistyneitä hakuja Mink-palvelun sisällä näkyvässä Korp-ympäristössä. Tekstit voi tarvittaessa ensin automaattisesti jäsentää ja annotoida Minkissä, jolloin Korpin hakumahdollisuudet paranevat. Mink-alusta tukee toistaiseksi lemmatisointia (eli sanojen perusmuotoistusta) sekä morfologista ja dependenssisyntaktista analyysia suomen-, ruotsin- ja englanninkieliselle tekstille sekä nimettyjen ilmausten tunnistusta englanninkieliselle tekstille. Korpin lisäksi analyysin tulokset voi myös tallentaa takaisin omalle koneelle.
Minkin avulla käyttäjä voi siis valmistella, kokeilla ja tutkia omaa Korp-korpustaan. Toistaiseksi vain käyttäjä itse pääsee käyttämään Minkin Korp-ympäristöön siirtämäänsä aineistoa. Erikseen voidaan kuitenkin sopia korpuksen toimittamisesta muiden tutkijoiden saataville Kielipankin yhteisen Korp-palvelun kautta. Myöhemmässä vaiheessa on tarkoitus, että Minkissä olevaa aineistoa olisi mahdollista jakaa esimerkiksi oman tutkimusryhmän jäsenten kanssa.
Tarkempia ohjeita Minkin käyttöön löytyy toistaiseksi ruotsalaisen Språkbankenin sivuilta. Huomaa, että Språkbankenissa kehitettyä Mink-ympäristöä on jonkin verran sovitettu suomalaisen Kielipankin käyttäjiä varten, joten kaikki ominaisuudet eivät välttämättä toimi samalla tavalla molemmissa Mink-palveluissa.
Mink-ympäristöä kehitetään edelleen ja Kielipankki ottaa vastaan palautetta Minkin toimivuudesta, ks.yhteystiedot.
Tämän aineistoryhmäsivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2026042422
The tool parses running Finnish text using TurkuNLP’s TNPP, and visualises with CoNLL-U viewer from The University of Groningen
The text is first parsed into a dependency parse tree in CoNLL-U format, and then visualised with dependency arrows that connect words in a sentence with each other.
NOTE: This tool is currently available as a demo version.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026031901
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Whisper can be installed to a SD Desktop virtual machine with SD Software installer.
The version provided for SD Desktop is based on Faster-Whisper-XXL.
After installation, Whisper is available as a command-line tool in SD Desktop.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026020504
This web service inputs a media file with a speech signal and a text file with a corresponding orthographic transcript, and computes a word segmentation and a phonetic segmentation and labeling.
The tools were developed at the Institute for Phonetics and Speech Processing in Munich, in the context of CLARIN-D.
For more information see the tutorial.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026020503
These command-line tools implement composable manipulations of segmented and annotated text in a VRT format aka verticalized text, related to Corpus WorkBench that is used in the back-end to the Korp concordance engine.
The basic function of the VRT tools is to preserve previous annotations, including structural markup that may contain valuable information about the text units, without the underlying tools even knowing that their input sentences are extracted from such context. New annotations from an underlying tool are added to their proper place in the input document.
The major innovation in FIN-CLARIN VRT is the use of names for the fields that are only positional in basic format. In the basic format the declaration of names is only a comment but these VRT tools use it extensively.
For more information see the README
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026020502
Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP).
Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level.
This tool is installed in CSC’s computing environment (’module load trankit’).
The current version is Trankit v1.0.0
For more details, please see Trankit’s Documentation.
Currently, Trankit supports the following tasks:
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026011402
Kielipankki – The Language Bank of Finland on nyt liitetty CLARINin yhdistettyyn sisältöhakuun (Federated Content Search).
Tällä hetkellä kytkettyinä ovat Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkielinen ja ruotsinkielinen osakorpus sekä uusin Suomi24, mikä tarkoittaa noin 27 miljardia sanetta Korpissa käytettävissä olevista yhteensä 37 miljardista saneesta.
Työkalua voi kokeilla täällä: https://contentsearch.clarin.eu/
Lisätietoja CLARINin FCS:stä löytyy täältä.
Kielipankki – The Language Bank of Finland is now connected to CLARIN’s Federated Content Search.
Presently connected are the Finnish and the Swedish subcorpora of The Newspaper and Periodical Corpus of the National Library of Finland as well as the latest Suomi24, which means about 27 of a total of 37 gigatokens available in Korp.
You can try the tool here: https://contentsearch.clarin.eu/
For more information about the CLARIN Federated Content Search, please see here.
Finnish-nertag is a named entity recogniser for Finnish. This tool implements a pipeline in which FiNER is the ner-tagging stage. Users can install the tools on their systems or run them in the local directory without installing.
FiNER is a rule-based named-entity recognition tool for Finnish, developed at the University of Helsinki for the FIN-CLARIN consortium. It uses tools based on the CRF-based tagger FinnPos, the Finnish morphology package OmorFi, and the FinnTreeBank corpus for tokenization and morphological analysis, and a set of pattern-matching (pmatch) rules for recognizing and categorizing proper names and other expressions in plaintext input.
The pattern-matching rules are built and compiled using the Helsinki Finite-State Technology toolkit.
More information and a technical documentation can be found here.
Finnish-nertag is offered in CSC’s computing environment. It is also available for download as part of the software package finnish-tagtools, whose current version number is 1.6.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2025021801
Kielipankin Korp-palvelu siirrettiin uudelle palvelimelle 12.11.2024. Tässä yhteydessä Korpiin tehtiin pieniä korjauksia ja muutoksia, jotka on lueteltu alla. Pahoittelemme, että jotkin ominaisuudet olivat pitkään toimimattomia.
Jos jokin ei toimi kuten ennen, lähetä palautetta joko palautelomakkeen kautta tai sähköpostitse osoitteella fin-clarin (ät) helsinki.fi.
Korjauksia ja muutoksia:
sentence_polarity → sentence_sentiment_polarity.Korpin uutisikkunan uutiset sisältävät joitain lisätietoja näistä muutoksista.
The Korp service of the Language Bank of Finland was moved to a new server on 12 November 2024. Korp also got a few minor fixes and changes listed below. We apologize for some features having been broken for a long time.
If something does not work as before, please send feedback either via the feedback form or by email to fin-clarin (at) helsinki.fi.
Fixes and changes:
sentence_polarity to sentence_sentiment_polarity.For some more details, please see the corresponding news items on the Korp newsdesk.
Due to very low usage, the Mylly service (https://mylly.rahtiapp.fi) will be shut down at the same time as CSC’s cloud services move to Rahti’s new version during the summer 2024. Mylly will be available until 17th June 2024. Due to the short notice, we will keep the users’ data for three months after the shutdown.
In case you wish to download your data, you can do it yourself by 17th June or by contacting CSC service desk within three months.
In case you wish to utilise the tool scripts from Mylly on other services (e.g., Puhti or CSC Notebooks), the software will still be available on GitHub.
Vähäisestä käytöstä johtuen Mylly-palvelu (https://mylly.rahtiapp.fi) ajetaan alas samassa yhteydessä, kun CSC:n pilvipalvelut siirtyvät Rahtin uuteen versioon kesän 2024 aikana. Mylly on käytettävissä vielä 17.6.2024 asti. Nopeasta aikataulusta johtuen pyrimme säilyttämään käyttäjien aineistot vielä 3 kuukautta tämän jälkeen.
Jos haluat Myllyssä olleet aineistosi talteen, voit ladata ne itse 17.6. asti tai seuraavan kolmen kuukauden ajan ottamalla yhteyttä CSC:n asiakaspalveluun.
Jos haluat hyödyntää Myllyn työkaluskriptejä muilla alustoilla (esim. Puhti tai CSC Notebooks), skriptit ovat saatavilla GitHubista myös jatkossa.
GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.
The GiellaLT website contains the technical documentation of the GiellaLT infrastructure, developed and used by Divvun and Giellatekno.
It is an open source website providing analysers and tools for a wide range of languages, as well as a ready-made setup for adding more languages.
The Language Bank of Finland is currently in the process of evaluating the state of development of GiellaLT’s analysers for individual languages in relation to text data being annotated for the Korp search engine.
Read more about the details and findings of the evaluation performed by Jack Rueter.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050301
GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. The web site of GiellaLT offers language models (transducers) for a wide range of languages. Writing documentation for each language repository is an ongoing effort, and part of the development process.
The GiellaLT infrastructure, with its implementation of finite-state tools, allows people working with different languages to make use of technological solutions that, otherwise, might require several years of individual development. It is here that descriptions for many of the Uralic languages have been initialized and developed as both financed projects and the work of language technology enthusiasts.
The GiellaLT infrastructure makes it possible to reuse finite-state descriptions and even encourages it. Thus, contributing to the enhancement of the finite-state tools at GiellaLT, when extending the annotation of corpora on the Language Bank of Finland’s Korp server, is beneficial to the search engine users as well.
On this page, we will evaluate the state of development of analysers for individual languages in relation to text data being annotated for the Korp search engine. This evaluation will therefore be aligned with the annotation of upcoming corpora, such as a new extended version of PaBiVUS (Parallel Biblical Verses for Uralic Studies). The objective is to increase the lemmatization, morphological and syntactic annotation coverage not previously offered for non-majority languages in the parallel corpus. So, here we will provide an illustrative depiction of each individual finite-state description and what steps have been made for improvement. This might be seen as enhanced but not complete coverage of various genre as we go.
The evaluations will tend to illustrate the capacities of the analysers, which do have equivalent generators, but the possible overproductivity of these generators is presently not the focus of these evaluations. In time, attention will be also drawn towards the description of the disambiguation of morphological analyses, which is made possible in the open-source GiellaLT infrastructure. The enhanced descriptions, housed in GiellaLT, will serve as a contribution by the Language Bank of Finland in the shared responsibilities towards improved coverage of lesser described languages and NLP addressing them. Thus, the resulting analysers will available for building within the GiellaLT infrastructure or the UralicNLP python, java and .net libraries available through Github or the Language Bank of Finland.
For more details see the complete description on the analyser enhancement by Jack Rueter.
Please follow this link for a Follow-up on the analyser enhancement by Jack Rueter.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050302
NTS on monikielinen monitorikorpus, joka sisältää maantieteellisesti paikannettuja twiittejä ja niihin liittyviä metatietoja Pohjoismaista. Kaikkiaan se sisältää lähes 74 miljoonaa viestiä sadoilta tuhansilta käyttäjätileiltä Tanskasta, Suomesta, Islannista, Norjasta ja Ruotsista. NTS-tiedot kattavat ajanjakson tammikuun 2013 ja toukokuun 2023 välillä, ja ne kerättiin Twitter Academic API:n avulla, joka on nyt suljettu.
NTS:n tarkoituksena on helpottaa SSH:n perustutkimusta. NTS:ssä on helppokäyttöinen graafinen käyttöliittymä, joka tukee nopeaa tiedonsaantia, jotta tutkijat voivat keskittyä tietojen analysointiin. Tietoaineisto mahdollistaa erityyppiset tutkimukset. Esimerkiksi on mahdollista tutkia julkista keskustelua ja tunteita lähihistorian tapahtumista (esim. COVID-19-pandemia, Nato-jäsenyysprosessi jne.). Tietokokonaisuus on myös resurssi sosiolingvistiselle tutkimukselle ja monikielisyyden tutkijoille.
Tutustu verkkosivustoon.
Jos käytät NTS-käyttöliittymää ja hyödynnät tuloksia julkaisuissasi, mainitse hiljattain julkaistu artikkeli, joka on saatavilla verkossa:
[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.
Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024041502
The NTS is a multilingual monitor corpus of geolocated tweets and associated metadata from the Nordic region. Altogether, it contains nearly 74 million messages from hundreds of thousands of user accounts from Denmark, Finland, Iceland, Norway, and Sweden. The NTS data cover the period between January 2013 and May 2023 and were collected using the Twitter Academic API, which is now closed.
The purpose of the NTS is to facilitate fundamental research in SSH. The NTS comes with an easy-to-use graphic interface that supports quick data access so that researchers can focus on data analysis. The dataset enables various types of research. For instance, it is possible to study public discourses and sentiment concerning events in recent history (e.g., the COVID-19 pandemic, the NATO membership process, etc.). The dataset is also a resource for sociolinguistic research and for scholars of multilingualism.
Please visit the website.
If you use the NTS interface and use the findings in your publications, please cite the recent paper, which is available online:
[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024041501
Oletko etsiskellyt työkalua, jolla voisi tunnistaa tekstin virkkeiden kielen?
Tutustu HeLI-OTSin uusimpaan versioon 1.5: https://www.kielipankki.fi/tools/heli-ots/
Have you been looking for a tool that can identify the language of individual sentences in text?
Take a look at HeLI-OTS version 1.5: https://www.kielipankki.fi/tools/heli-ots/
Last modified on 2024-03-26
