Mink

At kielipankki.fi/future/mink, a browser-based tool called Mink is available, where users logged in via Haka can upload their own text materials for processing. The file formats supported by Mink include plain text (UTF-8), XML (where the analysis pipeline preserves the structures), Microsoft Word (.docx), Open Document (.odt), PDF, and CoNLL-U.

You can perform advanced searches on your own text corpora within the Korp environment, which is accessible through the Mink service. If necessary, texts can first be automatically parsed and annotated in Mink, which improves Korp’s search capabilities. For now, the Mink platform supports lemmatization (i.e., the reduction of words to their base forms) as well as morphological and dependency-based syntactic analysis for Finnish, Swedish, and English text, and the recognition of named phrases in English text. In addition to Korp, you can also save the results of the analysis to your own computer.

With Mink, users can prepare, test, and explore their own Korp corpus. For now, only the user themselves can access the materials they have transferred to the Mink’s Korp environment. However, separate arrangements may be made to make the corpus available to other researchers through the Language Bank’s shared Korp service. At a later stage, the plan is to make it possible to share the data stored in Mink, for example, with members of one’s own research group.

For now, more detailed instructions on how to use Mink can be found on the Swedish Språkbanken website. Please note that the Mink environment developed by Språkbanken has been slightly adapted for users of the Finnish Kielipankki, so not all features may work in exactly the same way in both Mink services.

The Mink platform is currently being further developed, and the Language Bank welcomes feedback on its functionality; see contact information.

Access Mink

Mink (Språkbanken Text)

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026042421

Mink – omien aineistojen analysointi ja vienti Korpiin

In English

Osoitteessa kielipankki.fi/future/mink on käytettävissä selainpohjainen työkalu Mink, johon Haka-kirjautuneet käyttäjät voivat viedä käsiteltäviksi omia tekstiaineistojaan. Minkissä tuetut aineistoformaatit ovat muotoilematon teksti (utf-8), XML (jonka sisältämät rakenteet analyysiputki säilyttää), Microsoft Word (.docx), Open Document (.odt), PDF ja CoNLL-U.

Omista tekstiaineistoista voi tehdä edistyneitä hakuja Mink-palvelun sisällä näkyvässä Korp-ympäristössä. Tekstit voi tarvittaessa ensin automaattisesti jäsentää ja annotoida Minkissä, jolloin Korpin hakumahdollisuudet paranevat. Mink-alusta tukee toistaiseksi lemmatisointia (eli sanojen perusmuotoistusta) sekä morfologista ja dependenssisyntaktista analyysia suomen-, ruotsin- ja englanninkieliselle tekstille sekä nimettyjen ilmausten tunnistusta englanninkieliselle tekstille. Korpin lisäksi analyysin tulokset voi myös tallentaa takaisin omalle koneelle.

Minkin avulla käyttäjä voi siis valmistella, kokeilla ja tutkia omaa Korp-korpustaan. Toistaiseksi vain käyttäjä itse pääsee käyttämään Minkin Korp-ympäristöön siirtämäänsä aineistoa. Erikseen voidaan kuitenkin sopia korpuksen toimittamisesta muiden tutkijoiden saataville Kielipankin yhteisen Korp-palvelun kautta. Myöhemmässä vaiheessa on tarkoitus, että Minkissä olevaa aineistoa olisi mahdollista jakaa esimerkiksi oman tutkimusryhmän jäsenten kanssa.

Tarkempia ohjeita Minkin käyttöön löytyy toistaiseksi ruotsalaisen Språkbankenin sivuilta. Huomaa, että Språkbankenissa kehitettyä Mink-ympäristöä on jonkin verran sovitettu suomalaisen Kielipankin käyttäjiä varten, joten kaikki ominaisuudet eivät välttämättä toimi samalla tavalla molemmissa Mink-palveluissa.

Mink-ympäristöä kehitetään edelleen ja Kielipankki ottaa vastaan palautetta Minkin toimivuudesta, ks.yhteystiedot.

Avaa Mink

Mink (Språkbanken Text)

Tämän aineistoryhmäsivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2026042422

finnish-parse

The tool parses running Finnish text using TurkuNLP’s TNPP, and visualises with CoNLL-U viewer from The University of Groningen

The text is first parsed into a dependency parse tree in CoNLL-U format, and then visualised with dependency arrows that connect words in a sentence with each other.

NOTE: This tool is currently available as a demo version.

Access to the demo version.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026031901

Whisper

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

Whisper home page

Whisper can be installed to a SD Desktop virtual machine with SD Software installer.

The version provided for SD Desktop is based on Faster-Whisper-XXL.

After installation, Whisper is available as a command-line tool in SD Desktop.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026020504

WebMAUS

This web service inputs a media file with a speech signal and a text file with a corresponding orthographic transcript, and computes a word segmentation and a phonetic segmentation and labeling.

The tools were developed at the Institute for Phonetics and Speech Processing in Munich, in the context of CLARIN-D.

For more information see the tutorial.

Access the web service

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026020503

VRT tools

These command-line tools implement composable manipulations of segmented and annotated text in a VRT format aka verticalized text, related to Corpus WorkBench that is used in the back-end to the Korp concordance engine.

The basic function of the VRT tools is to preserve previous annotations, including structural markup that may contain valuable information about the text units, without the underlying tools even knowing that their input sentences are extracted from such context. New annotations from an underlying tool are added to their proper place in the input document.

The major innovation in FIN-CLARIN VRT is the use of names for the fields that are only positional in basic format. In the basic format the declaration of names is only a comment but these VRT tools use it extensively.

For more information see the README

Access on GitHub

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026020502

Trankit

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP).

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level.

This tool is installed in CSC’s computing environment (’module load trankit’).

The current version is Trankit v1.0.0

For more details, please see Trankit’s Documentation.

Currently, Trankit supports the following tasks:

Sentence segmentation.
Tokenization.
Multi-word token expansion.
Part-of-speech tagging.
Morphological feature tagging.
Dependency parsing.
Named entity recognition.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2026011402

Kielipankki nyt liitetty FCS:ään

Kielipankki – The Language Bank of Finland on nyt liitetty CLARINin yhdistettyyn sisältöhakuun (Federated Content Search).

Tällä hetkellä kytkettyinä ovat Kansalliskirjaston sanoma- ja aikakauslehtikokoelman suomenkielinen ja ruotsinkielinen osakorpus sekä uusin Suomi24, mikä tarkoittaa noin 27 miljardia sanetta Korpissa käytettävissä olevista yhteensä 37 miljardista saneesta.

Työkalua voi kokeilla täällä: https://contentsearch.clarin.eu/

Lisätietoja CLARINin FCS:stä löytyy täältä.

Kielipankki now connected to FCS

Kielipankki – The Language Bank of Finland is now connected to CLARIN’s Federated Content Search.

Presently connected are the Finnish and the Swedish subcorpora of The Newspaper and Periodical Corpus of the National Library of Finland as well as the latest Suomi24, which means about 27 of a total of 37 gigatokens available in Korp.

You can try the tool here: https://contentsearch.clarin.eu/

For more information about the CLARIN Federated Content Search, please see here.

finnish-nertag

Finnish-nertag is a named entity recogniser for Finnish. This tool implements a pipeline in which FiNER is the ner-tagging stage. Users can install the tools on their systems or run them in the local directory without installing.

FiNER is a rule-based named-entity recognition tool for Finnish, developed at the University of Helsinki for the FIN-CLARIN consortium. It uses tools based on the CRF-based tagger FinnPos, the Finnish morphology package OmorFi, and the FinnTreeBank corpus for tokenization and morphological analysis, and a set of pattern-matching (pmatch) rules for recognizing and categorizing proper names and other expressions in plaintext input.

The pattern-matching rules are built and compiled using the Helsinki Finite-State Technology toolkit.

More information and a technical documentation can be found here.

Finnish-nertag is offered in CSC’s computing environment. It is also available for download as part of the software package finnish-tagtools, whose current version number is 1.6.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2025021801

Korp siirretty uudelle palvelimelle pienin korjauksin ja muutoksin

Kielipankin Korp-palvelu siirrettiin uudelle palvelimelle 12.11.2024. Tässä yhteydessä Korpiin tehtiin pieniä korjauksia ja muutoksia, jotka on lueteltu alla. Pahoittelemme, että jotkin ominaisuudet olivat pitkään toimimattomia.

Jos jokin ei toimi kuten ennen, lähetä palautetta joko palautelomakkeen kautta tai sähköpostitse osoitteella fin-clarin (ät) helsinki.fi.

Korjauksia ja muutoksia:

Laajennetun haun aikavälivalitsin (tekstin piirre aikaväli) toimii jälleen.
Virkkeen, kappaleen ja tekstin tunnistetut kielet sisältävien tekstin piirteiden esitysmuotoa on muutettu. Muutokset koskevat Kansalliskirjaston lehtikokoelman (KLK) suomenkielisten lehtien versiota 2 sekä Suomi24 2018–2020 -aineistoa. Muutokset ovat seuraavat:
- Kielestä näkyy aina kolmikirjaiminen ISO 639-3 -kielikoodi.
- Jos kielikoodille on käännös, konkordanssin reunapalkissa kielen nimi näkyy työkaluvihjeenä, kun vie kohdistimen kielikoodin kohdalle.
- Konkordanssin reunapalkissa kielikoodi on linkki kyseisen kielen sivulle SIL:n ISO 639-3 -sivustolla.
- Laajennetussa haussa virkkeen kielen kielikoodeille on valintalista.
- Piirteen nimessä näkyy kielikoodistandardi (ISO 639-3).
Suomi24 2001–2020 -aineistossa tekstin piirteen nimi virkkeen polaarisuus on muutettu muotoon virkkeen tunnesävyn polaarisuus (tunnesävy = sentimentti), ja piirteen sisäinen nimi (jota käytetään mm. edistyneessä haussa) on muutettu sentence_polarity → sentence_sentiment_polarity.
Lauseopin arkiston murreaineistossa, ELFA-aineistossa (English as a Lingua Franca in Academic Settings) ja ScotsCorr-aineistossa laajennetun haun hakutulokset sisältävät osumia, joissa laajennetussa haussa eksplisiittisesti määriteltyjen saneiden välissä voi esiintyä välimerkkejä ja saneina esitettyjä annotaatioita. Näin tällaisia saneita ei tarvitse erikseen ottaa huomioon laajennetun haun hakuehdossa. Tämä ominaisuus oli ”vanhassa Korpissa” (Korp 5), joka ajettiin alas kesäkuussa 2024.
ScotsCorr-aineisto toimii viimein tässä Korp-versiossa. Lisäksi tekstin piirteen nimi käsiala (toissijainen) näkyy nyt oikein.
Reitti A-siipeen -aineiston (Reittidemo) videolinkit toimivat jälleen.

Korpin uutisikkunan uutiset sisältävät joitain lisätietoja näistä muutoksista.

Korp moved to a new server, with some fixes and changes

The Korp service of the Language Bank of Finland was moved to a new server on 12 November 2024. Korp also got a few minor fixes and changes listed below. We apologize for some features having been broken for a long time.

If something does not work as before, please send feedback either via the feedback form or by email to fin-clarin (at) helsinki.fi.

Fixes and changes:

The time interval selector (text attribute time interval) in the extended search works again.
The representation of the text attributes containing the identified language(s) of a sentence, paragraph and text has been changed. The changes affect the Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2 and the Suomi24 2018–2020 corpus. The internal representations of the attributes are intact, so they can be used in the CQP expressions of the advanced search as before. The changes are the following:
- A language is always represented by its three-letter ISO 639-3 code.
- If a language code has a translation, it is shown as a tooltip in the sidebar of the KWIC result when hovering over the code.
- A language code in the KWIC sidebar is a link to the page of the language in question on the SIL’s ISO 639-3 site.
- The extended search has a selection list for language codes (sentence only).
- The attribute label includes the language code standard (ISO 639-3).
In the Suomi24 2001–2020 corpus, the text attribute name sentence polarity has been changed to sentence sentiment polarity and the internal name of the attribute (used e.g. in the extended search) has been changed from sentence_polarity to sentence_sentiment_polarity.
In The Finnish Dialect Corpus of the Syntax Archive (LA-murre), The Corpus of English as a Lingua Franca in Academic Settings (ELFA) and ScotsCorr, the search results of the extended search include matches with punctuation marks and annotations represented as tokens between the tokens explicitly specified in the extended search. Such tokens thus need not be explicitly taken into account in the extended search expression. This feature was present in the “old Korp” (Korp 5) that was shut down in June 2024.
The ScotsCorr corpus finally works in this Korp version. In addition, the name of the text attribute script type (secondary) is now shown correctly.
The video links in the Route to A wing Corpus (Reittidemo) work again.

For some more details, please see the corresponding news items on the Korp newsdesk.

Mylly will be discontinued on 17th June 2024

Due to very low usage, the Mylly service (https://mylly.rahtiapp.fi) will be shut down at the same time as CSC’s cloud services move to Rahti’s new version during the summer 2024. Mylly will be available until 17th June 2024. Due to the short notice, we will keep the users’ data for three months after the shutdown.

In case you wish to download your data, you can do it yourself by 17th June or by contacting CSC service desk within three months.

In case you wish to utilise the tool scripts from Mylly on other services (e.g., Puhti or CSC Notebooks), the software will still be available on GitHub.

Mylly-palvelu suljetaan 17.6.2024

Vähäisestä käytöstä johtuen Mylly-palvelu (https://mylly.rahtiapp.fi) ajetaan alas samassa yhteydessä, kun CSC:n pilvipalvelut siirtyvät Rahtin uuteen versioon kesän 2024 aikana. Mylly on käytettävissä vielä 17.6.2024 asti. Nopeasta aikataulusta johtuen pyrimme säilyttämään käyttäjien aineistot vielä 3 kuukautta tämän jälkeen.

Jos haluat Myllyssä olleet aineistosi talteen, voit ladata ne itse 17.6. asti tai seuraavan kolmen kuukauden ajan ottamalla yhteyttä CSC:n asiakaspalveluun.

Jos haluat hyödyntää Myllyn työkaluskriptejä muilla alustoilla (esim. Puhti tai CSC Notebooks), skriptit ovat saatavilla GitHubista myös jatkossa.

GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology.

The GiellaLT website contains the technical documentation of the GiellaLT infrastructure, developed and used by Divvun and Giellatekno.

It is an open source website providing analysers and tools for a wide range of languages, as well as a ready-made setup for adding more languages.

Testing and enhancement of language models (transducers) from GiellaLT

The Language Bank of Finland is currently in the process of evaluating the state of development of GiellaLT’s analysers for individual languages in relation to text data being annotated for the Korp search engine.

Read more about the details and findings of the evaluation performed by Jack Rueter.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050301

Testing and enhancement of language models (transducers) from GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. The web site of GiellaLT offers language models (transducers) for a wide range of languages. Writing documentation for each language repository is an ongoing effort, and part of the development process.

Analyser enhancement

The GiellaLT infrastructure, with its implementation of finite-state tools, allows people working with different languages to make use of technological solutions that, otherwise, might require several years of individual development. It is here that descriptions for many of the Uralic languages have been initialized and developed as both financed projects and the work of language technology enthusiasts.
The GiellaLT infrastructure makes it possible to reuse finite-state descriptions and even encourages it. Thus, contributing to the enhancement of the finite-state tools at GiellaLT, when extending the annotation of corpora on the Language Bank of Finland’s Korp server, is beneficial to the search engine users as well.

On this page, we will evaluate the state of development of analysers for individual languages in relation to text data being annotated for the Korp search engine. This evaluation will therefore be aligned with the annotation of upcoming corpora, such as a new extended version of PaBiVUS (Parallel Biblical Verses for Uralic Studies). The objective is to increase the lemmatization, morphological and syntactic annotation coverage not previously offered for non-majority languages in the parallel corpus. So, here we will provide an illustrative depiction of each individual finite-state description and what steps have been made for improvement. This might be seen as enhanced but not complete coverage of various genre as we go.

The evaluations will tend to illustrate the capacities of the analysers, which do have equivalent generators, but the possible overproductivity of these generators is presently not the focus of these evaluations. In time, attention will be also drawn towards the description of the disambiguation of morphological analyses, which is made possible in the open-source GiellaLT infrastructure. The enhanced descriptions, housed in GiellaLT, will serve as a contribution by the Language Bank of Finland in the shared responsibilities towards improved coverage of lesser described languages and NLP addressing them. Thus, the resulting analysers will available for building within the GiellaLT infrastructure or the UralicNLP python, java and .net libraries available through Github or the Language Bank of Finland.

For more details see the complete description on the analyser enhancement by Jack Rueter.

Evaluations of analysers for individual languages:

Please follow this link for a Follow-up on the analyser enhancement by Jack Rueter.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050302

Nordic Tweet Stream (NTS) haku- ja visualisointikäyttöliittymä

In English

NTS on monikielinen monitorikorpus, joka sisältää maantieteellisesti paikannettuja twiittejä ja niihin liittyviä metatietoja Pohjoismaista. Kaikkiaan se sisältää lähes 74 miljoonaa viestiä sadoilta tuhansilta käyttäjätileiltä Tanskasta, Suomesta, Islannista, Norjasta ja Ruotsista. NTS-tiedot kattavat ajanjakson tammikuun 2013 ja toukokuun 2023 välillä, ja ne kerättiin Twitter Academic API:n avulla, joka on nyt suljettu.

NTS:n tarkoituksena on helpottaa SSH:n perustutkimusta. NTS:ssä on helppokäyttöinen graafinen käyttöliittymä, joka tukee nopeaa tiedonsaantia, jotta tutkijat voivat keskittyä tietojen analysointiin. Tietoaineisto mahdollistaa erityyppiset tutkimukset. Esimerkiksi on mahdollista tutkia julkista keskustelua ja tunteita lähihistorian tapahtumista (esim. COVID-19-pandemia, Nato-jäsenyysprosessi jne.). Tietokokonaisuus on myös resurssi sosiolingvistiselle tutkimukselle ja monikielisyyden tutkijoille.

Tutustu verkkosivustoon.

Lisää tietoa NTS:stä

Jos käytät NTS-käyttöliittymää ja hyödynnät tuloksia julkaisuissasi, mainitse hiljattain julkaistu artikkeli, joka on saatavilla verkossa:
[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.

Tämän sivun pysyvä tunniste: http://urn.fi/urn:nbn:fi:lb-2024041502

Nordic Tweet Stream (NTS) search & visualization interface

Suomeksi

The NTS is a multilingual monitor corpus of geolocated tweets and associated metadata from the Nordic region. Altogether, it contains nearly 74 million messages from hundreds of thousands of user accounts from Denmark, Finland, Iceland, Norway, and Sweden. The NTS data cover the period between January 2013 and May 2023 and were collected using the Twitter Academic API, which is now closed.

The purpose of the NTS is to facilitate fundamental research in SSH. The NTS comes with an easy-to-use graphic interface that supports quick data access so that researchers can focus on data analysis. The dataset enables various types of research. For instance, it is possible to study public discourses and sentiment concerning events in recent history (e.g., the COVID-19 pandemic, the NATO membership process, etc.). The dataset is also a resource for sociolinguistic research and for scholars of multilingualism.

Please visit the website.

About NTS

If you use the NTS interface and use the findings in your publications, please cite the recent paper, which is available online:
[1] Laitinen, Mikko, Jonas Lundberg, Magnus Levin & Rafael Martins. 2018. The Nordic Tweet Stream: A Dynamic Real-Time Monitor Corpus of Big and Rich Language Data, Proc. of Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org, online CEUR-WS.org/Vol-2084/short10.pdf.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024041501

HeLI-OTS 1.5 – automaattinen kielentunnistin 200 eri kielelle

Oletko etsiskellyt työkalua, jolla voisi tunnistaa tekstin virkkeiden kielen?
Tutustu HeLI-OTSin uusimpaan versioon 1.5: https://www.kielipankki.fi/tools/heli-ots/

HeLI-OTS 1.5 – an off-the-shelf language identifier for 200 languages

Have you been looking for a tool that can identify the language of individual sentences in text?
Take a look at HeLI-OTS version 1.5: https://www.kielipankki.fi/tools/heli-ots/

Last modified on 2024-03-26

Search the Language Bank Portal:

Researcher of the Month: Milla Uusitupa

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information