HeLI-OTS

HeLI-OTS (off-the-shelf) is a language identifier with language models for 200 languages. The program will read the <infile> and classify the language of each line as one of the 200 languages it knows and writes the results, one ISO 639-3 code per line, into file <outfile>. It can identify c. 3000 sentences per second using one core on a 2021 laptop and around 3 gigabytes of memory.

Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft.

Latest versions/subcorpora:  
HeLI-OTS 1.2
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Look for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022011801

Suomi 24 resource group

Suomeksi

Latest versions and variants:  
The Suomi 24 Sentences Corpus 2001-2020, Korp version (BETA)
icon-info-circle Metadata and license
icon-quote-rightCitation instructions
Open the resource in Korp (BETA) icon-question-circle
(including the years 2001-2017 and the update 2018-2020)
The Suomi 24 Sentences Corpus 2018-2020, Korp-version (BETA)
icon-info-circle Metadata and license
icon-quote-right Citation instructions
Open the resource in Korp (BETA) icon-question-circle
The Suomi24 Sentences Corpus 2001-2017, Korp version 1.2
icon-info-circle Metadata and license
icon-quote-right Citation instructions for this version
Open the resource in Korp icon-question-circle
The Suomi24 Corpus 2001-2017, VRT version 1.1
icon-info-circle Metadata and license
icon-quote-right Citation instructions for this version
Download the resource
Search for all available versions  

The resource consists of the discussions posted on the Suomi 24 discussion forum. The content has been annotated with automatic methods and stored in VRT format.

Via the Korp service, it is possible to perform versatile search queries from the content and to obtain various statistics and visualizations (see Korp instructions).

Without logging in via Korp, you can see the items matching your search criteria as brief excerpts only. At each word token in the concordance, you can find a link to the original message and discussion thread on the original Suomi 24 discussion platform, in case they are still available there. If required, researchers can also log in in case they need to view the wider context around the matching items.

In addition to the corpus versions that are available in Korp, the corresponding full text documents are available for logged-in researchers in VRT format either on the CSC computing environment or as downloadable packages via the download service of Kielipankki. In order to use the computing environment, researchers need a CSC user account. Please note, however, that in order to use the full text data efficiently, some technical and programming skills are usually required. The Korp service provides many opportunities for studying and analyzing the Suomi 24 corpus, so it is recommended that you first make sure whether Korp is suitable for your purpose.

 

Persistent identifier of this page: http://urn.fi/urn:nbn:fi:lb-2022011221

Suomeksi

Corpus Title

Current versions of this resource: 
Corpus Title, Korp version
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp icon-question-circle
Corpus Title, download version
icon-info-circle Metadata and license
icon-exclamation-triangle PRIV: See privacy guidelines
icon-quote-rightAttribution instructions
Apply for rights to access the resource
Download the resource
Look for other versions of this resource

Information about the removal of the LAT version of this corpus in November 2020

Due to technical reasons, the LAT service (lat.csc.fi) will be discontinued in the Language Bank of Finland as of November 30, 2020. After this, the LAT version of this corpus will no longer be available. However, the content will be made available for download. In case you urgently need the downloadable data, please contact us.

Corpus contents

The corpus consists of…

Other details about the content and the terms and conditions regarding the different corpus versions are available in the corresponding metadata records.

Example queries from the Korp version of this corpus


Privacy guidelines

Corpus XYZ contains personal data. When using the corpus, follow the personal data guidelines provided by the Language Bank of Finland. Below, you can find a description of the types of personal data that are included in the corpus as well as details on additional specific restrictions that you need to comply with when processing the personal data in question.

[This part should contain the description and corpus-specific restrictions regarding the processing of the personal data in the corpus, as stated by the data controller in the deposition license agreement.]

Nimiarkisto

Nimiarkisto.fi is a portal with the most important digital resources of names and named entities collected from and archived in Finland. The service is offered by the Institute for the Languages of Finland.

Open the website

User Guidelines (in Finnish)

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021111902

finsentiment

Finsentiment estimates a sentiment (positive, negative, or neutral) for each sentence in the input text, and also for the input text as a whole.

The sentiment analysis relies on three resources:

  1. Word embeddings calculated from a corpus of Finnish text.
  2. Product reviews harvested from the Internet.
  3. A word-based convolutional neural network with 100 kernels each of sizes 2, 3, 4 and 5 words. The neural network is trained to predict the rating associated with product reviews, and the prediction it gives to the input text is converted to a sentiment.

At the moment this tool is available as a demo version.

Open the website

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110405

Terminology Forum

Terminology Forum is a global non-profit information forum for freely available terminological information online, created by experts and enthusiasts in various fields. The Forum was established in 1994 and is maintained by the University of Vaasa, Finland.

Open the website

The related corpus Terminology Forum Glossaries (selection), source is available for download in the download service of Kielipankki.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110404

ELAN

ELAN is a program for transcribing and annotating audio and video files, offered by The Language Archive. It can also be used for searching locally stored collections of annotated material.

User guidelines in Finnish

User guidelines in English

Install the tool

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110402

Finnish BERT (FinBERT)

A version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group. The model can be fine-tuned to achieve state-of-the-art results for various Finnish natural language processing tasks.

FinBERT has been pre-trained for 1 million steps on over 3 billion tokens (24B characters) of Finnish text drawn from news, online discussion, and internet crawls.

TurkuNLP

For more information see the FinBERT’s project page

Install (GitHub)

FinBERT Kielipankki version: Kielipankki offers a version of Google’s BERT deep transfer learning model for Finnish. It is installed in CSC’s Puhti cluster and can be used via the pytorch 1.4 module. For details see /appl/data/kielipankki/bert_models/README.txt

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110401

Transkribus

Transkribus is a comprehensive platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents.

Open the website

User instructions

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110305

Semantic similarity of words (word2vec)

The tool is developed by the Turku NLP group for analyzing the semantic similarity of words.

Online demo

Documentation

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110304

WebAnno

WebAnno is a general purpose web-based annotation tool for a wide range of linguistic annotations including various layers of morphological, syntactical, and semantic annotations. Additionaly, custom annotation layers can be defined, allowing WebAnno to be used also for non-linguistic annotation tasks.

WebAnno is a multi-user tool supporting different roles such as annotator, curator, and project manager. The progress and quality of annotation projects can be monitored and measuered in terms of inter-annotator agreement. Multiple annotation projects can be conducted in parallel.

More about WebAnno

The Language Bank of Finland’s instance of WebAnno

See the documentation

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110303

Mylly

Mylly is a versatile data analysis platform with interactive visualizations and workflows. It can be used to build workflows with a variety of tools, including morphosyntactic parsing, character set conversion and speech recognition.

Open the website

About Mylly

Mylly User Guide

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110302

Sparv Pipeline

Sparv, Språkbanken’s text analysis tool, is a multilingual toolkit provided by the Swedish Språkbanken for parsing and annotating text in various languages.

User manual

Latest Sparv release on GitHub

Sparv GUI

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021110301

Lääketutka

Lääketutka, ”the Medicine Radar”, a real-time, open web service, provides analytics about health, medicine and symptom-related discussions in the Suomi24 discussion forum. It allows anyone to discover connections between drugs, symptoms and dosages – as they appear in the discussion data.

This content search tool was developed within a data science project by Futurice’s Chilicorn Fund and Citizen Mindscapes with data provided by Aller.

Access the website

More information about the project

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101305

CLARIN Federated Content Search

This tool allows to run a centralized query from all the resources provided by CLARIN centers.

The Aggregator application is a part of the CLARIN-FCS common federated content search infrastructure. It serves as a user interface to perform queries to CLARIN- resources and display search results. The Aggregator communicates with components called endpoints, which are provided as a service by all centres who participate in the federated content search. Each endpoint provides access to one or more searchable resources. The user can select a specific resource or resources, based on the resource name or on the language, or search through all of them. The content of these resources is searched with the query supplied to the endpoint. The endpoint returns results to this query and the aggregator collects the responses from all the endpoints and displays them to the user.

Access the FCS Aggregator

Content Search Tutorial

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101304

Virtual Language Observatory CLARIN

The Virtual Language Observatory (VLO) faceted browser was developed within CLARIN as a means to explore linguistic resources, services and tools available within CLARIN and related communities. Its aim is to provide an easy to use interface, allowing for a uniform search and discovery process for a large number of resources from a wide variety of domains and providers.

Open the website

Access the faceted search

User guidelines

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101303

Text reuse in the Swedish-language press, 1645-1918

This is a search engine for searching and analyzing text reuse clusters in the Swedish-language press from 1645 to 1918. It covers material from Finland, Sweden, and also the United States. The search engine is offered by the Society of Swedish Literature in Finland (SLS).

Open the website: https://textreuse.sls.fi/

Guidelines for using the search engine

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101302

Digital collections from the National Library of Finland

With the help of this dowload service, offered by the National Library of Finland, it is possible to download the material of magazines and newspapers, small printed works, books, handwritten documents and notes, as long as they are free of copyrights. Newspapers are downloadable until the end of the year 1918, magazines and small printed works until the end of the year 1910. Other material is downloadable according to its specific copyright information and the time limit can vary between the different sorts of materials and works.

Open the website: digi.kansalliskirjasto.fi

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101301

Texthammer

Texthammer is a search and analysis toolkit for parallel corpora provided by the University of Tampere.

For more details please see the user manual (pdf).

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101111

Wanca

Wanca is a portal for websites in Uralic languages. It offers a collection of links to web pages written in various Uralic languages. The pages have been found using the automated system developed in the SUKI project.

In the SUKI project smaller Uralic languages are promoted, which means that links to pages written in Hungarian, Finnish or Estonian are not collected. Wanca is the result of the Language Programme of the Kone Foundation for small Finno-Ugric languages.

In Kielipankki, the Language Bank of Finland, the resource Wanca 2016 is available as Korp version and for download.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021101110