This resource collection contains word embeddings trained with word2vec from various corpora.
The embedding file is in a simple and easily parsed textual format produced by word2vec. The first line in the file gives the vocabulary size and dimension. Each line after that begins with a vocabulary item, followed by a space, followed by 128 floating point numbers (represented textually) each followed by a space.
Latest versions/subcorpora: | |
Word embeddings trained with word2vec from the Finnish Text Collection Metadata and license Attribution instructions |
Download the resource |
Word embeddings trained with word2vec from the Suomi24 corpus Metadata and license Attribution instructions |
Download the resource |
Search for all versions of this resource in META-SHARE |
Of this language resource several versions are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022041401
Latest versions/subcorpora: | |
FinnONTO – ONKI Metadata and license |
Open the website |
Look for all versions of this resource in META-SHARE |
The ONKI service contains Finnish and international ontologies, vocabularies and thesauri needed for publishing content cost-efficiently on the Semantic Web. ONKI is published and maintained by Semantic Computing Research Group SeCo. It is part of the on-going project to build a national semantic web infrastructure to Finland (FinnONTO).
The service offers various ontologies under different categories like:
– General upper ontology
– Museum artifacts
– Music
– Design
– Health
– Photography
– Agriculture
– Government
– Literature
– Linguistics
– Literary research
– Cultural research
– Economics
– Seafaring
– Military
All ontologies are being merged into one ontology covering all the categories called The Finnish Collaborative Holistic Ontology (KOKO).
Most of the ontologies are multilingual. In the General upper ontology the names of concepts are in Finnish, Swedish and English, while for example in the Linguistics ontology the languages used are Finnish, Swedish, English, German and Estonian.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021093001
The Helsinki Term Bank for the Arts and Sciences (HTB) is a multidisciplinary project which aims to gather a permanent terminological database for all fields of research in Finland. The project has created this Semantic MediaWiki platform, which offers a collaborative environment. This means that anyone can freely use it and also participate in the discussion about terms.
The Helsinki Term Bank for the Arts and Sciences Metadata and license Attribution instructions |
Open the website |
Detailed information on the content, user rights and licenses can be found from the metadata record.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092002
The Finnish WordNet is a lexical database for Finnish. It is a part of the FIN-CLARIN infrastructure project.
FinnWordNet is licensed under the Creative Commons Attribution (CC-BY) 3.0 licence. As a derivative of the Princeton WordNet, FinnWordNet is also subject to the Princeton WordNet licence.
FinnWordNet contains words (nouns, verbs, adjectives and adverbs) grouped by meaning into synonym groups representing concepts. These synonym groups are linked to each other with relations such as hyponymy and antonymy, creating a semantic network.
FinnWordNet can be used in language technology research and applications. It can also be used interactively as an electronic thesaurus.
The first version of FinnWordNet has been created by having the words of the original English (Princeton) WordNet (version 3.0) translated into Finnish by professional translators.
Detailed information: http://www.kielipankki.fi/corpora/finnwordnet/
Latest versions/subcorpora: | |
The Downloadable Version of the Finnish WordNet Metadata and license Attribution instructions | Download the resource |
The Sanat Version of the Finnish WordNet Metadata and license Attribution instructions | Open the corpus in Sanat |
Search for these versions in META-SHARE |
Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Sanat Dictionary Service. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2014052714
The database of Finnish semantic frames is based on the original English language FrameNet housed at the International Computer Science Institute in Berkeley, California. The Finnish FrameNet project started by collecting 90,592 examples of different frame examples from the original Berkeley FrameNet. The examples represented 866 different frames and the elements that evoke them.
The FinnFrameNet project is a part of the FIN-CLARIN consortium.
Latest versions/subcorpora: | |
Finnish FrameNet Metadata and license Attribution instructions | Open the corpus in Sanat |
The Sanat Version of the Finnish FrameNet Metadata and license Attribution instructions | Open the corpus in Sanat |
The Sanat Version of the Finnish TransFrameNet Metadata and license Attribution instructions | Open the corpus in Sanat |
Search for these versions in META-SHARE |
Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Sanat Dictionary Service. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021091403
This corpus offers a Bidirectional Encoder Representations from Transformers (BERT) multilingual model trained from scratch, covering three languages: Finnish, Estonian, and English. Used for various NLP classification tasks on the mentioned three languages, supporting both monolingual and multilingual/crosslingual (knowledge transfer) tasks. Whole-word masking used during data preparation and training; trained for 40 epochs with sequence length 128 and another 4 epochs with sequence length 512. FinEst BERT model published here is in pytorch format.
Corpora used:
Finnish – STT articles, CoNLL 2017 shared task, Ylilauta downloadable version
Estonian – Ekspress Meedia articles, CoNLL 2017 shared task
English – English wikipedia
Latest versions/subcorpora: | |
FinEst BERT Metadata and license Attribution instructions | Download the resource |
Search for these versions in META-SHARE |
Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool, or they are offered by another member organisation of FIN-CLARIN. The links to the different versions can be found from the list above.
Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021091402
Team 1 of the Centre of Excellence in Ancient Near Eastern Empires (ANEE) has created a lexical portal that functions as a graphic semantic dictionary. Via this portal the user can explore semantic networks for one (or multiple) words that one is interested in. By following the links, one can also trace attestations back to the dataset in Korp and from there to Open Richly Annotated Cuneiform Corpus (Oracc).
Website of ANEE Lexical Networks v. 2.0
Latest versions/subcorpora: | |
ANEE Lexical Networks v. 2.0 – the dataset Metadata and license Attribution instructions |
Open the website |
Archived versions: | |
ANEE lexical portal of Akkadian: fastText Metadata and license Attribution instructions |
Open the website |
ANEE lexical portal of Akkadian: PMI Metadata and license Attribution instructions |
Open the website |
ANEE lexical portal of Akkadian: dataset Metadata and license Attribution instructions |
Open the website |
This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021082001