wordvec – Word embeddings trained with word2vec

This resource collection contains word embeddings trained with word2vec from various corpora.

The embedding file is in a simple and easily parsed textual format produced by word2vec. The first line in the file gives the vocabulary size and dimension. Each line after that begins with a vocabulary item, followed by a space, followed by 128 floating point numbers (represented textually) each followed by a space.

Latest versions/subcorpora:  
Word embeddings trained with word2vec from the Finnish Text Collection
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Word embeddings trained with word2vec from the Suomi24 corpus
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for all versions of this resource in META-SHARE  

Of this language resource several versions are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022041401

FinnONTO

Latest versions/subcorpora:  
FinnONTO – ONKI
icon-info-circle Metadata and license
Open the website
Look for all versions of this resource in META-SHARE  

The ONKI service contains Finnish and international ontologies, vocabularies and thesauri needed for publishing content cost-efficiently on the Semantic Web. ONKI is published and maintained by Semantic Computing Research Group SeCo. It is part of the on-going project to build a national semantic web infrastructure to Finland (FinnONTO).

The service offers various ontologies under different categories like:
– General upper ontology
– Museum artifacts
– Music
– Design
– Health
– Photography
– Agriculture
– Government
– Literature
– Linguistics
– Literary research
– Cultural research
– Economics
– Seafaring
– Military

All ontologies are being merged into one ontology covering all the categories called The Finnish Collaborative Holistic Ontology (KOKO).

Most of the ontologies are multilingual. In the General upper ontology the names of concepts are in Finnish, Swedish and English, while for example in the Linguistics ontology the languages used are Finnish, Swedish, English, German and Estonian.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021093001

The Helsinki Term Bank for the Arts and Sciences

Suomeksi

The Helsinki Term Bank for the Arts and Sciences (HTB) is a multidisciplinary project which aims to gather a permanent terminological database for all fields of research in Finland. The project has created this Semantic MediaWiki platform, which offers a collaborative environment. This means that anyone can freely use it and also participate in the discussion about terms.

The Helsinki Term Bank for the Arts and Sciences
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website

 

Detailed information on the content, user rights and licenses can be found from the metadata record.

 


This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021092002

 

Finnish WordNet

The Finnish WordNet is a lexical database for Finnish. It is a part of the FIN-CLARIN infrastructure project.

FinnWordNet is licensed under the Creative Commons Attribution (CC-BY) 3.0 licence. As a derivative of the Princeton WordNet, FinnWordNet is also subject to the Princeton WordNet licence.

FinnWordNet contains words (nouns, verbs, adjectives and adverbs) grouped by meaning into synonym groups representing concepts. These synonym groups are linked to each other with relations such as hyponymy and antonymy, creating a semantic network.

FinnWordNet can be used in language technology research and applications. It can also be used interactively as an electronic thesaurus.

The first version of FinnWordNet has been created by having the words of the original English (Princeton) WordNet (version 3.0) translated into Finnish by professional translators.

Detailed information: http://www.kielipankki.fi/corpora/finnwordnet/

Latest versions/subcorpora:
The Downloadable Version of the Finnish WordNet
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
The Sanat Version of the Finnish WordNet
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the corpus in Sanat
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Sanat Dictionary Service. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2014052714

Finnish FrameNet

The database of Finnish semantic frames is based on the original English language FrameNet housed at the International Computer Science Institute in Berkeley, California. The Finnish FrameNet project started by collecting 90,592 examples of different frame examples from the original Berkeley FrameNet. The examples represented 866 different frames and the elements that evoke them.

The FinnFrameNet project is a part of the FIN-CLARIN consortium.

Latest versions/subcorpora:
Finnish FrameNet
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the corpus in Sanat
The Sanat Version of the Finnish FrameNet
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the corpus in Sanat
The Sanat Version of the Finnish TransFrameNet
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the corpus in Sanat
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Sanat Dictionary Service. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021091403

FinEst BERT

This corpus offers a Bidirectional Encoder Representations from Transformers (BERT) multilingual model trained from scratch, covering three languages: Finnish, Estonian, and English. Used for various NLP classification tasks on the mentioned three languages, supporting both monolingual and multilingual/crosslingual (knowledge transfer) tasks. Whole-word masking used during data preparation and training; trained for 40 epochs with sequence length 128 and another 4 epochs with sequence length 512. FinEst BERT model published here is in pytorch format.

Corpora used:
Finnish – STT articles, CoNLL 2017 shared task, Ylilauta downloadable version
Estonian – Ekspress Meedia articles, CoNLL 2017 shared task
English – English wikipedia

Latest versions/subcorpora:
FinEst BERT
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Download the resource
Search for these versions in META-SHARE

Of this language corpus different versions are (or might be in the future) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool, or they are offered by another member organisation of FIN-CLARIN. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021091402

ANEE Lexical Networks

Team 1 of the Centre of Excellence in Ancient Near Eastern Empires (ANEE) has created a lexical portal that functions as a graphic semantic dictionary. Via this portal the user can explore semantic networks for one (or multiple) words that one is interested in. By following the links, one can also trace attestations back to the dataset in Korp and from there to Open Richly Annotated Cuneiform Corpus (Oracc).

Website of ANEE Lexical Networks v. 2.0

Latest versions/subcorpora:  
ANEE Lexical Networks v. 2.0 – the dataset
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Archived versions:  
ANEE lexical portal of Akkadian: fastText
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
ANEE lexical portal of Akkadian: PMI
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
ANEE lexical portal of Akkadian: dataset
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Open the website
Look for all versions in META-SHARE  

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2021082001

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information