Word embeddings trained with word2vec from the Finnish Text Collection

Metadata: http://urn.fi/urn:nbn:fi:lb-2022041405
Licence: CC-BY (https://creativecommons.org/licenses/by/4.0)
Resource shortname: ftc-wordvec

This package contains word embeddings trained with word2vec from newspaper text
in Kielipankki's Finnish Text Collection (FTC)
(http://urn.fi/urn:nbn:fi:lb-2016050206). The following files were used:

aamulehti.tar.gz
demari.tar.gz
hameensanomat.tar.gz
hyvinkaansanomat.tar.gz
iltalehti.tar.gz
kangasalansanomat.tar.gz
karjalainen.tar.gz
kauppalehti.tar.gz
keskisuomalainen.tar.gz
optio.tar.gz
suomenkuvalehti.tar.gz
tekniikanmaailma.tar.gz
turunsanomat.tar.gz

Instead of surface forms, the lemmas from text annotations were used. So
inflected forms like "koiralta" are absent, and are instead all represented as
the base form "koira".

All lemmas were also converted to lowercase. So names like "Niinistö" are
represented as "niinistö".

The embedding file ftc-wordvec.txt contains 247 305 entries. The dimension of
the vector space is 100.

The embedding file is in a simple and easily parsed textual format produced by
word2vec. The first line if the file gives the vocabulary size and dimension.
Each line after that begins with a vocabulary item, followed by a space,
followed by 100 floating point numbers (represented textually) each followed
by a space.

For efficient processing, the file ftc-wordvec.bin contains a binary
representation of the embedding file.