Word embeddings trained with word2vec from the Suomi24 corpus

Metadata: http://urn.fi/urn:nbn:fi:lb-2022061701
Licence: CC-BY (https://creativecommons.org/licenses/by/4.0)
Resource shortname: suomi24-wordvec

This package contains word embeddings trained with word2vec from Finnish
Internet forum discussions from the Suomi24 corpus
(http://urn.fi/urn:nbn:fi:lb-2020021801).

Instead of surface forms, the lemmas from text annotations were used. So
inflected forms like "koiralta" are absent, and are instead all represented as
the base form "koira".

The embedding file suomi24-wordvec.txt contains 633 758 entries. The dimension
of the vector space is 128.

The embedding file is in a simple and easily parsed textual format produced by
word2vec. The first line if the file gives the vocabulary size and dimension.
Each line after that begins with a vocabulary item, followed by a space,
followed by 128 floating point numbers (represented textually) each followed
by a space.

For efficient processing, the file suomi24-wordvec.bin contains a binary
representation of the embedding file.
