FinEst BERT

finestbert

Metadata PID: http://urn.fi/urn:nbn:fi:lb-2020061201

Licensed under CC BY 4.0


FinEst BERT is a multilingual cased BERT base model trained on three
languages: Finnish, Estonian and English. Whole-word masking used
during data preparation and training; trained for 40 epochs with
sequence length 128 and another 4 epochs with sequence length
512. FinEst BERT model published here is in pytorch format.

Corpora used:
Finnish - STT articles, CoNLL 2017 shared task, Ylilauta downloadable version;
Estonian - Ekspress Meedia articles, CoNLL 2017 shared task
English - English wikipedia

More information in the article "FinEst BERT and CroSloEngual BERT:
less is more in multilingual models" by Matej Ulčar and Marko
Robnik-Šikonja, published in the proceedings of the TSD 2020
conference.

"FinEst BERT" model by Matej Ulčar and Marko Robnik-Šikonja is
published under Creative Commons Attribution 4.0 International (CC BY
4.0) license.
