Erzya and Moksha Extended Corpora (ERME)

Currently available versions of this resource

Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level
Shortname	Name and metadata	License	Location	Cite	Resource group and help	Apply	Publication year	Support level

Upcoming versions of this resource

These resource versions are not yet available in the Language Bank of Finland.

Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information
Shortname	Name and metadata	License	Formats	Support level	Contact Person	Resource group and help	Location	Other information

Resource information

ERME

ERME contains predominantly original Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level with bibliographic reference to the sentence level.

The new version contains the literature found in the older instance and has grown markedly. While the old version was merely text divided to sentence level, the new version has lemmatization and dependencies. At sentence level contextual translation may be present (English or Finnish translation), while at word level there is morphological encoding, corresponding to each context. Preliminary morpho-syntactic analysis is carried out using HFST-based transducers and Constraint Grammar disambiguation, function and dependency tagging, which have been developed in the Giellatekno infrastructure of the University of Tromsø.

The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages.

The amount of the processed material is to be increased subsequently.

ERME-PSLA

While ERME contains predominantly if not solely original Erzya and Moksha literature, ERME-psla (Paragraph segementation low annotation) contains both original and translated texts. The most basic format used is XML, with a granularity set at the piece level and then automatically extended to the sentence level. The goal is to create corpora with source meta indicating authors, titles, translators, genre and collectors, etc., which where possible have geo-indendifiers and time stamps, so that the language of each individual piece (article) can be readily compared to fieldwork documentation of these language forms from various eras.

Content of ERME-PSLA:

Moksha-language texts from the Mokša journal

Time range 1956 – 2000

Download a list of all works

Erzya-language texts from the Surań tolt and Sâtko journals

Time range 1956 – 2001

Download a list of all works

License and access

All versions of this resource are available publicly (PUB).
Click on the license image to see the resource-specific license text.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022052001

Last modified on 2026-07-13