Erzya and Moksha Extended Corpora (ERME)

ERME contains predominantly original Erzya and Moksha literature. It consists of several media publications from the 19th to the 20th century. ERME was mapped in Saransk in 1997-2004, while in Helsinki it has been mapped since 2004. The most basic format used is XML, with a granularity extending to chapter level. The goal is to create corpora with a granularity extending to word level with bibliographic reference to the sentence level.

The new version contains the literature found in the older instance and has grown markedly. While the old version was merely text divided to sentence level, the new version has lemmatization and dependencies. At sentence level contextual translation may be present (English or Finnish translation), while at word level there is morphological encoding, corresponding to each context. Preliminary morpho-syntactic analysis is carried out using HFST-based transducers and Constraint Grammar disambiguation, function and dependency tagging, which have been developed in the Giellatekno infrastructure of the University of Tromsø.

The grammatical analysis and labeling comply with the practices developed in the Giellatekno infrastructure of the University of Tromsø. These practices are applied in the documentation of several Uralic languages.

The amount of the processed material is to be increased subsequently.

Latest versions/subcorpora:  
Erzya and Moksha Extended Corpora (ERME) version 2, Korp
icon-info-circle Metadata and license
icon-quote-right Attribution instructions
Select the corpus in Korp
Search for all versions of this resource in META-SHARE  

Of this language corpus different versions/subcorpora are (or will be) published in the Language Bank of Finland. The versions are available through the Language Bank Download Service and/or through the Korp concordance tool. The links to the different versions can be found from the list above.

Detailed information on the content of each version, user rights and licenses can be found from it’s specific metadata record in META-SHARE.

This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2022052001

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information