Testing and enhancement of language models (transducers) from GiellaLT

GiellaLT provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. The web site of GiellaLT offers language models (transducers) for a wide range of languages. Writing documentation for each language repository is an ongoing effort, and part of the development process.

Analyser enhancement

The GiellaLT infrastructure, with its implementation of finite-state tools, allows people working with different languages to make use of technological solutions that, otherwise, might require several years of individual development. It is here that descriptions for many of the Uralic languages have been initialized and developed as both financed projects and the work of language technology enthusiasts.
The GiellaLT infrastructure makes it possible to reuse finite-state descriptions and even encourages it. Thus, contributing to the enhancement of the finite-state tools at GiellaLT, when extending the annotation of corpora on the Language Bank of Finland’s Korp server, is beneficial to the search engine users as well.

On this page, we will evaluate the state of development of analysers for individual languages in relation to text data being annotated for the Korp search engine. This evaluation will therefore be aligned with the annotation of upcoming corpora, such as a new extended version of PaBiVUS (Parallel Biblical Verses for Uralic Studies). The objective is to increase the lemmatization, morphological and syntactic annotation coverage not previously offered for non-majority languages in the parallel corpus. So, here we will provide an illustrative depiction of each individual finite-state description and what steps have been made for improvement. This might be seen as enhanced but not complete coverage of various genre as we go.

The evaluations will tend to illustrate the capacities of the analysers, which do have equivalent generators, but the possible overproductivity of these generators is presently not the focus of these evaluations. In time, attention will be also drawn towards the description of the disambiguation of morphological analyses, which is made possible in the open-source GiellaLT infrastructure. The enhanced descriptions, housed in GiellaLT, will serve as a contribution by the Language Bank of Finland in the shared responsibilities towards improved coverage of lesser described languages and NLP addressing them. Thus, the resulting analysers will available for building within the GiellaLT infrastructure or the UralicNLP python, java and .net libraries available through Github or the Language Bank of Finland.

For more details see the complete description on the analyser enhancement by Jack Rueter.

Evaluations of analysers for individual languages:

 


This resource group page has a Persistent Identifier: http://urn.fi/urn:nbn:fi:lb-2024050302

Search the Language Bank Portal:
Juraj Šimko
Researcher of the Month: Juraj Šimko

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information