<< List of all deliverables

FIN-CLARIAH D4.3.1: Subsetting tool

Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.3: Report on Subsetting tool
Date of reporting: 14-11-2022

Report author: Eetu Mäkelä (University of Helsinki)
Contributors: Ville Vaara (University of Helsinki)
Deliverable location: Internal


The prototype version of the subsetting tool is at https://github.com/hsci-r/octavo/. This prototype version of the tool has been and is being successfully used in multiple research projects. At the same time, the prototype is 1) not as easily updatable as we’d like and 2) not as easily maintainable as we’d like. Both of these hindrances are mainly caused by the tool being built by hooking into the Lucene search library on multiple levels of interfaces (mostly according to whichever interface provided the most efficient way to enact each functionality), which considerably increases system complexity. Additionally, some of the integrations are on really low levels, where interface stability between versions is considerably lower.

In order to overcome these deficiencies, WP4.3 has been evaluating whether a production version of the tool could be built on top of Elasticsearch, which is also based on Lucene, but offers APIs and interfaces on a much higher level of abstraction and standardisation. The idea here is that if the same functionalities could be built using Elasticsearch, there would be 1) much less API surface between the custom and standard parts of the system, and 2) the remaining extension points would be more standard, widely documented, stable and understood.

In pursuit of this, the WP has all of a) catalogued the current Lucene extension points that the current prototype is using, b) catalogued which functionalities rely on which extension points, and rated them based on how important they have been for actual users in the associated research projects, and c) respectively gone over the extension points and possibilities offered by Elasticsearch. Next, these need to be brought together and aligned with each other to come up with a go/no-go decision on whether a sufficient number of the functionalities rated as important can be developed just using the well-documented extension points of Elasticsearch, and thus whether we should go ahead with the actual reimplementation of the tool using that framework.

According to the original plan, getting to the point where a decision could be made was slated for Q3/2022. However, due to delays in hiring, we are only now at the point where the constituent sides of the background reports are completed and working out their alignment can begin. At present, we expect to be able to make the go/no-go decision itself within a month from now.

Search the Language Bank Portal:
Heidi Niva
Researcher of the Month: Heidi Niva


Upcoming events


The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information