<< List of all deliverables

D1.1.1: Named-Entity Annotation

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 2024-01-01
Duration: 24 months

Report author: Jussi Piitulainen (UHEL)
WP 1.1: Report on Named-Entity Annotation
Date of reporting: 2024-09-26
Contributors: Jussi Piitulainen, Jyrki Niemi (UHEL), Sam Hardwick (CSC)
Deliverable location:

Keywords for the deliverable page: named-entity; finnish-nertag; VRT; Suomi24

Description

Name-like phrases are annotated in the Suomi24 2001–2020 VRT corpus in the Language Bank of Finland, using the computational resources of CSC. The new annotations are the three formats of the finnish-nertag 1.6 tool: maximally long identified names, names nested in those, and the BIO (begin, inside, outside) format for the maximal names.

All 20 years have already been processed with the tool. A small number of triply nested annotations required correction, for which a post-processing tool was written. All years are pending the addition of structural markup tags for each maximal name.

The final annotations are expected to be available in the Language Bank both through the Korp search engine and as a new downloadable version of the corpus in October 2024.

As an example of the tag format, below is a VRT fragment (found in year 2010 data) where ”Turun hallinto-oikeudelle” is recognized as a maximally long name with ”Turun” as a shorter name nested inside. There can be even a third nesting level. (The example is a projection to just the word and the new fields. Base forms and other morpho-syntactic annotations remain.)

word nertag2 nertags2/ nerbio2
joka _ | O
jätetään _ | O
Turun EnamexOrgCrp-B |EnamexOrgCrp-B-0|EnamexLocPpl-F-1| B-ORG
hallinto-oikeudelle EnamexOrgCrp-E |EnamexOrgCrp-E-0| I-ORG
ensi _ | O
maanantaina _ | O

The number of maximally long names identified in the years 2001–2010 (roughly a half of the corpus) is as follows, by counting the BIO start tags (the B of BIO). The BIO tags classify the recognized names in six types, with a finer classification provided by the other formats.

Start tag (BIO) frequency
B-PER 22 416 185
B-PRO 17 347 958
B-LOC 14 271 499
B-ORG 9 088 301
B-MISC 4 419 947
B-DATE 2 590 846

The annotation work was facilitated by writing a new preprocessing tool that hides from the finnish-nertag tool such input sentences that might, empirically, induce extreme resource consumption (usually excessive time, sometimes excessive space, both leading to a crash). Some of these sentences originate in trollish behaviour in the discussion forum, some are otherwise not really ordinary sentences at all. Some may have been segmented in a less than helpful way, possibly due to missing punctuation marks or missing spaces.

In addition to the names, the corpus was also annotated with HeLI-OTS 2.0 language identification of each sentence and summaries in paragraph and text elements.

References

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.

Search the Language Bank Portal:
Aku Rouhe
Researcher of the Month: Aku Rouhe

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information