<< List of all deliverables

D4.2.1: LDF Knowledge Extraction Tools

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.2: Report on LDF Knowledge Extraction Tools
Date of reporting: 2023-03-13

Report author: Eero Hyvönen (Aalto University)
Contributors: Rafael Leal, Minna Tamper (Aalto University); Jouni Tuominen (University of Helsinki)
Deliverable location: Data services: https://nlp.ldf.fi (not yet opened for public use) | Ontology services: https://onki.fi | Tools: https://github.com/SemanticComputing

Description

NLP toolset used in Sampos is being developed as a reusable web service and website under nlp.LDF.fi. The focus is on automatic name-entity recognition, keyword extraction, and multi-classification into predefined categories. In addition, the Anoppi tool for pseudonymization of text to comply the GDPR regulations is developed and deployed. Anoppi is being deployed by the Legal Register Centre of Finland (Oikeusrekisterikeskus). An API for it was developed and was used by Statistics Finland (Tilastokeskus).

Work on developing reusable data transformation pipelines in different Sampos has been carried on.

Work now focuses on the Secompling toolset for extracting knowledge from texts. It has already been used in ParliamentSampo, LawSampo, and WarMemoirSampo. Secompling (from SeCo Computational Linguistics) is an open-source Finnish NLP library written in Python. Its aim is to aggregate new and third-party tools into a comprehensive, integrated and easy to use package. At this point, the library contains the following tools:

  • Lemmatization (based on TurkuNLP’s Neural parser, with heuristic corrections)
  • Named entity extraction (via TurkuNLP’s NER tool and Stanford’s Stanza)
  • Keyword extraction (Annif and TF-IDF)
  • An unsupervised classification tool using keywords and word embeddings
  • A Relevance Feedback -based search engine using keywords and word embeddings
  • Language detection using different methods (via langdetect, lingua and pycld2)

The next step is the creation of a Deep Neural Network tool for Named Entity Linking using Wikipedia and Wikidata. The dataset for this task is under development and is to be released separately.

Secompling can be installed as a Python package. The repository https://version.aalto.fi/gitlab/seco/secompling contains more information. It is worth pointing out that the library is a research project and is under heavy development. Bug reports are very appreciated.

Publications

Minna Tamper: From Text to Knowledge: Methods, Tools, and Applications for Digital Humanities Based on Linked Data. (in English), Aalto University, Department of Computer Science, February, 2023. PhD Thesis. bib pdf link

Senka Drobac, Laura Sinikallio and Eero Hyvönen: An OCR Pipeline for Transforming Parliamentary Debates into Linked Data: Case ParliamentSampo – Parliament of Finland on the Semantic Web. 2022. Paper under peer review. bib pdf

Arttu Oksanen, Eero Hyvönen, Minna Tamper, Jouni Tuominen, Henna Ylimaa, Katja Löytynoja, Matti Kokkonen and Aki Hietanen: A Tool for Pseudonymization of Textual Documents for Digital Humanities Research and Publication. AI4LEGAL-KGSUM 2022: Artificial Intelligence Technologies for Legal Documents and Knowledge Graph Summarization 2022, vol. 3257, pp. 12-21, CEUR Workshop Proceedings, August, 2022. bib pdf link

Minna Tamper, Rafael Leal, Laura Sinikallio, Petri Leskinen, Jouni Tuominen and Eero Hyvönen: Extracting Knowledge from Parliamentary Debates for Studying Political Culture and Language. Proceedings of the 1st International Workshop on Knowledge Graph Generation From Text and the 1st International Workshop on Modular Knowledge co-located with 19th Extended Semantic Conference (ESWC 2022) (Sanju Tiwari, Nandana Mihindukulasooriya, Francesco Osborne, Dimitris Kontokostas, Jennifer D’Souza and Mayank Kejriwal (eds.)), vol. 3184, pp. 70-79, CEUR WS, May, 2022. International Workshop on Knowledge Graph Generation from Text (TEXT2KG 2022). bib pdf link

Arttu Oksanen, Minna Tamper, Jouni Tuominen, Aki Hietanen and Eero Hyvönen: A Tool for Pseudonymization of Textual Documents for Digital Humanities Research and Publication. 6th Digital Humanities in Nordic and Baltic Countries Conference, poster paper, book of abstracts, pp. 107-108, March, 2022. bib pdf

Laura Sinikallio: Eduskunnan täysistuntojen pöytäkirjojen muuntaminen semanttiseksi dataksi ja julkaiseminen verkkopalveluna (Transformation of the Debates of the Parliament of Finland into Semantic Data and a Data Service. (in Finnish), University of Helsinki, Department of Computer Science, February, 2022. MSc Thesis. bib pdf link

Minna Tamper, Jouni Tuominen and Eero Hyvönen: Extending the Finnish Linked Data Infrastructure with Natural Language Processing Services in FIN-CLARIAH. DHNB 2022 The 6th Digital Humanities in Nordic and Baltic Countries Conference, pp. 443-446, CEUR Workshop Proceedings, Vol. 3232, 2022. bib pdf link

Search the Language Bank Portal:
Harri Uusitalo
Researcher of the Month: Harri Uusitalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information