<< List of all deliverables

D1.1.1: Named-Entity Annotation

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 2024-01-01
Duration: 24 months

Report author: Jussi Piitulainen (UHEL)
WP 1.1: Report on Named-Entity Annotation
Date of reporting: 2024-09-26
Contributors: Jussi Piitulainen, Jyrki Niemi (UHEL), Sam Hardwick (CSC)
Deliverable location:

Keywords for the deliverable page: named-entity; finnish-nertag; VRT; Suomi24

Description

Name-like phrases are annotated in the Suomi24 2001–2020 VRT corpus in the Language Bank of Finland, using the computational resources of CSC. The new annotations are the three formats of the finnish-nertag 1.6 tool: maximally long identified names, names nested in those, and the BIO (begin, inside, outside) format for the maximal names.

All 20 years have already been processed with the tool. A small number of triply nested annotations required correction, for which a post-processing tool was written. All years are pending the addition of structural markup tags for each maximal name.

The final annotations are expected to be available in the Language Bank both through the Korp search engine and as a new downloadable version of the corpus in October 2024.

As an example of the tag format, below is a VRT fragment (found in year 2010 data) where ”Turun hallinto-oikeudelle” is recognized as a maximally long name with ”Turun” as a shorter name nested inside. There can be even a third nesting level. (The example is a projection to just the word and the new fields. Base forms and other morpho-syntactic annotations remain.)

word nertag2 nertags2/ nerbio2
joka _ | O
jätetään _ | O
Turun EnamexOrgCrp-B |EnamexOrgCrp-B-0|EnamexLocPpl-F-1| B-ORG
hallinto-oikeudelle EnamexOrgCrp-E |EnamexOrgCrp-E-0| I-ORG
ensi _ | O
maanantaina _ | O

The number of maximally long names identified in the years 2001–2010 (roughly a half of the corpus) is as follows, by counting the BIO start tags (the B of BIO). The BIO tags classify the recognized names in six types, with a finer classification provided by the other formats.

Start tag (BIO) frequency
B-PER 22 416 185
B-PRO 17 347 958
B-LOC 14 271 499
B-ORG 9 088 301
B-MISC 4 419 947
B-DATE 2 590 846

The annotation work was facilitated by writing a new preprocessing tool that hides from the finnish-nertag tool such input sentences that might, empirically, induce extreme resource consumption (usually excessive time, sometimes excessive space, both leading to a crash). Some of these sentences originate in trollish behaviour in the discussion forum, some are otherwise not really ordinary sentences at all. Some may have been segmented in a less than helpful way, possibly due to missing punctuation marks or missing spaces.

In addition to the names, the corpus was also annotated with HeLI-OTS 2.0 language identification of each sentence and summaries in paragraph and text elements.

References

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.

<< List of all deliverables

D2.4.1: Term definition discovery procedures

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 2.4: Report on Term definition discovery procedures
Date of reporting: 16-09-2024

Report author: Harri Kettunen (UHEL)
Contributors: Tiina Onikki-Rantajääskö (UHEL)
Deliverable location: The Helsinki Term Bank for the Arts and Sciences – Tieteen termipankki

Description

Since the start of 2024, 309 new concept pages (terminology articles) have been created and updates have been made on 1,078 concepts to the database of the Helsinki Term Bank for the Arts and Sciences (HTB). New concept pages have been created in the following fields: Art History, Educational Sciences, Environmental Sciences, Geology, Linguistics, Martial Arts Studies, Mesoamerican Studies, Nutritional Sciences, Open Science, Philosophy, and Theology. The full amount of concept pages as of September 16, 2023, is 45,436.

Furthermore, updates have been made to the database on 1,078 concept pages in the following fields: Aesthetics, Biology, Botany, Educational Sciences, Folklore, Geology, History, Indigenous Studies, Language Technology, Linguistics, Literary Studies, Martial Arts Studies, Open Science, Performing Arts, Philosophy, Translation Studies, and Veterinary Medicine.

In addition, the fields of Anthropology, Studies, Gender Studies, Mathematics, and Urban Studies are working offline until there is a critical mass of terminology to be published at the HTB. Furthermore, terminology work has been agreed upon to be carried out in the following fields: Arctic Research, Asian Studies, Geography, Military Sciences, and Physiology. A multidisciplinary group has also been established for meta scientific terminology of transdisciplinarity.

In 2024, we have also started to develop semi-automated processes for detecting terms and their relevant definition contexts in valid academic text genre corpora such as E-thesis in cooperation with Antti Kanner and Jussi Piitulainen (UHEL).

Furthermore, HTB has started cooperation with Aalto University on the terminology of the fields of studies at Aalto. These will include the following, representing all the disciplines at the Aalto University: Civil Engineering, Energy Technology, Geoinformatics, Mechanical Engineering, Real Estate Economics, Spatial Planning and Transportation Engineering, Water and Environmental Engineering, Accounting, Business Law, Economics, Entrepreneurship, Finance, Information Systems Science, International Business, Logistics, Management Science, Marketing, Organization and Management, Organizational Communication, Chemistry, Biotechnology, Chemical Engineering, Processing of Materials, Materials Science, Bioproduct Technology, Neuroscience and Biomedical Engineering, Mathematics and Statistics, Systems and Operations Research, Engineering Physics, Computer Science, Industrial Engineering and Management, Automation and Control Engineering, Robotics and Autonomous Systems, Electronic and Digital Systems, Biosensing and Bioelectronics, Electrical Power and Energy Engineering, Electronics, Photonics and Nanotechnology, Radio Science and Engineering, Space Science and Technology, Signal Processing and Data Analytics, Acoustics and Speech Technology, Communications Engineering and Networking Technology, Interactive Systems, Art Education, Contemporary Art, New Media, Photography, Visual Communication Design, Visual Culture, Design, Film and Television, Costume Design, Architecture, and Landscape and Urbanism. HTB will coordinate a meeting with the Doctoral Schools of the Aalto University on October 3rd, 2024, to plan the implementation of terminology work for the future.

HTB has also been working in close cooperation with the Institute for the Languages of Finland (Kotus) on the names of languages of the world. During the first half of 2024 we have had four meetings with Kotus (Elina Wihuri and Ulla Onkamo), along with the consultant of the project, Lyle Campbell (Professor, Department of Linguistics, University of Hawai’i at Mānoa).

The coordinator of the HTB, in cooperation with the Teachers’ Academy of the University of Helsinki, organized a session on January 29th titled “Kansalliskielten asema korkeakouluopetuksessa” (“The role of national languages in higher education”), featuring the following presenters: Johanna Komppa (Senior University Lecturer of Finnish Language, UHEL), Tiina Onikki-Rantajääskö (Professor of Finnish Language, UHEL), Janne Saarikivi (part-time professor of Saami language, University of Tromsø), Mikko Laitinen (Professor of English Language, University of Eastern Finland), and Sirpa Leppänen (Emerita Professor of English Language, University of Jyväskylä).

HTB has also had presentations at the following conferences, seminars, and other events:

  • Helsingin yliopiston tohtorikoulutusfoorumi (Doctoral Education Forum, University of Helsinki)
  • Kääntämisen ja tulkkauksen maisteriohjelma (Translating and Interpretation, University of Helsinki)
  • Sukupuolentutkimuksen seura (Society for Gender Studies)
  • Kielitieteen päivät (Finnish Conference of Linguistics, University of Jyväskylä)
  • Research Infrastructure sub-group of the University of Helsinki Operational Structure and Management System (TOIJO)
  • Digital Humanities in the Nordic and Baltic Countries: Conference, Reykjavik
  • Opettajien Akatemian OER – Avoimet oppimisresurssit (Teachers’ Academy of the University of Helsinki: Open Educational Resources)

Publications:

Kettunen, Harri (2024). Monitieteisyydestä tieteidenvälisyyteen: Joukkoistaminen Tieteen termipankin terminologiatyössä. Sosiaalilääketieteellinen Aikakauslehti 2024: 61: 448–450.

Kettunen, Harri & Tiina Onikki-Rantajääskö (2024) Vetenskapstermbanken i Finland i samhällets tjänst. Språk i Norden / Sprog i Norden 2024. https://tidsskrift.dk/sin/article/view/144160

 
FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.

<< List of all deliverables

D5.1.1: Community engagement: multim. societal data researchers

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 358720
Start date: 01-01-2024
Duration: 24 months

WP 5.1: Report on community engagement of multimodal societal data researchers
Date of reporting: 20-08-2024

Report author: Sanna Kumpulainen, Anna Sendra Toset (Tampere University)
Contributors: Elina Late, Jaakko Peltonen, Farid Alijani (Tampere University)
Deliverable location: N/A

Description

The main objective of this deliverable is to widen the user base of FIN-CLARIAH by specifically targeting multimodal societal data researchers when organizing training workshops on different RI tools and data and conducting explicit user monitoring of the facility.

To this end, since the start of 2024 we hosted two training workshops for researchers on the resources of the facility in collaboration with other WPs and organized two participatory workshops for improving services related to the RI, starting first with research data management. The events included:

Both training workshops required participants to be working with or be interested in social media data, while both participatory workshops required participants to be working with social media data and/or visual materials – thus complying in both cases with the aim of this deliverable.

Likewise, given that community engagement should be continuous, it is previewed that more training and/or participatory workshops will be organized during 2025.

Beyond the organization of these events, members of the WP 5.1 also took part in the in-person FIN-CLARIAH Meeting Helsinki organized on June 10, where the goal was to reflect on how SSH research will be affected by AI and how the RI should prepare for this.

FIN-CLARIAH project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant number 358720.

<< List of all deliverables

D4.2.2: Parliament of Finland Ontology

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.2: Report on Parliament of Finland Ontology
Date of reporting: 2023-11-09

Report author: Eero Hyvönen (Aalto University)
Contributors: Eero Hyvönen (PI), Laura Sinikallio, Petri Leskinen, Senka Drobac, Jouni Tuominen, Matti La Mela, Mikko Koho, Esko Ikkala, Minna Tamper, Rafael Leal, Heikki Rantala

Deliverable locations:

The Parliament of Finland Ontology with populated data and data services have been published Feb 14, 2023, using CC BY 4.0 license at the following platforms:

  1. Automatically daily updating CSV files of speeches in CSC’s Allas data service: https://a3s.fi/parliamentsampo/speeches/csv/index.html
  2. XML versions of speeches in Parla-CLARIN format (updated less frequently): https://a3s.fi/parliamentsampo/speeches/xml/index.html
  3. A ParlaMint-format sub-corpus of this has been published on 9.11.2023 in the repository of the pan-European ParlaMint II project: https://nl.ijs.si/et/tmp/ParlaMint/Repo/?C=M;O=D
  4. A CSV file of the Members of the Parliament and other speakers in CSC’s Allas data service: https://a3s.fi/parliamentsampo/actors/csv/index.html
  5. Speech data combined with the Parliament’s ontology, actors and other entities as linked data in RDF Turtle format is available 1) on the Linked Data Finland platform’s dataset page https://www.ldf.fi/dataset/semparl and as a data dump on the Zenodo.org data service https://doi.org/10.5281/zenodo.7636420.
  6. Semantic portal Parlamenttisampo.fi

Software tools

New declarative version of Sampo-UI framework used in the ParliamentSampo portal:
https://github.com/SemanticComputing/sampo-ui

Learning materials

Open online video lectures course.
https://seco.cs.aalto.fi/teaching/sw-introduction/index.html

Tutorial materials and video on using Sampo-UI frameworks for portal development:
https://seco.cs.aalto.fi/tools/sampo-ui/

Events organized

2023

2022

Publications

2023

Senka Drobac, Laura Sinikallio and Eero Hyvönen: An OCR Pipeline for Transforming Parliamentary Debates into Linked Data: Case ParliamentSampo – Parliament of Finland on the Semantic Web. Digital Humanities in the Nordic and Baltic Countries Publication, DHNB2023 Conference Proceedings, vol. 5, no. 1, University of Oslo Library, Norway, 2023. bib pdf link

Eero Hyvönen: Parlamenttisampo avaa eduskunnan miljoona puhetta ja kansanedustajien verkostot kaikkien tutkittaviksi. Tieteessä tapahtuu, vol. 41, no. 1, Tieteellisten seurain valtuuskunta (TSV), 2023. bib pdf link

Eero Hyvönen, Petri Leskinen and Jouni Tuominen: A Data-driven Approach to Create an Ontology of Parliamentary Work: Case Parliament of Finland on the Semantic Web. Proceedings of SWODCH 2023. Semantic Web and Ontology Design for Cultural Heritage. Co-located with the 22nd International Semantic Web Conference (ISWC 2023) in Athens, Greece, CEUR Workshop Proceedings, Vol-3540, November, 2023. bib pdf link

Eero Hyvönen, Laura Sinikallio, Petri Leskinen, Senka Drobac, Rafael Leal, Matti La Mela, Jouni Tuominen, Henna Poikkimäki and Heikki Rantala: Plenary Speeches of the Parliament of Finland as Linked Open Data and Data Services. Joint Proceedings of the Second International Workshop on Knowledge Graph Generation From Text and the First International BiKE Challenge co-located with 20th Extended Semantic Conference (ESWC 2023), pp. 1-20, CEUR Workshop Proceedings, Vol. 3447, August, 2023. bib pdf link

Henna Poikkimäki, Petri Leskinen and Eero Hyvönen: Applying Network and Bibliometric Analyses to Mentions of Politicians in Plenary Speeches: Case ParliamentSampo – Parliament of Finland on the Semantic Web. August, 2023. Submitted for evaluation. bib pdf

Eero Hyvönen, Petri Leskinen and Heikki Rantala: Integrating Faceted Search with Data Analytic Tools in the User Interface of ParliamentSampo – Parliament of Finland on the Semantic Web. Proceedings of ESWC 2023, poster and demo papers, Sringer-Verlag, June, 2023. bib pdf

Eero Hyvönen: Creating and Using a National Linked Open Data Infrastructure for Cultural Heritage Applications and Digital Humanities Research: Lessons Learned. DARIAH Annual Event 2023, abstracts of papers, DARIAH-EU, June, 2023. bib link

Eero Hyvönen: Creating and Using a Linked Open Ontology and Data Infrastructure for Digital Humanities in Finland: Lessons Learned 2003-2023. June, 2023. Under review. bib pdf

Minna Tamper, Laura Sinikallio, Jouni Tuominen and Eero Hyvönen: Transforming Linguistically Annotated Finnish Parliamentary Debates Into the Parla-CLARIN Format. Digital Humanities in the Nordic and Baltic Countries Seventh Conference (DHNB 2023), Book of Abstracts (Sofie Gilbert and Annika Rockenberger (eds.)), pp. 118, University of Oslo Library, Oslo, Norway, March, 2023. bib link

Eero Hyvönen: How to Create a National Cross-domain Ontology and Linked Data Infrastructure and Use It on the Semantic Web. Programming and Data Infrastructure in Digital Humanities, Book of Abstracts, pp. 7, High Performance Computing Centre, University of Évora, Portugal, March, 2023. bib link

Eero Hyvönen, Petri Leskinen, Laura Sinikallio, Senka Drobac, Rafael Leal, Matti La Mela, Jouni Tuominen, Henna Poikkimäki and Heikki Rantala: ParliamentSampo Infrastructure for Publishing the Plenary Speeches and Networks of Politicians of the Parliament of Finland as Open Data Services. Aalto University, Dept. of Computer Science, February, 2023. Paper published at the publication event of the ParliamentSampo data service and portal. bib pdf

Eero Hyvönen: How to Create a National Cross-domain Ontology and Linked Data Infrastructure and Use It on the Semantic Web. Semantic Web – Interoperability, Usability, Applicability, IOS Press, 2023. Forth-coming. bib pdf

Eero Hyvönen: Digital Humanities on the Semantic Web: Sampo Model and Portal Series. Semantic Web – Interoperability, Usability, Applicability, vol. 14, no. 4, pp. 729-744, IOS Press, 2023. bib pdf link

2022

Henna Poikkimäki, Petri Leskinen, Minna Tamper and Eero Hyvönen: Analyses of Networks of Politicians Based on Linked Data: Case ParliamentSampo – Parliament of Finland on the Semantic Web. Semantic Web and Ontology Design for Cultural Heritage (SWODCH 2022), Turin, Italy, Proceedings, CEUR WS Proceedings, 2022. Accepted. bib pdf

Eero Hyvönen, Laura Sinikallio, Petri Leskinen, Matti La Mela, Jouni Tuominen, Kimmo Elo, Senka Drobac, Mikko Koho, Esko Ikkala, Minna Tamper, Rafael Leal and Joonas Kesäniemi: Linked Data Approach for Studying Parliamentary Speeches and Networks of Politicians in Finland 1907-2021 (long paper). Digital Humanities 2022, Conference Abstracts, July 25-29, 2022 Online, Tokyo. Japan, University of Tokyo, pp. 254-257, ADHO, July, 2022. bib link

Matti La Mela, Fredrik Norén and Eero Hyvönen (eds.): Proceedings of the Digital Parliamentary Data in Action (DiPaDA 2022) Workshop. CEUR Workshop Proceedings, vol. 3133, May, 2022. bib link

Eero Hyvönen, Laura Sinikallio, Petri Leskinen, Matti La Mela, Jouni Tuominen, Kimmo Elo, Senka Drobac, Mikko Koho, Esko Ikkala, Minna Tamper, Rafael Leal and Joonas Kesäniemi: Finnish Parliament on the Semantic Web: Using ParliamentSampo Data Service and Semantic Portal for Studying Political Culture and Language. Digital Parliamentary data in Action (DiPaDA 2022), Workshop at the 6th Digital Humanities in Nordic and Baltic Countries Conference, long paper, pp. 69-85, CEUR Workshop Proceedings, Vol. 3133, May, 2022. bib pdf link

Minna Tamper, Rafael Leal, Laura Sinikallio, Petri Leskinen, Jouni Tuominen and Eero Hyvönen: Extracting Knowledge from Parliamentary Debates for Studying Political Culture and Language. Proceedings of the 1st International Workshop on Knowledge Graph Generation From Text and the 1st International Workshop on Modular Knowledge co-located with 19th Extended Semantic Conference (ESWC 2022) (Sanju Tiwari, Nandana Mihindukulasooriya, Francesco Osborne, Dimitris Kontokostas, Jennifer D’Souza and Mayank Kejriwal (eds.)), vol. 3184, pp. 70-79, CEUR WS, May, 2022. International Workshop on Knowledge Graph Generation from Text (TEXT2KG 2022). bib pdf link

Matti La Mela, Fredrik Norén and Eero Hyvönen: Digital Parliamentary Data in Action (DiPaDA 2022): Introduction. Proceedings of the Digital Parliamentary Data in Action (DiPaDA 2022) Workshop, CEUR Workshop Proceedings, Vol. 3133, May, 2022. bib pdf link

Laura Sinikallio: Eduskunnan täysistuntojen pöytäkirjojen muuntaminen semanttiseksi dataksi ja julkaiseminen verkkopalveluna (Transformation of the Debates of the Parliament of Finland into Semantic Data and a Data Service. (in Finnish), University of Helsinki, Department of Computer Science, February, 2022. MSc Thesis. bib pdf link

Esko Ikkala, Eero Hyvönen, Heikki Rantala and Mikko Koho: Sampo-UI: A Full Stack JavaScript Framework for Developing Semantic Portal User Interfaces. Semantic Web – Interoperability, Usability, Applicability, vol. 13, no. 1, pp. 69-84, January, 2022. Online version published in 2021, print version in 2022. bib pdf link

<< List of all deliverables

D5.2.2: Educational material

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 5.2: Report on Educational material
Date of reporting: 29-11-2023

Report author: Sanna Kumpulainen (Tampere University)
Contributors: Sanna Kumpulainen, Jaakko Peltonen, Anna Sendra Toset, Elina Late, Farid Alijani (Tampere University)
Deliverable location: DARIAH-FI: Educational material

Description

This deliverable is a living document that includes relevant information regarding the educational materials relevant to the DARIAH-FI research infrastructure and its resources, such as documentation created for helping use the different tools, datasets, and workflows, and guidance on which courses might be relevant to use the resources more efficiently. The document also includes an overview of the state of the digital humanities and computational social sciences education in Finland, including links to relevant courses and programmes at the bachelor, master, and doctoral level.

To create this deliverable, we used data from an internal survey on digital humanities and computational social sciences education in Finland conducted within the members of the DARIAH-FI research infrastructure, as well as information provided by the different work packages on their respective resources (e.g., location, related educational materials). Since some educational materials are still under development and have not been released yet, it is expected that the document will be made public during December 2023 through the DARIAH-FI website.

<< List of all deliverables

D5.1.3: Protocol for collecting workshop data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 5.1: Report on Protocol for collecting workshop data
Date of reporting: 29-11-2023

Report author: Sanna Kumpulainen (Tampere University)
Contributors: Sanna Kumpulainen, Jaakko Peltonen, Anna Sendra Toset, Elina Late, Farid Alijani (Tampere University)
Deliverable location: https://doi.org/10.5281/zenodo.10217404

Description

The protocol for collecting workshop data is available at: https://doi.org/10.5281/zenodo.10217404

This document is intended to serve as an initial guide for collecting user experience data from workshops and training sessions related to the resources developed by the FIN-CLARIAH consortium. In this context, the deliverable includes recommendations for designing the study and for setting up the data collection process, as well as information for creating protocols, informed consents, and other similar documents related to collecting user experience data.

To create this document, we used data collected from semi-structured interviews (n=34) with potential end-users of DARIAH-FI conducted between September 2022 and February 2023, as well as information gathered via a selected narrative review of different resources related to research design. This deliverable must be read in conjunction with the specific instructions provided by the institutions where the studies that collect user experience data take place.

Note: An updated version of this document might be released during Q1/2024.

<< List of all deliverables

FIN-CLARIAH D4.3.3: Representative Twitter dataset(s) of user-generated texts and metadata

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.3: Report on Representative Twitter dataset(s) of user-generated texts and metadata
Date of reporting: 25-11-2023

Report author: Mikko Laitinen (University of Eastern Finland)
Contributors: Masoud Fatemi, Mehrdad Salimi, Paula Rautionaho (all from the University of Eastern Finland)
Deliverable location: https://nordictweetstream.fi/ The URL is currently open for researchers, and we will add authentication to it in the spring of 2024.

Description

The WP’s main objective was to develop a representative dataset of social media data from Twitter from the five Nordic countries. The underlying idea is that various social media applications offer a promising and extremely large source of data for a range of disciplines in social sciences and the humanities (SSH) today, but research activities are often hindered by the lack of technical knowledge in collecting, pre-processing and analysing very large datasets. During the funding period, we expanded the data collection substantially, when it because clear that the future of the data collection route became more and more uncertain. All the materials were collected during the period when the academic application programming interface of this social media platform was still open, and later on when the company changed its name to X, the API was closed down. In the hindsight, the decision to store large amounts of material from various geographic settings turned out to be a wise move, because this subproject has now saved 12.5 years of material for future research.

The project activities so far have consisted of two parts:

  1. Collecting data: Masoud Fatemi has been in charge of the data collection. Our dataset initially focused on the Nordic region, but we decided to expand this considerably when it became clear that the API would be closed after changes in ownership of the platform. In addition to our original data, we expanded the data collection to social network information in the Nordic region, the United States, the United Kingdom, and to Australia, together with partners from the Australian Digital Observatory from the Queensland University of Technology and the University of Queensland. Basic information of the datasets are shown in Table 1 below, and they range in size from nearly 800 million words to nearly 4 billion words in the US and the Australian networks. The datasets cover a slightly different time frames from 2006 to May 2023. A substantial part of the NTS data will be shared via the Language Bank of Finland during 2023–24.

Social media datasets collected in the 2022-2023 in this subproject

  1. An easy-to-use graphic interface: The second part consisted of designing an easy-to-use graphic interface for accessing the material and for carrying out basic analysis and visualizations of the NTS data. The interface is currently in the piloting phase, and can be accessed at https://nordictweetstream.fi/. It currently has only partial data, but we aim at adding all the NTS data to the CSC by spring 2025. Mehrdad Salimi was hired for this task in June 2022, and his contract is until May 2024, by which time, the interface will be fully functional.

This WP has reached its objectives and succeeded in creating a national niche within the Finnish DH sphere. We have a good team that combines expertise from sociolinguistics and computer sciences, and we are able to develop digital tools for a range of audiences.

For 2024–2025, we aim at continuing the work, and adding a graphic interface for accessing network information and combining this network information with textual searches.

<< List of all deliverables

D2.5.2: Analysis and annotation tools for learner performances

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.5: Report on Finnish as a second language learners automated analysis and annotation tools
Date of reporting: 28-11-2023

Report author: Ari Huhta (University of Jyväskylä)
Contributors:: Jenny Tarvainen, Ida Toivanen, Sirkku Kronholm, Mika Halttunen (University of Jyväskylä)
Deliverable location: (so far only in Google Drive)

Description

Work in WP2.5 divides into two stages. In 2022, tools used for automated analysis of texts written by native speakers of Finnish were reviewed in collaboration with WP3.2. To investigate how the tools perform with texts written by Finnish as second language (L2) learners’ texts, texts collected in previous projects were used to test certain tools. The texts represented different proficiency levels defined in the Common European Framework of Reference for Languages (CEFR), based on assessments by trained raters.

Testing focused on two promising tools, Finnish Tagtools (Language Bank) and Turku-neural-parser-pipeline. Both tools utilize machine learning with pre-trained language models. The tools perform e.g. segmentation, lemmatization and morphological tagging for Finnish texts. In addition, TurkuNPP provides information about universal dependency relations. The tools were tested with L2 Finnish learners’ texts evaluated at several CEFR levels. Only a few texts could be analysed due to various technical and other reasons. However, it was clear that the tools do not function well on learner performances, with various mistakes often confusing the processing. Typical L2 Finnish characteristics, like mixing back and front vowels (kavelin vs. kävelin), can cause incorrect lemmatization and/or tagging. However, in some cases, tools are faithful to learner language forms and are able to give the lemma based on the inflected learner language form rather than giving the targeted Finnish lemma (e.g. lumihannenlumihansi not lumihanki). As language learning researchers have started to see learner language as a valuable language variant, this can be seen as a positive characteristic, but useful tools should give both learner language lemmas and targeted lemmas. A poster presentation of these findings was given at the annual conference of the Finnish Association of Applied Linguistics in November 2022.

In the second stage in 2023, a study has been conducted to build models for classifying learner language into CEFR levels and to investigate resources needed to establish strong deep learning based L2 Finnish research in the future. This will facilitate e.g. designing automated tools for learner language detection for pedagogical and assessment purposes and contributing to the development of textual models for Finnish. Specifically, the study investigates (1) if the currently available CEFR-annotated datasets are enough for training deep learning models, (2) how the trained models perform with new data, (3) if pretraining with MLM learner language improves model performance, and (4) if the model performs equally well across all CEFR levels.

Four CEFR annotated written Finnish as a second or foreign language datasets were used: International Corpus of Learner Finnish (ICLFI), The Advanced Finnish Learner’s Corpus (LAS2), and two young learner corpora from the cross-sectional Cefling and the longitudinal Topling projects.

The state-of-the-art Finnish BERT model, FinBERT base was used and tested against FinBERT large. To inspect the effect of pretraining (with masked language modeling (MLM) objective, models trained with and without pretraining were compared. The models were evaluated with test data extracted from all four datasets. The evaluation metrics include accuracy, F1-score, recall and precision. For model evaluation, an average value over five folds for each evaluation metric is computed. An article based on the study is currently in preparation.

Events / presentations:

Sirkku Kronholm & Ari Huhta: Automaattisten tekstityökalujen kehittäminen oppijankieliseen aineistoon. Poster presentation. AFinLA autumn symposium. Helsinki. 27.-29.10.2022. https://www.helsinki.fi/assets/drupal/2022-10/AFinLA2022_FINALFINAL_Timetable_A3.pdf

<< List of all deliverables

FIN-CLARIAH D4.1.4: R/Python module

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 4.1: Report on R/Python module
Date of reporting: 22-11-2023

Report author: Julia Matveeva (University of Turku), Leo Lahti (University of Turku)
Contributors: Pyry Kantanen (University of Turku), Akewak Jeba (University of Turku)
Deliverable location: https://github.com/fennicahub/fennica

Description

  1. Python module: We have developed a Python script utilizing Pandas, designed to selectively extract MARC fields from the raw data. This script allows for the extraction of fields individually or in batches, which are then saved in CSV format. The Python module is available at the following URL: https://github.com/fennicahub/fennica/tree/master/inst/examples/field_picking.
  1. R module, known as the Fennica-R package, functions as an algorithmic toolkit designed explicitly for transparent quantitative analysis of the Finnish national bibliography, Fennica, and its metadata. Initially deployed to harmonize a subset of 70,000 entries, the module has recently undergone updates to facilitate the analysis of a more extensive dataset, now encompassing 1 million entries, including a subset for the period 1809-1917. The CSV files generated by the Python module are instrumental in further harmonization processes via the Fennica package.

The Fennica-R package is publicly accessible at https://github.com/fennicahub/fennica. See the package README for an up-to-date link to outputs generated by the package.

<< List of all deliverables

FIN-CLARIAH D3.5.2: Text network analysis of political texts

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.5: Report on Text network analysis of political texts
Date of reporting: 22-11-2023

Report author: Kimmo Elo (University of Turku)
Contributors: Kimmo Elo, Veronika Laippala, Otto Tarkka, Pyry Kantanen, Markus Korhonen (all from the University of Turku)

Deliverable location:

  • KWIC-tool: http://finparl-01.utu.fi/apps/KWIC/
  • TNA-tool: http://finparl-01.utu.fi/apps/TNA/

Both URLs will be opened for public use on December 19, 2023.

Description

The WP’s main objective was to develop tools based on network analysis for the analysis of political texts. The following two (2) tools are now available for public use (as beta releases, see below):

  1. KWIC tool for FinParl corpus: This tool provides a user interface to query word embeddings with KWIC (Key Word In Context) method. The tool offers a simple, yet intuitive user interface built with R Shiny, with which the user can query key word embeddings of the FinParl corpus of plenary debates of the Finnish parliament (eduskunta) and use the KWIC results to inspect n-grams and to visualise key word embeddings as text networks.
  1. TNA tool for the analysis of speeches of Finnish MPs: This tool will provide functionalities for vocabulary based content analysis of political speeches. The user selects an MP and can then study 1) a timeline of the MP’s plenary speeches, 2) a wordcloud of max. 500 most used words by the MP, as well 3) a speaker-to-concept network consisting of the 50 most frequently used concepts of the selected MP and and of his/her most similar colleagues (similarity measured as word-based cosine similarity).

The dataset these deliverables are based on covers a timespan from 1990 to 2021. The WP 3.5 uses a tailored FinParl corpus consisting of all plenary speeches of the Finnish eduskunta since 1907. The data and apps are located on data servers of the University of Turku.

Overall, the WP 3.5 has reached its most central objectives and succeeded in creating a well-functioning, active, multi-disciplinary collaboration network within the University of Turku. This network brings together expertise from social sciences and computational linguistics and is well capable of developing tools for a wide audience. The team dynamics is at good level and regular internal meetings are used to discuss current issues, problems, and solutions.

The deliverables to be published in December 2023 should be, however, considered as project milestones only. From 2024 onwards the future development of both tools will continue within the FIRI consortium project “LAWPOL”. The next milestone is Q2/2024, until which these tools are expected to be fully integrated in the first release of LAWPOL Digital Workbench for Political and Legal Studies bringing together most important political materials related to Finnish democracy.

<< List of all deliverables

FIN-CLARIAH D3.3.2: R package for data concept network

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.3: Report on R package for data concept network
Date of reporting: 2023-11

Report author: Maria Valaste (University of Helsinki)
Contributors: Adeline Clarke (University of Helsinki), Ida Toivanen (University of Jyväskylä), Jani-Matti Tirkkonen (University of Eastern Finland), Jaakko Peltonen (Tampere University)
Deliverable location: Several repositories in Github are published (see below).

Description

The aim of WP3.3 is to enhance the utilization of unstructured qualitative textual in the context of Finnish surveys with the use of a concept network tool. The purpose of this toolbox is to build a bridge from not-very-NLP-coding-apt social science researchers towards the computational NLP community’s text analytics methods and processes that might be useful for understanding the results of their survey.

This deliverable is R package, which brings together tools for Finnish-language data for open-ended question analysis. In addition to the analysis tools, the R package contains a sample dataset to familiarise the user with the functions of the package. This has been built on the basis of the previous deliverable 3.3.1.

Repositories

https://github.com/DARIAH-FI-Survey-Concept-Network (public)
https://github.com/DARIAH-FI-Survey-Concept-Network/finnishsurveytext (will be published)

<< List of all deliverables

D3.2.2: Annotation & analysis tools for NARC data

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.2: Report on annotation & analysis tools for NARC data
Date of reporting: 08-11-2023

Report authors: Venla Poso (University of Jyväskylä), Ida Toivanen (University of Jyväskylä), Tanja Välisalo (University of Jyväskylä), Antero Holmila (University of Jyväskylä)

Deliverable location: Released soon.

Description

Named entity recognition (NER) model for state authority archival data.

The National Archives of Finland started a mass digitisation project in 2019, where the aim is to digitise over 135 kilometres of archival data. We identified a need for an advantaged information extraction method from unstructured and noisy text, which will make data more accessible and potentially generate innovative uses of the data in the research sector. The process included two questionnaires to the end-users, creation of annotation guidelines, manual annotation, inter-annotator agreement testing and model development.

This process resulted in a NER model, which identifies ten different entity categories (person, organisation, date, location, geopolitical location, nationalities/religious and political groups, event, product, journal number and Finnish business identity code). Journal number and Finnish business code are newly established named entities derived from the responses to two questionnaires, as opposed to the others which rely on existing NER models. The model obtains comparable results with non-OCR’d data while significantly improving named entity recognition results when tested with OCR’d state authority archival data.

Development was conducted in cooperation with the National Archives of Finland and their DALAI project.

Links

Version 0.1: https://huggingface.co/Kansallisarkisto/finbert-ner

Publications

Poso, Venla, Tanja Välisalo, Ida Toivanen, Antero Holmila, and Jari Ojala. 2023. “Untapped Data Resources. Applying NER for Historical Archival Records of State Authorities”. Digital Humanities in the Nordic and Baltic Countries Publications 5 (1). Oslo, Norway: 55-69. DOI: 10.5617/dhnbpub.10650

<< List of all deliverables

D2.4.3.3: Initializing terminology collections

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.4: Report on Initializing terminology collection
Date of reporting: 2023-11

Report author: Harri Kettunen (UHEL)
Contributors: Tiina Onikki-Rantajääskö (UHEL)
Deliverable location: The Helsinki Term Bank for the Arts and Sciences – Tieteen termipankki

Description

Since the start of 2023, 613 new concept pages have been created at the Helsinki Term Bank for the Arts and Sciences (HTB) in the following fields: Archaeology, Botany, Classical Studies, Environmental Sciences, Geology, Geophysics, History, Language Technology, Linguistics, Literary Studies, Martial Arts Studies, Media And Communication Studies, Mesoamerican Studies, North American Studies, Nutritional Sciences, Open Science, Philosophy, Religion Studies, Semiotics, and Theology.

Furthermore, updates have been made to the database on 1,806 concept pages in the following fields: Aesthetics, Archaeology, Art History, Astronomy, Behavioral Sciences, Biology, Botany, Classical Studies, Clean Energy Research, Digital Humanities, Educational Sciences, Environmental Sciences, Epidemiology, Folklore, Food Sciences, Geology, Geophysics, Heritage Research, History, Language Technology, Language Technology, Law, Linguistics, Literary Studies, Martial Arts Studies, Media And Communication Studies, Mesoamerican Studies, Microbiology, Mycology, North American Studies, Open Science, Philosophy, Seismology, Semiotics, Språkvetenskap, Study Of Religions, Sustainability Science, Terminologiako Bankos, Terminology, Theology, Translation Studies, Veterinary Medicine, and Zoology.

In addition, the fields of Anthropology, Contaminated Land Studies, Gender Studies, Mathematics, and Urban Studies are working offline until there is a critical mass of terminology to be published at the HTB. Furthermore, terminology work has been agreed upon to be carried out in the following fields: Arctic Research, Asian Studies, Geography, Military Sciences, and Physiology. A multidisciplinary group has also been established for meta scientific terminology of transdisciplinarity.

All in all, 613 new new concept pages have been created and 1,806 existing concept pages have been updated. Since the beginning of the year, the volume of the new additions and edits totals 341,175 bytes, which is approximately 280,000 characters, which translates to ca. 100 A4-size pages. The full amount of concept pages as of November 26, 2023, is 45,010.

<< List of all deliverables

FIN-CLARIAH D2.3.2: Aligning and retrieving

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.3: Report on Aligning and retrieving
Date of reporting: 13-11-2023

Report author: Jack Rueter, Erik Axelson (University of Helsinki)
Contributors: Aleksei Ivanov (University of Tartu), Niko Partanen (University of Helsinki)
Deliverable location: Christmas Gospel text-to-speech in four Uralic languages

Description

The «Christmas Gospel text-to-speech in four Uralic languages» (shortname: xmas-gospel-tts) is a collection of .txt, .wav and .vrt files with a variety of alignments used in Korp searches. The collection is intended as a demo for showing how to donate and implement in parallel multi-lingual spoken materials to the Language Bank of Finland.

Background

A model for Massively Multilingual Speech (MMS, CC-BY-NC 4.0) has recently been developed at Facebook (Meta), with language support for hundreds of languages whose automatic speech recognition (ASR), text to speech (TTS) and language identification (LID) coverage is documented here.

The documentation at Meta includes 16 of approximately 32 Uralic languages or language forms spoken today. We chose three languages, Komi-Zyrian (kpv), Karelian (krl) and Erzya (myv), of the eight Uralic languages with coverage for the three categories of ASR, TTS and LID, and then we selected one additional language, Olonets-Karelian (olo, aka Livvi), one of the 16 languages lacking coverage for any of the three categories. Our choice of a fourth language was motivated by the fact that Karelian and Olonets-Karelian share much the same character-to-sound correlation and that the latter might actually be the source of digital information under the umbrella term Karelian.

The .txt files represent a segment of an existing parallel corpus, Parallel Biblical Verses for Uralic Studies (PaBiVUS), which is described in Metashare with a CC-BY-NC license. The segment or mini parallel corpus here is the Christmas Gospel (Luke 2:1–20), which is well known in Finland.

The .wav files have been produced as a text-to-speech exercise with a Python script by Aleksei Ivanov, Niko Partanen and Jack Rueter, utilizing the model for MMS built at Facebook (see above).

The .vrt files contain morpho-syntactically annotated versions of the Christmas Gospel texts, which have been subsequently inspected and manually corrected. The annotation used analysers built with Helsinki Finite-State Technologies (HFST) under continual development at Saami Language Technology (GiellaLT), based at the Norwegian Arctic University, in Tromsø: (Erzya; Komi-Zyrian; Karelian; Olonets-Karelian); Constraint Grammar (CG) methods as documented at the University of Southern Denmark, and a Universal Dependencies tool, Annotatrix.

The demo provides two facets of searchability on the Korp server. First, there is parallel corpus searchability, as found in the PaBiVUS corpus, i.e., there are links between .vrt coded verses of the Christmas Gospel with automatically annotated and subsequently manually corrected dependencies. Second, the text content of each verse is linked with the sound file (.wav), which allows for a sentence-to-utterance alignment as found, for example, in the Finnish Parliament materials, where timestamps would be the equivalents of our verse identifiers.

<< List of all deliverables

D2.2.2: Speech recognition for L2 update

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 2.2: Report on Speech recognition for L2
Date of reporting: 2023-11

Report author: Yaroslav Getman (Aalto University)
Contributors: Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., Clara, A., Kurimo, M. (Aalto University); Salvi, G., Svendsen, T. (Norwegian University of Science and Technology); Strömbergsson, S. (Karolinska Institutet); Smolander, A., Ylinen, S. (Tampere University); von Zansen, A., Hilden, R., Linden, K. (University of Helsinki); Kallio, H., Kuronen, M., Huhta, A., Kronholm, S. (University of Jyväskylä)

Deliverable location: Aalto Speech Research | Multi-task wav2vec2

Description

Systems trained to perform automatic speech recognition (ASR) and pronunciation rating for child L2 Finnish are available on HuggingFace Hub. The links to the models are collected on this GitHub page and the methods are described in [1] and [2].

ASR systems for L2 learners of Finnish are described in [3] and [4]. The scripts used to train the models are available on this GitHub page and the data will be released in Kielipankki here and here.

References

[1] Getman, Y., Al-Ghezi, R., Grosz, T., Kurimo, M. (2023) Multi-task wav2vec2 Serving as a Pronunciation Training System for Children. Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), 36-40, doi: 10.21437/SLaTE.2023-8
[2] Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., Kurimo, M., Salvi, G., Svendsen, T., Strömbergsson, S., Smolander, A., Ylinen, S. (2023) Developing an AI-Assisted Low-Resource Spoken Language Learning App for Children. IEEE Access, vol. 11, pp. 86025-86037, 2023, doi: 10.1109/ACCESS.2023.3304274
[3] Kurimo, M., Getman, Y., Voskoboinik, E., Al-Ghezi, R., Kallio, H., Kuronen, M., von Zansen, A., Hilden, R., Kronholm, S., Huhta, A., Linden, K. (2023) New data, benchmark and baseline for L2 speaking assessment for low-resource languages. Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), 166-170, doi: 10.21437/SLaTE.2023-32
[4] Al-Ghezi, R., Voskoboinik, K., Getman, Y., von Zansen, A., Kallio, H., Clara, A., Kuronen, M., Huhta, A., Hilden, R. (in review) Automatic speaking assessment of Spontaneus L2 Finnish and Swedish. Language Assessment Quarterly.

<< List of all deliverables

D1.3.4: QA pair corpora

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on QA pair corpora
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

  1. Manually annotated English QA dataset

    100 manually annotated documents for question-answer pairs from a random sample of the documents labelled as having the QA label from the English web-scale dataset Falcon-refinedWeb. The dataset is split into 40 dev and 60 test, and includes 345 questions and 192 answers.

  2. Manually annotated Finnish QA dataset

    218 manually annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset is split into train, dev and test with 100, 50 and 68 documents respectively. The dataset includes 376 questions and 333 answers.

  3. ChatGPT-annotated Finnish QA dataset

    3,424 ChatGPT-annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset has been only used for training. The dataset includes 2,919 questions and 2,491 answers.

The first three datasets have been used in the training and testing of the QA pair extraction model introduced in report D.1.3.3 , and do not necessarily include QA pairs, as the documents were annotated by not taking into account whether there was a pair or not and instead by only annotating text spans for either a question or answer. The data for the first three can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/annotated-data

  1. Corpus of QA pairs retrieved from web-scale datasets

    QA pairs retrieved by the qa pair retrieval pipeline from several different corpora: the Finnish Parsebank, CC-Fi, mC4-Fi and the English Falcon-refinedWeb. The QA pair corpora includes almost 200K retrieved pairs from 125K documents after discarding low quality pairs. The final pairs can be found here: https://github.com/TurkuNLP/register-qa/tree/main/Turku-WebQA

The publication details will be updated later (work submitted for LREC-COLING 2024).

<< List of all deliverables

FIN-CLARIAH D3.1.4: Incremental update process

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 3.1: Report on Ingestion framework
Date of reporting: 2023-11

Report author: Johanna Lilja (National Library of Finland), Tuula Pääkkönen (National Library of Finland)
Contributors: Martin Matthiesen (CSC)
Deliverable location: https://github.com/CSCfi/kielipankki-nlf-harvester

Description

The OAI-PMH API of the National Library is regularily queried for changes in the dataset. If such changes occur (additions/deletions) the respective files in are added/deleted to the downloaded dataset as needed and another snapshot is created (see D3.1.3). Deleted bindings support is still under development.

More information

FIN-CLARIAH WP3.1 presentation from DARIAH-FI workshop on December 1st, 2023.

<< List of all deliverables

D1.3.3: Models for retrieving QA pairs from the web

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on Models for retrieving QA pairs from the web
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

Our pipeline to retrieve question-answer pairs from text corpora includes two transformer models: one for extracting documents with likely QA pairs from web-crawled corpora, and another one for extracting the actual QA pairs from the documents.

The model for QA document identification is a cross-lingual sequence classification model trained on register annotated data in English and Finnish as well as unpublished versions of Swedish and French which is specifically fine-tuned to predict whether a document (a piece of text) includes something related to questions and answers or not.

The model for QA pair extraction is a token classification model (for English and Finnish) which predicts whether a token in the text belongs to a question, answer or other and then splits the text into QA pairs based on those predictions and aggregation strategies. This model is used on the documents labelled as having something related to questions and answers.

The publication details will be updated later (work submitted for LREC-COLING 2024).

Links

  • The model for QA document identification: https://huggingface.co/TurkuNLP/xlmr-qa-register
  • Corpus of Online Registers of English (CORE): https://github.com/TurkuNLP/CORE-corpus
  • FinCORE corpus: https://github.com/TurkuNLP/FinCORE_full
  • Multilingual register annotations: https://github.com/TurkuNLP/multilingual-register-labeling/tree/master/register-annotations
  • The model for QA pair extraction (English): https://huggingface.co/TurkuNLP/xlmr-qa-extraction-en
  • The model for QA pair extraction (Finnish): https://huggingface.co/TurkuNLP/xlmr-qa-extraction-fi
  • << List of all deliverables

    D1.2.2: Transcription Service for Finnish Interviews

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 1.2: Transcription Service for Finnish Interviews
    Date of reporting: 2023-10

    Report author: Martin Matthiesen (CSC)
    Contributors: Anssi Moisio (Aalto), Sam Hardwick (CSC), Niko Partanen (National Library), Aivo Olev (Tallinn University of Technology)
    Deliverable location: https://tekstiks.ee (Finnish)

    Description

    The transcription service split into two parts: The end user frontend is hosted at the University of Tallinn, Estonia at https://tekstiks.ee and the speech recognition backend is hosted at CSC – IT Center for Science in Finland. For details and usage instructions see https://www.kielipankki.fi/arkisto/resource-info/tools-for-speech-analysis-and-annotation/

    The source code is available on Github.

    References:

    Olev, A; Alumäe, T. (2022). Estonian Speech Recognition and Transcription Editing Service. Baltic J. Modern Computing, Vol. 10 (3), pp. 409–421. DOI: 10.22364/bjmc.2022.10.3.14

    Moisio, A; Porjazovski, D; Rouhe, A; Getman, Y; Virkkunen, A; AlGhezi, R; Lennes, M; Grósz, T; Lindén, K & Kurimo, M (2022). Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks. Language Resources and Evaluation. DOI: 10.1007/s10579-022-09606-3

    Moisio, A. (2022). Lahjoita puhetta baseline Kaldi ASR model (1.2). Zenodo. DOI: 10.5281/zenodo.7101543

    << List of all deliverables

    FIN-CLARIAH D3.5.1: Text network analysis of political texts

    Project: FIN-CLARIAH
    Grant agreement: Academy of Finland no. 345610
    Start date: 01-01-2022
    Duration: 24 months

    WP 3.5: Report on Text network analysis of political texts
    Date of reporting: 06-06-2023

    Report author: Kimmo Elo (University of Turku)
    Contributors: Kimmo Elo, Veronika Laippala, Otto Tarkka (University of Turku)
    Deliverable location: None so far, R Shiny GUI and GitHub repository will be made public in Q3/2023.

    Description

    The WP’s main objective is to develop tools based on network analysis for the analysis of political texts. The tools will be made available both via a web-interface and as dedicated R packages. Three (3) tools are currently under development:

    1. A KWIC tool for FinParl corpus: This tool provides a user interface to query word embeddings with KWIC (Key Word In Context) method. The tool offers a simple, yet intuitive user interface built with R Shiny, with which the user can analyse key word embeddings of the FinParl corpus of plenary debates of the Finnish parliament (eduskunta). A beta version of this tool is already in the testing, the release is planned for Q3/2023.
    2. A tool for semantic and text network analysis and visualisations: Building on the KWIC tool, this tool will provide functionalities for vocabulary based content analysis of political text, for the comparison of different text networks, as well for dynamic text network analysis with a set of visualisation tools. These tools are currently under active development and testing, the production phase is expected to be completed in Q3/2023.
    3. A tool for analysing text reuse: This tool will offer functionalities to identify and analyse structural similarities of vocabulary-based text networks. Such structural patterns can help us to identify how phrases or longer text passages are re-used over time. The tool will also provide capabilities to identify patterns in concept embedding, a widely used strategy in political texts to frame different issues in the same (or similar) context. This tool is currently in planning, the active development and coding is expected to be completed in Q4/2023.

    All these tools will be developed for and tested with the FinParl-corpus consisting of all plenary speeches of the Finnish eduskunta since 1907. All tools will access a tailored dataset maintained on a server at the University of Turku.

    The FinParl-corpus used by this WP is structured according to the ParlaMint XML schema, so that – at least theoretically – the tools should be compatible with all corpora following the same ParlaMint schema. Our plan, however, is not to limit the analytical tools for the use with FinParl-corpus only. Instead, the tools will be designed to work with tidy data, and the WP provides tools to access relevant resources and to convert the working data in tidy data for further analysis.

    Overall, the WP is proceeding quite well and mostly in schedule. We have a small, yet active research team bringing together expertise from social sciences and computational linguistics and being capable of developing tools for a wide audience. The team dynamics is at good level and regular internal meetings are used to discuss current issues, problems, and solutions. The WP also benefits from a big FIRI research grant of the Academy of Finland covering the years 2023–2025 and allowing us a greater room of manoeuvring for the planning of the WP’s future development.

Search the Language Bank Portal:
Elina Vaahensalo
Researcher of the Month: Elina Vaahensalo

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information