D1.3.3: Models for retrieving QA pairs from the web

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on Models for retrieving QA pairs from the web
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

Our pipeline to retrieve question-answer pairs from text corpora includes two transformer models: one for extracting documents with likely QA pairs from web-crawled corpora, and another one for extracting the actual QA pairs from the documents.

The model for QA document identification is a cross-lingual sequence classification model trained on register annotated data in English and Finnish as well as unpublished versions of Swedish and French which is specifically fine-tuned to predict whether a document (a piece of text) includes something related to questions and answers or not.

The model for QA pair extraction is a token classification model (for English and Finnish) which predicts whether a token in the text belongs to a question, answer or other and then splits the text into QA pairs based on those predictions and aggregation strategies. This model is used on the documents labelled as having something related to questions and answers.

The publication details will be updated later (work submitted for LREC-COLING 2024).

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information

D1.3.3: Models for retrieving QA pairs from the web

Description

Links

Upcoming events

Contact

D1.3.3: Models for retrieving QA pairs from the web

Description

Links

News

Contact