D1.3.4: QA pair corpora

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on QA pair corpora
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

Manually annotated English QA dataset

100 manually annotated documents for question-answer pairs from a random sample of the documents labelled as having the QA label from the English web-scale dataset Falcon-refinedWeb. The dataset is split into 40 dev and 60 test, and includes 345 questions and 192 answers.

Manually annotated Finnish QA dataset

218 manually annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset is split into train, dev and test with 100, 50 and 68 documents respectively. The dataset includes 376 questions and 333 answers.

ChatGPT-annotated Finnish QA dataset

3,424 ChatGPT-annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset has been only used for training. The dataset includes 2,919 questions and 2,491 answers.

The first three datasets have been used in the training and testing of the QA pair extraction model introduced in report D.1.3.3 , and do not necessarily include QA pairs, as the documents were annotated by not taking into account whether there was a pair or not and instead by only annotating text spans for either a question or answer. The data for the first three can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/annotated-data

Corpus of QA pairs retrieved from web-scale datasets

QA pairs retrieved by the qa pair retrieval pipeline from several different corpora: the Finnish Parsebank, CC-Fi, mC4-Fi and the English Falcon-refinedWeb. The QA pair corpora includes almost 200K retrieved pairs from 125K documents after discarding low quality pairs. The final pairs can be found here: https://github.com/TurkuNLP/register-qa/tree/main/Turku-WebQA

The publication details will be updated later (work submitted for LREC-COLING 2024).

D1.3.4: QA pair corpora

Description

Upcoming events

Contact

D1.3.4: QA pair corpora

Description

News

Contact