Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months
WP 1.3: Report on QA pair corpora
Date of reporting: 02-11-2023
Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP
100 manually annotated documents for question-answer pairs from a random sample of the documents labelled as having the QA label from the English web-scale dataset Falcon-refinedWeb. The dataset is split into 40 dev and 60 test, and includes 345 questions and 192 answers.
218 manually annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset is split into train, dev and test with 100, 50 and 68 documents respectively. The dataset includes 376 questions and 333 answers.
3,424 ChatGPT-annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset has been only used for training. The dataset includes 2,919 questions and 2,491 answers.
The first three datasets have been used in the training and testing of the QA pair extraction model introduced in report D.1.3.3 , and do not necessarily include QA pairs, as the documents were annotated by not taking into account whether there was a pair or not and instead by only annotating text spans for either a question or answer. The data for the first three can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/annotated-data
QA pairs retrieved by the qa pair retrieval pipeline from several different corpora: the Finnish Parsebank, CC-Fi, mC4-Fi and the English Falcon-refinedWeb. The QA pair corpora includes almost 200K retrieved pairs from 125K documents after discarding low quality pairs. The final pairs can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/qa_predicted_final_files
The publication details will be updated later (work submitted for LREC-COLING 2024).