<< List of all deliverables

D1.3.4: QA pair corpora

Project: FIN-CLARIAH
Grant agreement: Academy of Finland no. 345610
Start date: 01-01-2022
Duration: 24 months

WP 1.3: Report on QA pair corpora
Date of reporting: 02-11-2023

Report author: Anni Eskelinen (UTU)
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, Sampo Pyysalo (UTU)
Deliverable location: https://github.com/TurkuNLP/register-qa | https://huggingface.co/TurkuNLP

Description

  1. Manually annotated English QA dataset

    100 manually annotated documents for question-answer pairs from a random sample of the documents labelled as having the QA label from the English web-scale dataset Falcon-refinedWeb. The dataset is split into 40 dev and 60 test, and includes 345 questions and 192 answers.

  2. Manually annotated Finnish QA dataset

    218 manually annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset is split into train, dev and test with 100, 50 and 68 documents respectively. The dataset includes 376 questions and 333 answers.

  3. ChatGPT-annotated Finnish QA dataset

    3,424 ChatGPT-annotated documents for QA pairs from a random sample of the documents labelled as having the QA label from the Finnish web-scale datasets Parsebank, CC-Fi and mC4-Fi. The dataset has been only used for training. The dataset includes 2,919 questions and 2,491 answers.

The first three datasets have been used in the training and testing of the QA pair extraction model introduced in report D.1.3.3 , and do not necessarily include QA pairs, as the documents were annotated by not taking into account whether there was a pair or not and instead by only annotating text spans for either a question or answer. The data for the first three can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/annotated-data

  1. Corpus of QA pairs retrieved from web-scale datasets

    QA pairs retrieved by the qa pair retrieval pipeline from several different corpora: the Finnish Parsebank, CC-Fi, mC4-Fi and the English Falcon-refinedWeb. The QA pair corpora includes almost 200K retrieved pairs from 125K documents after discarding low quality pairs. The final pairs can be found here: https://github.com/TurkuNLP/register-qa/tree/main/token-classification/qa_predicted_final_files

The publication details will be updated later (work submitted for LREC-COLING 2024).

Search the Language Bank Portal:
Krister Lindén
Researcher of the Month: Krister Lindén

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information