Corpus pipeline description

Phase 0: Finding resources

[Project leader, Planning officer]

  • Are there ongoing projects collecting or creating resources
  • Incoming inquiries about IPR, GDPR, data archiving
  • Incoming questions about Korp, Mylly and other tools
  • Conferences, seminars, roadshows, other collaborative events
  • Open Data, or other similar projects
  • Updates, new versions of existing resources

Phase 1: Identifying corpora

[Project leader, Planning officer]

  1. Contact the researcher or the IPR holder of the corpora. KP-1172 01.
  2. Provide information about licensing and agreements.
  3. Discuss the roles in the deposition process: IPR holder, Licensor, Distribution rights holder and restrictions Availability, Referencing etc. to determine the agreement and the license.
  4. Collect the minimum set of data about the resource (name in English & Finnish, short description, info about languages, text/speech)
  5. Help with IPR and GDPR issues if required.

Phase 2: Metadata

[Planning officer, CSC]

  1. Create an initial META-SHARE file, KP-1171 02. See the Metadata checklist for some pointers.
  2. Register an URN that points to the metadata of the corpus. KP-1246 02b; instructions: see the README.md of https://github.com/CSCfi/Kielipankki/tree/master/FIN-CLARIN-Administration NB: the URN will start working only the next day after you generate it. Note: final slash: for Korp no, for download yes.
  3. Publish the metadata file in META-SHARE.
  4. Add the metadata of the forthcoming resource to the FIN-CLARIN-Administration/KP_Aineistot.xlsx spreadsheet’s src_new worksheet (in Github: sync it in Github Desktop, commit your modification, sync it again)
  5. Copy the content of the obj_new worksheet of the FIN-CLARIN-Administration/KP_Aineistot.xlsx to the Kielipankki portal. Instructions are in the defs (variables, quick help) worksheet of the same spreadsheet.
  6. Create or update the resource group page in the Portal. Examples:
    1. https://www.kielipankki.fi/aineistot/eduskunta/
    2. https://www.kielipankki.fi/aineistot/ylenews/
  7. If you need to make changes in the metadata in META-SHARE, document it by creating an unstructured document with the content CHANGE LOG + date + short description of the change. Previously the changes were documented in the metadata version descriptions or in the resource descriptions.
  8. For numbers (size) do not include punctuation or spaces.
  9. If the META-SHARE article is out of date, the links no longer point anywhere and if the contact persons either can no more be contacted or reply that the resource no longer exists, create a tombstone page in the portal under kielipankki.fi/corpora/archive or tools/archive as appropriate. Take a screenshot of the META-SHARE article and add it on the tombstone page. See the decomissioning of WWW-Lemmie http://urn.fi/urn:nbn:fi:lb-20140730123  where the resource URN now points to the tombstone (the old META-SHARE article address http://metashare.csc.fi/repository/browse/www-lemmie/aff491b8fccc11e18b49005056be118e2f69c385f23b4ad0a8042a073d009f4d/  ) as an example.

Phase 3: Agreements

[Project leader, Planning officer]

3a. Define the license conditions

  • Are there personal data included? If yes and the material cannot be completely anonymized, ask/help the Data Controller prepare the required documentation, and be ready to take this into account in further processing of received data.
  • Check copyright restrictions. Remember the agreement with Kopiosto.
  • If the material cannot be publicly available, is it possible to publish several versions of the material with different licenses? (e.g., restricted context vs. full text; scrambled sentences or paragraphs; anonymized transcriptions vs. original audio with annotations)

3b. Prepare the deposition agreement

  1. Get to an agreement with the IPR holder about the license of the resource. Also make sure that, in case there are third parties involved, the IPR holder has their agreements, which then should be attached to the deposition agreement. KP-1247 05a.
  2. Edit and send a tentative deposition agreement to the IPR holder. KP-1248 05b.
  3. Once the IPR holder has accepted the deposition agreement, ask him/her to print it in 2 copies, sign them & then send them to you by mail. KP-1249 05c.
  4. Get the signature of the head of the Department of Digital Humanities (1.1.2018 onwards) to the deposition agreement. KP-1306 / KP-1304 05d.
  5. Scan the deposition agreement & place it into IDA (FIN-CLARIN Administration/agreements).
  6. Archive the deposition agreement paper version (the binder FIN-CLARIN Tallennussopimukset).
  7. Send the other copy of the signed deposition agreement to the IPR holder by mail

Phase 4: Retrieving corpus data

[Planning officer]

  1. Ask the IPR holder to send you the data. This may involve receiving the data as an email attachment or via Funet FileSender.
  2. Upload the data to IDA (corpora). General guidelines (for browser): https://openscience.fi/ida-browser. The data should be packaged appropriately.
  3. Define where the corpus will be available (Download/Korp).
  4. Define the priority of the corpus. KP-1302 04a.
  5. Define the preliminary workload on each step required for the publication of the corpus. KP-1239.
  6. Define the initial publication schedule based on the priority and the workload. KP-1303 04b.
  7. Inform the FIN-CLARIN team for speech / text corpora to start the conversion for Download/Korp.

Phase 5-1: Resource conversion for Download

KP-1307, [FIN-CLARIN speech/text corpora teams]

Instructions in detail: https://www.kielipankki.fi/development/corpus-data-publication-for-download-at-the-language-bank/

  1. Check the license. (Is download allowed, if so, PUB,ACA or RES?)
  2. Define the format of the data to be published in Download. Typical options:
    1. WAV,EAF (from Elan)
    2. VRT (from Korp)
    3. TXT,PDF (raw formats)
  3. Define shortname, see naming conventions
  4. Create the metadata in META-SHARE
  5. Create README.txt (refer to license and include URN to META-SHARE)
  6. Create zip file. Use shortname (without ”-dl”) as name and internal top-level directory. Structure:
    1. short-name.zip:
      1. short-name/README.txt
      2. short-name/short-name/data… (including possible sub directories)
  7. Prepare upload at Puhti
    1. mkdir/scratch/clarin/download_preview/<short-name>
    2. prepare directory as it should look in Download:
      1. zip file as created above
      2. README.txt/license.txt as contained in zip file.
  8. Upload the data.
    1. This requires root rights on korp.csc.fi, a detailed technical description is so far only available in CSC’s intranet (”Download” service).
  9. Check uploaded data
    1. https://kielipankki.fi/download/
      1. ”name” is ”short-name” (or as agreed)
      2. Description has the correct name (possibly slightly shortened) and links via URN to META-SHARE.
      3. ”name” links to subdirectory:
        1. subdirectory contains zip files as agreed (for ACA often license acceptance pages that need to be approved before download)
        2. subdirectory has uncompressed README.txt and (sometimes) separate license.txt information from within zip file.

Phase 5-2: Resource conversion for Korp

KP-1309 12. [FIN-CLARIN text corpora team]

  1. Decide if the corpus should be split into subcorpora.
  2. Decide the identifier of the corpus and its possible subcorpora. See naming conventions.
  3. Preprocess the data before converting to VRT. This may involve OCR’ing PDF files to text, converting the character encoding or fixing apparent errors.
  4. Convert the data to the VRT format used as Korp corpus input format. If a script exists for the same or a similar format in the conversion script repository, preferably use it as such or modified, but writing custom scripts for the input format may also be required.
  5. Validate the VRT data and otherwise verify its correctness with the validator.
  6. Parse the generated VRT, which also adds morphosyntactic, part-of-speech and named entity annotations. This currently only applies to corpora in (standard) Finnish. Parse the VRT and recognize named entities. Run parser and named-entity recognizer on the VRT data (if the tools exist for language of the corpus).
  7. Run korp-make(See scripts in Github) on the (parsed and NER-tagged) VRT data to make a corpus package. This command encodes the VRT data into the CWB database format and creates a Korp package for the corpus.
  8. Optional: For parallel corpora, run korp-make for each aligned language but do not package them; add alignment information. Package all the languages to a single package (korp-make-package.sh).
  9. Add corpus configuration to the Korp frontend. This consists of making changes to the Korp configuration and translation files on your own branch of the Korp frontend repository and committing the changes: Add corpus configuration to Korp’s configuration file (config.js, modes/modename_mode.js); Add translations of corpus attribute names and values to translation files (translations/corpora-{fi,en,sv}.js); Commit the changes to the configuration to the korp-frontend repository in GutHub
  10. Install the corpus package and configuration. At this stage, the corpus configuration should be installed on a separate test instance of the Korp frontend. If you do not have access to the Korp server, you need to request someone having the rights to do that.
  11. Test the corpus in Korp. Check that the corpus shows up and works as expected in the Korp test instance.
  12. Inform others of the corpus and request feedback. You should inform at least fin-clarin (at) helsinki.fi and the original corpus owner or compiler if applicable. If you get feedback, you may need to redo some of the previous steps.
  13. Install the corpus configuration to the production Korp, once the corpus works in Korp as desired. Install the corpus package (korp-install-corpora.sh). Install the changes to the Korp configuration from the GitHub repository (korp-install.sh). Again, you may need the help of someone with the appropriate rights.
  14. Upload the corpus package to the IDA storage service.
  15. Add a piece of news on the corpus to Korp’s newsdesk.
  16. Organize a test group among the Language Bank project and have them test the data and the information about it in Korp. Include the data depositor in the test group.
  17. A Beta-phase is recommended for all corpora, but especially for those where the data depositor is interested in testing the data. Typical beta period lasts for two weeks but the depositor can ask for a different length. During the beta period the access location is available in META-SHARE and it is published in the portal clearly marked as ”beta”.

Description of preprocessing: See the documentation in GitHub

Phase 6: URN for Publication and internal records

[CSC, Planning officer]

  1. Generate an URN for the location of the resource (same as in phase 2 step 2).
  2. Add the URN to the Url field of the META-SHARE metadata file of the resource.
  3. Update the resource’s META-SHARE metadata file accordingly. Make sure that also the link to the attribution details of the resource is added to the metadata file. IMPORTANT: if the resource has an ACA or a RES license, you have to
    a) create license pages for it in the Kielipankki portal both in Finnish & in English & add the links to these pages in the Documentation section of the META-SHARE metadata file KP-1316 08.;
    b) add the resource to LBR KP-1317 09.
    c) move on to the next step after the resource has been added to LBR.
  4. Update the KP_Aineistot.xlsx accordingly (cut info from the src_new worksheet & paste it to src_prod, then update the relevant info). Make sure to delete the row of the resource from the obj_new worksheet. KP-1174 04c.
  5. Copy the content of the obj_prod worksheet to to the Kielipankki portal. Select Tuo/Import: Syötä käsin/Import manually, CSV, Korvaa/Replace. Instructions are in the defs (variables, quick help) worksheet of the same spreadsheet. Sort the table in alphabetical order based on the ID of the corpus. (Preview does not give the right result, so for testing the best method is to re-create the Test table and publish it) KP-1174 04c.****
  6. Consider a resources info page /tietosivu/ resource family page in the portal + link to metashare and KP_Aineistot.

Phase 7: Information dissemination on publication and use

[Planning officer, FIN- CLARIN speech/text corpora teams]

  1. Publish the news in Kielipankki’s portal both in Finnish and in English.
  2. Inform the IPR holder that the resource has been published.
  3. Publish the news in the next Kielipankki newsletter.
  4. Publish user profiles in the series Researcher of the Month.
  5. User statistics for IPR holder.

Phase 8: Updating resources

[Planning officer, FIN-CLARIN speech/text corpora teams, CSC]

  1. Record the feedback from the users and identify needs for updating.
  2. Comply with the version management guidelines
  3. Create plans for  updating resources: which resources will be updated in regular intervals?