Sensitive Data: EOSC Pilot

Present: Satu Saalasti, Mietta Lennes (HY), Harri Hirvonsalo (CSC, via Zoom), João da Silva (CSC), Martin Matthiesen (CSC, notes)

Time, place: 30.10.2018 12-15:30 CSC, Espoo

 

Thematic summary

In this meeting we tried to look at the issue from all known angles. This memo tries to identify the main issues and state of the discussion

The research data (Satu)

The data will consist of video recordings of 7-10 year old children. The children will have speech impairments and will be recorded in 3 situations:

  1. Initial assesment (before treatment)
  2. Intervention (during treatment)
  3. Final assesment (how well did the treatment work?)

In all stages the children will be uttering preselected utterances.

In addition to videos showing the face of the children there will be also ultra sound recordings of tongue movements. The data will be used to assess the effectiveness of the treament method.

Data re-use can happen for two main reasons:

  1. Reproducibility of Satu’s research
  2. Other research

While the goal is to allow for both it might be easier to define a process to allow the reproducibilty of existing research.

Decisions

  • We decided to keep both use cases in mind but prioritise on reproducibility in case a prioritisation needs to be made.
  • Data collection is planned for spring 2019.
  • Satu will provide mock-up data that better shows what to expect by January 2019.

The ethics comittee

Before data can be collected, Satu needs to get approval from an ethics committee. Since the final implementation is not known the ethics committee will not be able to decide on the appropriateness. Without a decision data cannot be collected.

Our approach needs to be specific enough to get preliminary approval, but general enough to keep flexibility in the implementation.

Decisions

We decided to try to attempt to get preliminary approval for the data management by describing the planned system and requesting approval to use it provided we implemented it as planned.

Metadata

We discussed whether the system should be able to offer sensitive metadata under certain conditions. Descriptive metadata as shown in B2SHARE or META-SHARE should be public. Sensitive descriptions of the dataset can be moved to be part of the dataset itself. Rationale: Users must be able to search the metadata, to assess whether the dataset is useful for their needs. Such descriptive metadata should never need sensitive information.

Decisions

  • We decided that descriptive metadata as shown in B2SHARE or META-SHARE must be public.
  • Sensitive descriptions of the dataset shall be moved to be part of the dataset itself.

Existing solutions

B2SHARE/EUDAT

Harri showed B2SHARE, EUDAT’s self-depositing repository. The existing B2SHARE instance will not be used for this pilot, but the underlying software (with modifications) will. Sensitive data can be processed in two places, CSC’s ePouta (a secure cloud) and at TSD in Oslo/Norway (also a secure cloud). TSD offers also the download of data, ePouta does not. B2SHARE supports OAI-PMH, so metadata export is possible.

The Language Bank

Martin showed an Example from the Language Bank (ELFA) and the application and approval process in the REMS2 bases Language Bank Rights (LBR). So far the Language Bank has a simple approval process, a good reason to use the restricted resources is enough. Data is either shown in Korp and/or made available in the Download service. The Lanugage Bank supports OAI-PMH export, but not yet import.

Decisions

No decisions on the level of integration of the solution with the Language Bank. Because of the nature of the data, integration can be low.

The approval proces

We discussed the final approval process, assuming that we have all the other parts in place. Who would approve access? Satu? The ethics committee? A research group?

While LBR can accomodate more complex processes, it was unclear what this process would look like.

Decisions

Satu discusses this issue within her research group.

Re-using the data

If a user has the right to process the data there are two basic use cases Both approaches have advantages and disadvantages, summarized below.

The data stays in the secure location and the user processes it there.

Advantages

  • Copying can be made difficult.
  • Accidental disclosure of data almost impossible.
  • If connected to HPC: Scalable processing resources available.
  • No copying of possible large amounts of data needed.

Disadvantages

  • Remote desktop requires internet
  • Risk of lag (bad for video processing)
  • Secure software stack mainentance is a challenge.
  • Hardware possibly not adequate (eg. missing/wrong GPUs).
  • If the environment is too hard to use because of software/hardware limitations, users will look for ways to circumvent security. That threat is real, consider the ForMin Dataleak.

The user downloads the data and processes it in her own secure location.

Advantages

  • Resourcing and software stack are under user’s control.
  • Datacenter does not need to provide computing resources.
  • Data can be used offline after download.

Disadvantages

  • Requires more trust. A copy of the data has been made.
  • Requires copying of possibly large amounts of data.

At the moment CSC’s ePouta offers Remote Desktop access and TSD offers shell access and download (and possibly Remote Desktop as well, Harri will check). TSD is not a real option at this point, since There are no  plans to store the sensitve data outside of Finland.

Decisions

We keep both options in mind for now. If we need to prioritze, we prioritize towards the first opion, ”ePouta/Remote Desktop”.

Next steps

  • Satu to provide mock-up data by Jan 2019
  • Harri, João, Martin to meet mid November to discuss technical options.
Hae Kielipankki-portaalista:
Tommi Kurki
Kuukauden tutkija: Tommi Kurki

 

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4144036 / 029 4129317