The Allas data storage of the Language Bank

Introduction

Allas is a new data storage service that uses the Object Storage architecture. Data is stored as objects in so-called ”buckets”. While buckets are a bit similar to directories, they have to be unique accross the system. In other words, there can only be one bucket named ”test” in Allas.

In Allas, there are only buckets and objects. Directories are simulated by Swift tools, but paths are really part of the object. Thus, ”suomi-24/src/comment-1.txt” and ”suomi-24/src/comment-2.txt” are two objects named in a way that Swift and other tools can group together, but there is no such thing as a ”suomi-24″ directory.

Allas also does not, by default, register who uploads the data, and it does not support fine-grained access control. All buckets associated with a project (in our case ,”clarin”) can be deleted by anyone with write access. Buckets are, by default, private but can be made publicly available by any group member.

Data management

The removal of a permanent work storage on Puhti forces us to rethink data management. We will need to better separate valuable primary data, tools and generated data.

Quick start and conventions

To get started, type in CSC’s computing environment:

  • module load allas
  • allas-conf clarin
    • (type your CSC password)
  • Check connection with a-list

Remember the following conventions:

  • Prefix buckets with their project name and a dash: In our case, clarin-
  • Use Swift tools or a-tools to access Allas, do not use the S3 tools, they are incompatible with Swift.

Swift tools

The Swift tools are intended for more advanced use of Allas. They allow you to upload and download directories and individual files.

A-tools

CSC has created a set of tools to make Allas access a bit easier: a-tools. Tools such as ”a-put” and ”a-get” can convert entire directories into compressed tar archives and put/get them to/from Allas. The tools are basically wrappers around Swift and other tools such as Rclone.

Allas vs. IDA vs. Puhti’s file system

When should you use Allas, when IDA and when Puhti’s file system, like /scratch/clarin or /projappl/clarin?

Puhti’s file system has no permanent large data storage. The only permanent directory apart from your home directory is /projappl/clarin that has only a limited quota and is intended to be used for shared software. /scratch/clarin/ is intended for data to work on. Old data is removed regularly.

IDA

IDA contains data that cannot be easily recreated or reacquired from other sources, such as raw language data from depositors.

Allas

The use cases for Allas are developing. Objects in Allas can be provided in a massive parallel manner making massive parallel processing of data easier in the future. The most likely use cases:

  • Backup of data from a CSC server that is not a clear case for IDA.
  • Backup of intermediate data created on Puhti for later reference.

Puhti

/scratch/clarin/ (1 TB)

The work directory. Data is removed regularly (see CSC’s documentation)

/projappl/clarin/ (50 GB)

Space for tools that we want to share across the project but not (yet) provide to all users. The tools should be version-controlled.

Home directory (10 GB)

Personal files and tools.

Bucket naming

Especially when you use Swift, you want to consider how to organize your buckets. Swift has easy commands for uploading and downloading entire buckets, so it depends on your use case which structure should be preferred.

Say you have two corpora, named ”Suomi24” and ”Ylilauta”. In both cases, there are the subdirectories ”src” and ”vrt”. You can now create one bucket named ”clarin-corpora” with objects named ”Suomi24/src/…” (and obviously ”Suomi24/vrt/…”) and ”Ylilauta/src/…” and so on, or two buckets ”clarin-corpora-Suomi24” and ”clarin-corpora-Ylilauta” with objects in them named ”src/…” and ”vrt/…”. (The ”corpora” part would be a convention to give a hint about the content.)

One consideration could be size: Suomi24-sized corpora should likely be in a separate bucket, but various small corpora can maybe be grouped together in one bucket.

Hae Kielipankki-portaalista:
Tommi Kurki
Kuukauden tutkija: Tommi Kurki

 

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4144036 / 029 4129317