Archiving of directory trees

Introduction

This document describes how to create tar or zip files with safeguards against data corruption and how to test that a file in the archive is indeed identical to the file on disk. This document assumes that the reader has a basic knowledge of tar, zip, bzip2 and gzip.

What to use when

For Language Bank internal use (eg. snapshots of data in IDA) tar with bzip2 compression is recommended. Bzip2 compresses better than gzip. If the package is intended for the general public (eg. to be put in the download service) and use zip, since tar is not available on Windows. Use zip even if the downloaded package is to be used in Mac/Linux only, since users might want to inspect the package on Windows first.

Archiving using tar

Compressing

First create (c) an uncompressed tar file (f) using verify (W) and showing verbose output (v)

$ tar cfWv package.tar directory/

You should get not errors.

then compress using bzip2

$ bzip2 package.tar

You get package.tar.bz2

This package you can copy (eg. to IDA).

Checking/Uncompressing

If you suspect the integrity of the package use

$ tar tfjv package.tar.bz2

Unpack using

$ tar xfjv package.tar.bz2

Achiving using zip

Compressing

Archive recursively:

$ zip -r package.zip directory/

Note that zip uses the encoding of the underlying filesystem by default. CSC’s computing environment has the desired UTF8, most recent Linux versions should as well. You will get problems in Windows and MacOS.

At least on the command line you can try:

$ zip -r -UN=UTF8 package.zip directory/ to force UTF encoding.

7zip on Windows does not seem to work with UTF-8. The bottom line: Compress in Linux, eg. in CSC’s computing environment and avoid umlauts.

Checking/Uncompressing

To check the integrity:

$ unzip -t package.zip

To check files:

$ unzip -v package.zip filename and
$ crc32 filename

To uncompress:

unzip package.zip

Missing tools

There is no tool to recursively compare compressed tar or zip archives with a filesystem.

Tar does not natively store checksums, here’s a workaround: https://serverfault.com/questions/120582/creating-a-tar-file-with-checksums-included.

Zip would be fine, but it has no comparison method to quickly compare a zip archive againssst a directrory structure.

Hae Kielipankki-portaalista:
Tommi Kurki
Kuukauden tutkija: Tommi Kurki

 

Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4144036 / 029 4129317