Archiving of directory trees


This document describes how to create tar or zip files with safeguards against data corruption and how to test that a file in the archive is indeed identical to the file on disk. This document assumes that the reader has a basic knowledge of tar, zip, bzip2 and gzip.

What to use when

For Language Bank internal use (eg. snapshots of data in IDA) tar with bzip2 compression is recommended. Bzip2 compresses better than gzip. If the package is intended for the general public (eg. to be put in the download service) and use zip, since tar is not available on Windows. Use zip even if the downloaded package is to be used in Mac/Linux only, since users might want to inspect the package on Windows first.

Archiving using tar


First create (c) an uncompressed tar file (f) using verify (W) and showing verbose output (v)

$ tar cfWv package.tar directory/

You should get not errors.

then compress using bzip2

$ bzip2 package.tar

You get package.tar.bz2

This package you can copy (eg. to IDA).


If you suspect the integrity of the package use

$ tar tfjv package.tar.bz2

Unpack using

$ tar xfjv package.tar.bz2

Achiving using zip


Archive recursively:

$ zip -r directory/

Note that zip uses the encoding of the underlying filesystem by default. CSC’s computing environment has the desired UTF8, most recent Linux versions should as well. You will get problems in Windows and MacOS.

At least on the command line you can try:

$ zip -r -UN=UTF8 directory/ to force UTF encoding.

7zip on Windows does not seem to work with UTF-8. The bottom line: Compress in Linux, eg. in CSC’s computing environment and avoid umlauts.


To check the integrity:

$ unzip -t

To check files:

$ unzip -v filename and
$ crc32 filename

To uncompress:


Missing tools

There is no tool to recursively compare compressed tar or zip archives with a filesystem.

Tar does not natively store checksums, here’s a workaround:

Zip would be fine, but it has no comparison method to quickly compare a zip archive againssst a directrory structure.

Hae Kielipankki-portaalista:
Tommi Kurki
Kuukauden tutkija: Tommi Kurki



Kielipankin tekninen ylläpito:
kielipankki (ät)
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät)
p. 029 4144036 / 029 4129317