This document describes how to create tar or zip files with safeguards against data corruption and how to test that a file in the archive is indeed identical to the file on disk. This document assumes that the reader has a basic knowledge of tar, zip, bzip2 and gzip.
For Language Bank internal use (eg. snapshots of data in IDA) tar with bzip2 compression is recommended. Bzip2 compresses better than gzip. If the package is intended for the general public (eg. to be put in the download service) and use zip, since tar is not available on Windows. Use zip even if the downloaded package is to be used in Mac/Linux only, since users might want to inspect the package on Windows first.
First create (c) an uncompressed tar file (f) using verify (W) and showing verbose output (v)
$ tar cfWv package.tar directory/
You should get not errors.
then compress using bzip2
$ bzip2 package.tar
You get package.tar.bz2
This package you can copy (eg. to IDA).
If you suspect the integrity of the package use
$ tar tfjv package.tar.bz2 Unpack using $ tar xfjv package.tar.bz2
$ zip -r package.zip directory/
Note that zip uses the encoding of the underlying filesystem by default. CSC’s computing environment has the desired UTF8, most recent Linux versions should as well. You will get problems in Windows and MacOS.
At least on the command line you can try:
$ zip -r -UN=UTF8 package.zip directory/ to force UTF encoding.
7zip on Windows does not seem to work with UTF-8. The bottom line: Compress in Linux, eg. in CSC’s computing environment and avoid umlauts.
To check the integrity:
$ unzip -t package.zip
To check files:
$ unzip -v package.zip filename and
$ crc32 filename
There is no tool to recursively compare compressed tar or zip archives with a filesystem.
Tar does not natively store checksums, here’s a workaround: https://serverfault.com/questions/120582/creating-a-tar-file-with-checksums-included.
Zip would be fine, but it has no comparison method to quickly compare a zip archive againssst a directrory structure.