
HFST-SweNER - Swedish named-entity recognizer using HFST Pmatch
===============================================================

Version 0.9.3 (beta)


Introduction
------------

This package contains a beta version of HFST-SweNER, a rule-based
named-entity recognizer (NER) system for Swedish, implemented using a
pipeline of HFST Pmatch (pattern matching) finite-state transducers
(FSTs).

The recognizer is based on (was converted from) the original
implementation in Flex and Perl, developed by Dimitrios Kokkinakis at
the University of Gothenburg. The recognizer was converted to use HFST
Pmatch at the University of Helsinki.

Please note that this is a beta release and the recognizer has a few
know bugs and deficiencies, see below.


Package contents
----------------

This package contains the following files and directories:

    configure - A script for configuring the HFST-SweNER
    Makefile.in - A template for a makefile for (re)compiling
        HFST-SweNER
    README - This file
    INSTALL - Generic configuration and installation instructions
    pmatch/ - Precompiled Pmatch FSTs
    scripts/ - Auxiliary scripts for compiling and running the
        recognizer and for processing and comparing its output
    src/ - Pmatch source files
    src/flex/ - Original Flex source files, slightly modified before
        conversion
    src/gazetteer/ - Original gazetteer (name database) files
    src/gazetteer-pm/ - Gazetteer files converted for HFST Pmatch,
        only needed at compile time
    hfst-bin/ - Pre-built, statically linked binaries of HFST Pmatch
        and other required HFST tools
    build-aux/ - Auxiliary scripts for configuring and installing the
        system


Prerequisites
-------------

The makefile and scripts for compiling and running HFST-SweNER require
a Linux or a similar Unix-type system with several GNU tools.

For compiling and running the NER pipeline, you need a recent version
of the HFST Pmatch tool, as packaged in HFST version 3.8.2 or newer,
and a few other HFST tools. This package comes bundled with pre-built,
mostly statically linked binaries of the required tools for 64-bit x86
GNU/Linux, but they might not work in older systems. Alternatively,
and for other platforms, you can download HFST from SourceForge:

    http://sourceforge.net/projects/hfst/files/hfst/

Or, to compile the latest revision of HFST yourself, check it out from
the Subversion repository, configure, compile and install it:

    svn checkout svn://svn.code.sf.net/p/hfst/code/trunk/hfst3
    cd hfst3
    ./configure [options]
    scripts/generate-cc-files.sh
    make && make install

The hfst-swener script (alias runNer-pm) for running HFST-SweNER
requires Bash, iconv and Perl 5.x.

The makefile for compiling HFST-SweNER requires GNU Make, GNU M4 and
Perl 5.x.

The auxiliary scripts require Python 2.6.x or 2.7.x.


Installation
------------

HFST-SweNER has an Autoconf-based configuration and installation. The
file `INSTALL' contains generic configuration and installation
instructions. To configure the system, execute

    ./configure [options] [HFSTDIR=DIR]

in the top directory.

If you build HFST-SweNER on a 64-bit x86 GNU/Linux system, the
configuration uses the packaged HFST binaries by default. To use an
existing HFST installation instead, specify the option
`--without-bundled-hfst-tools'. To use a HFST binaries in a directory
DIR not in `$PATH', specify `HFSTDIR=DIR' on the command line.

Perhaps the most relevant of the standard `configure' options is
`--prefix=DIR' for specifying DIR as the directory prefix under which
to install HFST-SweNER (default: `/usr/local').

For more information about `configure' options, please run
`./configure --help' or refer to the generic instructions in the file
`INSTALL'.

After configuring the system, run

    make

If you have not made any changes to the Pmatch source files, `make'
only generates some scripts with configuration information added.

To check that the system works as expected, run

    make check

This currently runs only a very simple test.

To install the system in the installation directory, run

    make install 


Running
-------

The whole HFST-SweNER pipeline can be run with the Bash script
`hfst-swener' in directory `scripts'. The basic usage of the script
is:

    hfst-swener [options] [input files] [> output]

For a more detailed usage and a description of options, run

    hfst-swener --help

If input files are not specified, the script reads from the standard
input. The default input character encoding is UTF-8; another encoding
can be specified with option `--input-encoding'. The encodings
supported are those of the `iconv' program.

The script uses by default the HFST Pmatch found in the HFST binary
directory specified for `configure' (or found in `$PATH'). To use HFST
Pmatch residing elsewhere, you can either specify the option
`--progdir=HFSTDIR' where HFSTDIR is the directory, or set the value
of the environment variable `NER_BINDIR_PMATCH' to HFSTDIR.

By default, hfst-swener writes its output to standard output. If the
option `--output-to-file' is specified, hfst-swener produces its
output for input file FILE to a file named `FILE.ner-pm' in the
current directory, unless otherwise specified with options
`--output-name' and `--output-dir'.

The output contains named entities marked with XML-style tags of the
same kind as the original implementation. The output character
encoding is UTF-8.

The script can optionally generate intermediate files for the output
of each recognizer in the pipeline; see options `--tee', `--names',
`--name-options', `--diff', `--diff-only', `--clean-diff'.

If you are short of memory (less than 4 GiB), you can specify the
option `--all-tempfiles', so that the recognizer and correction filter
of each recognition stage are run separately with a temporary file in
between, not piped. If you have plenty of memory (24 GiB or more) and
can run at least 600 processes simultaneously, you can specify the
option '--no-tempfiles' to use pipelines also between recognition
stages.

With the option `--flex', the script can also be used to run the
original implementation of the Swedish NER system.


Recompiling
-----------

If you modify the Pmatch source files, they need to be recompiled
for the changes to take effect. You can recompile only the changed
files by running `make pmatch' in the top directory.

With the default settings, HFST-SweNER Pmatch FSTs compile relatively
fast: the slowest ones may take 10 minutes or more, depending on the
speed of the computer.


Gazetteers
----------

The gazetteer source files are in the directory `src/gazetteer'. They
are in the format used in the original Perl implementation of the
gazetteer lookup.

If you modify the gazetteers, you need to recompile the gazetteer
lookup FSTs by running `make pmatch' for the changes to take effect
(see above).

If you want to use gazetteer files residing elsewhere, you can
override the makefile variable `GAZETTEER_SRCDIR'. For example:

    make pmatch GAZETTEER_SRCDIR=~/ner/gazetteer

The gazetteer files should nevertheless be named `nameDb1.txt',
`nameDb2.txt' and `nameDb3.txt' for one-, two- and three-word names,
respectively.

The files in the directory `src/gazetteer-pm' are intermediate
gazetteer files for HFST Pmatch. They are only used at compile time,
not at run time.


Known bugs and deficiencies
---------------------------

In general, most of the names recognized by the original
implementation are also recognized by this implementation, but not
all of them.

There is little documentation.


Contact information
-------------------

If you have questions, comments or bug reports, please contact the
authors by email:

    Krister Lindén, NER project leader, krister.linden@helsinki.fi
    Jyrki Niemi, NER developer and packager, jyrki.niemi@helsinki.fi
    Sam Hardwick, HFST Pmatch developer, sam.hardwick@iki.fi
