# Finnish-tagtools-src

The official source distribution of finnish-tagtools is from the HFST
project, which has a source repository here:

https://github.com/hfst/hfst/tree/master/scripts/finnish-tagtools

The source repository does not provide the process for generating some
of the data files, however, and this distribution is intended to
remedy that. The one exception that this distribution does _not_
provide is the lemma weighting, which is generated from a corpus that
is not freely accessible (namely, the FTC Finnish newspaper
corpus). We do provide a precalculated relatively frequency table,
however.

### OmorFi

We rely on OmorFi, https://github.com/flammie/omorfi. OmorFi has
progressed a lot since finnish-tagtools was originally built. The
release tag https://github.com/flammie/omorfi/releases/tag/20180511 is
contemporaneous with the finnish-tagtools releases.

Omorfi has its own dependencies and build process, which we will not
document here, but its outputs are used to generate two components of
finnish-tagtools: the analyzer and the tokenizer.

## The morphological analyzer

This is the file `tag/omorfi.tagtools.optcap.hfst`. It is used to
generate candidate morphological analyses from wordforms given to it
by the tokenizer. The somewhat convoluted process used for generating
it requires some non-automated steps and not freely accessible corpus
data, so we describe it step-by-step here.

First, OmorFi is used to create a morphological analyzer,
`omorfi.analyse.hfst`. Its analysis tokens are relabeled with
`relabel.pmscript` (included), which is a `pmatch` script which will
be composed with `omorfi.analyse.hfst`:

```
hfst-pmatch2fst < relabel.pmscript | hfst-compose -1 omorfi.analyse.hfst -2 - > omorfi.analyse.relabeled.hfst
```

This is then composed with `word_id_fix.pmscript` (included), which
modifies the `WORD_ID=` tags produced by omorfi:

```
hfst-pmatch2fst word_id_fix.pmscript | hfst-compose -1 omorfi.analyze.relabeled.hfst -2 - > omorfi.analyze.relabeled_word_id_fixed.hfst
```

This is then composed with `lemmafreq.hfst`. `lemmafreq.hfst` is
generated from a corpus which is not freely accessible, but
`lemmafreq.txt`, which is a list of lemma-weight pairs, is distributed
with this documentation. The script is too closely tied to the
structure of the corpus to be useful in its whole, but essentially,
lemma counts are collected and normalized so that the most frequent
lemma has weights 1.0, and other lemmas have weight most_common_count
/ lemma_count for the given lemma.

The lowest weight (generally the weight of all words that occurred in
the corpus only once) is used as the default weight for OOV words. It
can therefore be filtered out of the list, which is then compiled with
`hfst-strings2fst`. We also filter out possible lemmas containing the
`:` characted, which is never correct, and we would have to deal with
specially otherwise:

```
grep -v ":" lemmafreq.txt | grep -v "18.1152419635" | hfst-strings2fst -j > lemmafreq.hfst # 18.1152... is the highest weight
```

Once `lemmafreq.hfst` has been produced, lemmafreq.hfst is composed
with what we previously had thusly:

```
echo '[[@bin"lemmafreq.hfst" "\t" ?*] | [?*]::18.1152419635] .o. @bin"omorfi.analyse.relabeled_word_id_fixed.hfst"' | hfst-regexp2fst > omorfi.analyse.relabeled_word_id_fixed_lemmaweighted.hfst
```

Then `BLACKLIST=` tags are removed, and that will be our final
`omorfi.tagtools.hfst`:

```
echo '[[\"[BLACKLIST="]* (["[BLACKLIST=" [\"]"]+ "]" ]:0) [\"[BLACKLIST="]*]*' | hfst-regexp2fst | hfst-compose -1 omorfi.analyse.relabeled_word_id_fixed_lemmaweighted.hfst -2 - > omorfi.analyse.relabeled_word_id_fixed_lemmaweighted.hfst | hfst-minimize | hfst-fst2fst -t > omorfi.tagtools.hfst
```

Finally we apply the `OptCap` function to it, to include capitalised
words:

```
echo 'set need-separators off regex OptCap(@bin"omorfi.tagtools.hfst", U);' | hfst-pmatch2fst -v > omorfi.tagtools.optcap.hfst
```

## The tokenizer

This is the file `tag/omorfi_tokenize.pmatch`. The sources for it are
distributed and documented elsewhere in HFST's repository, see

https://github.com/hfst/hfst/tree/master/scripts/tokenization/omorfi-tokenize

## The NER rules

These are the files `tag/proper_tagger_ph1.pmatch` and
`tag/proper_tagger_ph1.pmatch`.

Named-entity recognition is done by applying a series of hfst-pmatch
rules to output from finnish-postag, which in turn uses finnpos-label,
the morphological tagger, which mostly operates as a disambiguator on
the output of the analyzer. These rules are collectively called
FiNer-rules, and the source distribution for them is

https://github.com/Traubert/FiNer-rules

And the main documentation is

https://github.com/Traubert/FiNer-rules/blob/master/finer-readme.md

The version of the rules that was current as of this distribution is
1.6.0, and that version is included with this distribution as
`FiNer-rules-1.6.0.zip`.
