This is the README file for TRmorph (version 0.2.1, 2011-02-03)

This directory contains SFST sources for TRmorph, a Turkish
morphological analyzer. For the latest version of the analyzer and
this document please refer to: http://www.let.rug.nl/coltekin/trmorph/

The analyzer is distributed under GPL [1] (version 2 or later). 
(NOTE: The lexicon provided is based on wordlists from Zemberek [3],
however, it has been heavily modified. Note that, Zemberek is
distributed under a different free license, MDL.) 

You need SFST [2] to compile and use this analyzer. The Makefile
provided compiles compiles all the sources and produces the binary
'trmorph.a'. Assuming you have fst-compiler-utf8 from SFST in your
path, in POSIX systems typing 'make' should be enough to get started.
It turns out that it can also be compiled with HFST [3] (thanks to
Francis Tyers and Tommi Pirinen). To compile using HFST tools, set
FSTC envirionment to 'hfst', a command like 'FSTC=hfst make' should do
the trick. The resulting FSA will be usable with the HFST tools. The
rest of this document assumes that you are using SFST.

$fst-mor trmorph.a
analyze> evlerimizdekilerinki
ev<n><pl><p1p><loc><ki><pl><gen><ki>
ev<n><pl><p1p><loc><ki><pl><gen><ki><3s>
analyze> 
generate> ev<n><plu>
evler

Note that the current output may be noisier than above. This is
intentional left noisy for ease of debugging. Modify last line of
trmorph.fst to remove unwanted symbols from analysis level.

The analyzer is constantly under revision, and I'd be happy to hear
about corrections, comments, bug reports about the analyzer. I may
also be interested to help with adaptation for a particular purpose.
Please feel free to contact me at <c.coltekin@rug.nl>.

* Lexical entries

All the lexicon files are under lexicon/. The files contain lexical
items with related part of speech. The special lexicon "misc" contains
items that go through analysis unaltered. This is used for frequent
exceptions that are difficult (or costly) to handle.

The regular entries are all lowercase. In current version all symbols
are single word, i.e. does not contain any white spaces. Besides basic
alphabet, we have a number of special symbols in the lexicon.

The symbols <p><t><c><g><k><K> are likely to be found root-final
positions. These are rewritten as p/b, t/d, ç/c, g/ğ, k/ğ, k/y
respectively. The simple rule is that they are re-written as voiceless
counterparts (first of the alternatives) if no suffix is attached, or
a suffix that starts with a consonant is added. Otherwise the voiced
alternative is used. Another set of specially marked letters are the
long vowels, or vowels before a palatalized consonant. These affect
the way vowel harmony proceeds. These symbols are capitalized versions
of the vowel prefixed by `p'. For example, saat needs to be written as
sa<pA>t in the lexicon. These are simply mapped to their corresponding
surface symbols, but vowel harmony considers them as back vowels.

Other special symbols in the lexical entries include <del>, <dup>, and
<dels>. First one is used before the vowels that disappear (or if one
sees it the other way around, vowel epenthesis) on the surface  if
followed by a suffix staring with a vowel. For example burun ->
burn-um . <dup> is placed before the (root-final) consonants that gets
duplicated on the surface (hak -> hakk-ı). The last one <dels>, is
used after a word that may delete the buffer `s' in following suffix.
This only happens with a few words of foreign origin, more famously
cami -> cami-i. However, it seems this is disappearing in modern
Turkish, a Google search for `camisi' gives significantly higher hits
than `camii'. The <dels> rule is implemented as optional, so both
deleted, and overt buffer s will be produced.

Another lexical symbol is <compn>, which is added after one type of
compound nouns, such as `ayçiçeği', these compounds receive and -(s)I
suffix when they are bare, but in case they get a plural suffix, it
precedes -(s)I. So, these words are written without -(s)I, hence
`ayçiçeği' should be to the lexicon as `ayçiçek<compn>'. 

The last type of lexical symbols are related to verb irregularities
with causative forms, and aorist tense. The first one has a number of
alternatives: <caus_dir><caus_t><caus_it><caus_ir><caus_ar><caus_art>
the verbs that are marked with <caus_irreg> is not allowed to get a
causative marker. Unmarked entries get -t if they end in l, r, or a
vowel, otherwise -DIr. Only the verb roots that get aorist -Ar is
marked. The others get -Ir by default. Two more markers <rcp> and
<rfl> are used for verbs that get reciprocal and reflexive markers
respectively. If multiple irregularities are marked, it should follow
this order: <rcp>, <rfl>, <caus_*> and <aor_ar>.

* Derivations

All root forms in the lexicon are read in deriv.fst. This file tries
to model the most productive derivational suffixes that attache to the
root forms. The choice of derivations in this file is based on their
occurrences in various corpora. Mostly these are really productive
ones, such as -lA suffix that makes verbs from nominals. A more
complete list of suffixes are included in deriv_unproductive.fst.
However, the words formed by these morphemes should already be in the
lexicon. This file is prepared for a very aggressive segmentation
experiment. For most purposes this list is useless, and even harmful
as it will over-generate and increase the ambiguity.

As well as the choice of which derivational suffixes to include in the
analyzer, the number of allowed derivational suffixes are also a
practical choice. In principle, most of the productive derivational
suffixes attach to non-root word forms as well. Hence, one may apply
them recursively. The other extend is expecting the all the
derivations to be specified in the lexicon. Former will cause heavy
over-generation, the letter will require lots of predictable (possibly
automatically generated) entries in the lexicon. The choice likely
depends on the application. The current version allows only three
derivational suffixes.

Besides normal derivations, we allow a `zero derivation' from all
nominal types to nouns. Even though this makes every adjective or
adverb to have multiple analyses, it seems to make more sense
linguistically.

deriv.fst writes multiple `stem automata' files, which are later read
by trmorpoh.fst.

It should also be noted that some of the `productive' derivational
morphemes that attaches after the inflections are included in
trmorph.fst.

* Inflectional morphology

The nominal and verbal inflections are implemented in trmorph.fst.
trmorph.fst reads fst-lexicons that are produced by deriv.fst and
produces the final acceptable word forms. Both nominal and verbal
inflections are handled here. 

There is also a separate file num.fst which deals with most of the
number forms alphabetic or numeric.

For details, the reader is referred to the comments provided in this
file.

After forming one of the acceptable word forms, trmorph.fst passes the
surface to phon.fst, which implements the morpho-phonological
alternations.

* Phonological alternations

All phonological alternations are implemented by the files in phon
directory. The result is combined by phon.fst and used by trmorph.fst.

The alternations are implemented as a chain of transducers that work
on the previous one's output. As such, order of application in
phon.fst is important. The files are prefixed by a number that
indicates their order in the pipeline. 

These files also contain comments, and they are better and up to date
sources of information.

* Sources of Ambiguity (currently partially documented)

TRMorph mostly follows the description of the Turkish morphology from
Göksel & Kerslake (2005). There are a few persistent sources of
ambiguity that analyzer outputs without any attempt to resolve it. The
following is a list of these ambiguities:

    - Any nominal is analyzed as both as a nominal, and as a nominal
      verb. As in, 
            doktor<n>
            doktor<n><3s>
      The first analysis is valid for the cases like,
            doktor burada. `the doctor is here'.
      The second analysis works in 
            (benim) babam doktor. `my father is a doctor'.
      
      This formation is allowed for all nominal inflections as well:
            babam dotorlarla `may father is with the doctors'

    - Another additional source of ambiguity is the way the nominal
      inflections are implemented. The adjectives and adverbs are
      allowed to become a `noun' with an zero-derivation, hence, the
      adjective `hasta' is analyzed as
            hasta<adj>
            hasta<adj><Djn_0><n>
            hasta<adj><Djn_0><n><3s>
      In practice, the first analysis should be correct for,
            hasta adam  `(the) ill man'
      The second, 
            hasta geldi  `(the) ill (person) arrived'
      And similar to above, the 
            babam hasta `my father is ill'


* List of all Analysis Symbols (currently partially documented)

    See doc/analysis-symbols.txt
 
* Misc. Notes

This morphological analyzer is a result of a long term intermittent
work. Initially it was started to do segmentation and analysis of part
of the CHILDES data. Later on I needed a gold standards for
morphological segmentation experiments, and so on. Consequently, this
is not a `designed' morphological analyzer, it grew out of particular
need. It parses about 90% of the words (tokens) in METU corpus, which
is probably close to best that can be done (there is quite some noise
in the data). 

During my informal talks with some researchers, I felt that there is a
need for a freely accessible and modifiable morphological analyzer for
Turkish. Hence, I decided to put some more effort on it to make it
possibly useful for others. Probably I'll make some more effort time
to time to improve it further. The TODO file in this directory is my
list of items that needs to be done to improve the system. If you need
a feature, or see a bug and not able to do it yourself, please ask.

In general: please feel free to contact me on any issue regarding this
analyzer.

Çağrı Çöltekin
c.coltekin@rug.nl

-------------------------
[1] http://www.gnu.org/licenses/gpl.html
[2] http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html
[3] http://code.google.com/p/zemberek/,  
