Name — a simple tokenizer of plain text


Mylly: Demo → Simply Tokenize Plain Text
Input file
one plain text file
Character encoding (default UTF-8)
Output files
tokens.txt (duplicate)


demotokens writes the leftmost-longest sequences of
word characters (tokens) in the input file to the
output files.

Word characters consist of letters, digits, and the
underscore. Each token is written on its own line together with
its line number and a running token counter, with a single tab as
a field separator.

This simple tool is meant for initial testing and demonstration
of the platform. The output format is usable by other tools in the
Demo category.


Input consists of one plain text file and its
character encoding.

Input file (text.txt)
assumed to be plain text
Character encoding
UTF-8 (default)


Output consists of one file: a tokenized version of the input
file. Actually two files, but with identical content.

Each line consists of a token, followed by line number and
token number, with single tab as field separator.

See also

Eventually a better simple tokenizer in Text category, and
linguistic analysis tools in the Parsing category.


Should not allow non-UTF-8 character encodings. Just keep the
parameter so it appears as documentation in the workflow

2 thoughts on “demotokens

Comments are closed.

Search the Language Bank Portal:
Mila Oiva
Researcher of the Month: Mila Oiva



The Language Bank's technical support:
kielipankki (at)
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at)
tel. +358 29 4140599 / +358 29 4129317