Name

demotokens.py — a simple tokenizer of plain text

Synopsis

Mylly: Demo → Simply Tokenize Plain Text
Input file
one plain text file
Parameters
Character encoding (default UTF-8)
Output files
tokens.tsv
tokens.txt (duplicate)

Description

demotokens writes the leftmost-longest sequences of
word characters (tokens) in the input file to the
output files.

Word characters consist of letters, digits, and the
underscore. Each token is written on its own line together with
its line number and a running token counter, with a single tab as
a field separator.

This simple tool is meant for initial testing and demonstration
of the platform. The output format is usable by other tools in the
Demo category.

Input

Input consists of one plain text file and its
character encoding.

Input file (text.txt)
assumed to be plain text
Character encoding
UTF-8 (default)
Latin-1

Output

Output consists of one file: a tokenized version of the input
file. Actually two files, but with identical content.

tokens.tsv
tokens.txt
Each line consists of a token, followed by line number and
token number, with single tab as field separator.

See also

Eventually a better simple tokenizer in Text category, and
linguistic analysis tools in the Parsing category.

Bugs

Should not allow non-UTF-8 character encodings. Just keep the
parameter so it appears as documentation in the workflow
history.

2 thoughts on “demotokens

Comments are closed.

Search the Language Bank Portal:
Tommi Kurki
Researcher of the Month: Tommi Kurki

 

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317