Name
demotokens.py — a simple tokenizer of plain text
Synopsis
- Mylly: Demo → Simply Tokenize Plain Text
- Input file
- one plain text file
- Parameters
- Character encoding (default UTF-8)
- Output files
- tokens.tsv
- tokens.txt (duplicate)
Description
demotokens writes the leftmost-longest sequences of
word characters (tokens) in the input file to the
output files.
Word characters consist of letters, digits, and the
underscore. Each token is written on its own line together with
its line number and a running token counter, with a single tab as
a field separator.
This simple tool is meant for initial testing and demonstration
of the platform. The output format is usable by other tools in the
Demo category.
Input
Input consists of one plain text file and its
character encoding.
- Input file (text.txt)
- assumed to be plain text
- Character encoding
- UTF-8 (default)
- Latin-1
Output
Output consists of one file: a tokenized version of the input
file. Actually two files, but with identical content.
- tokens.tsv
- tokens.txt
- Each line consists of a token, followed by line number and
token number, with single tab as field separator.
See also
Eventually a better simple tokenizer in Text category, and
linguistic analysis tools in the Parsing category.
Bugs
Should not allow non-UTF-8 character encodings. Just keep the
parameter so it appears as documentation in the workflow
history.
2 thoughts on “demotokens”