demotokens.py — a simple tokenizer of plain text
- Mylly: Demo → Simply Tokenize Plain Text
- Input file
- one plain text file
- Character encoding (default UTF-8)
- Output files
- tokens.txt (duplicate)
demotokens writes the leftmost-longest sequences of
word characters (tokens) in the input file to the
Word characters consist of letters, digits, and the
underscore. Each token is written on its own line together with
its line number and a running token counter, with a single tab as
a field separator.
This simple tool is meant for initial testing and demonstration
of the platform. The output format is usable by other tools in the
Input consists of one plain text file and its
- Input file (text.txt)
- assumed to be plain text
- Character encoding
- UTF-8 (default)
Output consists of one file: a tokenized version of the input
file. Actually two files, but with identical content.
- Each line consists of a token, followed by line number and
token number, with single tab as field separator.
Eventually a better simple tokenizer in Text category, and
linguistic analysis tools in the Parsing category.
Should not allow non-UTF-8 character encodings. Just keep the
parameter so it appears as documentation in the workflow