Psycholinguistic Descriptives

Introduction

This material comprises a dataset of word frequencies from six
different corpora and a simple query tool for extracting often used
psycholinguistic descriptives for given words. The word frequency
tables have been filtered to better reflect actual word frequencies.

For more and up-to-date information on the dataset see:
http://urn.fi/urn:nbn:fi:lb-2018081601

This dataset is licesed with the Creative Commons Attribution 4.0
International License: https://creativecommons.org/licenses/by/4.0/

Requirements
Word frequency data is formatted as .csv tables and as such can be used 
with any program.
The query tool requires Python 3 as well as the modules:
Pandas (https://pandas.pydata.org/ version 0.20 or newer)
FinnSyll (https://pypi.org/project/FinnSyll/)
Both are available through 'pip' python package manager 
(https://pip.pypa.io/en/stable/installing/)

Query tool
The query tool can be used to obtain descriptives for a list of words. 
At this time the descriptives include:
1. surface or lemma frequencies:
     - corpus specific relative frequencies
     - relative frequency in the total sum of chosen corpora
     - average relative frequency in the chosen corpora
2. syllable information
     - identity
     - count
     - frequencies
     - average frequency
3. letter 2-gram and 3-gram average frequency
4. ortographic neighbours information (words within Hamming distance of 1)
     - identity
     - count
Help on how to use the query tool can be found in the programs --help 
message.

Frequency tables
Frequency tables are separated for lemmas and surface forms. Both have 
been composed with the same methods. Tokens with same written form are 
considered unique if they do not share the same part-of-speech tag. 
Tokens and part-of-speech tags have been extracted from texts parsed 
with Finnish Dependency 
Parser(http://turkunlp.github.io/Finnish-dep-parser/). Tokens have been 
filtered (see below) to make the frequency values better reflect actual 
word frequencies.
The filtered frequency tables for the surface forms were used to 
calculate letter 2-gram and 3-gram as well as syllable frequencies. 
These frequencies were first calculated and normalized per corpus and 
then averaged across corpora to reduce the effect of different corpus sizes.

The corpora used in making the word frequency tables:
The Suomi24 Corpus (S24): http://urn.fi/urn:nbn:fi:lb-2017021630
Newspaper and Periodical Corpus of the National Library of Finland (KLK, 
only from 1980 onwards): http://urn.fi/urn:nbn:fi:lb-2016050302
Finnish Magazines and Newspapers from the 1990s and 2000s (LEHDET): 
http://urn.fi/urn:nbn:fi:lb-2017091901
Finnish Wikipedia 2017 (WIKI): http://urn.fi/urn:nbn:fi:lb-2018060401
Finnish OpenSubtitles 2017 (OPENSUB): http://urn.fi/urn:nbn:fi:lb-2018060403

Data retrieved from the website in making the word frequency tables (REDDIT):
The Reddit/r/Suomi: https://old.reddit.com/r/Suomi/

Token filtering:
1. Remove tokens longer than 30 characters.
2. Remove tokens categorized as punctuation, symbols or foreign words.
3. All tokens are changed to lowercase.
4. Per corpus, tokens with occurrence lower than limit are removed. Limit 
is manually decided, but is approximately 0.01 relative frequency 
(tokens per million) for each corpus.
5. Remove tokens that contain characters not in the regex set: 
[0-9abcdefghijklmnopqrsštuvwxyzžåäö\-\'\:\.].
6. Remove tokens where "special" characters (regex: [0-9\-\'\:\.]) make 
up more than 75% of all characters.
7. Remove tokens that are present in only a single corpus.

Filtering results:

Surface forms:
		N tokens(millions)	Unique tokens
S24:
Pre		2278.5			43539346
Post		2088.4 (-8.3%)		983682 (-97.7%)

KLK:
Pre		122.0			6299766
Post		101.4 (-16.9%)		1032223 (-83.6%)

LEHDET:
Pre		136.3			8580856
Post		108.2 (-20.6%)		1066805 (-87.6%)

WIKI:
Pre		83.3			4044413
Post		61.4 (-26.3%)		856182 (-78.8%)

REDDIT:
Pre		38.2			1966899
Post		30.2 (-21.0%)		512325 (-74.0%)

OPENSUB:
Pre		267.6			3430478
Post		196.4 (-26.6%)		664655 (-80.6%)

TOTAL:
Pre		2926.0			56292881
Post		2586.1 (-11.6%)		1539918 (-97.3%)

Lemma forms:
		N tokens(millions)	Unique tokens
S24:
Pre		2278.4			31964747
Post		2119.3 (-7.0%)		408915 (-98.7%)

KLK:
Pre		121.9			4077418
Post		103.2 (-15.3%)		475322 (-88.3%)

LEHDET:
Pre		136.3			6184952
Post		110.1 (-19.2%)		501829 (-91.9%)

WIKI:
Pre		83.3			2468355
Post		62.7 (-24.8%)		443610 (-82.0%)

REDDIT:
Pre		38.2			1070158
Post		30.7 (-19.6%)		215858 (-79.8%)

OPENSUB:
Pre		267.6			1692284
Post		198.7 (-25.8%)		287327 (-83.0%)

TOTAL:
Pre		2925.8			41938288
Post		2624.6 (-10.3%)		747720 (-98.2%)

Known issues:
The S24, KLK and LEHDET corpora are parsed with an older versions of the 
Turku Dependency Parser than the WIKI, REDDIT and OPENSUB corpora. 
Because of this the part-of-speech tags have a few clear discrepancies.
For example the lemma 'ensimmäinen' is considered a numeral by the older 
version, while the newer version tags it (correctly) as an adjective. If 
part-of-speech tags are not relevant, the POS class information can be 
ignored in the query tool with '-pc IGNORE' argument, this will collapse 
all instances of identical written forms.

For more questions about the data gathering, filtering or query tool 
email: tatu.huovilainen@helsinki.fi
