Ajatella, miettiä, pohtia, harkita corpus

Description

The amph micro-corpus consists of altogether 3404 occurrences of the four most common Finnish THINK lexemes, ajatella, miettiä, pohtia, and harkita ’think, reflect, ponder, consider’.

These occurrences have been extracted from a corpus consisting of two months worth (January–February 1995) of written text from Helsingin Sanomat (1995), Finland’s major daily newspaper, and six months worth (October 2002 – April 2003) of written discussion in the SFNET (2002-2003) Internet discussion forum, namely regarding (personal) relationships (sfnet.keskustelu.ihmissuhteet) and politics (sfnet.keskustelu.politiikka). The newspaper corpus consisted altogether of 3,304,512 words of body text, excluding headers and captions (as well as punctuation tokens), and included 1,750 representatives of the studied THINK verbs, whereas the Internet corpus comprised altogether 1,174,693 words of body text, excluding quotes of previous postings as well as punctuation tokens, adding up to 1,654 representatives of the studied THINK verbs. The individual overall frequencies among the studied THINK lexemes in the corpus were 1492 for ajatella, 812 for miettiä, 713 for pohtia, and 387 for harkita.

The corpus contents were first automatically syntactically and morphologically analyzed using a computational implementation of Functional Dependency Grammar (Tapanainen and Järvinen, 1997, Järvinen and Tapanainen 1997) for Finnish, namely the FI-FDG parser (Connexor 2007). After this, all the instances of the studied THINK lexemes together with their syntactic arguments were manually validated and corrected, if necessary, and subsequently supplemented with semantic classifications. In addition, some extra-linguistic features (newspaper section or specific newsgroup, author ID when available, unique document index) are incorporated, when they could be identified and extracted from the original corpora.

The amph micro-corpus contains for each occurrence of the selected four THINK verbs in the original research corpora all relevant contextual features, including the verb itself, analyzed at the aforementioned morphological, syntactic and semantic levels in the immediate sentential context, as well as all pertinent extralinguistic features. In addition, the amph micro-corpus includes scripts for processing this data, R functions for its statistical analysis, as well as a comprehensive set of the ensuing results as R format data tables.

Research based on the amph micro-corpus are presented in Arppe (2007, 2008, submitted).

Version and Size

Version: 0.9
Size: 777288 kB

Content and Structure

CDPS/

Full set of COGNITION verbs and their single-word definitions in CD-Perussanakirja (Haarala et al. 1997), supplemented with relative frequencies as calculated on the basis of FTC 2001.

Documents/

Documents describing research based on the amph corpus and data, plus the entire e-mail correspondence concerning the manual annotation of the original corpus data.

HS+SFNET/

Final annotated versions of the two subparts (HS and SFNET) of the original corpus data, including the corpus in its entirety (without any exclusions of headers, captions, quoted passages in Internet discussion etc.) and the portions of the corpus data extracted for the actual research (where headers, captions, and Internet quotations have been excluded).

R_data/

Basic R format data tables to be used in the statistic analyses using the functions in R_functions/.

The file AMPH.dataset.R contains in R format a comprehensive set of all original data and functions as well as all the ensuing results (as covered in Arppe, submitted).

The files THINK.data and THINK.data.extra contain the essential data used in the statistical analyses in Arppe (submitted)

R_functions/

Text form files containing the code for R functions to be used in the statistical analysis of the data.

AMPH.functions.R contains the most important of these functions in R format.

R_plots/

Functions containing the R code with which the graphs and diagrams in Arppe (submitted) have been calculated and plotted, as well as the resultant PDF files.

R_results/

A selection of files containing the results of individual statistical analyses using the R functions. The comprehensive set of these results is incorporated in R_data/AMHP.dataset.R.

Scripts/

Shell scripts which have been used to process and annotate the raw data in the original corpus sources and convert the resultant linguistic analyses into the R format data tables (THINK.data and THINK.data.extra) for statistical treatment.

Statistics/

Various statistics concerning the extralinguistic characteristics of the research data, the accuracy of its automated analysis as well as the consistency of its manual annotation.

Directory in CSC’s computing environment

/appl/data/kielipankki/amph

Directory Listing

60 CDPS/
5428 Documents/
535416 HS+SFNET/
77136 R_data/
176 R_functions/
3140 R_plots/
140040 R_results/
324 Scripts/
15588 Statistics/

Access Rights and Conditions

Academic research

References

Making Bibliographical Reference to the Material

amph 2008. A micro-corpus of 3404 occurrences of the four most common Finnish THINK lexemes, ajatella, miettiä, pohtia, and harkita, in Finnish newspaper and Internet newsgroup discussion texts, containing extracts and linguistic analysis of the relevant context in the original corpus data, scripts for processing this data, R functions for its statistical analysis, as well as a comprehensive set of ensuing results as R data tables. Compiled and analyzed by Antti Arppe. Available on-line at URL: http://www.csc.fi/english/research/software/amph/

Other References

Arppe, Antti (2007). Multivariate methods in corpus-based lexicography: A study of synonymy in Finnish. The Fourth Biennial Corpus Linguistics 2007 Conference, July 28-30, 2007, Birmingham, UK. Available on-line at the /l/kielipankki/amph/Documents/ directory and at URL: http://www.ling.helsinki.fi/~aarppe/Publications/CLC07_Arppe.pdf

Arppe, Antti (2008). Linguistic choices vs. probabilities – how much and what can linguistic theory explain? Pre-proceedings of the International Conference on Linguistic Evidence. Tübingen, Germany, 31.1.-2.2.2008. Available on-line at the /l/kielipankki/amph/Documents/ directory and at URL: http://www.ling.helsinki.fi/~aarppe/Publications/LE2008_extended_abstract_Arppe.pdf

Arppe, Antti (submitted). Univariate, bivariate and multivariate methods in corpus-based lexicography. A study of synonymy. Doctoral Dissertation. Available on-line at the /l/kielipankki/amph/Documents/ directory.

Connexor (2007). List of morphological, surface-syntactic and functional syntactic features used in the linguistic analysis. [Web documentation] URL: http://www.connexor.com/demo/doc/fifdg3 tags.html (visited 29.5.2007) and URL: http://www.connexor.com/demo/doc/enfdg3 tags.html (visited 5.6.2007).

Järvinen, Timo and Pasi Tapanainen (1997). A Dependency Parser for English. TR-1, Technical Reports of the Department of General Linguistics, University of Helsinki, Finland.

Haarala, Risto and Marja Lehtinen (Editors)
(1997). CD-Perussanakirja. Helsinki: Edita.

Tapanainen, Pasi and Timo Järvinen (1997). A non-projective dependency parser. In: Proceedings of the 5th Conference on Applied Natural Language Processing, April 1997, Washington, D.C., Association of Computational Linguistics, pp. 64-71.

Hae Kielipankki-portaalista:
Lotta Leiwo
Kuukauden tutkija: Lotta Leiwo

 

Tulevat tapahtumat


Yhteystiedot

Kielipankin tekninen ylläpito:
kielipankki (ät) csc.fi
p. 09 4572001

Aineistoihin ja muuhun sisältöön liittyvät asiat:
fin-clarin (ät) helsinki.fi
p. 029 4129317

Tarkemmat yhteystiedot