ScotsCorr Korp Guide:
Using the Helsinki Corpus of Scottish Correspondence 1540–1750 (ScotsCorr) in the Korp Concordance Service

Anneli Meurman-Solin 2016–2017 (with technical details by Jyrki Niemi)

Anneli Meurman-Solin, Research Unit for the Study of Variation, Contacts and Change in English (VARIENG), Department of Modern Languages, University of Helsinki. The Helsinki Corpus of Scottish Correspondence 1540–1750 (2017) [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-201411071

The Helsinki Corpus of Scottish Correspondence 1540–1750 (abbreviated as ScotsCorr) is available in the Korp concordance service of Kielipankki – The Language Bank of Finland. This ScotsCorr Korp Guide aims to complement the generic Korp User guide by focusing on (1) features in Korp specific to ScotsCorr, and (2) questions the users may have concerning data retrieval methods applicable to a historical corpus reflecting a very high degree of variation and variability.

Any developments in the type and function of the tools available in Korp relevant to ScotsCorr will be recorded in new editions of this ScotsCorr Korp Guide. However, the ScotsCorr Manual will remain as published in 2016.

Changes to the Korp version of ScotsCorr

The following changes have taken place in the Korp version of ScotsCorr since ScotsCorr was published in Korp in June 2017. The list excludes changes affecting the Korp service in general.

  • 2018-01-23: The attribute selection list in the extended search lists “word” only once.
  • 2018-01-19: Searching for any punctuation mark works in the extended search.

Accessing ScotsCorr

The ScotsCorr corpus is available for licensed users. Usually, academic users are identified as such by their home university. If this is not the case, please apply for recognition as an academic user in the Language Bank Rights (LBR) application. Please see the access support page for instructions. Language Bank Rights is available only for members of organisations of the Haka and eduGAIN identity federations and for researchers with a CLARIN or CSC user account. If you cannot find your home organisation when trying to log in, you can apply for a CLARIN user account. If you experience log-in problems, please contact kielipankki (at) csc.fi. For further information, please consult the access support page.

Once you have access rights to ScotsCorr, you need to log in to the Korp service to access ScotsCorr. Either you can use the Log in to Korp link in the dialogue window Accessing restricted corpora presented when you access ScotsCorr directly via the URN http://urn.fi/urn:nbn:fi:lb-2016121607, or you can choose Log in from the cog menu on the top right corner of the Korp page. The log-in links direct you first to a page on which you can choose your organization and then to the actual log-in page provided by your organization. After a successful log in, you are directed back to Korp, and the top right side of the Korp page shows Log out user@domain where user@domain is your user name.

ScotsCorr in Korp

The Helsinki Corpus of Scottish Correspondence 1540–1750 in Korp consists of nine subcorpora representing four time-periods of approximately 50 years: 1540–1599, 1600–1649, 1650–1699 and 1700–1749. Each of the four periods is divided into two subcorpora by gender. In addition, there is a small subcorpus containing letters by members of the royal court dating from the second half of the sixteenth century and the first decades of the seventeenth century:

  • Royal
  • Male 1540–1599
  • Female 1540–1599
  • Male 1600–1649
  • Female 1600–1649
  • Male 1650–1699
  • Female 1650–1699
  • Male 1700–1749
  • Female 1700–1749

To select ScotsCorr in Korp, you can use the direct ScotsCorr Korp URN http://urn.fi/urn:nbn:fi:lb-2016121607, which redirects to Korp with ScotsCorr preselected. (The URN will remain the same and redirect to ScotsCorr even if the Korp URL changed.) If you are not already logged in to Korp, Korp presents the dialogue window Accessing restricted corpora offering links to log in and applying for access rights (an academic status). If you are logged in to Korp but are not recognized as an academic user, the dialogue offers a link for applying for recognition as an academic user.

Alternatively, you can select the corpus manually: choose the mode Other languages on the top left of the Korp page, and in the corpus selector, first clear the preselected corpora by clicking Select none and then tick ScotsCorr in the folder English / Englanti. (Even though Korp allows selecting other corpora at the same time, it would probably make little sense with ScotsCorr and is not recommended.) If you see a lock icon instead of a tick-box in front of ScotsCorr, either you are not logged in to Korp or you have no access rights to ScotsCorr. When you click the lock icon or the corpus name, the dialogue window Accessing restricted corpora appears.

In planning data retrieval processes for a particular search, you should keep in mind that Korp permits you to choose the whole corpus by ticking ScotsCorr in the corpus selector or to select one or more of the nine subcorpora.

A large majority of the data can be defined by geographical criteria, these letters representing a particular locality or region of Scotland. However, some writers remain unlocalised for various reasons (see the ScotsCorr Manual). In addition, letters by professional writers such as lawyers and members of the clergy form a category of their own. Consult the table below.

Geographically defined male and female writers, professional male writers, and unlocalised male and female writers in ScotsCorr.
1540–1599 1600–1649 1650–1699 1700–1749 Total % N Informants N Letters
Male 37,501 146,334 72,029 32,490 288,354 69.0 289 915
Male Professional 5,293 10,847 4,826 7,250 28,216 6.8 23 55
Male unlocalised 2,720 2,578 2,574 1,734 9,606 2.3 28 30
Female 2,190 30,167 34,264 13,545 80,166 19.2 103 313
Female unlocalised 278 1,946 2,777 1,154 6,155 1.5 15 22
Court 3,669 1,543 5,212 1.2 8 27
Total 51,651 193,415 116,470 56,173 417,709 100.0 466 1,362
% 12.4 46.3 27.9 13.4 100.0

The size of the ScotsCorr corpus is approximately 0.4 million words, or 0.5 million tokens including punctuation marks and the editor’s comments. There are 1,362 letters by 466 informants altogether. For detailed information, please consult the ScotsCorr Manual and the auxiliary data files listed below in section Documentation.

Documentation

The ScotsCorr Auxiliary Data site offers detailed information about the corpus in the following PDF files:

To assess the validity and relevance of a letter or a group of letters for a particular research project, please consult informant catalogues on the ScotsCorr Auxiliary Data site. Quantitative information is provided in the auxiliary files ScotsCorr Word Counts by Individual and Locality and ScotsCorr Quantitative Data.

Letter features: language-external variables

Features of each letter are characterised by defining the variables in the following table. The language-external parameter values in the first column appear in the info box (sidebar) on the right-hand side of the concordance view. The Korp attribute identifier can be used in the CQP query expressions in Korp’s Advanced search mode and they are shown as attribute labels in the downloaded KWIC files.

Parameters for features of letters
Parameter Reference code Korp attribute identifier Content type
Writer %IM/IF/IR text_from name
Addressee %AM/AF/AR text_to name
Year yyyy text_year yyyy
Date %DA text_date yyyy Month dd; may contain a range or indications of uncertainty
Description text_fraser summary
writer’s origin %LC text_lcinf region
larger region text_largeregion larger region
place of writing %LC text_lclet place name
gender/rank of writer %IM/IF/IR text_wgr male/female/royal
gender/rank of addressee %AM/AF/AR text_agr male/female/royal
hand (primary) %HD1 text_lettertype autograph/non-autograph
script type (primary) %HD1 text_scripttype secretary/italic/non-secretary
hand (secondary) %HD2 text_lettertype2 non-autograph
script type (secondary) %HD2 text_scripttype2 secretary/italic/non-secretary
number of words %WC text_wc word count
text identification # text_id number
file name %FN text_fn title, e.g. 12Sutherland (12th Earl of Sutherland) + date, or surname, first name + date
catalogue number %MS text_ms reference in the manuscript archives
previous editions %BI text_bi reference to a previous edition
transcription information %ST text_st transcription based on the manuscript itself in situ or its xerox copy

This information permits searches through the data selectively, as restricting the focus to a particular set of parameters allows the user to create such new shapes of ScotsCorr that only contain data valid for a particular research question.

An example of the information shown in the sidebar:

writer: Henry Stewart, 1st Lord Methven
addressee: Mary of Lorraine, Queen Dowager
year: 1544
date: 1544 November 25
description: unspecified
writer’s origin (region): Perthshire
larger region: Central
place of writing: unspecified
gender/rank of writer: male
gender/rank of addressee: royal
hand (primary): autograph
script type (primary): secretary
hand (secondary): information unavailable
script type (secondary): information unavailable
number of words: 349
text id: 95
file name: 1LMethven5441125
NLS/NRS catalogue number: NRS SP2/2
previous editions: previously edited by Annie I. Cameron in the Correspondence of Mary of Lorraine, 93
transcription information: a copy in the CSC archive

Please notice that the abbreviation CSC here refers to the compiler’s archive of MS copies deposited among the VARIENG papers in Metsätalo (basement), Unioninkatu 40, 00014 University of Helsinki.

Challenges of historical data

Degree of variation is very high in texts representing sixteenth-, seventeenth-, and early eighteenth-century Scottish English, especially as in early epistolary prose numerous writers are untrained or quite inexperienced. This high degree of variation is reflected in morphology and syntax as well as spelling. As illustrated in the ScotsCorr Manual and in the online articles by Meurman-Solin in Studies in Variation, Contacts and Change in English, volume 14, degree of variation is considerably high in the original manuscripts the ScotsCorr corpus is based on, as compared with the previously published editions Scots dictionaries usually draw on.

Therefore, rather than search by individual tokens, you are advised to create a list of the spelling variants of a particular word, including morphological variants when appropriate. This is possible by browsing through the ScotsCorr Word-list available in Extended search (see section Word-list below).

As described in the ScotsCorr Manual and ScotsCorr Symbols and Comments, in the digitised transcripts of the original manuscripts some word-forms may have comments of various kinds attached to them. These alert the user to problems of validity caused by a particular occurrence of the search word being ambiguous, blurred, or partly damaged in the manuscript. You should browse through such commented items to assess their relevance and validity as data in the study of a particular research topic.

Expansions of contracted word-forms are explicitly indicated in ScotsCorr, and ‘the contracted part’ is put between the symbols *%. For example, the contracted variant of your is frequently yo~ in the manuscript, ~ indicating a flourish, and this variant is transcribed as yo*ur%. The element *ur% is used as an emic representation of all the possible variant realisations that the flourish could be a ‘substitute’ for. The use of a fixed representation is considered variety-neutral and makes the data retrieval process smoother, but you will have to keep the full word-forms and the contracted forms in separate categories, never forgetting to make a distinction between them in their linguistic analysis. There is no linguistic justification for selecting a particular expansion rather than some other variant; the choice is merely based on such pragmatic concerns as retrievability (for further information, see the ScotsCorr Manual 4.3.4).

Please note that in the Korp version of ScotsCorr, the symbol for a word-final flourish or loop is the straight typewriter quotation mark " instead of the curly typographic quotation mark ”, so that it is easier to type in searches. In contrast, the apostrophe is the curly one ’ (closing single quotation mark). You have a few options of entering it in a Korp search, depending on your operating system and keyboard layout. On Windows, you can type Alt+0146 (hold down the Alt key while typing 0 1 4 6 on the number pad), whereas on a Mac, you can use Option+Shift+]. Alternatively, you can try to find such an apostrophe (for example, here: ‘), select and copy it, and paste it to the Korp search.

How to search

Three search types are available in Korp: Simple, Extended, and Advanced. The type of search is selected by clicking on the respective tab above the search box. Basically, Simple search finds individual tokens, Extended search makes it possible to restrict the search based on metadata variables and allow multiple variants for tokens, and Advanced search permits writing a CQP query directly. For ScotsCorr, the Extended search may often be the most relevant because it permits choosing the word forms to be searched from the Word-list (see section Word-list below).

In all the search types, you perform the search by clicking the “Search” button below the search criteria.

The following subsections outline briefly the features in Simple and Extended search. For Advanced search, please consult the Korp advanced search guide, and for further help on searching with Korp, the Korp user guide.

Simple search

In Simple search, you write the word or words for which you wish to search in the Search box directly below the search tabs. (Please note that the auto-completed words suggested do not currently work for ScotsCorr.) You may search for multiple strictly consecutive words by separating the words with a space. To search for words beginning with the given initial or final part, tick the box in front of “initial part” or “final part”, respectively. You may also wish to tick “case-insensitive” to find words containing uppercase letters. In Simple search, it is not possible to constrain the search based on language-external variables.

Extended search

In Extended search, a search condition is composed of one or more “token boxes”, each of which corresponds to the search criteria for a single token in a sequence of tokens. Each search criterion has three parts: attribute (or variable) whose value is to be tested, test condition, and the value for the attribute. The attribute and test condition are selected from a list, whereas the value may be written to a text box or selected from a list, depending on the attribute. In ScotsCorr, the possible attributes are “word” and the various language-external variables for each letter, listed under “Text attributes” in the attribute list. (Constraints on language-external variables are added to token boxes in Korp, even though that may not be intuitive.) For attributes with a selection list for the value, the available test conditions are “is” and “is not”. Words and other attributes with a text box for the value also allow the test conditions “begins with”, “contains“, “ends with”, “regexp” (for “regular expression”), and “not regexp”. Regular expressions may be used to specify compactly e.g. alternatives and repetition within a word by using certain characters with a special meaning; please refer to the section on regular expressions in the Korp advanced search guide. The names of the search conditions are probably self-explaining.

Additional search criteria for a token may be added by clicking the “or” (disjunctive criteria) or the plus icon (conjunctive criteria) at the lower left corner of the token box. From the cog menu at the lower right corner of the token box you may choose to repeat the same criteria for several consecutive tokens. (Since the letters in ScotsCorr are not divided into sentences, the other two options anchoring the token at the beginning or end of a sentence in the cog menu are not relevant.)

To search for multiple consecutive tokens (words), add a new token box by clicking the plus icon on the right of the token box and specify the search criteria. To allow any token between two tokens, keep the “<any word>” in the search box. Note that in ScotsCorr, a search in Extended search finds token sequences which may contain non-word tokens (tokens consisting solely of non-alphanumeric characters or comments enclosed in curly brackets) between tokens matching the criteria specified in the search boxes.

To constrain a search based on language-external variables (attributes), add “and” criteria specifying the desired attributes and their values to one of the token boxes.

Search results

The main search result type in Korp is a KWIC concordance. Before performing a search, you may specify how many hits you wish to see on a single page (default 25, maximum 1,000) and how to sort the concordance. The available sorting options are “not sorted” (occurrences in the order in which they are in the corpus data; the default), “matched word(s)” (sort by the occurrence itself), “left context” (sort by the three words preceding the match, from right to left), “right context” (sort by the three words following the match) and “random” (a random but fixed order). Note that the occurrences are sorted separately within each ScotsCorr subcorpus, not across all the subcorpora.

The KWIC concordance shows the tokens matching the search condition in a context of two lines before and after the line containing the hit in a letter in ScotsCorr. (The line breaks of the ScotsCorr letters are not explicitly shown in Korp, but the corpus data shows them as backslashes (\). If a line breaks within a word (a token containing a backslash), that word is regarded as part of the line containing the initial part of the word.) By clicking “Show context” above the concordance, you can see all the occurrences in the context of the whole letters in which the occurrences have been attested. The matching words are shown in bold in the concordance. Note that if a single letter contains multiple occurrences, the letter is shown separately for each occurrence.

In addition to the KWIC concordance, there is another search result view available in ScotsCorr in Korp: Statistics. However, since the statistics are calculated based on the raw query results, with no possibility of choosing the occurrences relevant to a particular research question, it might be of little use for ScotsCorr. Moreover, the “Map” and “Name classification” result tabs are not functional for ScotsCorr.

Word-list

The main search tool designed for data retrieval in the ScotsCorr corpus is the Word-list in Extended search. Start a search for relevant items for a particular research by choosing “extended search” and clicking the list-resembling icon in the Search box. (If the list icon is not visible, try first choosing another attribute to search than ‘word’ from the attribute list and then choose ‘word’ again. This happens if you first use another corpus in Korp and then switch to ScotsCorr.) This Word-list comprises all the words, word-forms, independent comments, numerals, and punctuation marks occurring in the ScotsCorr corpus. Each entry is followed by a number in parentheses indicating the total absolute frequency of the item. (The total absolute frequency is calculated from the whole ScotsCorr corpus even when only some subcorpora are selected.) The items appear in the following order:

  • Words and word-forms with an initial lower- or upper-case character in alphabetical order, those with a symbol marking a flourish in word-initial position appearing after the full word-forms (e.g., variants of *con%sider after those of consider).
  • Words with an uncertain beginning: words and word-forms with an initial question-mark or question-marks in the order from 1 to 3 question-marks and, secondly, by the first fully legible part of a word or word-form in alphabetical order
  • Independent comments (comments in curly brackets without an arrowhead)
  • Numerals
  • Punctuation marks

To make the process of creating lists of selected words as smooth as possible, the Word-list has been divided into sections (e.g., items beginning with an a or A) by the first letter of the word. Words with an uncertain beginning, independent comments, numerals, and punctuation marks each form a section of their own. A section can be opened by clicking the section title or the blue triangle in front of it, and it can be closed by clicking again. The section title is followed by two numbers in the form “(n; f)”, where n is the number of words selected in the section and f the total absolute frequency of the selected words. You get an explanation for the numbers in a tooltip by hovering the mouse pointer over them.

Please note that sections in the word-list containing a large number of words may be slow to open. Keeping several large sections open at the same time slows down the word-list, so it is advisable to close a section before opening a new one. The largest sections (words beginning with C and S) contain approximately 4,500 words each.

A relevant item in the Word-list is selected by putting a tick in the square preceding the item. As a result, the item will appear in bold in the Word-list and, together with its frequency, in the “Selected words” list at the top of the Word-list. When further occurrences considered relevant are selected, they will appear in the Selected words list in the order in which they appear in the Word-list, not in the order in which they have been selected. During the selection process, information about the total number of selected items and their total absolute frequency is kept up-to-date.

Below the Selected words list are three buttons: “Done”, “Clear” and “Cancel”. Clicking “Done” will transfer the selected items to the Search box and close the Word-list. Korp automatically adds a vertical bar (|) between the different items and a backslash (\) before such special symbols as * and ?. Clicking “Clear” will clear the list in the Selected words box, whereas clicking “Cancel” will cancel the selections and close the Word-list, saving, however, those items that were in the Search box before you reopened the Word-list to start a new round of making selections.

When selecting more than one word in the Word-list, the search condition is automatically set to “regexp” and should not be changed for the search to work.

If you first specify a search condition and then open the Word-list, the Word-list has all the words selected that match the search condition. For example, if you select “word” “contains” commen, the word list has all the words containing commen as part of them selected. If you then click the “Done” button, the search condition will change to “regexp” and the Search box will contain all the words listed separated by vertical bars, even if you did not change the selections. To retain the original search condition, for example, if you only wished to review via the Word-list which words match the search condition and what are their frequencies, you should click “Cancel”.

In creating lists of variant word-forms for a particular search, you should keep in mind that relevant items can be found in various sections and positions in the Word-list. For example, variants of the word change may occur in both section C and section S. In order to identify all the relevant word-forms, you may find it necessary to consult dictionaries of Scots, the online Dictionary of the Scots Language (DSL) in particular. (The book-like icon beside the word search box in the advanced search is a link to the DSL, and the upper part of the Word-list also contains a link to the DSL.) However, since the DSL is based on editions, which have usually resorted to some degree of normalisation (see the ScotsCorr Manual 4.2; Meurman-Solin 2013a), it does not cite all the variants attested in the diplomatically edited manuscripts of ScotsCorr. In the compiler’s experience, the dictionary will provide indispensable information about the main variants, but browsing through the relevant sections and subsections to find the rest of the variants will usually prove necessary. Even though the creation of lists of variants to be searched may be a somewhat time-consuming task, the benefits of having all the rich variety of authentic data as evidence can hardly be exaggerated.

We recommend a process of the following kind:

  • study the history of a particular word focusing on patterns of variation recorded in DSL
  • choose the option of extended search in ScotsCorr in Korp
  • open the Word-list
  • tick the relevant variants taking into account that they may appear in more than one particular section in the Word-list
  • create a concordance based on the selected variants in the Search box
  • browse through the concordance to check that all the instances are relevant for your research

To guarantee a smooth data retrieval process, it is advisable to resort to searching one linguistic feature at a time, in other words, occurrences of words and word-forms that share a particular linguistic property. This is recommended especially for searches focusing on high-frequency items or those that have a particularly high number of variants in the corpus.

As pointed out above, the Word-list provides information about the absolute frequency of each variant in the ScotsCorr corpus. This permits you to take frequency into account in the planning of your search. When you study a high-frequency item, it is advisable to restrict the size of a single search by such language-external variables as date (e.g., time-period) or gender/rank (e.g., male, female, royal) or by linguistic criteria (e.g., morphological properties or word-form properties such as full forms versus contracted forms).

Analysing the concordance

The concordance view in Korp is static: you cannot edit it. To be able to exclude occurrences that are not examples of the word or feature being studied, you need to download the concordance as a file onto your own computer and then edit the file. The concordance can downloaded be choosing the desired format from the selection list below the concordance and then pressing the “Download KWIC” button.

Currently, the most suitable formats for many ScotsCorr users may be ”Sentence per row, match and contexts separated” combined with “Excel (XLS)”: it provides an Excel spreadsheet with one row for each KWIC concordance row. (The first row contains column headings.) The columns of the spreadsheet contain the number of the hit (starting from 0), (sub)corpus identifier, words in the left context of the match, the matched word(s), words in the right context of the match, and a column for each of the the language-external variables (text attributes) listed above, labelled with Korp attribute identifiers. Each row also contains information on the whole query.

It is at this stage that you can also analyse all the occurrences that have a word-related comment in curly brackets with a left or right arrowhead attached to them. These word-related comments appear as the first piece of information in the column on the right side of the concordance view and function as attributes.

For detailed information about these word-related attributes, you should consult the ScotsCorr manual and the document ScotsCorr Symbols and Comments. The attributes usually provide information about problems affecting legibility, such as a particular character or part of the manuscript letter being blurred, damaged, or torn, this information permitting you to decide whether the commented occurrence can be considered valid data in a particular study.

Similarly, information about language-external parameters in the info box on the right-hand side of the concordance view permits you to decide whether to include or exclude particular occurrences by language-external properties.

Search the Language Bank Portal:

Researcher of the Month: Olli Silvennoinen

 

Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4140599 / +358 29 4129317