Resource group page

klk-fi: Notes for the user

The corpus concerned: The Finnish Sub-corpus of the Newspaper and Periodical Corpus of the National Library of Finland version 2, Korp

The klk-fi-v2 data contains a number of inconsistencies, which were noticed while preparing the data for publishing in Korp.

  1. 378,659 issues (of 1,406,104) of 361 publications (of 4,207) have a title of the form MF + number, e.g. MF64602 (5,464 different ones). The attributes publ_title, issue_title and label all contain this “title” in these cases, so the user would need to use the publication id instead of title to search from all issues of such a publication. Publications with such “titles” include issues of major newspapers, such as Helsingin Sanomat, all issues of which from 1930–1945 have such a title, in addition to some issues from 1923, 1926, 1928 and 2017.
  2. The publ_title attribute of some issues of Etelä-Suomen Sanomat (2014 and 2015) and Itä-Häme (2015) contain a date and issue number, which the attribute in general shouldn’t. The issue_title of these issues is “Omaan kotiin”, and not the same as publ_title. The OAI-PMH API only gives Etelä-Suomen Sanomat as the title, probably based on the ISSN.
  3. The fk + number publication ids sometimes have uppercase FK, so they are considered different ids: for example Filmiaitta has 121 issues with fk00275 and 75 issues with FK00275.
  4. Some publication ids (ISSNs or fk numbers) appear to have some issues with clearly different publication titles: e.g., 0355-6913 has both Aamulehti (13,885 issues) and Hämeen Sanomat (12). At least in this case, the title would seem to be incorrect and should be Aamulehti.
  5. Some titles have typos and inconsistent spelling or capitalization; e.g. Helginsin Sanomat (6 issues), Helsigin Sanomat (11), Helsinginsanomat (558), Helsingin sanomat (1).
  6. 4 issues with 3 different publication ids have the title Unknown. Based on the publication ids, these could be Finsk Tidskrift (2 issues), Apu (1) and probably a government report series (ISSN 0784-5367) (1).
  7. The same title may occur with and without a trailing full stop or (a space and) a slash: e.g. Uudenkaupungin Sanomat (1335 issues with a full stop, 2917 without) and Idän tähti (3 issues with a slash, 25 without).
  8. Sometimes a title may contain a colon (separating a subtitle) which is preceded but not followed by a space; e.g. Apu :ajanvietelukemisto. Some issues may be with a following space, some without.
  9. Some titles contain the literal string “\u000d\u000a”, apparently marking a line break; e.g. Käsityö- ja teollisuuslehti : Suomen teollisuusvaliokunnan \u000d\u000aäänenkannattaja.
  10. One issue of Åbo Underrättelser from 1882 has its ISSN 0785-398X as the title.
  11. The 42 issues of Lasten-lehti have publication id “host” instead of the correct ISSN 0355-8320.
  12. Sometimes a title includes a subtitle or publisher information: the same publication may have issues with and without this information, e.g. Finsk Tidskrift (261 issues), Finsk tidskrift :kultur, ekonomi, politik / (13), Finsk tidskrift : kultur, ekonomi, politik / Föreningen Granskaren. (1).
  13. Three titles have a double space in the middle: Historiallinen Aikakauskirja (4 issues), Uusi Suometar (6), Haminan Lehti (15).

Some of the differences in titles may reflect actual differences in the titles of the original issues, but not all of them.

The above issues originate from the source data. Moreover, contrary to the information elsewhere, the data contains 633 texts (pages) (in 102 issues) that do not contain any sentence identified as Finnish, which is due to a mistake in processing the data.


Last updated: 10.10.2023

This page has a persistent identifier: http://urn.fi/urn:nbn:fi:lb-2023101001

Search the Language Bank Portal:
Heidi Niva
Researcher of the Month: Heidi Niva

 

Upcoming events


Contact

The Language Bank's technical support:
kielipankki (at) csc.fi
tel. +358 9 4572001

Requests related to language resources:
fin-clarin (at) helsinki.fi
tel. +358 29 4129317

More contact information