<TITLE: Parallel Corpora of Literary Texts
ACADEMIC DOMAIN: humanities
DISCIPLINE: translation studies
EVENT TYPE: doctoral defence presentation
FILE ID: UDEFP050
NOTES: continued in UDEFD050; mostly read from notes

RECORDING DURATION: 18 min 14 sec

RECORDING DATE: 27.9.2003

NUMBER OF PARTICIPANTS: unknown

NUMBER OF SPEAKERS: 1

S1: NATIVE-SPEAKER STATUS: Russian; ACADEMIC ROLE: junior staff; GENDER: male; AGE: 31-50>


<S1> <FOREIGN> kustos </FOREIGN> my esteemed opponent ladies and gentlemen , according to the university rules an introductory lecture should be within the field of the dissertation but shouldn't present the findings or the research work itself , so in my lecture i i'm not going to retell the whole dissertation i'd rather concentrate just on one question raised in it , how parallel corpora of literary texts can be used , corpus based research is very popular nowadays , corpora are used as a source of empirical data for grammarians and lexicographers and as a test field for theoretical models , use of corpora speed up research considerably use of computer , makes it possible to obtain data which was almost impossible to obtain using traditional methods , monolingual corpora are more or less common nowadays there is more or less stable tradition and er it is more or less clear how to use them , the situation with multilingual and parallel corpora is much worse not many language pairs are represented , corpora are relatively small in size and implications of multilingual corpora are not as clear as it seems to be , i will talk about parallel corpora by which i understand collections of original texts in one language and translations of these texts into another language , i shall use an exa- as an example parrus a russian-finnish parallel corpus of literary texts , which is one of the practical results of my dissertation , nowadays parallel corpora are considered as a very valuable source of data for high technologies like machine translation or translation memory systems , i will not speak about these applications of parallel corpora because what high technologies need are corpora of special texts and parrus is a collection of er literary texts and is not likely to suit for these purposes , parallel literary texts may be rather used in humanities lexicography linguistics translation studies cultural studies et cetera mhm , when my research was in the initial phase i had an illusion that a parallel corpus is a very valuable source of data er source of data for dictionary compiling it would be possible to obtain parallel concordances and to find out how words are translated in real life this should be the way to get rid of wrong equivalents and to find better equivalents , however parallel corpora do not seem to be as useful in lexicography as monolingual corpora are <NAME> in her recent book gives quite a long list of corpora used for dictionary compiling but there is not a single parallel corpus mentioned , the problem is that text corpora for lexicographical purposes should be very large , like BNC or bank of english , ELRA association mentions only one parallel corpus which can be regarded as large it is german-french reciprocal parallel corpus its size er is 15 million words per language however it is significantly smaller than a large per- mo- mon- monolingual corpus for example the size of BNC is 100 million running words , unfortunately text corpora of just several million running words in size are not very helpful for lexicograph- lexicographers for example the size of parrus is about 2.2 million running words per language the russian subcorpus produced a word list of about 42,000 words only 30,000 occurred ten times or more and the finnish subcorpus around 31,000 11,000 occurred ten times or more this means that parallel corpora like parrus can be used only for compiling basic dictionaries not more than 15-25,000 words in size and it is not easy to collect a (xx) parallel corpus there are certain technical issues to solve and besides it is difficult to get parallel texts for certain language pairs fields text types so it seems that it is possible to collect large parallel corpora only for so-called major languages which produce texts in immense numbers and many of these texts are translated into other languages one would expect that parallel corpora provide a researcher with good translation equivalents of course in many cases it's quite easy to monitor which equivalents are preferred by translators . for example russian-finnish dictionary by <NAME> and <NAME> suggests the following five finnish equivalents for the russian word <FOREIGN> prostor </FOREIGN> , in the meaning expanse vast space it is <FOREIGN> lakeus aukea aava avaruus </FOREIGN> , in the meaning freedom it is <FOREIGN> vapaus </FOREIGN> . er the word <FOREIGN> prostor </FOREIGN> occurs in parrus 45 times there are 19 different finnish equivalents for this word ten of which are used more than once . er it is interesting that the most frequently used equivalent is <FOREIGN> tila </FOREIGN> or its derivatives which is not even mentioned in the dictionary <FOREIGN> vnutrennye komnaty svintsitskikh byli zagormazhdeny(xx)vynesennemi iz gostini i zala dlya bolshego prostora </FOREIGN> <FOREIGN> svintsitskin sishuoneet olivat tptynn huonekaluja ja ise- esineit jotka oli kannettu vierashuoneesta ja isosta salista jotta niihin tulisi tarpeeksi tilaa  </FOREIGN> the word <FOREIGN> tila </FOREIGN> doesn't exactly match the meaning of <FOREIGN> prostor </FOREIGN> it doesn't have in its semantics freedom component which is essential for the russian word <FOREIGN> tila </FOREIGN> means only space , still in some contexts or this equivalent might be useful and should be at least mentioned , the next common equivalent is <FOREIGN> vlj vljyys </FOREIGN> and some derivatives <FOREIGN> gorodok byl nevelik s lyubogo mesta v nem tut zhe za povorotom otkryvalos khmuraya step , temnoe nebo prostory voiny prostory revolyutsii </FOREIGN> <FOREIGN> kaupunki oli pieni sen joka paikasta levittytyivt nkyviin synkk aro tumma taivas sodan vljyydet vallankumouksen vljyydet </FOREIGN> the semantics of this word is much closer to <FOREIGN> prostor </FOREIGN> the <FOREIGN> suomen kielen perussanakirja </FOREIGN> defines <FOREIGN> vlj </FOREIGN> as <FOREIGN> liikkumatila mahdollisuuksia tulkinnanvara salliva vapaa </FOREIGN> but it should be stated that the translation in above mentioned example is too literal and doesn't sound well , many of the equivalents from the parallel concordance do not exactly match the russian word they can be broader or narrower in their meanings , it is easy to see from the concordance that equivalents from parallel corpora belong to their contexts and some of them cannot be used as dictionary equivalents for example one of the parrus equivalents for <FOREIGN> prostor </FOREIGN> is <FOREIGN> maailma </FOREIGN> another thing which is easy to notice is influence of dictionaries , dictionaries influence translators and sometimes make them use unsuitable or even wrong equivalents . er for example most russian finnish dictionaries suggest <FOREIGN> ylioppilas </FOREIGN> and <FOREIGN>  opiskelija </FOREIGN> as equivalents for the russian word <FOREIGN> student </FOREIGN> finnish word <FOREIGN> ylioppilas </FOREIGN> did indeed mean university student in the 19th and beginning of the 20th century but in modern finnish it is used in most cases in a quite different meaning high school graduate , and sometimes as part of a title for example <FOREIGN> humanistisen tiedekunnan ylioppilas </FOREIGN> or in idioms like <FOREIGN> ikuinen ylioppilas </FOREIGN> but quite occasionally in the meaning university student <FOREIGN> opiskelija </FOREIGN> looks much more suitable as an eqi- equivalent for <FOREIGN> student </FOREIGN> although its meaning is broader however data from parrus shows that translators prefer <FOREIGN> ylioppilas </FOREIGN> <FOREIGN> a khotite vot takikh(xx)studenta vospitivat po golovke gladyat po golovke </FOREIGN> <FOREIGN> tuollaisiako te olette hn tnisi ylioppilasta kasvatetaan silitelln pt </FOREIGN> the reason the reasons of this dominance of the word <FOREIGN> ylioppilas </FOREIGN> in the language of translations can be as follows , first some of the russian texts are 19th century classics many examples come from dostoyevsky's crime and punishment where raskolnikov is <FOREIGN> ylioppilas </FOREIGN> er another reason might be that translators want their translations look as translations , or that the third and <SIC> quitely </SIC> probable might be that translators just follow the dictionary so although parallel corpora can be useful as in bilingual lexicography they should be considered only as one of the sources of raw lexicographical data which needs further checking from other sources like monolingual text corpora or other types of multilingual corpora , besides some dictionary equivalents might be totally unsuitable for translation and vice versa many translation equivalents from the corpora cannot be used in bilingual dictionaries , anyway translators translate whole texts not separate words so translation equivalent is to certain extent a relative concept , still it becomes possible to speed up some phases of lexicographical work , for example simple bilingual glossaries can be extracted from aligned parallel texts the methods of extraction are in most cases based on mutual information (xx) or on string similarity , er these methods work quite well on english-french german-french english-swedish german-swedish and some other pairs of related languages , i checked if it is possible to use the similar methods on russian-finnish texts as well , although results are not very encouraging only about 2000 equivalent pairs were found and the search was performed on a list of about 7000 words still it is possible to use the method even with our unusual pair of languages . parallel corpora of literary texts are more likely to be used for studying translation process strategies which translators use , their attitude to the source text et cetera , parallel corpora are used for training of translators and as a source of data for translators although the latter sometimes prefer original data rather than translations er translations by other translators , another important application of parallel corpora is to study relations between source and target texts what is lost in translation , what are the reasons of these losses , is it possible to minimise some of them , it also becomes possible to study influence of target language on source language , the parrus corpus is untagged which limited my research with a tagged corpus it would have been possible to study what forms are used for translating certain grammatical forms of source language what are the reasons of over or under-representing of certain feature et cetera an untagged corpus can serve these purposes as well but the amount of manual work increases considerably , which makes (on the) case studies possible only formally definable features can be taken into account graphical word sentence letter , punctuation mark et cetera , for studying language of translations it is important to compare translated texts to texts initially written in the same language , that's why i needed original finnish texts for my research the texts used for this purpose were original literary finnish texts from the savonlinna corpus of translated finnish , now i'll tell about some typical features revealed during the research , it was discovered that translators try to keep in their translations paragraphs of the original text whenever it is possible , numbers of paragraphs in original russian texts and their finnish translations are with some exceptions quite close , sometimes they almost coincide , this tendency made it possible to develop text aligner which works on paragraph level it works more or less smoothly using only paragraph lengths . er it was also found that even commas colons and semicolons are often left intact you can see that the frequencies of question mark in parrus originals and translations are very close . er influence on punctuation of the source texts makes punctuation of translations from russian into finnish different than that of original finnish texts , only comma statistics do not seem to be significantly different in er original and translated texts all the rest differ dramatically , this phenomenon is most likely connected with influence of the syntax of the source language it seems that translator does his or her best to translate texts sentence by sentence whenever it is possible and to use similar syntactic constructions as you can see in the example on the screen . er differences in frequencies of some common words are also quite likely to have something to do with influence of source language , for example nouns <FOREIGN> mies </FOREIGN> and <FOREIGN> nainen </FOREIGN> occur in original texts more often than in translations while <FOREIGN> ihminen </FOREIGN> is more typical for translations , the reason might be in total absence of grammatical gender in finnish , in original finnish texts er sexless words like <FOREIGN> ihminen </FOREIGN> and personal pronoun <FOREIGN> hn </FOREIGN> er may some- er sometimes cause problems , er that er for example person's sex remains remains unknown or reference becomes unclear er so the translator has to take care in while translating , some examples <FOREIGN> pribezhal uzhitel molodoi eshche chelovek uvzhaemyi v derevne </FOREIGN> <FOREIGN> opettaja nuori mies arvostettu kylss tule tuli juosten </FOREIGN> in this example the translator had to translate the russian word <FOREIGN> chelovek </FOREIGN> by <FOREIGN> mies </FOREIGN> so that the finnish reader would know that the teacher was a man not a woman . <FOREIGN> eto nizko vozmutilsya volon vy chelovek bednyi ved vy chelovek bednyi , bufetchik vtyanul golovu v plechi tak chto stalo vidno chto on chelovek bednyi </FOREIGN> <FOREIGN> sehn on alhaista(xx)kiihtyi te kyh ihminen olettehan te kyh ihminen kahvilanpitj veti ptn hartioiden vliin niin ett selvsti huomasi hnen olevan kyhn ihmisen </FOREIGN> here the translator uses the word <FOREIGN> ihminen </FOREIGN> and it is impossible at least from this context to know whether the bartender was a man or a woman , so russian source text makes translator use pronouns or words with no reference to gender more often and this may make translated texts less readable than original finnish texts . er adjective <FOREIGN> pieni </FOREIGN> is more common in for original finnish texts than for translations from russian into finnish , it is also surprising that finnish word er <FOREIGN> pieni </FOREIGN> occurs in finnish translations more often than its russian equivalent <FOREIGN> malenkii </FOREIGN> in russian source texts the frequency of the russian adjective is lower because there is another word <FOREIGN> nebolshoi </FOREIGN> which is quite close in its meaning to <FOREIGN>  malenkii </FOREIGN> this adjective can be also translated into finnish with the adjective <FOREIGN> pieni </FOREIGN> er that's why the frequency of <FOREIGN>  pieni </FOREIGN> is higher than that of <FOREIGN> malenkii </FOREIGN> . <FOREIGN> (xx)vzdrognul obernulsya i uvidel za soboi kakogo-to nebolshogo tolstyaka , kak pokazalsya koshyachei fizionomiei </FOREIGN> <FOREIGN> (xx)spshti kntyi ja nki edessn jonkun pienen ja paksun olennon jolla oli kissan naama </FOREIGN> diminutives like <FOREIGN> dom domik stol stolik </FOREIGN> are quite typical for the russian language use of this use of these derivatives also lessens the need for the word <FOREIGN> malenkii </FOREIGN> although one can certainly say <FOREIGN> malenkii domik </FOREIGN> in some cases er russian diminutives are translated with the help of adjective <FOREIGN> pieni </FOREIGN> sometimes information about the size and speaker's attitude to the subject is omitted , <FOREIGN> ya poedu , starichok pristuknul sukhim kulachkom po stolu </FOREIGN> <FOREIGN> mink min lhden ukko kopsautti kuivan pienen nyrkkins pytn </FOREIGN> here diminutive is translated , ah <FOREIGN> i verno(xx)u menya tolko chto ne rai berezka znaesh lipov tsvetu pchelki bzhzh </FOREIGN> <FOREIGN> minun mehilistarhani er se on melkein kuin paratiisi koivut lehtii ja lehmukset kukkii ja mehiliset surisee </FOREIGN> here diminutive is left untranslated , it's quite easy to find other traces of influence of the source language er on the target language parallel corpus study makes it possible to define what kind of interferences are typical for a given pair of languages (what for) is it possible to provide translations with no imprint of the source language no it isn't , however if translator knows how the given source language influences the given ta- the given target language he or she might avoid unnecessary interference or emphasise some kinds of interference if needed so it looks that parallel corpora of literary texts are more important for studying translations and language of translations rather than for dictionary compiling i now call upon you professor <NAME S3> opponent appointed by the faculty of humanities to present whatever critical comments you consider my dissertation calls for </S1>
