<TITLE: Parallel Corpora of Literary Texts
ACADEMIC DOMAIN: humanities
DISCIPLINE: translation studies
EVENT TYPE: doctoral defence discussion
FILE ID: UDEFD050
NOTES: continuation of UDEFP050

RECORDING DURATION: 108 min 16 sec

RECORDING DATE: 27.9.2003

NUMBER OF PARTICIPANTS: unknown

NUMBER OF SPEAKERS: 3

S1: NATIVE-SPEAKER STATUS: Russian; ACADEMIC ROLE: junior staff; GENDER: male; AGE: 31-50

S2: NATIVE-SPEAKER STATUS: Finnish; ACADEMIC ROLE: senior staff; GENDER: male; AGE: 51-over

S3: NATIVE-SPEAKER STATUS: Swedish; ACADEMIC ROLE: senior staff; GENDER: male; AGE: 31-50>



<PRESENTATION UDEFP050 by S1>

<S3> okay so i have to <S1> [mhm-hm] </S1> [switch] screens then <SETTING UP EQUIPMENT, P:06> ah <S1> mhm-hm  </S1> there we are , er <COUGH> so <P:06> it's er relevant to the things i'm going to say right now and also to the questions that i will have that i am , well my background is relevant to this so i think that that's one reason i put it up here and another reason is that er it has to go somewhere NLP means natural language processing in in swedish it's <FOREIGN>  sprkvetenskaplig databehandling </FOREIGN> and er my my background is in NLP in computational linguistics in corpus linguistics from that point of view and er that has to a great extent informed the way i read your dissertation and it certainly has to a great extent influenced what i will and will not say and ask about it er . and i'll start by putting your work into a kind of landscape in into the landscape of corpus linguistics because i i think actually that it will help everybody here to understand what it's all about you can't see this really well i guess [maybe i should] </S3>
<S1> [i can] @@ </S1>
<S3> okay , er . it's er this this is er a figure that i made a few years back when i myself had to think about these matters it was prompted by the circumstance that when when you read things that say they belong in the area of corpus linguistics you find that they can be very very different address very very different issues and be written by people from very very different backgrounds and and just to bring order into this area for myself i i made this figure i've i've also put it into something that i wrote er at about that time so so this er figure oval in the middle is supposed to be corpus linguistics , what what you find is actually that corpus linguistics is er something that has two ends so to speak , it's er i- it it it's something that belongs squarely in a kind of applied empirical or well applied is the wrong word but in in a kind of empirical linguistics this would be the right or if if i look at it it's the right hand side of this figure but it's also something that belongs not even in linguistics at all you might say but rather in computer science and that's the left end of that same oval , er and this er , er the the left end of the oval or the left end of the continuum that corpus linguistics is is often called empirical natural language processing and it is is often er done by computer scientists and and similar people with a similar background whereas the the right hand end of this continuum is is pursued by people working in languages in language departments , er and and and this er accounts for some conflicts that we find in in in this area taken as a whole . er . and er and er yeah one more thing before i before i go on to the next slide er my la- to to make it clear my background is somewhere in the left hand side of this this er oval i'm i i'm i come from computational linguistics but i guess one of the reasons that i was asked to be an opponent here is that i have also background in slavic studies i studied russian and polish i have a background in in finno-ugric studies i've studied finnish and hungarian so so it's i @@ in in in in that sense maybe i'm not too one-sidedly computational linguistic here , so as to the general remarks ah the la- the lamp in the ceiling obscures some of the headings here but it says general remarks one er my er er as to my general remarks about this dissertation i could say then probably that it's er i'm referring back to to the previous slide that this is corpus linguistics in the on on the right hand side this is the kind of linguistic linguistically informed corpus linguistics rather than than empirical natural language processing that is presented in this dissertation er it's also of course translation studies corpus linguistics as applied in translation studies and er you have to say also generally that as this kind of corpus linguistics the work that <NAME S1> has done has first of all produced a very valuable resource resource is the word you use about these things that is the parallel corpus of russian and finnish , er and as those of you who have read the dissertation know at least and as you might know anyway this is not a very common language combination in this in this context , most parallel corpora in the world tend to have english as one of the languages and and they are almost always er parallel corpora among say well the- the- there was this er notion of standard average european that sapir once talked about and and and the parallel corpora tend to be tend to be made up from standard average european language pairs in general you find french-german corpora you find swedish-german and so on but but finnish-russian is is unusual er er the , well there there are english-chinese and english-japanese materials and so on for economical (resources that is) er so so so this is a very very valuable outcome of of er the dissertation work , er <COUGH> it's also another very valuable outcome of the dissertation work is as far as i am competent to judge this because like i said translation research is not my primary area of of competence but it certainly seems to me that it's a very very good piece of translation research translation studies using these parallel materials and and <NAME S1> makes a very very good case for how parallel corpora can and should be used in this kind of research , there are some er well i i i call them case studies because each of them could certainly be made into several dissertation or other book length studies in their own right i your your emphasis has been more on showing that this is something that you can do with parallel corpora and that it's a good thing to do with parallel corpora because you get interesting results from it er so you have looked at such issues as the length of translations compared to originals which is er which is both interesting and and yielded interesting results and you also had a kind of original take on how to measure length in a in a reliable way which isn't all that easy when you think about it a bit should you should you compare paragraphs (well we) talk about paragraphs (xx) in your introductory lecture should you talk about words okay but words are different in different languages what you count well letters or characters but then you adjust it er with with er a an informa- information theoretical notion w- which which i think is a is a was was a very good idea , er <COUGH> you've also looked at the text structural transfer as we could call it paragraphs is one example punctuation is another and there are certainly other other text structural factors that you could look at but but you you show that this is parallel corpora are very suitable for these kinds of of investigations , you've looked at various aspects of the lexical characteristics of of translated language , maybe i should have written it but then it would have been two lines and it wouldn't have fitted in the slide , so so translations here means the tran- texts and translated language , er as as as compared to the original and as compared to to original texts in the target language which and and and you you show convincingly that the parallel corpora are very very valuable for that kind of research , you also investigated the relationship between , well parallel texts translations or translation equivalents as evidenced in parallel texts er and relationship between those and er translation equivalents as given in bilingual dictionaries where you also have some interesting results among them are that it seems that the bilingual dictionaries tend to influence translators the fact that you have er a translation equivalent available in the dictionary prompts possibly anyway a translator to use that rather than than some other possibly more appropriate equivalent we'll er come back to that issue , a bit , so er summing up these two slides as as thought i i i i i kind of saw this thesis as well consisting of three separate parts well not separate to but it's a it's a whole but but there are are three discernible parts in it and i've just talked about two of them that is the creation of the resource the parallel corpus i've talked about the well actually third part of it which which was the case studies the investigation of translated language using this er this parallel corpus the the middle part is er . ah , is a , sorry i'm i'm getting lost here with the slides , but the middle the middle part is er the program the er er computer environment where you can work with a parallel corpora in order to to investigate er in- investigate the , er the translations translation equivalents and other other features of translations and er . just . yeah and i'll i'll get back to that on the next slide i think be- because because that in in my in my opinion the the the middle part is er i the the the part that has to do with computational linguistics in a in a in a wider sense is is perhaps the the weakest part of the of the whole work in in in a in a specific sense not and i i hasten to add that it's it's not weak in the sense that the results that you get from it are bad it's not weak in the sense that the program is doesn't do what it should it's it's weak in the sense that er it er in in a way it represents in in the field of computational linguistics a kind of old-fashioned way of doing these things there's there's happened a lot er in er well the the methods that you use were first proposed at the end of the 50s and so on and and er were still generally used up until the end of the 70s or something like that but in the 35 years after that quite a lot occurred in the field of computational linguistics which could in some cases help or or improve what the the tools that you are building and we'll return a bit to these issues as well and this is and and i say again it does not mean that your program works badly or that it's bad it's this i- this is more about the kind of theoretical foundations of of the program and and also about the possibilities for developing it further and and we'll get back to that , er . now i'm er kind of ad-libbing i think i'm not following these <S1> @@ </S1> slavishly but i so so the the important the just to highlight some some important results of this dissertation work er summing them up on the slide the the corpus itself is is it's extremely an extremely valuable resource i would say it's it's er a lot of work has gone into its compilation and and certainly it should produce a lot of high quality research in the future as well that's that's my belief it should anyway @@ er the idea of about using information theoretical notions for adjusting the length of texts in when when comparing them er the that you found by empirically that the the the there is a method using word inter-language word similarity as a heuristic for finding translation equivalents and what this means is basically that if two languages the there are two cases if two languages are closely related then you will find words that are similar and mean the same thing in those two languages er . and on the other hand if two languages are are in a wide sense culturally related then you will find words that have a common origin because they are loan words for instance in the two languages which mean the same thing so you find this international vocabulary for instance in in many many languages and you f- and and and this has been used in in many parallel corpus projects in order to find er translation equivalents and and it works wonderfully with word pairs like swedish and german for instance for for both reasons actually there's a they're they're quite closely related so you have words like <FOREIGN> bt </FOREIGN> and <FOREIGN> boot </FOREIGN> or whatever and things like that just off off my head off top of my head er but you also have a large stock of common international vocabulary which are are quite which is quite similar , but you found that comparing russian and finnish you didn't get very many of these pairs at all that's not a good heuristic for finding translation equivalents that's what T-E means there by the way so that's that's a very interesting result and you and you instead resorted to distributional characteristics for for finding these equivalents which which were much much better erm that er you also found all these structural factors that tend to be preserved in translations you found that translatio- translators are often bound to dictionaries and you also say something which i find very very interesting and which i will return to as well but well translationese er which some people have called translated language as opposed to original language is not a unitary phenomenon you you rather say that there is er finnish translated from russian shows certain certain traits finnish translated from english shows certain traits for at least that's how i read you er finnish translated from german shows certain traits and and er there are very very few traits that those three varieties have in common as compared to to finnish not translated from anything that is original finnish and er and er this this too is a is a very interesting claim and and i i wi- wi- will get back to it and discuss it a bit more later on er , and this should actually if i've read the instructions correctly , er m- m- m- thi- this kind of concludes the general overview now i should get on with methodological and general questions but i should sit down is that right @@ okay <GETTING SEATED, P:06> so , er although it would sometimes feel better if i'm allowed to to walk around <S2> @@ </S2> it it is it okay </S3>
<S2> yes it's allowed </S2>
<S3> so there are some methodological usings as i've called them for some (xx) methodological remarks that i would like to to make , and there there's first of all a general remark which is not in no way unique to this work which is general at least to the field of corpus linguistics it's also kind of endemic to computational linguistics and parts of computer science and this is the issue of repu- reproducibility of results , one one of the demands that you put on on on a scientific investigation is that it's presented in such a way and that and and in other ways arranged so that anybody could reproduce the results and in the world of corpus linguistics one of the great obstacles to re- to this re-producibility is er er in issues of intellectual property that's what I-P means it's a very common common er acronym or abbreviation in in in this er context er which which basically means copyright in the case of of corpus material so so one one big obstacle to to anybody being able to reproduce your results is that anybody won't have access or does not have access to the corpus for copyright reasons , there could be other reasons as well i mean if if you decided to keep the programs er proprietary for instance but you have actually presented them in the dissertation so that's not a problem but but and so i- so i- rai- to- er <COUGH> er i i will like to raise this as a general methodological er matter er issue which which er which i- which is er potentially very serious for for the field of corpus linguistics , er <COUGH> there is another er methodological matter which er crops up now and then in the dissertation and that's er tha- that's more of a a a technical issue actually there is a a s- s- sometimes you you kind of blur the distinction between what what should go into the database and what should be shown to the user of the database because those are two fairly separate <S1> [mhm] </S1> [matters] actually so so you discuss for instance the the thing on page 56 is as far as i remember the year of er of er publication where you have a where you have a discussion from which and you really i really didn't see whether you actually have the the exact year of publication in the database <S1> i have </S1> okay so that's be- be- be- because the point is that you c- you can always <S1> mhm </S1> you can always remove information <S1> mhm </S1> or or for for the purposes of displaying it but but er okay so that that takes care of that , er the other one i think is about er alignment where you state that a paragraph is a good <S1> mhm  </S1> good context for for showing but but but you you have that discussion as a reason for not doing sentence alignment basically but but the point is you you can have a sentence aligned corpus and still still choose to display er search hits in in paragraph context there there's no contradiction there so so so that's not really and er , er and and and there's also i think a bit of the same er confusion in in your discussion about the SGML XML HTML <S1> mhm </S1> and so on because because one of one of the important things about these (xx) are that ther- there ha- has been a tendency towards er distinguishing between what's in the computer and what's shown to the user as far as possible so so that's tha- that's a kind of general methodological matter that that you you just should think about <S1>  mhm </S1> and the and the third third remark then is that well i said it already you you don't really take into account the the field of language technology or computational linguistics in in the in the serious sense here which is which is er what on on one reading what chapter three is all about <S1> mhm </S1> but that's er debatable of course . er , okay , mhm so then i will proceed to what's what does it say here it says er proceed to details . and er . i have a number of questions to you they are of very very varying er characters some of them are fairly fairly short and simple some of them will require other dissertations in their own <S1> mhm </S1> right to to <S1> [@@] </S1> [@@ answer] properly er , the this is the first set of questions also deals you could say with a kind of methodological matter because i i noticed when reading your chapter about corpus both the background chapter and the and the chapter on corpus composition compilation representativity and related matters that er you the the the category of well literary text which which i've i've used as the term for <FOREIGN> (xudozestvennyx text) </FOREIGN> that that you have in in the title of of the dissertation , they are er accorded a somewhat special status the the literature the literary works are are given a a somewhat special status you you contrast them with er what you call normative language on page 35 which is er a you say the kind of language that you find in ordinary written language corpora at least balanced written language corpora where where you've tried to collect a representative er sample of the well normative language the written language er at at a certain time and er well th- the brown and and LOB corpora are are the oldest i guess examples of this for for english er , british national corpus might be something like that it's it's already debatable but but but but but you but you but you make that distinction anyway that that the literary texts are different from this kind of language , you also say that the literary texts are important for the language community and for the culture that's that's that's the next page actually , er you also single out these texts in in a in a kind of negative way or implicit way by contrasting them to in- insignificant works of of of literature that's un- un- i i don't remember the exact formulation of the they're not (xx) <S1> mhm </S1> written by by authors that nobody knows about and so on so there's there there's there are there are definitely different kinds of literary texts as well and and also which i think which i find maybe the most interesting that that literary texts are characterised as well transcending the bounds in gra- of grammar and style they which which i take to mean that they display unusual linguistic creativity you find you you find you you'll be surprised linguistically speaking when reading literary texts <S1> mhm </S1> er of of this kind as opposed to to any text <S1> mhm  </S1> so so i i have i have some questions about this notion of of literary texts and literary works and and their special status which which are mainly perhaps of a methodological nature , so er first of all well it says question one when there are <S1> mhm </S1> three questions here <S1> mhm </S1> but still you you'll have to accept that @@ er so d- do you have more than er a priori opinion to support this high assessment this high re- regard in which you seem to hold these literary texts and their importance for for for what you're doing their importance for corpus work are are there kind of empirical factors that you could point to which which also would single out these works as being a a category <S1> [mhm] </S1> [by themselves] by by itself rather than than than something else so that's i guess my first question <COUGH> </S3>
<S1> well er so my interest er literary texts er and er language of literary texts well actually it comes from er russian er linguistics and from traditions in er russian linguistics actually for dictionary compiling er i guess for example for <FOREIGN> (xx) </FOREIGN> a lot of er literary texts er were used er and er were almost can be sai- can be said one of er main sources for this large dictionary er and er in er russian linguistics for example <NAME S1> defines distinguishes between er <FOREIGN> (xx) </FOREIGN> and <FOREIGN> (xx) </FOREIGN> so lit- lan- <FOREIGN> (xx) </FOREIGN> in his er opinion is er something what er brown corpus calls would call normative language <FOREIGN> (xx) </FOREIGN> is the language of literature and in his er works he tries to distinguish between er these two things so i've i guess i er was er to large extent influenced by his views , then i started to do my own corpus research i noticed that the western linguistics er pays little or no interest at all to the literary texts , language of the news language of the newspapers er language of er er documents is considered and er in the brown corpus for example er there are literary texts but er the principle and if i understand correctly was to take some third rate or second rate authors which are not widely u- yu- widely known which are not on best-sellers list and so on so it is just er another (xx) and there i tried to stay somewhere in between er but er actually now i think er during last year my view started to change but er maybe these changes and er er interests to er special texts as well are maybe not er yet reflected in the dissertation this is my answer </S1>
<S3> yeah the er the the kind of methodological issue that i'm talking about here is that i the- there is a problem here in the sense that if you are presented with two texts <S1> mhm </S1> and er and er can you formulate objective criteria for saying that this is a literary text and this is a literary text too or <S1> [mhm] </S1> [for saying that] this is a literary text but this is not a literary text or none of them are literary texts <S1> oh </S1> and that's and and and that's the kind of thing i'm asking for what how how would you that's that's actually the first <S1> [mhm] </S1> [parenthesis] here what would such evidence look like what how how how would you discern or how would you program a computer to look for literary texts as a as opposed to non-literary texts in this sense , is it possible </S3>
<S1> well i think yes it is possible because er well in cultural studies er well i've been er communicating with some scholars doing research in there and er cultural studies er er they er now are now talking very often about strong texts and weak texts , strong texts are texts which er produce er which influence the culture so which er have some impact er there are phrases borrowed from it and er people er refer to them quote them often for example like in russian literature is <FOREIGN> (xx) </FOREIGN> everybody knows er idioms from this er play this play is just cut into idioms er like er <FOREIGN> (xx) </FOREIGN> or <FOREIGN> (xx) </FOREIGN> and so on and so forth so it lives er its own life and er er so er er it should be if a corpus of literary texts is compiled er compiled some of sort of electronic anthologies are collected er it ma- it should include er these strong texts and er i think it's quite possible to er define them for example er a to study if er there are phrases from them quoted or to make er questionnaires er among er native speakers and ask them do you know this book er have you read this book <S3> mhm  </S3> have you heard about this book and select er on this criteria <S3> mhm </S3> but they do not exist @@ </S1>
<S3> no no but no no thank you no that's that's a good answer but on the other hand i have to (xx) then the bible should be in most of these collections </S3>
<S1> yes [looks like that] @@ </S1>
<S3> [or the quran depending on the (context)] or the culture but because that's er i mean you find stuff from the bible all over the place <S1> [mhm yes] </S1> [in western or] in christianity </S3>
<S1> [yes there's (for exam-)] </S1>
<S3> [not just western christianity] </S3>
<S1> for example er this intertextual er studies er they need er such things er to study what er quotations travel from the book to another <S3> [mhm] </S3> [allusions] of different kinds and er i read about some studies then in this er field </S1>
<S3> so that's you're talking about a kind of intersubjectivity here maybe in a sense it's it's not exactly a computer decidable thing <S1> [mhm] </S1> [but] but still decidable i i <S1> yes </S1> i i mean i i i accept that </S3>
<S1> well i i read about one algorithm er it er they were studying bi- bible quotations <S3> mhm </S3> and and it was quite an interesting algorithm it wasn't based on any linguistic just on overlapping <S3> [mhm] </S3> [on er] some er words er overlapping in different er texts er and producing them to the user then because it actually it's very difficult to decide whether it is quotation allusion or just coincidence <S3> yeah </S3> whether person is quoting it or not </S1>
<S3> er well the third parenthesis is maybe not so relevant here then because it depend on what you're there's this this er psychologist eleanor rosch you might <S1> mhm </S1> know about <S1> prototypes </S1> yeah so so maybe i i was just thinking that maybe we're talking prototypes here </S3>
<S1> not exactly </S1>
<S3> so so s- there it there are distinct categories a text is a literary text or it's not it's not it's it's not a set of defining characteristics so so could for instance a text be a literary text without having this creative language use it could be quite i can imagine i that there are some quite well-known swedish authors who who use very simple language actually and and they're still they're still highly regarded and they're not creative in the sense that </S3>
<S1> yes and this thing is very difficult to decide whether it's creative creative or not for example i think that it's brilliant and very creative and and there's a lot and another one said it's er not okay </S1>
<S3> okay no bec- because you you actually you actually talk about that as a kind of or i i read you as talking about that as a kind <S1> mhm </S1> of defining characteristic whereas a characteristic of these literary <S1> [mhm] </S1> [texts] and that's that's why i'm asking here because er another another thing that's or or my my next question , and this is also because i come from a <S1> [mhm] </S1> [kind of] linguistic background well nowadays anyway <S1> [mhm] </S1> [rather] than this er languages background , so what are you investigating er i i i mean apart from from translations and so on what what is it that you're investigating when you choose to work with these full length i call them high-brow here <S1> mhm </S1> literary texts rather than other texts <S1> [mhm] </S1> [types] in linguistic terms what what are you wha- wha- what what would you be investigating in those texts that you wouldn't be investigating in other other kinds of texts or or text collections </S3>
<S1> mm , well er , i think er to some extent er there are the same things with er which can be found in er any text collection so every every text is words words words and so it's possible to find words everywhere to find er question marks and commas and so on er but er if er we're talking about high-brow texts er they are er influencing er er other media so they are quoted they are reproduced and so on and so they mi- might be er interesting not only for linguistic studies but a- in linguistic studies it might be interesting from the point of view of er violation of some principles because if you were studying a normal standard text corpus were interested in the norm <S3> [mhm] </S3> [were] f- looking up er certain words er look er up er their frequencies how they are used and what and so on and so forth and er kinds o- in the kind of these texts er might be more interesting to er find er what is wrong there <S3> mhm </S3> what is different because er if er it is a literary text er it should er to some extenc- extent er violate the rules so there might be some metaphors some unusual words some words er which <S3> mhm </S3> er the author invented and so on so this is one thing that might be studied another thing is er er well the er cultural background and so on richness of vocabulary or and so on so it's another direction er and er the last very er pragmatic direction is er to find er the standard translation for example er er some er worker translator perhaps or scholar er needs er the correct er needs the standard translation of this quotation [you just can't look it] </S1>
<S3> [(xx)] </S3>
<S1> up here in the parallel text </S1>
<S3> you mean this important quotation [and cultural (xx)] </S3>
<S1> [yes they (xx)] er for example wants to er find [a quotation (xx)] </S1>
<S3> [(xx)] </S3>
<S1> from dostoyevsky for example some er something about er well . <FOREIGN> (xx) </FOREIGN> he's curious how it sounds in finnish and wants to quote it properly or if he's er translating some other text and er found the the quotation and er when he has two options er to translate it himself or to find the existing translation which is a better solution , by the way i asked er er eero balk is er the translator from russian into <S3> [mhm] </S3> [finnish] well-known translator i was er i met him at one seminar in h- in helsinki and i er asked him about well the practice what er shall the translator do if he's translating a fiction text and er comes across er a we- some well-known translations from <S3>  mhm </S3> bible or from er <S3> [mhm] </S3> [some] classical literature russian or foreign for example if goethe is er er quoted or tolstoy is quoted what shall you do and er and er he answered that well the best er solution would be to find it the existing standard translation and to use it but we can't (xx) usually we translate ourselves so that might be useful in this field that's the practical (one) with the er the research </S1>
<S3> no that's a very good point <S1> mhm </S1> i didn't think of it but every time you see on the subtitles in television <S1> mhm </S1> when they mangle <S1> mhm </S1> the shakespeare <S1> mhm </S1> [quote or something like that by by] </S3>
<S1> [yeah , yeah (xx)] </S1>
<S3> @re- recreating@ the <S1> mhm </S1> translation it <S1> mhm </S1> into into swedish in this case so you @you you@ realise that that's a problem of course , er . okay </S3>
<S1> even the titles of the books <S3> yeah right </S3> for example master and margarita in finnish is <FOREIGN> saatana saapuu moskovaan </FOREIGN> </S1>
<S3> mhm yeah okay , er but again you return to the matter of the kind of linguistic rule breaking <S1> [mhm] </S1> [linguistic] creativity of these works and and and and another thing so so apparently you seem to think that this is not important trait in these kinds of works which i'm happy about because that doesn't then my third question here is not invalidated by <S1> @@ </S1> your answer to the first question er , because one one thing that you might think given that these literary works are , have a kind of special role in the source language community and have a kind of special way of dealing with language a a more creative way of dealing with language does this hold for their translations as well as a rule </S3>
<S1> well i'm afraid not <S3> okay </S3> and this is er the weak point of translations er while i was er studying these translations er of course er well i er i confess that i didn't read all the translations which <S3> mhm </S3> there are in the corpus but of course i studied er a lot of er examples and er now i think er that er the best idea to read er the book er is to read it in the original <S3> mhm </S3> because er er well there are different translations but in most cases er they er doing some standard works there are many cases they either misunderstand or maybe i don't understand the the translation properly because here is the culture proble- cultural problem i er give er in er one of the chapters er an epigraph from er dostoyevsky well when dostoyevsky says that er it is so easy to translate into russian from western languages and it is so hard to translate from russian into western languages and i think that's just an illusion because it was easier for him to judge er how er french german or er italian texts were translated into russian er he could er er well while he would judge the quality of er these texts another thing he couldn't er er exactly er er tell er whether the translation was really correct whether everything was conveyed there and er it is much easier to judge er the translation into a foreign language because er you can er see that this is missing and this is also missing but also er in this case also it might be wrong because er the audience the target group fortunately are not speakers er for example translation into finnish from russian are not done for russians speaking finnish <S3> no </S3> they're done for the finns who do not speak russian or they speak russian but not er so well as to read russian er works in original or for reading both and comparing them and so on but actually er i had an impression that sometimes er a sort of mechanistic approach and er it's difficult to say whether the music of the original is preserved whether all the metaphors are okay <S3> mhm </S3> for example this example i gave in my lecture about <FOREIGN> prostor </FOREIGN> the equivalent <FOREIGN> vljyys </FOREIGN> well actually it looks like a good equivalent but er er the translation looks er very mechanistic and er actually the native speakers says that er the style is er isn't okay so this is i think an interesting thing but er i think er the translations should be er er well the translator should attempt to reach this er level <S3> mhm </S3> so that it would be at least er comparable to the er original text and er might be these er corpora might help to some extent so let us see what er happens </S1>
<S3> now one the a- a- again i have a kind of a methodological motive for asking this because if it would be the case <S1> mhm </S1> that these exactly this kind of work would be translated in a way to which is which is then er by its very nature deviant because it w- would be translated into something that's again by its very nature deviant then you don't know what you're comparing are you comparing are you comparing traits of translated texts or are you comparing traits of of er well i w- w- , an- another another thing that that i at least took you to be saying was that that er literary works are individually different as well <S1> mhm </S1> that is and and that and and that means and that would mean that what you're looking at in the translated works are kind of individual differences rather than differences caused by the fact that they've been translated even if you compared them with with original literary works do you do you see what i mean <S1> [yes yes] </S1> [that that] that there's a potential additional variable to to account for if this would be the case but you say that it's not so @@ </S3>
<S1> well to some extent er er i found er interesting things for example er there was one er study i i've been i have been doing for some time but i didn't include it into the dissertation is the problem does the translator has his or her own style and er <S3> ah </S3> er there were er different translations done by er translations done by er same translators were compared and it was difficult to find er something in common because for example if you're taking the texts er done by the same author there exist the same methods to er er how to find out some style prints er and the author's finger prints in style and so on there are some methods there are er er compa- er vocabularies are compared the grammar use of certain grammar forms and constructions and so on and er er this works more or less consistently although there are different results like <FOREIGN> <BOOK TITLE> </FOREIGN> <S3> yeah </S3> some experts say that it was done by <NAME> some say that <NAME> and both use er some methods er for the er research but er with er , translation it's er different what they found out so er the same translator translating different er texts is , changes style so the style might be quite different the vocabulary is different the grammar might be different so in most cases he er acts like a mirror <S3> okay that's er mhm </S3> but that also could be studied and er , but not much of personality (traits) </S1>
<S3> okay . mhm okay , right so let's go on , now i turn to another <S1> mhm </S1> topic er you say or it is said that you say in this dissertation actually that it's hard to use internal effects to classification parameters <S1> mhm </S1> in your database because er well and and you er you give basically two reasons that the corpus consists of full length literary works <S1> mhm </S1> that they in themselves <S1> mhm </S1> have many sub-parts or s- <S1> mhm </S1> segments that could be classified differently you you get some amount of dialogue for instance in er in er in er something that is otherwise monological narrative and so on er and also that the texts were not available in electronic form prior to their inclusion to the corpus you say something <S1> [mhm] </S1> [like] but i'm er i'm actually wondering i don't know if this is er obvious answer to this question but wouldn't it make sense to to apply a more fine-grained internal classification to the texts so that you do get <S1> [mhm] </S1> [into] the corpus that are available in electronic form because you you chose basically one i think internal parameter <S1> mhm </S1> which was which was year of publication </S3>
<S1> well it's also external </S1>
<S3> sorry sorry okay </S3>
<S1> er internal i think the genre </S1>
<S3> the genre yeah right </S3>
<S1> (the only) thing and it's very vague [(xx)] </S1>
<S3> [sorry sorry sorry] er </S3>
<S1> yes i've been thinking about it er and er , and i think er it has to be studied later er when i er have texts already in electronic form it is possible to study their structure their different er features er , and i i i think but it should be a separate research er and quite a long one long lasting but <S3> yeah </S3> er i agree that er that's very important </S1>
<S3> i think you might have answered this one <S1> mhm </S1> already now so so i'm asking here <S1> [mhm] </S1> [specifically] about parts of texts because it could be interesting i guess to study dialogue in <S1> [mhm] </S1> [different] authors' <S1> [mhm] </S1> [for] instance but i mean you need to find those parts , er how would you do it maybe that's something that you haven't ans- answered to yet the last question here </S3>
<S1> well yes <COUGH> i i think er first thing is er that er well during my he- research i . there was always a problem er using full er texts <S3> mhm </S3> because er i was never sure that er the data is really representative i was never sure that something wasn't influencing like er with this case for the word <FOREIGN> (xx) </FOREIGN> there were just plenty of er examples from dostoyevsky's crime and punishment <S3> yeah </S3> which er could change all of the picture er so actually i er think that er in future it would be er possible to er use this full text corpus for compiling er the samples corpus and the samples er could be not er fixed length samples perhaps but er some kind of a . some er chunks of text which can er act independently like a chapter <S3> [mhm] </S3> [or] some paragraphs er er that should be er discussed er what kind of this er chunk would be er active in the , er the these samples and er this sample is easier to classify er easier to study its er internal features and the internal features well could be many of them invented just like er , some vocabulary index for example or er a kind of discourse er </S1>
<DISC CHANGE>
<S3> there is work in er well in the area of information retrieval which which could be very relevant to you the the this er there is something called text segmentation <S1> mhm-hm </S1> which which tries beca- because i mean i mean information retrieval problem you know web web search <S1> [mhm] </S1> [engines] and so on is that er as as it usually works today you'll get a whole document <S1> [mhm] </S1> [as a] as a as the result of your sea- the query <S1> mhm </S1> but what you actually want is the relevant portion of a <S1> [mhm] </S1> [document] so people have been looking into ways of automatically segmenting longer texts <S1> [mhm] </S1> [into] kind of coherent pieces <S1> mhm </S1> and these are topically coherent in the sense that <S1> mhm </S1> they are about the same thing <S1> mhm </S1> but i mean you you can see how <S1> mhm </S1> you could be able to apply this also to <S1> mhm </S1> literary works and and maybe try to see <S1> [mhm] </S1> [more or] less automatically where you have dialogue for instance <S1> mhm </S1> or where you're changing topics from for instance changing scenes or <S1> mhm </S1> something like that so <S1> mhm </S1> so so i i i think you should look into that @@ </S3>
<S1> mhm yes and also the full text er also er is a useful th- thing to study because one can never see er how you can er change it and perhaps er er it would be interesting to create random corpora from it er <S3> yeah </S3> or er as a whole it might be also needed er for looking for quotations (xx) <S3> mhm </S3> say or study the this given work , and title </S1>
<P:05>
<S3> er a- here comes a question directly without the <S1> mhm </S1> background er it seems to me that er some of the functionality and i notice now when you show me that you spell it with a C actually <S1> mhm </S1> not with a K and in the english version <S1> that's the way </S1> @@ er is available in in something called two-step which you also talk about </S3>
<S1> yes i heard about it some time [ago] </S1>
<S3> [so] so did you did you ever consider working with that program instead , it's it's it also works paragraph alignments for instance it's also it claims to be able to handle unicode for instance the full range <S1> mhm </S1> of both character sets and so on <S1> mhm </S1> and and if if not why not @@ </S3>
<S1> mhm well i read about this pro- program some time ago and er i think er it was er quite er interesting and maybe some functions are the same but er , and it might be interesting to contact them and to talk about how they are doing this and how we are doing this , so but but actually if there is er something what i did myself is always easier to use because er well actually i er usually try to well if er there is a bicycle in the store and it might be nice to buy it or it might be nice to <SIC> invite </SIC> it myself and er have the pleasure of invit- inventing it and er , actually i . i have such er impression that er while er i'm doing some software on my own er i am more free er to make what i need if i need to develop some new functions i can develop them for example this er punctuation marks er i er think er well at least the software i know usually ignore these marks and you can't get information on their use and er so does my software but i managed to er write a function which doesn't ignore it and calculate these frequencies so i don't need to go into some forum asking if somebody has the program for calculating question marks so so this is my er just my point er that i again i do not insist [just my style] </S1>
<S3> [no no] no there are levels of complexity and and it's also i i i have to reply to you then that there is there's a phenomenon which in english even has got a name it's called re-inventing the flat tyre that <S1> [mhm] </S1> [which is] which is apt considering your <S1> [mhm] </S1> [bicycle] analogy because of- often when people write programs on their own and not having for instance a a computer science background they they miss some obvious things <S1> [mhm] </S1> [which] which </S3>
<S1> again and again </S1>
<S3> again and again yes (true) so so that that's always the danger and how serious that is depends on the kind of complexity of the task so finding question marks i think if you you you're probably quite right in your assessment that it's easier to to cook something yourself <S1> mhm </S1> than than because it it's it's a matter of well hours before you have it working probably rather than than spending a day or two going out on the web asking people trying to find something that's very simple but when it comes to to larger software development efforts then er and and especially if if large corpora are involved which benefit from all kinds of clever algorithms and optimisations that computer scientists have been working away at for <S1> mhm </S1> for decades <S1> yes </S1> sorting for ins- well that's built-in often so you <S1> mhm </S1> so you benefit from it <S1> mhm </S1> anyway er that that's that's when it becomes <S1> mhm </S1> perhaps more productive to to to see what's around already <S1> mhm </S1> er </S3>
<S1> yes but er i guess that er in a s- er large projects at least er what i read er usually if there is a large project er they do their software by [themselves] <S3> [partly] i mean that's er er an an </S3> some clause i think was done <S3> [yeah] </S3> [just for] this er BNC </S1>
<S3> but that's a long time ago it's er the the the clause i mean <S1> mhm  </S1> so , so it's a the the the there's a quite a lot of reuse actually in computational linguistics world today not not as much as you would like you're writing about i mean people still have this not not invented here syndrome i'm i i @i don't like that@ at all </S3>
<S1> and actually i also tried to exchange er algorithms and er exchange er codes er with er my er one of my colleagues he gave me and er these codes for lemmatiser <S3> yeah </S3> er i tried to use it and er well it didn't succeed and er actually i had some questions about his lemmatiser which were it wasn't doing quite what i needed and so finally i tried for couple of days er then i gave up and er did my own </S1>
<S3> mhm no no there are [very many issues about inter-] </S3>
<S1> [it's er so it's] </S1>
<S3> interoperability and and things like that which which you you rightly point to fr- @from an@ empirical point of view </S3>
<S1> but anyway it's always nice to er cooperate if it is possible </S1>
<S3> yeah sure but i mean open open sources are very <S1> mhm </S1> good development <S1> mhm </S1> i think in this <S1> mhm </S1> respect which <S1> mhm </S1> which means that you can get software and you're <S1> mhm  </S1> allowed to modify it to your heart's content <S1> mhm </S1> provided you give the same <S1> mhm </S1> kind of a er <S1> mhm </S1> conditions for the next user of the software so that's that's very good st- there's still an effort in (xx) understanding what it does of course , right er you also say and now we're going into technicalities i think and er in order to use the a f- a fairly obvious kind of candidate for for a lemmatiser in finnish <S1>  mhm </S1> would be the two-level or or or sys- a system built with with two-level morphology koskenniemi built one in 1983 for his dissertation or a little bit earlier of course er but you say i think that you would need to devote a lot of work to lexicon building or something like <S1> yes yes </S1> that (xx) but it seems to me that you have already done a lot of lexicon building which is more or less equi- equivalent in a sense to what what [(xx)] </S3>
<S1> [not exactly] i because i have what i have done are just grammar classes according to <FOREIGN> perussanakirja </FOREIGN> and er <S3> mhm </S3> they were just grammar classes and i have er grammar tables which er <S3> yeah </S3> are work as simple rules quite primitive rules er but er in er koskenniemi's grammar you have to consider every word and for every word give er some er information so it could be different database structure and er , well actually i er tried to find the quick er solution <S3> yeah </S3> so that i won't er work on it for two or three years <S3> oh </S3> although it might be er interesting the same things er there are lots of er , er not lots of course there are certain . er approaches er which would be er very useful and er very interesting like er for example in er russian linguistics er is er this <NAME>'s model <S3> [yeah mhm] </S3> [<FOREIGN> (xx) </FOREIGN>] that is er interesting to use er but er it demands a lot of material and i know people who work in it they tell that er every entry in the dicti- in the er glossary which is used by the er computer it takes er well at least several pages because er <S3> [yeah] </S3> [everything] about this word should be mentioned its er government er its er er the preposition it takes er the preposition it doesn't take its semantics and so on finally it works but it's er very fragile so it er finally er er some er in some trickier context it doesn't work and and er also i er talked with er <NAME> who is working on er tree banks <S3> yeah </S3> russian tree banks er they're doing a syntactic er parsing and it's based on this er <NAME>'s theory and er he said that it er well they have some interesting results but er how tagging you know parsing looks like that the parser studies the phrase builds the tree then the tree is presented to the user the user corrects it and er <S3> mhm </S3> they can't work without the user and a very high qualified user who can <S3> mhm </S3> understand if the tree was okay and if he can't then the er the sentence should be s- cut into two pieces and er procedure er continued so i i and actually i was afraid that er if i tried to use koskenniemi's approach i would have to study every word word by word and would [take a lot of time] </S1>
<S3> [mhm] er i i i i i think there is more in common with what you have done and what and and one of these descriptions than you than you <S1> [maybe it is maybe] </S1> [say is] it's it's this (let's let) , fair enough , er , another thing that er , i , won- i was wondering about is that you use of the term gold standard <S1> mhm </S1> which you mentioned well i i i put in the page numbers i don't i don't remember which section it <S1> [mhm] </S1> [was but] er in a slightly non-standard way <S1> maybe @maybe@ </S1> and and what what's the role of the gold standard in your work when you're when you're <S1> mhm </S1> testing methods for finding translation equivalents </S3>
<S1> well i just actually now i think that i could do without it as well and to try just er from the ad hoc work er on er this er search of the equivalents er but er when i was er starting to study this subject er i thought that it might be interesting at first to to do a sub-standard list and to study then er how er the tests are running on this er list but now i think i could do without it as well <S3> mhm </S3> decided finally to write what is done and is what is done and it should tell about this flat tyre </S1>
<S3> yeah well what what er the way i understand this term is that you have a not only a standard list you also have a s- list of of already annotated equivalents <S1> yes yes </S1> so that's what you have </S3>
<S1> yes so i i had er i compiled er two lists finnish and russian <S3> yeah  </S3> er in the given er corridor of frequencies <S3> [yeah] </S3> [(with them)] and then er they were just er compared er on the base of er frequencies <S3>  yeah </S3> i think if there were many overlaps it was registered if not er not and then i looked if there were many equivalents in this group and er what should be er the test to find out if it is an equivalent or not <S3> yeah </S3> so just i needed sort of raw material to , to test er the existing coefficients <FOREIGN> (xx) </FOREIGN> and so on </S1>
<S3> so okay so so and and just to stay on this topic a bit longer , the the the usual way it's it's it's used is that you also automatically can determine how good your method is <S1> yes </S1> so you just you just running on the gold standard <S1> mhm </S1> without without the links and circuit and then <S1> mhm </S1> see how many of them are of the links actually <S1> mhm </S1> coincide </S3>
<S1> well actually is this is what i've done <S3> okay </S3> so i've <S3> [no i'm] </S3> [just er] i have this and i ran different tests <S3> mhm </S3> and er well there were some diagrams in this work and it shows that er well <NAME> coefficient would have worked but i er stuck to this <NAME> coe- coefficient and er it looked that it doesn't cut too er many , er right er pairs doesn't <S3> [yeah mhm] </S3> [throw out] everything but still there are no not many wrong pairs but some tests worked quite badly i don't understand why </S1>
<S3> mhm so and did you do the same with the other frequency band as well [(xx)] </S3>
<S1> [yes yes] later when i choo- chose this er er <NAME> coefficient i tested it er on a larger band and er by the way on this er first er band er before i started i had er an idea that er er low f- er well of course that low frequencies er low frequency words er makes no sense er to try to find the mutual information because of all the <S3> yeah </S3> if the word occurs once in er the finnish text and once in the russian text er it's not likely that they are equivalents of course even if they're in the same context , there could be er [just er (xx)] </S1>
<S3> [mhm well (xx)] </S3>
<S1> well that depends <S3> yeah </S3> by the way if it was a one sentence phra- a one a one word sentence and another one word sentence and they were aligned <S3> yeah </S3> er we we could er claim er that they're er equivalents although th- they're equ- equivalents in this context because er sometimes er no dictionary will er <S3> [okay yeah] </S3> [take it] yes and er , here . well (doesn't matter) </S1>
<S3> okay so er we we're about half way through so do you think we should go . <WHISPERING> (xx) </WHISPERING> </S3>
<S2> <WHISPERING> (xx) </WHISPERING> </S2>
<S3> yes certainly er i'm just wondering (xx) not finished yet , okay er okay so here's the background to my next two questions in chapter three again <S1> @a-ha@ </S1> @i guess@ there is a discussion about the part of speech composition of the translation equivalents <S1> mhm </S1> that you found when you looked at distribution <S1> [mhm] </S1> [because] the other ones weren't all that interesting <S1> mhm </S1> the o- the ones that you found <S1> mhm </S1> with the er string string similarity <S1>  yes </S1> we've discussed that already <S1> mhm </S1> and you also talk a bit about the possible reasons for the predominance of nouns and adjectives among these er equivalents that you find <S1> mhm </S1> and and one thing i wonder i don't know if i- if this is relevant (at all) i think it's a bit relevant though <S1> mhm </S1> is er well first of all did you find many equivalences where the part of speech was not preserved across languages </S3>
<S1> yes er there were of course <S3> yeah </S3> because it's quite typical to translate er adjective with a noun into er from russian into finnish <S3> mhm  </S3> for example or as a part of a compound for example <FOREIGN> (xx) </FOREIGN> <FOREIGN> talviaika </FOREIGN> <S3> mhm mhm </S3> it's a compound or a er , whatever <P:06> and so on and so it's quite often when er it is not preserved and er in this [case mhm] </S1>
<S3> [are] are those given in your diagrams </S3>
<S1> er . no er i er i suggested that er the normal this translation equivalent pair er should be the one which could er be registered in the dictionary so if er the part of speech <S3> [okay] </S3> [is confused] it was considered partially correct </S1>
<S3> okay yeah i see <S1> it isn't [(xx)] </S1> [right] so i missed it er , no no that's a reasonable definition <S1> mhm </S1> er an- another question that i wondered a bit about because er it's true that you have a predominance of nouns <S1> mhm </S1> in your translation equivalents <S1> mhm </S1> but if i mean it's it's it's a kind of common experience in corpus linguistics that most words in corpora are nouns , so <S1> mhm </S1> and and so so so one thing that you wonder here what was the part of speech di- distribution among the original candidates , <S1> [mhm] </S1> [is is] is the part of speech distribution among the equivalents noticeably or significantly different from that because you i i i'll bet you that you'll find a predominance of nouns among the original 7290 words as well , well er i <S1> [it was difficult] </S1> [i i one (xx)] i i i want i want to keep high <S1> [@@] </S1> [@@] high stakes but <S1> [@@ right] </S1> [@@] er i i i i think i think that's true mostly that that that nouns <S1> mhm  </S1> predominate fairly heavily in in in any text almost any text </S3>
<S1> maybe maybe but er er er this er thing was er er at the present stage it was er quite difficult to check and er even er this er this information in the table on er page 162 is just er pleri- preliminary results <S3> mhm </S3> because sometimes er it was difficult to er find out what part of speech is er this er lemma er part er part of speech of the lemma <S3> yes </S3> as for example you have <FOREIGN> (xx) </FOREIGN> it could be both an adjective <S3> [yeah] </S3> [or a noun] depends on the context lemmatiser is context-free so we have problems here , and er er here so er it was er difficult to check this so i didn't even try but er my idea is er yes i think you are right and er that nouns are a lot but er er there are er in this statistics er i think adjectives are much less than they are in the <S3> okay </S3> text and verbs the besides with er verbs the results were especially poor i think because er of the aspect [there was there was a (xx) mhm] </S1>
<S3> [oh yes that's that's what you say and that's why it would have been very good to have the original (xx)] because that would strengthen your statement <S1> mhm </S1> if you could show that <S1> mhm </S1> verbs are 20 per cent of the <S1> [mhm] </S1> [original] but only six or <S1> [mhm] </S1> [whatever] per cent of the er equivalents found because otherwise <S1> mhm </S1> you you might think that oh maybe it was only six per cent verbs in the original which er which <S1> mhm </S1> is er not maybe not likely but conceivable anyway so so that it's always good to eliminate all other sources when you when you try to er explain so </S3>
<S2> should we now er </S2>
<S3> yes so </S3>
<S2> er i announce er 15 minutes break , cu- coffee and tea are served on in the restaurant of the second floor <FOREIGN> viidentoista minuutin tauko,kahvia ravintolassa </FOREIGN> </S2>
<15 MINUTE BREAK>
<S2> <FOREIGN> vitstilaisuus jatkuu </FOREIGN> defence of the dissertation continues </S2>
<S3> okay <COUGH> er so going on then , okay now we return to the matter of translationese and er one of the most interesting claims that you make in the dissertation says basically that there is no such thing or at least not to any very interesting extent such a thing as translationese in general you rather we should talk about particular <SIC> translationeses </SIC> or whatever er for specific target languages translated from specific source languages so well we i i mentioned this already so <S1> [mhm] </S1> [you know] what it's all about and you say that on er at least a couple of places in the dissertation as far as i can i can see , so one one thing i wonder because er even though you talk about this and and you discuss it to some extent , it , er given that there is some discussion about translationese and the literature on on on <S1> [mhm] </S1> [translations] and translation studies so so i i wonder how firm your conclusions are about this <S1> mhm </S1> and and what on what do you base them have you compared translations or <S1> [mhm mhm] </S1> [(xx)] and and and to what extent have you compared translations with <S1> [mhm] </S1> [each other] and with the target language so in in original form </S3>
<S1> [(xx)] <S3> [oh] okay </S3> well actually my research er is dealing only with er at least at this moment is dealing with translations from russian into finnish and i haven't got much material to com- for example english-russian or russian-english or finnish-russian translations er but er still i think that er this er , this er could be er regarded as at least as a good start er because er there are lot of there is a lot of research done for example during savonlinna project they were comparing translations from different languages into finnish and every time there were some er new different things found and different influences of course there are some general things which are typical for any translation er but er they're just er too general to get er some er practical er use of that so like er the dictionary isn't as rich as dictionaries more standard than er the language of er language of er translations as more standard than language of original texts and so on but er they're just general tendencies and for the given pair of languages you might get er some more exact information or precise information and it's more interesting to study <S3> mhm </S3> this assumption </S1>
<S3> okay , er but you've been able to compare for instance translations into finnish from i mean i in a i in in not not i mean a in detail but in a general sense <S1> mhm </S1> translations into finnish from english with your <S1> mhm </S1> with your transformatory translations in that sense but with the translations that you've been working with <S1> mhm </S1> and and and you find a small overlap is that <S1> mhm </S1> is is is is that how to interpret what what you said <S1> mhm mhm </S1> okay er <COUGH> , when er what one , i didn't mention this at the beginning but one of the areas apart from corpus linguistics that i've been working in is er well the general area of computer assisted language learning utilising language technology and in that context you often look at er learner language learner corpora for instance <S1> [mhm] </S1> [there are] there are there is a number of learner corpora and and er now that you state that translated language really is only interesting well i'm i'm i'm now putting your st- this statement in <S1> mhm </S1> maybe a bit more in in a little bit more firm words than you would anyway it's it's language er , y- y- y- y- y- you should look at a particular language pair and and a particular direction as well and and er in in this connection you come to think about the idea of transfer in language learning for instance <S1> mhm </S1> which which has been debated that's true in the the second language acquisition community but the at least some researchers think that there is some truth to to or or some content in the in the notion of transfer , so er is there a relationship between translated language and learner language i mean what w- w- if if you looked at , because i also happen to know that you teach russian to finns <S1> [@@] </S1> [@@] which which makes the question even <S1> [yes mhm] </S1> [more relevant] it it i- i- d- d- do you find any commonalities or are [those def- definitely] </S3>
<S1> [yeah there should be] </S1>
<S3> are are are are they completely different or </S3>
<S1> mhm mhm yes at least er one thing is er clear about it is that there is certain thing in common between learners' corpora and er translation corpora is that in both cases er language is influenced by another language difference well the reasons are different so er for example er translators of course they translated er their in their er in most cases they translated to their er own language into their mother tongue er and er er this er influence of the er source language is er a little different but it would be interesting to compare and er but what pair what kind of er , well finnish learner , for example if er er russian students study finnish the finnish language and compare it to the finnish of the finnish translators [translating from russian] </S1>
<S3> [yeah certainly that would be the] </S3>
<S1> it would be quite an interesting turn of the research and actually i myself didn't think about it it would be , [(xx)] </S1>
<S3> [of course yeah] of course it it it's it's not really relevant that you teach russian because you're looking at finnish [certainly i'm i i i was i i i made a wrong correction here yeah] </S3>
<S1> [yes but actually i think there should be] there should be something but quite a different er types of interference (xx) the first one is er just mistakes which are done or over-presenting of certain features like er using er certain er tenses er more often <S3> mhm </S3> or using some cases in the wrong way or er but actually it would be easier to predict er what kind of interferences is er in the translation <S3> mhm </S3> yes </S1>
<S3> er that actually brings me to a question that i didn't put on the slide which occurred to me right now and er <COUGH> which also is er . when when you look at learner corpora <S1> mhm </S1> then one one of the things that you can do , which you couldn't do earlier is look at things like tendencies <S1> mhm </S1> which which i i thank you for mentioning here you you can look at learner language and see how is it different not not at the errors because errors that that's one thing that you can find as well but you can look at learner language and see how are there tendencies which you can discern when you compare it to native language and and you and you might see that er . the er s- learners of english having swedish as their native language tend to overuse er let's say phrasal verbs or something like that <S1> mhm </S1> it's phrasal verbs are not wrong <S1> mhm </S1> they they're valid english <S1> [mhm] </S1> [constructions] or mi- they they you might find if you if you have a way of looking at syntax you might find that they they less often have a subject in the first position in the sentence than the native speakers because in well swedish is a V-2 language like <S1> mhm </S1> german is a (xx) you can move anything into the first position and so on which which you can in english as well but you don't do as often and so on <S1> mhm </S1> and er <COUGH> well o- one thing that occurred to me was that er when you when you look at , translated language in your corpus it seems to me that you m- please correct me if i'm wrong that you're you're being atomistic still you're not looking at tendencies you're looking at individual lexical items you're not looking at say er er usage acronyms and things like that <S1> mhm </S1> is is that something that you plan to do @@ to put it , to ask in a nice way </S3>
<S1> well sometimes later maybe </S1>
<S3> because that that's that's that that would bring it closer to the way that learner corpora are are investigated </S3>
<S1> there's one thing that i i was un- unhappy about in my work is that er they're just general er or just er cases er this word and that word and then <S3> yeah  </S3> i i think in general it's to some er extent the problem of er corpus linguistics er as such so if er just er . in er many works er just er certain examples are studied <S3> [mhm] </S3> [lots of] them lots of empirical material but no <S3> mhm </S3> er general theories er <S3> yeah </S3> arrived at , and so i think it is an important @thing to do@ </S1>
<S3> yeah yeah and and i think actually you've pointed to something that has to do with my first well the second <S1> [mhm] </S1> [actually] slide where the the that kind of corpus linguistics that you are talking about er suffers from a lack of language technology i would say <S1> mhm </S1> that's why you look at at lexical items and <S1> mhm </S1> so because you don't have the tools for doing for instance syntactic analysis <S1> [mhm] </S1> [on any] level which you would need <S1> mhm </S1> in order to look at syntactic patterns obviously <S1> of course </S1> and and and i i usually try to , try to kind of , er push for the , a greater use of of of these <S1> [mhm] </S1> [language technology] tools in in that kind of corpus linguistics as well and for instance <S1>  mhm </S1> investigating learner corpora and investigating translation corpora and so on so , er that's that's that's what we should do i think [(xx)] </S3>
<S1> [yes it's it's besides (xx)] this lack of tools and <S3> yeah </S3> is the there are lots of things which er should be done but er cannot be done at the present stage and finally there are just word studies sometimes it is possible to catch er some grammar form like comitative <S3> well </S3> it was possible because it's easy to define with er russian it is more difficult because er certain er endings er could be anything </S1>
<S3> yeah now one of my points is that these tools actually exist <S1> mhm  </S1> it's it's only from from the the way that the <S1> [mhm mhm mhm] </S1> [corpus linguistics field] looks at er the right hand end really doesn't know much about what's going on at the left hand end that's that's what happens and the <S1> yes </S1> left hand end er er some of these tools are developed for quite different purposes than than doing linguistic investigations but it could be applied very well it's not , it's not painless i i'm i 'm not saying it's painless @@ <S1> @yes@ </S1> <COUGH> because the- the- they're often very as as far as user interfaces and so on are concerned the they're they're often very immature the <S1> mhm </S1> their research tools <S1> mhm </S1> built by com- computer scientists and <S1> mhm </S1> it's , well <S1> yes [(xx)] </S1> [you know about] these things </S3>
<S1> to some extent it could be a problem that er er on one er side are er linguists know quite er little about er computational technologies and er sometimes they even can't program on the other er wing are er er mathematicians and programmers who work with language but know nothing about linguistics <S3> mhm yeah </S3> and er , <S3> [well] </S3> [so we] have to develop something in the middle </S1>
<S3> yeah well it's er that's quite true , alright so let's go on er you said er and er , well that er translators tend to the dictionary (xx) or at least i read you as <S1> mhm </S1> stated that and and and actually in your in in your introductory lecture you partly answered this question already but but i'm putting it to you anyway so , er if translators are dictionary-bound and this is a matter of methodology to some extent again it's a hen and egg thing actually which is which is ultimately a matter of methodology , er , wouldn't parallel corpora if they reflect this dictionary-boundness in translators be a , be a bad basis for new and better dictionaries because it would kind of be be reproducing this <S1> mhm </S1> reliance on older dictionaries <S1> mhm </S1> by looking in parallel corpora for for translation equivalents which <S1> mhm </S1> which i'm i'm i'm i'm putting this again in a bit <S1> mhm </S1> simplified terms @@ <S1> @@ </S1> bu- but or or how could these parallel corpora play a role in dictionary making i mean it's or or w- w- what would you do instead </S3>
<S1> mhm well my idea is that er er it isn't possible to use parallel corpora as the only resource <S3> mhm </S3> if you take just parallel corpus and use the equivalents which er translators use we might get some very reliable data and some quite unreliable data and something will be quite wrong er but er if we also use er monolingual corpora large monolingual corpora of two languages and er <S3> mhm  </S3> er check on them for example if we have er er this er <FOREIGN> (xx) </FOREIGN> word <S3> mhm </S3> we get er equivalents like er <FOREIGN> tila vlj  </FOREIGN> and so on <S3> mhm </S3> er and then er we could er check er the usage of this word on a large finnish monolingual corpus and er find out er the functioning of these words do they suit as equivalents and so on and this way it might correct some of the skewedness of er parallel corpora , so er general idea is that er it couldn't be the only resource and another thing is that not everything could be caught by parallel corpora because there are some things er some words you need in er everyday speech but never occur which never occur in translations there are some genres or types of texts which are never translated but still they exist and they should be included there er so i think er in this way </S1>
<S3> mhm yeah i'm glad that you mentioned the possibility of using kind of <S1> mhm </S1> , original corpora in both languages because that's er i i i i think that's a promising way to and i and i believe even that using lan- language technology to some extent automate that process that is finding the contexts of the <S1> mhm </S1> terms that you're looking for <S1> mhm </S1> or find terms that used in similar contexts <S1> [mhm] </S1> [in that] language i i i don't i don't know how it's how you're supposed to do it but you can see that it's it it might be possible it's an interesting topic for research anyway </S3>
<S1> well one one of the people in the audience is working @on the problem@ and might <S3> mhm </S3> give us an answer some time ago <S3> okay </S3> in some years . so actually this er multilingual corpora er well some some scholars call them comparable corpora <S3>  yeah right </S3> some call comparable corpora something like <S3> [yeah i know] </S3> [savonlinna corpus] so there are some @lots of er problems with terminology@ but actually they should be er viewed (at least) as another resource which could be <S3> [mhm] </S3> [very] useful and helpful </S1>
<S3> yes and now we come to the the really difficult questions . what would you do differently if you were to do this work all over again [@@] </S3>
<S1> [@@] well er first well one thing i i regretted about is er well the vastness of the field so er i er had to do many things and compile them into one book and er lots of er questions remain questions and er no answers were got and so on so perhaps er i would if i would have been starting it all over again i would er , er take er i wouldn't take literary texts i would have taken for example er some j- like students' translations we have er a lot of students at the department er they're translating from russian into finnish from finnish into russian and er there are er third year students translating there are fourth year students translating there are students er , are close to their master's degree and they do er translating so er it would be er quite possible to get er quite an interesting material on er again it would be a sort of learner parallel corpus <S3> yeah would </S3> a new type of corpus <S3> mhm </S3> and would be er quite easy to er deal with er , copyright problems , by the way <S3> yeah yeah [(xx)] </S3> [it's quite easy] er just the form and nothing more er and er er er size of the material wouldn't be as large as here it would be quite a small corpus which would be er possible to operate possible to do all these things which i did but er also might be possible to get a tagged version and er to er do a deep research that's i think er what i could have done </S1>
<S3> done instead yeah but these these are finns learning russian so they </S3>
<S1> both we have both groups </S1>
<S3> russians learning finnish as well oh </S3>
<S1> there are there are russian students some of them present here @@ there are finnish students also </S1>
<S3> oh it would be a very [(xx)] </S3>
<S1> [yet so] we could get very valuable er information er </S1>
<S3> oh yes certainly </S3>
<S1> in written or in er interpreting as well . <S3> yeah </S3> yeah but it would have helped er the department to some extent [(xx)] </S1>
<S3> [but this] has helped the department as well , don't be too too modest <S1> [@@] </S1> [here] er and then my last question , but i mean now now you've veered off in a completely different direction but at at the end that is in the in in your answer to the previous question which wasn't even on the map of the dissertation but at the end of the dissertation or at the end of the of the main text anyway you you say that there are some some things that you could do to continue the work that you started in the dissertation er what would you do first that's that's the essence of the (applying) </S3>
<S1> well one thing is er already been done so we have a student who is er working now as a <S3> yeah </S3> research assistance er assistant at our department er and er he is marking transitive verbs in the lemmatiser <S3> okay </S3> and so we er because er i er i've written about er the problems which er i get er well in this er in finnish lemmatiser er i initially had only grammar tags only grammar markers <S3>  mhm </S3> er and it is difficult from the grammar marker er even to er decide what part of speech is it because in finnish er for example nouns and adjectives all belong to the same well there are in every class everything can be found like er <FOREIGN>  talo </FOREIGN> and <FOREIGN> jalo </FOREIGN> <S3> mhm </S3> or <FOREIGN> lyhyt kevyt ja er kevt </FOREIGN> for example in the same class and so on er <FOREIGN> olut </FOREIGN> and er <FOREIGN> kevyt </FOREIGN> er and so er he's now er he did some er marking of the adjectives and er nouns and now there is another challenge er transitive verbs it is very important for er for example er er , age- er these er agentive er participles this <S3> [mhm mhm yeah] </S3> [<FOREIGN> agenttipartisiippi </FOREIGN>] because it can be formed only from er transitive verbs like <S3> okay mhm mhm </S3> <FOREIGN> kantama lukema </FOREIGN> for example er so er i think er i have to continue works on this er lemmatiser and er both at the same time on russian lemmatisation and er work on er context sensitive er lemmatisation test some methods <S3> [mhm] </S3> [er] and er improve er the analysis er itself er that could be and another thing is er sentence based alignment <S3> yeah right </S3> . and many many more @@ </S1>
<S3> alright that's <S1> [@@] </S1> [(what you)] first yeah i know it's (xx) . okay that was the last question i have a few summing up or winding down slides as well so well i'll just </S3>
<S1> @stand up position@ </S1>
<S3> okay </S3>
<S1 AND S3 STAND UP>
<S3> so and this this is actually a kind of continuation of the first , first er review of the dissertation but now it's against against the background of of all this questioning and <S1> mhm </S1> answering going back and forth erm so what i would say as a kind of summary and and i've written it here too that that there's a really really impressive empirical part of this that is the third part so and and there there's so much more you can do with it as well i mean you you you've kind of shown the way , or there are many many interesting investigations that you can that you can carry out in translation studies in lexicography and if you ever start collecting these essays written by st- russian and finnish students then it's also applied linguistics definitely , er you have this unique parallel corpus resource wh- wh- where i really hope that you'll be able to resolve the intellectual property er issues so that it can become generally available at least in for the research community er er bec- because i mean there there are many many uses that you can put it to <S1> mhm </S1> if it's generally available <S1> mhm </S1> even even not foreseen by you or me or by anybody else <S1> mhm </S1> but the- it must be available , er there is one thing though and and we actually discussed this before before this defence that the resource will be more useful i already mentioned the (xx) publications but but i i also think it's it's pretty urgent that you try to make it available in some kind of standard format mm i mean it's it's one thing that that your database your program has an internal representation that's much more efficient <S1> mhm </S1> but you should consider being able to export <S1> [mhm] </S1> [corpus data] into for instance well this is the <COUGH> this is the corp- corpus encoding standard <S1> mhm </S1> in in XML this is the text encoding initiative which are both kind of standards for representing corpus materials and er even though you er you say something about standards you you quote <NAME> on page 72 <S1> mhm </S1> who is in gothenburg as well <S1> [mhm yes @@] </S1> [actually] er but but i think there there are two kinds of standards in in this work there there is the structural mark-up <S1> mhm </S1> which which is pretty well established and and which is mainly what what these address i mean things like like er text headers where does the text begin wha- where are the paragraphs <S1> mhm yes </S1> where where are new pages <S1> mhm </S1> and we- y- a- and then there's the issue of of er well linguistic annotation which is much more contentious you might say it's <S1> mhm </S1> it's er so so so you but but but you should go at least for the first kind of mark-up <S1> mhm </S1> because that's that makes for easier exchange of resources <S1> yes </S1> so so that's a kind of advice , er what's the third one then yeah i don't know if well you you know about the taj mahal i guess <S1> mhm </S1> which is er er a grave actually in agra and the surrounding area which is where taj mahal is located there is a kind of a er local industry in producing this kind of decorated white marble which which the taj mahal is built from as well so so you find white marble inlaid with various precious and semi-precious stones and things like that which which is very very beautiful and very very expensive and and if you ever have the chance to look into the places where they make these things you notice that there is no there is no high technology involved it's good old-fashioned handicraft and and and my point with this is that i i told you several times that the kind of language technology in here is is a bit old-fashioned and so and so on <S1> mhm </S1> but that doesn't really reduce the beauty of the result , you can do the same thing by hand the the thing is that that when when you when you want to go industrial you might want to consider other ways of producing the same thing if you're not selling marble table tops to tourists at 10,000 dollars but rather want to sell them to everybody in the local k-mart or whatever <S1> mhm </S1> at at ten dollars then you need other means of production so that's it i guess i think this was the last one yes that's my summing up of it </S3>
<S1> well i would like to thank thank my right and honourable opponent professor <NAME S3> both for evaluating my dissertation on the stage of er evaluation and for giving positive er evaluation er positive statement and for he agreed to read it in russian as it was i guess that er you spent a lot of time reading it and it's more difficult to read than than swedish or english er and er i thank you for very valuable and useful remarks er which er put er of course er use in the future thank you . well and it's not yet over i now invite er those members of the audience who may wish to bring up any comments or questions concerning my dissertation to request the floor from the <FOREIGN> kustos </FOREIGN> </S1>
<P:13>
<S2> <FOREIGN> vitstilaisuus on pttynyt </FOREIGN> the defence of the dissertation is now concluded </S2>
