DD: Damien Deighan
PD: Philipp Diesinger
PK: Philipp Koehn
DD: This is the Data Science Conversations Podcast with Damien Deighan and Dr Philipp Diesinger featuring cutting edge AI and data science research from the world’s leading academic minds, so you can expand your knowledge and grow your career. This show is sponsored by Data Science Talent. Welcome to part two of our conversation with Professor Philipp Koehn, who is one of the world’s largest leading authorities on machine translation and is currently a professor at John Hopkins University. In this episode we delve into the commercial applications of machine translation. We talk about some of the open source tools available and we also take a look into what to expect, in the field, in the future.
PD: You’ve talked about translations from an input language to an output language. How about translations within the given language, are those a viable topic as well? Say between different accents and from a jargon like medical English to more understandable casual English?
PK: There is a whole…it’s not as big as a field, because it’s not clear what the applications really are. So, what you describe there is maybe an application but it’s not really that big and pressing application as machine translation is. This is called paraphrasing, how can you reword things. There’s a group here, Transform Ganza, spent for a long time working on paraphrasing and work up a big paraphrasing database, they build a new paraphraser now, that actually works like a machine translation system. So the whole task is then translated from English to English and basically it trained as a multi lingual machine translation system where you just throw in all kind of language pairs, so it learns to map between any languages, so you can also force it to learn to map between English and English.
There’s no real training data in that space actually, there’s not much text written with the same meaning. The only thing I can think of is maybe different translations of books that exist. Literature often they go very far from the original meaning. The goals are things like style, levels of formality and readability, that’s probably in the dimension that does make sense. So, the simple Wikipedia or maybe you can train from complex Wikipedia to simple Wikipedia. It’s a much tougher field, A, it’s not quite clear what the application is and B, there’s’ no ground truth, it’s much harder to evaluate that.
PD: What typical data sets are you using to train your machine translation [inaudible 00:02:34]
PK: Everything we can get our hands on. So, when I started machine translation with a group I was with worked on Arabic and Chinese, I didn’t like that because I couldn’t read either and I wanted to work on German English, and I came across the European Parliament has the public debate out on the web. And they translate it into all the official languages, which back then, were 13 different languages, now they’re 28 different languages. So you can just download all the webpages. They automatically marked up, here’s speaker so and so and he says this. And then you have French translation speaking and so it’s very easy to figure out which blocks of text belong to which blocks of text. It’s not that hard to then break it down into sentence level.
So, that is a big data resource that was used for a long time it’s about 50 million words of translated text for all the big, official EU languages. And a lot of the other publicly available datasets are similar. There is a big United Nations Corp but it’s only for Chinese, Arabic, Spanish, French, English. There is one interesting corpus is open sub titles. Currently people like to translate, I think this comes also from the day when first you pirate all the TV shows, then you want to watch them in Chinese but you don’t understand it because all is in English. So, people actually first create subtitles and then translate the subtitles. So, there’s actually a vast reservoir in the order of hundreds of millions of words of translated subtitles.
For a lot of the languages, quality is a big shady. So this kind of material, we have a big project right now, running now for three or four years. Still with the University of Edinburgh where I was before. And other groups in Spain, you know, [inaudible 00:04:22]. Where we go out on the web and crawl any website there is. This is something Google has been doing since the very beginning when they got engaged in machine translation. They had the advantage, they already downloaded the entire internet because their search engine. We now have to, we were always a bit envious, they do better because they have more data. And at some point, I thought we should stop with the whinging. We have access to the internet, we can download everything, so that’s what we did. We just downloaded hundreds of thousands of websites and tried to find translated text on them.
So, that’s another large resource. So usually find some good goldmines of good data, where there is consistently translated, nicely formatted, or you know what the source text matches to which target text. Or you just go out on the web and just crawl whatever you can get. The amounts of data we’re talking about there is, for the biggest language payers in the order of billions of words, that is probably more than you can read in your lifetime. But then it does taper off towards the next 50, next 100 languages, where there’s just not that much, at present at all.
PD: You mentioned using the paid data from the European Parliament, is that language not very specific and would it not have an impact on the translation?
PK: It depends on what the application is. We’ve been organising competition, so machine translation in the academic community for the last 15 years. And there we use as a test set news story, because it’s tough. I mean it has a very broad domain. I can talk about NMB Sports, it might be natural disasters, might be political events. And it’s relatively complex languages, average sentence length is like 25 or 30 words. For that, we found the European Parliament proceedings very useful because they talk about the same subjects, but it’s a particular type of language.
Other areas where people currently get very excited about is translation of speech and spoken language and spoken language is very different from written language, even parliament proceedings which formally are spoken, but accept that they are then transcribed and all the disfluencies and non-grammaticality’s are removed and it’s cleanly formatted text. Often it’s just direct speech, so there is a real mismatch between the data you need to train those engines and the data you actually have from official publications. And that’s a real problem.
PD: What role does infrastructure and technology play for machine translation?
PK: So, it’s been, for a while, pretty compute heavy because we’re talking about data sizes and the gigabytes as training data. In [inaudible 00:07:00] I think we were always in the situation where I always like to say, grad student at a university could do meaningful research. They had access to all the data, all the data is publicly available. There was a lot of open source code so you can actually just download the software, run it and then work on improvements. And the machines you needed were just typical modern computers and a single machine was good enough.
This did change a bit with the advent of neural networks, because suddenly you need GPU servers and this seems to be a field where just throwing more computer-y sources at it you actually do get better results. You can build more complex models so you can measure the complexity of neural networks by how many layers it has, so you can build models with five, six, seven layers but you can also build models with twenty layers, except they train them many, many times slower. For the big language pairs where you do have a billon word corpus, yes training does take weeks.
It’s become a little bit of a problem, because now suddenly we, in academia, are struggling a bit with competing with the big industry labs, which easily have a thousand GPUs. What does a GPU cost by itself? $1,000 or $2,000. You need to put in a computer, so a computer with four GPUs cost $10,000. So, academic institutions, I mean we have a pretty big lab here at Hopkins because we have been the centre for language and speech processing, it’s just a large research centre. But even as we have about 100 or 200 GPUs available for 50 PhD students. And that’s a crunch, they’re always overloaded.
So there is certain research we cannot do in academia anymore because we can’t just run the amount of experiments that can be run in industry. One thing I read somewhere in a paper, really threw me off, it wasn’t in machine translation it was in language modelling, where it was like, oh yes, this model trained in a week, on 1,000 GPUs, and I was like, oh my god, that’s just the end of it. And there’s this paper, GPT3 that maybe some listeners are familiar with, that’s a big language model trained on massive amounts of data. I tried to compute how many machines they used but it’s in the order of 50,000 GPU days. And that’s just insanity. Even people in the industry were like, wow that’s insanity.
But even there they showed it hasn’t converged yet, and hasn’t finished yet. If you would throw more computer at it, as you throw bigger corporate at it, you would still probably get a little bit better. So, yes we currently are definitely in a situation where compute is a limitation and what you can practically do is limiting what kind of models you can develop.
PD: How does academic research translate into industry applications?
PK: There’s different motivations for researchers and academia. Ultimately students and researchers work on what’s fun. We are a little bit guided by big funding projects, in the US, Dargo has been funding machine translation for quite a while and they’re interested in, basically understanding foreign language text and more recently also towards the languages that are not covered by Google. So the last project I was involved, we had to translate from Somali and whatever, Ethiopian languages. Whatever languages Google translate doesn’t exist for. So, that drives some research. Since we’ve been organising this share task on news translation that drives a lot of research, that’s kind of the dataset that scores around.
Generally, in machine translation research, academia are not that concerned about the end application. And they are quite different end applications. So, you mentioned Google translate, so that’s the challenge of translating web pages into something understandable. The bar is actually not that high because it just needs to be understandable. It can have mistakes, that’s fine, but it just needs to be understandable. So, Facebook has a similar problem of people post stuff in different languages and the translation needs to be understandable. They have a bit of a tougher time because what people write on Facebook messages is not as polished as what people write on webpages. So you have creative language and made up terms and acronyms and a modii console all of the place and a lot of inside jokes in the language, that’s just really hard to translate.
The big commercial application for translation is actually completely different. This is mostly companies who want to globalise their products, they have to translate marketing materials, they have to translate documentations of products, omission technology is one of the big areas they work on, is the translation of subtitles, so movies and TV shows. So, if you want to sell Hollywood TV shows into the Indonesian market, then you at least need to have Indonesian subtitles. That quality bar is much higher, because you’re delivering translations to someone who just expects to read it without problems and if there are errors, they’re going to be annoyed. They’re not going to be appeased by saying, okay that is machine translation, deal with it.
So, this is still something where having humans involved nowadays, mostly post editing the machine translation that has become the practice in the field. Not just translating from scratch but post editing machine translation is a key part of it.
PD: Could you give our audience a brief overview of what’s available in terms of open source in the field of machine translation?
PK: We have a very nice culture in machine translation of having publicly available, you can literally go to a website without even registering for it, a really good website for that is Opus, from [inaudible 00:12:44] in the University of Helsinki, who basically collected all kinds of parallel data for all kinds of language pairs. So, data is publicly available and tools are also publicly available. So, in [s.l. statistic MT 00:12:53] I was leading this Moses machine translation project, which was the dominant tool for machine translation back then. Nowadays there are several implementation of neural machine translation, it seems like every year or every half a year a new one pops out and they’re all publicly available. So, literally the system that Facebook uses in their research, in their AI lab, which is called Fair Seek, you can actually download the entire source code and run the same code on your machines.
A little bit temper this impression that because of the resource used, academia cannot do interesting stuff anymore, I don’t think that’s literally that true. I think there’s certain things we can’t do, like high resource, massive explorations of different model architectures, that is kind of off limits. But 90% of the good ideas still come out of academia. And it’s just a much broader field of people and a much higher pressure to generate papers and new ideas and publish them. And it did shift a little bit that most of the academia nowadays works on low resource languages, language pairs where you don’t have much data and I mean, still all the problems that exist, for instance shift of domain and subject matter and style and so on. And different ways to train it. Because there’s not just parallel data, what can you do with more lingual data, there’s a whole exciting area of machine translation that says, can we learn to translate if we actually don’t have any translated text. Here’s a pile of English, here’s a pile of German, can you learn to translate from just seeing that? And it seems to be somewhat successful. It’s not as good as having translated text but it’s there.
Anyway, there’s lots of other kind of challenging ideas and interesting ideas out there.
PD: How far do you think is machine translation away from passing a Turin test?
PK: I think ultimately and to flawlessly to achieve translation, I’m not making any predictions where we cannot reach that. I think in machine translation it has some interesting history of overselling and under delivering and going through various hype cycles. And being very aware of that and statistical machine translation data is, be very careful not to promise too much. Then all these deep learning people showed up and they just made extravagant claims about parity of human translation and so on and everybody else was cringing.
So, I think a good measure of machine translation is always, is it good enough for a particular purpose? So, if I go to a French newspaper website, and there’s a French newspaper story about whatever President Macron is doing. And I run it through Google translate and I can perfectly understand the story. Maybe there is some detail here and there missing, and maybe it’s an artefact of me not knowing too much about the integral system of French politics. But that’s good enough. If I want to buy a metro ticket in Paris and the translation of the website allows me to buy it, or if I can go to a foreign country and order a pizza or whatever, or ask for directions or have a conversation with someone even, it doesn’t have to be perfect, it has to be good enough.
That’s one measure, the other measure is does it make professional human translators that ultimately are going to product high quality translation, does it make them more productive? So, if you can make them twice as fast, that saves an enormous amount of money. And that’s always kind of the measuring stick, it’s not like you solve the problem, you don’t, and before that it’s useless. It’s more like the better it gets, the more users there are for it and so on. But ultimately, we’re dealing with language and it is an AI heart problem. So if you would actually do translation perfectly, you could just construct any kind of intelligence test as a translation challenge and just basically write a story in a way that you can only translate it correctly if you understand the deep meaning of the story.
And you can always check it. I have this example of whenever I visit my uncle and his daughters, I don’t know who is my favourite cousin. And the daughter’s of the uncle so the cousins are female. Cousins in English is not gendered, but if I translate that into German, I have to pick a gender for that. And from the story it’s clear that they’re female cousins.
This kind of world knowledge that is required to do this reasoning, I mean that’s kind of the deep AI knowledge that we don’t have right now and we don’t have now on our newer models either. So, any claims currently of human parity is questionable. And maybe one reason why people make, or they can make these kinds of claims is because of, what are you actually comparing against? You don’t compare against the idea of a human doing a perfect translation, you compare against some crowdsource translator who didn’t really care, who just runs things in translate and fixed up some words and submitted that.
So, yes. We maybe able to beat that, but we’re not close to perfect translation and we don’t have to be. And also, I think we started with that, it’s an impossible task anyway. So, there’s always going to be some, no matter what translation you produce, there’s always going to be someone that says, no that’s just not right, there’s a mistake and I don’t like that.
DD: These are kind of unrelated topics, but looking to the future, so Elon Musk reckons that he’ll have a neural link available to be implanted in people’s brains within the next 18 to 24 months.
PK: Is he also promising self-driving cars every month and then it never happens?
DD: Five years ago, yes. So, maybe this is part of the hype cycle, but do you ever see a situation where we do have the microchips in the brain and we can literally download French straight into the human brain? Then all of us are out of a job, especially you, Philipp?
PK: Oh I’m not a neuroscience, I’m not sure exactly how much you can manipulate the brain. Arguably we already are in a kind of hybrid cyborg, we all have our cell phones, we don’t remember phone numbers any more. Or meeting schedules. The phone beeps and says, okay I have to talk to this person now. So, we already offloaded some of our memory capacity to machines. If you look at, I don’t know, back to machine translation, if you look at the process of you know, people post editing machine translation, that is almost a machine man into action that’s taking place. But yes, not a big believer in microchips in the brain doing anything useful anytime soon.
DD: Your recent book that was published in June, at least in Europe, it came out in June this year, Neural Machine Translation, do you want to just basically highlight who that’s suitable for and what you’ve attempted to do with that book?
PK: Yes, so that book came out now exactly 10 years after my first book, which was statistical machine translation. So, this is neural machine translation and it kind of emerged out of, yes I need to update the book because now all these neural methods are out there. But now neural methods have completely overtaken the field. That everything we’ve done before is not interesting anymore. So, the book is, it starts out as a general introduction into what is the problem of translation, the history of machine translation. The history of neural network research. What the problem is, what the applications are. The core of the book is kind of the core technology, how it works and there it’s kind of directed at graduate students.
I’m teaching a class now at Hopkins here, in the fall, on machine translation and that is going to be the text book for the class, where every lecture is one chapter. So, it’s geared towards people who actually implement the code and understand the code and
understand the models mathematically and so on. And it goes kind of in the later chapters to all the open challenging things, some of them we address like adapting models to different domains, how they actually represent words as individual [inaudible 00:21:10] maybe character sequences and so on and so on.
There’s also one chapter which is something we didn’t talk about, is how do we interpret the model? So, we build all these very complicated neural machine translation models, and we don’t really know how they do it. And I’m still amazed that they work at all, based on that they’re really guided by very simple principles, but how they do it is one of these big questions. Yes you can look at the models but all you’re going to see is millions of numbers, so that’s not very informative. So how can you maybe probe them, can you know what is in these different representation states. So it goes through all these layers of representations. Do some tactical representations emerge? Do semantic representations emerge?
Especially what would be useful, if it makes mistakes to figure out why it made the mistake, how can I fix it? And we don’t really have any good answers to that right now. There’s a whole interesting sub field on interpreting these models, so I’m talking about that a little bit in the book too. How to visualise some of the internal processes, or how can you understand the internal workings of these models.
DD: What are the implications for us going forward for if we don’t figure out what the models or what the black box is doing or why it’s doing it?
PK: You can have a very cynical take on the whole thing. There’s certainly people in the field of natural language processing and it’s either called natural language processing and computational linguistics, so they’re somewhat synonymous but the intent is different. Computational linguistics ultimately are the idea, we’re going to learn something about language. So, we seem to be, I mean it’s putting it a bit crass or extreme but maybe they say, we’re almost close to solving machine translation and we haven’t learn anything about language. There’s something going on inside these models that does language well enough to be effective. But we actually don’t know. We don’t suddenly know what syntax is, and grammar and lithology. We didn’t learn anything about it. It’s just, we serve data into a black box and it spits something out. What’s going on inside is actually a big mystery to us. That’s a sad note to end on isn’t it?
DD: Well I don’t think it is necessary sad, it’s probably a field all of its own.
PK: Yes its putting it a bit extreme. I mean obviously a lot of people are really interested in figuring out what’s inside and you can, it’s better than brain scanning because you actually have a super clear picture of what’s’ going on inside these models.
PD: You mentioned the black box field of the neural networks you’re training, do you see evidence that language might be an emerging property of a complex system?
PK: No, I think it’s a very interesting question, like, yes, what does it actually say, for instance, about image recognition or language? I mean we have all these kinds of physics envy of reducing the world to a few formula. And I mean, that doesn’t seem to work for problems like language were you just have a few rules and that’s it. And what’s then the answer? I mean we can discover principles that are true 90% of the time, that’s a German saying. The rule is proven by its contradiction. There’s always an exception that proves the rule
and language is a lot like this. I think this is definitely one of the giant challenges of the field, trying to understand what’s going on. Because it’s also important from an engineering perspective. Because right now if something goes wrong, what can you do? Maybe some intuition about changing some parameter settings or more layers or more training data or massaging your training data a bit better. And none of them really go to the route of the problem, they just accommodate it.
DD: You’ve trained hundreds of these models now, neural networks, what’s your best hunch if you had to make a guess about how the black box is figuring all this stuff out?
PK: Well there’s some things that it seems to be doing similar to human translate and how our original models did this. So, there’s definitely a sense of when it produces an outward word, that it creates a link to the source word that is most important. And the original models it was a bit easier to visualise for instance, word alignments. But yes, what drives these decisions is less clear. There are people who looked at these intermediate layers, and as I said earlier, you can map certain properties of the representations to our classical senses of, this is a noun or a verb, does it discover that some things are nouns and verbs. And apparently there is something like this that does emerge, you can map, learn from the representations, oh yes, it detected this is a noun.
Also it’s not saying noun inside the model but it goes to some of these stages. Actually we have, in our reading group here two weeks ago, where people were arguing about, should the encoder be bigger or the decoder be bigger? And one of the questions was like, where does it learn to reorder? Where does it learn to reorder? Where does the reordering kick in? Where does it know it has to change the sentence order? And we don’t know. It’s an interesting thing to explore but it’s at least an actionable question that we can maybe trace down.
A lot of these interpretation questions are just a bit, I want to understand what it does, that’s not a very clearly defined question, that’s not a question you can answer. I mean what kind of answer do you want? You have to be a bit more specific with what kind of answer you want, and there’s where everything always gets a bit hazy and shaky. Yes, what explanation actually would satisfy you.
DD: Just a couple more questions. One on personal productivity, because your output over the years, in terms of publications has been prolific. Have you got any words of inspiration or advice for aspiring academics or indeed data scientist about how you manage all the things you manage?
PK: Well I am a professor at a university, so my students do all the work. That’s one way to put it. Definitely we look at the publications, most of them, are authored by the students. When I was a student, yes, and the advice I give to students is do the things you actually care about and are interested in. That’s my experience with students, they’re kind of like, it’s like herding cats. They have their ideas, it’s hard to tell them do this and that, they do what they want. And that’s important as a PhD student, and as a researcher that you find the things you’re interested in.
Another good advice is just work on, there’s always a challenge between like the giant challenge and the low hanging fruit, you need to balance that somewhat, you need to do something that you already know they’re going to work, probably and they’re easy to do
and you should do them. And then there are things that are the giant challenges. So, it’s one risk is to only pursue the giant challenges and then after years and years never really have anything to show for. So, you have to break it up into smaller pieces. And yes, work on multiple projects at the same time. Some might get stuck.
In research the big challenge is to always say after months of work, this is not going anywhere and I should just stop doing it. That’s a hard decision to make. Admitting that, this months of work are actually completely useless.
DD: I think that’s a big challenge in business as well, once a certain amount of time and money has been sunk into a project, no one wants to close the lid on it. So, that brings us sadly to the end of this episode. Thank you so much to Professor Koehn and Dr Diesinger for their fantastic questions, answers and many, many insights. Thanks also to you for listening, we really appreciate it. If you enjoyed the show, would you mind helping us out by subscribing and leaving us a review on your preferred platform. You can also connect with us and give us some feedback on the usual social media channels, particularly LinkedIn and Twitter. And we look forward to having your company again on our next episode.