Skip to content

Episode 2

Part 2

Philipp Koehn (Part 2) – How Neural Networks Have Transformed Machine Translation

Philipp Koehn

Share this:


It’s Part 2 of our conversation with Professor Philipp Koehn of Johns Hopkins University.  Professor Koehn is one of the world’s leading experts in the field of Machine Translation & NLP. 

In this episode we delve into commercial applications of machine translation, open source tools available and also take a look into what to expect in the field in the future.

Show Notes

Show Notes

  • Typical datasets used for training models
  • The role of infrastructure and technology in Machine Translation
  • How the academic research in Machine Translation has manifested into industry applications
  • Overview of what’s available in Opensource tools for Machine Translation
  • The Future of Machine Translation and can it pass a Turing test


Philipp Koehn latest book – Neural Machine Translation – View Link

Omniscien Technologies – Leading Enterprise Provider of machine translation services – View Link

Open Source tools:


Translated texts (parallel data) for training:


– Paracrawl


Two papers mentioned about excessive use of computing power to train NLP models:

In this series

Episode 1

Philipp Koehn (Part 1) – How Neural Networks Have Transformed Machine Translation

Philipp Koehn

This is Part 1 of our conversation with Professor Philipp Koehn who is one of the world’s leading experts in the field of Machine Translation & NLP.



DD – Damien Deighan

PK – Philipp Koehn

PD – Philipp Diesinger


DD  (00:06):

Welcome to the data science conversations podcast. My name is Damien Deighan, and I’m here with my co host, Dr. Philipp Diesinger.

Our guest, Professor Koehn completed his PhD in computer science in 2003 at the University of Southern California. He then went onto a post doc position at MIT and in 2014 became a professor at  John Hopkins University.

He has been a prolific publisher of over a hundred academic papers in the field of machine translation. He has also authored two seminal textbooks and machine translation. The latest of which was released in June 2020. Beyond academia, his contributions have impacted many industry applications, such as the likes of Google translate. He is currently the chief scientist and a shareholder of Omniscient, one of the world’s leading providers of enterprise translation products and services.

He has also worked directly with Facebook to help them make machine translation technology available for their next 100 languages. He is undoubtedly a pioneer and world leader in the field of machine translation and AI. Professor Koehn, thank you so much for joining us.

DD (01.45):

So today’s discussion will cover a lot of ground. We are going to look at the evolution of machine translation from rule based statistical and phrase-based methods, right the way up to the current landscape, which is dominated by neural networks.

And from there, we will talk about how to use training data effectively, the increasingly important role of technical infrastructure, commercial applications, and industry, and finally, an introduction to the best open source tools available for those of you who want to explore further.

The context of all of this is of course machine translation. However, the methods that we’re going to discuss are highly relevant to many of the problems faced in other areas of natural language processing.  The core principles of professor Koehn’s contributions to the field of neural networks have got data science applications beyond language.

But before we get into that, let’s briefly introduce the bigger picture. Professor Koehn,  what is machine translation?

PK (02:50):

I mean, it’s very easy to explain problems. So you have documents and all kinds of languages in the world, and you would like to know what they say, and therefore you need to have a translation. So that’s a process that humans have been doing forever, and it’s always been one of the Holy grails of natural language processing. And I would say over the last 20, 30 years, it definitely has gone from the point of being just ridiculously bad, to actually pretty useful. And even some crazy claims about human parity (might want to talk about that later), but the quality is actually quite impressive for languages where we have a lot of resources.

DD (03:27):

Why have you dedicated your life to studying machine translation?

PK (03:31):

So when I started out studying computer science, I was interested in machine learning. And originally I actually did mostly machine learning for machine learning sake. That was the early nineties where a crazy idea like neural networks was around. And my master thesis at that time was actually on neural networks. I just realized relatively quickly that it’s kind of bad to work on machine learning for machine learning sake, without having a problem. It’s much easier to have a problem to work on. And then that time kind of just generally text processing just occurred as a real good problem for machine learning because you have data they’re real practical problems to work on. So when I did my PhD, that was the topic of my thesis advisor, Kevin Knight. It’s a real meaty problem and there is data, so you can actually do machine learning.  And it’s somewhat of a feasible task where there is at least some idea about what the correct input and the correct output should be.

DD (04:26):

You very neatly touched on the problem with neural machine translation. How would you describe the fundamental problem faced by practitioners in the field? I’m hinting at the very humorous way you described it and the opening of your new book?

PK (04:42):

So when I ask people about what they think about machine translation, I show them that output I get extremely and wild differing opinions on what people say it’s all terrible. And some people are super impressed. So I started the book with this example of ‘Sitzpinkler’, which is kind of a comical word to explain wimps.  Another one is ‘Warmduscher’, which means someone who takes the warm showers.  ‘Sitzpinkler’ meaning a man who sits down for peeing, which is kind of a strange insult. It has its time and place in a culture where, what it means to be a man and taking warm showers. And another one is Frauenversteht – a man who understands women. I mean, all these things that are kind of expected of modern man, but are not characteristics of maybe the culture 50 years ago.

So these are words that are kind of well known in German culture, and you can insult someone with them, but if you actually encounter these in texts and have to translate to English, you would have explained basically the long story I just explained.

So what do you do? Are you going to say, Oh you are a warm shower taker, or you are someone who pees sitting down, nobody would understand what that is supposed to mean. So the real adequate translation maybe is just – you’re a wimp, because that’s what it’s really meant. But something is lost during translation. That’s kind of a crucial problem of translation that it’s ultimately an impossible task. There’s always some nuance or cultural connotation and you’re always going to lose something, so there’s always endless debate about what is a good translation and then translators argue with each other and translators don’t like each other’s translations.

So it’s a somewhat ill defined problem – but compared to other problems, in natural language processing it’s however, relatively well defined. The other problems, like summarization – give me the summary of this book. I mean, you’re not going to get much agreement from anybody about what that should be, but in translation, at least if you’re not super critical, you can say, okay, that’s a passable translation, that’s an acceptable translation and that isn’t.

PD (06:48):

Thanks for the great introduction into machine translation and its challenges Philipp many formula approaches to machine translation tend to center around the two concepts of adequacy and fluency. Can you talk us through that?

PK (07:00):

So there’s always two competing goals for translation. On the one hand, you want to produce texts that are very fluent, that you don’t even notice that it’s translated, that it’s just like when it’s written in the native language. So that’s fluency and the other one is adequacy – so that is you want to have the same meaning of the source and sometimes then conflict, and sometimes one is more important than something else. For instance, if you think about translation of literature, fluency is more important than adequacy.

It’s more important that this is still an enjoyable book and it captures the spirit of the book and it doesn’t have to get all the facts, right? Like if I would write a story in an American newspaper about this town has the population of Nebraska, maybe in America, people still know what that means, but if I translate that into Chinese, I could literally translate that as this town has a size of Nebraska, except the entire readership will have no idea what that is.

So maybe you should then compare to Chinese city that people know in China, how big it is. So they actually have made some kind of sense of it. And that was the intention of the author. The intention of the author was to give some kind of understandable concept of that. But, obviously if you translate Nebraska with, I don’t know, Wuhan, that’s probably not the same size. That,  I mean obviously a mistranslation in terms of adequacy, but in terms of fluency and intended meaning it’s probably the right thing to do.  In statistical machine translation, we actually had two different components in the models, that models these two different aspects separately. So we had a language model that looked at sentences and checked, is this a fluent sentence? I prefer fluent sentences over non fluent sentences. Then a translation model that just looks at the how well  things map. We basically balanced them, and then we always turned them towards whatever the application goal was.

PD (08:46):

How do you quantify the performance of such machine translation systems? What kind of metrics are useful for that?

PK (08:53):

I mean that bridges to the next problem. So the way I described it, these are components of the model and they have a meaning within the model and they are used within the model, but ultimately the goal is to produce translations. And then that brings up the question, how do you evaluate – what is a good translation? We could spend the entire hour talking about evaluation of machine translation. We have an engineering problem that we want to build machine translation systems and tune them and change them, and we want to measure immediately how it goes. So we need an automatic metric to evaluate how good is machine translation, how good it is? Is this system better than the other.

PD (09:26):

So if I understand you correctly, performance metrics are more important internally to train and develop the models.

PK (09:32):

There’s the infamous blue score that is being used in machine translation. It’s technically a little bit too complicated to explain super straightforwardly. Ideally you would like to know how many words are wrong, but then if you just count how many words are wrong, you also have to consider word order so that that’s not easy to do. So the blue score looks at how many words are right? But also which pairs and triplets and forward sequences were right.  You compare it against the human translation, so basically a machine translation system that is better than another machine translation system produces output that is more similar to a human translation that’s already existing. That’s how automatic matics work. And so it all comes down to a measure of similarity and there’s a whole industry coming up with metrics – how to measure similarity of machine translation, output to a human translation.

PK (10:26):

We also frequently when we organize evaluation campaigns and ultimately the kind that gives the whole idea of automatic metrics some credence, we also do human evaluation of machine translation. Unfortunately that’s even trickier because at that point you don’t have any ground truth anymore. If you have one method of human translation that says system A is better, another method of human evaluation that says system B is better, what are you going to do.   For a long time we have had two standards. We have people just looking at two sentences, one from system A and one from system B and ask them which one is better.

They’ll disagree with each other. They even disagree with themselves if you show them the same sentence pairs an hour later.  Because typically if there are flaws there are different flaws in the two sentences – why is this flaw worse than that flaw? You know, is word order worse than dropping a word or is grammatical error worse. So as I said, we could spend a whole hour on discussing how to evaluate machine translation. We have a pretty reasonably good useful setup for the last 20 years, since the blue score was invented.  We have these scores, everybody criticises them all the time, but they are still used. They have definitely helped guide development of machine translation.

PD (11:38):

One of the earlier ideas of machine translation was to split the problem into three categories, a lexical, a syntactic, and a semantic problem.  Is this still a valid approach in the age of neural networks?

PK (11:49):

The short answer is no. To give a little bit longer the answer is Yes.  So, before the whole statistical wave hit about 20, 30 years ago, there was this grand vision of machine translation being an application that guides the development of better natural language processing. And that involves also understanding language and the idea was that we go through various processing stages. We start with, you know, part of speech tagging, like, what are the nouns, what are the verbs to handle morphology and detecting syntactic structure? Here is a non phrase, here is this clause, there’s a subclause. Then beyond that, each class has some kind of meaning and we have meaning representations.

Ultimately the vision was always to have some meaning representation that is beyond all language. So if you, take a source language and map to that meaning representation that is beyond our language and then generate from that, you can build machine translation systems for every language pair.

PK (12:50):

You just need to build an analyzer and a generator for each language. So that was kind of the vision of rule-based and towards inter linguist systems. So the statistical revolution that happened 20 years ago was the first one that just threw all this by the wayside and just said, okay, it’s just a word mapping problem. We just have to find source words, map them to target words. And we have to have some kind of model of reordering, but it’s all tied to words. So it was a very superficial mode. It just looked at word sequences.

The output of that was generally good, except often it was not very grammatical because they only looked at very short windows. So back then we had to, for instance, to check if something is a fluent language, we only looked at word sequences five words at a time.

PK (13:33):

And of course, then, you know, you might sometimes reach the end of the sentence and you never encountered the verb of the sentence, which is somewhat crucial. They all kind of read locally, very fluent, but then suddenly the sentence ended. It didn’t make sense.

So there, there was then pretty good pressure to say, well, because it doesn’t have any understanding of syntactic structure. So we developed statistical systems that actually built syntactic structures. So there was longstanding research and natural language processing to build past trees. What we call these syntactic representations that have all these things that I mentioned, non phrases, clauses, and so on. And we were actually pretty successful. That was work was initially done on Chinese – English where word order and structure was much trickier. We were also really successful in German – English, which was long, a really hard problem for natural language processing.

PK (14:23):

Although German and English are really related languages, the syntax is very vastly different. I mean, German has a verb at the end of the sentence, although that’s not entirely true, but you know, something has still reorder that work and put it in the right place. And the traditional model has not been very good at it. So we’ve been kind of in statistical days, it seemed to be on the way of building linguistically better models, they became always more complicated because suddenly you had to build three structures.  Then there was all this talk about semantic representations and these graph structures that are even more difficult. And that’s where the neural machine translation wave hit again and started out with saying we have just two ways of word mapping problems. There’s a sequence of where it’s coming in and producing a sequence of output words.

And I like to joke that in the old days, the people who don’t even believe in a linguistic concept of a word anymore, let’s just say there’s a sequence of characters coming in and a sequence of characters coming out. And that’s a pretty serious effort of building models that really treat this as a byte sequence coming in. And there’s a byte sequence coming out and no notion of any kind of linguistic understanding behind that.

PD (15:39):

So machine translation moved from rule-based systems more and more to statistical approaches. This is something that we have seen happening in many other fields that’s been in the last decade with the introduction of machine learning methods. What do you think are the reasons for this? That’s the complexity of neural graphs match language very well? Is it availability of data or hardware?

PK (15:59):

There are various aspects to that. I mean the turn to data driven methods and natural language processing is pretty much in parallel to what I just described about machine translation. So other problems in natural language processing –  for instance, just analyzing syntactic structures, parsing was often done by handwritten rules. You know, you can just write a rule or a sentence is a subject object.  A subject is a non phrase, a non phrase is determined an adjective non, I mean, that sounds like we are all very natural. Except then if you actually look at the actual texts, there’s like every other sentence is something that violates these kinds of very basic ideas about how language looks. So in the nineties it was rebuilt. Now it just annotates sentences with their syntactic trees. And then we learn things in general,  why is that so successful for language processing?

PK (16:47):

Why has it completely overtaken the field? It’s just because this is a field where you actually have data and in translation, especially if data, you get all your training data for free, all you need is translated text and people translate stuff all the time. So people generate your training data all the time, just because that’s the natural activity that people do. I mean, there’s many other problems, but that’s not the case. If you look at image classification, then all people go around. And then just, just for the fun of it label, you know, this is a dog and this is a cat. I mean, maybe they’re right captions, but all of that is kind of less shaky, but translation is because it’s such an inherent human task that people have been doing. We get the training data. It’s extremely rare that we actually annotate training data ourselves. We just try to go out and find translator text from the internet or from public repositories.

PD (17:38):

Humans also have different ways of learning language.  As children we learn by listening. But later in school we are taught the more formal and rule based approach.

PK (17:47):

I mean, so I’m not a linguist. So we don’t, for instance, have a good linguistic theory – like what’s the structure of language is, and I mean, what do you describe as that as kids? We just listened to language and then we are told. But then we also go through a phase where we go to school and then saying, no, this is the wrong grammar and you’re making grammar mistakes here. And we are basically taught some rules. And, you know, especially if you look at language like German, you have to get your cases. I still remember from school, all the word endings have to be right. So you learn a rule. So is language driven by rules or is language just the mess of what somebody says something and people repeat it? And it seems to be a mix of both. So there seems to be some structure and, you know, that’s where Chomsky comes in and there’s this famous claim of language being recursive. And, you know, it has a, it has some structure, but then it also seems to be, people can just say, you know, crazy things and then they’ll repeat it.

So I hear from my kids now, I’m like you, what was it? He better be vibing though, or something like that. And that’s apparently something you can say, although it’s not really grammatical in English, you know, whether he, “he do be vibing though’, you know, that’s not proper grammar, but that’s what people say and people repeat it. Then it becomes part of the language.  What is language? is language driven by some normative rule? So it’s just something that people just pick up, things borrowed from other languages.

PD (19:21):

You describe the input of machine translation models changed over time from phrases to words, to sub words. And as you said, there are attempts now to process even sequences of characters, what has been driving this process?

PK  (19:33):

So the fundamental problem is, in the language is that everything is incredibly ambiguous. Words are ambiguous, words have different meanings and syntactic structures might be ambiguous. I mean, the classic example of where it’s a river bank, is the bank where you’ll have your money in or interest is another great word, which means, you know, something could be interesting or what is the interest rate or having an interest in a company.

So where it’s ambiguous,  and this is an example, like I eat steak with ketchup when I eat steak with a knife, that is a structural difference, you know, related to the action.

So there is  always ambiguity. That’s why it is hard for a computer because it has to resolve the ambiguity. So how can a computer ever tell the difference between a financial bank and the riverbank, they’re just banks. They’re just, you know, character sequence, four letters, bank.

PK (20:21):

So what can I do with that? And the answer is it can do the same thing that we do. We just look at the surrounding context. We understand the difference, just when I say river bank and money bank, you’re not going to confuse what I mean by each of the words bank. It was just the proceeding word that told you that, and that’s what machines do. So they look at preceding words and that’s what drives a lot of people to use different languages. So if you do phrase translation, you translate groups of words and a group of words like interest rate is much, much less ambiguous. And it has a very clear meaning while the word interest is very ambiguous and the word rate is very ambiguous. So if you translate them independently, it is very confusing.

PK (21:04):

If you translate them as a phrase, it’s very clear. So that’s one aspect of when you get to subwords and character sequences, that usually comes up in the issue of morphology. So if you just say any word is a different word and has a different meaning, what will you do with a car and cars that are  different words that have a different meaning.  But shouldn’t you be able to share some knowledge, just like what we do as humans.

If I tell you a new word, but I only tell it to you in singular, you will still understand what it means when suddenly I use the plural of the word because I just edit out the letter S, so that is driving a bit. We need to get away from representing car and cars. It’s a completely different thing, and if you look at character sequences, you see from the character sequence that they’re very similar and that should help.

PD  (21:53):

You have been describing sequential character sentences that were used by machine translation models? Is this one of the reasons the field is currently relying so heavily on the recurrent neural networks?

PK (22:03):

So you have an input, sentence, and you have an output sentence.  And the output sentence, always, whenever you try to predict one word at a time, if you predict one word at a time, you have all the previous words that degenerate it to help to disambiguate. So what drives the decision to produce the next word is obviously the input sentence, but also all the previous words that you have produced. So it’s kind of a recurrent process and that’s how people, I mean, that’s the big question is that is language recursive or us it just a sequence?

There are good reasons to believe that it’s heavily influenced by it being a sequence. When we understand language, we always receive it linearly. I mean, we listen to things, word by word, and we read things word by word. We don’t look at the entire sentence and look for the verb and then branch out again. We just see it as a sequence. So it should be modelled as a sequence generating task too where you produce one word after the other.  And that makes it also a bit more feasible. I mean, you can just predict the whole sentence. It’s just an infinite number of  sentences, but you can predict one word at a time. You still have a fairly large vocabulary, actual real texts, hundreds of thousands of words that drive some of these, let’s break up infrequent words into subwords to make it a computational, a bit more feasible, but it seems to be that this recurrent process of producing one word at a time seems to suit language pretty well.

PD (23:33):

Can you talk a little bit more about the type of neural network you’re using, or their internal structure? I understand that is a part that is encoding the input sentence, and then there’s another part that is decoding it – the output sentence.

PK (23:46):

If you go back to, and there’s definitely a lot of talk in the neural network world applying it to language also about semantics and the meaning, but you can still frame the problem as – you have an input sentence and you try to get the meaning of that input sentence. And then from that meaning a tread generates the output sentence. As I said earlier – in rule-based days and slightly also in statistical dates towards the end, this was done explicitly. They actually put in representations that much more closely mirrored our understanding of meaning, or at least syntactic structure on things like that in neural networks, there are claims that this kind of meaning emerges kind of in the middle of the process of going from an input sentence to an output sentence. We do break up the problem of translation into an encoder that just looks at the source sentence and does some processing just on the source, then the Decoder (that starting with that processing) generates the output sentence.

PK (24:44):

The very first neural machine translation models used recurrent neural networks. So these are neural networks that at each stage take as input the previous state and a new input word.  So they kind of walk through the sentence and say – Ok, what’s the probability that I start the sentence with the word ‘The’ and that’s a certain probability and lets predict that. Then what’s the probability the next word is ‘man.  Because of where we are  in the sentence now, whats the probability of producing the word, man?  And it just learns these things. So this is kind of offhand, very similar to what I talked about earlier, it’s about the language models. So how, how likely do certain words follow each other?

If you do this, you learn the kind of which words fit in the sequence.  This helps you also to figure out how words fit into a sentence and an ambiguous word is then not only represented by its source, the actual word token, but also the surrounding context. So you had these recurrent neural networks running left and right on the, over the sentence that then looked at the word, not only as word isolation but the word given the left context, the word given the right context.

PK (25:59):

You have several layers of these recurrent neural networks as input. So then you have kind of more refined representations of each word that is informed by its surrounding context. Then you translate from that and you also then produce one word at a time. That’s not what the decoder comes in. So this is what I just described.

Hopefully somewhat clearly is recurrent neural networks that were started in your machine translation five years ago, which is now ancient days. But since two or three years ago, we have a different model that is called a transformer, which is not a very informative name.  A bit more informative name is self attention. So there’s the idea that we are modelling its words in the context of the other words. And we just do this very explicitly. We look at each word and say, how is it in relation to all the other words in the sentences?  So we’d learn some weights. Well, the word itself is obviously most important, but maybe some of the surrounding words are more important than others.

We kind of refine the purpose of the word, give them the surrounding word, and we go through layers of that. So this is this self intentional transformer approach. And we have the same thing on the source and the target side.

PD (27:06):

Can you give us an idea of typical things that can go wrong with machine translation and which methods are used to validate the outputs?

PK (27:13):

So validation is – how well does your system do?

And what we do there is we leave aside a bunch of sentences and we just translate them and check how well they match, whatever the human produced. So the sentence, we can measure this, as I said before with blue scores, but it can also measure it in like, what probability did it give to each word and the human translation. We can also make it to produce the human output and then see how well it scores that.

So what can go wrong?   So one interesting thing about the neural machine translation approach is it differs in the types of errors, quite a lot from what the statistical methods used to do. So statistical methods, because they’ve had a very narrow window in what they looked at when they translated something they often produced grammatical output with very incoherent outputs.

PK (28:05):

If you give it a sentence that is just has a lot of unusual words or is difficult. In other words, for other reasons, the translation that often is just very gibberish and hard to read.  The neural model, because it’s just a generative language model,  it just produces word sequences.

It almost always produces beautiful sentences. It just, sometimes the sentences have nothing to do with the input anymore. There were some newspaper articles about why does Google translate pronounce like biblical prophecies. That’s something we’ve seen actually now in our  experiments too.

So if you don’t have much data, what do you have? You have the Bible, you have the Koran, that’s the kind of data you get for hundreds of languages. So you train your model on that and then you suddenly want to translate tweets and what the model seems to be doing there – Is I have no idea what that input is.  I don’t know what to do with it, but here’s a beautiful sentence I’ve seen in training. Let me just tell you that.

PK (28:59):

And it just produces output that is completely unrelated. So there’s a much bigger problem of it producing things that just semantically have absolutely nothing to do is call these hallucinations where just kind of comes up with stuff that just was not there in the source at all. That’s a real problem because it’s also hidden by the fact that the output just looks beautiful.

So if you’re just a naive user from machine translation, you have a Chinese document, you translate it and then you read it all, this is beautiful English text, but you don’t really know if it’s actually a translation because you’re fooled by it being such beautiful texts that you are less clear about does it actually translate all the words right.

PK (29:44):

So you’ve gotten much better and the fluency and actually producing beautiful output text, but the problem of adequacy of do we actually translate the words correctly and do we handle ambiguous words nicely?  So if an ambiguous word occurs in a sentence with the meaning, that is not the most common meaning and their surrounding context is not clear enough to give that away, at least for the machine, then it often screws that up.

So it produces a beautiful sentence, that doesn’t actually mean what the source sentence means. That’s a real problem because it’s much more misleading than previously.  Previously, you just got gibberish output so you didn’t trust it. Now you get a beautiful output and then nothing tells you not to trust it.

DD (30:28):

And this brings to a close part 1. One of our conversations with professor Phillip Kohen, we will be back in a week’s time with part two of this conversation. And we’d love to extend the reach of the show to as many people as possible. So please do subscribe on your favourite podcast platform. And if you enjoyed the show, please leave us a review.