A Monthly Column for Word Lovers
Machine Translation Dreams
"A serious artist doesn't start with a kitschy piece of error-ridden bilgewater and then patch it up here and there to produce a work of high art. That's not the nature of art. And translation is an art." —Douglas Hofstadter
The noted author Douglas Hofstadter wrote a piece for the Atlantic last month called The Shallowness of Google Translate, about the general topic of machine translation (that is, translation of human languages by computer). Hofstadter is not a linguist by training, but he's a cognitive scientist, a polyglot, and a brilliant scholar. The article is an interesting read, and recommended to all. Of the many great points that Hofstadter hits on, one struck a chord that continues to reverberate in the Lounge, and that is the persistent but not very productive idea that more data gives computers the capacity to do better translation.
Accurate and dependable machine translation has long been a dream among linguists and computer scientists alike. It is the subject an extraordinary amount of research (Google Scholar has more than three million hits on the term) and that research has certainly paid off in improvements that are now available to all. Probably everyone reading this has used either Google Translate or some other bilingual or multilingual translation service online to help them past a stumble in a foreign language. For that simple task—let's say, for example, "how do you say ice skate in Swedish?"—machine translation is usually dependable. But if you step up the game—that is, if you want to translate something so complicated as an actual sentence, or heaven forbid, a longer passage held together by the rules of logic, grammar and syntax, machine translation is an odd, baffling, and sometimes hilarious mashup of hits and misses, as some of Hofstadter's examples show.
We live in the era of Big Data, and because massive amounts of formatted data are available to anyone with a computer and an internet connection, it's tempting to look to Big Data as the solution to any of a number of problems, including machine translation. The great advantage of Big Data is the evidential support it provides. Most things that happen have happened sometime before, and we now seem to have enough data about everything that has ever happened to get a statistical picture of how likely it is that a particular thing will happen again, given similar conditions. But that simple scenario also points up the weakness of throwing a problem at Big Data. The fact that something happens most of the time is certainly no assurance that it happens all the time, or that it happens at the particular time that you are interested in learning about.
Here's an example. If you look at hundreds of examples of vetted, gold-standard human translations of the consecutive words red tape from English into German, there are a handful of variations, but more often than not, red tape is translated as Bürokratie. Armed with evidence like this, Google Translate feels confident to translate for you. Ask it to translate the sentence "We sealed the package with red tape" and you get:
Wir haben das Paket mit Bürokratie versiegelt.
And if you dutifully return to Google Translate to render this sentence back into English, you get:
We have sealed the package with bureaucracy.
To characterize the failing generally here, Google Translate takes a statistical, Bayesian approach to a situational problem, and that's sensible on the face of it: look at the statistical record of instances of a phenomenon and on that basis, make an informed guess about possible outcomes when a similar phenomenon arises. Or in more technical talk, consider "how the conditional probability of a set of possible causes for a given observed event can be computed from knowledge of the probability of each cause and the conditional probability of the outcome of each cause." That's the restatement of Bayes' Theorem in the VT. Here's what goes wrong: Google Translate doesn't look beyond the simple juxtaposition of the words red and tape to characterize the situation. The translation algorithm simply homes in on the collocation, has the computational equivalent of an aha moment, and plugs in the translation that works most of the time. Except it doesn't work here. Why not?
As Hofstadter puts it, "Google Translate isn't familiar with situations, period. It's familiar solely with strings composed of words composed of letters." It's not as if the "situations" that clue up a reader about the correct meaning of red tape in this sentence are absent: there are two excellent cues. One is the presence of the verb seal, which collocates more than randomly with tape; the other is the presence of the noun package, which is also found with greater than random frequency in the vicinity of tape. Could a computer not make these small inferential leaps and "realize" that the red tape in this sentence is literally red tape?
The challenge that computers (and their human programmers) face is that analyzing the situational aspects of "strings composed of words"—in other words, the pragmatics of such strings—is something extremely easy for humans to do, and nigh impossible for computers to do. Humans are natural language machines. We invented and have evolved natural languages and they are therefore ideally suited for communication between minds embedded in the sensate skin envelopes we are all born into. All of the components that enable us to interpret natural language are in-built and highly developed. Today's computers of even the greatest sophistication, on the other hand, simply do not have ways to emulate this complexity of what is very literally the human condition.
Will computers get there eventually? Possibly, but probably not after going down many other algorithmic rabbit holes. Because a computer is not a mind, much less an embodied mind, it will never understand language; it will not equate words with the physical or abstract things they represent, because it has no notion that relationship, or of meaning. The most computers can attempt is to decode language—that is, map its coded message into its own form of coding, or the coding of another language. Computers do not read text (which suggests some understanding of it); they process text. Understanding and translation require access to material that is partly encoded in language, but also largely encoded in the minds of those who invented language—namely, humans—and computers don't yet have a way of gaining access to this.