A Monthly Column for Word Lovers
Make Sense: Word Games With a Purpose
There's a new game in town. Actually, there are a number of new games, all of them about words. They're all at wordrobe.org, and they give you an opportunity to test your language skills and aptitude, as well as to advance the cause of science. The games at wordrobe.org are GWAPs, that is, games with a purpose. I talked about GWAPs briefly a couple of years ago when I wrote about the Google Image Labeler. The GWAPs at wordrobe.org help researchers develop valuable training data for Natural Language Processing (NLP) which, in a nutshell, is the science of trying to get computers to process language the way humans do, only better and faster.
One of the GWAPs at wordrobe is called "Senses." The "Senses" game asks you to do deliberately a task that you perform automatically hundreds or thousands of times a day; a task that computer types call Word Sense Disambiguation, or WSD. In reading and listening, humans perform WSD automatically and on the fly. When you hear someone say "Storen walked him on five pitches" (and perhaps with the proviso that you are familiar with baseball), you know immediately that the walk in this sentence is a special transitive sense of the verb in which the subject of the sentence is always the pitcher and the object is the batter. You also know that the pitches referred to are pitches that the pitcher made, and not soccer pitches, sales pitches, pitches in a musical scale, or any of the other senses of pitch.
Your cleverness in instantly decoding a sentence like this is something that even the most advanced computers (or rather, their programmers) can only look on with longing and envy because computers lack at least two important things that you have: real world knowledge, and the ability to infer context from various cues. In the absence of such useful faculties, a computer has to go about WSD in a much clunkier way: for example, in the sentence we use above, a computer may have to test all the senses it knows of "walk" and then all the senses it knows of "pitch," reject all the ones that don't work for any obvious syntactic reasons, and then see if what's left makes any sense. The saving grace is that computers can do that whole operation in even less time than it takes you to do it the human way.
The challenge of words like walk and pitch in NLP is that they're richly polysemous: they have many meanings, and computers aren't equipped like we are to pick out the right ones. A related problem in NLP, and particularly in English, is that we have many words that mean generally the same thing. English, rather shockingly, has more than 100 verbs that can mean "walk." For a sampling you can look at one of the verb categories in VerbNet, a large database used in NLP that classifies verbs by their behavior. On the page for run there are a group of verbs that, with very little context, a human can infer a vast amount of critical information from: when we hear verbs like clamber, goose-step, hobble, lope, parade, sashay, sidle, skulk, tiptoe, trudge, or waddle, we know already that these are species of walking and that, give or take a bit of context, we are reading or hearing about a situation involving a human (or other creature), that has and is using two (or more) legs. Computers, however, lack this real-world knowledge and the sense apparatus that enables humans to develop it, and so they must be painstakingly programmed to learn, for example, that when someone hobbles it is a kind of limping, which is in turn a kind of walking. In cases like this the database behind the Visual Thesaurus is also tremendously helpful because a computer can traverse a couple of nodes between hobble and walk to get an idea of their connection.
How do you know that the him in "Storen walked him on five pitches" is a person and not a dog or a horse — each of which would invoke a different sense of walk? Presumably, you would know this from context, because the antecedent of "him" would have been mentioned earlier and you would have no trouble matching it up. Here again, computers can be somewhat clueless in figuring out which anaphor goes with which earlier noun in a sentence, and so another game on wordrobe, "Pointers," helps computers to develop better rules for resolving pronoun references.
Another problem of English NLP is addressed by a different wordrobe game called "Twins." A common feature of English is the lack of any distinction between the dictionary forms of related verbs and nouns. Walk and pitch are both examples of this and there are thousands of others: picture, carpet, chair, cup — you only have to look around you to see dozens of objects, the name of which also functions as a verb in English. The problem is compounded by some inflected forms being equally ambiguous. In other words, walks might be the plural of noun walk, or it might be the third person singular present of verb walk. It's not a problem for the average human, because context nearly always makes clear whether a word is a verb or a noun. Not so clear for computers, however, which rely on rules to assign a part of speech to a word in context. Have a look at these two sentences:
The guidebook suggests cliff walks along the seashore.A computational parser, relying on fairly standard rules, might have a hard time deciding on the status of walks in both sentences above. Humans, on the other hand, don't miss a beat in correctly understanding both sentences with sufficient context. The game "Twins" presents players with sentences containing words that can function as nouns or verbs and asks them to say which it is.
Mother suggests Cliff walks along the seashore.
All of the games at wordrobe appeal to people who love words, and many such people are often writers and editors. You may wonder, as you play these games, whether you are contributing to a cause that will eventually put you out of a job, in a brave new world in which computers do all the writing and editing. You're probably safe for now; natural language is an extremely complex communication tool, and today humans are by far the only reliable and expert natural language processors.