The idea that we all have a soul mate out there somewhere is a popular cultural meme. Words seem to have soul mates as well, judging by the way that they mate for life. But such word unions are not always marked with ceremony, the way human ones are, and this makes some of the hookups a bit difficult to document and validate.
Word marriages (and let's call them collocations, to give them their proper name) stand the best chance of being officially recognized if they're a candidate for lemmatization: that is, if they constitute a union of a kind that could be a citation form in a dictionary. There is a prejudice at work here, because dictionaries – in English, anyway – only lemmatize parts of speech. So word pairs that have the credential of being a noun, a verb, an adjective, and so forth, are the ones that get their foot in the door of recognition: if they're lucky, fingers poised over keyboards (or dusty old papers) the world over will devote some effort to determining whether this pairing of words deserves a place in a dictionary headword list, which is a good beginning for many things that will follow: definition, research into origin, documentation of usage, and indexing by giant information extraction programs.
Collocations that don't constitute a part of speech – say, for example, adverb + adjective, verb + adverb, noun+ verb, verb + noun – often languish in the dustbin of history, with no one giving particular attention to when, where, or how they first found each other. Compound nouns (noun + noun, adjective + noun) that are regarded as transparent – that is, no more than the sum of their parts – also do not make the grade of the dictionary headword list, and usually do not merit study by lexicographers and other word detectives.
Should we care about this? Yes and no. There are huge numbers of collocations that are a sort of foregone conclusion (sun shines; eat food; absolutely certain; vanish completely) and we don't care when or whence they originated because it seems inevitable that they would get together, and their presence in a text may not signify much. Such collocations are usually chunks of language that accrue naturally over time to both the native speaker's and the learner's lexicon. But the huge middle ground of compound nouns whose meaning may or may not be obvious (depending on context or whom you ask) deserves some thinking about.
What about foregone conclusion, which we used above? This is a lucky pair of words, because it was none other than William Shakespeare who first married them in print (in Othello). Foregone conclusion is not quite a transparent compound, the participle foregone being somewhat unusual in the attributive position. But the Shakespeare connection is undoubtedly what gave foregone conclusion legs. A collocation's first citation in the works of a noted author is like having a celebrity guest at your wedding: you get a lot more attention than you might otherwise merit.
Have you ever taken a sentimental journey? Some readers are already hearing Doris Day in their heads. Others may be flashing back to a college literature class, in which they read Laurence Stern's 1768 novel entitled A Sentimental Journey Through France and Italy. The collocation sentimental journey doesn't merit dictionary inclusion because it's regarded as transparent, but it is nonetheless very popular: indeed, it passes the test of many a popular collocation by having its own disambiguation page on Wikipedia. It rises above the level of what we'd call a mere collocation; because of its popularity and frequency, we can call it a concept.
In today's world of data and information overload, concepts are as hot as collocations. The latter, as you might guess, are often a representation of the former, and data miners integrate collocations into their algorithms as a way to facilitate high-speed collection of key information from vast documents, or from a big bucket of small documents on the Internet. The other day we test-drove a commercially available API (that's application programming interface) to see how this works.
We turned the program loose on this article in Wired, which is about how defense companies have built a virtual reality simulation based on the raid on the Bin Laden compound in Pakistan. The API was supposed to send us back key concepts in the article. Here's what we got.
How successful is it? Not bad: the returned list tells you at a glance that the article contains information, possibly useful to you, about the Abbotabad compound, ginormous flatscreens, Navy SEALs, infamous terrorists, and badass phylacteries. Badass phylacteries? Well, they might be a look, but they seem a few catwalk appearances short of a concept. Perhaps they're something that "product manager joel" wears.
We started thinking about this problem with collocations constituting nouns – when they're significant, when they're not – on the basis of a particular collocation: auspicious pair. Does it ring a bell? Would you call it a transparent compound? Should it be flagged by a program crawling over text at lightning speed as worthy of attention? As so often with collocations, the answers to these questions depend on whom you ask. For most readers, "auspicious pair" is probably just an adjective and a noun. But if you're a student of Buddhism it immediately conjures a story from the Buddha's life in which, by virtue of his supernormal mental powers, he recognized the approach of his two future chief disciples when they were coming to him for the first time:
And the Blessed One saw them, Sâriputta and Moggalâna, coming from afar; on seeing them he thus addressed the Bhikkhus: "There, O Bhikkhus, two companions arrive...; these will be a pair of true pupils, a most distinguished, auspicious pair."
Now, the curious thing is that if you Google "auspicious pair," you get a lot of hits on this very story or an allusion to it. But you also get some hits that suggest how, and how far, this rather obscure collocation has traveled. Not very far, by the look of it: nearly all the hits on auspicious pair have some connection with Asia or with religion. Under what circumstances, in a given text, should a collocation like auspicious pair be flagged as interesting or significant?
The art of automated information extraction is a huge growth industry now, and while not exactly in its infancy, it still has miles to go before it reliably returns what you're looking for, while leaving behind what you're not. We talked a few months ago in the Lounge about Google's publication of massive databanks of collocations, and these should go a long way towards helping researchers trace the careers of various "bigrams" (that's a two-word collocation) "trigrams (three), and so forth. But algorithm-crafters still have a thorny problem to deal with: finding a way to determine programmatically how context influences which collocations may be significant, so that a killer API of the future will know, after a lightning scan through a text, which bigrams are truly auspicious pairs.