A Monthly Column for Word Lovers
To Lump or Not to Lump?
Publishers of dictionaries today face a major dilemma: how can they justify continuing to devote the tremendous resources required to produce and distribute a dictionary in book form when an increasingly number of people — a number that is being added to with each new birth in a developed country — will probably never have the need to use or own a paper dictionary? It's a question that has gotten publishers' attention but that is far from finding a solution. There is considerable prestige associated with publishing a reputable dictionary in book form, and a certain amount of chagrin in having to stop doing so.
One tactic to keep the paper dictionary on the bookstore shelf — whether it is a stopgap or permanent solution remains to be seen — is to continue to publish the paper dictionary at a nominal loss while making money in other ways on the dictionary database: through related low-production-cost titles, electronic publishing, and licensing of the data for various uses. A promising but also problematic use of dictionary data is in natural language processing (NLP), wherein computers are fed bucketfuls of language with the expectation that they will be able to do something useful with it faster and more efficiently than humans do. A dictionary database can supply a full inventory of the meanings of words that, in theory, can aid a computer in word sense disambiguation (WSD): that is, determining which of many senses of a particular word is intended in a given context.
This interface between language database and machine is a busy place that we've visited before in the Lounge (here and here). Our visit this month takes us into a subject that is never far from the lexicographer's heart, and one that is especially problematic for computers when they deal with language: lumping and splitting.
You're probably aware of the phenomena of lumping and splitting in dictionaries and in the VT, even if you don't think of it in these terms. Splitting is the easy one: have a look at fin, for example. It's a good example of a polysemous word — one with many senses. Dictionary writers and users both find it useful to split a word like fin into multiple senses that are reasonably distinct from each other. The VT's definitions of fin, both the noun and the less frequent verb, are what we call splitty in the trade: each sense designates mainly only one thing.
Lumping, on the other hand, is the grouping together of related meanings of a word under a single sense. Look, for example, at recall. It has some splitty senses, but one particularly lumpy one: "cause one's (or someone else's) thoughts or attention to return from a reverie or digression". This kind of lumping is typical in dictionaries, and makes it possible for them to weigh no more than about five pounds when delivered. This definition of recall contains three or's, and can generate eight distinct definitions:
cause one's thoughts to return from a reverie
cause one's thoughts to return from a digression
cause one's attention to return from a reverie
cause one's attention to return from a digression
cause someone's thoughts to return from a reverie
cause someone's thoughts to return from a digression
cause someone's attention to return from a reverie
cause someone's attention to return from a digression
The lumping in this case is not terribly problematic, because the things lumped (thoughts and attention, reverie and digression, reflexive use and transitive use) are not worlds apart from each other: they are not things of an entirely different nature. But take another definition, not from the VT but from a leading British dictionary, of the noun clipping:
something cut out or trimmed off, especially an article from a newspaper
This definition contains two time-honored lumping tools: or, which we've noted already, and especially: a definition code-word for indicating that among the meanings lumped in a sense, one particular meaning is far more frequently found than others.
This sort of dictionary-speak is not a challenge to most natural language users — that is, human beings: you take what you need and leave the rest, and you probably find in the definition of clipping what you're looking for because you have a context that tells you pretty quickly which meaning is closest to the one you seek. A computer, on the other hand, is not a natural language user: at best it's an artificial user of natural language, and dictionary-speak is not exactly natural language. The lumping in the above definition of clipping can be somewhat pernicious with regard to NLP: it passes over a "real world" distinction that the human mind makes automatically, but that a computer would not know to do: a clipping that you "cut out" is usually one that you want to keep; a clipping that you "trim off" (nail clippings, lawn clippings) is one that you typically discard. So the definition seamlessly lumps two entirely different kinds of things: one that you separate in order to keep, one that you separate in order to throw away.
When a computer is processing text at a fast clip — say, five thousand words a minute or so — how is it going to decide, on the basis of a definition like "something cut out or trimmed off, especially an article from a newspaper" which meaning of clipping is the one intended? Let's look at some typical uses of the word:
The newspaper sense is indeed the most frequent, and any sentence that contains "newspaper" as well as "clipping" would probably be safely sorted by the computer into the proper sense. But beyond this, the definition does not provide many cues to a naïve user like a computer about which sense of clipping might be meant. The upshot, if we may go from the particular to the general in one step, is that traditional dictionary definitions — the kind that humans have been happily dealing with for hundreds of years — are often not very useful to a computer processing text.
So here's the dilemma for dictionary publishers, rephrased: can dictionary definition language be made more computer-friendly, as a way of ensuring that dictionary-making is sustainably profitable in the future? And would doing this detract from the usefulness of dictionary definitions for their core (if not very profit-generating) audience, namely human beings? The short answers to these questions are, respectively, "yes" and "yes" — and this of course does not resolve the dilemma, it only perpetuates it. Later this month, at the meeting of the Dictionary Society of North America, we'll address the question, and next month in the Lounge we'll unpack the answers to these questions in a little more detail.