Language Lounge

A Monthly Column for Word Lovers

To Lump or Not to Lump?

Publishers of dictionaries today face a major dilemma: how can they justify continuing to devote the tremendous resources required to produce and distribute a dictionary in book form when an increasingly number of people — a number that is being added to with each new birth in a developed country — will probably never have the need to use or own a paper dictionary? It's a question that has gotten publishers' attention but that is far from finding a solution. There is considerable prestige associated with publishing a reputable dictionary in book form, and a certain amount of chagrin in having to stop doing so.

One tactic to keep the paper dictionary on the bookstore shelf — whether it is a stopgap or permanent solution remains to be seen — is to continue to publish the paper dictionary at a nominal loss while making money in other ways on the dictionary database: through related low-production-cost titles, electronic publishing, and licensing of the data for various uses. A promising but also problematic use of dictionary data is in natural language processing (NLP), wherein computers are fed bucketfuls of language with the expectation that they will be able to do something useful with it faster and more efficiently than humans do. A dictionary database can supply a full inventory of the meanings of words that, in theory, can aid a computer in word sense disambiguation (WSD): that is, determining which of many senses of a particular word is intended in a given context. 

This interface between language database and machine is a busy place that we've visited before in the Lounge (here and here). Our visit this month takes us into a subject that is never far from the lexicographer's heart, and one that is especially problematic for computers when they deal with language: lumping and splitting.

You're probably aware of the phenomena of lumping and splitting in dictionaries and in the VT, even if you don't think of it in these terms. Splitting is the easy one: have a look at fin, for example. It's a good example of a polysemous word —  one with many senses. Dictionary writers and users both find it useful to split a word like fin into multiple senses that are reasonably distinct from each other. The VT's definitions of fin, both the noun and the less frequent verb, are what we call splitty in the trade: each sense designates mainly only one thing.

Lumping, on the other hand, is the grouping together of related meanings of a word under a single sense. Look, for example, at recall. It has some splitty senses, but one particularly lumpy one: "cause one's (or someone else's) thoughts or attention to return from a reverie or digression". This kind of lumping is typical in dictionaries, and makes it possible for them to weigh no more than about five pounds when delivered. This definition of recall contains three or's, and can generate eight distinct definitions:

cause one's thoughts to return from a reverie
cause one's thoughts to return from a digression
cause one's attention to return from a reverie
cause one's attention to return from a digression
cause someone's thoughts to return from a reverie
cause someone's thoughts to return from a digression
cause someone's attention to return from a reverie
cause someone's attention to return from a digression

The lumping in this case is not terribly problematic, because the things lumped (thoughts and attention, reverie and digression, reflexive use and transitive use) are not worlds apart from each other: they are not things of an entirely different nature. But take another definition, not from the VT but from a leading British dictionary, of the noun clipping:

 something cut out or trimmed off, especially an article from a newspaper

This definition contains two time-honored lumping tools: or, which we've noted already, and especially: a definition code-word for indicating that among the meanings lumped in a sense, one particular meaning is far more frequently found than others.

This sort of dictionary-speak is not a challenge to most natural language users — that is, human beings: you take what you need and leave the rest, and you probably find in the definition of clipping what you're looking for because you have a context that tells you pretty quickly which meaning is closest to the one you seek. A computer, on the other hand, is not a natural language user: at best it's an artificial user of natural language, and dictionary-speak is not exactly natural language. The lumping in the above definition of clipping can be somewhat pernicious with regard to NLP: it passes over a "real world" distinction that the human mind makes automatically, but that a computer would not know to do: a clipping that you "cut out" is usually one that you want to keep; a clipping that you "trim off" (nail clippings, lawn clippings) is one that you typically discard. So the definition seamlessly lumps two entirely different kinds of things: one that you separate in order to keep, one that you separate in order to throw away.

When a computer is processing text at a fast clip — say, five thousand words a minute or so — how is it going to decide, on the basis of a definition like "something cut out or trimmed off, especially an article from a newspaper" which meaning of clipping is the one intended? Let's look at some typical uses of the word:

 someone anonymously sent us a newspaper clipping , dated a few years back. It was about how  
blow. Then I explained to him about the  clipping  I'd received from South Africa, the article 
 of contemporary newspaper and magazine  clippings  of the famous events, plus another fine 
ing a fence. No one gets three years for clipping ." She made another quick turn, this time  
een Julius and that woman in the Ledger  clippings  ?"   `I wouldn't have, but he showed me 
er death notice among the old newspaper  clippings  that Miss Grant had collected. Interestingly 
 built, modern suburb. He had with him a clipping from the local newspaper giving the names  
ews. But he did not just throw away the  clippings  . He spliced together all the gaps, the 
ht,' whispered Ellie,'I save my toenail  clippings  and leave them in his sock drawer.''I heard 
eafed through the thick file of Delafoy  clippings  and found my piece near the top. Delafoy 
 dangerous, for blood, like hair or nail clippings , can form a link between you and any forces  
 a very important weapon. "They collect  clippings  , they distribute material in small ways 
ing the lawn while a third raked up the  clippings  . Two herons flew above the distant stream 
 the same day. Impatient with newspaper  clippings  , she set off again to Wolfrats-hausen towards 
tractive shape. Hedges will need regular clipping . Not suitable for growing in pots. Harvesting  
reathe, plant a lettuce, throw the lawn  clippings  onto the compost heap, start the car, fell 
irdresser, with a palmful of straw-blond clippings , had smilingly informed Grillo that he  
ight!" cackled the ancient, stowing the  clipping  away. `And I'll tell you some more. I know 
the Yusufzai clan. Khalil still had the  clippings  from the Melbourne Age, with pictures and 

The newspaper sense is indeed the most frequent, and any sentence that contains "newspaper" as well as "clipping" would probably be safely sorted by the computer into the proper sense. But beyond this, the definition does not provide many cues to a naïve user like a computer about which sense of clipping might be meant. The upshot, if we may go from the particular to the general in one step, is that traditional dictionary definitions — the kind that humans have been happily dealing with for hundreds of years — are often not very useful to a computer processing text.

So here's the dilemma for dictionary publishers, rephrased: can dictionary definition language be made more computer-friendly, as a way of ensuring that dictionary-making is sustainably profitable in the future? And would doing this detract from the usefulness of  dictionary definitions for their core (if not very profit-generating) audience, namely human beings? The short answers to these questions are, respectively, "yes" and "yes" — and this of course does not resolve the dilemma, it only perpetuates it. Later this month, at the meeting of the Dictionary Society of North America, we'll address the question, and next month in the Lounge we'll unpack the answers to these questions in a little more detail.

Click here to read more articles from Language Lounge.

Orin Hargraves is an independent lexicographer and contributor to numerous dictionaries published in the US, the UK, and Europe. He is also the author of Mighty Fine Words and Smashing Expressions (Oxford), the definitive guide to British and American differences, and Slang Rules! (Merriam-Webster), a practical guide for English learners. In addition to writing the Language Lounge column, Orin also writes for the Macmillan Dictionary Blog. Click here to visit his website. Click here to read more articles by Orin Hargraves.

The Stuff of Fiction
Some turns of phrase are peculiar to literary English.
Operative Words
How the study of collocations can reveal patterns that go unnoticed in dictionaries.
Behold the Corpus
The use of massive databases of texts is transforming how dictionaries are made.
Vocabulary:
The Post-Dictionary World?
A "Dumpster Fire" of a Year
Language:
Metaphors We Live By (Updated)
Brand Names of the Year for 2021
Words:
Shoot! How Gun Idioms Color Our Speech
"Nine Nasty Words": How and Why We Curse