The Culturome and the Lexicographer

Google Labs released the Google Books corpus and its Ngram viewer a couple of months ago, coincident with the publication of a paper in the journal Science (free registration required), explaining and illustrating various proposed uses for the data and the tool. These events spawned a bevy of observations great and small:  among the great there have been both Ben Zimmer and Dennis Baron, here at the VT; numerous commentators on Language Log; and perhaps most eloquently and thoroughly, Geoff Nunberg in the Chronicle of Higher Education. Among the small, the Twitterverse is now atwitter with the hashtag #ngram, which, if you follow it, will allow you to view a number of entertaining but often specious graphs displaying and comparing the historical trajectories of various terms. Herewith, a medium-sized observation about the data and tools that Google has released, and their reported usefulness to lexicographers.

The paper published in Science devotes several column inches to the implications of the data for dictionaries and for study of the lexicons of various languages. We were struck by a statement in the paper that purports to explain the failure of dictionaries to cover less frequent words in the lexicon:

This gap between dictionaries and the lexicon results from a balance that every dictionary must strike: it must be comprehensive enough to be a useful reference, but concise enough to be printed, shipped, and used. As such, many infrequent words are omitted.

This seems, to us, a peculiarly dated and 20th-century notion to be appearing in an up-to-date scientific article. Or rather, it explains the failure of dictionaries to fully document the lexicon historically, though it hardly does so now. The number of dictionaries that are being "printed, shipped, and used" is ever declining, and has been for some time. The print dictionary's one dependable growth area — ESL dictionaries — is not much concerned with the absence of words like aridification and deletable from their headword lists. So in fact, the limitations imposed by the printing press should no longer be an excuse for failing to treat the lexicon fully and comprehensively. What the availability of the Google data really does is to drive another nail into the coffin of the print dictionary as a comprehensive tool for the serious student of the language.

Interestingly, there are in fact two modern sources where seekers after definitions of obscure words may find satisfaction for their curiosity. Neither of them is a "printed, shipped" dictionary but both are dictionaries of sorts and both are free. One is Wiktionary, the online wiki dictionary, and the other is Wordnik, a website that aggregates dictionary definitions (including those from Wiktionary) and other word data, such as examples and images. Both of these sites in fact have some data for aridification and deletable, which are two examples used in the Science article of words not appearing in major dictionaries.

We do not mean to suggest that Wordnik and Wiktionary offer treatments superior to those that may be found in professionally edited print dictionaries, or that the age of the printed dictionary is now at an end. That idea, in any case, is a subject that has already been aired in the Lounge, here and here, replete with many nostalgic comments about the joys of holding these big old books in your lap. But the easy availability of so much new lexical information in the Google data, and the inadequacy of print dictionaries in relation to it, leaves very little to recommend the print approach to dictionaries as a vehicle for the investment of resources. Dictionary publishers would do better to put their money into products that take full advantage of new technology and data, combined with their historical expertise in interpreting and presenting lexical information.

The Science article discusses the disparities between the full lexicon and dictionaries' coverage of it with numerous detailed examples, leading eventually to this conclusion:

Our results suggest that culturomic tools will aid lexicographers in at least two ways: (i) finding low-frequency words that they do not list; and (ii) providing accurate estimates of current frequency trends to reduce the lag between changes in the lexicon and changes in the dictionary.

We have no quibble with this well-documented observation, but we would substitute "dictionary publishers" for "lexicographers." Lexicographers usually work with a much more granular level of lexical information than the Google datasets or Ngram graphs provide, and the rather mind-numbing tasks noted above can be completed much more effectively by computer programs than by lexicographers — even though lexicographers are very good at mind-numbing tasks!

To illustrate the somewhat limited nature of the graphical presentation of Google book data from the point of view of the lexicographer, here's a graph of the word keyboard — a word that even nonlexicographers know has undergone a number of career enhancements and extensions since its introduction to English in the early 19th century:

A lexicographer tasked with creating or updating a dictionary entry for keyboard would find nothing here that he or she did not know intuitively — namely, that keyboard  long ago escaped the narrow confines of its musical origins, and that it appears in an ever-increasing number of places because of its use with electronic devices. The huge uptick in usage around 1980, we would guess, reflects the development and success of the personal computer. The fluctuations in the use of keyboard after the mid 1980s are somewhat surprising and might bear investigation by someone particularly interested in the word's career. Butnothing in the graph would be of much help to anyone crafting a definition of keyboard, or deciding how many and what parts of speech and senses it ought to be divided into.

Here, by contrast, is a Word Sketch of the noun keyboard, from a corpus that is tagged for part-of-speech:

With this sort of data, a lexicographer can easily identify everything pertinent there is to know about keyboard as a noun: its most frequent and most salient uses, its most typical collocates in every possible grammatical function, the frequency and nature of prepositional phrases that follow it or include it — along with the opportunity to view thousands of uses of it in context. A similar Word Sketch can be drawn up for the verb keyboard.

Aside from the suggestions of the Science authors noted above, we suspect that there are many rich pickings for dictionary makers in this vast trove of Google's data. One area that we think is especially promising is 2-grams, or to put it another way, two-word collocations. A vast number of compound nouns (and lesser numbers of compound adjectives and verbs) merit appearances in dictionaries because they are not understandable by combining the meanings of their parts — especially to language learners, who may lack the instincts about a language that native speakers possess. But it is easy for these terms to fly under the radar of headword gatherers because hitherto there have been no easy tools for collecting them. Take, for example, guestbook –  or if you prefer, guest book.

If you're a native speaker, you probably know both the original meaning  — a book for guests to sign at an event or attraction — and the newer meaning, which may have influenced the spike in usage in about 1990: a web page that provides a place for visitors to record their names and observations. If you're not a native speaker, you may have no idea what a guestbook is: a book about guests? By guests? For the use of guests?

Guestbook is clearly a word that deserves a dictionary definition, but it still does not appear in many modern dictionaries, and it did not make it into the OED until 2004, even though it had been around for 150 years before that. As the graph above shows, the two-word spelling guest book has been more common than the solid compound guestbook from the get-go, and it continues to be. If dictionary makers had been on the lookout for the two-word form earlier, it might have been given a place in dictionaries far sooner. We expect that many such cases lurk in the Google data; those with the resources to download and exploit the Google datasets may discover a lexical gold mine.

Orin Hargraves is an independent lexicographer and contributor to numerous dictionaries published in the US, the UK, and Europe. He is also the author of Mighty Fine Words and Smashing Expressions (Oxford), the definitive guide to British and American differences, and Slang Rules! (Merriam-Webster), a practical guide for English learners.

Tuesday February 1st 2011, 7:00 AM
Comment by: Thelma J. (FPO AP, AP)
or perhaps printed dictionaries could combine to provide meanings in more than one language!
Tuesday February 1st 2011, 1:40 PM
Comment by: Bill M. (Hermosa Beach, CA)
It seems to me the distinction between paper and "on line" is an artificial one - it's really just a choice of the recording mechanism. The main difference is the ability to charge for it. The actual recording medium and delivery mechanism will continue to evolve. Some day you will have the ability to pick a definition from a wireless hook up between a cloud and a chip in your brain and you will debate the difference between that and a the memory on your smart communication device.
Saturday February 5th 2011, 12:16 PM
Comment by: Jane B. (Winnipeg Canada)Top 10 Commenter
For some, at least, there could be a problem in choosing an accurate online source as there is at least as much misinformation as information out there. One objective of language instruction should be this, evaluating sources.

I make use of online searching almost entirely now, but I did grow up with print dictionaries and am accustomed to seeking and seeing all the offerings of a good definition.

