Word Routes

Exploring the pathways of our lexicon

All Aboard the "Chunking" Express

This Sunday's New York Times Magazine was a special issue on education, with a focus on education technology. I used the opportunity to write an On Language column that explored new theoretical approaches to language learning that are having important practical applications in the English-language classroom.

The insights that are being put into practice have to do with "chunking" — the way that we learn and process language in prefabricated strings of words, or "lexical chunks." Native speakers of a language like English take for granted how much we rely on these chunks, and we tend not to appreciate their significance in the creation of linguistic fluency. But acquiring competency in a language isn't all about mastering rules of grammar and finding words to fill the functional slots, despite the syntactic emphasis in formal linguistics that has been championed by Noam Chomsky and his followers. A counter-current in linguistics since the 1960s has focused on what the late British scholar John Sinclair called "the idiom principle," or the tendency of certain words to cluster together with certain other words in their vicinity.

When Sinclair and like-minded linguists were first pioneering this approach to language, the technological tools to bear it out were limited. What was required was the construction of massive "corpora" (databases of texts) that could be analyzed to determine "collocations" (high-frequency combinations of words and phrases). The Brown Corpus was state-of-the-art in the '60s, and contained about a million words in total. Sinclair helped build a much larger corpus known as the Bank of English, which was used by the COBUILD project (a joint effort by the Collins publishing company and the University of Birmingham) for the express purpose of helping learners of English.

While Collins used COBUILD to create its series of learner's dictionaries, other dictionary publishers such as Cambridge and Oxford followed suit. Additional corpus projects include the British National Corpus, the American National Corpus, the International Corpus of English, and the latest entry, the Corpus of Contemporary American English, compiled by Mark Davies at Brigham Young University. The cutting-edge corpora of today, containing hundreds of millions or even billions of words, make the Brown Corpus of the '60s look quaint.

What all of these corpora tell us is which "chunks" are the most salient in the language — from common collocations (shrug goes with shoulders, for instance), all the way up to longer idioms and conventional expressions in everyday interaction. The teaching of English — especially to non-native learners — is drawing lessons from all of this research, moving away from traditional pedagogy that primarily emphasizes grammatical rules and lists of individual vocabulary words. Now learners can appreciate how the meanings of words are revealed by their frequent neighbors in both spoken and written texts.

These issues first piqued my interest when I was working as editor for American dictionaries at Oxford University Press. You can read an interview I did with the Visual Thesaurus here where I described how corpus research was transforming the way that lexicographers make dictionaries. Not long after that, I came over to the Visual Thesaurus as executive producer, and I brought my interest in corpus-based approaches to language with me. Our vocabulary analysis tool, VocabGrabber, relies on a large corpus to establish which words are the most relevant in a text. We are continuing to use corpus findings to develop new vocabulary-building features both on the Visual Thesaurus and our sister site, Vocabulary.com.

If you want to hear more about "chunking" and its applications for the teaching of English, check out the video chat I had with the linguist John McWhorter on Bloggingheads. And if you stick around after the chunking talk, you'll hear us discuss a range of other language-related topics, including Obama's proficiency in Indonesian and the authenticity of the dialogue on "Mad Men."

Rate this article:

Click here to read more articles from Word Routes.

Ben Zimmer is language columnist for The Wall Street Journal and former language columnist for The Boston Globe and The New York Times Magazine. He has worked as editor for American dictionaries at Oxford University Press and as a consultant to the Oxford English Dictionary. In addition to his regular "Word Routes" column here, he contributes to the group weblog Language Log. He is also the chair of the New Words Committee of the American Dialect Society. Click here to read more articles by Ben Zimmer.

Join the conversation

Comments from our users:

Monday September 20th 2010, 7:23 AM
Comment by: Geoff A. (United Kingdom of Great Britain and Northern Ireland)
This 'chunks' theory throws light on how people could manage (I think in simpler times?) to live a normal, fully active life on a vocabulary of 500 words. I read this figure somewhere years ago and had assumed their discourse must have been severely limited. My mistake was to see these people drawing on a list of 500 words, the kind of list so familiar to language students, divided into 'Home', 'Farm', 'Pub', etc. But language works in chunks, not individual words, and those 500 words combined into maybe thousands of chunks that enabled those people to talk about issues that went beyond the home, the farm, and even the pub.

As a matter of interest, what is the vocabulary thought to be of the average American? Is there a huge difference between city and rural dwellers, or between graduates and non-graduates? Or are such figures not available?
Monday September 20th 2010, 8:07 AM
Comment by: Ravi K.
Thanks for the article! Does anyone know where I can find a corpus of such chunks in the Spanish language?
Monday September 20th 2010, 8:57 AM
Comment by: Federico E. (Camuy, PR)
Ravi: I can think of two places, one of them freely accessible, the other not. The first is run by the Spanish Academy itself (the Real Academia Española). It's a corpus called Corpus de Referencia del Español Actual (CREA). I use it all the time, and it's rather good (even if not as complete as COCA in English). When you use CREA, keep in mind that it's case-sensitive. Here's the URL: http://corpus.rae.es/creanet.html.

The second source I recommend is the only collocations dictionary in Spanish I know of. It was very avant-garde when it came out precisely because of that (an abridged version was published later). It's called Redes. Diccionario combinatorio del español contemporáneo.

Hope this helps.
Monday September 20th 2010, 1:50 PM
Comment by: christiane P. (paris Afghanistan)
Thanks for the article! It's very new for me, I do not value all the article but I promise myself to "back an listening
it again. I have seen so many recording to listen to , good for my information my lessons. These helps are a present.
Monday September 27th 2010, 10:27 PM
Comment by: James D. (Edmond, OK)
Thank you so much for bring functional linguistics and SFL-based corpus linguistics into the discussion!!

I'm certainly on board with SFL and have been for a good while.

Do you have a comment?

Share it with the Visual Thesaurus community.

Your comments:

Sign in to post a comment!

We're sorry, you must be a subscriber to comment.

Click here to subscribe today.

Already a subscriber? Click here to login.

Behold the Corpus
- 1 Comment
An interview with Ben Zimmer on the role of the corpus in creating dictionaries.
Pieces of April
Can a word's inner life be revealed by the company it keeps?
How lexicographers plumb the depths of corpora to write entries.