All Aboard the "Chunking" Express : Word Routes : Thinkmap Visual Thesaurus

This Sunday's New York Times Magazine was a special issue on education, with a focus on education technology. I used the opportunity to write an On Language column that explored new theoretical approaches to language learning that are having important practical applications in the English-language classroom.

The insights that are being put into practice have to do with "chunking" — the way that we learn and process language in prefabricated strings of words, or "lexical chunks." Native speakers of a language like English take for granted how much we rely on these chunks, and we tend not to appreciate their significance in the creation of linguistic fluency. But acquiring competency in a language isn't all about mastering rules of grammar and finding words to fill the functional slots, despite the syntactic emphasis in formal linguistics that has been championed by Noam Chomsky and his followers. A counter-current in linguistics since the 1960s has focused on what the late British scholar John Sinclair called "the idiom principle," or the tendency of certain words to cluster together with certain other words in their vicinity.

When Sinclair and like-minded linguists were first pioneering this approach to language, the technological tools to bear it out were limited. What was required was the construction of massive "corpora" (databases of texts) that could be analyzed to determine "collocations" (high-frequency combinations of words and phrases). The Brown Corpus was state-of-the-art in the '60s, and contained about a million words in total. Sinclair helped build a much larger corpus known as the Bank of English, which was used by the COBUILD project (a joint effort by the Collins publishing company and the University of Birmingham) for the express purpose of helping learners of English.

While Collins used COBUILD to create its series of learner's dictionaries, other dictionary publishers such as Cambridge and Oxford followed suit. Additional corpus projects include the British National Corpus, the American National Corpus, the International Corpus of English, and the latest entry, the Corpus of Contemporary American English, compiled by Mark Davies at Brigham Young University. The cutting-edge corpora of today, containing hundreds of millions or even billions of words, make the Brown Corpus of the '60s look quaint.

What all of these corpora tell us is which "chunks" are the most salient in the language — from common collocations (shrug goes with shoulders, for instance), all the way up to longer idioms and conventional expressions in everyday interaction. The teaching of English — especially to non-native learners — is drawing lessons from all of this research, moving away from traditional pedagogy that primarily emphasizes grammatical rules and lists of individual vocabulary words. Now learners can appreciate how the meanings of words are revealed by their frequent neighbors in both spoken and written texts.

These issues first piqued my interest when I was working as editor for American dictionaries at Oxford University Press. You can read an interview I did with the Visual Thesaurus here where I described how corpus research was transforming the way that lexicographers make dictionaries. Not long after that, I came over to the Visual Thesaurus as executive producer, and I brought my interest in corpus-based approaches to language with me. Our vocabulary analysis tool, VocabGrabber, relies on a large corpus to establish which words are the most relevant in a text. We are continuing to use corpus findings to develop new vocabulary-building features both on the Visual Thesaurus and our sister site, Vocabulary.com.

If you want to hear more about "chunking" and its applications for the teaching of English, check out the video chat I had with the linguist John McWhorter on Bloggingheads. And if you stick around after the chunking talk, you'll hear us discuss a range of other language-related topics, including Obama's proficiency in Indonesian and the authenticity of the dialogue on "Mad Men."

Teaching, Vocabulary, Words, Language arts, Language, Linguistics

Click here to read more articles from Word Routes.

Ben Zimmer is language columnist for The Wall Street Journal and former language columnist for The Boston Globe and The New York Times Magazine. He has worked as editor for American dictionaries at Oxford University Press and as a consultant to the Oxford English Dictionary. In addition to his regular "Word Routes" column here, he contributes to the group weblog Language Log. He is also the chair of the New Words Committee of the American Dialect Society. Click here to read more articles by Ben Zimmer.