Behind the Dictionary
Lexicographers Talk About Language
Inside the OED, Part 1: The Wisdom of Crowds
Ever wonder how work is done at the Oxford English Dictionary, the world's largest and most prestigious English-language dictionary project? We got the inside story from none other than Jesse Sheidlower, OED editor at large, who works on North American materials out of the dictionary's New York office. In the first installment of our three-part interview, Jesse explains how the OED's North American Reading Program operates. (Note the firmly American spelling of "Program"!) The reading programs (or programmes) have been radically transformed by the digital revolution, but at the same time they still follow the traditions set down 150 years ago by James Murray, the dictionary's first editor. As Jesse explains, the OED relied on "the wisdom of crowds" for the gathering of historical evidence long before the age of Wikipedia.
VT: One of your responsibilities is overseeing the North American Reading Program. Could you describe what that is and how it works?
JS: The OED is a historical dictionary, which means that for every sense of every word it contains quotations from chiefly written sources, showing how that word has been used over time. Originally the way that you would get these quotations, which are called citations, was that you simply read a wide variety of texts. And any time you come across an interesting word, you write it down on a slip of paper. You do this for enough years and read enough texts and you will eventually have a very, very large file with slips of paper in it that shows how the word has been used throughout its history. You take a file of these, you sort them into order, you divide them up into senses. And you have your dictionary there, based on the evidence that's in front of you.
In a way, it's a collaborative project, one of the earliest collaborative projects in a way that Wikipedia and things like that are thought to be now, where these books were read by a very large number of people, thousands of people spread all over the world. Readers would take books either that they were interested in reading or that were assigned to them that would illustrate some time period of English or a particular subject area. They would find the interesting words and send them in. So everyone was contributing the words that they found to the OED.
VT: Nowadays that would be called "crowdsourcing."
JS: Yes, and this process still goes on with the North American Reading Program and the OED's other reading programmes. There is one in the UK, one for world English, one for scientific sources, one devoted just to pre-1800 material. And each of them has slightly different goals. But in general the idea is that you're reading a lot of sources and trying to come up with interesting words.
One of the things that has changed over time is that 150 years ago, or even 25 years ago, the only way of assembling this kind of material was to pretty much read a text through and find examples and write them down. Now with the tremendous growth in online databases, it's very easy to find large numbers of good examples, even from published sources from just about any time period in English, by looking at online sources. So it's no longer necessary to have a reading program to find a word.
James Murray's classic example of the difficulties of reading programs is that people tend to notice unusual words and they don't notice usual words. So when he was first starting to edit, he noticed that in the files there were five examples of the word abuse but 50 examples of the word abusement. This does not mean that abusement is 10 times more common. It means that any time you come across a word that's unusual, of course you'll write it down. But, it will never occur to you to write down a word like abuse because it's so common. So then when you're working on abuse, you have this problem where the evidence you have in front of you is not sufficient. In the old days, you'd have to either use a text-based concordance, such as was written for the Bible or Shakespeare or a very small number of other sources, or just read through and hope you can randomly find an example of this word from the time period you need. Now you can go online and punch up every example of abuse published in an English source in the entire eighteenth century, for instance.
VT: So what's the point of having a reading program then, if you can do all of this simply by searching online databases now?
JS: The nature of the reading program has changed over time, where reading in order to find a particular example of any given word is no longer that important a goal because you can find these online. The things that you want now are, first of all, identifying new words, or in particular new senses. And this is something that's very hard to do online. Even if you know what you're looking for, it can be hard to find.
To take one example, there's the so-called "Gen X so," where the word so is used to emphasize words that typically don't allow for comparison — something like, "Blogging is so 2004," or, "You are so not going to discuss that with me." This is something where it's very hard to find examples online because even if you can imagine a frame in which this can appear, the word so is so common that you're either going to find extremely narrow things because you're required to search so narrowly, or you're going to miss things that are out there because you don't know to search for them or you can't easily construct a search for them. You get 10,000 examples and only one of them might be the thing you're interested in. A reading program person would identify this as a new sense. And the examples of so you have in a database will reflect this, rather than the 9,999 other examples that you're not interested in.
VT: Just for the record, what's the earliest recorded use of "Gen X so"?
JS: I found an example from 1979, in Woody Allen's movie Manhattan: "'He's a big Bergman fan, you know?' 'Oh, please! God, you're so the opposite.'" But the canonical example is from the movie Heathers: "Grow up, Heather, bulimia's so '86."
VT: What are some of the new ways that the reading program is working to bring together collaborators in the search for citations?
JS: We have a project that we started a number of years ago devoted to science fiction terms, where volunteer moderators are running a website devoted to science fiction terms in the OED, with examples of the words and definitions and discussions of how they're used. Enthusiasts can add words to that, which will then be added to the OED's database and eventually either appear as part of OED entries or just be on the website for people to see.
This was the perfect example of a kind of field where people are extremely devoted to the subject and very knowledgeable about it. They're able to find examples of things that would be very hard to find without this kind of specialist knowledge, and they contribute commentary to it that would be very hard to do without lots of specialist research. It appears as a separate website, Science Fiction Citations. Many of the entries are now being used as part of OED itself. And there's been a book published out of it called Brave New Words, which is effectively a historical glossary of science fiction terms taken from the website.
I think it's a good example of how this kind of work can benefit everyone involved, where the people who are contributing get to have a website devoted to their subject that's very detailed and is shared for everyone to be able to see. The OED has extremely high-quality research that would be hard to get in any other way. The world at large has a free site, where they can see tons of information about the vocabulary of science fiction. And it seems like a win-win situation for everyone. So this sort of project is something that we've been thinking about expanding into other areas. The problem is, it is time-consuming to set up and you do need to have moderators to run it. But, there's no reason why you couldn't have forums devoted to this and people actively discussing particular words and how they're used. Any sort of specialist area could benefit from such an approach.