Word Routes

Exploring the pathways of our lexicon

Tracking Dialects on Twitter: What's Coo and What's Koo?

In last Sunday's New York Times, I wrote about how researchers are using Twitter to build huge linguistic datasets in order to answer all sorts of interesting analytical questions. Some are looking at the emotional responses of Libyans to unfolding events like the death of Qaddafi, while others are tracking the distribution of regional patterns in American English. This latter research area, Twitter dialectology, is just getting off the ground, but the results are already quite intriguing.

One study by researchers at Carnegie Mellon University received a fair amount of press attention earlier this year. For the Times article, I spoke to the lead researcher, Jacob Eisenstein, who co-authored a groundbreaking study (with Brendan O'Connor, Noah A. Smith, and Eric P. Xing). Eisenstein and his colleagues used Twitter's "streaming API" to download enormous quantities of text for filtering and analysis. For the non-techies, API stand for "application programming interface." Twitter's API has two settings: "Firehose," which streams all public messages, and "Gardenhose," a 10-percent sample of all messages. Even with the "Gardenhose" setting, there are still millions of tweets per day that can be collected by researchers.

Over the course of a week last year, the CMU team gathered 380,000 messages from 9,500 users, selecting messages from within the continental United States. They could pinpoint users' geographical coordinates by focusing only on messages that are geocoded, something that's possible to do on many mobile devices. After narrowing down the stream in this way, they took a look at about 5,000 different words from these messages, and determined that about a quarter of them couldn't be found in a spell-checking dictionary. If you've spent some time on Twitter, that proportion of non-dictionary words will seem about right: unusual slang, non-standard spellings, and online-only abbreviations are pretty much the norm for many users.

Those non-standard written forms showed some interesting regional patterning. Spelling cool as coo or koo turns out to be a California thing. (The initial study reported that coo was clustered in southern California and koo in northern California. Later, though, Eisenstein ran the same test on a larger dataset and found the northern/southern distinction wasn't as prominent as they originally surmised, though both forms are distinctly Californian.) Suttin beats out sumthin as a non-standard spelling of something in the New York area, where one would also find uu outdoing yu as a way to write you. I briefly wondered if old-school New Yorkers pronounce uu as youse, but old-school New Yorkers probably don't frequent Twitter very much.

As research on Twitter dialects progresses, more research tools will likely become publicly available so that everyone can join in on the fun. One such tool called Lexicalist was created by the computational linguist David Bamman. Through an analysis of the Twitter stream, Lexicalist can create maps of the U.S. showing how usage varies from state to state. (You can read about how the project came together in a guest post by Bamman on Language Log.)

Lexicalist is a great start, but I'm hankering for something that drills down deeper than the state level. Take the word jawn, a hallmark of Philadelphia slang. I talked about jawn in an interview with AV Club Philadelphia in May; it's a variant of joint, from hiphop slang, originally meaning "something good" but extended to refer to all sorts of people and things. And jawn also features prominently in a wonderful piece on regional Twitter slang by Maud Newton that appeared last Friday in The Awl. (It was a happy coincidence that Newton's Awl post and my Times article came out at about the same time, as she was able to take a longer and more personal look at issues I could only touch on briefly.) The map for jawn on Lexicalist shows the highest concentration in Delaware, not Pennsylvania, because little Delaware is entirely in the Philadelphia orbit, while much of central and western Pennsylvania isn't so Philly-centric and thus would be less likely to use jawn. So we'd need a more fine-grained analytical tool to get at patterns in different urban areas rather than just states.

And there are plenty of other research possibilities one could think of. On The Economist's Johnson blog, Robert Lane Greene suggested that Twitter tools might be useful in determining which parts of the U.S. are most susceptible to British lexical imports. And there's some new research going on now by Eisenstein, Bamman, and their colleague at Stanford University, Tyler Schnoebelen, on the role that gender plays in the variability of different forms of Twitter, including emoticons. (See my Language Log post for more.) We're still in the early days of Twitterology, but there are many fascinating prospects on the horizon.


Rate this article:

Click here to read more articles from Word Routes.

Ben Zimmer is executive editor of Vocabulary.com and the Visual Thesaurus. He is language columnist for The Wall Street Journal and former language columnist for The Boston Globe and The New York Times Magazine. He has worked as editor for American dictionaries at Oxford University Press and as a consultant to the Oxford English Dictionary. In addition to his regular "Word Routes" column here, he contributes to the group weblog Language Log. He is also the chair of the New Words Committee of the American Dialect Society. Click here to read more articles by Ben Zimmer.

Join the conversation

Comments from our users:

Friday November 4th 2011, 8:45 PM
Comment by: Ellen M.
I'm from Chicago, but have lived in the Bay Area for nearly 30 years.
The correct Twitter/text spelling of youse is uz.
You're welcome.
Saturday November 5th 2011, 7:15 AM
Comment by: Ellis D. (London United Kingdom)
I am a surgeon who has practiced in London UK for more than 50 years. I have listened to London English vary almost from street to street (Shaw's Pygmalion!) and year to year. I was interested in the Californian 'Coo' as I have heard the terminal 'L' disappear from the speech of most areas in London.
Thus: A meal>miaow>mee' and St Paul's>St Pow's.
This is speech, of course, rather than writing.

My question is: do the Californians actually also drop the 'L' and say 'coo' when they speak, or only text it?

Ellis
Sunday November 6th 2011, 8:10 AM
Comment by: Ben Zimmer (New York, NY)Visual Thesaurus ContributorVisual Thesaurus Moderator
Ellis: The phenomenon you're describing is known as l-vocalization. In the U.S., it shows up frequently in the mid-Atlantic States and in African American Vernacular English (see Ben Trawick-Smith's post here for more). In parts of California, l-vocalization extends beyond AAVE speakers -- Lauren Hall-Lew found that it was common among Asian Americans in San Francisco (PDF of a poster presentation here).
Sunday November 6th 2011, 10:11 AM
Comment by: Roger Dee (Haslett, MI)Top 10 Commenter
Of greatest interest to me. Even in the 1940's the way you said the "og" (frog, fog, etc.) immediately cast your regional origin as the Detroit/Toronto axis or not, and along with it the cultural opprobrium that might or not be associated with it.

Do you have a comment?

Share it with the Visual Thesaurus community.

Your comments:

Sign in to post a comment!

We're sorry, you must be a subscriber to comment.

Click here to subscribe today.

Already a subscriber? Click here to login.

On Twitter, the hashtag has been pressed into the service of self-mockery.
Egyptian anti-Mubarak protesters played with language on Twitter and elsewhere.
"Tweet police" are setting themselves up as guardians of proper language use on Twitter.