Exploring the pathways of our lexicon
Tracking Dialects on Twitter: What's Coo and What's Koo?
In last Sunday's New York Times, I wrote about how researchers are using Twitter to build huge linguistic datasets in order to answer all sorts of interesting analytical questions. Some are looking at the emotional responses of Libyans to unfolding events like the death of Qaddafi, while others are tracking the distribution of regional patterns in American English. This latter research area, Twitter dialectology, is just getting off the ground, but the results are already quite intriguing.
One study by researchers at Carnegie Mellon University received a fair amount of press attention earlier this year. For the Times article, I spoke to the lead researcher, Jacob Eisenstein, who co-authored a groundbreaking study (with Brendan O'Connor, Noah A. Smith, and Eric P. Xing). Eisenstein and his colleagues used Twitter's "streaming API" to download enormous quantities of text for filtering and analysis. For the non-techies, API stand for "application programming interface." Twitter's API has two settings: "Firehose," which streams all public messages, and "Gardenhose," a 10-percent sample of all messages. Even with the "Gardenhose" setting, there are still millions of tweets per day that can be collected by researchers.
Over the course of a week last year, the CMU team gathered 380,000 messages from 9,500 users, selecting messages from within the continental United States. They could pinpoint users' geographical coordinates by focusing only on messages that are geocoded, something that's possible to do on many mobile devices. After narrowing down the stream in this way, they took a look at about 5,000 different words from these messages, and determined that about a quarter of them couldn't be found in a spell-checking dictionary. If you've spent some time on Twitter, that proportion of non-dictionary words will seem about right: unusual slang, non-standard spellings, and online-only abbreviations are pretty much the norm for many users.
Those non-standard written forms showed some interesting regional patterning. Spelling cool as coo or koo turns out to be a California thing. (The initial study reported that coo was clustered in southern California and koo in northern California. Later, though, Eisenstein ran the same test on a larger dataset and found the northern/southern distinction wasn't as prominent as they originally surmised, though both forms are distinctly Californian.) Suttin beats out sumthin as a non-standard spelling of something in the New York area, where one would also find uu outdoing yu as a way to write you. I briefly wondered if old-school New Yorkers pronounce uu as youse, but old-school New Yorkers probably don't frequent Twitter very much.
As research on Twitter dialects progresses, more research tools will likely become publicly available so that everyone can join in on the fun. One such tool called Lexicalist was created by the computational linguist David Bamman. Through an analysis of the Twitter stream, Lexicalist can create maps of the U.S. showing how usage varies from state to state. (You can read about how the project came together in a guest post by Bamman on Language Log.)
Lexicalist is a great start, but I'm hankering for something that drills down deeper than the state level. Take the word jawn, a hallmark of Philadelphia slang. I talked about jawn in an interview with AV Club Philadelphia in May; it's a variant of joint, from hiphop slang, originally meaning "something good" but extended to refer to all sorts of people and things. And jawn also features prominently in a wonderful piece on regional Twitter slang by Maud Newton that appeared last Friday in The Awl. (It was a happy coincidence that Newton's Awl post and my Times article came out at about the same time, as she was able to take a longer and more personal look at issues I could only touch on briefly.) The map for jawn on Lexicalist shows the highest concentration in Delaware, not Pennsylvania, because little Delaware is entirely in the Philadelphia orbit, while much of central and western Pennsylvania isn't so Philly-centric and thus would be less likely to use jawn. So we'd need a more fine-grained analytical tool to get at patterns in different urban areas rather than just states.
And there are plenty of other research possibilities one could think of. On The Economist's Johnson blog, Robert Lane Greene suggested that Twitter tools might be useful in determining which parts of the U.S. are most susceptible to British lexical imports. And there's some new research going on now by Eisenstein, Bamman, and their colleague at Stanford University, Tyler Schnoebelen, on the role that gender plays in the variability of different forms of Twitter, including emoticons. (See my Language Log post for more.) We're still in the early days of Twitterology, but there are many fascinating prospects on the horizon.