Word Count

Writers Talk About Writing

Halt, wHo gOEs tHErE?

Ever wonder what those squiggly words are that you have to spell in order to get past security on many websites? They're called CAPTCHAs, and Mike Pope, a technical writer and editor at Microsoft, has the full story on them.

Odds are good, I'd venture, that you've tried to register for a website or enter a comment somewhere, only to be stymied by something like this:

If you're like me, you might have wondered whether they have to make it that hard to read. Well... yes. In fact, for this example the reason you see these wonky words is precisely because they're hard to read.

Let's back up. This challenge is generically referred to as a CAPTCHA. It's supposed to ensure that someone using a website (registering or commenting, say) is a person and not an automated process — a bot. For example, a spammer might use bots to try to sign up for many email accounts in order to send spam, or to leave blog comments that direct readers to the spammer's website. CAPTCHAs are supposed to thwart this.

The term CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart," coined at Carnegie-Mellon University. It's helpful to know that a Turing test derives from Alan Turing's philosophical work in artificial intelligence and his Imitation Game, in which someone tries to guess whether they're interacting with a human or a machine. (Technically, CAPTCHA could just be CAPT, since "to tell Computers and Humans Apart" is the point of a Turing test. But the term CAPTCHA was intended to invoke the word capture, so it's something of a backronym.) Interestingly, the university originally wanted to trademark the term, but has given up on that.

Speaking linguistically, most CAPTCHAs rely on people's ability to be able read even highly distorted text. (Consider our remarkable ability to read the vast range of people's handwriting and their idiosyncratic letterforms.) CAPTCHAs will use many ways to distort text in order to disguise it. Here are a couple more examples:

The primary reason why CAPTCHA text is so distorted is that spamming software is pretty good at reading text, even in images. (For example, a page on the web describes the success that some academics have had in devising CAPTCHA-cracking software.)

The example I have at the beginning is a special type of CAPTCHA known as a reCAPTCHA. In a reCAPTCHA, the reason for the distortion is, in effect, the opposite of the usual reason. In reCAPTCHAs, the text isn't distorted so that optical-character recognition (OCR) software can't read it; it's distorted because OCR software can't read it.

Basically, the reCAPTCHA project is helping scan old books and newspapers. The story is this: Google has been steadily working to digitize old texts. During the scanning process, words are identified that might not have been scanned correctly. (In older texts, for example, up to 20% of the words might be misread during scanning.)

The reCAPTCHA software picks words from among these "suspicious" terms and presents them (after some additional protective distortion) as part of a CAPTCHA, as you can see in the example earlier. The test contains a control word (in the example, "any"), which users have to get right; the other is an unknown term that users are helping to decipher. The same unknown words are presented to multiple users, and if several humans agree on a word, the word is marked as known in the database.

It's an ingenious way to harness humans' ability to decipher text in the service of what's generally considered a good cause. According to the reCAPTCHA people, after one year, the project had deciphered 440 million terms, and one number I found suggests that reCAPTCHA presents 60 million words to users every day.  (You can read more technical details about the reCAPTCHA process in a paper (PDF) that was published in Science magazine back in 2008.)

Text-based CAPTCHAs are not without their problems, however. The most obvious one is that although we readers have astounding abilities to make out text, sometimes the text is simply too distorted. To solve this problem, most CAPTCHAs let you click a button to get a new word.

Another problem is that text-based CAPTCHAs are not very accessible – that is, they're not usable by people with sight disabilities. (Government entities are generally obliged to create websites that meet strict accessibility standards.) A common workaround is to offer an audio alternative. For example, reCAPTCHA will read four words out loud, with background noise as a kind of audio version of the text distortion, and you pass the test if you type the words correctly.

There's also internationalization. Many text-based CAPTCHAs have an English bias. In practice, this is not a huge problem, because the challenge words are for all intents and purposes random characters. Still, non-English users do have a slightly more trouble with distorted English words than native speakers. And effectiveness aside, website owners do occasionally ask whether the words being presented by CAPTCHAs could be in their own language. (For reCAPTCHA and its database of scanned words, the answer is no.)

A less serious problem is that any randomly chosen sequence of letters or words will sooner or later produce terms that range from amusing to suggestive to downright offensive. (I once was presented with the words "continuing waffling," which I decided not to take as commentary on my personality.) If your sense of humor inclines that way, you might want to search the web for "funny captcha fail," which reveals a number of websites devoted to tittering at poor choices for CAPTCHA text.

Text is the most common basis for CAPTCHAs, but it's not the only one. There are CAPTCHAs based on simple arithmetic tests, on matching text to pictures (same problem with accessibility), on identifying pictures of dogs and cats, on using the mouse to draw, and many more.

I did find that understanding the purpose and challenges for CAPTCHAs, and especially the work being done in the reCAPTCHA project, has made me a lot more tolerant of the wacky text that I am asked to transcribe. Now whenever I hear someone complain about how hard the CAPTCHA is, I respond with "Say, do you know about this digitizing project? Let me explain..."

Click here to read more articles from Word Count.

Mike Pope has been a technical writer and editor for nearly 30 years. He has worked at Microsoft and Amazon, and currently works at Tableau Software. You can read more at Mike's Web Log and Evolving English II. Click here to read more articles by Mike Pope.

Mike Pope explores the history of spelling alphabets.
Mike introduces us to some unusual computer programming jargon.
Usage Deltas
Mike looks at what happens when math terms turn fuzzy in extended use.