A Monthly Column for Word Lovers
Pride and Prejudice and Natural Language Processing
Everyone in the Language Lounge is a Jane Austen fan. I make a point of rereading one of her novels every year, something I have never considered worth the time for any other author. I've written about her a couple of times in earlier columns, here and here. Recently I've been delighted to discover that many others are improving their acquaintance with Jane, via a field of endeavor in which I have spent a fair amount of time: natural language processing (NLP), the use of computers to analyze text.
Nineteenth century writings hold a couple of obvious charms for NLP researchers: one is that so much of the work is now long out of copyright, and thus free from the cumbersome business of obtaining permissions to do anything with it. Many 19th century texts are also widely digitized already, and so available in multiple machine-readable formats. Among her contemporaries, Austen's oeuvre is also attractive for its compact size; she completed only six novels in her sadly short writing life.
As I read Jane Austen, the question that is ever in the back of my mind is, how did she do it? How did she write novels two hundred years ago that today seem as fresh as the day they were written, and that still deeply engage audiences whose lives are necessarily incomparably different from the genteel English folk of her day? Surprisingly, computers are quite helpful in discovering some of the aspects of Austen's writing that make it distinctive in the wide field of English fiction, and it is a pleasure for me to see that computational linguists have tuned their engines to look into questions like this.
A project at the University of Nebraska called Austen Said looks at patterns of diction and lexicon in Austen's novels. The project began with a question similar to one you might ask yourself: are there patterns of language in Austen's novels that are distinct from what you find among her contemporaries or elsewhere later in the genre of English fiction? This project, by the use of careful hand-coding of the text and some algorithmic classification, has turned up a trove of discoveries. Some of these, once you see them, seem quite intuitive but would be difficult to arrive at definitively by any means other than computational. For example: Mr. Darcy, in Pride and Prejudice, utters a number of words that no other character uses, including accusations, carelessness, defiance, faithful, indirect, lessen, liberally, meanly, and purposely. If you are among the many who swoon just at the mention of his name, perhaps you have been charmed by his high-flown speech. Another interesting tidbit is that in classifying the type of characters in the six novels, researchers found that "there is one cad in each novel" and "there are more fools than any other sort of character."
One of Austen's great contributions is her development of free indirect discourse (FID), in which she "renders not merely the point of view of a given character, but gives the flavor of a character's speech or thought." As an example, in Emma, a meeting between Emma and Jane Fairfax turns out to be unusually vexing for Emma and the third-person narrator (ubiquitous in Austen's novels) observes that "this amiable, upright, perfect Jane Fairfax was apparently cherishing very reprehensible feelings." But the reader knows it is not the narrator who so characterizes Jane Fairfax; it is Emma. This very economical technique is rare in fiction before Austen, but it abounds in her novels. In the words of the project's researchers: "Austen's discovery of what FID could do was comparable in the history of the novel to the discovery of the atomic bomb in the history of warfare; thereafter, things were never the same, and FID became a basic feature of the novel as genre."
Julia Silge, a data scientist at Stack Overflow, who has "a PhD in astrophysics and an abiding love for Jane Austen," has directed her wizardry at compiling a database of Austen's novels for use by computational linguists, and she's run a number of interesting analyses of her own. Among them is a study of bigrams (two-word sequences) in the novels, looking specifically at contrasts between words (most often verbs) that follow the pronouns he and she. The thesis is that this might give a clue to differences in gender roles. A number of words occur with about equal probability after the pronouns, but there are some standouts, as show in the graph below.
Silge writes, "Women in Austen's novels do things like remember, read, feel, resolve, long, hear, dare, and cry. Men, on the other hand, in these novels do things like stop, take, reply, come, marry, and know. Women in Austen's world can be funny and smart and unconventional, but she plays with these ideas within a cultural context where they act out gendered roles." When set out in such clear writing, this may again seem like an obvious conclusion about Austen, but I think it is also part of her great appeal. Austen's women are severely confined to the roles that their culture prescribed for them, but they rarely seem terribly oppressed by these restraints. At a time when few gave a thought to the notion of equal rights or gender parity, Austen's heroines often find a way to triumph, even while enjoying far fewer rights and privileges than their male counterparts.
For me the most telling analysis of Austen's work was brought to light last year in analysis by Kathleen Flynn and Josh Katz, which they published in a New York Times article called The Word Choices That Explain Why Jane Austen Endures. Using an annotated database of English novels published between 1710 and 1920, they analyzed the books' lexicons and then looked at the statistical patterns that emerged, based on semantic categories assigned to words in the novels. Here's a snapshot of the lexicons of English novels that contrast semantic content along two axes of word meaning:
I've drawn a box around Austen's six novels that help to accentuate how much of an outlier she was, and surely still is, in the language of the novel. She writes somewhat abstractly about emotions in a way that, as the graphic shows, is completely different from any writer before or after her. This approach is, to my mind, extremely efficient and compact: she manages to say a lot with few words about something that we all spend a huge amount of time doing: experiencing our emotions. And she does it with a sharp analytical intelligence that we can rarely bring to our own relationships or to our own feelings. I think this kind of writing is best exemplified in passages like this one, from the end of a chapter in Sense and Sensibility:
As for Marianne, on the pangs which so unhappy a meeting must already have given her, and on those still more severe which might await her in its probable consequence, she could not reflect without the deepest concern. Her own situation gained in the comparison; for while she could ESTEEM Edward as much as ever, however they might be divided in future, her mind might be always supported. But every circumstance that could embitter such an evil seemed uniting to heighten the misery of Marianne in a final separation from Willoughby—in an immediate and irreconcilable rupture with him.
While I know that this is exactly the kind of passage that makes some readers' eyes glaze over or run for the exits, I think that it exemplifies the true genius of Austen: her insightful intelligence and her adroitness with English grammar and vocabulary, combined with her thorough understanding of and sympathy with her characters' predicaments and their responses to them, enables her to bring them to life in a way that no other writer has been able to duplicate.