Word Count

Writers Talk About Writing

Data-Based: The Jargon of Big Data

One aspect of the computing world that we're all deeply involved in (whether we realize it or not) is the specialized field of databases. In this article, I thought it might be fun to look at the terminology of that involvement from the database's point of view.

The term database itself is a relative newcomer to English; the first OED cite is for 1962. (Of course, the idea of storing data in a structured way goes back eons.) Nowadays we spell the term as one word, but it was originally two: data base (think customer base or knowledge base). The OED traces the term's transformation from two words to the hyphenated data-base (ca. 1974) to a single term by the mid-80s.

In a database, information is stored in tables that have rows (or records) and columns (or fields). You can think of a spreadsheet as a (very) simple kind of database. Each record in a table generally has a primary key that uniquely identifies that record. You personally are represented by many, many primary keys in databases around the world: your Social Security number, bank account numbers, insurance numbers, credit-card numbers, and so on.  Here's an example of another primary key (in both computer- and human-readable formats) that you see on almost everything you can buy in a store:

To you, a UPC. To the database, a primary key.

You've undoubtedly filled in forms that have a little box for each letter or number, like this example from a United States W-9 form:

Examples of fixed-length fields.

This is a clue that the information is going into a fixed-length field in the database. In the old days when storage was a major cost factor, data could only be stored in fixed-length fields. You might still see a remnant of that in postal mail or on an airline boarding pass where your name or address is oddly abbreviated or even truncated. Today databases generally allow variable-length fields, and we don't see many forms any more where there aren't enough little boxes for our name or address.

Most databases in the last 40 years have been relational databases. In this arrangement, the data is normalized, meaning that it's broken out into separate tables (related via primary keys) to eliminate redundancy. Almost any time you see something coded in a form — medical diagnostic codes, lists of occupations, even age groups in a survey — you're seeing evidence of a normalized database, where each code is an entry in a relational table. This is also why you are limited, sometimes frustratingly, to just the choices that they provide.

The primary functions for a database go by the acronym CRUD — create, read, update, and delete. (This term gives some editors the willies, as you can imagine.) Most database work consists of these CRUD operations over and over.  When you buy something with a credit card or register at a website, a database somewhere creates a new record. When you view a catalog online, the database reads data in order to display it. If you change a password, the database performs an update. If you unsubscribe from a mailing list, a database performs a delete operation (we hope) to remove you from that database.

These terms have undergone some novel changes in the lexicon of databases. Parallelism with creation, selection, and deletion has spawned the noun updation, even though the word update already fulfills this role. And people have invented the term upsert for the common case of "update the record if it exists, otherwise create (insert) it."

If you order tickets online, you don't want someone to buy seats out from under you while you ponder. This is the database problem of concurrency control, or making sure that multiple users don't try to change the same record at the same time. One approach is optimistic concurrency. Here, the database lets anyone view records. Only when an update is submitted does the database check whether the record has been changed. If you've ever ordered something only to discover during check-out that the item has gone out of stock, you might have experienced the downside of optimistic concurrency. The alternative approach is pessimistic concurrency, where the database locks a record so that no one else can touch it until it's explicitly unlocked. As you might guess, the ticket-site model is pessimistic, given the high likelihood of collisions between customers all hoping for the same seats. And that's why you have a limited amount of time to make your choice and check out — to minimize the time that a record is locked.

Another problem that databases have to solve is that of atomicity. A classic example is transferring money from one bank account to another. This requires two database updates (one debit, one credit); obviously, you don't want the debit to succeed but the credit to fail. To ensure this, databases "wrap" the updates in a transaction, which is a set of operations that succeed or fail together. If all the operations succeed, the database performs a commit, which makes the changes permanent. If any constituent operation fails, the database performs a rollback. Think about ordering something from a third-party vendor via Amazon, and you get a sense of the complexity of distributed transaction processing, where your order involves updates in different databases on different computers belonging to different companies.

These days, a term on the minds of many computer people is big data, which refers to the enormous quantity of diverse data that we collectively generate (one source estimates 2.5 quintillion bytes a day). Given how much we all — individuals and corporations and governments — now want to store, you can imagine some of the challenges involved in capturing, managing, and analyzing this big data.

Naturally, there is a great deal more to databases. Even experienced programmers typically defer database work to a database administrator (DBA), whose specialized job involves the intricacies of not just creating databases, but indexing for improved performance, mirroring (real-time duplication of data for safety), scalability, and other critical tasks to keep everything running smoothly.

Give a thought to all this the next time you surf the web, order things online, or fill in a form. Chances are good that there's a database waiting back there, ready to translate your action into its own terms.

Rate this article:

Click here to read more articles from Word Count.

Mike Pope has been a technical writer and editor for nearly 30 years. He has worked at Microsoft and Amazon, and currently works at Tableau Software. You can read more at Mike's Web Log and Evolving English II. Click here to read more articles by Mike Pope.

Join the conversation

Comments from our users:

Monday February 6th 2012, 2:25 AM
Comment by: Roxanne L. (Paris France)
Very informative and made so accessible for simple folks - Thank you.
Monday February 6th 2012, 8:25 AM
Comment by: Roger Dee (Haslett, MI)Top 10 Commenter
Excellent overview, Mike!
When you start thinking about the complexities of online commercial transactions, let alone airline databases, it seems of a different order of complexity theory to consider!
When my wife starts ordering, re-deciding, re-ordering items on QVC with the apparently limitless scope of choices, and THEN calls the assisting customer service operator, can you imagine it?
It's more than the human brain can contemplate!
Monday February 6th 2012, 9:53 AM
Comment by: Graeme Roberts (Pittsford, NY)
Great article. I love the way that technical terms are applied to other fields, although they are often misused. Creativity rides on the application of principles from one field to another.
Monday February 6th 2012, 11:04 AM
Comment by: Susan C.
It's ironic that circa 1974, the term database was hyphenated but today, almost 40 years later, many databases still aren't designed to handle names with hyphens. Also, while variable-length fields are more common, I'm amazed at how many are still limited to lengths that only accommodate "John Smith" who lives at "30 Elm Street" ...

Great article, Mike!

Do you have a comment?

Share it with the Visual Thesaurus community.

Your comments:

Sign in to post a comment!

We're sorry, you must be a subscriber to comment.

Click here to subscribe today.

Already a subscriber? Click here to login.