Writers Talk About Writing
Data-Based: The Jargon of Big Data
One aspect of the computing world that we're all deeply involved in (whether we realize it or not) is the specialized field of databases. In this article, I thought it might be fun to look at the terminology of that involvement from the database's point of view.
The term database itself is a relative newcomer to English; the first OED cite is for 1962. (Of course, the idea of storing data in a structured way goes back eons.) Nowadays we spell the term as one word, but it was originally two: data base (think customer base or knowledge base). The OED traces the term's transformation from two words to the hyphenated data-base (ca. 1974) to a single term by the mid-80s.
In a database, information is stored in tables that have rows (or records) and columns (or fields). You can think of a spreadsheet as a (very) simple kind of database. Each record in a table generally has a primary key that uniquely identifies that record. You personally are represented by many, many primary keys in databases around the world: your Social Security number, bank account numbers, insurance numbers, credit-card numbers, and so on. Here's an example of another primary key (in both computer- and human-readable formats) that you see on almost everything you can buy in a store:
To you, a UPC. To the database, a primary key.
You've undoubtedly filled in forms that have a little box for each letter or number, like this example from a United States W-9 form:
Examples of fixed-length fields.
This is a clue that the information is going into a fixed-length field in the database. In the old days when storage was a major cost factor, data could only be stored in fixed-length fields. You might still see a remnant of that in postal mail or on an airline boarding pass where your name or address is oddly abbreviated or even truncated. Today databases generally allow variable-length fields, and we don't see many forms any more where there aren't enough little boxes for our name or address.
Most databases in the last 40 years have been relational databases. In this arrangement, the data is normalized, meaning that it's broken out into separate tables (related via primary keys) to eliminate redundancy. Almost any time you see something coded in a form — medical diagnostic codes, lists of occupations, even age groups in a survey — you're seeing evidence of a normalized database, where each code is an entry in a relational table. This is also why you are limited, sometimes frustratingly, to just the choices that they provide.
The primary functions for a database go by the acronym CRUD — create, read, update, and delete. (This term gives some editors the willies, as you can imagine.) Most database work consists of these CRUD operations over and over. When you buy something with a credit card or register at a website, a database somewhere creates a new record. When you view a catalog online, the database reads data in order to display it. If you change a password, the database performs an update. If you unsubscribe from a mailing list, a database performs a delete operation (we hope) to remove you from that database.
These terms have undergone some novel changes in the lexicon of databases. Parallelism with creation, selection, and deletion has spawned the noun updation, even though the word update already fulfills this role. And people have invented the term upsert for the common case of "update the record if it exists, otherwise create (insert) it."
If you order tickets online, you don't want someone to buy seats out from under you while you ponder. This is the database problem of concurrency control, or making sure that multiple users don't try to change the same record at the same time. One approach is optimistic concurrency. Here, the database lets anyone view records. Only when an update is submitted does the database check whether the record has been changed. If you've ever ordered something only to discover during check-out that the item has gone out of stock, you might have experienced the downside of optimistic concurrency. The alternative approach is pessimistic concurrency, where the database locks a record so that no one else can touch it until it's explicitly unlocked. As you might guess, the ticket-site model is pessimistic, given the high likelihood of collisions between customers all hoping for the same seats. And that's why you have a limited amount of time to make your choice and check out — to minimize the time that a record is locked.
Another problem that databases have to solve is that of atomicity. A classic example is transferring money from one bank account to another. This requires two database updates (one debit, one credit); obviously, you don't want the debit to succeed but the credit to fail. To ensure this, databases "wrap" the updates in a transaction, which is a set of operations that succeed or fail together. If all the operations succeed, the database performs a commit, which makes the changes permanent. If any constituent operation fails, the database performs a rollback. Think about ordering something from a third-party vendor via Amazon, and you get a sense of the complexity of distributed transaction processing, where your order involves updates in different databases on different computers belonging to different companies.
These days, a term on the minds of many computer people is big data, which refers to the enormous quantity of diverse data that we collectively generate (one source estimates 2.5 quintillion bytes a day). Given how much we all — individuals and corporations and governments — now want to store, you can imagine some of the challenges involved in capturing, managing, and analyzing this big data.
Naturally, there is a great deal more to databases. Even experienced programmers typically defer database work to a database administrator (DBA), whose specialized job involves the intricacies of not just creating databases, but indexing for improved performance, mirroring (real-time duplication of data for safety), scalability, and other critical tasks to keep everything running smoothly.
Give a thought to all this the next time you surf the web, order things online, or fill in a form. Chances are good that there's a database waiting back there, ready to translate your action into its own terms.