The Absolute Minimum Programmers Need to Know About Databases
Generally speaking, in modern programming there are three primary tiers: the visual presentation, the business logic layer, and the persistence engine. To be effective, and therefore hirable, most programmers are strong in one of the first two and we would classify these programmers as either front-end programmers or back-end programmers. However a quick glance at popular technical blogs will revel that the third tier, the persistence engine, is fraught with suspicion and misunderstanding.
Therefore, we are going to talk about databases and outline the minimum knowledge that is necessary for any good programmer to know regarding how databases are effectively used. We take for granted that almost all back-end programmers are familiar with the mechanisms necessary for actually talking to a database, as persisting and retrieving data is one of the primary use cases for the middle tier. We will focus instead on how to think of the database in the overall architecture of a system and how and why the database functions as it does.
Generally, one of the most frequently voiced complaints about databases is that they are too complex and that what most programmers want is a pure data store. It should be easy to store any arbitrary object, get an identifier, and retrieve the object at a later time—just sort of a very large persistent hash map. The rest of the complexity in dealing with databases just makes it harder for the application programmer to elegantly express their vision in code. This line of thinking has expressed itself in many alternative data store technologies in the last 30 years or so: object databases, share nothing architectures, and MapReduce. These technologies solve a number of data access problems, but they have not supplanted the standard SQL database, which has been around since 1970. While these new technologies provide unique and speedy access to data, they do not solve a class of problems that SQL is uniquely suited for.
The primary database architecture that most programmers will run across will be some brand of SQL database. The vast majority of these databases will be SQL92 compliant. This means, mostly, that the underlying core of the databases will all support the same syntax. This has become truer over time as database vendors strive to make their platforms more open and easy to migrate to.
The compliance to the standard makes intermediate access libraries like JDBC or ODBC possible, as the libraries must only take into consideration slight variations between database engines. The SQL language is divided between two primary sets: Data Definition Language (DDL) and Data Manipulation Language (DML). DDL is the primary method for creating objects within a SQL database. Within the language are the tools to create tables, indexes, sequences, and other artifacts. DML, on the other hand, is used to manipulate the data that is placed in the structures created by DML.
From a technical standpoint, a SQL-compliant database needs to be ACID:
- Atomicity: each transaction is all or nothing.
- Consistency: each transaction moves the database from one valid state to another.
- Isolation: concurrent transactions will be handled gracefully.
- Durability: when a transaction is committed it stays that way.
This presupposes a number of different properties that make SQL databases useful. There is a default transaction model that allows for the rollback of content if there is a failure mid-transaction. Multi-threaded behavior is handled gracefully. These features show the difference between standard file I/O and database access as far as data stability is concerned.
A very simplified visual model of a SQL database table is that of a two-dimensional matrix or grid. Each row consists of a number of elements or columns. This structure will look identical to an excel spreadsheet. Data is added to the database table using the INSERT statement, retrieved via the SELECT statement, and removed using the DELETE statement. At this level, there is not that much difference between an individual database table and a delimited flat file. In order to retrieve data from the table, the programmer uses the SELECT clause with a WHERE clause to define which rows should be returned. For instance:
Select * from animal_table where id = 1. This will return the record in the table with the identifier 1.
Select * from animal_table where species = ‘feline’. This will return all records in the table where the column species is equal to feline.
The two examples above show the primary difference between access modalities that we see with databases. The first, where a unique identifier is provided and a single record is retrieved, is very similar to a hash map. I call this access model the Online Transaction Processing (OLTP) model. The second example returns a group of records, sometimes from multiple tables, based on non-unique selection criteria. I will call this the Decision Support (DS) model. The difference between these two models, I suspect, is the main cause for confusion about databases in general. The OLTP model is inherently algebraic, you can think of it as a function, where a value is added to the function and a value is returned. This maps quite nicely on to the general conceptual model of coding languages. DS usage of the database is more akin to group theory than it is to algebra. In my experience, 80% of databases are OLTP, while 20% can be described as DS. And while other persistence technologies can be used quite effectively in the OLTP case, they fall down in the DS case.
So, if we can visualize a table as a matrix of rows and columns, we can then describe the columns. SQL supports a large number of data types that can be mapped to columns. For the most part these can be sorted into four different types:
- a. Large Integer
- b. Small Integer
- c. Float
- d. Real
- e. Double
- f. Numeric
- a. Fixed Size
- b. Variable Size
- c. Unicode
- a. Date
- b. Time
- c. Timestamp
- a. Defined by each vendor
A point of concern while creating a database is that database vendors support each of these data types differently. There is not much that can be done about this, but remember that all of these data types will be rationalized into the data types supported by your interface library. The list above is a fairly generic representation of the data types you will find in most modern SQL-compliant databases.
In order to minimize the amount of space taken up by each row within the database, a database table should be defined with the smallest column type that will support all use cases. For instance, a column that held a price field could be defined as a float or it could be defined as a numeric (9,2). The numeric value would be preferred because the internal representation would be an integer, which is smaller than the internal representation of the float. Another example would be string values; in most cases it makes sense to define character fields as variable rather than fixed. However, as we learned during the Y2K fiasco, shortening data types just to save space has long-term and unforeseen consequences.
One last note on column types: for the most part, one must expect that values within a column might be null. This is perfectly acceptable, but causes huge problems if one blindly tries to use a variable retrieved from a database row without checking for the condition first. If a variable should never be null, the column can be modified with the NOT NULL description to assure that a value is entered. Another issue that should be watched out for is that there is a difference between a null value and an empty string (“ ”)—this will cause no end of confusion unless you guard against it.
The above discussion describes the basic data container within SQL, the table. In order to find content within the table more quickly, we need to introduce the concept of the index. Without indexes, to find content within a table, the server would have to evaluate all rows within the table against the search criteria. This is known as a serial scan. As a table becomes larger, the amount of time necessary to do a serial scan of the table becomes longer. Indexes provide a way for the server to quickly identify qualifying entries, without having to evaluate the entire table. Indexes come in two varieties: unique and non-unique. A unique index allows for one, and only one, value to be added in a column for the entire table and it uniquely identifies the record. A special case of the unique index is one that cannot be null. This would be the primary key for the record. A non-unique index would allow for multiple rows to have the same column value and if that value were queried, we would expect 0 to N records returned.
Indexes can be thought of as secondary tables that are ordered to provide quick binary search access to the data. In the simplest case, an index based on an integer would be ordered from 0 to N. When a request was made for in indexed value, the server would look at the ordered index table by using a binary search to quickly find the qualifying entry if it exists. Once the entry in the index is found, the matching record in the primary table can be retrieved.
Indexes can be built from individual fields or multiple fields. An index created from multiple fields is called a concatenated key and creating a non-null concatenated key allows for the database table to make combinations of column values unique. One can also take into account the concatenated key cardinality. If there were two values for the key, for instance name and year, one would still be able to search by name alone and get the value from the index. However, if one searched by year alone, the index would not come into play.
At this point we have created database tables and indexed them—allowing for the storage and retrieval of a homogeneous set of content. In most cases you will have to deal with multiple entities that will be stored in multiple tables. We will now look at how these entities can relate to one another.
Database normalization is the process of structuring columns and tables in such a way as to reduce redundancy and dependency. This can be more simply stated in that for each fact within a database, there is one, and only one, way to know that fact. The amount of normalization that can be performed against a dataset is typically defined as one of four normal forms:
- First Normal Form: Table represents a relation and has no repeating elements.
- Second Normal Form: Table eliminates duplicate data.
- Third Normal Form: All non-key values provide about the key and only the key.
- Fourth Normal Form: Each row represents one and only one fact.
We can skip the rather academic discussion of the normal forms, but basically we want to make sure that each row in each table is as unique as possible. We also want to eliminate any possibility that one fact can be represented in multiple places. In order to create a normalized database we have to be able to create relationships between entities, which can be formed by using primary and foreign keys.
Let’s take an example of a pet hospital. Our initial cut at the database shows the following:
|Hospital Name||Address||Phone Numbers||Animals Treated|
|Dave’s Pet Emporium||567 Example St.||415-001-2222, 415-611-5551||dogs, mice, snakes|
|Frank’s House of Cats||987 Alley St.||415-545-5555||cats|
|Large Animal Hospital||863 Triangle St.||415-000-0101||horses, cows, elephants|
|Pet Care||123 Street Ave.||415-101-1111||dogs, cats, lizards|
The requirements here are that each hospital has a name, address, multiple phone numbers, and they each treats multiple kinds of different animals. For this table to meet the minimum amount of normalization, we must first remove duplicate data from the table. To do this, we would create separate tables defining the hospital, another for phone numbers, and yet another for animals treated.
One hallmark of our practice is that we would want to choose a primary key that did not impart any information about the entity itself. For this, we would want to introduce a numeric primary key for each entity not defined by its relationship with another table. First, we would create the hospital entity:
|Primary Key||Hospital Name||Address|
|1||Dave’s Pet Emporium||567 Example St.|
|2||Frank’s House of Cats||987 Alley St.|
|3||Large Animal Hospital||863 Triangle St.|
|4||Pet Care||123 Street Ave.|
Next, we would want to create a table linking 1 to N telephone numbers with the hospital:
|Phone Number Primary||Hospital Primary||Phone Number|
Similarly, we would do the same for animals:
|Animal Primary||Hospital Primary||Animal|
As you can see, there is additional normalization that could be performed on the animal table to decrease the dependence on the animal names.
Once the entities above are created, we would then define the primary keys for each table (our unique identifier). In order to then create our relationships, we would define foreign keys in the dependent tables. In this case, hospital primary within the phone number and animal tables would be foreign keys pointing at the hospital primary in the hospital table. This gives us two things: one, it gives a direct link from the primary table to the dependent table and two, it makes certain that no dependent record can be created for which there is no primary.
It is best practice to create an index, either unique or non-unique, on the dependent side of the foreign key.
We have now defined the basic building blocks for a relational database and we have content containers (tables), an indexing mechanism for fast retrieval, as well as a formal way to describe the relationship between entities.
It is safe to say that the slowest thing in any system architecture is the movement of the disk head on the database server. This is measured in milliseconds rather than nanoseconds. A great deal of thought needs to be put into the design of a database server to provide quick, multi-threaded access to content.
In a physical database layout, care needs to be taken to reduce I/O. This can be done in several ways. Data can be physically separated from indexes, which keeps the disk head from moving between the data and the index. If this can be done, different disk heads can be used to simultaneously read data and index content. The perfect physical database layout would have a separate I/O bus for each table and index.
When content is being selected, you want to keep from moving between the index and the underlying content as much as possible. To do this sometimes it is advantageous to create covering indexes—these indexes contain all the data necessary to satisfy a specific set of queries. If all the data elements are available within the index itself, then it is not necessary to move from the index back to the main table to resolve any discriminators that are mentioned in the select statement.
It is in your best interest to keep index chains as short as possible. For instance, you might want to index the gender column of a database table; however, this would form very long index chains because the data within the column is not diverse. The longer the chains, the less useful the indexes will be in the query process. If you absolutely must index columns with low diversity, some database engines have special index types—for instance bitmap in Oracle.
Author: Adam Trissel is the Head of Java Development at SolutionSet, where he helped found and build the successful social media professional service group focused on Jive software.