Thursday, February 18, 2010

Primary Keys in MySQL

I was reading an AJAX tutorial the other day, and something that the author said caught my eye:

"All we really need is the title, but I always provide a primary key for any table that I create."

Why add it if you know for a fact that you don't need it? In this case, as with every case I've seen so far, the primary key in question was an auto_increment integer. I have long maintained that while this is necessary in most tables, it does not necessarily belong in every table. What you really need is a unique identifier. Without this, all you have is a jumble of data that really doesn't make sense to be stored in a database.

But that unique identifier doesn't always need to be a counter. Let's take a look at a couple of examples. Consider the following hypothetical user table:

username VARCHAR(50) NOT NULL,
realname VARCHAR(50) NOT NULL,
password VARCHAR(50) NOT NULL,
PRIMARY KEY (username)

It makes sense for a username to be unique, because the username/password (or other token) combination will be required for access to this application. If you need to refer to this table from another table, you can just refer to the username, because it is unique and identifiable. I might note that MySQL itself does not use an auto_increment field for its own user table.

There are potential problems with this, however. When you change a username in this table, you need to change it in any tables that reference it as well. Additionally, you are taking up a little extra (if negligible) space in referring tables. An auto_increment uses an integer, which takes up less space, and will not change under normal circumstances. It should be noted that Unix-style operating systems use a UID to identify not only users, but file and process ownership. The username itself is rarely used by the system for anything other than making things more human-readable.

Let me show you a table structure that I've been working on this morning. I have a need to store recipes in a database, but because a recipe contains varaiable numbers of ingredients and directions, it doesn't make sense to try and store each in its own field in a single recipe table. Instead, I have broken out my structure into three separate tables:

CREATE TABLE recipes (
source VARCHAR(255) NOT NULL,
preheat_qty INT(4),
preheat_unit ENUM('F','C'),
yield_qty INT(4),
yield_unit INT(11) NOT NULL,
PRIMARY KEY (recipe_id)

CREATE TABLE recipe_ingredients (
recipe_id INT(11) UNSIGNED NOT NULL,
rank INT(4),
qty INT(4) NOT NULL,
unit INT(11) NOT NULL,
PRIMARY KEY (recipe_id, rank)

CREATE TABLE recipe_directions (
recipe_id INT(11) UNSIGNED NOT NULL,
rank INT(4),
direction TEXT,
PRIMARY KEY (recipe_id, rank)

The recipe itself is referenced by a unique counter, which is usual. INT(11) contains far more unique identifiers as I expect to need for that table. But there will be multiple ingredients per recipe, and multiple directions per recipe. Even if I only store recipes with no more than 4 ingredients and 4 steps (which is unlikely), I need 4 times as many unique identifiers per table.

I already have to store the recipe_id, to keep from orphaning the data. It's important to store what order each step is in, because if they got out of order, the recipe would quickly becaome confusing. It's nice to store the order of ingredients in the order in which they'll be used too, and many people write recipes with this in mind. I've called this field "rank". Since I already have those two fields, and they already uniquely identify the rows, why not officially make them primary keys together? MySQL allows it, and I'm going to make use of it.

I propose that whenever you create a table, you take a moment to consider whether or not you actually need a counter. Most of the time you will, but not always. Get out of the habit of doing things because that's the way you've always done them, and get into the habit of doing things because you've thought about them and have made an informed decision specific to the situation at hand.