Friday, May 8, 2009

Soundex

Today at work we were talking about spelling issues that have come up with phone reps, while trying to look up customer names in our database. For instance, it might sound like the customer is saying her name is Cassy, when in fact she spells it Cassie or Cassi. I used to know a guy who spelled his name Eron (instead of the more traditional Aaron).

How does one deal with this problem? Some of the solutions that were coming up were pretty scary-sounding, and started to drift into areas of serious security concerns. I finally spoke up and said something to the effect of, "if I could offer a suggestion, how do you all feel about looking up records by soundex?" I suddenly had three very interested faces looking in my direction.

Soundex is something that any serious genealogist is well familiar with, but for some reason it doesn't seem to be that common in the programming world. That's a real shame, because it's so useful. And fortunately for Perl and MySQL developers, tools already exist for you.

First, the basics:
  • Take a word, and write down the first letter of that word.

  • Drop all of the vowels remaining in the word.

  • Remove all duplicate letters (i.e. LL becomes L).

  • If "H" or "W" separate two letters with the same soundex code, the consonant to the right is ignored.

  • Convert the remaining consonants to numbers:

    • b, f, p, v => 1

    • c, g, j, k, q, s, x, z => 2

    • d, t => 3

    • l => 4

    • m, n => 5

    • r => 6

  • Save only the first three numbers. If you run out before you reach three digits, pad with zeros.

Using this code, the name "Joseph" would be encoded as "J100". "McAllister" would be "M242". See how easy that is?

MySQL has a couple of built-in functions that utilize soundex. The first, amazingly enough, is the "SOUNDEX()" function. This function doesn't give a standard soundex value; it actually encodes the whole word, giving you a minimum of 4 characters, but an arbitrary maximum. For instance:

mysql> select soundex('mcallister');
+-----------------------+
| soundex('mcallister') |
+-----------------------+
| M24236 |
+-----------------------+
1 row in set (0.00 sec)

If you want it to return a standard soundex with a maximum of four characters, you can use the "SUBSTRING()" function:

mysql> select substring(soundex('mcallister'),1,4);
+--------------------------------------+
| substring(soundex('mcallister'),1,4) |
+--------------------------------------+
| M242 |
+--------------------------------------+
1 row in set (0.00 sec)

MySQL also has a built-in function called "SOUNDS LIKE" that actually performs a SOUNDEX() function in the background:

mysql> select distinct first_name from customer where first_name sounds like 'corey' limit 10;
+------------+
| first_name |
+------------+
| Corey |
| cherie |
| Cory |
| Carrie |
| corri |
| cheri |
| cesar |
| CHERRY |
| Cierra |
| cora |
+------------+
10 rows in set (0.00 sec)

On one hand, we got a few names that most definitely weren't "corey". But on the other hand, the query was inherently case-insenstive, and offered a lot of wiggle room.

Perl also has a couple of Soundex modules available in CPAN, such as Text::Soundex. This module is also pretty easy to use:

use Text::Soundex;
print soundex("Ashcraft"), "\n"; # prints: A226
print soundex_nara("Ashcraft"), "\n"; # prints: A261

That second function, "soundex_nara()", is pretty important for doing US Census work, since the National Archives and Records Administration (NARA) uses a slightly different encoding scheme. Theirs is probably the version that most genealogists are going to be familiar with.

This is something that's definitely handy for looking up a lot of data, when the actual spelling is a little fuzzy. Maybe it will find its way into your database queries at some point now.

1 comment:

  1. And if you wanted to dump MySQL for a *real* open source relational database engine, there are functions available for PostgreSQL as well that do this kind of stuff as well:

    http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html

    ReplyDelete

Comments for posts over 14 days are moderated