blog.josephhall.com: May 2009

Saturday, May 30, 2009

Replacing AWStats

Don't get me wrong; I've been a big fan of AWStats for years. It's great for giving me a high-level overview of how my sites are doing, traffic-wise. The day-by-day breakdown is nice, and if I don't mind looking at the big picture for my site a month at a time (which is usually good), then it's perfect. When I started working on a more complex setup than just a few low-traffic blogs, then it started to show its limitations. Let me tell you what I mean.

In our current setup at work (or at least the one that we've been moving toward), we have three web servers. Each server is behind a load balancer, and serves exactly the same sites. When I started monitoring, we had 200+ sites active. We wanted to have stats for each domain on each server, plus the stats for each domain across all servers, plus the stats for all domains on each individual server.

AWStats requires one separate config file per domain, per server, plus another separate config file per domain for all servers, plus a separate config for the server-wide stats. That means we're looking at 800+ separate config files. I wrote a script to automatically generate all of these for me, but they're still a pain to keep track of.

Before I parse out the log files, I need to combine multiple log files into massive composite log files (domain-wide and server-wide), in chronological order so that AWStats doesn't choke. As you can imagine, this requires a considerable amount of resources.

AWStats also maintains a flat data file per config file per month, all in the same directory. Assuming we keep just a year's worth of data, we're looking at 9600+ separate files in that directory. At this point, one wonders why we can't just store everything in a database.

Now that you've seen the kind of management nightmare that I have to deal with, with just the features that AWStats does have, I think you can see where my frustrations begin. But I can't be content just being unhappy with existing features; I want new features too. And anybody that's ever looked at the main awstats.pl file knows how much of a beast it is to figure out just what's going on, much less change anything.

I ended up adding some glue of my own to make things a little easier. Remember, not only do I have hundreds of config files and thousands of data files, I also have to pull them up in the first place. As it turns out, the script that I wrote to automate building config files also builds an HTML file with links to all of the AWStats pages in it. I even added a quick ping script to periodically hit each domain, and place a colored dot next to the domain name to indicate its active status (green is good, red is bad, blue and yellow are proprietary indicators that a site responded, but not in a normal manner). I even have other indicators set up to tell me things like whether a site has an SSL cert, today's hits so far, yesterday's hits, green up arrows to tell me if today's hits are higher then yesterdays, red down arrows if they are lower, yellow right arrows if the traffic is the same, you get the idea.

Maybe I'm just selfish, but I want a stats program that can handle all of this gracefully, and for not a lot of money. Free (as in freedom, and beer) is ideal. But since I ultimately decided to write my own, it certainly wasn't free as in time. But it was fun, and I learned a lot about AWStats while I was at it.

Incidentally, it turns out AWStats does support clusters kind of like our load balancer setup, but I didn't find that out until I spent a lot of time in the various parts of it. And I haven't taken the time to figure out how they do it; I already had a solution in mind anyway.

I hope this post gives you an idea of why I started thinking about replacing the great AWStats, I program which I still love and respect. If you're interested in looking at my code so far, I have packed it up for public consumption. Information about its operation and shortcomings are in the README.

Clicky!

Friday, May 22, 2009

Not So Inspirational Thoughts

I believe I've mentioned before that my primary responsibilities at work are as the systems administrator. It's something that Ive dabbled in with several previous companies, and taught a lot of classes on at my last job. Now I do it full time.

I was talking to my brother about it some months ago, explaining a few of the concepts of server security that I've either learned from others before me, or have picked up on my own. We might even have talked a little bit about social engineering and corporate security. After a while, my brother said to me, "so basically, good security is a fascist regime."

I thought about it for a moment, and said, "yeah, I guess it is."

Yesterday I mentioned this conversation to my boss. He laughed and said, "yeah, that makes perfect sense. In fact, any mechanical process that is a democracy is a failure."

Hope you all enjoy those thoughts as we head into the Memorial Day weekend. Me, I've got a few company servers to enforce martial law on.

Friday, May 8, 2009

Soundex

Today at work we were talking about spelling issues that have come up with phone reps, while trying to look up customer names in our database. For instance, it might sound like the customer is saying her name is Cassy, when in fact she spells it Cassie or Cassi. I used to know a guy who spelled his name Eron (instead of the more traditional Aaron).

How does one deal with this problem? Some of the solutions that were coming up were pretty scary-sounding, and started to drift into areas of serious security concerns. I finally spoke up and said something to the effect of, "if I could offer a suggestion, how do you all feel about looking up records by soundex?" I suddenly had three very interested faces looking in my direction.

Soundex is something that any serious genealogist is well familiar with, but for some reason it doesn't seem to be that common in the programming world. That's a real shame, because it's so useful. And fortunately for Perl and MySQL developers, tools already exist for you.

First, the basics:

Take a word, and write down the first letter of that word.

Drop all of the vowels remaining in the word.

Remove all duplicate letters (i.e. LL becomes L).

If "H" or "W" separate two letters with the same soundex code, the consonant to the right is ignored.

Convert the remaining consonants to numbers:

b, f, p, v => 1

c, g, j, k, q, s, x, z => 2

d, t => 3

l => 4

m, n => 5

r => 6

Save only the first three numbers. If you run out before you reach three digits, pad with zeros.

Using this code, the name "Joseph" would be encoded as "J100". "McAllister" would be "M242". See how easy that is?

MySQL has a couple of built-in functions that utilize soundex. The first, amazingly enough, is the "SOUNDEX()" function. This function doesn't give a standard soundex value; it actually encodes the whole word, giving you a minimum of 4 characters, but an arbitrary maximum. For instance:


mysql> select soundex('mcallister');
+-----------------------+
| soundex('mcallister') |
+-----------------------+
| M24236                | 
+-----------------------+
1 row in set (0.00 sec)

If you want it to return a standard soundex with a maximum of four characters, you can use the "SUBSTRING()" function:


mysql> select substring(soundex('mcallister'),1,4);
+--------------------------------------+
| substring(soundex('mcallister'),1,4) |
+--------------------------------------+
| M242                                 | 
+--------------------------------------+
1 row in set (0.00 sec)

MySQL also has a built-in function called "SOUNDS LIKE" that actually performs a SOUNDEX() function in the background:


mysql> select distinct first_name from customer where first_name sounds like 'corey' limit 10;
+------------+
| first_name |
+------------+
| Corey      | 
| cherie     | 
| Cory       | 
| Carrie     | 
| corri      | 
| cheri      | 
| cesar      | 
| CHERRY     | 
| Cierra     | 
| cora       | 
+------------+
10 rows in set (0.00 sec)

On one hand, we got a few names that most definitely weren't "corey". But on the other hand, the query was inherently case-insenstive, and offered a lot of wiggle room.

Perl also has a couple of Soundex modules available in CPAN, such as Text::Soundex. This module is also pretty easy to use:


use Text::Soundex;
print soundex("Ashcraft"), "\n";       # prints: A226
print soundex_nara("Ashcraft"), "\n";  # prints: A261

That second function, "soundex_nara()", is pretty important for doing US Census work, since the National Archives and Records Administration (NARA) uses a slightly different encoding scheme. Theirs is probably the version that most genealogists are going to be familiar with.

This is something that's definitely handy for looking up a lot of data, when the actual spelling is a little fuzzy. Maybe it will find its way into your database queries at some point now.