computer geek gone chef and back again
 
 


graciously hosted by neverblock
http://www.neverblock.com/

blog.josephhall.com


Saturday, March 13, 2010

Getting Started with Cassandra

12 days ago, I read an article by Matt Asay which briefly mentioned Cassandra, Facebook's NoSQL offering. I had heard of it before, but hadn't really looked into it. For some reason, Matt's article caused me to look into it again. Within a couple of hours, I was evangelising it to a few friends. Matt's article pointed out that Facebook, Digg and Twitter had all started using it, and as I researched it, it seemed that Digg's and Twitter's migrations to it had taken only a few days. Last night, one of the people that I had been hyping it to sent me a thing from Reddit, posted only 11 days after Matt's article, talking about how they had just finished a 10-day migration to Cassandra.

What is Cassadra?

In order to understand Cassandra, it would first make sense to talk about exactly what this NoSQL movement is.

What is NoSQL?

NoSQL is a nickname that arose sometime last year to describe a series of increeasingly popular database management systems (DBMS) that do not use SQL as an interface, as has been common in database servers for at least a couple of decades. Indeed, SQL is hardly the issue here at all. If you were to migrate from MySQL to PostgreSQL, chances are you would still have to update several of your queries in order to be compatible with the new DBMS. It's almost like changing languages anyway, except that it's more like the differences between Canadian French and Hatian Creole: both French, but not pure French, and there are enough differences to matter.

Switching from SQL to NoSQL is a little more like switching from English to Japanese. Both are languages which accomplish the same goal (interperson communication), but both take very different approaches. They have different keywords, different grammars, and some would argue that Japanese is a much more precise and efficient language. One might even bring scalability into the discussion, as both languages have had the opportunity to grow. One might argue that English has done so sloppily, borrowing from odd places, whereas Japanese has done so with a little less mess, for instance, adding an entire syllabic infrastructure called katakana in order to handle foreign words, among other things.

Both SQL and NoSQL are DBMSs. They both hold data admirably, but whereas most SQL servers were originally built before the idea of database clusters was common, NoSQL servers were introduced around the time that database clusters were becoming a necessity in many infrastructures. This provided NoSQL servers with the ability to consider this concern during the design stages, rather than having to patch it in later. Some of the names that you will see in the NoSQL world are BigTable, Dynamo, HBase, Hadoop, CouchDB, and Cassandra.

So, What is Cassandra?

Cassandra is a NoSQL DBMS written by Facebook. It was open sourced in 2008, and added to the Apache Project in 2009. It is fault-tolerant, decentralized, and "eventually consistent", meaning that when data is added to the database, there is a propagation period before that data is available to all of the nodes in the cluster. A more famous database model that is also eventually consistent is DNS: zone records are updated higher up in the DNS tree, and then trickle down to relevant servers in an organized fashion. This used to take 72 hours or more in DNS, but these days takes closer to an hour. With Cassandra, it is more likely to take a few seconds. This means that your applications must be written with this consideration in mind,

Rather than using SQL, Cassandra uses a system of key/value pairs. This is not a new concept to most programmers, whether they refer to them as libraries, associative arrays or hashes. The concept should be immediately familiar to any Perl programmer, and possibly even more comfortable to anyone who has ever worked with JSON. One major difference is that each name/value pair is also timestamped. So a column, as it were, in Cassandra is comprised of a name/value/timestamp set. For example:

{
name: "email",
value: "test@test.com",
timestamp: 1259991135887
}

Cassandra also has what's called a SuperColumn, which is a grouping of columns, much like a hash of hashes in Perl. For example:

{
name: "person",
value: {
realname: { name: "realname", value: "Billy Bob Test", timestamp: 1259991135887 },
email: { name: "email", value: "test@test.com", timestamp: 1259991135887 },
ircnick: { name: "ircnick", value: "billybobtest", timestamp: 1259991135887 }
}
}

That's all the technical detail that I'm going to go into at the moment, largely because there's already so many great articles out there to get you started, but also because I'm new, and still know just enough to be dangerous (mostly to myself). But I am going to link to a few of those articles for you, if you're interested enough now to check them out.

Bearing in mind that I'm a Perl guy, here are the links that I've already sent out to a couple of friends in email, which may or may not cover your language of choice.

For information about Cassandra and some theories behind it, you'll want to take a look at these links:

http://incubator.apache.org/cassandra/
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
http://bryanpendleton.blogspot.com/2010/03/following-links-to-cassandra.html
http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/ (Ruby examples included)

When you're ready to install it and start playing with it, you'll want to read these, in roughly this order.

http://dustyreagan.com/installing-cassandra-on-ubuntu-linux/
http://wiki.apache.org/cassandra/CassandraCli
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
http://search.cpan.org/~lbrocard/Net-Cassandra-0.35/lib/Net/Cassandra.pm (For the Perl guys)

As I continue to explore and learn Cassandra, you may see an article here and there about it on my blog. I'm pretty excited about it, and while I see no reason to completely abandon SQL databases (they all have their uses, many of which Cassandra is likely not well-suited for), I think that I'm likely to use Cassandra as a major component on an upcoming project.


Friday, March 12, 2010

Interpreting Recipe Input

As some of you know, I've been working on recipe software off and on for a few years. It keeps finding its way to the backburner, mostly because other things have always taken priority, but also because doing it The Right Way (TM) is pretty intimidating.

A few weeks ago I started The Latest Attempt. I started simple, and had few eyeballs look at it. A couple of days ago, I went back to the beginning of my blog and started transcribing recipes into the simple interface that I had. I ended up finding several issues, most of which I don't think most of my testers ever encountered. I thought I'd lay them out here, and let them percolate in my brain.

I'm going to use an example that wasn't in my early archives, but that I've been playing with lately. The original is here. I blogged about a version of mine here. Hopefully Food Network won't mind if I reprint the original here, because I'm going to tweak it a little of the sake of demonstration.

6 squares unsweetened chocolate
3/4 cup unsalted butter
2 cups sugar
3 eggs
1 teaspoon pure vanilla extract
1 cup unbleached allpurpose flour
1 cup chopped nuts (optional)

Words like "pure" and "unbleached" sound pretty specific. But it starts with "6 squares unsweetened chocolate". What size is a "square"? The experienced baker will tell you that unsweetened chocolate is often measured in one-ounce squares. But it presents the first problem what I've encountered, and that at least one tester came across too:

Non-Standard Measurements

"two sticks butter". "one can olives". "one package spinach". These are all arbitrary sizes, that don't necessarily mean what you think they mean. Okay, so in America, butter comes in 4 oz sticks. That's something that we can rely upon. But one can of olives? Are we talking about the little 4 oz cans of sliced or chopped olives? Or are we talking about a 15 oz can of whole olives? How about the spinach? I've seen fresh spinach come in packages ranging from a few ounces to a couple of pounds, and who's to say we're not talking about frozen spinach? This actually leads into the next issue:

Inspecific Ingredients
What kind of butter? Salted or unsalted? A professional chef would never cook with salted butter. Joe Q. America, who knows? And I've already brought up the issues with olives and spinach. But the issue that I found here was actually in trying to categorize the food items, with minimal effort on the user's behalf. I've been using the USDA SR22 this time around, along with some auto-suggest AJAX code, to try and link up what the user has been looking for with something that the user can use for things like nutritional charts. The SQL looks something like this:

select NDB_No, Shrt_Desc from ABBREV where Shrt_Desc like '%butter%' limit 10;

Offhand, it seems reasonable that this would give us results like "unsalted butter", "salted butter", etc. But the SR22 isn't in an order that is condusive that that kind of thing. Instead, we get a result like this:

+--------+------------------------------------------------------------+
| NDB_No | Shrt_Desc |
+--------+------------------------------------------------------------+
| 42291 | PEANUT BUTTER,RED NA |
| 42307 | MARGARINE-LIKE,BUTTER-MARGARINE BLEND,80% FAT,STK,WO/ SALT |
| 42309 | MARGARINE-LIKE,VEG OIL-BUTTER SPRD,RED CAL,TUB,W/ SALT |
| 43214 | BUTTER REPLCMNT,WO/FAT,PDR |
| 11866 | SQUASH,WNTR,BUTTERNUT,CKD,BKD,W/SALT |
| 11867 | SQUASH,WNTR,BUTTERNUT,FRZ,CKD,BLD,W/SALT |
| 11372 | POTATOES,SCALLPD,HOME-PREPARED W/BUTTER |
| 11373 | POTATOES,AU GRATIN,HOME-PREPARED FROM RECIPE USING BUTTER |
| 11381 | POTATOES,MSHD,DEHYD,PREP FR GRNLS WO/MILK,WHL MILK&BUTTER |
| 11385 | POTATOES,AU GRATIN,DRY MIX,PREP W/H2O,WHL MILK&BUTTER |
+--------+------------------------------------------------------------+

Only one of those even remotely resembles what I'm looking for, and honestly, I don't want it. This problem is easily, if tediously solved: all I need to do is create a new database, of commonly used ingredients, and query it first, and then supplement it with the second database. Oog. Let's move onto the really tricky stuff.

Storing Measurements

Let's go back to our brownie recipe. One of the ingredients is 3/4 cup butter. Databases don't store fractions in any mathematically-usable format. Do we store it as a VARCHAR to maintain integrity with what the user entered, and then convert it later? Or do we store it as a decimal, and then store a flag saying whether it was entered as a fraction or a decimal? Or do we store it as a decimal, and assume that it will always be displayed to ther user as a fraction (which is usually what the user wants)? For the sake of argument, let's go with decimals as the storage mechanism, and not worry about anything else for now. In MySQL, we might have something that looks like this:

...SNIP...
amount DECIMAL(10,2),
unit VARCHAR(10),
...SNIP...

Measurement Ranges

Brownie recipe again. One of those ingredients, the nuts, is optional. It's technically a garnish, if an internal one. It's already subjective (walnuts? peanuts? pecans?), which means we can fudge a little on the amount too. Depending on how much you like your nut of choice, let's say you might opt for anywhere from 3/4 cup to 1 1/2 cups. This presents another problem with database management, at the very least. Maybe we can solve it by breaking the amount into two different fields?

...SNIP...
amount_min DECIMAL(10,2),
amount_max DECIMAL(10,2),
unit VARCHAR(10),
...SNIP...

Intermixed Measurement Units

Let's play with the sugar a little. A lot of people (like me) like to swap out some of the white sugar with brown sugar. Let's say that after careful testing and tweaking, I come up with the following measurements:

1 cup + 2 Tbsp white sugar
3/4 cup + 2 Tbsp brown sugar

Oh man. How do you store that? We could convert everything down to the lowest common denominator, Tbsp in this case, and then upscale when we display it back to the user. In the database, we would have something like:

18 Tbsp white sugar
14 Tbsp brown sugar

Of course, now we have to write code to scale this back to a reasonable measurement, since 18 Tbsp of anything is just weird, even just for behind-the-scenes storage. I think I would rather convert everything to a common unit, store it, and then convert it back. And I don't think any (non-metric) unit of measurement is more versatile than the ounce. Now we can store the sugar as:

9 oz white sugar
7 oz brown sugar

Of course, this leads us to the next problem:

Weight vs Volume

In the metric world, this isn't an issue, because everything is stored in either some version of grams or some version of litres. But in the Imperial system favored in America, an ounce could mean weight or it could mean volume. With some ingredients, this isn't a big deal. "A pint's a pound, the whole world round", right? Makes sense, since a pint is 16 ounces by volume and a pound is 16 ounces by weight. Well, not exactly (1 pint == 1.043 pounds), but close enough for the home cook.

In the above example, we know that we're referring to volume, if only because we know that we started with cups. But if I were looking at the recipe without any context, I personally would start off by thinking that it was a weight measurement. Some would assume it was volume. Even for the home cook, this is kind of significant, since a cup of sugar weighs a little over 7 ounces, not 8 ounces. In the aforementioned example, we could just store a flag that states whether this unit is by weight, volume, count, etc. If a user specified ounce, without any context, we'd have to store it as "unspecified", until it became important to the user (say, for nutritional data).

Of course, this is all backend stuff. Let's talk about another major problem:

Dealing With Users

If you can force the user to use your own forms, custom-tailored to suit your database, you can force them to do everything properly. And any UI designer worth his or her salt can tell you, when you start forcing users to what you want, you start losing users to your competitor. Your competitor's software may or may not do an inferior job, but if it makes your users feel better, that's what your users will use.

I've heard a lot of people complain about recipe software. From my limited experience, when a user gets tired of their recipe software (as most inevitably will), they will switch back to using a word processor, or possibly a spreadsheet. So it seems to me that the best way to get somebody to use your recipe software is to make it feel as much as possible like using a word processor. That means using a lot of fuzzy logic to convert what your users type into something that your software can use.

As you can see, writing recipe software The Right Way (TM) presents itself with a lot of issues. I haven't figured out how to handle most of them, and one of the biggest problems seems to be that some issues are dependent upon other issues. Well, I'll get it figured out.


Homemade Server Rack

I actually built about a year ago, and never bothered to post it. I mentioned it to a couple of people this week, and thought I would post it for them:



I had a couple of 2U servers laying around that I was planning on installing, but that's kind of a weird form factor for setting up around the home. Fortunately, we had some laminate boards laying around, left over from a water-damaged built-it-yourself stand-alone closet that was in the basement when we moved in. I tossed the water damaged parts, but held onto the rest of the boards, just for something like this.

I used 4 shelves, and built a box about the right form factor. I already had some casters (not shown in the photo, because they're underneath) so I attached them to the bottom to easily move the box around. The only parts that I ended up buying were the mounting racks for the servers. It's just not a part that I had laying around the house. It's also not a part you can find at Lowes, but it's easy to find online. I bought mine from Star Case. I figured that a 10U set should last me for a while.


Thursday, February 18, 2010

Primary Keys in MySQL

I was reading an AJAX tutorial the other day, and something that the author said caught my eye:

"All we really need is the title, but I always provide a primary key for any table that I create."

Why add it if you know for a fact that you don't need it? In this case, as with every case I've seen so far, the primary key in question was an auto_increment integer. I have long maintained that while this is necessary in most tables, it does not necessarily belong in every table. What you really need is a unique identifier. Without this, all you have is a jumble of data that really doesn't make sense to be stored in a database.

But that unique identifier doesn't always need to be a counter. Let's take a look at a couple of examples. Consider the following hypothetical user table:

CREATE TABLE users (
username VARCHAR(50) NOT NULL,
realname VARCHAR(50) NOT NULL,
password VARCHAR(50) NOT NULL,
PRIMARY KEY (username)
);

It makes sense for a username to be unique, because the username/password (or other token) combination will be required for access to this application. If you need to refer to this table from another table, you can just refer to the username, because it is unique and identifiable. I might note that MySQL itself does not use an auto_increment field for its own user table.

There are potential problems with this, however. When you change a username in this table, you need to change it in any tables that reference it as well. Additionally, you are taking up a little extra (if negligible) space in referring tables. An auto_increment uses an integer, which takes up less space, and will not change under normal circumstances. It should be noted that Unix-style operating systems use a UID to identify not only users, but file and process ownership. The username itself is rarely used by the system for anything other than making things more human-readable.

Let me show you a table structure that I've been working on this morning. I have a need to store recipes in a database, but because a recipe contains varaiable numbers of ingredients and directions, it doesn't make sense to try and store each in its own field in a single recipe table. Instead, I have broken out my structure into three separate tables:

CREATE TABLE recipes (
recipe_id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
source VARCHAR(255) NOT NULL,
preheat_qty INT(4),
preheat_unit ENUM('F','C'),
yield_qty INT(4),
yield_unit INT(11) NOT NULL,
PRIMARY KEY (recipe_id)
);

CREATE TABLE recipe_ingredients (
recipe_id INT(11) UNSIGNED NOT NULL,
rank INT(4),
name VARCHAR(255) NOT NULL,
qty INT(4) NOT NULL,
unit INT(11) NOT NULL,
PRIMARY KEY (recipe_id, rank)
);

CREATE TABLE recipe_directions (
recipe_id INT(11) UNSIGNED NOT NULL,
rank INT(4),
direction TEXT,
PRIMARY KEY (recipe_id, rank)
);

The recipe itself is referenced by a unique counter, which is usual. INT(11) contains far more unique identifiers as I expect to need for that table. But there will be multiple ingredients per recipe, and multiple directions per recipe. Even if I only store recipes with no more than 4 ingredients and 4 steps (which is unlikely), I need 4 times as many unique identifiers per table.

I already have to store the recipe_id, to keep from orphaning the data. It's important to store what order each step is in, because if they got out of order, the recipe would quickly becaome confusing. It's nice to store the order of ingredients in the order in which they'll be used too, and many people write recipes with this in mind. I've called this field "rank". Since I already have those two fields, and they already uniquely identify the rows, why not officially make them primary keys together? MySQL allows it, and I'm going to make use of it.

I propose that whenever you create a table, you take a moment to consider whether or not you actually need a counter. Most of the time you will, but not always. Get out of the habit of doing things because that's the way you've always done them, and get into the habit of doing things because you've thought about them and have made an informed decision specific to the situation at hand.


Friday, February 12, 2010

Buzz off social networks!

Buzz off social networks!

Posted using ShareThis

Dedicated to eightyeight.


Thursday, February 11, 2010

Google Reader: Disable Sharing

Don't get me wrong, I like Google Reader. Every other RSS reader that I've tried to use is comprised of at least 90% suck, whereas Reader only seems to be about 10% suck. Most of that suck has to do with its social networking "features".

I was surprised yesterday to find a follow invite from my brother, in Google Reader. This is because a) he hates social networking at least twice as much as I do, and b) I was entirely certain (and later confirmed) that he's never opened Google Reader. As I poked around, I discovered that I was somehow following 27 people. Four of them had apparently accepted an invite from me, and the rest had been sent invites that had not yet been responded to.

Let me be clear: I have never sent a follow request in Google Reader. Google automatically sending invites to people on my behalf feel a little like when I hear people say things like, "Oh sure, Joseph would love to come to your <INSERT EVENT HERE>." Would it kill you to ask me first? One of the people who had been sent an invite was a person who had sent me a business proposition, which I respectfully declined. That was to be the end of my contact with her, but Google was so kind as to send her a Google Reader invite for me. Gee, thanks.

Those who have tried to unfollow have probably found yourself frustrated with the fact that there is no interface inside of Reader to unfollow a person, or to turn off sharing altogether. Worry not, I figured it out.

  • From inside Google Reader, click the "Settings" drop-down in the top-right corner.

  • Select "Google Account Settings" (not Reader settings). This will open a page where you can edit your profile.

  • Click on "Edit Profile". There's nothing useful on this page, so click "Save" or "Cancel" at the bottom (intuitive, isn't it?).

  • The next page has a link that says, "<YOUR NAME> has X followers" and another link that says, "<YOUR NAME> is following X". You can't stop people from following you, but you can click that second link and stop following people. These are the ones that Google sent invites to without your permission.


Of course, it would be nice for Google to give you the option to disable sharing altogether, so that you could just use Reader as a regular RSS reader. One day I'll find one of those that doesn't suck.


Wednesday, February 10, 2010

Google Buzz

People are probably sick of me complaining about stuff like this, so I'll keep it short.

Google Buzz showed up in my Gmail account this morning. I tried it a little bit, read some feeds, gave it a shot. At the moment I find it more annoying than Twitter, so I've turned it off. You can too!

Clicky!

Maybe I'll give it another shot later. For now I've shoved it back into the depths from whence it came. It can share rent with its other dejected roommate, Google Wave.


Saturday, January 30, 2010

My Drive Array

A couple of years ago I got ahold of an old ATX computer that I intended to use as a file server. Unfortunately, there were a few problems with it. The biggest problem was that the drive cage for the smaller drives was missing. Smaller problems like an underpowered power supply and limited onboard IDE adapters were fixed with things like a new 600W power supply, and extra IDE expansion cards. As it turns out, Linux had no problem with two onboard adapters and two more cards (two adapters per card). With IDE's master/slave setup, that brought me up to two DVD burners (/dev/scd0 to /dev/scd1) and six hard drives (/dev/hda to /dev/hdf). But the drive cage, that was a problem.

Fortunately, like many American bakers and pastry chefs, I spent a lot of time at the hardware store. And believe it or not, the roofing section at Lowes carries a simple solution: roofing ties. Not TILES, but TIES (no "L"). Behold, the drive array, now connected to my Thinkpad (click to embiggen):



Yes, dear readers, the file server is dead. It seems to have developed memory issues in its old age, leading to its untimely demise. Not Alzheimers, but some form of dementia. A close-up on the array itself:



Sadly, my Thinkpad does not have an external IDE adapter of any kind. But USB to IDE adapters are relatively cheap and easy to find. I can't access every drive at once, but that's not a big deal at the moment.

Lowes has several different sizes of roofing ties, and several different styles. I used two different sizes, both completely flat, but with rows of holes exactly the same width as a standard 3.5" internal drive. One is five holes high and one is three holes high.



If you're going to do this, you'll be buying them in sets of two each. And while you may be tempted to stack three drives in one 3-high roofing tie set, fight the urge! You need airflow between the drives, or they will overheat. I speak from experience. Limit yourself to three drives for the 5-high ties, and only use the 3-high ties for connecting sets of 5-high ties. Look back at photo #2 to see what I mean.

Speaking of heat issues, it's not a bad idea to point a fan at these if you're going to have them all on at once. It's not a big deal with one or two USB-connected drives, but with all six drives that I have in my array, I always had a fan going.


Thursday, January 28, 2010

Importing the USDA SR22 into MySQL

Some of you who were interested in my post on importing SR21 may have been waiting for this one. I know it's been a few months since SR22 came out, but I didn't have a need to import it until just now. There are changes to this new version, but they are minimal, and you may have already patched the code yourself.

Specifically, two new fields were added to the ABBREVS table: Vit_D_mcg and Vit_D_IU. This brings the total column count in that table from 51 to 53. That number is the only change in the Perl file, and those two columns were the only additions to the SQL file. With those in place, I was able to import the new database without a problem. For the lazy and/or efficient, here are the new versions of the files:

import_sr22.pl
sr22.sql

Based on the reponse to my last post, I expect more troubleshooting questions on this post. For those who know what you're doing, you can stop reading now. Everyone else, check here before asking.

  • You need to have at least Perl 5.8.6 installed.

  • You need to have the Perl DBI installed, and DBD::mysql.

    • In RHEL/CentOS/Fedora, these packages should be called perl-DBI and perl-DBD-MySQL.

    • In Ubuntu/Debian, these packages should be called libdbi-perl and libdbd-mysql-perl.

  • When you download the files, make sure you save the Perl script as import_sr22.pl, not import_sr22.txt.

  • This script assumes you've downloaded the abbreviated file. It is a separate download from the full version, so make sure you don't miss it.


I think that covers all the questions I was asked previously. For those who are interested, The Eloquent Geek posted a non-Perl way of doing this on the last post. I haven't tried it myself, but for your reference:

For those who do not want to use the perl here is how you import the data from the command line client:
load data infile '/file_path/TABLE_NAME.txt' into table TABLE_NAME fields terminated by '^' optionally enclosed by '~' ;

Just substitute the table name per table.


Wednesday, January 27, 2010

iPad

Well, Apple's new tablet was finally announced today. Aside from a slew of jokes concerning the name, it would seem that the majority of the reviews can be summed up in one word:

"Seriously?"

Not all reviews were negative, of course. One that I read can be summed up in a few more words: "Look! It's a giant iPhone! Isn't that so cool?" Let me put this into perspective. A cute little baby, at the size that you would expect a cute little baby to be, is cute. The same baby, but the size of a house, is no longer cute. It is scary.

I'm not going to bother linking to any reviews of Apple's attempt to make Google look good in comparison, just like I'm not going to bother buying one when it comes out. That's all on you.

Okay, tell you what. Let me Google that for you.