Saturday, March 13, 2010

Getting Started with Cassandra

12 days ago, I read an article by Matt Asay which briefly mentioned Cassandra, Facebook's NoSQL offering. I had heard of it before, but hadn't really looked into it. For some reason, Matt's article caused me to look into it again. Within a couple of hours, I was evangelising it to a few friends. Matt's article pointed out that Facebook, Digg and Twitter had all started using it, and as I researched it, it seemed that Digg's and Twitter's migrations to it had taken only a few days. Last night, one of the people that I had been hyping it to sent me a thing from Reddit, posted only 11 days after Matt's article, talking about how they had just finished a 10-day migration to Cassandra.

What is Cassadra?

In order to understand Cassandra, it would first make sense to talk about exactly what this NoSQL movement is.

What is NoSQL?

NoSQL is a nickname that arose sometime last year to describe a series of increeasingly popular database management systems (DBMS) that do not use SQL as an interface, as has been common in database servers for at least a couple of decades. Indeed, SQL is hardly the issue here at all. If you were to migrate from MySQL to PostgreSQL, chances are you would still have to update several of your queries in order to be compatible with the new DBMS. It's almost like changing languages anyway, except that it's more like the differences between Canadian French and Hatian Creole: both French, but not pure French, and there are enough differences to matter.

Switching from SQL to NoSQL is a little more like switching from English to Japanese. Both are languages which accomplish the same goal (interperson communication), but both take very different approaches. They have different keywords, different grammars, and some would argue that Japanese is a much more precise and efficient language. One might even bring scalability into the discussion, as both languages have had the opportunity to grow. One might argue that English has done so sloppily, borrowing from odd places, whereas Japanese has done so with a little less mess, for instance, adding an entire syllabic infrastructure called katakana in order to handle foreign words, among other things.

Both SQL and NoSQL are DBMSs. They both hold data admirably, but whereas most SQL servers were originally built before the idea of database clusters was common, NoSQL servers were introduced around the time that database clusters were becoming a necessity in many infrastructures. This provided NoSQL servers with the ability to consider this concern during the design stages, rather than having to patch it in later. Some of the names that you will see in the NoSQL world are BigTable, Dynamo, HBase, Hadoop, CouchDB, and Cassandra.

So, What is Cassandra?

Cassandra is a NoSQL DBMS written by Facebook. It was open sourced in 2008, and added to the Apache Project in 2009. It is fault-tolerant, decentralized, and "eventually consistent", meaning that when data is added to the database, there is a propagation period before that data is available to all of the nodes in the cluster. A more famous database model that is also eventually consistent is DNS: zone records are updated higher up in the DNS tree, and then trickle down to relevant servers in an organized fashion. This used to take 72 hours or more in DNS, but these days takes closer to an hour. With Cassandra, it is more likely to take a few seconds. This means that your applications must be written with this consideration in mind,

Rather than using SQL, Cassandra uses a system of key/value pairs. This is not a new concept to most programmers, whether they refer to them as libraries, associative arrays or hashes. The concept should be immediately familiar to any Perl programmer, and possibly even more comfortable to anyone who has ever worked with JSON. One major difference is that each name/value pair is also timestamped. So a column, as it were, in Cassandra is comprised of a name/value/timestamp set. For example:

{
name: "email",
value: "test@test.com",
timestamp: 1259991135887
}

Cassandra also has what's called a SuperColumn, which is a grouping of columns, much like a hash of hashes in Perl. For example:

{
name: "person",
value: {
realname: { name: "realname", value: "Billy Bob Test", timestamp: 1259991135887 },
email: { name: "email", value: "test@test.com", timestamp: 1259991135887 },
ircnick: { name: "ircnick", value: "billybobtest", timestamp: 1259991135887 }
}
}

That's all the technical detail that I'm going to go into at the moment, largely because there's already so many great articles out there to get you started, but also because I'm new, and still know just enough to be dangerous (mostly to myself). But I am going to link to a few of those articles for you, if you're interested enough now to check them out.

Bearing in mind that I'm a Perl guy, here are the links that I've already sent out to a couple of friends in email, which may or may not cover your language of choice.

For information about Cassandra and some theories behind it, you'll want to take a look at these links:

http://incubator.apache.org/cassandra/
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
http://bryanpendleton.blogspot.com/2010/03/following-links-to-cassandra.html
http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/ (Ruby examples included)

When you're ready to install it and start playing with it, you'll want to read these, in roughly this order.

http://dustyreagan.com/installing-cassandra-on-ubuntu-linux/
http://wiki.apache.org/cassandra/CassandraCli
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
http://search.cpan.org/~lbrocard/Net-Cassandra-0.35/lib/Net/Cassandra.pm (For the Perl guys)

As I continue to explore and learn Cassandra, you may see an article here and there about it on my blog. I'm pretty excited about it, and while I see no reason to completely abandon SQL databases (they all have their uses, many of which Cassandra is likely not well-suited for), I think that I'm likely to use Cassandra as a major component on an upcoming project.

Friday, March 12, 2010

Interpreting Recipe Input

As some of you know, I've been working on recipe software off and on for a few years. It keeps finding its way to the backburner, mostly because other things have always taken priority, but also because doing it The Right Way (TM) is pretty intimidating.

A few weeks ago I started The Latest Attempt. I started simple, and had few eyeballs look at it. A couple of days ago, I went back to the beginning of my blog and started transcribing recipes into the simple interface that I had. I ended up finding several issues, most of which I don't think most of my testers ever encountered. I thought I'd lay them out here, and let them percolate in my brain.

I'm going to use an example that wasn't in my early archives, but that I've been playing with lately. The original is here. I blogged about a version of mine here. Hopefully Food Network won't mind if I reprint the original here, because I'm going to tweak it a little of the sake of demonstration.

6 squares unsweetened chocolate
3/4 cup unsalted butter
2 cups sugar
3 eggs
1 teaspoon pure vanilla extract
1 cup unbleached allpurpose flour
1 cup chopped nuts (optional)

Words like "pure" and "unbleached" sound pretty specific. But it starts with "6 squares unsweetened chocolate". What size is a "square"? The experienced baker will tell you that unsweetened chocolate is often measured in one-ounce squares. But it presents the first problem what I've encountered, and that at least one tester came across too:

Non-Standard Measurements

"two sticks butter". "one can olives". "one package spinach". These are all arbitrary sizes, that don't necessarily mean what you think they mean. Okay, so in America, butter comes in 4 oz sticks. That's something that we can rely upon. But one can of olives? Are we talking about the little 4 oz cans of sliced or chopped olives? Or are we talking about a 15 oz can of whole olives? How about the spinach? I've seen fresh spinach come in packages ranging from a few ounces to a couple of pounds, and who's to say we're not talking about frozen spinach? This actually leads into the next issue:

Inspecific Ingredients
What kind of butter? Salted or unsalted? A professional chef would never cook with salted butter. Joe Q. America, who knows? And I've already brought up the issues with olives and spinach. But the issue that I found here was actually in trying to categorize the food items, with minimal effort on the user's behalf. I've been using the USDA SR22 this time around, along with some auto-suggest AJAX code, to try and link up what the user has been looking for with something that the user can use for things like nutritional charts. The SQL looks something like this:

select NDB_No, Shrt_Desc from ABBREV where Shrt_Desc like '%butter%' limit 10;

Offhand, it seems reasonable that this would give us results like "unsalted butter", "salted butter", etc. But the SR22 isn't in an order that is condusive that that kind of thing. Instead, we get a result like this:

+--------+------------------------------------------------------------+
| NDB_No | Shrt_Desc |
+--------+------------------------------------------------------------+
| 42291 | PEANUT BUTTER,RED NA |
| 42307 | MARGARINE-LIKE,BUTTER-MARGARINE BLEND,80% FAT,STK,WO/ SALT |
| 42309 | MARGARINE-LIKE,VEG OIL-BUTTER SPRD,RED CAL,TUB,W/ SALT |
| 43214 | BUTTER REPLCMNT,WO/FAT,PDR |
| 11866 | SQUASH,WNTR,BUTTERNUT,CKD,BKD,W/SALT |
| 11867 | SQUASH,WNTR,BUTTERNUT,FRZ,CKD,BLD,W/SALT |
| 11372 | POTATOES,SCALLPD,HOME-PREPARED W/BUTTER |
| 11373 | POTATOES,AU GRATIN,HOME-PREPARED FROM RECIPE USING BUTTER |
| 11381 | POTATOES,MSHD,DEHYD,PREP FR GRNLS WO/MILK,WHL MILK&BUTTER |
| 11385 | POTATOES,AU GRATIN,DRY MIX,PREP W/H2O,WHL MILK&BUTTER |
+--------+------------------------------------------------------------+

Only one of those even remotely resembles what I'm looking for, and honestly, I don't want it. This problem is easily, if tediously solved: all I need to do is create a new database, of commonly used ingredients, and query it first, and then supplement it with the second database. Oog. Let's move onto the really tricky stuff.

Storing Measurements

Let's go back to our brownie recipe. One of the ingredients is 3/4 cup butter. Databases don't store fractions in any mathematically-usable format. Do we store it as a VARCHAR to maintain integrity with what the user entered, and then convert it later? Or do we store it as a decimal, and then store a flag saying whether it was entered as a fraction or a decimal? Or do we store it as a decimal, and assume that it will always be displayed to ther user as a fraction (which is usually what the user wants)? For the sake of argument, let's go with decimals as the storage mechanism, and not worry about anything else for now. In MySQL, we might have something that looks like this:

...SNIP...
amount DECIMAL(10,2),
unit VARCHAR(10),
...SNIP...

Measurement Ranges

Brownie recipe again. One of those ingredients, the nuts, is optional. It's technically a garnish, if an internal one. It's already subjective (walnuts? peanuts? pecans?), which means we can fudge a little on the amount too. Depending on how much you like your nut of choice, let's say you might opt for anywhere from 3/4 cup to 1 1/2 cups. This presents another problem with database management, at the very least. Maybe we can solve it by breaking the amount into two different fields?

...SNIP...
amount_min DECIMAL(10,2),
amount_max DECIMAL(10,2),
unit VARCHAR(10),
...SNIP...

Intermixed Measurement Units

Let's play with the sugar a little. A lot of people (like me) like to swap out some of the white sugar with brown sugar. Let's say that after careful testing and tweaking, I come up with the following measurements:

1 cup + 2 Tbsp white sugar
3/4 cup + 2 Tbsp brown sugar

Oh man. How do you store that? We could convert everything down to the lowest common denominator, Tbsp in this case, and then upscale when we display it back to the user. In the database, we would have something like:

18 Tbsp white sugar
14 Tbsp brown sugar

Of course, now we have to write code to scale this back to a reasonable measurement, since 18 Tbsp of anything is just weird, even just for behind-the-scenes storage. I think I would rather convert everything to a common unit, store it, and then convert it back. And I don't think any (non-metric) unit of measurement is more versatile than the ounce. Now we can store the sugar as:

9 oz white sugar
7 oz brown sugar

Of course, this leads us to the next problem:

Weight vs Volume

In the metric world, this isn't an issue, because everything is stored in either some version of grams or some version of litres. But in the Imperial system favored in America, an ounce could mean weight or it could mean volume. With some ingredients, this isn't a big deal. "A pint's a pound, the whole world round", right? Makes sense, since a pint is 16 ounces by volume and a pound is 16 ounces by weight. Well, not exactly (1 pint == 1.043 pounds), but close enough for the home cook.

In the above example, we know that we're referring to volume, if only because we know that we started with cups. But if I were looking at the recipe without any context, I personally would start off by thinking that it was a weight measurement. Some would assume it was volume. Even for the home cook, this is kind of significant, since a cup of sugar weighs a little over 7 ounces, not 8 ounces. In the aforementioned example, we could just store a flag that states whether this unit is by weight, volume, count, etc. If a user specified ounce, without any context, we'd have to store it as "unspecified", until it became important to the user (say, for nutritional data).

Of course, this is all backend stuff. Let's talk about another major problem:

Dealing With Users

If you can force the user to use your own forms, custom-tailored to suit your database, you can force them to do everything properly. And any UI designer worth his or her salt can tell you, when you start forcing users to what you want, you start losing users to your competitor. Your competitor's software may or may not do an inferior job, but if it makes your users feel better, that's what your users will use.

I've heard a lot of people complain about recipe software. From my limited experience, when a user gets tired of their recipe software (as most inevitably will), they will switch back to using a word processor, or possibly a spreadsheet. So it seems to me that the best way to get somebody to use your recipe software is to make it feel as much as possible like using a word processor. That means using a lot of fuzzy logic to convert what your users type into something that your software can use.

As you can see, writing recipe software The Right Way (TM) presents itself with a lot of issues. I haven't figured out how to handle most of them, and one of the biggest problems seems to be that some issues are dependent upon other issues. Well, I'll get it figured out.

Homemade Server Rack

I actually built about a year ago, and never bothered to post it. I mentioned it to a couple of people this week, and thought I would post it for them:



I had a couple of 2U servers laying around that I was planning on installing, but that's kind of a weird form factor for setting up around the home. Fortunately, we had some laminate boards laying around, left over from a water-damaged built-it-yourself stand-alone closet that was in the basement when we moved in. I tossed the water damaged parts, but held onto the rest of the boards, just for something like this.

I used 4 shelves, and built a box about the right form factor. I already had some casters (not shown in the photo, because they're underneath) so I attached them to the bottom to easily move the box around. The only parts that I ended up buying were the mounting racks for the servers. It's just not a part that I had laying around the house. It's also not a part you can find at Lowes, but it's easy to find online. I bought mine from Star Case. I figured that a 10U set should last me for a while.