Saturday, March 13, 2010

Getting Started with Cassandra

12 days ago, I read an article by Matt Asay which briefly mentioned Cassandra, Facebook's NoSQL offering. I had heard of it before, but hadn't really looked into it. For some reason, Matt's article caused me to look into it again. Within a couple of hours, I was evangelising it to a few friends. Matt's article pointed out that Facebook, Digg and Twitter had all started using it, and as I researched it, it seemed that Digg's and Twitter's migrations to it had taken only a few days. Last night, one of the people that I had been hyping it to sent me a thing from Reddit, posted only 11 days after Matt's article, talking about how they had just finished a 10-day migration to Cassandra.

What is Cassadra?

In order to understand Cassandra, it would first make sense to talk about exactly what this NoSQL movement is.

What is NoSQL?

NoSQL is a nickname that arose sometime last year to describe a series of increeasingly popular database management systems (DBMS) that do not use SQL as an interface, as has been common in database servers for at least a couple of decades. Indeed, SQL is hardly the issue here at all. If you were to migrate from MySQL to PostgreSQL, chances are you would still have to update several of your queries in order to be compatible with the new DBMS. It's almost like changing languages anyway, except that it's more like the differences between Canadian French and Hatian Creole: both French, but not pure French, and there are enough differences to matter.

Switching from SQL to NoSQL is a little more like switching from English to Japanese. Both are languages which accomplish the same goal (interperson communication), but both take very different approaches. They have different keywords, different grammars, and some would argue that Japanese is a much more precise and efficient language. One might even bring scalability into the discussion, as both languages have had the opportunity to grow. One might argue that English has done so sloppily, borrowing from odd places, whereas Japanese has done so with a little less mess, for instance, adding an entire syllabic infrastructure called katakana in order to handle foreign words, among other things.

Both SQL and NoSQL are DBMSs. They both hold data admirably, but whereas most SQL servers were originally built before the idea of database clusters was common, NoSQL servers were introduced around the time that database clusters were becoming a necessity in many infrastructures. This provided NoSQL servers with the ability to consider this concern during the design stages, rather than having to patch it in later. Some of the names that you will see in the NoSQL world are BigTable, Dynamo, HBase, Hadoop, CouchDB, and Cassandra.

So, What is Cassandra?

Cassandra is a NoSQL DBMS written by Facebook. It was open sourced in 2008, and added to the Apache Project in 2009. It is fault-tolerant, decentralized, and "eventually consistent", meaning that when data is added to the database, there is a propagation period before that data is available to all of the nodes in the cluster. A more famous database model that is also eventually consistent is DNS: zone records are updated higher up in the DNS tree, and then trickle down to relevant servers in an organized fashion. This used to take 72 hours or more in DNS, but these days takes closer to an hour. With Cassandra, it is more likely to take a few seconds. This means that your applications must be written with this consideration in mind,

Rather than using SQL, Cassandra uses a system of key/value pairs. This is not a new concept to most programmers, whether they refer to them as libraries, associative arrays or hashes. The concept should be immediately familiar to any Perl programmer, and possibly even more comfortable to anyone who has ever worked with JSON. One major difference is that each name/value pair is also timestamped. So a column, as it were, in Cassandra is comprised of a name/value/timestamp set. For example:

{
name: "email",
value: "test@test.com",
timestamp: 1259991135887
}

Cassandra also has what's called a SuperColumn, which is a grouping of columns, much like a hash of hashes in Perl. For example:

{
name: "person",
value: {
realname: { name: "realname", value: "Billy Bob Test", timestamp: 1259991135887 },
email: { name: "email", value: "test@test.com", timestamp: 1259991135887 },
ircnick: { name: "ircnick", value: "billybobtest", timestamp: 1259991135887 }
}
}

That's all the technical detail that I'm going to go into at the moment, largely because there's already so many great articles out there to get you started, but also because I'm new, and still know just enough to be dangerous (mostly to myself). But I am going to link to a few of those articles for you, if you're interested enough now to check them out.

Bearing in mind that I'm a Perl guy, here are the links that I've already sent out to a couple of friends in email, which may or may not cover your language of choice.

For information about Cassandra and some theories behind it, you'll want to take a look at these links:

http://incubator.apache.org/cassandra/
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
http://bryanpendleton.blogspot.com/2010/03/following-links-to-cassandra.html
http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/ (Ruby examples included)

When you're ready to install it and start playing with it, you'll want to read these, in roughly this order.

http://dustyreagan.com/installing-cassandra-on-ubuntu-linux/
http://wiki.apache.org/cassandra/CassandraCli
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
http://search.cpan.org/~lbrocard/Net-Cassandra-0.35/lib/Net/Cassandra.pm (For the Perl guys)

As I continue to explore and learn Cassandra, you may see an article here and there about it on my blog. I'm pretty excited about it, and while I see no reason to completely abandon SQL databases (they all have their uses, many of which Cassandra is likely not well-suited for), I think that I'm likely to use Cassandra as a major component on an upcoming project.

1 comment:

Comments for posts over 14 days are moderated

Note: Only a member of this blog may post a comment.