blog.josephhall.com

Thursday, May 21, 2015

The Garden's Coming Along

It's hard to see with a couple of them, but it looks like my seed potatoes have all officially sprouted.

This is a very ambitious garden. I have 10 garden boxes, 4 of which are 4'x5', and the others 4'x4'. With luck, I'll have far more produce in the fall than I know what to do with, and will be able to share my bounty with the neighbors.

I'll post more photos of the garden in a bit.

Monday, May 18, 2015

Speaking at YAPC::NA 2015

For those who are interested, I will be speaking at YAPC::NA (Yet Another Perl Conference, North America) next month. My talk is entitled A Series of Unfortunate Requests. If you know me, you know I'm a REST API junkie, so I'm pretty excited to talk about it at this conference.

Sunday, May 17, 2015

Finland!

I just got back from Helsinki, Finland!

I am now convinced that Helsinki is one of the great cities of the world. It's a beautiful place, with friendly people, and bonus for me as an American: almost everybody speaks English! In fact, in my two weeks there, the only person I spoke with that did not seem to know any English was a bus driver.

I was working onsite with a customer there, including 4 of the nicest guys I've ever met. We did some work with SaltStack + OpenStack, and I for one had a thoroughly enjoyable time on the project.

Because I was there for two weeks, I got to spend a weekend there. Saturday I went wandering around my part of the city, visiting the Kamppi Center and the Forum Mall across the street. I also went to the Kiasma modern art museum, and saw thoroughly fascinating exhibits on Ismo Kajander, and "Elements".

I loved the bus system while I was there, though I admittedly only used one route, to get to and from work. My host was kind enough to set me up in the Kamppi Scandic, which was in a pretty central location. I should note though that, outside of one cab driver who seemed to have vision problems ("is this a 4 or a 6?"), all of the drivers that I observed tended to drive very conservatively, defensively, and safely. This is in stark contrast to driving in Utah, where it often feels like every driver is out for blood.

If you ever get a chance to visit Finland, I highly recommend it. It's a beautiful country with beautiful, friendly people.

5 Years

It's been over 5 years since I've posted here, and it's time for that to change. I have a lot to say, and I find myself in need of a place to say it. So I've resurrected this blog, and intend to start posting on a regular basis again.

A lot has happened in the past 5 years, and it will affect what I post here. I love cooking as much as ever, and a large chunk of the content will still be about food, albeit more from an engineering point of view. I have also worked for SaltStack for over 2 1/2 years, and you can expect to see some posts here about Salt.

My old layout was incredibly incompatible with the current version of Blogger, so I'm using one of the default templates for now. Hopefully I'll have time to get something up soon that's more me.

Thursday, April 15, 2010

Using the find command

Today I was working on a problem with a coworker, and I saw him start to type the following:


find . | grep

I said, "no, I'm not going to let you do that. You need to do it the right way." (Yes, I do that sort of thing. Don't act like you didn't expect it.)

Don't get me wrong, there's nothing wrong with grep. It's a fine tool, and I encourage people to constantly try to become better with it. But it has its place, and this was not it. Using the find command properly in this case would result in less typing, and would spawn one less process. It may not be important for a one-line command, but in a larger script it might be more significant. Better to get in the right habit now, so that when you do find yourself working on that big script, you do it right the first time.

The find command is extremely powerful. Unlike locate, which uses a pre-built database of files and paths on your system, find searches your filesystem in real time, paying more attention to the individual files, and their properties. It may be slower than locate, but it's more accurate and far more flexible.

I see people use find mostly to search by filename, but it has plenty of other options. Let's start with filename and build from there. There are two relevant options:


-name
-iname

They are identical, except that -iname is case insensitive. Since files on a *nix system are traditionally all lowercase, you might want to save yourself a few processor cycles and just go with -name. If there's a chance that case may be an issue, use -iname instead.


find -name 'myfile.txt'

Using quotes is not strictly necessary with most filenames, but it's a good habit to get into. Keep in mind that by default, find searches by exact filenames. If you're not sure what the extension is, or you want to look anywhere in the filename, you can use globs:


find -iname '*myfile*'

The above commands will process files recursively, inside the current working directory. If you want to search a different directory, you need to specify it before any other options:


find /etc -name passwd

There are a few subtleties of find that you will encounter. They're not usually a big deal, but they can be annoying sometimes. For instance, find does not sort its results. If I expect a lot of results, I generally pipe it through the sort command. It also isn't very good at searching its own results, which is where grep can come in handy:


find / -name 'Net' | sort | grep -i perl

Now that you have the basics of find, let's explore some of the other options. Two that I use extensively are:


-ok
-exec

Again, these options are identical in purpose, but there is an important difference in how they behave. Both of them will execute a command on each file found, but -ok will ask for permission first (for each file) while -exec will just do it.


find /etc -name '*conf' -exec mv {} {}.orig \;

First off, everything between -exec (or -ok) and \; is the command that you want to run. Make sure you escape that semi-colon at the end with a backslash, or you'll be sorry. The {} is a placeholder for the filename that was found by find. In this case, we're actually going to be performing a series of commands that looks like this:


mv /etc/httpd/conf/httpd.conf /etc/httpd/conf/httpd.conf.orig
mv /etc/httpd/conf.d/ssl.conf /etc/httpd/conf.d/ssl.conf.orig
mv /etc/httpd/conf.d/perl.conf /etc/httpd/conf.d/perl.conf.orig
...SNIP...

You're not limited to just searching filenames by glob. The find command does actually have support for regular expressions, using the following:


-regex
-iregex

I don't think I need to tell you that -iregex is the case-insensitive version of -regex. If you already know how to use grep, this isn't much of a stretch:


find -regex ".*deskto."

The find command also supports boolean logic:


-and (or -a)
-or  (or -o)
-not

Let's combine these with couple more options from the man page:


touch /tmp/jayceweb.tar.bz2
find / -user jayce -and -group apache -exec tar --remove-files -jrf /tmp/jayceweb.tar.bz2 {} \;

This is the sort of command a person might run if they found a user on their system that they didn't trust, and wanted to quaratine all of their web files. First we make an empty tar file, then we add the suspicious files to it, removing them once they've been archived. It assumes that the user that owns the files is jayce, and the group that owns the files is apache. You could also make use of:


-uid
-gid
-nouser
-nogroup

That's probably enough of a primer to get you started. Now would be a good time to check the man page for some of the other myriad options that you can use to check by date stamp(s), file size, file type and even permissions. A little practice with this powerful command will save you time and energy, and increase your productivity like you won't believe.

Saturday, March 13, 2010

Getting Started with Cassandra

12 days ago, I read an article by Matt Asay which briefly mentioned Cassandra, Facebook's NoSQL offering. I had heard of it before, but hadn't really looked into it. For some reason, Matt's article caused me to look into it again. Within a couple of hours, I was evangelising it to a few friends. Matt's article pointed out that Facebook, Digg and Twitter had all started using it, and as I researched it, it seemed that Digg's and Twitter's migrations to it had taken only a few days. Last night, one of the people that I had been hyping it to sent me a thing from Reddit, posted only 11 days after Matt's article, talking about how they had just finished a 10-day migration to Cassandra.

What is Cassadra?

In order to understand Cassandra, it would first make sense to talk about exactly what this NoSQL movement is.

What is NoSQL?

NoSQL is a nickname that arose sometime last year to describe a series of increeasingly popular database management systems (DBMS) that do not use SQL as an interface, as has been common in database servers for at least a couple of decades. Indeed, SQL is hardly the issue here at all. If you were to migrate from MySQL to PostgreSQL, chances are you would still have to update several of your queries in order to be compatible with the new DBMS. It's almost like changing languages anyway, except that it's more like the differences between Canadian French and Hatian Creole: both French, but not pure French, and there are enough differences to matter.

Switching from SQL to NoSQL is a little more like switching from English to Japanese. Both are languages which accomplish the same goal (interperson communication), but both take very different approaches. They have different keywords, different grammars, and some would argue that Japanese is a much more precise and efficient language. One might even bring scalability into the discussion, as both languages have had the opportunity to grow. One might argue that English has done so sloppily, borrowing from odd places, whereas Japanese has done so with a little less mess, for instance, adding an entire syllabic infrastructure called katakana in order to handle foreign words, among other things.

Both SQL and NoSQL are DBMSs. They both hold data admirably, but whereas most SQL servers were originally built before the idea of database clusters was common, NoSQL servers were introduced around the time that database clusters were becoming a necessity in many infrastructures. This provided NoSQL servers with the ability to consider this concern during the design stages, rather than having to patch it in later. Some of the names that you will see in the NoSQL world are BigTable, Dynamo, HBase, Hadoop, CouchDB, and Cassandra.

So, What is Cassandra?

Cassandra is a NoSQL DBMS written by Facebook. It was open sourced in 2008, and added to the Apache Project in 2009. It is fault-tolerant, decentralized, and "eventually consistent", meaning that when data is added to the database, there is a propagation period before that data is available to all of the nodes in the cluster. A more famous database model that is also eventually consistent is DNS: zone records are updated higher up in the DNS tree, and then trickle down to relevant servers in an organized fashion. This used to take 72 hours or more in DNS, but these days takes closer to an hour. With Cassandra, it is more likely to take a few seconds. This means that your applications must be written with this consideration in mind,

Rather than using SQL, Cassandra uses a system of key/value pairs. This is not a new concept to most programmers, whether they refer to them as libraries, associative arrays or hashes. The concept should be immediately familiar to any Perl programmer, and possibly even more comfortable to anyone who has ever worked with JSON. One major difference is that each name/value pair is also timestamped. So a column, as it were, in Cassandra is comprised of a name/value/timestamp set. For example:


{
    name: "email",
    value: "test@test.com",
    timestamp: 1259991135887
}

Cassandra also has what's called a SuperColumn, which is a grouping of columns, much like a hash of hashes in Perl. For example:


{
    name: "person",
    value: {
        realname: { name: "realname", value: "Billy Bob Test", timestamp: 1259991135887 },
        email: { name: "email", value: "test@test.com", timestamp: 1259991135887 },
        ircnick: { name: "ircnick", value: "billybobtest", timestamp: 1259991135887 }
    }
}

That's all the technical detail that I'm going to go into at the moment, largely because there's already so many great articles out there to get you started, but also because I'm new, and still know just enough to be dangerous (mostly to myself). But I am going to link to a few of those articles for you, if you're interested enough now to check them out.

Bearing in mind that I'm a Perl guy, here are the links that I've already sent out to a couple of friends in email, which may or may not cover your language of choice.

For information about Cassandra and some theories behind it, you'll want to take a look at these links:

http://incubator.apache.org/cassandra/
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
http://bryanpendleton.blogspot.com/2010/03/following-links-to-cassandra.html
http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/ (Ruby examples included)

When you're ready to install it and start playing with it, you'll want to read these, in roughly this order.

http://dustyreagan.com/installing-cassandra-on-ubuntu-linux/
http://wiki.apache.org/cassandra/CassandraCli
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
http://search.cpan.org/~lbrocard/Net-Cassandra-0.35/lib/Net/Cassandra.pm (For the Perl guys)

As I continue to explore and learn Cassandra, you may see an article here and there about it on my blog. I'm pretty excited about it, and while I see no reason to completely abandon SQL databases (they all have their uses, many of which Cassandra is likely not well-suited for), I think that I'm likely to use Cassandra as a major component on an upcoming project.

Friday, March 12, 2010

Interpreting Recipe Input

As some of you know, I've been working on recipe software off and on for a few years. It keeps finding its way to the backburner, mostly because other things have always taken priority, but also because doing it The Right Way (TM) is pretty intimidating.

A few weeks ago I started The Latest Attempt. I started simple, and had few eyeballs look at it. A couple of days ago, I went back to the beginning of my blog and started transcribing recipes into the simple interface that I had. I ended up finding several issues, most of which I don't think most of my testers ever encountered. I thought I'd lay them out here, and let them percolate in my brain.

I'm going to use an example that wasn't in my early archives, but that I've been playing with lately. The original is here. I blogged about a version of mine here. Hopefully Food Network won't mind if I reprint the original here, because I'm going to tweak it a little of the sake of demonstration.


6 squares unsweetened chocolate
3/4 cup unsalted butter
2 cups sugar
3 eggs
1 teaspoon pure vanilla extract
1 cup unbleached allpurpose flour
1 cup chopped nuts (optional)

Words like "pure" and "unbleached" sound pretty specific. But it starts with "6 squares unsweetened chocolate". What size is a "square"? The experienced baker will tell you that unsweetened chocolate is often measured in one-ounce squares. But it presents the first problem what I've encountered, and that at least one tester came across too:

Non-Standard Measurements

"two sticks butter". "one can olives". "one package spinach". These are all arbitrary sizes, that don't necessarily mean what you think they mean. Okay, so in America, butter comes in 4 oz sticks. That's something that we can rely upon. But one can of olives? Are we talking about the little 4 oz cans of sliced or chopped olives? Or are we talking about a 15 oz can of whole olives? How about the spinach? I've seen fresh spinach come in packages ranging from a few ounces to a couple of pounds, and who's to say we're not talking about frozen spinach? This actually leads into the next issue:

Inspecific Ingredients
What kind of butter? Salted or unsalted? A professional chef would never cook with salted butter. Joe Q. America, who knows? And I've already brought up the issues with olives and spinach. But the issue that I found here was actually in trying to categorize the food items, with minimal effort on the user's behalf. I've been using the USDA SR22 this time around, along with some auto-suggest AJAX code, to try and link up what the user has been looking for with something that the user can use for things like nutritional charts. The SQL looks something like this:


select NDB_No, Shrt_Desc from ABBREV where Shrt_Desc like '%butter%' limit 10;

Offhand, it seems reasonable that this would give us results like "unsalted butter", "salted butter", etc. But the SR22 isn't in an order that is condusive that that kind of thing. Instead, we get a result like this:


+--------+------------------------------------------------------------+
| NDB_No | Shrt_Desc                                                  |
+--------+------------------------------------------------------------+
| 42291  | PEANUT BUTTER,RED NA                                       | 
| 42307  | MARGARINE-LIKE,BUTTER-MARGARINE BLEND,80% FAT,STK,WO/ SALT | 
| 42309  | MARGARINE-LIKE,VEG OIL-BUTTER SPRD,RED CAL,TUB,W/ SALT     | 
| 43214  | BUTTER REPLCMNT,WO/FAT,PDR                                 | 
| 11866  | SQUASH,WNTR,BUTTERNUT,CKD,BKD,W/SALT                       | 
| 11867  | SQUASH,WNTR,BUTTERNUT,FRZ,CKD,BLD,W/SALT                   | 
| 11372  | POTATOES,SCALLPD,HOME-PREPARED W/BUTTER                    | 
| 11373  | POTATOES,AU GRATIN,HOME-PREPARED FROM RECIPE USING BUTTER  | 
| 11381  | POTATOES,MSHD,DEHYD,PREP FR GRNLS WO/MILK,WHL MILK&BUTTER  | 
| 11385  | POTATOES,AU GRATIN,DRY MIX,PREP W/H2O,WHL MILK&BUTTER      | 
+--------+------------------------------------------------------------+

Only one of those even remotely resembles what I'm looking for, and honestly, I don't want it. This problem is easily, if tediously solved: all I need to do is create a new database, of commonly used ingredients, and query it first, and then supplement it with the second database. Oog. Let's move onto the really tricky stuff.

Storing Measurements

Let's go back to our brownie recipe. One of the ingredients is 3/4 cup butter. Databases don't store fractions in any mathematically-usable format. Do we store it as a VARCHAR to maintain integrity with what the user entered, and then convert it later? Or do we store it as a decimal, and then store a flag saying whether it was entered as a fraction or a decimal? Or do we store it as a decimal, and assume that it will always be displayed to ther user as a fraction (which is usually what the user wants)? For the sake of argument, let's go with decimals as the storage mechanism, and not worry about anything else for now. In MySQL, we might have something that looks like this:


...SNIP...
amount DECIMAL(10,2),
unit VARCHAR(10),
...SNIP...

Measurement Ranges

Brownie recipe again. One of those ingredients, the nuts, is optional. It's technically a garnish, if an internal one. It's already subjective (walnuts? peanuts? pecans?), which means we can fudge a little on the amount too. Depending on how much you like your nut of choice, let's say you might opt for anywhere from 3/4 cup to 1 1/2 cups. This presents another problem with database management, at the very least. Maybe we can solve it by breaking the amount into two different fields?


...SNIP...
amount_min DECIMAL(10,2),
amount_max DECIMAL(10,2),
unit VARCHAR(10),
...SNIP...

Intermixed Measurement Units

Let's play with the sugar a little. A lot of people (like me) like to swap out some of the white sugar with brown sugar. Let's say that after careful testing and tweaking, I come up with the following measurements:


1 cup + 2 Tbsp white sugar
3/4 cup + 2 Tbsp brown sugar

Oh man. How do you store that? We could convert everything down to the lowest common denominator, Tbsp in this case, and then upscale when we display it back to the user. In the database, we would have something like:


18 Tbsp white sugar
14 Tbsp brown sugar

Of course, now we have to write code to scale this back to a reasonable measurement, since 18 Tbsp of anything is just weird, even just for behind-the-scenes storage. I think I would rather convert everything to a common unit, store it, and then convert it back. And I don't think any (non-metric) unit of measurement is more versatile than the ounce. Now we can store the sugar as:


9 oz white sugar
7 oz brown sugar

Of course, this leads us to the next problem:

Weight vs Volume

In the metric world, this isn't an issue, because everything is stored in either some version of grams or some version of litres. But in the Imperial system favored in America, an ounce could mean weight or it could mean volume. With some ingredients, this isn't a big deal. "A pint's a pound, the whole world round", right? Makes sense, since a pint is 16 ounces by volume and a pound is 16 ounces by weight. Well, not exactly (1 pint == 1.043 pounds), but close enough for the home cook.

In the above example, we know that we're referring to volume, if only because we know that we started with cups. But if I were looking at the recipe without any context, I personally would start off by thinking that it was a weight measurement. Some would assume it was volume. Even for the home cook, this is kind of significant, since a cup of sugar weighs a little over 7 ounces, not 8 ounces. In the aforementioned example, we could just store a flag that states whether this unit is by weight, volume, count, etc. If a user specified ounce, without any context, we'd have to store it as "unspecified", until it became important to the user (say, for nutritional data).

Of course, this is all backend stuff. Let's talk about another major problem:

Dealing With Users

If you can force the user to use your own forms, custom-tailored to suit your database, you can force them to do everything properly. And any UI designer worth his or her salt can tell you, when you start forcing users to what you want, you start losing users to your competitor. Your competitor's software may or may not do an inferior job, but if it makes your users feel better, that's what your users will use.

I've heard a lot of people complain about recipe software. From my limited experience, when a user gets tired of their recipe software (as most inevitably will), they will switch back to using a word processor, or possibly a spreadsheet. So it seems to me that the best way to get somebody to use your recipe software is to make it feel as much as possible like using a word processor. That means using a lot of fuzzy logic to convert what your users type into something that your software can use.

As you can see, writing recipe software The Right Way (TM) presents itself with a lot of issues. I haven't figured out how to handle most of them, and one of the biggest problems seems to be that some issues are dependent upon other issues. Well, I'll get it figured out.