blog.josephhall.com: nzbsize Revisited: slurp

My nzbsize program was nice and all, at least for small nzb files. Then I downloaded one that was big. It was real big. It indexed a 64gb Usenet post. I didn't realize when I downloaded it that it was that big. When I tried to see how big it was, nzbsize took forever. I glance at the source code, and knew exactly what was wrong right away. No, I'm not super-smart. It just isn't a long program.

The culprit is something that I think most Perl pragrammers take for granted. Let's say we want to work with a file. Perl's file operation commands are quick. But since Perl was originally designed as a reporting language (so they say), it normally only operates one line at a time. This makes sense. We pull in a line of data, we parse it, add it to the report, and continue with the next line.

But sometimes we want to work with multiple lines at once. Parsing HTML is a great example of this, since tags can easily span multiple lines. I think most Perl programmers, when faced with this situation, prefer to slurp in the whole file at once. I suspect a lot of Perl programmers had sitting in their arsenal for years a subroutine that just pulled in an entire file at one time. I called mine readfile. It looked kind of like this:


sub readfile {
    my ( $filename ) = @_;
    open INPUT, "<$filename";
    my @contents = <INPUT>;
    close INPUT;
    return join '', @contents;
}

This became so common that the Perl developers finally gave us a built-in called slurp. But that wasn't introduced until Perl 5.10, which isn't exactly common at the moment. And really, it was just one of those pieces of functionality that was made for Perl 6, but that the devs decided to pull into Perl 5.x now because people wanted it now.

Since I'm not running either of those versions of Perl yet, I just renamed my sub to slurp. At one point I showed a piece of my code to Tene, who gave me an optimized version of the slurp sub:


sub slurp {
    my ( $file_name ) = @_;
    local $/;
    open INPUT, "<$file_name";
    return <INPUT>;
}

Perl programmers love optimizations. Tene told me once that Perl was about making life easier for the programmer, not necessarily the compiler, but I can't help but notice a lot of Perl programmers doing a lot of optimizations designed to make the compiler do less work. Most Perl programmers that I know thrive on this. Then again, I suspect that Perl programmers eventually start to think the same way as the compiler too (a side effect of the language itself, maybe?), and that makes for tight code.

When it comes down to it, my code still processed the file a line at a time. Sure, it shoved the whole thing into an array in one line of code, and then returned it joined into a string in another, but it was still doing a list operation, twice. Tene's optimization told the compiler to ignore line breaks, which basically turned the file into a single-item list, which was really just a scalar. Nice.

Now, think about my original nzbsize script. It slurped in a file, then it went through that file line by line looking for a byte value, which it then added to a grand total. Even using Tene's superior slurp sub, I was still processing the entire file a second time. In fact, since I was using a while statement and removing each byte count one by one, it had to reprocess the entire file again for each byte count that it encountered. Even on a smallish nzb file, that's going to be a lot of work for the compiler. On a file that indexes 64gb worth of data, well, on my little notebook it took somewhere around half an hour. That's just not good code.

So I went back to the way I should have done it in the first place. When it comes down to it, I'm running a report. And that report needs to be analyzed one line at a time. the new nzbsize program looks like this:


#!/usr/bin/perl

use strict;
use Number::Bytes::Human qw(format_bytes);

my $file_name = $ARGV[0];
my $bytes;

open INPUT, "<$file_name";
while ( my $file = <INPUT> ) { 
    $bytes += $1 if $file =~ s{bytes="(.*?)"}{};
}
close INPUT;

my $hr = format_bytes( $bytes );

print "$ARGV[0]: $hr\n";
exit;

This new piece of code returns the size of the nzb file in under a second. That's one 18mb file indexing 64gb worth of data, and the processor didn't even flinch.


jhall@bourdain nzb$ time nzbsize Lots\ o\ Data.nzb 
Lots\ o\ Data.nzb: 64G

real 0m0.564s
user 0m0.440s
sys 0m0.020s
jhall@bourdain nzb$

Don't take things like slurp for granted. In the short run it may save a little typing, but in the long run it may cause serious performance issues. It's not just a matter of using the right tool for the right job, it's a matter of using the right version of the right tool. A claw hammer, a ball-peen hammer and a rubber mallet may look similar and the interface is pretty much the same, but I wouldn't use the rubber mallet to drive nails, and I wouldn't use the claw hammer to tap my new bathroom towel rack into place.

1 comment:

wdefJanuary 3, 2009 at 10:14 PM
I've been fiddling with Perl on and off for a few short and only as I felt like it, so am hardly expert.
I read somewhere early on (could have been one of the intro tutorials in Perldoc) a caution about reading entire files into memory at once and that this only was reasonable for small files. So that's the only place I've ever used it.
Slurping by manipulation of $/ really comes into its own for reading in and pattern matching within paragraphs or other records one at a time instead of line by line. See perldocs.

If you want to process a whole file in one line of code without using while, look at Perl's grep or map.

Comments for posts over 14 days are moderated

Note: Only a member of this blog may post a comment.

Sunday, August 17, 2008

nzbsize Revisited: slurp

1 comment: