Monday, September 29, 2008

Weighted Keywords

I was thinking back today to an old company that I used to work for. As far as I know, the company went under a while ago, and even if they were still around, it was a long time ago and I never signed any non-disclosure or non-compete agreements, so I'm thinking there's no harm in talking about one of the concepts behind the product that they offered. Maybe somebody will have some use for them.

The idea was simple: a family-safe Internet filter. It wasn't just supposed to handle pornography. It had several other categories that it looked at, including gambling, shopping, games, hate and violence, even lingerie and the like. It would filter sites based on a black list (sites that we knew always matched a category), a white list (sites that we knew would never match a category) and sites that scored high enough using a weighted keyword list.

The black and white lists have always been common in blocking software. If a site is on a black list, it's bad, end of story. If it's on the white list, it's safe to look at. The weighted keywords were really what was important. A team of people looked at various sites that they knew to be bad (or in our case, match a certain category) and found keywords that were more likely to indicate if a site matched a category.

Seeing an opportunity to automate the process, I wrote a script that would Google for a specific query (related to a specific category), hit the first 100 pages that were returned, and count the number of keywords that appeared across all of those pages. It wasn't long before I even created a list of "commonly-used words" that were pretty much useless to count ("the", "of", "and", etc). I saved the results in a series of text files, including both the keyword and the count, and sent those files to the team leader. To this day I will never understand why he didn't think the count was important, but he liked having the words. It only took me a few minutes to write the script, but it saved him hours of trouble.

I never found out how they actually weighted the words. I assume they made a judgment call based on how relevant they thought the word to be. In other words, the data was completely subjective rather than statistical. This makes sense to a degree. Lots of sites with adult content are likely to contain the word "breast". But there are also sites, including CNN, which may publish articles on breast cancer, which a parent considers okay for their child to view. The word "breast" might get a score, but it will be a low score. But the appearance of "XXX" or some profanity related to adult material is going to receive a much higher score because "safe" sites like CNN are far less likely to have those words appear on their pages. If a page reached a high enough score for a particular category, it could be blocked.

Subjective data might be relevant, but I think that statistical data is far more important than this team leader gave it credit. But I wonder how much of that statistical data can be automated? I would still want to throw away certain keywords. Articles ("of", "and", "the", etc) can safely be ignored, because they will likely match all categories. I'm also not interested in numbers. Chunks of text that only contain digits can be tossed. I would probably even add prepositions to the list.

That leaves us with several other very generic words that I'm afraid to throw away. Do I throw away the word "cool"? Maybe not so much if the page is talking about climate or weather. Then again, it's such a generic word otherwise "that casino was cool", "that beach party was cool", "that fight was cool", "that Perl script was cool" that maybe it will just confuse things anyway. I haven't decided yet how to handle those words.

Once we've thrown away the overly-generic keywords, we're left with a bunch of words that may or may not be relevant. Tagging a bunch of pages as the same category might help, which is what I did for that team leader: 100 pages worth of keywords that were returned when I searched for something related to gambling, or shopping. Rather that seeing a specific word show up 3 times on one page, maybe it showed up 73 times across 100 pages. But it seems to me that maybe we could get the computer to do a little more work for us. It would take longer, but might produce more accurate results.

Tagging is a big buzzword right now. Sites like Amazon are allowing users to add their own tags to items to build up relevancy databases. It probably took a few weeks worth of manpower to write the code, but once it was up and running they had millions of users literally performing free labor for Amazon. Now when those users search for something, assuming it was properly tagged, they likelyhood of something relevat being returned is increased. One way to look at it is as a community effort. In Amazon's case, I would also look at it as free labor.

On a much smaller scale, Firefox 3 now supports tagging bookmarks. Unfortunately, their effort is little more than an afterthought and their implementation has little to no actual usefulness. When you "organize bookmarks", you can sort by tag. That's it. FF3 has no built-in tools to make any more use of tags. It's almost as bad as tagging in Blogger. The effort was poor enough that I'm quite honestly surprised that they bothered in the first place. It would have been far better to adopt GMail's label scheme, but I'm sure there are plenty of reasons why that would not be feasible (starting with the fact that Mozilla really seems to love storing bookmarks in an inherently-limiting HTML file).

Still, the tags are available. And there are plenty of other sociel bookmarking sites that handle tags somewhat better. If you are diligent in properly tagging your bookmarks, you're off to a good start: you have a set of data from which to work. That means you're already ahead of me, since I haven't bothered much with Firefox's poor bookmark tag support. But that doesn't mean I haven't tossed together some Perl code to start counting words.

This code makes use of the elinks program, which can conveniently strip out HTML from a web page and render it as a piece of text, exactly the same way as you might see in a regular browser, minus the images. It uses a file called common-words.txt which contains a series of articles and prepositions. When it's finishes, it dumps the word count to the screen. It does nothing else at the moment, but it might be useful to you.

#!/usr/bin/perl

use strict;
use Data::Dumper;

my @common_words = split( /\n/, slurp('common-words.txt') );
my %common_words;
$common_words{$_} = 1 for @common_words;
undef @common_words;

my $contents = `elinks -dump -no-numbering -no-references $ARGV[0]`;
my %words = wordcount( $contents );
print Dumper \%words;

exit;

sub wordcount {
my ( $html ) = @_;
my %word_array;
$html =~ s/^\s+//igs;
while ( $html =~ s/(.*?)\s+//is ) {
my $key = lc( $1 );
$key =~ s/\W//g;
next if $common_words{$key} == 1;
next if $key !~ /\D/;
$word_array{$key}++;
}
return %word_array;
}

Here are my thoughts. When a page is bookmarked and tagged, do a word count. Save the word count in a database and associate it with the tag and the page. As you diligently tag and wordcount pages, the database will become more useful. After a certain point in time, when you look at a page, the database should have enough information to suggest what tag or tags might be most appropriate. The more pages are tagged properly, the more accurate the computer's suggestions will become. Start sharing the database with enough users that are also diligently tagging, and the time it takes to produce accurate results will decrease.

This can certainly be used to construct a family-safe Internet filter, but it would require that the surfer(s) look at a lot of inappropriate material. And my guess is that somebody that spends that much time looking at inappropriate material isn't incredibly interested in a "family-safe Internet experience". What I think it is useful for is helping users to easily and accurately manage their own bookmarks. I guess that also begs the question, is it worth that much effort for one person to handle their own bookmarks that way? I guess it depends on how much you surf.

Let me know if you have any other thoughts on this. I think it could potentially be useful, especially if implemented with a group of people.

1 comment:

  1. Have you looked at Dan's Guardian? It's a Squid-based proxy server that does dynamic content web filtering. It uses a lot of the concepts you describe, at least as I understand it. I haven't looked too deeply at the implementation though. I've used it on occasion and it works reasonably well although not perfectly.

    It seems to me that you don't really need to decide which words are so common that they should be ignored. When you scan your corpus, any word that occurs as commonly in the sample as occurs in the baseline sample could be eliminated. It's only the unique ones that matter anyway. I believe that's all standard to Bayesian theory.

    ReplyDelete

Comments for posts over 14 days are moderated

Note: Only a member of this blog may post a comment.