Blogathon Cheating
Update: Sweet. I won the first ever “George Washington ‘I Cannot Tell A Lie, I Hacked Your Contest‘ Script Kiddie Special Achievement Award for Outstanding Accomplishment of Incredibly Trivial Ends”. Perl rox0rs.
When Eric announced that there would be a prize for the person who could provide the longest list of sequential header images from his blog during Blogathon, I immediately thought, “cool - I could automate that!” And then, almost as immediately thought, “hey - that would be cheating,” followed by “well, I’ll just disqualify myself. This will be too much fun to pass up.”
So, the night before the Blogathon began, I made sure to have the latest version of WWW::Mechanize (a Perl module from the always friendly and generous Andy Lester) and set about writing a script that I could quickly modify to read the actual HTML Eric would use during the Blogathon. I figured I’d just grab the first image in the HTML, and that would probably cover it. Before the Blogothon began, this did, indeed, grab his masthead image. Confidence was high.
In the morning, when it all began, I found that my script was getting an image, but not the one I wanted. It was getting an image from the first post in the page.
Turns out, Eric pulled a switcharoo, and was placing the header image with CSS, not HTML. The script I wrote parses HTML just fine, but doesn’t care much for CSS. Uh-oh.
Fortunately, he was loading the image from a static location, and making the hourly switch by replacing the image file with a different one, using the same name. Well. That makes it too easy - it’s just a call to a single, unchanging URL; every hour, the image there will be different. So I threw away my script and replaced with this:
wget http://www.ericsiegmund.com/fireant/images/blogathon2006/bg.jpg
Run that every hour via cron, and I’ll have my collection of images, to sort through at my leisure.
Now that I had it working, I e-mailed Eric to explain my scheme and recuse myself from the competition, confident that if this isn’t cheating, I don’t know what is.
After a few hours (OK, maybe ten hours) of quietly retrieving the images, I went to look at the folder. There was the first one, which I’d saved manually, then the second, which the script got. Then about ten more. Those ten were all copies of the second one; I had a collection of Texans in Hatchbacks! The same Texan in the same hatchback, every time.
Not what I was looking for.
Now, I’d had no reply to my e-mail. I figured Eric was too busy live-blogging from the HEB to read it. (The HEB is a grocery store, but it’s apparently quite a grocery store, as they allow blogging and gaget hackery to take place inside.) Apparently not. I’m guessing he got it, and figured he’d mess with me. Since I had kind of started it, this was entirely appropriate, and a welcome surprise!
So, I dug out the original script, and tweaked it to make it more Eric-proof. The new version (a) gets the blog’s index page, (b) finds the URL of the stylesheet it uses, (c) gets the stylesheet, (d) gets the URL of the image used for the background of the CSS ‘header’ div, and (e) gets the image itself, saved with a timestamp (f) once per hour, scheduled with cron. Oh, and it uses a Safari identification string (user-agent), so it wouldn’t call attention to itself with a casual look through the server logs, should Eric decide to look.
And I left the other one running, so it *did* show up in the server logs - this way, it’s not obvious that I changed anything.
I wrote it as the last thing before going to bed (yes, I’d ignored the output of the original attempt for far too long. It was a Saturday! A day of rest! It looked fine on the surface!), so it’s a little wordy, and a bit messy in places.
But it was a great exercise in writing a more resilient approach, and in writing regular expressions (that’s the part that looks like gibberish, with the slashes and splats and brackets and parens). And, more importantly, it worked all night as I slept!
Here’s the Perl:
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
# create a mech that calls itself "Safari", so we aren't obvious in server logs
my $mech = WWW::Mechanize->new( agent => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X;'
. 'de-de) AppleWebKit/125.5.7 (KHTML, like Gecko) Safari/125.12' );
my $now = timestamp();
# default bg image url
my $bg_url = 'http://www.ericsiegmund.com/fireant/bg.jpg';
# url to start on
my $url = 'http://www.ericsiegmund.com/fireant/';
# name to store image under
my $bg_file = 'bg' . timestamp() . '.jpg';
$mech->get( $url );
my $page = $mech->content();
# In the content of $url, get the stylesheet's URL:
# Look for
#
my ($css) = $page =~ /link\s+rel\s*=\s*['"]?stylesheet['"]?\s+href\s*=\s*['"]?([^'"]+)/is;
# get the stylesheet, store the whole thing as a string in $sheet
# and split it up into individual lines, one per element of the @sheet array
#
$mech->get($css);
my $sheet = $mech->content();
my @sheet = split/\n/, $sheet;
foreach (@sheet){ # go through the lines of the stylesheet, one by one.
# We’re looking for something like this:
# #header {
# background: #[[variable color]] url([[grab this part]]) no-repeat top right;
# border-bottom: 1px gray solid;
# width: 100%;
# height: 150px;
# text-align: center;
# }
if( /#header[\s{]+/ .. m|}| ){ # only look while we’re in the ‘#header’ section
# (below) look for a line that contains ‘background’ and a URL.
# that URL is where we can find the image
if( $_ =~ m|background[^(]+(http://www\.ericsiegmund\.com/.+)|i ){
$bg_url = $1; # $1 is what we found in the parens, above
# that is, the URL of the background image
}
}
}
# get the image, and store it in the file whose name we made up at the start
$mech->get( $bg_url, “:content_file” => $bg_file );
# script ends here - below is just the timestamp routine, used to create the filename
sub timestamp{
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime();
# never can remember the order of those things.
# sloppy copy & paste, yes…
$year = $year + 1900;
$mon = sprintf(”%2.2d”, $mon + 1);
$mday = sprintf(”%2.2d”, $mday);
$hour = sprintf(”%2.2d”, $hour);
$min = sprintf(”%2.2d”, $min);
$sec = sprintf(”%2.2d”, $sec);
return $year . $mon . $mday . ‘_’ . $hour . $min . $sec;
}




