BeanQuest

July 30, 2006

Blogathon Cheating

Filed under: perl, pointlessness, web-ish stuff — Brian @ 11:35 am

Update: Sweet. I won the first ever “George Washington ‘I Cannot Tell A Lie, I Hacked Your Contest‘ Script Kiddie Special Achievement Award for Outstanding Accomplishment of Incredibly Trivial Ends”. Perl rox0rs.

When Eric announced that there would be a prize for the person who could provide the longest list of sequential header images from his blog during Blogathon, I immediately thought, “cool – I could automate that!” And then, almost as immediately thought, “hey – that would be cheating,” followed by “well, I’ll just disqualify myself. This will be too much fun to pass up.”

So, the night before the Blogathon began, I made sure to have the latest version of WWW::Mechanize (a Perl module from the always friendly and generous Andy Lester) and set about writing a script that I could quickly modify to read the actual HTML Eric would use during the Blogathon. I figured I’d just grab the first image in the HTML, and that would probably cover it. Before the Blogothon began, this did, indeed, grab his masthead image. Confidence was high.

In the morning, when it all began, I found that my script was getting an image, but not the one I wanted. It was getting an image from the first post in the page.

Turns out, Eric pulled a switcharoo, and was placing the header image with CSS, not HTML. The script I wrote parses HTML just fine, but doesn’t care much for CSS. Uh-oh.

Fortunately, he was loading the image from a static location, and making the hourly switch by replacing the image file with a different one, using the same name. Well. That makes it too easy – it’s just a call to a single, unchanging URL; every hour, the image there will be different. So I threw away my script and replaced with this:

wget  http://www.ericsiegmund.com/fireant/images/blogathon2006/bg.jpg

Run that every hour via cron, and I’ll have my collection of images, to sort through at my leisure.

Now that I had it working, I e-mailed Eric to explain my scheme and recuse myself from the competition, confident that if this isn’t cheating, I don’t know what is.

After a few hours (OK, maybe ten hours) of quietly retrieving the images, I went to look at the folder. There was the first one, which I’d saved manually, then the second, which the script got. Then about ten more. Those ten were all copies of the second one; I had a collection of Texans in Hatchbacks! The same Texan in the same hatchback, every time.

Not what I was looking for.

Now, I’d had no reply to my e-mail. I figured Eric was too busy live-blogging from the HEB to read it. (The HEB is a grocery store, but it’s apparently quite a grocery store, as they allow blogging and gaget hackery to take place inside.) Apparently not. I’m guessing he got it, and figured he’d mess with me. Since I had kind of started it, this was entirely appropriate, and a welcome surprise!

So, I dug out the original script, and tweaked it to make it more Eric-proof. The new version (a) gets the blog’s index page, (b) finds the URL of the stylesheet it uses, (c) gets the stylesheet, (d) gets the URL of the image used for the background of the CSS ‘header’ div, and (e) gets the image itself, saved with a timestamp (f) once per hour, scheduled with cron. Oh, and it uses a Safari identification string (user-agent), so it wouldn’t call attention to itself with a casual look through the server logs, should Eric decide to look.

And I left the other one running, so it *did* show up in the server logs – this way, it’s not obvious that I changed anything.

I wrote it as the last thing before going to bed (yes, I’d ignored the output of the original attempt for far too long. It was a Saturday! A day of rest! It looked fine on the surface!), so it’s a little wordy, and a bit messy in places.

But it was a great exercise in writing a more resilient approach, and in writing regular expressions (that’s the part that looks like gibberish, with the slashes and splats and brackets and parens). And, more importantly, it worked all night as I slept!

Here’s the Perl:

#!/usr/bin/perl
use strict;
use warnings;

use WWW::Mechanize;

# create a mech that calls itself "Safari", so we aren't obvious in server logs
my $mech	= WWW::Mechanize->new( agent => 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X;'
.  'de-de) AppleWebKit/125.5.7 (KHTML, like Gecko) Safari/125.12' );
my $now 	= timestamp();
# default bg image url
my $bg_url 	= 'http://www.ericsiegmund.com/fireant/bg.jpg';
# url to start on
my $url 	= 'http://www.ericsiegmund.com/fireant/';
# name to store image under
my $bg_file = 'bg' . timestamp() . '.jpg';


$mech->get( $url );
my $page = $mech->content();

# In the content of $url, get the stylesheet's URL:
# Look for 
#
my ($css) = $page =~ /link\s+rel\s*=\s*['"]?stylesheet['"]?\s+href\s*=\s*['"]?([^'"]+)/is;


# get the stylesheet, store the whole thing as a string in $sheet
# and split it up into individual lines, one per element of the @sheet array
#
$mech->get($css);
my $sheet = $mech->content();
my @sheet = split/\n/, $sheet;

foreach (@sheet){		# go through the lines of the stylesheet, one by one.
# We're looking for something like this:
# #header   { 
#	background: #[[variable color]] url([[grab this part]]) no-repeat top right;
#	border-bottom: 1px gray solid;
#	width: 100%; 
#	height: 150px; 
#	text-align: center; 
#	}

	if( /#header[\s{]+/ .. m|}| ){	# only look while we're in the '#header' section
		# (below) look for a line that contains 'background' and a URL.
		# that URL is where we can find the image
		if( $_ =~ m|background[^(]+(http://www\.ericsiegmund\.com/.+)|i ){
			$bg_url = $1;	# $1 is what we found in the parens, above
							# that is, the URL of the background image
		}
	}
}

# get the image, and store it in the file whose name we made up at the start
$mech->get( $bg_url, ":content_file" => $bg_file );

# script ends here - below is just the timestamp routine, used to create the filename
sub timestamp{
	my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime();
# never can remember the order of those things.
# sloppy copy & paste, yes...
	$year 	= $year + 1900;
	$mon 	= sprintf("%2.2d", $mon + 1);
	$mday 	= sprintf("%2.2d", $mday);
	$hour 	= sprintf("%2.2d", $hour);
	$min 	= sprintf("%2.2d", $min);
	$sec 	= sprintf("%2.2d", $sec);

	return $year . $mon . $mday . '_' . $hour . $min . $sec;

}
Advertisements

7 Comments »

  1. Hey, I’d say anyone willing to work that hard for the prize deserves it! The alternative, staying up nearly all night, didn’t work for me because. Man! I had no idea all of that was even possible, let alone how it is done!

    Comment by Gwynne — July 30, 2006 @ 11:41 pm

  2. It wasn’t all that much work, really – this kind of stuff is fun for me. And I’m quite certain it evades the purpose of paying attention to the blogathon.

    I was going to e-mail the list to Eric, and then post a report here questioning the carbon-to-silicon ratio in the body of the entrant (a la Tour de France), but couldn’t figure out a good punchline.

    Comment by Brian — July 31, 2006 @ 6:10 am

  3. LOL! ;-)

    Comment by Gwynne — July 31, 2006 @ 11:04 am

  4. I’m guessing he got it, and figured he’d mess with me.

    You give me FAR too much credit, Brian! I was totally clueless, and would have remained that way. But you did an excellent job in solving this problem, and I’m also impressed with your honesty in publicizing it. Say, maybe you deserve some kind of award! ;-)

    Comment by Eric — July 31, 2006 @ 4:09 pm

  5. And the moral to this tale is: never admit your hack until the hacking’s done.

    And don’t mess with Texas. That too.

    Comment by Jim — July 31, 2006 @ 7:42 pm

  6. That was awfully clever of you! You totally earned your award! :)

    Comment by Rachel — July 31, 2006 @ 9:59 pm

  7. Well, Jim, you know what they say, “If at first you don’t succeed, cheat better next time.”

    Thanks, Rachel! I don’t know how clever it was, but that imaginary battle of wits sure was fun.

    Comment by Brian — August 1, 2006 @ 6:17 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: