DEBRIS.COMgood for a laugh, or possibly an aneurysm

Thursday, February 17th, 2005

HTTP_REFERER spamming: the mob found my website

Like most webmasters, I keep track of the websites that link to this one. In the jargon of my people this is called “referer logging,” which is short for “HTTP_REFERER logging,” which I include for the benefit of GoogleBot.

Starting a few months ago, my referer logs became worthless; they were filled with sites that couldn’t possibly be linking to mine: paris-hilton-video.blogspot.com, www.texas-holdem-poker-downloads-4u.info, viagra.hosting4u.gb.com. In other words, even though those sites did not contain links to debris.com, my logs looked as if hundreds of people per day were clicking through from there to here.

Why would anyone bother to fake clickthroughs? Because some websites automatically display the URLs other readers have clicked through from. The gambling and porn site owners are hoping debris.com will automatically display, and link to, their URLs.

It’s all Google’s fault. Google’s PageRank system counts inbound links as relevance votes: the more sites link to website X, the more relevant website X must be. So, if a million weblogs link to paris-hilton-viagra-holdem-poker.org, then paris-hilton-viagra-holdem-poker.org will show up high in Google’s search results for any search on related terms.

So, some unknown fuckwit, or collective of fuckwits, operates software that hammers on my site (and countless others, I’m sure), with the page requests faked to make it look as if readers are clicking through from various gambling and porn and pharmaceutical sites, in a lame attempt to raise their PageRank scores.

There are numerous problems with this strategy:

  1. My site doesn’t display referers, so no benefit has ever been realized by the spammers.
  2. 90% of the spamvertised URLs get shut down within a day anyway, e.g. last night’s variation, http://www.nutzu.com/internet-poker.html, so even if my site did automatically display referers, the referers would have been shut down before Google’s spiders would have counted the links as valuable PageRank votes.

The fact that the strategy is a failure doesn’t make it any less of a hassle for me. My ISP recently began charging me surplus-bandwidth fees, because all the sites I host are serving more data than I projected or paid for. Yet a measurable percentage of the bytes served by this site were not actually being seen by humans. I’m paying for the traffic generated by the referer-spammers’ software robots.

Preventing this abuse requires daily maintenance, because the spamvertised URLs change frequently. A few general keywords like poker, holdem, and viagra trap most new attacks; these are trapped hourly by a scheduled script that scans recent logs and updates the blacklist with matching domains. Every second or third day, I manually examine the logs in search of new attacks that don’t happen to match any of the keywords I’ve already defined.

So now when these robot scripts pound on my site, instead of serving up 15-20k of glorious debris.com content, the software engine that generates these pages returns a brief error message.

Frankly, the bandwidth savings are miniscule compared to the amount consumed by people abusing the MP3s and graphics. But they’re next in line.


Tags:
posted to channel: Colophon
updated: 2005-02-18 23:43:16

follow recordinghacks
at http://twitter.com


Search this site



Carbon neutral for 2007.