Protecting the Glype browse.php page and blocking abusive bots by their useragent
Sometimes you may find your proxy being abused by search engine crawlers like Googlebot or Yandex and many other assorted bots lurking around the net. When these bots are using up all your sites resources / memory and causing problems you may have little choice but to block them.
I made a quick little script to help detect and block abusive bots after seeing this post at Netbuilders : Need help google bot going crazy
Robots.txt - First line of defence
Restricting bots / spiders / crawlers etc in your robots.txt is one of the first steps you should take to secure your proxy site against possible abuse. Setting up rules in the robots.txt file to disallow crawlers from using the browse.php page will generally only help to stop crawlers sent out by the major search engines from accessing proxified pages.
The main problem with robots.txt is that's it not actually a (legal) requirement for anyone to follow those rules.
Search engine companies may check robots.txt because most webmasters expect them to follow their rules and are very vocal and condemning if those companies don't.
So while you might expect all bots to follow the robots.txt, and punish them if they don't, there's not really any 'laws' against not checking it.
In real life not all bots will bother to check the robots.txt at all before devouring all the pages on your site or running amok through your proxy.
Some bots will even do the opposite of what you want by reading the robots.txt file and using it as a guide to find pages or directories on your site. They will check your rules and proceed to try and access every page and directory that is disallowed. For that reason it's not a great idea to list anything that you really dont want bots or hackers to try and access. Use a .htaccess file to enforce restrictions instead.
This is not a guide about using a robots.txt though. In this guide we'll take a look at a much more effective method of controlling abuse, using PHP to block search engine crawlers from accessing the Glype browse.php page by detecting and blocking abusive bots based on their useragent.
|