TulipTools Internet Business Owners and Online Sellers Community

Full Version: Banned for Being Badly Behaved Idiots...
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
We banned .the following annoying spiders from accessing the directory site that TulipTools shares a server with (and here's how you can ban them too):

RufusBot:  it obeys robots.txt so you can ban it by adding this to your robots.txt file:

Code:
User-agent: RufusBot
Disallow: /

Microsoft URL Control - 6.00.8862: doesn't obey robots.txt, can be banned if you have Apache mod rewrite enabled by adding this to your .htaccess file:

Code:
Options +FollowSymlinks
RewriteEngine On
RewriteBase /

RewriteCond %{HTTP_USER_AGENT} "Microsoft URL Control"
RewriteRule .* - [F,L]

EasyDL/3.04 -another one that doesn't obey robots.txt, ban it by adding the following to your .htaccess file:

Code:
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from d57-8-78.home.cgocable.net
Deny from 24.57.249.53
</Limit>

These are all Spam Harvesting Bots:  ban them all by adding the following to your .htaccess file:

Code:
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>




Add some more spidering idiots to the list  Smile but these we can't ban.

We turned off the Apache web server for a few minutes a little while ago A. just to annoy TulipTools users and snicker at the thought of all of you getting the white 'can't connect' error messages and B. so we could stop our log files and see who exactly (as in which idiot spiders) have been arriving en masse the past few days in the early morning and again in the late afternoon and causing our server to slow to a crawl or be inaccessible for short periods of time (because they're hitting the MYSQL database with 250-300 simultaneous requests)

The winners of the Spiderboinktard Award are....Yahoo Slurp and Yahoo Slurp China are both arriving (with lots of little children) at the same time...and about 5 minutes after they arrive LookSmart's WiseNutJob shows up with a few friends to join the party...grrrrr...and none of them appear to be obeying the rules

Tongue2
:Smile

http://help.yahoo.com/help/us/ysearch/sl...rp-03.html

How can I reduce the number of requests you make on my web site?

There is a Yahoo! Slurp-specific extension to robots.txt which allows you to set a lower limit on our crawler request rate.

You can add a "Crawl-delay: xx" instruction, where "xx" is the a delay in seconds between successive crawler accesses. If the crawler rate is a problem for your server, you can set the delay up to 5 or 20 or a comfortable value for your server.

Setting a crawl-delay of 20 seconds for Yahoo! Slurp would look something like:


   
Code:
User-agent: Slurp
Crawl-delay: 20

rose Wrote:There is a Yahoo! Slurp-specific extension to robots.txt which allows you to set a lower limit on our crawler request rate.

We added the Crawl-delay for both Slurp and MSNbot Sunday night.  Started out at 5 seconds for Slurp, didn't work, upped it to 10 still didn't work, and increased it to 15.  We survived the heavy morning Slurp crawl unscathed today.

We didn't make it through last night though.  We restarted Apache 5 times last evening to shake the spiders.  Slurp, Slurp China, MSNbot, Googlebot, WiseNutBot, Lycos, Dir.com, and a few others all arrived at once.  40 spiders unleashed in our database using the cache to do metasearches on the net. Hundreds of searches a minute, thousands of database queries a minute  :Smile  On the bright side 112,000 pages of that site are indexed in Yahoo now and the number is increasing daily.
We survived the spiders last night.   

A Yahoo search for site:----.com now yields 126,000 results, a search for ----.com yields 487,000 results, and a search for link: www.----.com yields 128,000 results.  I'm happy.  Smile

TulipTools is another story: site: tt.com 227, community.tt.com 271, and link:  tt.com 5,310 .   Search Engines have been slow to index the forums on any site where we've used this forum software.

The Crawl-delay for MSNbot.  Delay times can be 5 to 120 seconds.  We're using 40 on the search site.

   
Code:
    User-agent: msnbot
    Crawl-delay: 120


Quote:We survived the spiders last night.

Thumbsup
A related article on sites with a large number of pages having trouble with ill mannered spiders slowing their servers to a snail's pace:

Quote:After banning spiders from crawling its site last week, WebmasterWorld has been delisted from Google and MSN, and is sure enough to be delisted soon from Yahoo...

Brett Tabke of WebmasterWorld explains “We have pushed the limits of page delivery, banning, ip based, agent based, and down right cloaking to avoid the rogue bots - but it is becoming an increasingly difficult problem to control.”

full article: http://www.searchenginejournal.com/index.php?p=2560

Quote:WebmasterWorld head Brett Tabke decided to ban all search spiders including those from the major search engines in an effort to combat bandwidth loss and server sluggishness due to rogue spiders. Brett figured he had about 60 days until he'd see pages get dropped. It took two.

As of this moment, site:webmasterworld.com at Google shows NO pages being listed from the site. Prior to the ban, about 2 million pages were listed.

... this is indicative of Google manually pulling everything about the site from Google.

full article: http://blog.searchenginewatch.com/blog/051123-093904
When I read the title, I thought you were going to tell us YOU were banned from somewhere.  Wink
Updates:

A.  We did not survive last night's invasion of the spiders.  At one point we had 97  :blinkie: spiders simultaneously playing around on our directory/metasearch site last night. Any of you who visited that site or this site after 9 pm EST last night no doubt noticed the sites were unresponsive--now you know why. Angryfire

B.  There's no doubt now that Google has responded to WebmasterWorld's banning of googlebot by doing a manual ban of the Internet's 279th busiest web site (as ranked by Alexa).  Google has removed the site from its directory, all 2 million previously indexed site pages are now removed from Google Search, and WW now has a PR of 0 across all Google data centers.