Bots That Suck… Bandwidth, That Is!
So you’re supposed to get indexed in search engines and directories, to increase traffic to your site, right? Only not all search engines are “good” search engines, and some directories are also more enemy than friend.
Usually, if a search engine “spiders” your site, they follow some rules to make sure it does not harm your site. Spidering just means that the search bot crawls through your site looking for juicy bits of content, and then throws them into their index.
Robots have rules. When they come to your site, they are supposed to stop in and check with your robots.txt file, just to see if you have any instructions for them. And then they are supposed to OBEY the rules you give them.
But some bots don’t play nice. It is important to note that ALL major search engines SAY they respect the robots.txt file. But even those who say they do, often don’t. That includes Google, and some of the other biggies – they’ll MOSTLY follow it. But they sometimes disobey.
Bad bots don’t even look, or they look but don’t pay attention to what it says. Often, when you get a bot you don’t like on your site, the first suggestion is to block it in the robots.txt file. But that is often a waste of time, since the bot is not obligated to obey the rules in that file. To most bots, the robots.txt file is merely suggestions.
Some are so aggressive, the only option is to block them through an .htaccess file. Unlike a robots.txt file, a bot HAS to obey the .htaccess file. You can set it up so that the bot simply cannot access your site at all, and it has no choice in the matter.
So what does a bad bot do?
They can do several things that you probably don’t want on your site:
1. Suck bandwidth. This is common with badly written search bots. They just thrash your site, pulling up page after page, sometimes endlessly indexing, sometimes following all the links and then looking for things that are NOT linked. A bad bot can suck several gigs of bandwidth in a single session (this is an astronomical amount!).
2. Scrape content. Most bots do pull text to print in a summary in their search results. A bad one will scrape more than just sample text, and may reprint your content in unauthorized ways that constitute copyright violation. When images and text are scraped, this takes MUCH more bandwidth.
It is important to point out that bandwidth consumed by bad bots almost NEVER gives you any kind of return. Bad search bots pretty much DON’T send you site traffic.
Typically the first indication you’ll have is of escalating bandwidth usage which does not correlate with a proportional increase in traffic. In many hosting accounts, this is serious, because if your bandwidth exceeds a certain amount, your account may be suspended, or your host may charge you more. So bandwidth consumption with no return benefit is not a good thing.
Two bots that we have encountered recently are the Cuil bot, and the Twenga bot. Both absolutely TRAMPLE a site, and suck HUGE amounts of bandwidth, but send pretty much NO traffic to a site. The Cuil people are at least polite about posting the IPs that you can block. Twenga is not – and it has to be stopped using an .htaccess file. Neither one has a reputation for abiding by the robots.txt file either.
Both of these consumed the same amount of bandwidth that about 50,000 site visitors would consume. And Twenga seems to increase by the month.
We recommend that you do NOT submit a site to either of these engines, and that if you see escalating bandwidth usage on your site, do some checking to see whether you’ve been hit. Twenga may show up as an unknown bot with VERY high bandwidth usage. Cuil may show up with Admin URLs in your referrers – in other words, the referral URLs will be ones that PEOPLE could not have come from.
There are other bad search bots as well, which show up in similar ways. Usually, a search on Google for the bot by name will tell you pretty quick whether other people are having similar problems.
Great info! Now I know I’m not crazy for frequently monitoring my bandwith usage and visitor traffic!
Malicious bots can do this too. But with both, the key characteristic is that rising bandwidth consumption. It is almost always dramatic enough to raise a red flag – though sometimes it happens slowly over 1-3 months, it is still impossible to miss it.
Pingback: Why Choosing The Right Hosting Company Is Important