Robots.txt File – The Power Unleashed -- simon birch

July 9, 2011 - PRLog -- Whether you are a web veteran or a rookie I am confident that you have heard of the robots.txt file. You have probably heard myths, conflicting information, and advice on how to use it. You may also have heard advice to abandon it. Who is right?

Full article here:

http://www.seomarketingforums.com/content/76-robots-txt-file-%96-power-unleashed.html

I’m here to tell you.

First things first, the robots.txt file was designed to inform bots how to behave on your site. What information they can get and what information they can’t. It’s a simple text file that is very easy to create, once you understand the proper format. This system is called the Robots Exclusion Standard.

An example of a robots.txt file can be found at:

http://www.webmarketingnow.com/robots.txt
User-agent

The User-agent line specifies the robot. For example:

User-agent: googlebot

You may also use the wildcard character ‘*’ to specify all robots. For example:

User-agent: *

You can find user agent names in your own logs by checking for requests to robots.txt. Most major search engines have names for their spiders.

Here is a partial list:

Googlebot
MSN Robot
Yahoo! Slurp
Google AdSense Robot
Noxtrumbot
Xenu Link Sleuth

Disallow:

The second part of a robots.txt file consists of Disallow: directive lines. Just because the Disallow statement is there, doesn’t mean that the bot(s) are completely disallowed on the site. These lines can specify files and/or directories. For example, if you want to instruct spiders to not download private.htm, you would enter:

Disallow: private.htm

You can also specify directories:

Disallow: /cgi-bin/

This will block spiders from your cgi-bin directory. Some webmasters are nervous to list a directory to exclude in the robots.txt as that gives hackers a reason to attempt to get into that folder. You can exclude a folder without giving out the full name. For example, if the directory you want to exclude is “secret”. You could add:

Disallow: /sec

This would disallow spiders from indexing folders beginning with “sec” so make sure you look at your directory structure first before implementing this, as it would also disallow the folder “secondary” as well.

There is also a wildcard nature to the Disallow directive. The standard dictates that /temp would disallow /temp.html and /temp/index.html (both the file temp and files in the temp directory will not be indexed).

If you leave the Disallow line blank, it indicates that ALL files may be retrieved. At least one disallow line must be present for each User-agent directive to be correct format. If you don’t do this correctly, the file will not be compliant and chances are, the bots will not read it correctly, or may just ignore the entire file. Yahoo! has been known to do this. A completely empty robots.txt file is the same as if it were not present. Also, over 80% of people who complain that bots are not obeying their robots.txt file have syntax errors, thus the file isn’t read.

Comments

Any line in the robots.txt that begins with # is considered to be a comment line and is ignored. The standard allows for comments at the end of directive lines, but this is really bad formatting style and I don’t recommend it.

Example:

Disallow: temp # Disallowing access to the temp folder.

Some spiders will not interpret the above line correctly and instead will attempt to disallow ‘temp#comment’.

Instead, format the line as follows:

#Disallowing access to the temp folder
Disallow: /temp/

That makes for a cleaner looking robots.txt file.

Examples

The following allows all robots to visit all files because the wildcard ‘*’ specifies all robots.

User-agent: *
Disallow:

Want to keep all robots out? Use this one:

User-agent: *
Disallow: /

Want to keep out just one bot? Let’s deny Ask (it’s a good idea anyway):

User-agent: Teoma
Disallow: /

Keep bots out of your cgi-bin and images folders:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

How about just keeping out the Google Images Bot, but allowing other image bots free roam of your site?

User-agent: Googlebot-Image
Disallow: /images/

Note: When using the above code, ensure that you don’t have other images throughout your site, or the image bot will get them.

This one bans Email Harvester from all files on the server:

User-agent: emailharvester
Disallow: /

This one keeps googlebot from getting at the cloaking.htm file:

User-agent: googlebot
Disallow: cloaking.htm

If you create a page that is perfect for Yahoo!, but you don’t want Google to see it:

User-Agent: Googlebot
Disallow: /yahoo-page.html

Before you look up at the above examples and a “light bulb” goes off in your head and you realize that you can do User-Agent based cloaking, don’t go down that road. That is known as “poor man’s cloaking.” It may work for a little while, but you will get nailed hard and getting the domain back into the index is a painful and long process. It just isn’t worth it.

Common Questions about the Robots.txt File

Q: Why should I use it when I can use the meta-robots tag instead.

A: First of all, the meta-robots tag is not compliant to the needs of search engines and in testing that I have done it is often not read. All the major engines and most of the minor engines look for the robots.txt and do their best to obey it. This is not true with the meta-robots tag. Also, if you use the meta-robots tag, don’t use the “index,follow” parameter. That is what a search bot does by default. It would be like you having a sign above your desk that says, “Breathe,Blink Eyes.” You don’t need to be told to do that and neither do the bots.

Q: What if I don’t use the robots.txt file? What is the worst that can happen?

A: According to my testing, when a site that has been online for 12 months or longer employs the robots.txt file and doesn’t make any other changes, the site is indexed an average of 14% deeper than it was before.

Q: Where do I place the robots.txt file?

A: The file should be placed in the root directory of your server. In other words, in the same place as your index.html file for your home page.

Q: What are some things that I would want to exclude from the robots?

A: Here are a few examples:

Any folder that is “off limits” to public eye that you have not (for whatever reason) password protected.
Print Friendly versions of pages (to avoid the duplicate content filter)
Images – to protect them and to avoid spidering problems
CGI-BIN (programming code)
Review your weblogs and find spiders that you don’t want to come to your site and deny them. I always look at the data transferred, and I look at 10,000kb or more per month. Anything less than that is not worth your time. The following is a dump from one of our servers over a 30 day period.

Spider Number of Hits Data Transferred (Kb)
MSNBot 12,473 259,687
Yahoo 10,548 193,983
GoogleBot 5,768 138,447
Ask Jeeves robot 4,623 113,023
LinksManager Link Checker Bot 1,356 31,698
Xenu link checker 1,061 20,209
Alexa 6 740 17,711
wisenut robot 578 10,317

What would I deny in the list above? Honestly, I would deny Ask Jeeves. The bot is using a ton of resources and the amount of referral traffic from Ask is so low it doesn’t make a “fair trade”. You would want to deny the links manager bot, which is not just a resource hog, but will fill in your inbox with spam link requests from garbage sites looking for reciprocal links. I would also deny WiseNut. WiseNut was a good idea that never quite made it.

End