A robots.txt file is a simple, plain text file that you store on your website. Its purpose is to give instructions to robots (also known as "spiders", programs that retrieve content for search engines like Google and Fast) detailing what they should not index on a website. If you are unable to create or use a robots.txt file, you might find this meta tags tutorial useful.
A robots.txt file (a document detailing the robots.txt exclusion standard is available) is always stored in the root of your site, and is always named in lower case. For example, if a website at http://www.addedbytes.com/ had a robots.txt file it would be found at http://www.addedbytes.com/robots.txt - and only there. Spiders will always search for it in the root of a domain, and will never ever look for it elsewhere. You cannot specify a different name or location for a robots.txt file.
A robots.txt file should be viewed like a list of recommendations. By including one, you are asking the spiders that visit your site to ignore certain things that you would prefer not to be indexed, but they are not obliged to pay attention to that. If you really do not want things indexed, it is far better to disallow access with server-side programming than a robots.txt file.
Writing a robots.txt File
A robots.txt file is a list of instructions. Each instruction is divided into two parts. The first, "User-agent" (case-sensitive), tells robots reading the file which robots should pay attention to the instructions that follow. Usually, this will be a "*", which is a wild card meaning "all robots". The wild card character can only be used in this context, except in the case of Googlebot, which does support it in other places (see User-Agent Specific Commands).
Following this line specifying a user agent are the rules themselves. The rules that apply to a defined user agent must be defined on the lines following the "User-agent" instruction. There can be no blank lines within each set of instructions, and there must be at least one blank line seperating sets of instructions. The instructions are usually of the format: "Disallow: /folder/" or "Disallow: /file.htm". There can only be one instruction per line, and you should really avoid putting spaces before the instructions (though this isn't specifically allowed or disallowed, it is probably best to avoid taking a risk).
Anything following a hash symbol "#" is considered a comment and ignored. At least, according to the standards. Rumours abound, though, that in the past some engines have ignored a line with a hash symbol on it wherever it is placed, so you may want to place each comment on a line by itself.
For example, the following robots.txt file is technically valid:
# My robots.txt file User-agent: * Disallow: /folder/ # My private folder Disallow: /file.htm # My private file
If you want to prevent robots from indexing anything at all on your site, you could add the following to your robots.txt file:
User-agent: * Disallow: /
If you want to prevent all robots, except for a particular one or two, from accessing a folder, you could write a file like this, which will allow GoogleBot to index everything on your site, but prevent all other robots from accessing the folder called, imaginatively, "folder":
User-agent: googlebot Disallow: User-agent: * Disallow: /folder/
Please note: Many people believe that it is necessary to define the robot-specific rules before the general rules. This is not necessary according to the robots.txt exclusion standard, however there is no evidence of it causing problems, so may be worth doing, if there is a small chance it will help things to work as you intend.
Once you have written a robots.txt file, it is often a good idea to run it through a validator to check for errors, as they may do considerable harm if they prevent your site from being indexed. SearchEngineWorld's robots.txt validator is the most proficient of those available, or if you prefer, there is a validator that understands more unusual commands like Crawl-delay available as well.
This is the robots.txt file for AddedBytes.com. As you can see, I have disallowed the indexing of a few files, but not many. Specifically, I have asked Google not to index "404.php", which is the page a user is redirected to if a page is not found, and "friend.php", which is linked to from every page, but is there to allow users to refer friends to the site, and so should not really be indexed.
User-agent: * Disallow: /404.php Disallow: /friend.php
This file, from eBay, is again quite short, and simply specifies a few folders that should not be indexed:
User-agent: * Disallow: /help/confidence/ Disallow: /help/policies/ Disallow: /disney/
As you can see, Google will still list pages excluded by robots.txt, as Google is still aware they exist. However, Google will not index the content of the page and the page will not show up in searches except where a search includes the address of the excluded page.
Blank robots.txt files
It may be that you do not want to prevent spiders from indexing anything on your site. If that is the case, you should still add a robot.txt file, but an empty one, of this format:
User-agent: * Disallow:
This prevents spiders from generating a 404 error when the robots.txt file isn't found. It is basically just good practice to add a blank robots file, at the least, but not essential.
You may be thinking that adding the addresses of folders you do not with robots to index is a good way to prevent spiders from accidentally indexing sensitive areas of your site, like an administration area. While this is true, remember that anybody at all can view your robots.txt file, and therefore find the address(es) you'd rather were not indexed. If that includes your admninistration area, you may have saved them the trouble of searching for it.
There have been websites with unprotected administration areas online, whose admin area was hidden in an unusually named folder for "security" reasons - who added the name of the folder to their robots.txt file, opening up their admin area to anyone who wanted to have a poke around.
You must also be careful when writing your robots.txt file. Robots will usually err on the side of caution. If they do not recognise a command, they may well assume you meant them to stay away. Syntax errors in a robots.txt file can prevent your entire site from being indexed, so check it thoroughly before uploading it!
User-Agent Specific Commands
Googlebot has no extra commands specific to it, however it is allegedly a little brighter than the average crawler. Googlebot will supposedly understand wild card characters (*) in the "Disallow" field of the robots.txt file. However, Googlebot is the only engine even rumoured to be able to do this, so you would be wise to avoid using wild cards in the disallow field wherever possible.
MSNBot and Slurp
User-Agent: msnbot Crawl-Delay: 10 User-Agent: Slurp Crawl-Delay: 10
The above code is specific to MSN's spider, "MSNBot", and Inktomi's spider, "Slurp", and instructs the spiders to wait the specified amount of time, in seconds (10 seconds above, default is 1 second if not specified) before requesting another page from your site. MSNBot and Slurp have been known to index some sites very heavily, and this allows webmasters to slow down their indexing speed.
You could technically use this command with a user agent of "*" as well - the robots.txt exclusion standard instructs robots to just ignore commands they do not understand. However, if a robot sees something they do not understand in a robots.txt file, they may just not index your site. If using the "Crawl-Delay" command, you would be wiser to specify the user agents it should apply to.
List of User-Agent Names
- Google: "googlebot"
- Google's Image Search: "Googlebot-Image"
- MSN: "msnbot"
- Inktomi: "Slurp"
- AllTheWeb: "fast"
- AskJeeves: "teomaagent1" or "directhit"
- Lycos: "lycos"