A robots.txt file is a simple, plain text file that you store on your website. Its purpose is to give instructions to robots (also known as "spiders", programs that retrieve content for search engines like Google and Fast) detailing what they should not index on a website. If you are unable to create or use a robots.txt file, you might find this meta tags tutorial useful.
A robots.txt file (a document detailing the robots.txt exclusion standard is available) is always stored in the root of your site, and is always named in lower case. For example, if a website at http://www.addedbytes.com/ had a robots.txt file it would be found at http://www.addedbytes.com/robots.txt - and only there. Spiders will always search for it in the root of a domain, and will never ever look for it elsewhere. You cannot specify a different name or location for a robots.txt file.
A robots.txt file should be viewed like a list of recommendations. By including one, you are asking the spiders that visit your site to ignore certain things that you would prefer not to be indexed, but they are not obliged to pay attention to that. If you really do not want things indexed, it is far better to disallow access with server-side programming than a robots.txt file.
Writing a robots.txt File
A robots.txt file is a list of instructions. Each instruction is divided into two parts. The first, "User-agent" (case-sensitive), tells robots reading the file which robots should pay attention to the instructions that follow. Usually, this will be a "*", which is a wild card meaning "all robots". The wild card character can only be used in this context, except in the case of Googlebot, which does support it in other places (see User-Agent Specific Commands).
Following this line specifying a user agent are the rules themselves. The rules that apply to a defined user agent must be defined on the lines following the "User-agent" instruction. There can be no blank lines within each set of instructions, and there must be at least one blank line seperating sets of instructions. The instructions are usually of the format: "Disallow: /folder/" or "Disallow: /file.htm". There can only be one instruction per line, and you should really avoid putting spaces before the instructions (though this isn't specifically allowed or disallowed, it is probably best to avoid taking a risk).
Anything following a hash symbol "#" is considered a comment and ignored. At least, according to the standards. Rumours abound, though, that in the past some engines have ignored a line with a hash symbol on it wherever it is placed, so you may want to place each comment on a line by itself.
For example, the following robots.txt file is technically valid:
# My robots.txt file
User-agent: *
Disallow: /folder/ # My private folder
Disallow: /file.htm # My private file
If you want to prevent robots from indexing anything at all on your site, you could add the following to your robots.txt file:
User-agent: *
Disallow: /
If you want to prevent all robots, except for a particular one or two, from accessing a folder, you could write a file like this, which will allow GoogleBot to index everything on your site, but prevent all other robots from accessing the folder called, imaginatively, "folder":
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /folder/
Please note: Many people believe that it is necessary to define the robot-specific rules before the general rules. This is not necessary according to the robots.txt exclusion standard, however there is no evidence of it causing problems, so may be worth doing, if there is a small chance it will help things to work as you intend.
Once you have written a robots.txt file, it is often a good idea to run it through a validator to check for errors, as they may do considerable harm if they prevent your site from being indexed. SearchEngineWorld's robots.txt validator is the most proficient of those available, or if you prefer, there is a validator that understands more unusual commands like Crawl-delay available as well.
Example Files
This is the robots.txt file for AddedBytes.com. As you can see, I have disallowed the indexing of a few files, but not many. Specifically, I have asked Google not to index "404.php", which is the page a user is redirected to if a page is not found, and "friend.php", which is linked to from every page, but is there to allow users to refer friends to the site, and so should not really be indexed.
User-agent: *
Disallow: /404.php
Disallow: /friend.php
This file, from eBay, is again quite short, and simply specifies a few folders that should not be indexed:
User-agent: *
Disallow: /help/confidence/
Disallow: /help/policies/
Disallow: /disney/
As you can see, Google will still list pages excluded by robots.txt, as Google is still aware they exist. However, Google will not index the content of the page and the page will not show up in searches except where a search includes the address of the excluded page.
Blank robots.txt files
It may be that you do not want to prevent spiders from indexing anything on your site. If that is the case, you should still add a robot.txt file, but an empty one, of this format:
User-agent: *
Disallow:
This prevents spiders from generating a 404 error when the robots.txt file isn't found. It is basically just good practice to add a blank robots file, at the least, but not essential.
Be Careful
You may be thinking that adding the addresses of folders you do not with robots to index is a good way to prevent spiders from accidentally indexing sensitive areas of your site, like an administration area. While this is true, remember that anybody at all can view your robots.txt file, and therefore find the address(es) you'd rather were not indexed. If that includes your admninistration area, you may have saved them the trouble of searching for it.
There have been websites with unprotected administration areas online, whose admin area was hidden in an unusually named folder for "security" reasons - who added the name of the folder to their robots.txt file, opening up their admin area to anyone who wanted to have a poke around.
You must also be careful when writing your robots.txt file. Robots will usually err on the side of caution. If they do not recognise a command, they may well assume you meant them to stay away. Syntax errors in a robots.txt file can prevent your entire site from being indexed, so check it thoroughly before uploading it!
User-Agent Specific Commands
GoogleBot
Googlebot has no extra commands specific to it, however it is allegedly a little brighter than the average crawler. Googlebot will supposedly understand wild card characters (*) in the "Disallow" field of the robots.txt file. However, Googlebot is the only engine even rumoured to be able to do this, so you would be wise to avoid using wild cards in the disallow field wherever possible.
MSNBot and Slurp
User-Agent: msnbot
Crawl-Delay: 10
User-Agent: Slurp
Crawl-Delay: 10
The above code is specific to MSN's spider, "MSNBot", and Inktomi's spider, "Slurp", and instructs the spiders to wait the specified amount of time, in seconds (10 seconds above, default is 1 second if not specified) before requesting another page from your site. MSNBot and Slurp have been known to index some sites very heavily, and this allows webmasters to slow down their indexing speed.
You could technically use this command with a user agent of "*" as well - the robots.txt exclusion standard instructs robots to just ignore commands they do not understand. However, if a robot sees something they do not understand in a robots.txt file, they may just not index your site. If using the "Crawl-Delay" command, you would be wiser to specify the user agents it should apply to.
List of User-Agent Names
- Google: "googlebot"
- Google's Image Search: "Googlebot-Image"
- MSN: "msnbot"
- Inktomi: "Slurp"
- AllTheWeb: "fast"
- AskJeeves: "teomaagent1" or "directhit"
- Lycos: "lycos"
26 Comments
Good article, Dave. I think many people, after writing their first robots.txt file wonder if they've done it right. Hope you don't mind my adding a URL, but you can validate your robots.txt file here: http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
#1, Garrick Saito, United States, 20 July 2004. Reply to this.
I don't mind at all, Garrick, and will add a clickable link to the article now. Thankyou.
#2, Dave Child, United Kingdom, 21 July 2004. Reply to this.
I really like your article, it's a clear and comprehensive resource about robots.txt files. Can I advise you to add a link to the official homepage of Robots.txt Exclusion standard? It's an important resource for webmasters: http://www.robotstxt.org/wc/norobots.html
Here is also an up-to-date robots.txt checker that handles Crawl-delay command and uncommon syntax errors: http://tool.motoricerca.info/robots-checker.phtml
#3, Mike Garret, Italy, 1 August 2004. Reply to this.
Thanks Mike, I have added both of those links to the article.
#4, Dave Child, United Kingdom, 2 August 2004. Reply to this.
At the moment i have no robots file.
Until I am understanding better what it is all about, is it better to just have no robots file and allow the search engines to crawl however they wish to ~ if atall?
Cheers, Jonathan
#5, jonathan, United Kingdom, 21 August 2004. Reply to this.
Great article. I'd mention that some of the logfile analyzer programs like Analog (http://www.analog.cx/) have files that list many of the robot agent strings in current use, then the reports can break out traffic by browsers and by what robots are visiting.
That info can also be useful when writing your robots.txt file.
Note also that often googling an unfamiliar agent string will usually bring up sites listing robots and with info about what that robot is doing.
#6, Jeff Wilkinson, United States, 5 May 2005. Reply to this.
Let me add another robot id tag for anyone who does not want to see his/her website archived by the WayBack Machine at http://www.archive.org.
Just add the following 2 linec to your robots.txt.
User-agent: ia_archiver
Disallow: /
#7, Jan Krejci, Czech Republic, 15 July 2005. Reply to this.
Great Info!
#8, Vijay, India, 9 November 2005. Reply to this.
Im new at all this, i read this article but im still kinda lost and its probably me. ha! where do you put this robot.txt file at on your ftp directory? in what folder? sorry if im a pain, just a newbie!!! with no one to help i keep gettin that damn archive.org on my site....i dont want any at all....how do i get rid of all and add it is there a tutorial, i use ipb....
#9, Suz, United States, 22 July 2006. Reply to this.
I run a portal on an engine. That means, all pages look like /Default.aspx?something. So, if I want to restrict the robots from visiting /Default.aspx?ctl=manager, may I Disallow it like that? In other words, do the robots understand query strings, or will I loose all my listings?
Thanks!
Ulu
#10, Ulu, Russian Federation, 6 August 2006. Reply to this.
Good question, Ulu.
Also just a heads-up Dave, the link to the Robots.txt validator is broken, or rather it leads to a redicrect which leads to the webmasterworld donate (login?) page.
Love your site, good article.
#11, Sean, Canada, 11 September 2006. Reply to this.
please suggest me i m developing this company site and i m nil in kn owledge of SEO but i know what is seo?but this company site to come in first page of google search engine when a person search with a keyword of OUTSOURCE and then it should come in first page please give suggestion in this how i have to proceed in design so that they can reach that position.
#12, mohammad ali, India, 19 October 2006. Reply to this.
Mohammad,
There is one sure fire way to "come in first page of google search engine when a person search with a keyword of OUTSOURCE".
* * * BUY THE KEYWORD * * *
#13, Commentar, Netherlands, 31 October 2006. Reply to this.
This article really helps me in writing robots.txt file.
#14, Haris khan, Australia, 24 September 2007. Reply to this.
Fix that link Dave.
http://www.addedbytes.com/seo/robots-txt-file/
That's a free tip :0)
#15, James Ryddel, United Kingdom, 18 October 2007. Reply to this.
can i know where to put robot.txt?
#16, Anonymous, China, 27 October 2007. Reply to this.
I instructed robot.txt to disallow indexing my website, but google still picked up and indexed all the pages even src and images forbidden directories. Did anybody have the same experience?
#17, Julian, United States, 6 December 2007. Reply to this.
I have had people stealing from and downloading my entire website. I have exact copies of it with a few minor changes and them claiming it as thier own. I have put a great amount of time and money into creating it. I am looking for something to lock the pages or encrpt them so they can not be stolen. I understand that if you encrpt the html with a program that your page or website can not be crawled or index. Is this true? Any suggestions on how to protect my work and stil be indexed so my site can be seen? Are there any programs that you would suggest? Thanks!
#18, Caren, United States, 3 February 2008. Reply to this.
first off great site, I've stumbled upon you many times in the past and have always found great info here.
secondly can you help me with this problem:
I'm trying to disallow a directory /clients/ but allow only one file in the directory /clients/index.php
is this possible with
allow /clients/index.php
disallow /clients
does it matter the order or will this just not work at all.
brent
@
<a href="http://www.mimoymima.com">mimoymima.com</a>
#19, Brent Lagerman, United States, 7 April 2008. Reply to this.
caren, google ".htaccess rippers". It's another file you put in your root, it stops bad boys ripping off your site.
#20, johan, United Kingdom, 13 June 2008. Reply to this.
Thanks for this article. Now I'm ready to make my own robots.txt. Just because I'm curious: can someone answer the question of brent?
#21, Christian, Austria, 6 April 2009. Reply to this.
I just made a site with some amazon ads on it which are iframes (and they don't validate). Do they hurt my site ranking? And if so how can I use a robot.txt file to tell them not to index the iframe?
Sorry if this is a dumb question I am a total newbie.
#22, Muffy, United States, 23 April 2009. Reply to this.
Great Article!
Simple, understandable and practical.
Thanks!
#23, Ami Daniel, Unknown, 29 May 2009. Reply to this.
Hi, NEW here! I was just poking around, did a site Optimization, Meta Tag Checker, ect...then, heard of the ROBOT FILE...
Can anyone help me write this, or one, that is? And where would I put it first, then to add to site. Hope I worded this right...to anyone out there!!!
Thanks!
#24, Jan, US, 4 April 2010. Reply to this.
You can prevent search engines from crawling into iframes by adding the following line(s) in your robots.txt file:
Disallow: http://www.example.com/ExamplePage.htm
Replies: #26.
#25, KG Karaoke, USA, 7 September 2010. Reply to this.
Do I still need a robots.txt file if I already use a WordPress plugin such as WordPress SEO by Yoast? I'm never sure of how much (if at all) this overrides what you can do with robots.txt.
#26, Katie Keith, United Kingdom, 20 December 2012. Reply to this.