Last Post
Popular Posts
Feed Analysis

 Powered by Max Banner Ads 

What is Robots.txt? How It Works? What is REP?

Written by Alfonso Muñoz on July 4, 2008 – 11:00 am

Surely you have heard about a file named robots.txt at some point. It sounds very scary, you know, robots are complex machines but don’t worry because we are not going to see anyone. This is something that has been part of the search engines since many years ago.

We must start talking about Robots Exclusion Protocol, also known as REP. This protocol let the webmasters to have some control over what parts of the site (pages) they want to be crawled (indexed) by search engines and what parts not. There are several ways to achieve this goal but the most known is the use of the file robots.txt.

Why we could want to hide some parts of our websites to search engines?

There are several reasons, some important ones are:

  • We don’t want to transfer Google Juice to pages like login, admin pages, etc. It is a waste of PR.
  • We try to avoid duplicate content like categories and tags sections.
  • We don’t want to have files of images available for searches to avoid hotlinking. This happens when you publish for instance a picture and other website copy your picture in their content but instead of copying the file they link it directly to your server wasting your bandwidth.

How can I create my robots.txt? What content should it have?

This file is a .txt file, you can create it with the notepad. If you use WordPress and you have installed the Google Sitemap Generator plugin then you have the option of create it automatically.

This file can be as short or as long as you want but you don’t need to make it difficult. Basically there are two modes:

  • Allow: something, this means that you allow the access to something, it could be a page, a file, a directory, a file extension.
  • Disallow: something, similar to the point above but in this case you deny the access.

You can disallow the access to a whole directory but allow the access with the allow command to an specific file in that directory. Let’s see my file, quite simple:

Disallow: /wp-admin/   It’s stupid to let the crawler enter here.
Disallow: /tag/     this is to avoid duplicated content
Disallow: /category/     the same to avoid duplicated content
Disallow: /go/    This directory contains something very interesting I’m going to talk about soon

# BEGIN XML-SITEMAP-PLUGIN    this is a comment, it doesn’t have any value
Sitemap: http://www.binaryant.com/sitemap.xml.gz    this line tells the crawler where to find the sitemap
# END XML-SITEMAP-PLUGIN    this is a comment, it doesn’t have any value

You have my robots.txt here if you want to take a look. Other bloggers disallow many more parts but for me it is enough. This decision depends on the nature of everyone.


Tags:
Posted in SEO |

3 Comments to “What is Robots.txt? How It Works? What is REP?”

  1. Kim Woodbridge Says:

    Thanks for this tutorial. I am definitely going to work on mine later. I have one question - where does it go? At the root level of your website?

  2. Alfonso Muñoz Says:

    Hi Kim! I had the comments on moderation mode because I was out on short holidays. I’m glad to hear that this article helps you in any way.

    About your doubt, yes, you have to put this file on the root level of your site :) .

  3. Kim Woodbridge Says:

    Hi Alfonso,

    Thanks! And I knew via Twitter that you were away - I hope you enjoyed your trip.

    Kim

Leave a Comment