Categories

A Guide to Robots.txt

Posted on: September 17, 2016 by Dimitar Ivanov

If you ever used a SEO tool you've probably seen a positive score for the using of a robots.txt or negative for a missing robots.txt. Despite the presence or absence of robots.txt is not a ranking signal in SERP, it's still important part of SEO efforts to optmize a website.

What is robots.txt?

It's a plain text file used to instruct search engines whether to include a web resource into its index or to stay away from it. The robots.txt file must be placed into a web root directory of a host, i.e. /robots.txt

What is robot?

A robot is an automatic program or service that crawls the web and gathers information later used by search engines to update their indexes and to provide relevant search results. When a robot visits a website, tries to find and read the robots.txt. Based on instructions found above, the crawler will add a web page or resource to its index or will stay away of it. All search engines (e.g. Google, Baidu, Yandex, DuckDuckGo, Bing, Yahoo! etc.) has their own robots. Another common names for a robot are: bot, spider and crawler.

Robots.txt syntax

The robots exclusion standard defines the following directives:

  • User-agent - a name of a web crawler. A wildcard * stands for all robots.
  • Disallow - specifies paths that must not be accessed by given robots. If no path is specified this directive has no effect.
  • Allow - specifies paths that must be accessed by given robots. If no path is specified this directive has no effect.

Robots.txt examples

  • Gives an access of all robots to whole website:
    User-agent: *
    Disallow:
    
  • Refused access of all robots to the whole website:
    User-agent: *
    Disallow: /
    
  • Prevents access of all robots to given folders or single files:
    User-agent: *
    Disallow: /uploads/
    Disallow: /img/private.jpg
    
  • Disallow a specific robot to access a web resource:
    User-agent: Googlebot
    User-agent: Baiduspider
    Disallow: /cgi-bin/
    
  • Prevents access of a specific robot to whole website:
    User-agent: YandexBot
    Disallow: /
    
  • Instruct a multiple user-agents with different rules:
    User-agent: *
    Disallow: /images/
    
    User-agent: Yahoo! Slurp
    Disallow: /
    
  • Tells to all robots to stay away from the whole website except the home page:
    User-agent: *
    Disallow: /
    Allow: /index.html
    

Robots Meta tag

To instruct web spiders also is possible a use of HTML meta tag. Note that this applies only for HTML documents, e.g. can't use meta tag for images, styles or scripts.

<!-- Do not index this web page -->
<meta name="robots" content="noindex">

<!-- Index the content and do follow links -->
<meta name="robots" content="index,follow">

Robots header

Instead of HTML meta tag you could use a HTTP header X-Robots-Tag sent from your web server via a .htaccess file or a dynamic language as PHP, Python, Ruby, etc.

X-Robots-Tag: noindex
Conclusion

When needs to hide a portion of website from Google or other search engines the most right solution is to use a robots.txt file. But be careful with robots.txt - a little mistake could cost a lot, even a whole website not to be indexed. So it's recommended after changes always to validate the robots.txt using a tool or service.

See also
Share this post

If you have questions about robots.txt, leave a comment below. Don't forget to share this article if you think it worths others to know about it. Thanks so much for reading!


0 Comments

Leave a comment

Captcha