A Guide to Robots.txt
If you ever used a SEO tool you've probably seen a positive score for the using of a robots.txt or negative for a missing robots.txt. Despite the presence or absence of robots.txt is not a ranking signal in SERP, it's still important part of SEO efforts to optmize a website.
What is robots.txt?
It's a plain text file used to instruct search engines
whether to include a web resource into its index or to stay away from it. The
robots.txt file must be placed into a web root
directory of a host, i.e.
What is robot?
A robot is an automatic program or service that crawls the web and gathers information later used by search engines to update their indexes and to provide relevant search results. When a robot visits a website, tries to find and read the robots.txt. Based on instructions found above, the crawler will add a web page or resource to its index or will stay away of it. All search engines (e.g. Google, Baidu, Yandex, DuckDuckGo, Bing, Yahoo! etc.) has their own robots. Another common names for a robot are: bot, spider and crawler.
The robots exclusion standard defines the following directives:
- User-agent - a name of a web crawler. A wildcard
*stands for all robots.
- Disallow - specifies paths that must not be accessed by given robots. If no path is specified this directive has no effect.
- Allow - specifies paths that must be accessed by given robots. If no path is specified this directive has no effect.
- Gives an access of all robots to whole website:
User-agent: * Disallow:
- Refused access of all robots to the whole website:
User-agent: * Disallow: /
- Prevents access of all robots to given folders or single files:
User-agent: * Disallow: /uploads/ Disallow: /img/private.jpg
- Disallow a specific robot to access a web resource:
User-agent: Googlebot User-agent: Baiduspider Disallow: /cgi-bin/
- Prevents access of a specific robot to whole website:
User-agent: YandexBot Disallow: /
- Instruct a multiple user-agents with different rules:
User-agent: * Disallow: /images/ User-agent: Yahoo! Slurp Disallow: /
- Tells to all robots to stay away from the whole website except the home page:
User-agent: * Disallow: / Allow: /index.html
Robots Meta tag
To instruct web spiders also is possible a use of HTML meta tag. Note that this applies only for HTML documents, e.g. can't use meta tag for images, styles or scripts.
<!-- Do not index this web page --> <meta name="robots" content="noindex"> <!-- Index the content and do follow links --> <meta name="robots" content="index,follow">
Instead of HTML meta tag you could use a HTTP header
sent from your web server via a .htaccess file or
a dynamic language as PHP, Python,
When needs to hide a portion of website from Google or other search engines the most right solution is to use a robots.txt file. But be careful with robots.txt - a little mistake could cost a lot, even a whole website not to be indexed. So it's recommended after changes always to validate the robots.txt using a tool or service.
If you have questions about robots.txt, leave a comment below. Don't forget to share this article if you think it worths others to know about it. Thanks so much for reading!
Subscribe to our newsletter
Join our mailing list and stay tuned! Never miss out news about Zino UI, new releases, or even blog post.
Comments are closed