robots guideline seo

The Ultimate Guideline to Robots.txt for SEO

Do you want a search engine crawler to get access to certain folders? As an SEO, you may need that.

Even, if you want to block some pages of your website, you certainly can do that as well.

Do you want to block the search engine from accessing a certain directory on your site?

Webmasters have a variety of tricks up their sleeves. They can instruct search engine crawlers on how to crawl pages on their website. As part of their tactics, one has to mention the robots.txt file.

In this article, I am going to discuss what all the nitty-gritty of the robots.txt file.

robots txt guideline

Let’s explore each of these topics in depth:

What Is a Robots.txt file?

A robots.txt file is a text file. It follows a strict syntax, intelligible by search engine spiders. These spiders are also called robots, hence the name of the file. The strict syntax it follows has to be computer readable.

The file is created by webmasters to direct web robots on how to crawl webpages on their website. The robots.txt file is originally part of the robots exclusion protocol (REP). The protocol concerns a group of web standards that determine how robots are supposed to crawl the website.

This REP also regulates how robots should access and index content and serve content based on users. The REP is responsible for including directives likes subdirectory-, page-, or site-wide instructions for how search engines are supposed to treat links.

Basic Format of a Robots File

User-agent: [user-agent name]

Disallow: /[URL string not to be crawled]

How does the Robots.txt file work?

A site owner has to direct web crawlers on certain situations. For this reason, they put their robots.txt file in the root directory of their site, i.e. https://www.yoursitename.com/robots.txt.

Bots that follow REP will read and fetch the file before heading for any other file from the site. If the site doesn’t have a robots.txt file, the crawler will go for crawling the entire site. Therefore, you can see in absence of a robots.txt file, the crawler will assume that webmasters didn’t give any specific directions.

Robots.txt file is made up of two basic parts:

  1. User-agent
  2. Disallow

User-Agent in Detail

User-agent is the name given to the spider that is being addressed. The user-agent line always has to come before the directive lines of. And, you have to follow this order for each set of directives. A very basic format of robots.txt file looks like this:

User-agent: Googlebot

Disallow: /

These directives instruct the user-agent to stay away from crawling the entire server. As a result, it won’t crawl any page on the website. If you want to instruct multiple robots, create a set of user-agents and disallow directives for each one.

If you want to instruct multiple robots, create a set of user-agents and disallow directives for each one.

User-agent: Googlebot

Disallow: /

User-agent: Bingbot

Disallow: /

The above directives show that both Google and Bing’s user-agents know that they have to avoid crawling the entire site. If you want to opt for crawling of the entire server of your site, then the directive should be like this:

If you want to opt for crawling of the entire server of your site, then the directive should be like this:

User-agent: *

Disallow:

List of most common search engine user-agents

Search Engine User-Agent Field
Baidu baiduspider General
Baidu Baiduspider-image Images
Baidu Baiduspider-mobile Mobile
Bing Bingbot General
Bing msnbot General
Google Googlebot-image Images
Google Googlebot-news News
Google Googlebot-video Video
Google Mediapartners-Google AdSense

 

Disallow in Detail

Disallow is the second part of the robots.txt file. The directive forbids spiders from crawling certain webpages. You can set multiple disallow lines for each set of directives. But, you have to include only one user-agent.

Bots will consider an empty disallow value as the directive that you aren’t disallowing anything. As a result, bots will choose to crawl the entire site.

To block crawlers from crawling a specific page, use the webpage’s relative link in the disallow line:

User-agent: *

Disallow: directory/page.html

You can block access to whole directories the same way as well:

User-agent: *

Disallow: /folder1/

Disallow: /folder2/

Furthermore, a robots.txt file can block bots from crawling certain file types. And, this can be done using a wildcard and a file type in the disallow line:

User-agent: *

Disallow: /*.ppt

Disallow: /duplicatecontent/copy*.html

Pros and Cons of Using Robots.txt

Pros

  • A website has an allowance to fix how many pages a search engine spider can crawl. SEO experts call this a crawl budget. You can use this budget in the best way possible by blocking sections of your site from the search engine while allowing your crawl budget to be used for other sections.

Cons

  • Although you can block search engine crawlers from accessing your certain web pages, you cannot block them from showing up your URLs in the SERPs.

Summary

The robots.txt is one of the basic ways you can apply to tell a search engine where it can go and can’t go on your websites. I discussed everything you need to know about this useful file in this article.

3

No Responses

Leave a Reply