Robots.txt is a text file that provides instructions to search engine robots about which pages they can and cannot crawl on a website.
Here's an example of what a Robots.txt file looks like:
What is Robots.txt?
Robots.txt is a text file that instructs search engines on which pages of a site they are permitted or prohibited from crawling. It employs "Allow" and "Disallow" directives to control access for specific bots or all bots.
Why Use Robots.txt?
Optimize Crawling:
By using robots.txt, website owners can control which pages search engines can crawl. This helps ensure that search engines focus on the most important content and avoid indexing unnecessary or less relevant pages.
Block Unwanted Pages:
The robots.txt file can be used to prevent search engines from accessing private, duplicate, or internal pages, such as admin panels or test sites, thereby maintaining privacy and avoiding potential indexing of non-essential content.
Hide Resources:
robots.txt can also be utilized to exclude specific files, such as PDFs, images, or videos, from search results. This helps keep these resources private or ensures that other, more important content is prioritized in search results.
How Does Robots.txt Work?
Robots.txt files offer rules for search engine bots regarding which URLs they are permitted to crawl. When a bot visits a site, it first looks for the robots.txt file to read and adhere to its directives. Here’s a simple example:
Key Directives
User-Agent Directive
The "User-agent" directive specifies which search engine crawler (bot) the rules apply to and is the first line in each directive block. For example, to prevent Googlebot from accessing the WordPress admin page, the directive would be:
Disallow Directive
The "Disallow" directive specifies which parts of a site should not be accessed by search engine crawlers. Multiple "Disallow" directives can be included within a block to restrict access to various sections. If the "Disallow" line is left blank, it indicates that no pages are restricted, thereby granting crawlers full access to the site.
Examples:
- To allow all bots to crawl the entire site:
- To block all bots from crawling the site:
Allow Directive
The "Allow" directive enables search engines to crawl specific subdirectories or pages within a directory that is otherwise disallowed. This is particularly useful for granting access to certain pages while blocking others in the same directory. For example, if Googlebot is to be blocked from accessing all blog posts except one, the directive would be configured as follows:
This setup blocks Googlebot from crawling all pages in the /blog/ directory, except for /blog/specific-post.html, which is explicitly allowed.
Sitemap Directive
The "Sitemap" directive specifies the location of the XML sitemap, directing search engines to the pages intended for indexing. This directive can be placed anywhere in the robots.txt file, whether at the top or bottom.
For optimal results, it is also advisable to submit the XML sitemap directly through each search engine's webmaster tools. Although search engines can discover the sitemap independently, direct submission speeds up the indexing process, ensuring that important pages are crawled and indexed more efficiently.
How to Find a Robots.txt File
The robots.txt file is hosted on the same server as the site, similar to any other file related to the domain. To view the robots.txt file of any site, one can type the full homepage URL and append /robots.txt at the end.
Crawl-Delay Directive
The "Crawl-delay" directive specifies a delay, in seconds, between successive crawl requests from bots to help prevent server overload and slowdowns. It's important to note that Google no longer supports this directive. For Googlebot crawl rate adjustments, use Google Search Console. However, Bing and Yandex still honor this directive. To implement a delay (e.g., 10 seconds) for a bot, you would configure it as follows:
How to Create a Robots.txt File
If a robots.txt file does not already exist, creating one is straightforward. One can use a robots.txt generator tool or create the file manually. Here are the steps to follow:
-
Create the Robots.txt File
Open a .txt file using any text editor or browser (it is advisable to avoid using Word, as it saves files in a proprietary format that can introduce random characters). Name the document robots.txt, as it must be named exactly this way to function correctly.
-
Add Directives
Crawlers disregard lines that do not match any directives. For example, to prevent Google from crawling the /clients/ directory, which is for internal use only, the directives would be as follows:
If additional instructions for Google are needed, they should be included on separate lines below.
Once the specific instructions for Google are complete, a new directive group can be created for all search engines to prevent them from crawling the /archive/ and /support/ directories, which are private:
After finalizing the directives, the sitemap can be added. The completed robots.txt file would look like this: