What is a robots.txt file?
A robots.txt file tells search engines what your website’s rules of engagement are. A big part of doing SEO is about sending the right signals to search engines, and the robots.txt is one of the ways to communicate your crawling preferences to search engines.
Search engines regularly check a website’s robots.txt file to see if there are any instructions for crawling the website. We call these instructions directives.
If there’s no robots.txt file present or if there are no applicable directives, search engines will crawl the entire website.
Although all major search engines respect the robots.txt file, search engines may choose to ignore (parts of) your robots.txt file. While directives in the robots.txt file are a strong signal to search engines, it’s important to remember the robots.txt file is a set of optional directives to search engines rather than a mandate.
Why should you care about robots.txt?
The robots.txt plays an essential role from a SEO point of view. It tells search engines how they can best crawl your website.
Using the robots.txt file you can prevent search engines from accessing certain parts of your website, prevent duplicate content and give search engines helpful tips on how they can crawl your website more efficiently.
Be careful when making changes to your robots.txt though: this file has the potential to make big parts of your website inaccessible for search engines.
What does a robots.txt file look like?
An example of what a simple robots.txt file for a WordPress website may look like:
User-agent: * Disallow: /wp-admin/
Let’s explain the anatomy of a robots.txt file based on the example above:
- User-agent: the user-agent indicates for which search engines the directives that follow are meant.
- *: this indicates that the directives are meant for all search engines.
- Disallow: this is a directive indicating what content is not accessible to the user-agent.
- /wp-admin/: this is the path which is inaccessible for the user-agent.
In summary: this robots.txt file tells all search engines to stay out of the /wp-admin/ directory.
Let’s analyze the different components of robots.txt files in more detail:
- User-agent
- Disallow
- Allow
- Sitemap
- Crawl-delay
User-agent in robots.txt
Each search engine should identify themself with a user-agent. Google’s robots identify as Googlebot for example, Yahoo’s robots as Slurp and Bing’s robot as BingBot and so on.
The user-agent record defines the start of a group of directives. All directives in between the first user-agent and the next user-agent record are treated as directives for the first user-agent.
Directives can apply to specific user-agents, but they can also be applicable to all user-agents. In that case, a wildcard is used: User-agent: *.
Disallow directive in robots.txt
You can tell search engines not to access certain files, pages or sections of your website. This is done using the Disallow directive. The Disallow directive is followed by the path that should not be accessed. If no path is defined, the directive is ignored.
Example
User-agent: * Disallow: /wp-admin/
In this example all search engines are told not to access the /wp-admin/ directory.
Allow directive in robots.txt
The Allow directive is used to counteract a Disallow directive. The Allow directive is supported by Google and Bing. Using the Allow and Disallow directives together you can tell search engines they can access a specific file or page within a directory that’s otherwise disallowed. The Allow directive is followed by the path that can be accessed. If no path is defined, the directive is ignored.
Example
User-agent: * Allow: /media/terms-and-conditions.pdf Disallow: /media/
In the example above all search engines are not allowed to access the /media/ directory, except for the file /media/terms-and-conditions.pdf.
Important: when using Allow and Disallow directives together, be sure not to use wildcards since this may lead to conflicting directives.
Example of conflicting directives
User-agent: * Allow: /directory Disallow: *.html
Search engines will not know what to do with the URL http://www.domain.com/directory.html. It’s unclear to them whether they’re allowed to access. When directives aren’t clear to Google, they will go with the least restrictive directive, which in this case means that they would in fact access http://www.domain.com/directory.html.
A separate line for each directive
Each directive should be on a separate line, otherwise search engines may get confused when parsing the robots.txt file.
Example of incorrect robots.txt file
Prevent a robots.txt file like this:
User-agent: * Disallow: /directory-1/ Disallow: /directory-2/ Disallow: /directory-3/
How important is the robots.txt for SEO?
In general, the robots.txt file is very important for SEO purposes. For larger websites the robots.txt is essential to give search engines very clear instructions on what content not to access.