We use cookies to measure visits and improve RnkRocket. Accept analytics cookies or continue with essential only. Cookie policy

Not getting calls from Google? Find out why. See how it works →
Skip to main content

The Complete Guide to robots.txt for SEO

Your robots.txt file tells search engines what they can and cannot crawl. Get it wrong and you risk blocking your entire site from Google — here's how to use it correctly.

By RnkRocket Team
May 28, 2026
13 min read
The Complete Guide to robots.txt for SEO

Key Takeaways

  • robots.txt controls which parts of your site search engine crawlers can access — it does not control indexing, only crawling (Google Search Central)
  • A single misconfigured robots.txt can de-index your entire website from Google — this is one of the most common and catastrophic technical SEO errors
  • Blocking a URL in robots.txt does not prevent it from appearing in Google's index if other sites link to it — use noindex for true indexing control
  • RnkRocket's site intelligence crawl checks your robots.txt for misconfigurations and blocked important pages automatically

Few files on your website carry as much risk as robots.txt. It is typically tiny — sometimes just a handful of lines — yet a single typo or misunderstanding can cause Google to stop crawling your entire site, leading to pages disappearing from search results within days.

We have seen this happen more times than we care to count. A developer adds a disallow rule before a site launch and forgets to remove it. An SEO plugin generates a robots.txt with an overly aggressive block. The result is always the same: traffic collapses, pages vanish from search, and the culprit takes days to identify because no one thinks to check a three-line text file.

This guide explains exactly how robots.txt works, how to write it correctly, and how to verify it is not quietly working against you.


What Is robots.txt?

robots.txt is a plain text file placed in the root of your website (e.g. `https://yourdomain.com/robots.txt\`) that communicates instructions to automated crawlers — including search engine bots like Googlebot, Bingbot, and others.

It follows the Robots Exclusion Protocol, a convention that web robots have followed since the mid-1990s. Unlike many areas of SEO, the protocol is technically a standard rather than a binding rule — crawlers choose to follow it as a courtesy. Most reputable crawlers do; malicious scrapers generally do not.

The file uses a simple syntax:

``` User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://yourdomain.com/sitemap.xml ```

  • `User-agent` specifies which bot the rules apply to (`*` means all bots)
  • `Disallow` specifies paths the bot should not crawl
  • `Allow` (supported by Google) overrides a disallow for a specific sub-path
  • `Sitemap` declares the location of your XML sitemap

The Critical Distinction: Crawling vs Indexing

This is the most important concept in this entire guide, and the one most often confused.

Blocking crawling does not block indexing.

If you disallow a URL in robots.txt, Googlebot will not crawl that URL. But if other websites link to that URL, Google may still discover it, list it in its index, and show it in search results — just without being able to read its content. This can result in pages appearing in Google with no title or description, just the URL and a snippet saying "A description for this result is not available because of this site's robots.txt."

This is the exact opposite of what most people intend when they block a URL. If you want a page kept out of the index, you need a `noindex` directive on the page itself — not a robots.txt rule.

The practical consequence:

GoalCorrect Approach
Prevent Googlebot crawling a sectionrobots.txt Disallow
Prevent a page appearing in search resultsnoindex meta tag or X-Robots-Tag header
Both: not crawled and not indexednoindex on the page (bot must be able to crawl it to see the noindex)

You cannot effectively noindex a page you are blocking in robots.txt — because Google cannot crawl the page to read the noindex directive.


robots.txt Syntax in Detail

User-agent Targeting

You can write rules for all bots or specific ones:

```

Rules for all crawlers

User-agent: * Disallow: /private/

Rules only for Googlebot

User-agent: Googlebot Disallow: /google-specific-block/

Rules only for Bingbot

User-agent: Bingbot Allow: / ```

Each `User-agent` line starts a new block. Rules apply to the agent specified until the next `User-agent` line. You can have multiple `User-agent` lines in a single block if you want the same rules to apply to multiple named bots.

Path Matching Rules

robots.txt uses simple path prefix matching:

  • `Disallow: /admin/` blocks everything under `/admin/`
  • `Disallow: /admin` also blocks `/admins` and `/administration` — note the trailing slash matters
  • `Disallow: /` blocks your entire site from being crawled — the most dangerous rule possible
  • `Disallow:` (empty value) means allow everything — the same as not having a disallow rule
  • `Allow: /` explicitly allows everything

Google also supports basic wildcard patterns:

  • `` matches any sequence of characters: `Disallow: /.pdf` blocks all PDF files
  • `$` matches end of URL: `Disallow: /*.pdf$` blocks URLs ending in `.pdf` specifically

Order of Rules

When `Allow` and `Disallow` rules conflict for the same URL, Google uses the most specific rule. If specificity is equal, the `Allow` wins.

``` Disallow: /images/ Allow: /images/public/ ```

In this example, `/images/private/photo.jpg` is blocked, but `/images/public/logo.jpg` is allowed.


What to Block with robots.txt

Here is a practical guide to what is typically worth blocking and what is not.

Commonly Blocked Sections

Admin and login pages. These should not appear in search results, and blocking crawling saves crawl budget.

``` Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php ```

Note the Allow for `admin-ajax.php` — WordPress requires this to be crawlable for certain frontend features to work.

Staging and development environments. If you have a staging site on a subdirectory or subdomain, block it entirely. Better still, password-protect it.

Internal search results pages. If your site has search functionality (e.g. `/search/?q=example`), blocking the search result pages prevents Googlebot from crawling thousands of near-duplicate pages. This is a crawl budget concern for larger sites.

``` Disallow: /search/ ```

Duplicate content from URL parameters. Faceted navigation on e-commerce sites can generate thousands of near-duplicate URLs (e.g. `/products/?colour=red&size=medium`). Blocking these from crawling helps focus Google's attention on your canonical product pages.

Account and checkout pages. Pages behind login, checkout flows, and order confirmation pages have no value in search results.

What You Should Not Block

Pages you want indexed. This sounds obvious, but it is the most common mistake. We regularly see sites blocking their service pages, blog posts, or product categories — usually an accident from a blanket disallow rule.

Pages with noindex directives. If a page has noindex, you do not need to block it in robots.txt. In fact, blocking it means Google cannot crawl the page and therefore cannot see the noindex — potentially leaving it indexable from external links.

CSS, JavaScript, and image files. In the early days of SEO, blocking CSS and JS was common to speed up crawls. This is now actively harmful — Google needs to render your pages to understand them, and blocking your stylesheets and scripts prevents Google from seeing your pages as users do. Google explicitly recommends allowing crawlers access to all resources needed to render pages.

Case study: blocked CSS and JavaScript causing rendering failures. A web design agency in Leeds rebuilt a client's website on a modern JavaScript framework and, during the migration, carried over a legacy robots.txt that included `Disallow: /assets/` — which happened to contain both the CSS and JavaScript bundles for the new site. Visually the site worked fine for users, but Googlebot could not render any page correctly. In Google's cached versions, every page appeared as unstyled plain text with broken navigation. Over the following month, rankings for their top 15 service pages dropped an average of 12 positions. The fix was a single line removal in robots.txt. Within two weeks of the change, Google re-rendered the pages correctly and rankings began recovering. Google's rendering documentation specifically warns against blocking resources that Googlebot needs to render pages, and their Mobile-Friendly Test tool can confirm whether blocked resources are affecting page rendering.

Common robots.txt Directives Reference

DirectivePurposeExample
`User-agent: *`Apply rules to all crawlers`User-agent: *`
`User-agent: Googlebot`Apply rules only to Google's crawler`User-agent: Googlebot`
`Disallow: /path/`Block crawling of a directory`Disallow: /admin/`
`Disallow: /`Block crawling of the entire site`Disallow: /`
`Disallow:` (empty)Allow crawling of everything`Disallow:`
`Allow: /path/`Override a Disallow for a sub-path (Google-supported)`Allow: /wp-admin/admin-ajax.php`
`Disallow: /*.pdf$`Block URLs matching a wildcard patternBlocks all PDF file URLs
`Sitemap:`Declare sitemap location for all crawlers`Sitemap: https://example.com/sitemap.xml\`
`Crawl-delay: 10`Request a delay between requests (honoured by Bing, ignored by Google)`Crawl-delay: 10`

For a thorough specification of robots.txt syntax and behaviour, see the Yoast guide to robots.txt and the RFC 9309 standard which formalised the protocol.


The Most Dangerous robots.txt Mistakes

Blocking Your Entire Site

``` User-agent: * Disallow: / ```

This single rule tells every crawler to stay out of your entire site. It is sometimes added intentionally during development (valid) but catastrophic if left in place after launch. Always check your live robots.txt after a site migration or relaunch.

Forgetting the Trailing Slash

`Disallow: /admin` blocks `/admin`, `/admins`, `/administration` — any URL starting with that string. `Disallow: /admin/` blocks only pages within the `/admin/` directory. This distinction causes unexpected blocks more often than you might expect.

Using robots.txt for Security

If you have pages with genuinely sensitive content, robots.txt is not the place to protect them. The file is publicly visible — anyone can read it at `yourdomain.com/robots.txt`. In fact, it is sometimes used by attackers to find hidden admin paths. Use proper authentication for sensitive pages, not robots.txt.

Blocking CDN-served Resources

If your images, CSS, or JavaScript are served from a CDN subdomain (e.g. `cdn.yourdomain.com`), the robots.txt at your main domain does not apply. Each subdomain has its own robots.txt. If your CDN subdomain has no robots.txt (or has a blocking one), crawlers may not be able to access your resources.

Conflicting Rules Between robots.txt and Canonical Tags

If you block a URL in robots.txt but it has a canonical pointing elsewhere, Google cannot follow the canonical because it cannot crawl the page. This creates orphaned canonical tags that Google ignores — a subtle but real issue for sites with complex URL management.


How to Test Your robots.txt

Google Search Console — robots.txt Tester

Google Search Console has a built-in robots.txt tester under Settings > robots.txt. It shows you the current file, allows you to test any URL to see if it would be blocked, and highlights syntax errors.

Direct URL Test

Simply visit `https://yourdomain.com/robots.txt\` in a browser. If you get a 404, you have no robots.txt file — which means all crawlers can access everything (not necessarily a problem for small sites, but worth knowing). If you get a 200 response, review the content carefully.

Third-party Validators

Tools like Merkle's robots.txt tester allow you to paste in your file and test specific URLs against it, including wildcard rules that Search Console's tester can occasionally mishandle.

RnkRocket's Automated Audit

RnkRocket checks your robots.txt as part of its site intelligence crawl — flagging rules that block important pages, missing sitemap declarations, and syntax errors. This runs automatically, so you will be alerted to changes that could affect your crawlability.

For context on how robots.txt fits into a broader technical SEO strategy, see Technical SEO Explained and our guide to XML Sitemaps.


robots.txt and Crawl Budget

For most small business websites, crawl budget — the number of pages Google will crawl on any given visit — is not a meaningful concern. Google crawls small sites comprehensively regardless.

Crawl budget becomes relevant when your site has tens of thousands of URLs. In those cases, using robots.txt to block low-value URLs (internal search results, parameter-generated duplicates, admin pages) concentrates Google's crawl capacity on your valuable pages.

The alternative — allowing Google to crawl thousands of near-duplicate search result pages — wastes crawl capacity that could be directed at your actual content.

For a small business with a few dozen to a few hundred pages, optimising for crawl budget through robots.txt is unnecessary. Focus instead on clean URL structures and proper internal linking.


A Baseline robots.txt for Small Business Sites

For most small business websites — particularly those running on WordPress — this is a sensible baseline:

``` User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /?s= Disallow: /search/

Sitemap: https://yourdomain.com/sitemap.xml ```

This blocks the WordPress admin, login page, and site search results (which create near-duplicate pages), while declaring your sitemap location for all crawlers. Everything else is left accessible.

If you run Shopify, Squarespace, or another hosted platform, the platform generates a robots.txt for you and it is typically sensible out of the box. Check it, but you rarely need to change it unless you have specific needs.


Frequently Asked Questions

Does robots.txt affect my rankings directly?

No — robots.txt affects crawling, not ranking signals directly. However, if you accidentally block important pages from being crawled, those pages cannot be ranked. The indirect effect on rankings can be severe if misconfigurations prevent your key pages from being indexed.

Can I block specific bots while allowing Google?

Yes. Use named user-agents to target specific bots. For example, to block SEO scrapers while allowing Google:

``` User-agent: AhrefsBot Disallow: /

User-agent: SemrushBot Disallow: /

User-agent: * Allow: / ```

Note that blocking SEO tool crawlers does not prevent those tools from showing data about your site if they have already crawled it previously.

What happens if I have no robots.txt file at all?

Nothing bad. With no robots.txt file, all crawlers are permitted to access all pages. A missing robots.txt file returns a 404, which crawlers interpret as "no restrictions". This is fine for most small sites.

Can robots.txt stop my content from being scraped?

No. robots.txt is a voluntary protocol. Legitimate crawlers follow it, but malicious scrapers typically do not. If you need to protect content from scraping, you need technical measures like authentication, rate limiting, or CAPTCHAs — not robots.txt.

How quickly does Googlebot respond to robots.txt changes?

Google typically fetches and processes robots.txt changes within a few hours to a day. However, if you block pages that Google previously crawled and cached, it may take several days or a few weeks for those pages to disappear from the index. If you have inadvertently blocked pages, fix the robots.txt immediately and use the URL inspection tool in Search Console to request recrawling.


Related Reading


RnkRocket automatically checks your robots.txt for common mistakes — including rules that accidentally block your most important pages. See what RnkRocket finds on your site.

Related Posts

XML Sitemaps Explained: Why They Matter and How to Create One
Technical SEO

XML Sitemaps Explained: Why They Matter and How to Create One

An XML sitemap tells search engines which pages exist on your site and when they were last updated. Here's everything small businesses need to know to create and maintain one correctly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 25, 202612 min read
Duplicate Content: What It Is and How to Fix It
Technical SEO

Duplicate Content: What It Is and How to Fix It

Duplicate content confuses search engines, splits your ranking signals across multiple URLs, and can cause Google to index the wrong version of your page. Here is how to identify it and fix it properly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 11, 202613 min read
Page Speed Optimisation: A Practical Guide for Non-Developers
Technical SEO

Page Speed Optimisation: A Practical Guide for Non-Developers

Slow pages cost you rankings and customers. This practical guide explains page speed optimisation in plain English — with specific fixes you can implement without touching a line of code.

Core Web Vitals
Site Speed
Technical SEO
+1 more
RnkRocket Team
May 4, 202615 min read

We use cookies to measure visits and improve RnkRocket. Accept analytics cookies or continue with essential only. Cookie policy

Not getting calls from Google? Find out why. See how it works →
Skip to main content

The Complete Guide to robots.txt for SEO

Your robots.txt file tells search engines what they can and cannot crawl. Get it wrong and you risk blocking your entire site from Google — here's how to use it correctly.

By RnkRocket Team
May 28, 2026
13 min read
The Complete Guide to robots.txt for SEO

Key Takeaways

  • robots.txt controls which parts of your site search engine crawlers can access — it does not control indexing, only crawling (Google Search Central)
  • A single misconfigured robots.txt can de-index your entire website from Google — this is one of the most common and catastrophic technical SEO errors
  • Blocking a URL in robots.txt does not prevent it from appearing in Google's index if other sites link to it — use noindex for true indexing control
  • RnkRocket's site intelligence crawl checks your robots.txt for misconfigurations and blocked important pages automatically

Few files on your website carry as much risk as robots.txt. It is typically tiny — sometimes just a handful of lines — yet a single typo or misunderstanding can cause Google to stop crawling your entire site, leading to pages disappearing from search results within days.

We have seen this happen more times than we care to count. A developer adds a disallow rule before a site launch and forgets to remove it. An SEO plugin generates a robots.txt with an overly aggressive block. The result is always the same: traffic collapses, pages vanish from search, and the culprit takes days to identify because no one thinks to check a three-line text file.

This guide explains exactly how robots.txt works, how to write it correctly, and how to verify it is not quietly working against you.


What Is robots.txt?

robots.txt is a plain text file placed in the root of your website (e.g. `https://yourdomain.com/robots.txt\`) that communicates instructions to automated crawlers — including search engine bots like Googlebot, Bingbot, and others.

It follows the Robots Exclusion Protocol, a convention that web robots have followed since the mid-1990s. Unlike many areas of SEO, the protocol is technically a standard rather than a binding rule — crawlers choose to follow it as a courtesy. Most reputable crawlers do; malicious scrapers generally do not.

The file uses a simple syntax:

``` User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://yourdomain.com/sitemap.xml ```

  • `User-agent` specifies which bot the rules apply to (`*` means all bots)
  • `Disallow` specifies paths the bot should not crawl
  • `Allow` (supported by Google) overrides a disallow for a specific sub-path
  • `Sitemap` declares the location of your XML sitemap

The Critical Distinction: Crawling vs Indexing

This is the most important concept in this entire guide, and the one most often confused.

Blocking crawling does not block indexing.

If you disallow a URL in robots.txt, Googlebot will not crawl that URL. But if other websites link to that URL, Google may still discover it, list it in its index, and show it in search results — just without being able to read its content. This can result in pages appearing in Google with no title or description, just the URL and a snippet saying "A description for this result is not available because of this site's robots.txt."

This is the exact opposite of what most people intend when they block a URL. If you want a page kept out of the index, you need a `noindex` directive on the page itself — not a robots.txt rule.

The practical consequence:

GoalCorrect Approach
Prevent Googlebot crawling a sectionrobots.txt Disallow
Prevent a page appearing in search resultsnoindex meta tag or X-Robots-Tag header
Both: not crawled and not indexednoindex on the page (bot must be able to crawl it to see the noindex)

You cannot effectively noindex a page you are blocking in robots.txt — because Google cannot crawl the page to read the noindex directive.


robots.txt Syntax in Detail

User-agent Targeting

You can write rules for all bots or specific ones:

```

Rules for all crawlers

User-agent: * Disallow: /private/

Rules only for Googlebot

User-agent: Googlebot Disallow: /google-specific-block/

Rules only for Bingbot

User-agent: Bingbot Allow: / ```

Each `User-agent` line starts a new block. Rules apply to the agent specified until the next `User-agent` line. You can have multiple `User-agent` lines in a single block if you want the same rules to apply to multiple named bots.

Path Matching Rules

robots.txt uses simple path prefix matching:

  • `Disallow: /admin/` blocks everything under `/admin/`
  • `Disallow: /admin` also blocks `/admins` and `/administration` — note the trailing slash matters
  • `Disallow: /` blocks your entire site from being crawled — the most dangerous rule possible
  • `Disallow:` (empty value) means allow everything — the same as not having a disallow rule
  • `Allow: /` explicitly allows everything

Google also supports basic wildcard patterns:

  • `` matches any sequence of characters: `Disallow: /.pdf` blocks all PDF files
  • `$` matches end of URL: `Disallow: /*.pdf$` blocks URLs ending in `.pdf` specifically

Order of Rules

When `Allow` and `Disallow` rules conflict for the same URL, Google uses the most specific rule. If specificity is equal, the `Allow` wins.

``` Disallow: /images/ Allow: /images/public/ ```

In this example, `/images/private/photo.jpg` is blocked, but `/images/public/logo.jpg` is allowed.


What to Block with robots.txt

Here is a practical guide to what is typically worth blocking and what is not.

Commonly Blocked Sections

Admin and login pages. These should not appear in search results, and blocking crawling saves crawl budget.

``` Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php ```

Note the Allow for `admin-ajax.php` — WordPress requires this to be crawlable for certain frontend features to work.

Staging and development environments. If you have a staging site on a subdirectory or subdomain, block it entirely. Better still, password-protect it.

Internal search results pages. If your site has search functionality (e.g. `/search/?q=example`), blocking the search result pages prevents Googlebot from crawling thousands of near-duplicate pages. This is a crawl budget concern for larger sites.

``` Disallow: /search/ ```

Duplicate content from URL parameters. Faceted navigation on e-commerce sites can generate thousands of near-duplicate URLs (e.g. `/products/?colour=red&size=medium`). Blocking these from crawling helps focus Google's attention on your canonical product pages.

Account and checkout pages. Pages behind login, checkout flows, and order confirmation pages have no value in search results.

What You Should Not Block

Pages you want indexed. This sounds obvious, but it is the most common mistake. We regularly see sites blocking their service pages, blog posts, or product categories — usually an accident from a blanket disallow rule.

Pages with noindex directives. If a page has noindex, you do not need to block it in robots.txt. In fact, blocking it means Google cannot crawl the page and therefore cannot see the noindex — potentially leaving it indexable from external links.

CSS, JavaScript, and image files. In the early days of SEO, blocking CSS and JS was common to speed up crawls. This is now actively harmful — Google needs to render your pages to understand them, and blocking your stylesheets and scripts prevents Google from seeing your pages as users do. Google explicitly recommends allowing crawlers access to all resources needed to render pages.

Case study: blocked CSS and JavaScript causing rendering failures. A web design agency in Leeds rebuilt a client's website on a modern JavaScript framework and, during the migration, carried over a legacy robots.txt that included `Disallow: /assets/` — which happened to contain both the CSS and JavaScript bundles for the new site. Visually the site worked fine for users, but Googlebot could not render any page correctly. In Google's cached versions, every page appeared as unstyled plain text with broken navigation. Over the following month, rankings for their top 15 service pages dropped an average of 12 positions. The fix was a single line removal in robots.txt. Within two weeks of the change, Google re-rendered the pages correctly and rankings began recovering. Google's rendering documentation specifically warns against blocking resources that Googlebot needs to render pages, and their Mobile-Friendly Test tool can confirm whether blocked resources are affecting page rendering.

Common robots.txt Directives Reference

DirectivePurposeExample
`User-agent: *`Apply rules to all crawlers`User-agent: *`
`User-agent: Googlebot`Apply rules only to Google's crawler`User-agent: Googlebot`
`Disallow: /path/`Block crawling of a directory`Disallow: /admin/`
`Disallow: /`Block crawling of the entire site`Disallow: /`
`Disallow:` (empty)Allow crawling of everything`Disallow:`
`Allow: /path/`Override a Disallow for a sub-path (Google-supported)`Allow: /wp-admin/admin-ajax.php`
`Disallow: /*.pdf$`Block URLs matching a wildcard patternBlocks all PDF file URLs
`Sitemap:`Declare sitemap location for all crawlers`Sitemap: https://example.com/sitemap.xml\`
`Crawl-delay: 10`Request a delay between requests (honoured by Bing, ignored by Google)`Crawl-delay: 10`

For a thorough specification of robots.txt syntax and behaviour, see the Yoast guide to robots.txt and the RFC 9309 standard which formalised the protocol.


The Most Dangerous robots.txt Mistakes

Blocking Your Entire Site

``` User-agent: * Disallow: / ```

This single rule tells every crawler to stay out of your entire site. It is sometimes added intentionally during development (valid) but catastrophic if left in place after launch. Always check your live robots.txt after a site migration or relaunch.

Forgetting the Trailing Slash

`Disallow: /admin` blocks `/admin`, `/admins`, `/administration` — any URL starting with that string. `Disallow: /admin/` blocks only pages within the `/admin/` directory. This distinction causes unexpected blocks more often than you might expect.

Using robots.txt for Security

If you have pages with genuinely sensitive content, robots.txt is not the place to protect them. The file is publicly visible — anyone can read it at `yourdomain.com/robots.txt`. In fact, it is sometimes used by attackers to find hidden admin paths. Use proper authentication for sensitive pages, not robots.txt.

Blocking CDN-served Resources

If your images, CSS, or JavaScript are served from a CDN subdomain (e.g. `cdn.yourdomain.com`), the robots.txt at your main domain does not apply. Each subdomain has its own robots.txt. If your CDN subdomain has no robots.txt (or has a blocking one), crawlers may not be able to access your resources.

Conflicting Rules Between robots.txt and Canonical Tags

If you block a URL in robots.txt but it has a canonical pointing elsewhere, Google cannot follow the canonical because it cannot crawl the page. This creates orphaned canonical tags that Google ignores — a subtle but real issue for sites with complex URL management.


How to Test Your robots.txt

Google Search Console — robots.txt Tester

Google Search Console has a built-in robots.txt tester under Settings > robots.txt. It shows you the current file, allows you to test any URL to see if it would be blocked, and highlights syntax errors.

Direct URL Test

Simply visit `https://yourdomain.com/robots.txt\` in a browser. If you get a 404, you have no robots.txt file — which means all crawlers can access everything (not necessarily a problem for small sites, but worth knowing). If you get a 200 response, review the content carefully.

Third-party Validators

Tools like Merkle's robots.txt tester allow you to paste in your file and test specific URLs against it, including wildcard rules that Search Console's tester can occasionally mishandle.

RnkRocket's Automated Audit

RnkRocket checks your robots.txt as part of its site intelligence crawl — flagging rules that block important pages, missing sitemap declarations, and syntax errors. This runs automatically, so you will be alerted to changes that could affect your crawlability.

For context on how robots.txt fits into a broader technical SEO strategy, see Technical SEO Explained and our guide to XML Sitemaps.


robots.txt and Crawl Budget

For most small business websites, crawl budget — the number of pages Google will crawl on any given visit — is not a meaningful concern. Google crawls small sites comprehensively regardless.

Crawl budget becomes relevant when your site has tens of thousands of URLs. In those cases, using robots.txt to block low-value URLs (internal search results, parameter-generated duplicates, admin pages) concentrates Google's crawl capacity on your valuable pages.

The alternative — allowing Google to crawl thousands of near-duplicate search result pages — wastes crawl capacity that could be directed at your actual content.

For a small business with a few dozen to a few hundred pages, optimising for crawl budget through robots.txt is unnecessary. Focus instead on clean URL structures and proper internal linking.


A Baseline robots.txt for Small Business Sites

For most small business websites — particularly those running on WordPress — this is a sensible baseline:

``` User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /?s= Disallow: /search/

Sitemap: https://yourdomain.com/sitemap.xml ```

This blocks the WordPress admin, login page, and site search results (which create near-duplicate pages), while declaring your sitemap location for all crawlers. Everything else is left accessible.

If you run Shopify, Squarespace, or another hosted platform, the platform generates a robots.txt for you and it is typically sensible out of the box. Check it, but you rarely need to change it unless you have specific needs.


Frequently Asked Questions

Does robots.txt affect my rankings directly?

No — robots.txt affects crawling, not ranking signals directly. However, if you accidentally block important pages from being crawled, those pages cannot be ranked. The indirect effect on rankings can be severe if misconfigurations prevent your key pages from being indexed.

Can I block specific bots while allowing Google?

Yes. Use named user-agents to target specific bots. For example, to block SEO scrapers while allowing Google:

``` User-agent: AhrefsBot Disallow: /

User-agent: SemrushBot Disallow: /

User-agent: * Allow: / ```

Note that blocking SEO tool crawlers does not prevent those tools from showing data about your site if they have already crawled it previously.

What happens if I have no robots.txt file at all?

Nothing bad. With no robots.txt file, all crawlers are permitted to access all pages. A missing robots.txt file returns a 404, which crawlers interpret as "no restrictions". This is fine for most small sites.

Can robots.txt stop my content from being scraped?

No. robots.txt is a voluntary protocol. Legitimate crawlers follow it, but malicious scrapers typically do not. If you need to protect content from scraping, you need technical measures like authentication, rate limiting, or CAPTCHAs — not robots.txt.

How quickly does Googlebot respond to robots.txt changes?

Google typically fetches and processes robots.txt changes within a few hours to a day. However, if you block pages that Google previously crawled and cached, it may take several days or a few weeks for those pages to disappear from the index. If you have inadvertently blocked pages, fix the robots.txt immediately and use the URL inspection tool in Search Console to request recrawling.


Related Reading


RnkRocket automatically checks your robots.txt for common mistakes — including rules that accidentally block your most important pages. See what RnkRocket finds on your site.

Related Posts

XML Sitemaps Explained: Why They Matter and How to Create One
Technical SEO

XML Sitemaps Explained: Why They Matter and How to Create One

An XML sitemap tells search engines which pages exist on your site and when they were last updated. Here's everything small businesses need to know to create and maintain one correctly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 25, 202612 min read
Duplicate Content: What It Is and How to Fix It
Technical SEO

Duplicate Content: What It Is and How to Fix It

Duplicate content confuses search engines, splits your ranking signals across multiple URLs, and can cause Google to index the wrong version of your page. Here is how to identify it and fix it properly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 11, 202613 min read
Page Speed Optimisation: A Practical Guide for Non-Developers
Technical SEO

Page Speed Optimisation: A Practical Guide for Non-Developers

Slow pages cost you rankings and customers. This practical guide explains page speed optimisation in plain English — with specific fixes you can implement without touching a line of code.

Core Web Vitals
Site Speed
Technical SEO
+1 more
RnkRocket Team
May 4, 202615 min read
matches end of URL: `Disallow: /*.pdf The Complete robots.txt SEO Guide 2026 | RnkRocket

We use cookies to measure visits and improve RnkRocket. Accept analytics cookies or continue with essential only. Cookie policy

Not getting calls from Google? Find out why. See how it works →
Skip to main content

The Complete Guide to robots.txt for SEO

Your robots.txt file tells search engines what they can and cannot crawl. Get it wrong and you risk blocking your entire site from Google — here's how to use it correctly.

By RnkRocket Team
May 28, 2026
13 min read
The Complete Guide to robots.txt for SEO

Key Takeaways

  • robots.txt controls which parts of your site search engine crawlers can access — it does not control indexing, only crawling (Google Search Central)
  • A single misconfigured robots.txt can de-index your entire website from Google — this is one of the most common and catastrophic technical SEO errors
  • Blocking a URL in robots.txt does not prevent it from appearing in Google's index if other sites link to it — use noindex for true indexing control
  • RnkRocket's site intelligence crawl checks your robots.txt for misconfigurations and blocked important pages automatically

Few files on your website carry as much risk as robots.txt. It is typically tiny — sometimes just a handful of lines — yet a single typo or misunderstanding can cause Google to stop crawling your entire site, leading to pages disappearing from search results within days.

We have seen this happen more times than we care to count. A developer adds a disallow rule before a site launch and forgets to remove it. An SEO plugin generates a robots.txt with an overly aggressive block. The result is always the same: traffic collapses, pages vanish from search, and the culprit takes days to identify because no one thinks to check a three-line text file.

This guide explains exactly how robots.txt works, how to write it correctly, and how to verify it is not quietly working against you.


What Is robots.txt?

robots.txt is a plain text file placed in the root of your website (e.g. `https://yourdomain.com/robots.txt\`) that communicates instructions to automated crawlers — including search engine bots like Googlebot, Bingbot, and others.

It follows the Robots Exclusion Protocol, a convention that web robots have followed since the mid-1990s. Unlike many areas of SEO, the protocol is technically a standard rather than a binding rule — crawlers choose to follow it as a courtesy. Most reputable crawlers do; malicious scrapers generally do not.

The file uses a simple syntax:

``` User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://yourdomain.com/sitemap.xml ```

  • `User-agent` specifies which bot the rules apply to (`*` means all bots)
  • `Disallow` specifies paths the bot should not crawl
  • `Allow` (supported by Google) overrides a disallow for a specific sub-path
  • `Sitemap` declares the location of your XML sitemap

The Critical Distinction: Crawling vs Indexing

This is the most important concept in this entire guide, and the one most often confused.

Blocking crawling does not block indexing.

If you disallow a URL in robots.txt, Googlebot will not crawl that URL. But if other websites link to that URL, Google may still discover it, list it in its index, and show it in search results — just without being able to read its content. This can result in pages appearing in Google with no title or description, just the URL and a snippet saying "A description for this result is not available because of this site's robots.txt."

This is the exact opposite of what most people intend when they block a URL. If you want a page kept out of the index, you need a `noindex` directive on the page itself — not a robots.txt rule.

The practical consequence:

GoalCorrect Approach
Prevent Googlebot crawling a sectionrobots.txt Disallow
Prevent a page appearing in search resultsnoindex meta tag or X-Robots-Tag header
Both: not crawled and not indexednoindex on the page (bot must be able to crawl it to see the noindex)

You cannot effectively noindex a page you are blocking in robots.txt — because Google cannot crawl the page to read the noindex directive.


robots.txt Syntax in Detail

User-agent Targeting

You can write rules for all bots or specific ones:

```

Rules for all crawlers

User-agent: * Disallow: /private/

Rules only for Googlebot

User-agent: Googlebot Disallow: /google-specific-block/

Rules only for Bingbot

User-agent: Bingbot Allow: / ```

Each `User-agent` line starts a new block. Rules apply to the agent specified until the next `User-agent` line. You can have multiple `User-agent` lines in a single block if you want the same rules to apply to multiple named bots.

Path Matching Rules

robots.txt uses simple path prefix matching:

  • `Disallow: /admin/` blocks everything under `/admin/`
  • `Disallow: /admin` also blocks `/admins` and `/administration` — note the trailing slash matters
  • `Disallow: /` blocks your entire site from being crawled — the most dangerous rule possible
  • `Disallow:` (empty value) means allow everything — the same as not having a disallow rule
  • `Allow: /` explicitly allows everything

Google also supports basic wildcard patterns:

  • `` matches any sequence of characters: `Disallow: /.pdf` blocks all PDF files
  • `$` matches end of URL: `Disallow: /*.pdf$` blocks URLs ending in `.pdf` specifically

Order of Rules

When `Allow` and `Disallow` rules conflict for the same URL, Google uses the most specific rule. If specificity is equal, the `Allow` wins.

``` Disallow: /images/ Allow: /images/public/ ```

In this example, `/images/private/photo.jpg` is blocked, but `/images/public/logo.jpg` is allowed.


What to Block with robots.txt

Here is a practical guide to what is typically worth blocking and what is not.

Commonly Blocked Sections

Admin and login pages. These should not appear in search results, and blocking crawling saves crawl budget.

``` Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php ```

Note the Allow for `admin-ajax.php` — WordPress requires this to be crawlable for certain frontend features to work.

Staging and development environments. If you have a staging site on a subdirectory or subdomain, block it entirely. Better still, password-protect it.

Internal search results pages. If your site has search functionality (e.g. `/search/?q=example`), blocking the search result pages prevents Googlebot from crawling thousands of near-duplicate pages. This is a crawl budget concern for larger sites.

``` Disallow: /search/ ```

Duplicate content from URL parameters. Faceted navigation on e-commerce sites can generate thousands of near-duplicate URLs (e.g. `/products/?colour=red&size=medium`). Blocking these from crawling helps focus Google's attention on your canonical product pages.

Account and checkout pages. Pages behind login, checkout flows, and order confirmation pages have no value in search results.

What You Should Not Block

Pages you want indexed. This sounds obvious, but it is the most common mistake. We regularly see sites blocking their service pages, blog posts, or product categories — usually an accident from a blanket disallow rule.

Pages with noindex directives. If a page has noindex, you do not need to block it in robots.txt. In fact, blocking it means Google cannot crawl the page and therefore cannot see the noindex — potentially leaving it indexable from external links.

CSS, JavaScript, and image files. In the early days of SEO, blocking CSS and JS was common to speed up crawls. This is now actively harmful — Google needs to render your pages to understand them, and blocking your stylesheets and scripts prevents Google from seeing your pages as users do. Google explicitly recommends allowing crawlers access to all resources needed to render pages.

Case study: blocked CSS and JavaScript causing rendering failures. A web design agency in Leeds rebuilt a client's website on a modern JavaScript framework and, during the migration, carried over a legacy robots.txt that included `Disallow: /assets/` — which happened to contain both the CSS and JavaScript bundles for the new site. Visually the site worked fine for users, but Googlebot could not render any page correctly. In Google's cached versions, every page appeared as unstyled plain text with broken navigation. Over the following month, rankings for their top 15 service pages dropped an average of 12 positions. The fix was a single line removal in robots.txt. Within two weeks of the change, Google re-rendered the pages correctly and rankings began recovering. Google's rendering documentation specifically warns against blocking resources that Googlebot needs to render pages, and their Mobile-Friendly Test tool can confirm whether blocked resources are affecting page rendering.

Common robots.txt Directives Reference

DirectivePurposeExample
`User-agent: *`Apply rules to all crawlers`User-agent: *`
`User-agent: Googlebot`Apply rules only to Google's crawler`User-agent: Googlebot`
`Disallow: /path/`Block crawling of a directory`Disallow: /admin/`
`Disallow: /`Block crawling of the entire site`Disallow: /`
`Disallow:` (empty)Allow crawling of everything`Disallow:`
`Allow: /path/`Override a Disallow for a sub-path (Google-supported)`Allow: /wp-admin/admin-ajax.php`
`Disallow: /*.pdf$`Block URLs matching a wildcard patternBlocks all PDF file URLs
`Sitemap:`Declare sitemap location for all crawlers`Sitemap: https://example.com/sitemap.xml\`
`Crawl-delay: 10`Request a delay between requests (honoured by Bing, ignored by Google)`Crawl-delay: 10`

For a thorough specification of robots.txt syntax and behaviour, see the Yoast guide to robots.txt and the RFC 9309 standard which formalised the protocol.


The Most Dangerous robots.txt Mistakes

Blocking Your Entire Site

``` User-agent: * Disallow: / ```

This single rule tells every crawler to stay out of your entire site. It is sometimes added intentionally during development (valid) but catastrophic if left in place after launch. Always check your live robots.txt after a site migration or relaunch.

Forgetting the Trailing Slash

`Disallow: /admin` blocks `/admin`, `/admins`, `/administration` — any URL starting with that string. `Disallow: /admin/` blocks only pages within the `/admin/` directory. This distinction causes unexpected blocks more often than you might expect.

Using robots.txt for Security

If you have pages with genuinely sensitive content, robots.txt is not the place to protect them. The file is publicly visible — anyone can read it at `yourdomain.com/robots.txt`. In fact, it is sometimes used by attackers to find hidden admin paths. Use proper authentication for sensitive pages, not robots.txt.

Blocking CDN-served Resources

If your images, CSS, or JavaScript are served from a CDN subdomain (e.g. `cdn.yourdomain.com`), the robots.txt at your main domain does not apply. Each subdomain has its own robots.txt. If your CDN subdomain has no robots.txt (or has a blocking one), crawlers may not be able to access your resources.

Conflicting Rules Between robots.txt and Canonical Tags

If you block a URL in robots.txt but it has a canonical pointing elsewhere, Google cannot follow the canonical because it cannot crawl the page. This creates orphaned canonical tags that Google ignores — a subtle but real issue for sites with complex URL management.


How to Test Your robots.txt

Google Search Console — robots.txt Tester

Google Search Console has a built-in robots.txt tester under Settings > robots.txt. It shows you the current file, allows you to test any URL to see if it would be blocked, and highlights syntax errors.

Direct URL Test

Simply visit `https://yourdomain.com/robots.txt\` in a browser. If you get a 404, you have no robots.txt file — which means all crawlers can access everything (not necessarily a problem for small sites, but worth knowing). If you get a 200 response, review the content carefully.

Third-party Validators

Tools like Merkle's robots.txt tester allow you to paste in your file and test specific URLs against it, including wildcard rules that Search Console's tester can occasionally mishandle.

RnkRocket's Automated Audit

RnkRocket checks your robots.txt as part of its site intelligence crawl — flagging rules that block important pages, missing sitemap declarations, and syntax errors. This runs automatically, so you will be alerted to changes that could affect your crawlability.

For context on how robots.txt fits into a broader technical SEO strategy, see Technical SEO Explained and our guide to XML Sitemaps.


robots.txt and Crawl Budget

For most small business websites, crawl budget — the number of pages Google will crawl on any given visit — is not a meaningful concern. Google crawls small sites comprehensively regardless.

Crawl budget becomes relevant when your site has tens of thousands of URLs. In those cases, using robots.txt to block low-value URLs (internal search results, parameter-generated duplicates, admin pages) concentrates Google's crawl capacity on your valuable pages.

The alternative — allowing Google to crawl thousands of near-duplicate search result pages — wastes crawl capacity that could be directed at your actual content.

For a small business with a few dozen to a few hundred pages, optimising for crawl budget through robots.txt is unnecessary. Focus instead on clean URL structures and proper internal linking.


A Baseline robots.txt for Small Business Sites

For most small business websites — particularly those running on WordPress — this is a sensible baseline:

``` User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /?s= Disallow: /search/

Sitemap: https://yourdomain.com/sitemap.xml ```

This blocks the WordPress admin, login page, and site search results (which create near-duplicate pages), while declaring your sitemap location for all crawlers. Everything else is left accessible.

If you run Shopify, Squarespace, or another hosted platform, the platform generates a robots.txt for you and it is typically sensible out of the box. Check it, but you rarely need to change it unless you have specific needs.


Frequently Asked Questions

Does robots.txt affect my rankings directly?

No — robots.txt affects crawling, not ranking signals directly. However, if you accidentally block important pages from being crawled, those pages cannot be ranked. The indirect effect on rankings can be severe if misconfigurations prevent your key pages from being indexed.

Can I block specific bots while allowing Google?

Yes. Use named user-agents to target specific bots. For example, to block SEO scrapers while allowing Google:

``` User-agent: AhrefsBot Disallow: /

User-agent: SemrushBot Disallow: /

User-agent: * Allow: / ```

Note that blocking SEO tool crawlers does not prevent those tools from showing data about your site if they have already crawled it previously.

What happens if I have no robots.txt file at all?

Nothing bad. With no robots.txt file, all crawlers are permitted to access all pages. A missing robots.txt file returns a 404, which crawlers interpret as "no restrictions". This is fine for most small sites.

Can robots.txt stop my content from being scraped?

No. robots.txt is a voluntary protocol. Legitimate crawlers follow it, but malicious scrapers typically do not. If you need to protect content from scraping, you need technical measures like authentication, rate limiting, or CAPTCHAs — not robots.txt.

How quickly does Googlebot respond to robots.txt changes?

Google typically fetches and processes robots.txt changes within a few hours to a day. However, if you block pages that Google previously crawled and cached, it may take several days or a few weeks for those pages to disappear from the index. If you have inadvertently blocked pages, fix the robots.txt immediately and use the URL inspection tool in Search Console to request recrawling.


Related Reading


RnkRocket automatically checks your robots.txt for common mistakes — including rules that accidentally block your most important pages. See what RnkRocket finds on your site.

Related Posts

XML Sitemaps Explained: Why They Matter and How to Create One
Technical SEO

XML Sitemaps Explained: Why They Matter and How to Create One

An XML sitemap tells search engines which pages exist on your site and when they were last updated. Here's everything small businesses need to know to create and maintain one correctly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 25, 202612 min read
Duplicate Content: What It Is and How to Fix It
Technical SEO

Duplicate Content: What It Is and How to Fix It

Duplicate content confuses search engines, splits your ranking signals across multiple URLs, and can cause Google to index the wrong version of your page. Here is how to identify it and fix it properly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 11, 202613 min read
Page Speed Optimisation: A Practical Guide for Non-Developers
Technical SEO

Page Speed Optimisation: A Practical Guide for Non-Developers

Slow pages cost you rankings and customers. This practical guide explains page speed optimisation in plain English — with specific fixes you can implement without touching a line of code.

Core Web Vitals
Site Speed
Technical SEO
+1 more
RnkRocket Team
May 4, 202615 min read
blocks URLs ending in `.pdf` specifically<\/li>\n<\/ul>\n

Order of Rules<\/h3>\n

When `Allow` and `Disallow` rules conflict for the same URL, Google uses the most specific rule. If specificity is equal, the `Allow` wins.<\/p>\n

```\nDisallow: /images/\nAllow: /images/public/\n```<\/p>\n

In this example, `/images/private/photo.jpg` is blocked, but `/images/public/logo.jpg` is allowed.<\/p>\n


\n

What to Block with robots.txt<\/h2>\n

Here is a practical guide to what is typically worth blocking and what is not.<\/p>\n

Commonly Blocked Sections<\/h3>\n

Admin and login pages.<\/strong> These should not appear in search results, and blocking crawling saves crawl budget.<\/p>\n

```\nDisallow: /wp-admin/\nAllow: /wp-admin/admin-ajax.php\n```<\/p>\n

Note the Allow for `admin-ajax.php` — WordPress requires this to be crawlable for certain frontend features to work.<\/p>\n

Staging and development environments.<\/strong> If you have a staging site on a subdirectory or subdomain, block it entirely. Better still, password-protect it.<\/p>\n

Internal search results pages.<\/strong> If your site has search functionality (e.g. `/search/?q=example`), blocking the search result pages prevents Googlebot from crawling thousands of near-duplicate pages. This is a crawl budget concern for larger sites.<\/p>\n

```\nDisallow: /search/\n```<\/p>\n

Duplicate content from URL parameters.<\/strong> Faceted navigation on e-commerce sites can generate thousands of near-duplicate URLs (e.g. `/products/?colour=red&size=medium`). Blocking these from crawling helps focus Google's attention on your canonical product pages.<\/p>\n

Account and checkout pages.<\/strong> Pages behind login, checkout flows, and order confirmation pages have no value in search results.<\/p>\n

What You Should Not Block<\/h3>\n

Pages you want indexed.<\/strong> This sounds obvious, but it is the most common mistake. We regularly see sites blocking their service pages, blog posts, or product categories — usually an accident from a blanket disallow rule.<\/p>\n

Pages with noindex directives.<\/strong> If a page has noindex, you do not need to block it in robots.txt. In fact, blocking it means Google cannot crawl the page and therefore cannot see the noindex — potentially leaving it indexable from external links.<\/p>\n

CSS, JavaScript, and image files.<\/strong> In the early days of SEO, blocking CSS and JS was common to speed up crawls. This is now actively harmful — Google needs to render your pages to understand them, and blocking your stylesheets and scripts prevents Google from seeing your pages as users do. Google explicitly recommends<\/a> allowing crawlers access to all resources needed to render pages.<\/p>\n

Case study: blocked CSS and JavaScript causing rendering failures.<\/strong> A web design agency in Leeds rebuilt a client's website on a modern JavaScript framework and, during the migration, carried over a legacy robots.txt that included `Disallow: /assets/` — which happened to contain both the CSS and JavaScript bundles for the new site. Visually the site worked fine for users, but Googlebot could not render any page correctly. In Google's cached versions, every page appeared as unstyled plain text with broken navigation. Over the following month, rankings for their top 15 service pages dropped an average of 12 positions. The fix was a single line removal in robots.txt. Within two weeks of the change, Google re-rendered the pages correctly and rankings began recovering. Google's rendering documentation<\/a> specifically warns against blocking resources that Googlebot needs to render pages, and their Mobile-Friendly Test tool can confirm whether blocked resources are affecting page rendering.<\/p>\n

Common robots.txt Directives Reference<\/h3>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
Directive<\/th>Purpose<\/th>Example<\/th><\/tr><\/thead>
`User-agent: *`<\/td>Apply rules to all crawlers<\/td>`User-agent: *`<\/td><\/tr>
`User-agent: Googlebot`<\/td>Apply rules only to Google's crawler<\/td>`User-agent: Googlebot`<\/td><\/tr>
`Disallow: /path/`<\/td>Block crawling of a directory<\/td>`Disallow: /admin/`<\/td><\/tr>
`Disallow: /`<\/td>Block crawling of the entire site<\/td>`Disallow: /`<\/td><\/tr>
`Disallow:` (empty)<\/td>Allow crawling of everything<\/td>`Disallow:`<\/td><\/tr>
`Allow: /path/`<\/td>Override a Disallow for a sub-path (Google-supported)<\/td>`Allow: /wp-admin/admin-ajax.php`<\/td><\/tr>
`Disallow: /*.pdf The Complete robots.txt SEO Guide 2026 | RnkRocket

We use cookies to measure visits and improve RnkRocket. Accept analytics cookies or continue with essential only. Cookie policy

Not getting calls from Google? Find out why. See how it works →
Skip to main content

The Complete Guide to robots.txt for SEO

Your robots.txt file tells search engines what they can and cannot crawl. Get it wrong and you risk blocking your entire site from Google — here's how to use it correctly.

By RnkRocket Team
May 28, 2026
13 min read
The Complete Guide to robots.txt for SEO

Key Takeaways

  • robots.txt controls which parts of your site search engine crawlers can access — it does not control indexing, only crawling (Google Search Central)
  • A single misconfigured robots.txt can de-index your entire website from Google — this is one of the most common and catastrophic technical SEO errors
  • Blocking a URL in robots.txt does not prevent it from appearing in Google's index if other sites link to it — use noindex for true indexing control
  • RnkRocket's site intelligence crawl checks your robots.txt for misconfigurations and blocked important pages automatically

Few files on your website carry as much risk as robots.txt. It is typically tiny — sometimes just a handful of lines — yet a single typo or misunderstanding can cause Google to stop crawling your entire site, leading to pages disappearing from search results within days.

We have seen this happen more times than we care to count. A developer adds a disallow rule before a site launch and forgets to remove it. An SEO plugin generates a robots.txt with an overly aggressive block. The result is always the same: traffic collapses, pages vanish from search, and the culprit takes days to identify because no one thinks to check a three-line text file.

This guide explains exactly how robots.txt works, how to write it correctly, and how to verify it is not quietly working against you.


What Is robots.txt?

robots.txt is a plain text file placed in the root of your website (e.g. `https://yourdomain.com/robots.txt\`) that communicates instructions to automated crawlers — including search engine bots like Googlebot, Bingbot, and others.

It follows the Robots Exclusion Protocol, a convention that web robots have followed since the mid-1990s. Unlike many areas of SEO, the protocol is technically a standard rather than a binding rule — crawlers choose to follow it as a courtesy. Most reputable crawlers do; malicious scrapers generally do not.

The file uses a simple syntax:

``` User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://yourdomain.com/sitemap.xml ```

  • `User-agent` specifies which bot the rules apply to (`*` means all bots)
  • `Disallow` specifies paths the bot should not crawl
  • `Allow` (supported by Google) overrides a disallow for a specific sub-path
  • `Sitemap` declares the location of your XML sitemap

The Critical Distinction: Crawling vs Indexing

This is the most important concept in this entire guide, and the one most often confused.

Blocking crawling does not block indexing.

If you disallow a URL in robots.txt, Googlebot will not crawl that URL. But if other websites link to that URL, Google may still discover it, list it in its index, and show it in search results — just without being able to read its content. This can result in pages appearing in Google with no title or description, just the URL and a snippet saying "A description for this result is not available because of this site's robots.txt."

This is the exact opposite of what most people intend when they block a URL. If you want a page kept out of the index, you need a `noindex` directive on the page itself — not a robots.txt rule.

The practical consequence:

GoalCorrect Approach
Prevent Googlebot crawling a sectionrobots.txt Disallow
Prevent a page appearing in search resultsnoindex meta tag or X-Robots-Tag header
Both: not crawled and not indexednoindex on the page (bot must be able to crawl it to see the noindex)

You cannot effectively noindex a page you are blocking in robots.txt — because Google cannot crawl the page to read the noindex directive.


robots.txt Syntax in Detail

User-agent Targeting

You can write rules for all bots or specific ones:

```

Rules for all crawlers

User-agent: * Disallow: /private/

Rules only for Googlebot

User-agent: Googlebot Disallow: /google-specific-block/

Rules only for Bingbot

User-agent: Bingbot Allow: / ```

Each `User-agent` line starts a new block. Rules apply to the agent specified until the next `User-agent` line. You can have multiple `User-agent` lines in a single block if you want the same rules to apply to multiple named bots.

Path Matching Rules

robots.txt uses simple path prefix matching:

  • `Disallow: /admin/` blocks everything under `/admin/`
  • `Disallow: /admin` also blocks `/admins` and `/administration` — note the trailing slash matters
  • `Disallow: /` blocks your entire site from being crawled — the most dangerous rule possible
  • `Disallow:` (empty value) means allow everything — the same as not having a disallow rule
  • `Allow: /` explicitly allows everything

Google also supports basic wildcard patterns:

  • `` matches any sequence of characters: `Disallow: /.pdf` blocks all PDF files
  • `$` matches end of URL: `Disallow: /*.pdf$` blocks URLs ending in `.pdf` specifically

Order of Rules

When `Allow` and `Disallow` rules conflict for the same URL, Google uses the most specific rule. If specificity is equal, the `Allow` wins.

``` Disallow: /images/ Allow: /images/public/ ```

In this example, `/images/private/photo.jpg` is blocked, but `/images/public/logo.jpg` is allowed.


What to Block with robots.txt

Here is a practical guide to what is typically worth blocking and what is not.

Commonly Blocked Sections

Admin and login pages. These should not appear in search results, and blocking crawling saves crawl budget.

``` Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php ```

Note the Allow for `admin-ajax.php` — WordPress requires this to be crawlable for certain frontend features to work.

Staging and development environments. If you have a staging site on a subdirectory or subdomain, block it entirely. Better still, password-protect it.

Internal search results pages. If your site has search functionality (e.g. `/search/?q=example`), blocking the search result pages prevents Googlebot from crawling thousands of near-duplicate pages. This is a crawl budget concern for larger sites.

``` Disallow: /search/ ```

Duplicate content from URL parameters. Faceted navigation on e-commerce sites can generate thousands of near-duplicate URLs (e.g. `/products/?colour=red&size=medium`). Blocking these from crawling helps focus Google's attention on your canonical product pages.

Account and checkout pages. Pages behind login, checkout flows, and order confirmation pages have no value in search results.

What You Should Not Block

Pages you want indexed. This sounds obvious, but it is the most common mistake. We regularly see sites blocking their service pages, blog posts, or product categories — usually an accident from a blanket disallow rule.

Pages with noindex directives. If a page has noindex, you do not need to block it in robots.txt. In fact, blocking it means Google cannot crawl the page and therefore cannot see the noindex — potentially leaving it indexable from external links.

CSS, JavaScript, and image files. In the early days of SEO, blocking CSS and JS was common to speed up crawls. This is now actively harmful — Google needs to render your pages to understand them, and blocking your stylesheets and scripts prevents Google from seeing your pages as users do. Google explicitly recommends allowing crawlers access to all resources needed to render pages.

Case study: blocked CSS and JavaScript causing rendering failures. A web design agency in Leeds rebuilt a client's website on a modern JavaScript framework and, during the migration, carried over a legacy robots.txt that included `Disallow: /assets/` — which happened to contain both the CSS and JavaScript bundles for the new site. Visually the site worked fine for users, but Googlebot could not render any page correctly. In Google's cached versions, every page appeared as unstyled plain text with broken navigation. Over the following month, rankings for their top 15 service pages dropped an average of 12 positions. The fix was a single line removal in robots.txt. Within two weeks of the change, Google re-rendered the pages correctly and rankings began recovering. Google's rendering documentation specifically warns against blocking resources that Googlebot needs to render pages, and their Mobile-Friendly Test tool can confirm whether blocked resources are affecting page rendering.

Common robots.txt Directives Reference

DirectivePurposeExample
`User-agent: *`Apply rules to all crawlers`User-agent: *`
`User-agent: Googlebot`Apply rules only to Google's crawler`User-agent: Googlebot`
`Disallow: /path/`Block crawling of a directory`Disallow: /admin/`
`Disallow: /`Block crawling of the entire site`Disallow: /`
`Disallow:` (empty)Allow crawling of everything`Disallow:`
`Allow: /path/`Override a Disallow for a sub-path (Google-supported)`Allow: /wp-admin/admin-ajax.php`
`Disallow: /*.pdf$`Block URLs matching a wildcard patternBlocks all PDF file URLs
`Sitemap:`Declare sitemap location for all crawlers`Sitemap: https://example.com/sitemap.xml\`
`Crawl-delay: 10`Request a delay between requests (honoured by Bing, ignored by Google)`Crawl-delay: 10`

For a thorough specification of robots.txt syntax and behaviour, see the Yoast guide to robots.txt and the RFC 9309 standard which formalised the protocol.


The Most Dangerous robots.txt Mistakes

Blocking Your Entire Site

``` User-agent: * Disallow: / ```

This single rule tells every crawler to stay out of your entire site. It is sometimes added intentionally during development (valid) but catastrophic if left in place after launch. Always check your live robots.txt after a site migration or relaunch.

Forgetting the Trailing Slash

`Disallow: /admin` blocks `/admin`, `/admins`, `/administration` — any URL starting with that string. `Disallow: /admin/` blocks only pages within the `/admin/` directory. This distinction causes unexpected blocks more often than you might expect.

Using robots.txt for Security

If you have pages with genuinely sensitive content, robots.txt is not the place to protect them. The file is publicly visible — anyone can read it at `yourdomain.com/robots.txt`. In fact, it is sometimes used by attackers to find hidden admin paths. Use proper authentication for sensitive pages, not robots.txt.

Blocking CDN-served Resources

If your images, CSS, or JavaScript are served from a CDN subdomain (e.g. `cdn.yourdomain.com`), the robots.txt at your main domain does not apply. Each subdomain has its own robots.txt. If your CDN subdomain has no robots.txt (or has a blocking one), crawlers may not be able to access your resources.

Conflicting Rules Between robots.txt and Canonical Tags

If you block a URL in robots.txt but it has a canonical pointing elsewhere, Google cannot follow the canonical because it cannot crawl the page. This creates orphaned canonical tags that Google ignores — a subtle but real issue for sites with complex URL management.


How to Test Your robots.txt

Google Search Console — robots.txt Tester

Google Search Console has a built-in robots.txt tester under Settings > robots.txt. It shows you the current file, allows you to test any URL to see if it would be blocked, and highlights syntax errors.

Direct URL Test

Simply visit `https://yourdomain.com/robots.txt\` in a browser. If you get a 404, you have no robots.txt file — which means all crawlers can access everything (not necessarily a problem for small sites, but worth knowing). If you get a 200 response, review the content carefully.

Third-party Validators

Tools like Merkle's robots.txt tester allow you to paste in your file and test specific URLs against it, including wildcard rules that Search Console's tester can occasionally mishandle.

RnkRocket's Automated Audit

RnkRocket checks your robots.txt as part of its site intelligence crawl — flagging rules that block important pages, missing sitemap declarations, and syntax errors. This runs automatically, so you will be alerted to changes that could affect your crawlability.

For context on how robots.txt fits into a broader technical SEO strategy, see Technical SEO Explained and our guide to XML Sitemaps.


robots.txt and Crawl Budget

For most small business websites, crawl budget — the number of pages Google will crawl on any given visit — is not a meaningful concern. Google crawls small sites comprehensively regardless.

Crawl budget becomes relevant when your site has tens of thousands of URLs. In those cases, using robots.txt to block low-value URLs (internal search results, parameter-generated duplicates, admin pages) concentrates Google's crawl capacity on your valuable pages.

The alternative — allowing Google to crawl thousands of near-duplicate search result pages — wastes crawl capacity that could be directed at your actual content.

For a small business with a few dozen to a few hundred pages, optimising for crawl budget through robots.txt is unnecessary. Focus instead on clean URL structures and proper internal linking.


A Baseline robots.txt for Small Business Sites

For most small business websites — particularly those running on WordPress — this is a sensible baseline:

``` User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /?s= Disallow: /search/

Sitemap: https://yourdomain.com/sitemap.xml ```

This blocks the WordPress admin, login page, and site search results (which create near-duplicate pages), while declaring your sitemap location for all crawlers. Everything else is left accessible.

If you run Shopify, Squarespace, or another hosted platform, the platform generates a robots.txt for you and it is typically sensible out of the box. Check it, but you rarely need to change it unless you have specific needs.


Frequently Asked Questions

Does robots.txt affect my rankings directly?

No — robots.txt affects crawling, not ranking signals directly. However, if you accidentally block important pages from being crawled, those pages cannot be ranked. The indirect effect on rankings can be severe if misconfigurations prevent your key pages from being indexed.

Can I block specific bots while allowing Google?

Yes. Use named user-agents to target specific bots. For example, to block SEO scrapers while allowing Google:

``` User-agent: AhrefsBot Disallow: /

User-agent: SemrushBot Disallow: /

User-agent: * Allow: / ```

Note that blocking SEO tool crawlers does not prevent those tools from showing data about your site if they have already crawled it previously.

What happens if I have no robots.txt file at all?

Nothing bad. With no robots.txt file, all crawlers are permitted to access all pages. A missing robots.txt file returns a 404, which crawlers interpret as "no restrictions". This is fine for most small sites.

Can robots.txt stop my content from being scraped?

No. robots.txt is a voluntary protocol. Legitimate crawlers follow it, but malicious scrapers typically do not. If you need to protect content from scraping, you need technical measures like authentication, rate limiting, or CAPTCHAs — not robots.txt.

How quickly does Googlebot respond to robots.txt changes?

Google typically fetches and processes robots.txt changes within a few hours to a day. However, if you block pages that Google previously crawled and cached, it may take several days or a few weeks for those pages to disappear from the index. If you have inadvertently blocked pages, fix the robots.txt immediately and use the URL inspection tool in Search Console to request recrawling.


Related Reading


RnkRocket automatically checks your robots.txt for common mistakes — including rules that accidentally block your most important pages. See what RnkRocket finds on your site.

Related Posts

XML Sitemaps Explained: Why They Matter and How to Create One
Technical SEO

XML Sitemaps Explained: Why They Matter and How to Create One

An XML sitemap tells search engines which pages exist on your site and when they were last updated. Here's everything small businesses need to know to create and maintain one correctly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 25, 202612 min read
Duplicate Content: What It Is and How to Fix It
Technical SEO

Duplicate Content: What It Is and How to Fix It

Duplicate content confuses search engines, splits your ranking signals across multiple URLs, and can cause Google to index the wrong version of your page. Here is how to identify it and fix it properly.

Technical SEO
Crawlability
Indexing
+1 more
RnkRocket Team
May 11, 202613 min read
Page Speed Optimisation: A Practical Guide for Non-Developers
Technical SEO

Page Speed Optimisation: A Practical Guide for Non-Developers

Slow pages cost you rankings and customers. This practical guide explains page speed optimisation in plain English — with specific fixes you can implement without touching a line of code.

Core Web Vitals
Site Speed
Technical SEO
+1 more
RnkRocket Team
May 4, 202615 min read
<\/td>
Block URLs matching a wildcard pattern<\/td>Blocks all PDF file URLs<\/td><\/tr>
`Sitemap:`<\/td>Declare sitemap location for all crawlers<\/td>`Sitemap: https://example.com/sitemap.xml\\`<\/a><\/td><\/tr>
`Crawl-delay: 10`<\/td>Request a delay between requests (honoured by Bing, ignored by Google)<\/td>`Crawl-delay: 10`<\/td><\/tr><\/tbody><\/table>\n

For a thorough specification of robots.txt syntax and behaviour, see the Yoast guide to robots.txt<\/a> and the RFC 9309 standard<\/a> which formalised the protocol.<\/p>\n


\n

The Most Dangerous robots.txt Mistakes<\/h2>\n

Blocking Your Entire Site<\/h3>\n

```\nUser-agent: *\nDisallow: /\n```<\/p>\n

This single rule tells every crawler to stay out of your entire site. It is sometimes added intentionally during development (valid) but catastrophic if left in place after launch. Always check your live robots.txt after a site migration or relaunch.<\/p>\n

Forgetting the Trailing Slash<\/h3>\n

`Disallow: /admin` blocks `/admin`, `/admins`, `/administration` — any URL starting with that string. `Disallow: /admin/` blocks only pages within the `/admin/` directory. This distinction causes unexpected blocks more often than you might expect.<\/p>\n

Using robots.txt for Security<\/h3>\n

If you have pages with genuinely sensitive content, robots.txt is not the place to protect them. The file is publicly visible — anyone can read it at `yourdomain.com/robots.txt`. In fact, it is sometimes used by attackers to find hidden admin paths. Use proper authentication for sensitive pages, not robots.txt.<\/p>\n

Blocking CDN-served Resources<\/h3>\n

If your images, CSS, or JavaScript are served from a CDN subdomain (e.g. `cdn.yourdomain.com`), the robots.txt at your main domain does not apply. Each subdomain has its own robots.txt. If your CDN subdomain has no robots.txt (or has a blocking one), crawlers may not be able to access your resources.<\/p>\n

Conflicting Rules Between robots.txt and Canonical Tags<\/h3>\n

If you block a URL in robots.txt but it has a canonical pointing elsewhere, Google cannot follow the canonical because it cannot crawl the page. This creates orphaned canonical tags that Google ignores — a subtle but real issue for sites with complex URL management.<\/p>\n


\n

How to Test Your robots.txt<\/h2>\n

Google Search Console — robots.txt Tester<\/h3>\n

Google Search Console has a built-in robots.txt tester under Settings > robots.txt. It shows you the current file, allows you to test any URL to see if it would be blocked, and highlights syntax errors.<\/p>\n

Direct URL Test<\/h3>\n

Simply visit `https://yourdomain.com/robots.txt\\`<\/a> in a browser. If you get a 404, you have no robots.txt file — which means all crawlers can access everything (not necessarily a problem for small sites, but worth knowing). If you get a 200 response, review the content carefully.<\/p>\n

Third-party Validators<\/h3>\n

Tools like Merkle's robots.txt tester<\/a> allow you to paste in your file and test specific URLs against it, including wildcard rules that Search Console's tester can occasionally mishandle.<\/p>\n

RnkRocket's Automated Audit<\/h3>\n

RnkRocket checks your robots.txt as part of its site intelligence crawl — flagging rules that block important pages, missing sitemap declarations, and syntax errors. This runs automatically, so you will be alerted to changes that could affect your crawlability.<\/p>\n

For context on how robots.txt fits into a broader technical SEO strategy, see Technical SEO Explained<\/a> and our guide to XML Sitemaps<\/a>.<\/p>\n


\n

robots.txt and Crawl Budget<\/h2>\n

For most small business websites, crawl budget — the number of pages Google will crawl on any given visit — is not a meaningful concern. Google crawls small sites comprehensively regardless.<\/p>\n

Crawl budget becomes relevant when your site has tens of thousands of URLs. In those cases, using robots.txt to block low-value URLs (internal search results, parameter-generated duplicates, admin pages) concentrates Google's crawl capacity on your valuable pages.<\/p>\n

The alternative — allowing Google to crawl thousands of near-duplicate search result pages — wastes crawl capacity that could be directed at your actual content.<\/p>\n

For a small business with a few dozen to a few hundred pages, optimising for crawl budget through robots.txt is unnecessary. Focus instead on clean URL structures and proper internal linking.<\/p>\n


\n

A Baseline robots.txt for Small Business Sites<\/h2>\n

For most small business websites — particularly those running on WordPress — this is a sensible baseline:<\/p>\n

```\nUser-agent: *\nDisallow: /wp-admin/\nAllow: /wp-admin/admin-ajax.php\nDisallow: /wp-login.php\nDisallow: /?s=\nDisallow: /search/<\/p>\n

Sitemap: https://yourdomain.com/sitemap.xml<\/a>\n```<\/p>\n

This blocks the WordPress admin, login page, and site search results (which create near-duplicate pages), while declaring your sitemap location for all crawlers. Everything else is left accessible.<\/p>\n

If you run Shopify, Squarespace, or another hosted platform, the platform generates a robots.txt for you and it is typically sensible out of the box. Check it, but you rarely need to change it unless you have specific needs.<\/p>\n


\n

Frequently Asked Questions<\/h2>\n

Does robots.txt affect my rankings directly?<\/h3>\n

No — robots.txt affects crawling, not ranking signals directly. However, if you accidentally block important pages from being crawled, those pages cannot be ranked. The indirect effect on rankings can be severe if misconfigurations prevent your key pages from being indexed.<\/p>\n

Can I block specific bots while allowing Google?<\/h3>\n

Yes. Use named user-agents to target specific bots. For example, to block SEO scrapers while allowing Google:<\/p>\n

```\nUser-agent: AhrefsBot\nDisallow: /<\/p>\n

User-agent: SemrushBot\nDisallow: /<\/p>\n

User-agent: *\nAllow: /\n```<\/p>\n

Note that blocking SEO tool crawlers does not prevent those tools from showing data about your site if they have already crawled it previously.<\/p>\n

What happens if I have no robots.txt file at all?<\/h3>\n

Nothing bad. With no robots.txt file, all crawlers are permitted to access all pages. A missing robots.txt file returns a 404, which crawlers interpret as \"no restrictions\". This is fine for most small sites.<\/p>\n

Can robots.txt stop my content from being scraped?<\/h3>\n

No. robots.txt is a voluntary protocol. Legitimate crawlers follow it, but malicious scrapers typically do not. If you need to protect content from scraping, you need technical measures like authentication, rate limiting, or CAPTCHAs — not robots.txt.<\/p>\n

How quickly does Googlebot respond to robots.txt changes?<\/h3>\n

Google typically fetches and processes robots.txt changes within a few hours to a day. However, if you block pages that Google previously crawled and cached, it may take several days or a few weeks for those pages to disappear from the index. If you have inadvertently blocked pages, fix the robots.txt immediately and use the URL inspection tool in Search Console to request recrawling.<\/p>\n


\n

Related Reading<\/h2>\n