The Complete Guide to robots.txt for SEO
Your robots.txt file tells search engines what they can and cannot crawl. Get it wrong and you risk blocking your entire site from Google — here's how to use it correctly.

Key Takeaways
- robots.txt controls which parts of your site search engine crawlers can access — it does not control indexing, only crawling (Google Search Central)
- A single misconfigured robots.txt can de-index your entire website from Google — this is one of the most common and catastrophic technical SEO errors
- Blocking a URL in robots.txt does not prevent it from appearing in Google's index if other sites link to it — use noindex for true indexing control
- RnkRocket's site intelligence crawl checks your robots.txt for misconfigurations and blocked important pages automatically
Few files on your website carry as much risk as robots.txt. It is typically tiny — sometimes just a handful of lines — yet a single typo or misunderstanding can cause Google to stop crawling your entire site, leading to pages disappearing from search results within days.
We have seen this happen more times than we care to count. A developer adds a disallow rule before a site launch and forgets to remove it. An SEO plugin generates a robots.txt with an overly aggressive block. The result is always the same: traffic collapses, pages vanish from search, and the culprit takes days to identify because no one thinks to check a three-line text file.
This guide explains exactly how robots.txt works, how to write it correctly, and how to verify it is not quietly working against you.
What Is robots.txt?
robots.txt is a plain text file placed in the root of your website (e.g. `https://yourdomain.com/robots.txt\`) that communicates instructions to automated crawlers — including search engine bots like Googlebot, Bingbot, and others.
It follows the Robots Exclusion Protocol, a convention that web robots have followed since the mid-1990s. Unlike many areas of SEO, the protocol is technically a standard rather than a binding rule — crawlers choose to follow it as a courtesy. Most reputable crawlers do; malicious scrapers generally do not.
The file uses a simple syntax:
``` User-agent: * Disallow: /admin/ Allow: /admin/public/ Sitemap: https://yourdomain.com/sitemap.xml ```
- `User-agent` specifies which bot the rules apply to (`*` means all bots)
- `Disallow` specifies paths the bot should not crawl
- `Allow` (supported by Google) overrides a disallow for a specific sub-path
- `Sitemap` declares the location of your XML sitemap
The Critical Distinction: Crawling vs Indexing
This is the most important concept in this entire guide, and the one most often confused.
Blocking crawling does not block indexing.
If you disallow a URL in robots.txt, Googlebot will not crawl that URL. But if other websites link to that URL, Google may still discover it, list it in its index, and show it in search results — just without being able to read its content. This can result in pages appearing in Google with no title or description, just the URL and a snippet saying "A description for this result is not available because of this site's robots.txt."
This is the exact opposite of what most people intend when they block a URL. If you want a page kept out of the index, you need a `noindex` directive on the page itself — not a robots.txt rule.
The practical consequence:
| Goal | Correct Approach |
|---|---|
| Prevent Googlebot crawling a section | robots.txt Disallow |
| Prevent a page appearing in search results | noindex meta tag or X-Robots-Tag header |
| Both: not crawled and not indexed | noindex on the page (bot must be able to crawl it to see the noindex) |
You cannot effectively noindex a page you are blocking in robots.txt — because Google cannot crawl the page to read the noindex directive.
robots.txt Syntax in Detail
User-agent Targeting
You can write rules for all bots or specific ones:
```
Rules for all crawlers
User-agent: * Disallow: /private/
Rules only for Googlebot
User-agent: Googlebot Disallow: /google-specific-block/
Rules only for Bingbot
User-agent: Bingbot Allow: / ```
Each `User-agent` line starts a new block. Rules apply to the agent specified until the next `User-agent` line. You can have multiple `User-agent` lines in a single block if you want the same rules to apply to multiple named bots.
Path Matching Rules
robots.txt uses simple path prefix matching:
- `Disallow: /admin/` blocks everything under `/admin/`
- `Disallow: /admin` also blocks `/admins` and `/administration` — note the trailing slash matters
- `Disallow: /` blocks your entire site from being crawled — the most dangerous rule possible
- `Disallow:` (empty value) means allow everything — the same as not having a disallow rule
- `Allow: /` explicitly allows everything
Google also supports basic wildcard patterns:
- `` matches any sequence of characters: `Disallow: /.pdf` blocks all PDF files
- `$` matches end of URL: `Disallow: /*.pdf$` blocks URLs ending in `.pdf` specifically
Order of Rules
When `Allow` and `Disallow` rules conflict for the same URL, Google uses the most specific rule. If specificity is equal, the `Allow` wins.
``` Disallow: /images/ Allow: /images/public/ ```
In this example, `/images/private/photo.jpg` is blocked, but `/images/public/logo.jpg` is allowed.
What to Block with robots.txt
Here is a practical guide to what is typically worth blocking and what is not.
Commonly Blocked Sections
Admin and login pages. These should not appear in search results, and blocking crawling saves crawl budget.
``` Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php ```
Note the Allow for `admin-ajax.php` — WordPress requires this to be crawlable for certain frontend features to work.
Staging and development environments. If you have a staging site on a subdirectory or subdomain, block it entirely. Better still, password-protect it.
Internal search results pages. If your site has search functionality (e.g. `/search/?q=example`), blocking the search result pages prevents Googlebot from crawling thousands of near-duplicate pages. This is a crawl budget concern for larger sites.
``` Disallow: /search/ ```
Duplicate content from URL parameters. Faceted navigation on e-commerce sites can generate thousands of near-duplicate URLs (e.g. `/products/?colour=red&size=medium`). Blocking these from crawling helps focus Google's attention on your canonical product pages.
Account and checkout pages. Pages behind login, checkout flows, and order confirmation pages have no value in search results.
What You Should Not Block
Pages you want indexed. This sounds obvious, but it is the most common mistake. We regularly see sites blocking their service pages, blog posts, or product categories — usually an accident from a blanket disallow rule.
Pages with noindex directives. If a page has noindex, you do not need to block it in robots.txt. In fact, blocking it means Google cannot crawl the page and therefore cannot see the noindex — potentially leaving it indexable from external links.
CSS, JavaScript, and image files. In the early days of SEO, blocking CSS and JS was common to speed up crawls. This is now actively harmful — Google needs to render your pages to understand them, and blocking your stylesheets and scripts prevents Google from seeing your pages as users do. Google explicitly recommends allowing crawlers access to all resources needed to render pages.
Case study: blocked CSS and JavaScript causing rendering failures. A web design agency in Leeds rebuilt a client's website on a modern JavaScript framework and, during the migration, carried over a legacy robots.txt that included `Disallow: /assets/` — which happened to contain both the CSS and JavaScript bundles for the new site. Visually the site worked fine for users, but Googlebot could not render any page correctly. In Google's cached versions, every page appeared as unstyled plain text with broken navigation. Over the following month, rankings for their top 15 service pages dropped an average of 12 positions. The fix was a single line removal in robots.txt. Within two weeks of the change, Google re-rendered the pages correctly and rankings began recovering. Google's rendering documentation specifically warns against blocking resources that Googlebot needs to render pages, and their Mobile-Friendly Test tool can confirm whether blocked resources are affecting page rendering.
Common robots.txt Directives Reference
| Directive | Purpose | Example |
|---|---|---|
| `User-agent: *` | Apply rules to all crawlers | `User-agent: *` |
| `User-agent: Googlebot` | Apply rules only to Google's crawler | `User-agent: Googlebot` |
| `Disallow: /path/` | Block crawling of a directory | `Disallow: /admin/` |
| `Disallow: /` | Block crawling of the entire site | `Disallow: /` |
| `Disallow:` (empty) | Allow crawling of everything | `Disallow:` |
| `Allow: /path/` | Override a Disallow for a sub-path (Google-supported) | `Allow: /wp-admin/admin-ajax.php` |
| `Disallow: /*.pdf$` | Block URLs matching a wildcard pattern | Blocks all PDF file URLs |
| `Sitemap:` | Declare sitemap location for all crawlers | `Sitemap: https://example.com/sitemap.xml\` |
| `Crawl-delay: 10` | Request a delay between requests (honoured by Bing, ignored by Google) | `Crawl-delay: 10` |
For a thorough specification of robots.txt syntax and behaviour, see the Yoast guide to robots.txt and the RFC 9309 standard which formalised the protocol.
The Most Dangerous robots.txt Mistakes
Blocking Your Entire Site
``` User-agent: * Disallow: / ```
This single rule tells every crawler to stay out of your entire site. It is sometimes added intentionally during development (valid) but catastrophic if left in place after launch. Always check your live robots.txt after a site migration or relaunch.
Forgetting the Trailing Slash
`Disallow: /admin` blocks `/admin`, `/admins`, `/administration` — any URL starting with that string. `Disallow: /admin/` blocks only pages within the `/admin/` directory. This distinction causes unexpected blocks more often than you might expect.
Using robots.txt for Security
If you have pages with genuinely sensitive content, robots.txt is not the place to protect them. The file is publicly visible — anyone can read it at `yourdomain.com/robots.txt`. In fact, it is sometimes used by attackers to find hidden admin paths. Use proper authentication for sensitive pages, not robots.txt.
Blocking CDN-served Resources
If your images, CSS, or JavaScript are served from a CDN subdomain (e.g. `cdn.yourdomain.com`), the robots.txt at your main domain does not apply. Each subdomain has its own robots.txt. If your CDN subdomain has no robots.txt (or has a blocking one), crawlers may not be able to access your resources.
Conflicting Rules Between robots.txt and Canonical Tags
If you block a URL in robots.txt but it has a canonical pointing elsewhere, Google cannot follow the canonical because it cannot crawl the page. This creates orphaned canonical tags that Google ignores — a subtle but real issue for sites with complex URL management.
How to Test Your robots.txt
Google Search Console — robots.txt Tester
Google Search Console has a built-in robots.txt tester under Settings > robots.txt. It shows you the current file, allows you to test any URL to see if it would be blocked, and highlights syntax errors.
Direct URL Test
Simply visit `https://yourdomain.com/robots.txt\` in a browser. If you get a 404, you have no robots.txt file — which means all crawlers can access everything (not necessarily a problem for small sites, but worth knowing). If you get a 200 response, review the content carefully.
Third-party Validators
Tools like Merkle's robots.txt tester allow you to paste in your file and test specific URLs against it, including wildcard rules that Search Console's tester can occasionally mishandle.
RnkRocket's Automated Audit
RnkRocket checks your robots.txt as part of its site intelligence crawl — flagging rules that block important pages, missing sitemap declarations, and syntax errors. This runs automatically, so you will be alerted to changes that could affect your crawlability.
For context on how robots.txt fits into a broader technical SEO strategy, see Technical SEO Explained and our guide to XML Sitemaps.
robots.txt and Crawl Budget
For most small business websites, crawl budget — the number of pages Google will crawl on any given visit — is not a meaningful concern. Google crawls small sites comprehensively regardless.
Crawl budget becomes relevant when your site has tens of thousands of URLs. In those cases, using robots.txt to block low-value URLs (internal search results, parameter-generated duplicates, admin pages) concentrates Google's crawl capacity on your valuable pages.
The alternative — allowing Google to crawl thousands of near-duplicate search result pages — wastes crawl capacity that could be directed at your actual content.
For a small business with a few dozen to a few hundred pages, optimising for crawl budget through robots.txt is unnecessary. Focus instead on clean URL structures and proper internal linking.
A Baseline robots.txt for Small Business Sites
For most small business websites — particularly those running on WordPress — this is a sensible baseline:
``` User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php Disallow: /wp-login.php Disallow: /?s= Disallow: /search/
Sitemap: https://yourdomain.com/sitemap.xml ```
This blocks the WordPress admin, login page, and site search results (which create near-duplicate pages), while declaring your sitemap location for all crawlers. Everything else is left accessible.
If you run Shopify, Squarespace, or another hosted platform, the platform generates a robots.txt for you and it is typically sensible out of the box. Check it, but you rarely need to change it unless you have specific needs.
Frequently Asked Questions
Does robots.txt affect my rankings directly?
No — robots.txt affects crawling, not ranking signals directly. However, if you accidentally block important pages from being crawled, those pages cannot be ranked. The indirect effect on rankings can be severe if misconfigurations prevent your key pages from being indexed.
Can I block specific bots while allowing Google?
Yes. Use named user-agents to target specific bots. For example, to block SEO scrapers while allowing Google:
``` User-agent: AhrefsBot Disallow: /
User-agent: SemrushBot Disallow: /
User-agent: * Allow: / ```
Note that blocking SEO tool crawlers does not prevent those tools from showing data about your site if they have already crawled it previously.
What happens if I have no robots.txt file at all?
Nothing bad. With no robots.txt file, all crawlers are permitted to access all pages. A missing robots.txt file returns a 404, which crawlers interpret as "no restrictions". This is fine for most small sites.
Can robots.txt stop my content from being scraped?
No. robots.txt is a voluntary protocol. Legitimate crawlers follow it, but malicious scrapers typically do not. If you need to protect content from scraping, you need technical measures like authentication, rate limiting, or CAPTCHAs — not robots.txt.
How quickly does Googlebot respond to robots.txt changes?
Google typically fetches and processes robots.txt changes within a few hours to a day. However, if you block pages that Google previously crawled and cached, it may take several days or a few weeks for those pages to disappear from the index. If you have inadvertently blocked pages, fix the robots.txt immediately and use the URL inspection tool in Search Console to request recrawling.
Related Reading
- Technical SEO Explained: A Plain-English Guide
- XML Sitemaps Explained: Why They Matter and How to Create One
- The Complete SEO Audit Checklist for 2026
- What Is SEO? A Beginner's Guide
RnkRocket automatically checks your robots.txt for common mistakes — including rules that accidentally block your most important pages. See what RnkRocket finds on your site.


