Skip to content
Live check · fetches your URL server-side

Robots.txt Checker

Parse, test per user-agent (including GPTBot/ClaudeBot), detect rule conflicts.

A robots.txt file tells crawlers which pages they can and cannot access. Most validators test one bot and quit. This robots.txt checker tests per user-agent, including the AI crawlers that matter in 2026-GPTBot, ClaudeBot, and PerplexityBot-detects rule conflicts when multiple directives apply, and validates whether your sitemap links actually exist.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

What a robots.txt checker actually does

A robots.txt checker fetches the /robots.txt file from your domain, parses every user-agent block and disallow rule, then tests whether a given path is allowed or blocked for a specific crawler. It applies the longest-matching rule when multiple patterns overlap, follows the order-of-precedence in the spec, and reports whether a test URL would be crawled.

Most crawlers look for their own user-agent block first. If one exists, they use those rules. If not, they fall back to the wildcard User-agent: * block. This means a site can allow Googlebot into /admin while blocking all other bots. Our checker simulates this cascade for any user-agent you pick from the User-agent to test dropdown.

Two common mistakes break robots.txt files silently. The first is syntax errors: extra spaces, missing colons, Windows line endings, or uppercase "Disallow" when only lowercase works. The second is conflicting rules-allow and disallow lines that overlap, leaving it ambiguous whether a path is blocked. Our checker flags both and shows which rule wins.

How to use this robots.txt checker

  1. Paste your full domain into Site URL. We fetch yourdomain.com/robots.txt automatically. No need to type /robots.txt.
  2. Pick a User-agent to test from the dropdown. Googlebot, Bingbot, GPTBot, ClaudeBot, CCBot, PerplexityBot, anthropic-ai, or * for wildcard. This is the crawler identity we simulate.
  3. Paste a path into Test path if you want to check a specific URL. Leave it empty to see the full parsed ruleset. A path looks like /admin or /blog/post-slug.
  4. Hit Check robots.txt. You get the parsed file, per-agent rules, sitemap links, crawl-delay if set, and a verdict for your test path.
  5. Expand Rule conflicts if any rows are flagged. We show overlapping allow/disallow lines and tell you which one a real crawler would follow.

Try testing yourdomain.com with User-agent set to GPTBot and Test path set to /blog. If your robots.txt has no GPTBot block but disallows all bots from /admin, the blog is allowed and admin is blocked. Switch the user-agent to ClaudeBot and the result might change if you have a ClaudeBot-specific block.

Why testing per user-agent matters

Search crawlers are not the only bots that read robots.txt anymore. AI training crawlers-GPTBot from OpenAI, ClaudeBot from Anthropic, CCBot from Common Crawl, PerplexityBot, and Google-Extended-now respect robots.txt to decide whether they can scrape your content for model training. If you block them, your pages stay out of training datasets. If you allow them, you opt in.

Three practical consequences.

Policy clarity. A robots.txt that says User-agent: * / Disallow: / blocks everyone, including Google. If that is not your intent, you need separate blocks per agent. Testing per user-agent reveals what each bot sees before a model trains on your content.

AI crawler control. In 2026, most site owners want search bots in but training bots out. That requires explicit GPTBot, ClaudeBot, and CCBot disallow blocks. Competitors ignore these agents. We test them by default because they matter.

Conflict detection. When you have both Disallow: /blog and Allow: /blog/public, the more specific rule wins. But hand-parsing which rule is longer or more specific is error-prone. Testing shows you exactly what a bot would do, not what you think the file says.

Rule precedence and wildcards

The robots.txt spec defines a precedence order when multiple rules match the same path. The rule with the longest matching prefix wins. If two rules have the same length, the allow rule wins over disallow.

Wildcards make this harder to eyeball. A line like Disallow: /admin* blocks /admin, /admin/users, and /admin-panel. A later line Allow: /admin/public overrides it for that one folder because /admin/public is longer than /admin. Our checker evaluates both and tells you which applies.

The wildcard $ anchors the end of a path. Disallow: /*.pdf$ blocks all PDF files but allows /report.pdf.html because the path does not end in .pdf. Competitors often parse $ wrong or ignore it. We match the Google implementation.

The user-agent name is case-insensitive in the spec, so User-agent: googlebot and User-agent: Googlebot are identical. Disallow paths are case-sensitive on most servers. /Admin and /admin are different URLs. Our checker respects both rules.

Sitemap validation and crawl directives

Every robots.txt file should include at least one Sitemap: line pointing to your sitemap.xml file. This tells crawlers where to find the list of URLs you want indexed. Our checker fetches every sitemap URL listed in your robots.txt and reports the HTTP status code. If a sitemap returns 404, crawlers cannot use it, and you lose a signal that helps with discovery.

Multiple sitemap declarations are valid. If you have separate sitemaps for posts, pages, and products, list all three. If you use a sitemap index that references child sitemaps, list only the index. Avoid listing every child sitemap individually because it clutters the file and duplicates information already in the index.

The Crawl-delay: directive sets the minimum seconds a bot should wait between requests to your server. Googlebot ignores this directive entirely and uses its own adaptive crawl rate based on server response time. Bingbot, Yandex, and some smaller crawlers respect it. A crawl-delay of 1 second is safe. A delay of 10 or higher effectively stops most crawling on large sites. Use it only if your server cannot handle normal crawl rates.

A less common directive is Request-rate:, which sets a number of requests per time window. Few crawlers support it, and it is not part of the official spec. If you see it in a robots.txt, it is likely legacy or non-standard. Our checker notes it but does not enforce it because crawler behavior varies.

Syntax errors and validation edge cases

Robots.txt syntax is unforgiving. A single misplaced space or tab can invalidate a rule. The directive name-User-agent, Disallow, Allow, Sitemap, Crawl-delay-must be followed by a colon with no space before it and at least one space or tab after it. Disallow:/admin fails. Disallow: /admin works. Our checker flags spacing issues and suggests fixes.

Windows line endings-\r\n instead of \n-cause problems on some servers. When a robots.txt file is edited on Windows and uploaded without conversion, bots may misread line breaks and treat multiple lines as one. Our checker detects non-Unix line endings and reports them as a warning.

Comments in robots.txt start with #. Everything after the # on that line is ignored. A common error is accidentally commenting out a directive: # Disallow: /admin does nothing. If you see rules that should apply but do not, check for stray # characters.

Blank lines separate user-agent blocks. A blank line ends the current block, and the next User-agent: starts a new one. If you have User-agent: Googlebot, Disallow: /private, then a blank line, then Allow: /public, the allow rule does not apply to Googlebot-it starts a new block with no user-agent, which is invalid. Our checker flags orphaned directives and suggests grouping them under the correct user-agent.

Common mistakes

  • Blocking Googlebot by accident. A User-agent: * block with Disallow: / blocks every bot, including Google. If you want Googlebot in, add a separate User-agent: Googlebot block with Allow: / before the wildcard block. Order matters.
  • Forgetting the leading slash. Disallow: admin does nothing. It must be Disallow: /admin. Our checker flags this as a probable syntax error.
  • Testing only Googlebot. Your robots.txt might allow Google but block Bingbot or GPTBot without you noticing. Test all agents you care about, not just one.
  • Leaving out AI crawlers. If your file has no GPTBot or ClaudeBot block, those bots fall back to User-agent: *. That might allow them when you thought everything was blocked. Explicit per-agent blocks make policy unambiguous.
  • Assuming sitemap links are validated elsewhere. A sitemap URL in robots.txt can be broken, return 404, or point to an XML file that no longer exists. Our checker tests the link and reports the status code.

Advanced tips

  • Test the same path against multiple user-agents in sequence. If the result changes, your per-agent blocks are working. If it stays the same, you might be relying only on the wildcard block.
  • Check the Crawl-delay line if present. Googlebot ignores it, but Bingbot and some others respect it. A delay of 10 seconds can slow a crawl to a near-halt on large sites.
  • Look at the Sitemap lines. Multiple sitemap declarations are valid. If you have a sitemap index, list it once rather than repeating every sub-sitemap. We fetch each link and confirm it returns HTTP 200.
  • Test a path with query parameters. Disallow: /search blocks /search?q=test on most servers, but Disallow: /search$ would not because $ expects no trailing characters. If you want to block query strings, use the star: Disallow: /search*.
  • Download the parsed output as a reference. When you regenerate robots.txt or switch CMSes, re-check against the same test paths to confirm behavior did not change.
  • Use the conflict report before deploying a new robots.txt. If two rules overlap, your local interpretation might differ from Googlebot's. Testing removes the guess.

If you need to generate a new robots.txt file from scratch with presets for WordPress, Shopify, or Next.js, use our robots.txt file generator. It includes explicit AI-crawler toggles and outputs a production-ready file with syntax guaranteed valid. After deploying, re-check it with this tool. If you want to see how Googlebot renders the page after respecting robots.txt and executing JavaScript, the Google crawler simulator shows the exact HTML and visible text a bot indexes. To confirm that every URL in your sitemap is reachable and returns 200, use the sitemap checker.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

Frequently Asked Questions

What is a robots.txt file?

A robots.txt file is a plain text file at the root of your domain that tells crawlers which paths they can and can't request. It lives at exactly one location: /robots.txt. Googlebot checks it before every crawl, and so do Bingbot, GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), PerplexityBot, and Google-Extended. The file uses a simple grammar. You write one or more User-agent blocks, each followed by Allow and Disallow rules. A Sitemap line near the top points crawlers to your XML index so they don't have to guess at the structure. Paste any site URL into our robots.txt checker, pick a User-agent, and you'll see the parsed rules table plus which rule wins for that specific bot on that specific path. If you don't have a file yet, generate one with our robots.txt generator and the right CMS preset baked in.

What does a robots.txt checker actually test?

A real robots.txt checker does four things. It fetches the file and confirms it's reachable with a 200 status and the right content-type. It parses the syntax so you catch typos that silently break rules: wrong capitalization on User-agent, missing colons, stray BOM characters at the start of the file. It resolves a specific path for a specific bot so you can answer "is /admin blocked for GPTBot right now?" without guessing. And it detects rule-order conflicts where two bots inherit different rules from overlapping User-agent blocks. Most free checkers stop at step one. Ours runs the full set. Set Site URL, pick the bot from User-agent, drop an optional Test path, and you get a verdict per rule. When you ship a fix, confirm the change with a second pass in the checker before you move on to other work.

Where do I find my robots.txt file?

Type your domain followed by /robots.txt into any browser. If https://www.example.com/robots.txt returns a 200 and shows plain text, you have one. If it returns 404 or your CMS homepage, you don't. The file must sit at the exact root of the domain. Subdirectory paths like /blog/robots.txt are ignored entirely by every crawler. Subdomains are separate: blog.example.com and www.example.com each need their own file at their own root. WordPress sites usually have a virtual one generated by the SEO plugin; Shopify generates one automatically and locks most of it; Next.js and Astro need you to ship a static file under /public. If you're not sure what crawlers actually see, paste your URL into our robots.txt checker and we fetch it with the exact headers a real bot sends so the result matches crawler reality. For a clean rewrite with CMS presets baked in, use the generator.

How do I fix a "blocked by robots.txt" error in Search Console?

Search Console flags "blocked by robots.txt" when a Disallow rule covers the URL Google tried to crawl. Open the URL Inspection tool to see which rule Google matched. Then run the same URL through our robots.txt checker with User-agent set to Googlebot and the blocked path pasted into Test path. The checker shows you the exact rule that matched and the User-agent block it came from, so you can fix the source instead of guessing. Three fixes cover almost every case. Remove the offending Disallow line. Narrow it with a more specific path. Or add an Allow rule above it (the longer match wins on overlap). Ship the change, test the same path again in the checker, then request indexing back in Search Console. If pages still look blocked, Google's cached copy may be in play; it refreshes robots.txt roughly every 24 hours.

Should I block AI crawlers in robots.txt?

That depends on what you're optimizing for. Block them if your content is the product: publishers, paid research, subscription archives, anything where free training data hurts the business. Allow them if you want to be cited in ChatGPT and Claude answers, where being the cited source drives referral traffic back to your site. The 2026 list worth naming explicitly: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), CCBot (Common Crawl, which trains many models), PerplexityBot, and Google-Extended (controls training usage of Googlebot-crawled pages without affecting your rankings in normal Google search). Our robots.txt generator gives you one checkbox per crawler so you decide per bot, not per everything. After you ship, test each one with our checker against a real path to confirm the rule resolves as you expect for that bot. Most bugs come from rule-order conflicts between overlapping User-agent blocks, not missing entries.

How should a robots.txt file be structured?

Start with a Sitemap line pointing to your XML index. Then group rules by User-agent. The wildcard block (User-agent: *) catches every bot not named elsewhere, so put it last. Above it, add named blocks for bots you want to treat differently: Googlebot, Bingbot, GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended. Each block can have multiple Allow and Disallow lines. The longer, more specific match wins when rules overlap. Keep paths case-sensitive: Disallow: /Admin does not block /admin. List one Sitemap per domain, declared once near the top. Keep the file under 500 KB or Google starts ignoring lines past that point. Our robots.txt generator scaffolds all of this for you with a CMS preset and AI-crawler toggles. Once you publish, verify the structure parses correctly with our checker against a handful of real URLs and each named bot before you close the ticket out.

What's the difference between Disallow and noindex?

Disallow in robots.txt tells a bot not to crawl a URL. It does not tell the bot not to index it. If another site links to a Disallowed page, Google can still list the URL in search results with "no description available" underneath it. To actually keep a page out of the index, use a meta robots noindex tag on the page itself, or an X-Robots-Tag noindex header in the HTTP response. The catch: Google has to crawl the page to see the noindex tag. So if you both Disallow and noindex, noindex never takes effect and the page lingers in results. Pick one per page. Disallow is for crawl budget (blocking admin, internal search, filter URLs). Noindex is for keeping content out of results entirely. For a full page-level audit of robots directives, use our crawler simulator alongside the website metadata checker.

Does robots.txt still work in 2026?

Yes, for crawlers that choose to honor it. Googlebot, Bingbot, and the major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended) all respect robots.txt as a matter of policy. Rogue scrapers ignore it because the file is a polite request, not a firewall. If you need hard blocking, add server-side rules: IP deny lists, Cloudflare bot management, rate limiting, or authentication in front of the sensitive paths. Use robots.txt for what it's good at: shaping which pages the bots you care about spend their crawl budget on. The 2026 difference is AI. Five years ago, "the bots" meant Google and Bing. Today the list is longer and each AI crawler uses a different User-agent name. Our checker tests any of them in one click so you can see exactly what each bot sees. Pair it with our crawler simulator for a rendered-page view.

Can I use wildcards in robots.txt?

Yes, two wildcards are supported and understood by all major bots. The asterisk () matches any sequence of characters, and the dollar sign ($) anchors the pattern to the end of a URL. Disallow: /.pdf$ blocks every URL ending in .pdf. Disallow: /?sort= blocks any URL with a sort parameter anywhere in it. Combine them: Disallow: /search?&page=$ blocks paginated internal search results but leaves the main search page crawlable. Wildcards don't work in User-agent lines, so you can't write User-agent: Google* and hit every Google bot. Name each one explicitly (Googlebot, Googlebot-Image, Googlebot-News). The longer literal match wins over a shorter pattern match. Test wildcard rules with a concrete path in our checker because mental models break fast with nested parameters, query strings, and overlapping patterns that look fine on paper. For a clean baseline, generate one with our generator and iterate from there with test paths.

Will robots.txt protect sensitive pages?

No. Robots.txt is a public document anyone can read at yourdomain.com/robots.txt by typing it into a browser. Listing a path there tells every crawler, every competitor, and every curious human that the path exists on your site. For staging URLs, admin panels, or private files, that's the opposite of what you want: you've just advertised them. Real protection comes from server-side controls: password authentication, IP allow lists, VPN-only access, or simply not exposing the URL on a public server at all. A noindex meta tag keeps the page out of search results if the page is reachable but you want it private from searchers. For truly hidden content, don't link to it, don't list it in sitemaps, and gate it with auth. Use robots.txt for crawl-budget shaping on pages you don't mind being public. Audit what's exposed with our metadata checker and confirm robots rules with our robots.txt checker.

Related free tools

All tools →