What a robots.txt checker actually does
A robots.txt checker fetches the /robots.txt file from your domain, parses every user-agent block and disallow rule, then tests whether a given path is allowed or blocked for a specific crawler. It applies the longest-matching rule when multiple patterns overlap, follows the order-of-precedence in the spec, and reports whether a test URL would be crawled.
Most crawlers look for their own user-agent block first. If one exists, they use those rules. If not, they fall back to the wildcard User-agent: * block. This means a site can allow Googlebot into /admin while blocking all other bots. Our checker simulates this cascade for any user-agent you pick from the User-agent to test dropdown.
Two common mistakes break robots.txt files silently. The first is syntax errors: extra spaces, missing colons, Windows line endings, or uppercase "Disallow" when only lowercase works. The second is conflicting rules-allow and disallow lines that overlap, leaving it ambiguous whether a path is blocked. Our checker flags both and shows which rule wins.
How to use this robots.txt checker
- Paste your full domain into Site URL. We fetch
yourdomain.com/robots.txtautomatically. No need to type/robots.txt. - Pick a User-agent to test from the dropdown. Googlebot, Bingbot, GPTBot, ClaudeBot, CCBot, PerplexityBot, anthropic-ai, or * for wildcard. This is the crawler identity we simulate.
- Paste a path into Test path if you want to check a specific URL. Leave it empty to see the full parsed ruleset. A path looks like
/adminor/blog/post-slug. - Hit Check robots.txt. You get the parsed file, per-agent rules, sitemap links, crawl-delay if set, and a verdict for your test path.
- Expand Rule conflicts if any rows are flagged. We show overlapping allow/disallow lines and tell you which one a real crawler would follow.
Try testing yourdomain.com with User-agent set to GPTBot and Test path set to /blog. If your robots.txt has no GPTBot block but disallows all bots from /admin, the blog is allowed and admin is blocked. Switch the user-agent to ClaudeBot and the result might change if you have a ClaudeBot-specific block.
Why testing per user-agent matters
Search crawlers are not the only bots that read robots.txt anymore. AI training crawlers-GPTBot from OpenAI, ClaudeBot from Anthropic, CCBot from Common Crawl, PerplexityBot, and Google-Extended-now respect robots.txt to decide whether they can scrape your content for model training. If you block them, your pages stay out of training datasets. If you allow them, you opt in.
Three practical consequences.
Policy clarity. A robots.txt that says User-agent: * / Disallow: / blocks everyone, including Google. If that is not your intent, you need separate blocks per agent. Testing per user-agent reveals what each bot sees before a model trains on your content.
AI crawler control. In 2026, most site owners want search bots in but training bots out. That requires explicit GPTBot, ClaudeBot, and CCBot disallow blocks. Competitors ignore these agents. We test them by default because they matter.
Conflict detection. When you have both Disallow: /blog and Allow: /blog/public, the more specific rule wins. But hand-parsing which rule is longer or more specific is error-prone. Testing shows you exactly what a bot would do, not what you think the file says.
Rule precedence and wildcards
The robots.txt spec defines a precedence order when multiple rules match the same path. The rule with the longest matching prefix wins. If two rules have the same length, the allow rule wins over disallow.
Wildcards make this harder to eyeball. A line like Disallow: /admin* blocks /admin, /admin/users, and /admin-panel. A later line Allow: /admin/public overrides it for that one folder because /admin/public is longer than /admin. Our checker evaluates both and tells you which applies.
The wildcard $ anchors the end of a path. Disallow: /*.pdf$ blocks all PDF files but allows /report.pdf.html because the path does not end in .pdf. Competitors often parse $ wrong or ignore it. We match the Google implementation.
The user-agent name is case-insensitive in the spec, so User-agent: googlebot and User-agent: Googlebot are identical. Disallow paths are case-sensitive on most servers. /Admin and /admin are different URLs. Our checker respects both rules.
Sitemap validation and crawl directives
Every robots.txt file should include at least one Sitemap: line pointing to your sitemap.xml file. This tells crawlers where to find the list of URLs you want indexed. Our checker fetches every sitemap URL listed in your robots.txt and reports the HTTP status code. If a sitemap returns 404, crawlers cannot use it, and you lose a signal that helps with discovery.
Multiple sitemap declarations are valid. If you have separate sitemaps for posts, pages, and products, list all three. If you use a sitemap index that references child sitemaps, list only the index. Avoid listing every child sitemap individually because it clutters the file and duplicates information already in the index.
The Crawl-delay: directive sets the minimum seconds a bot should wait between requests to your server. Googlebot ignores this directive entirely and uses its own adaptive crawl rate based on server response time. Bingbot, Yandex, and some smaller crawlers respect it. A crawl-delay of 1 second is safe. A delay of 10 or higher effectively stops most crawling on large sites. Use it only if your server cannot handle normal crawl rates.
A less common directive is Request-rate:, which sets a number of requests per time window. Few crawlers support it, and it is not part of the official spec. If you see it in a robots.txt, it is likely legacy or non-standard. Our checker notes it but does not enforce it because crawler behavior varies.
Syntax errors and validation edge cases
Robots.txt syntax is unforgiving. A single misplaced space or tab can invalidate a rule. The directive name-User-agent, Disallow, Allow, Sitemap, Crawl-delay-must be followed by a colon with no space before it and at least one space or tab after it. Disallow:/admin fails. Disallow: /admin works. Our checker flags spacing issues and suggests fixes.
Windows line endings-\r\n instead of \n-cause problems on some servers. When a robots.txt file is edited on Windows and uploaded without conversion, bots may misread line breaks and treat multiple lines as one. Our checker detects non-Unix line endings and reports them as a warning.
Comments in robots.txt start with #. Everything after the # on that line is ignored. A common error is accidentally commenting out a directive: # Disallow: /admin does nothing. If you see rules that should apply but do not, check for stray # characters.
Blank lines separate user-agent blocks. A blank line ends the current block, and the next User-agent: starts a new one. If you have User-agent: Googlebot, Disallow: /private, then a blank line, then Allow: /public, the allow rule does not apply to Googlebot-it starts a new block with no user-agent, which is invalid. Our checker flags orphaned directives and suggests grouping them under the correct user-agent.
Common mistakes
- Blocking Googlebot by accident. A
User-agent: *block withDisallow: /blocks every bot, including Google. If you want Googlebot in, add a separateUser-agent: Googlebotblock withAllow: /before the wildcard block. Order matters. - Forgetting the leading slash.
Disallow: admindoes nothing. It must beDisallow: /admin. Our checker flags this as a probable syntax error. - Testing only Googlebot. Your robots.txt might allow Google but block Bingbot or GPTBot without you noticing. Test all agents you care about, not just one.
- Leaving out AI crawlers. If your file has no GPTBot or ClaudeBot block, those bots fall back to
User-agent: *. That might allow them when you thought everything was blocked. Explicit per-agent blocks make policy unambiguous. - Assuming sitemap links are validated elsewhere. A sitemap URL in robots.txt can be broken, return 404, or point to an XML file that no longer exists. Our checker tests the link and reports the status code.
Advanced tips
- Test the same path against multiple user-agents in sequence. If the result changes, your per-agent blocks are working. If it stays the same, you might be relying only on the wildcard block.
- Check the Crawl-delay line if present. Googlebot ignores it, but Bingbot and some others respect it. A delay of 10 seconds can slow a crawl to a near-halt on large sites.
- Look at the Sitemap lines. Multiple sitemap declarations are valid. If you have a sitemap index, list it once rather than repeating every sub-sitemap. We fetch each link and confirm it returns HTTP 200.
- Test a path with query parameters.
Disallow: /searchblocks/search?q=teston most servers, butDisallow: /search$would not because$expects no trailing characters. If you want to block query strings, use the star:Disallow: /search*. - Download the parsed output as a reference. When you regenerate robots.txt or switch CMSes, re-check against the same test paths to confirm behavior did not change.
- Use the conflict report before deploying a new robots.txt. If two rules overlap, your local interpretation might differ from Googlebot's. Testing removes the guess.
If you need to generate a new robots.txt file from scratch with presets for WordPress, Shopify, or Next.js, use our robots.txt file generator. It includes explicit AI-crawler toggles and outputs a production-ready file with syntax guaranteed valid. After deploying, re-check it with this tool. If you want to see how Googlebot renders the page after respecting robots.txt and executing JavaScript, the Google crawler simulator shows the exact HTML and visible text a bot indexes. To confirm that every URL in your sitemap is reachable and returns 200, use the sitemap checker.