Question 1

What is a robots.txt file?

Accepted Answer

A robots.txt file is a plain text file at the root of your domain that tells crawlers which paths they can and can't request. It lives at exactly one location: /robots.txt. Googlebot checks it before every crawl, and so do Bingbot, GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), PerplexityBot, and Google-Extended. The file uses a simple grammar. You write one or more User-agent blocks, each followed by Allow and Disallow rules. A Sitemap line near the top points crawlers to your XML index so they don't have to guess at the structure. Paste any site URL into our robots.txt checker , pick a User-agent , and you'll see the parsed rules table plus which rule wins for that specific bot on that specific path. If you don't have a file yet, generate one with our robots.txt generator and the right CMS preset baked in.

Question 2

What does a robots.txt checker actually test?

Accepted Answer

A real robots.txt checker does four things. It fetches the file and confirms it's reachable with a 200 status and the right content-type. It parses the syntax so you catch typos that silently break rules: wrong capitalization on User-agent, missing colons, stray BOM characters at the start of the file. It resolves a specific path for a specific bot so you can answer "is /admin blocked for GPTBot right now?" without guessing. And it detects rule-order conflicts where two bots inherit different rules from overlapping User-agent blocks. Most free checkers stop at step one. Ours runs the full set. Set Site URL , pick the bot from User-agent , drop an optional Test path , and you get a verdict per rule. When you ship a fix, confirm the change with a second pass in the checker before you move on to other work.

Question 3

Where do I find my robots.txt file?

Accepted Answer

Type your domain followed by /robots.txt into any browser. If https://www.example.com/robots.txt returns a 200 and shows plain text, you have one. If it returns 404 or your CMS homepage, you don't. The file must sit at the exact root of the domain. Subdirectory paths like /blog/robots.txt are ignored entirely by every crawler. Subdomains are separate: blog.example.com and www.example.com each need their own file at their own root. WordPress sites usually have a virtual one generated by the SEO plugin; Shopify generates one automatically and locks most of it; Next.js and Astro need you to ship a static file under /public. If you're not sure what crawlers actually see, paste your URL into our robots.txt checker and we fetch it with the exact headers a real bot sends so the result matches crawler reality. For a clean rewrite with CMS presets baked in, use the generator .

Question 4

How do I fix a "blocked by robots.txt" error in Search Console?

Accepted Answer

Search Console flags "blocked by robots.txt" when a Disallow rule covers the URL Google tried to crawl. Open the URL Inspection tool to see which rule Google matched. Then run the same URL through our robots.txt checker with User-agent set to Googlebot and the blocked path pasted into Test path . The checker shows you the exact rule that matched and the User-agent block it came from, so you can fix the source instead of guessing. Three fixes cover almost every case. Remove the offending Disallow line. Narrow it with a more specific path. Or add an Allow rule above it (the longer match wins on overlap). Ship the change, test the same path again in the checker, then request indexing back in Search Console. If pages still look blocked, Google's cached copy may be in play; it refreshes robots.txt roughly every 24 hours.

Question 5

Should I block AI crawlers in robots.txt?

Accepted Answer

That depends on what you're optimizing for. Block them if your content is the product: publishers, paid research, subscription archives, anything where free training data hurts the business. Allow them if you want to be cited in ChatGPT and Claude answers, where being the cited source drives referral traffic back to your site. The 2026 list worth naming explicitly: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), CCBot (Common Crawl, which trains many models), PerplexityBot, and Google-Extended (controls training usage of Googlebot-crawled pages without affecting your rankings in normal Google search). Our robots.txt generator gives you one checkbox per crawler so you decide per bot, not per everything. After you ship, test each one with our checker against a real path to confirm the rule resolves as you expect for that bot. Most bugs come from rule-order conflicts between overlapping User-agent blocks, not missing entries.

Question 6

How should a robots.txt file be structured?

Accepted Answer

Start with a Sitemap line pointing to your XML index. Then group rules by User-agent. The wildcard block (User-agent: *) catches every bot not named elsewhere, so put it last. Above it, add named blocks for bots you want to treat differently: Googlebot, Bingbot, GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended. Each block can have multiple Allow and Disallow lines. The longer, more specific match wins when rules overlap. Keep paths case-sensitive: Disallow: /Admin does not block /admin. List one Sitemap per domain, declared once near the top. Keep the file under 500 KB or Google starts ignoring lines past that point. Our robots.txt generator scaffolds all of this for you with a CMS preset and AI-crawler toggles. Once you publish, verify the structure parses correctly with our checker against a handful of real URLs and each named bot before you close the ticket out.

Question 7

What's the difference between Disallow and noindex?

Accepted Answer

Disallow in robots.txt tells a bot not to crawl a URL. It does not tell the bot not to index it. If another site links to a Disallowed page, Google can still list the URL in search results with "no description available" underneath it. To actually keep a page out of the index, use a meta robots noindex tag on the page itself, or an X-Robots-Tag noindex header in the HTTP response. The catch: Google has to crawl the page to see the noindex tag. So if you both Disallow and noindex, noindex never takes effect and the page lingers in results. Pick one per page. Disallow is for crawl budget (blocking admin, internal search, filter URLs). Noindex is for keeping content out of results entirely. For a full page-level audit of robots directives, use our crawler simulator alongside the website metadata checker .

Question 8

Does robots.txt still work in 2026?

Accepted Answer

Yes, for crawlers that choose to honor it. Googlebot, Bingbot, and the major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, CCBot, Google-Extended) all respect robots.txt as a matter of policy. Rogue scrapers ignore it because the file is a polite request, not a firewall. If you need hard blocking, add server-side rules: IP deny lists, Cloudflare bot management, rate limiting, or authentication in front of the sensitive paths. Use robots.txt for what it's good at: shaping which pages the bots you care about spend their crawl budget on. The 2026 difference is AI. Five years ago, "the bots" meant Google and Bing. Today the list is longer and each AI crawler uses a different User-agent name. Our checker tests any of them in one click so you can see exactly what each bot sees. Pair it with our crawler simulator for a rendered-page view.

Question 9

Can I use wildcards in robots.txt?

Accepted Answer

Yes, two wildcards are supported and understood by all major bots. The asterisk ( ) matches any sequence of characters, and the dollar sign ($) anchors the pattern to the end of a URL. Disallow: / .pdf$ blocks every URL ending in .pdf. Disallow: / ?sort= blocks any URL with a sort parameter anywhere in it. Combine them: Disallow: /search? &page=$ blocks paginated internal search results but leaves the main search page crawlable. Wildcards don't work in User-agent lines, so you can't write User-agent: Google* and hit every Google bot. Name each one explicitly (Googlebot, Googlebot-Image, Googlebot-News). The longer literal match wins over a shorter pattern match. Test wildcard rules with a concrete path in our checker because mental models break fast with nested parameters, query strings, and overlapping patterns that look fine on paper. For a clean baseline, generate one with our generator and iterate from there with test paths.

Question 10

Will robots.txt protect sensitive pages?

Accepted Answer

No. Robots.txt is a public document anyone can read at yourdomain.com/robots.txt by typing it into a browser. Listing a path there tells every crawler, every competitor, and every curious human that the path exists on your site. For staging URLs, admin panels, or private files, that's the opposite of what you want: you've just advertised them. Real protection comes from server-side controls: password authentication, IP allow lists, VPN-only access, or simply not exposing the URL on a public server at all. A noindex meta tag keeps the page out of search results if the page is reachable but you want it private from searchers. For truly hidden content, don't link to it, don't list it in sitemaps, and gate it with auth. Use robots.txt for crawl-budget shaping on pages you don't mind being public. Audit what's exposed with our metadata checker and confirm robots rules with our robots.txt checker .

Robots Txt Checker

Generate the whole content, not just check it.

What a robots.txt checker actually does

How to use this robots.txt checker

Why testing per user-agent matters

Rule precedence and wildcards

Sitemap validation and crawl directives

Syntax errors and validation edge cases

Common mistakes

Advanced tips

Generate the whole content, not just check it.

Frequently Asked Questions

Related free tools