What a robots.txt generator actually does
A robots.txt generator writes the plain-text directives that crawlers read before accessing any page on your domain. It structures user-agent blocks, disallow and allow rules, crawl-delay instructions, and sitemap declarations in the exact order and syntax that bots expect. It prevents syntax errors, ensures the wildcard block comes after specific agents, and validates that every disallow path starts with a forward slash.
Most sites need three blocks. One for search bots like Googlebot and Bingbot with broad access. One wildcard block for all other crawlers with restricted access. And separate blocks for AI training bots that you want to allow or deny explicitly. Our generator handles all three and lets you toggle each AI bot independently because their policies differ.
Two principles guide a good robots.txt. First, explicit beats implicit. If you want GPTBot blocked but ClaudeBot allowed, list both by name instead of relying on the wildcard. Second, specific rules beat general rules. If you disallow /admin for everyone but allow it for Googlebot, put the Googlebot block before the wildcard block. Our generator enforces both.
How to use this robots.txt generator
- Pick a CMS preset from the dropdown. WordPress blocks
/wp-admin,/wp-includes,*.php$, andreadme.htmlby default. Shopify blocks/admin,/cart, and/checkout. Next.js blocks/_next/staticand/api. Astro / static gives you a minimal file. Strapi blocks/adminand/api. Custom starts blank. - Paste your sitemap URL into Sitemap URL. This becomes the
Sitemap:line at the top. If you have multiple sitemaps or a sitemap index, add one per line in the expanded field. - Check or uncheck each AI crawler in Block AI crawlers. Checked means we add a user-agent block with
Disallow: /for that bot. Unchecked means it falls back to the wildcard block. Options are GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), PerplexityBot, anthropic-ai, and Google-Extended. - Paste additional paths to block into Disallow paths, one per line. Examples:
/admin,/checkout,/cart,/private. These apply to the wildcardUser-agent: *block. - Hit Generate robots.txt. You get syntax-highlighted output, a download button, and a Test path box so you can check whether a URL would be blocked before deploying.
- Click Check robots.txt at the bottom to run the file through our validator and see how each user-agent interprets your rules.
Try selecting the WordPress preset, leaving all AI crawlers checked, and adding /staging to the disallow list. The output allows Googlebot and Bingbot full access, blocks GPTBot, ClaudeBot, CCBot, PerplexityBot, anthropic-ai, and Google-Extended from the entire site, and blocks all other bots from /wp-admin, /staging, and PHP files.
Why AI crawler toggles matter in 2026
The robots.txt spec was written before LLMs existed. In 2023 and 2024, AI companies released crawlers-GPTBot, ClaudeBot, CCBot-that scrape the web to train language models. By mid-2026, respecting robots.txt is standard practice for these bots. If your file says User-agent: GPTBot / Disallow: /, OpenAI's crawler skips your site. If you say nothing, it crawls by default.
Three practical decisions every site makes now.
Allow search bots, block training bots. Most sites want Google and Bing to index content for search but do not want that content feeding LLM training datasets. This requires separate blocks: User-agent: Googlebot / Allow: / and User-agent: GPTBot / Disallow: /. The wildcard alone does not distinguish them.
Selectively allow certain AI bots. Some publishers allow ClaudeBot and block GPTBot because Anthropic offers attribution features. Others allow Google-Extended for Bard but block CCBot. Per-bot control is the only way to implement a nuanced policy.
Make the policy explicit. If your robots.txt has no GPTBot block, GPTBot assumes it can crawl. Explicit blocks make your intent legally defensible if a future dispute arises about consent. We include toggles for six major AI crawlers as of April 2026. That list will grow.
CMS presets and why they differ
Every CMS has files or directories that should not be crawled. WordPress exposes /wp-admin and /wp-includes by default. Blocking them prevents bots from indexing admin panels and PHP includes, which have no SEO value and waste crawl budget. Shopify's /cart and /checkout pages create duplicate content because every session generates a new URL. Blocking them prevents index bloat.
Next.js apps serve JavaScript bundles from /_next/static. Crawling those files is pointless because they are code, not content. Blocking them with Disallow: /_next/static saves requests. Astro and other static-site generators usually do not need blocks because there are no admin routes, but blocking *.json files or /_astro/* build artifacts is common.
Strapi, Directus, and headless CMSes expose /admin and /api endpoints that should be blocked for security and SEO reasons. If an API route is public and you want it indexed-rare but possible-you would allow it explicitly with User-agent: Googlebot / Allow: /api/public.
Our presets reflect these patterns. Pick the one that matches your stack, then edit the disallow list if your setup differs. The Custom option gives you a blank starting point if you are on a framework we do not list or if you are building a robots.txt from scratch for a multi-site architecture.
Sitemap declaration and crawl-delay
The Sitemap: line tells crawlers where to find your sitemap.xml file. It should be a full URL, not a relative path: https://www.yourdomain.com/sitemap.xml, not /sitemap.xml. If you have a sitemap index, list the index. If you have separate sitemaps for posts, pages, and products, you can list all three on separate Sitemap: lines. We put these at the top of the file before any user-agent blocks.
The Crawl-delay: directive sets a minimum number of seconds between requests. Googlebot ignores it. Bingbot and some smaller crawlers respect it. A delay of 1 is reasonable. A delay of 10 or higher can slow crawls to a near-halt. Most sites do not need crawl-delay unless server load is an issue. If you add it, add it to the wildcard block, not to Googlebot, so search indexing stays fast.
Some bots read case-insensitive user-agent names. User-agent: googlebot works the same as User-agent: Googlebot. Disallow paths, however, are case-sensitive on most servers. /Admin and /admin are different. Match the case your server uses or rely on lowercase paths everywhere.
Common mistakes
- Putting the wildcard block first. If
User-agent: *comes beforeUser-agent: Googlebot, some parsers stop reading after the wildcard. Always list specific bots first, wildcard last. - Blocking the entire site by accident.
User-agent: * / Disallow: /blocks every bot, including Google. If you want Google in, add aUser-agent: Googlebot / Allow: /block before the wildcard. - Forgetting the leading slash in disallow paths.
Disallow: admindoes nothing. It must beDisallow: /admin. - Blocking CSS or JavaScript files. Google needs to see your CSS and JS to render pages correctly.
Disallow: *.cssorDisallow: *.jscan hurt mobile usability scores and indexing. Only block these if you have a specific reason. - Assuming AI crawlers respect implied denials. If you have no GPTBot block, GPTBot crawls. The absence of a rule is not a block. Use explicit disallow blocks if you want certain bots excluded.
- Deploying without testing. Generate the file, then test a few paths with our inline tester or the robots.txt checker before pushing to production. One typo can block your whole site.
Advanced tips
- Use the
$anchor to match path endings precisely.Disallow: /*.pdf$blocks PDF files but not/report.pdf.html. This pattern is useful for blocking file types without catching false positives. - Use wildcards for dynamic parameters.
Disallow: /search?*blocks all URLs starting with/search?, which prevents infinite crawl loops on search-results pages. - If you want Googlebot to index a page but block Googlebot-Image from scraping images on that page, add two blocks:
User-agent: Googlebot / Allow: /pageandUser-agent: Googlebot-Image / Disallow: /page. Googlebot-Image is a separate crawler. - Test the file against real paths from your sitemap. If your sitemap includes
/blog/post-1but your robots.txt blocks/blog, you have a conflict. The sitemap will be ignored for those URLs. - Save a copy of your generated robots.txt with a version number or date stamp. When you regenerate it later, compare the diff to see what changed.
- If your site has staging or dev subdomains, generate separate robots.txt files for each with
Disallow: /on non-production domains. Staging should never be indexed.
Once your robots.txt is generated, download it and place it in your site's root directory as robots.txt (no .txt extension in the URL-it is served as yourdomain.com/robots.txt). Then validate it with the robots.txt checker to confirm syntax and rule logic. If you also need to validate your sitemap links, use the sitemap checker. If you are fixing canonical URL issues at the same time, the canonical checker bulk-tests URL variants.