Skip to content
Live check · fetches your URL server-side

Robots.txt Generator

Preset for your CMS + explicit controls for AI crawlers that matter in 2026.

A robots.txt file controls which bots can crawl your site. Most generators give you a blank textarea and a syntax guide. This robots.txt generator starts with proven presets for WordPress, Shopify, Next.js, and Astro, adds explicit allow or block toggles for the AI crawlers that matter in 2026-GPTBot, ClaudeBot, CCBot, PerplexityBot, anthropic-ai, and Google-Extended-and outputs a production-ready file with syntax highlighting and inline testing.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

What a robots.txt generator actually does

A robots.txt generator writes the plain-text directives that crawlers read before accessing any page on your domain. It structures user-agent blocks, disallow and allow rules, crawl-delay instructions, and sitemap declarations in the exact order and syntax that bots expect. It prevents syntax errors, ensures the wildcard block comes after specific agents, and validates that every disallow path starts with a forward slash.

Most sites need three blocks. One for search bots like Googlebot and Bingbot with broad access. One wildcard block for all other crawlers with restricted access. And separate blocks for AI training bots that you want to allow or deny explicitly. Our generator handles all three and lets you toggle each AI bot independently because their policies differ.

Two principles guide a good robots.txt. First, explicit beats implicit. If you want GPTBot blocked but ClaudeBot allowed, list both by name instead of relying on the wildcard. Second, specific rules beat general rules. If you disallow /admin for everyone but allow it for Googlebot, put the Googlebot block before the wildcard block. Our generator enforces both.

How to use this robots.txt generator

  1. Pick a CMS preset from the dropdown. WordPress blocks /wp-admin, /wp-includes, *.php$, and readme.html by default. Shopify blocks /admin, /cart, and /checkout. Next.js blocks /_next/static and /api. Astro / static gives you a minimal file. Strapi blocks /admin and /api. Custom starts blank.
  2. Paste your sitemap URL into Sitemap URL. This becomes the Sitemap: line at the top. If you have multiple sitemaps or a sitemap index, add one per line in the expanded field.
  3. Check or uncheck each AI crawler in Block AI crawlers. Checked means we add a user-agent block with Disallow: / for that bot. Unchecked means it falls back to the wildcard block. Options are GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), PerplexityBot, anthropic-ai, and Google-Extended.
  4. Paste additional paths to block into Disallow paths, one per line. Examples: /admin, /checkout, /cart, /private. These apply to the wildcard User-agent: * block.
  5. Hit Generate robots.txt. You get syntax-highlighted output, a download button, and a Test path box so you can check whether a URL would be blocked before deploying.
  6. Click Check robots.txt at the bottom to run the file through our validator and see how each user-agent interprets your rules.

Try selecting the WordPress preset, leaving all AI crawlers checked, and adding /staging to the disallow list. The output allows Googlebot and Bingbot full access, blocks GPTBot, ClaudeBot, CCBot, PerplexityBot, anthropic-ai, and Google-Extended from the entire site, and blocks all other bots from /wp-admin, /staging, and PHP files.

Why AI crawler toggles matter in 2026

The robots.txt spec was written before LLMs existed. In 2023 and 2024, AI companies released crawlers-GPTBot, ClaudeBot, CCBot-that scrape the web to train language models. By mid-2026, respecting robots.txt is standard practice for these bots. If your file says User-agent: GPTBot / Disallow: /, OpenAI's crawler skips your site. If you say nothing, it crawls by default.

Three practical decisions every site makes now.

Allow search bots, block training bots. Most sites want Google and Bing to index content for search but do not want that content feeding LLM training datasets. This requires separate blocks: User-agent: Googlebot / Allow: / and User-agent: GPTBot / Disallow: /. The wildcard alone does not distinguish them.

Selectively allow certain AI bots. Some publishers allow ClaudeBot and block GPTBot because Anthropic offers attribution features. Others allow Google-Extended for Bard but block CCBot. Per-bot control is the only way to implement a nuanced policy.

Make the policy explicit. If your robots.txt has no GPTBot block, GPTBot assumes it can crawl. Explicit blocks make your intent legally defensible if a future dispute arises about consent. We include toggles for six major AI crawlers as of April 2026. That list will grow.

CMS presets and why they differ

Every CMS has files or directories that should not be crawled. WordPress exposes /wp-admin and /wp-includes by default. Blocking them prevents bots from indexing admin panels and PHP includes, which have no SEO value and waste crawl budget. Shopify's /cart and /checkout pages create duplicate content because every session generates a new URL. Blocking them prevents index bloat.

Next.js apps serve JavaScript bundles from /_next/static. Crawling those files is pointless because they are code, not content. Blocking them with Disallow: /_next/static saves requests. Astro and other static-site generators usually do not need blocks because there are no admin routes, but blocking *.json files or /_astro/* build artifacts is common.

Strapi, Directus, and headless CMSes expose /admin and /api endpoints that should be blocked for security and SEO reasons. If an API route is public and you want it indexed-rare but possible-you would allow it explicitly with User-agent: Googlebot / Allow: /api/public.

Our presets reflect these patterns. Pick the one that matches your stack, then edit the disallow list if your setup differs. The Custom option gives you a blank starting point if you are on a framework we do not list or if you are building a robots.txt from scratch for a multi-site architecture.

Sitemap declaration and crawl-delay

The Sitemap: line tells crawlers where to find your sitemap.xml file. It should be a full URL, not a relative path: https://www.yourdomain.com/sitemap.xml, not /sitemap.xml. If you have a sitemap index, list the index. If you have separate sitemaps for posts, pages, and products, you can list all three on separate Sitemap: lines. We put these at the top of the file before any user-agent blocks.

The Crawl-delay: directive sets a minimum number of seconds between requests. Googlebot ignores it. Bingbot and some smaller crawlers respect it. A delay of 1 is reasonable. A delay of 10 or higher can slow crawls to a near-halt. Most sites do not need crawl-delay unless server load is an issue. If you add it, add it to the wildcard block, not to Googlebot, so search indexing stays fast.

Some bots read case-insensitive user-agent names. User-agent: googlebot works the same as User-agent: Googlebot. Disallow paths, however, are case-sensitive on most servers. /Admin and /admin are different. Match the case your server uses or rely on lowercase paths everywhere.

Common mistakes

  • Putting the wildcard block first. If User-agent: * comes before User-agent: Googlebot, some parsers stop reading after the wildcard. Always list specific bots first, wildcard last.
  • Blocking the entire site by accident. User-agent: * / Disallow: / blocks every bot, including Google. If you want Google in, add a User-agent: Googlebot / Allow: / block before the wildcard.
  • Forgetting the leading slash in disallow paths. Disallow: admin does nothing. It must be Disallow: /admin.
  • Blocking CSS or JavaScript files. Google needs to see your CSS and JS to render pages correctly. Disallow: *.css or Disallow: *.js can hurt mobile usability scores and indexing. Only block these if you have a specific reason.
  • Assuming AI crawlers respect implied denials. If you have no GPTBot block, GPTBot crawls. The absence of a rule is not a block. Use explicit disallow blocks if you want certain bots excluded.
  • Deploying without testing. Generate the file, then test a few paths with our inline tester or the robots.txt checker before pushing to production. One typo can block your whole site.

Advanced tips

  • Use the $ anchor to match path endings precisely. Disallow: /*.pdf$ blocks PDF files but not /report.pdf.html. This pattern is useful for blocking file types without catching false positives.
  • Use wildcards for dynamic parameters. Disallow: /search?* blocks all URLs starting with /search?, which prevents infinite crawl loops on search-results pages.
  • If you want Googlebot to index a page but block Googlebot-Image from scraping images on that page, add two blocks: User-agent: Googlebot / Allow: /page and User-agent: Googlebot-Image / Disallow: /page. Googlebot-Image is a separate crawler.
  • Test the file against real paths from your sitemap. If your sitemap includes /blog/post-1 but your robots.txt blocks /blog, you have a conflict. The sitemap will be ignored for those URLs.
  • Save a copy of your generated robots.txt with a version number or date stamp. When you regenerate it later, compare the diff to see what changed.
  • If your site has staging or dev subdomains, generate separate robots.txt files for each with Disallow: / on non-production domains. Staging should never be indexed.

Once your robots.txt is generated, download it and place it in your site's root directory as robots.txt (no .txt extension in the URL-it is served as yourdomain.com/robots.txt). Then validate it with the robots.txt checker to confirm syntax and rule logic. If you also need to validate your sitemap links, use the sitemap checker. If you are fixing canonical URL issues at the same time, the canonical checker bulk-tests URL variants.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

Frequently Asked Questions

What does a robots.txt file do?

A robots.txt file tells crawlers which parts of your site they can and can't request. It sits at the exact root of your domain at /robots.txt and is the first thing polite bots read before crawling anything else on the site. The syntax is small. You write one or more User-agent blocks, each followed by Allow and Disallow lines. A Sitemap line near the top points bots at your XML index so they discover every page instead of guessing from internal links. In 2026, the file does more than manage Googlebot. It also controls AI training crawlers like GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), PerplexityBot, and Google-Extended, each with a separate User-agent name. Our robots.txt generator writes a production file from a CMS preset and checkbox toggles. Verify it with our checker against a real path before you ship the change to production.

How do I generate a robots.txt file?

Pick your CMS preset: WordPress, Shopify, Next.js, Astro, Strapi, or Custom for an empty starter. The preset drops in sensible defaults for that platform, such as blocking /wp-admin on WordPress or /checkout on Shopify so crawl budget doesn't get burned. Paste your Sitemap URL so crawlers discover pages fast. Use the AI-crawler checkboxes to block GPTBot, ClaudeBot, CCBot, PerplexityBot, anthropic-ai, and Google-Extended individually or all at once. Add any extra paths you want disallowed, one per line. Hit generate, copy the output, save it as robots.txt in your public root. On WordPress, paste it into your SEO plugin's robots.txt editor. On Next.js or Astro, drop it in /public. Once live, paste your URL into our robots.txt checker and test a few paths per bot to confirm the rules resolve the way you expect. If something is off, adjust the inputs and regenerate rather than hand-editing the output.

What should I put in my robots.txt file?

Four things at minimum. First, a Sitemap line pointing to your XML index (https://example.com/sitemap.xml). This is the single fastest discovery boost for new pages and orphan URLs not linked from the homepage. Second, a User-agent: * block with any Disallow rules that apply to every bot: admin paths, internal search results, cart and checkout on ecommerce. Third, per-bot rules if you want different treatment for AI crawlers versus search engines. Fourth, nothing else. Most broken robots.txt files are bloated with stale rules copied from a ten-year-old tutorial. Keep it short, keep paths case-sensitive, and list one sitemap per domain. Our generator scaffolds a clean version for your CMS with the 2026 AI-crawler toggles included by default. Once it's live, test a handful of real URLs per bot with our robots.txt checker to confirm every rule resolves correctly and nothing important is accidentally blocked.

How does a robots.txt file work?

When a crawler visits your site, it first requests /robots.txt. If the file returns a 200, the bot parses the rules, finds the User-agent block that matches its own name, and follows the Allow and Disallow lines in that block. If no named block matches, it uses the wildcard block (User-agent: *). The longer, more specific path match wins on overlap, so Allow: /blog/ beats Disallow: /bl when both are present. The bot caches the file for roughly 24 hours before re-fetching, so new rules don't apply instantly across the web. Robots.txt is advisory, not enforced. Compliant bots (Googlebot, Bingbot, GPTBot, ClaudeBot) respect it as policy. Rogue scrapers ignore it entirely and go straight for the content. See exactly which rule wins for any bot on any path with our robots.txt checker, and audit the rendered result with our crawler simulator after changes go live.

How do I block AI crawlers in robots.txt?

Add a User-agent block for each bot you want to block, with Disallow: / underneath to block the whole site. The 2026 names worth including explicitly: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), CCBot (Common Crawl, trains many open models), PerplexityBot (Perplexity's answer engine), and Google-Extended (blocks training usage of Googlebot-crawled pages without hurting your Google search rankings or feature eligibility). Our robots.txt generator has one checkbox per crawler so you can decide per bot rather than blocking or allowing everything at once. Publishers usually block all of them to protect paid archives. Marketers often leave them allowed to get cited in ChatGPT and Claude answers where citations drive referral traffic. After you generate and deploy, test each blocked bot with our robots.txt checker on a specific path to confirm the Disallow resolves correctly for that User-agent name and block order.

Do I need a robots.txt file?

Not strictly. If you don't have one, crawlers assume everything is allowed and crawl your site by following links. For a small static site with nothing to hide and no crawl-budget concerns, that's fine and you can skip it entirely. You'll probably still want one for three reasons. One, it's the standard place to declare your sitemap, which speeds up discovery of new pages across Google, Bing, and AI crawlers. Two, it gives you a switch for AI training crawlers (block GPTBot and CCBot, for instance) that you don't get by default without a file. Three, it lets you block high-traffic wasteful paths like internal search results or faceted filter URLs so Googlebot spends its budget on pages that matter. Generate a baseline with our robots.txt generator and the right CMS preset. Verify the rules resolve correctly per bot with our robots.txt checker before you move on.

What's the difference between Allow and Disallow?

Disallow blocks a path for the named User-agent. Allow explicitly permits one. You rarely need Allow because anything not Disallowed is allowed by default. The one time Allow matters is when you want to block a folder but keep one file inside it crawlable. For example, Disallow: /wp-admin/ with Allow: /wp-admin/admin-ajax.php blocks the admin interface but lets Googlebot reach the AJAX endpoint WordPress needs for core functions. The longer, more specific match wins. Allow: /blog/ beats Disallow: /bl, and Disallow: /blog/drafts/ beats Allow: /blog/ when both apply to the same bot. Rule order inside a single User-agent block doesn't matter to modern bots; specificity does. Test the outcome with our robots.txt checker by pasting the exact path you care about and seeing which rule the bot lands on. For clean starter rules matched to your CMS, use our generator and the right preset.

Where do I put the robots.txt file?

At the exact root of your domain. The URL must resolve at https://yourdomain.com/robots.txt with no subdirectory in front of it at all. Subdirectory paths like /blog/robots.txt are ignored entirely by every crawler, no matter how many rules the file contains. Subdomains count as separate domains, so blog.example.com needs its own robots.txt at its own root; the one on www.example.com does not apply. Platform-specific placement: WordPress users paste the content into their SEO plugin's robots.txt editor (Yoast, Rank Math, or All in One SEO) and the plugin serves it virtually. Shopify generates one automatically and locks most of it; you customize via robots.txt.liquid. Next.js and Astro projects drop robots.txt into /public. Static sites put it in the web root directly. Generate the file itself with our robots.txt generator, then confirm it's live and parses correctly for each named bot with our robots.txt checker.

Should robots.txt include the sitemap URL?

Yes. It's one line and it's the single most useful non-rule entry in the file. A Sitemap: https://example.com/sitemap.xml declaration tells every crawler where your XML index lives, which speeds up discovery of new pages and helps bots find orphan URLs that aren't linked from the homepage or main navigation. Google and Bing both read this line and queue every URL inside the sitemap for crawling on their next pass. The sitemap URL should be absolute (full https://), point to an index that returns a 200, and list only canonical URLs that match your preferred domain version. One sitemap declaration per domain; if you have multiple sitemaps, put them in a sitemap index file and declare that index once. Our robots.txt generator adds the Sitemap line automatically from the Sitemap URL field. Validate the sitemap itself (status codes, duplicates, lastmod freshness across every URL) with our sitemap checker.

How do I test my robots.txt after generating it?

Three-step verification. First, confirm the file is live by loading https://yourdomain.com/robots.txt in a browser; you should see your rules as plain text, not your homepage or a 404 error. Second, paste the URL into our robots.txt checker and run one check per bot you named in the file (Googlebot, GPTBot, ClaudeBot, PerplexityBot, and so on) against a specific Test path that should be blocked. The checker tells you which rule matched and which User-agent block it came from so you can fix the source. Third, for rendered-page validation, run the same URL through our crawler simulator to see whether the page is actually fetched and indexed the way you intend. If Search Console flags any URL as blocked after deployment, Google's cache can lag by up to 24 hours before the new rules take effect across the index, so re-check after a day before filing a bug.

Related free tools

All tools →