Skip to content
Live check · fetches your URL server-side

Sitemap Checker

Crawl up to 200 URLs — status codes, duplicates, orphans, lastmod age.

A sitemap.xml file tells search engines which pages exist and how often they change. Most validators parse the XML and stop. This sitemap checker validates structure, fetches HTTP status codes for every listed URL, detects duplicates, flags orphans that are in your sitemap but not linked from your homepage, and checks whether lastmod timestamps are recent enough to justify crawl priority.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

What a sitemap checker actually does

A sitemap checker fetches your sitemap.xml file, parses every <url> entry, extracts the <loc>, <lastmod>, <changefreq>, and <priority> tags, then makes an HTTP HEAD request to each URL to confirm it returns 200. It flags redirects, 404s, and server errors, checks for duplicate URLs, and compares your sitemap structure against the XML sitemap spec.

If your sitemap is a sitemap index-a file that lists other sitemap files instead of individual URLs-we follow every reference, fetch each child sitemap, and aggregate the results. A single check covers your entire sitemap tree up to 200 URLs in full mode or 50 URLs in sample mode.

Three categories of problem show up in every sitemap audit. Broken URLs that return 404 or 500. Duplicate URLs listed more than once, which waste crawl budget. And orphan URLs that appear in the sitemap but have zero internal links, meaning a user cannot reach them by clicking through your site. Our checker flags all three in one pass.

How to use this sitemap checker

  1. Paste your sitemap URL into Sitemap URL. Usually https://www.yourdomain.com/sitemap.xml or https://www.yourdomain.com/sitemap_index.xml.
  2. Pick a Crawl depth from the dropdown. Index only validates the XML structure without fetching URLs. All referenced sitemaps follows every sitemap listed in an index. Sample 50 URLs checks status codes for 50 random URLs. Full - up to 200 URLs checks every URL we find, up to the limit.
  3. Hit Check sitemap. You get a summary table with total URLs, status code breakdown, duplicates count, average lastmod age, and any XML schema errors.
  4. Expand Problem URLs to see a row-by-row list of 404s, 301s, duplicates, and orphans. Each row shows the URL, status, lastmod date, and recommended fix.
  5. Click Download CSV to export the full report. Use it to batch-fix issues in your CMS or pass it to a developer.

Try checking a sitemap with more than one file. If your sitemap index lists five sub-sitemaps and one returns 404, we report the broken reference and skip that file. The other four are still checked. If you have a flat sitemap with 10,000 URLs, pick Sample 50 first to spot-check before running the full crawl.

Why status codes matter more than XML validity

A sitemap can be perfectly valid XML and still hurt your SEO. If 30 URLs return 404, Google wastes crawl budget fetching pages that do not exist. If 50 URLs are 301 redirects, Google has to follow the redirect, which doubles the request count and slows indexing. If URLs return 500 errors, Google might drop them from the index entirely.

Three practical consequences.

Crawl budget. Google allocates a daily crawl budget to each site based on server speed, site authority, and crawl demand. Every 404 or redirect in your sitemap subtracts from that budget without indexing new content. Cleaning the sitemap before submitting it to Search Console makes every crawl count.

Index coverage. URLs with 4xx or 5xx status codes may be excluded from the index after repeated failures. If those pages are important-product pages, blog posts with backlinks, landing pages for paid campaigns-you lose traffic. A sitemap check catches this before the damage compounds.

Lastmod accuracy. The <lastmod> tag tells Google when a page was last updated. If every page has the same lastmod from three years ago, Google learns your sitemap is stale and may crawl less often. If lastmod is always "yesterday" even when content has not changed, Google learns to ignore it. Our checker reports average lastmod age and flags suspicious patterns.

Duplicate URLs and canonical mismatches

A duplicate URL in a sitemap usually means the same loc appears twice, often with a trailing-slash difference or a protocol mismatch. /page and /page/ are different URLs to a parser, even if your server treats them as identical. http://example.com/page and https://example.com/page are different. Our checker normalizes these patterns and flags them as probable duplicates.

If your sitemap lists /page but that URL redirects to /page/, the redirect wastes a request. It is better to list the final destination in the sitemap and fix the redirect at the server level. We show the redirect chain and recommend listing the 200-status version.

Canonical mismatches are a related issue. If your sitemap includes /page-a but that page has a <link rel="canonical" href="/page-b"> tag, Google sees a conflict. The sitemap says "index page-a" but the page says "I am a duplicate of page-b." Google may choose to ignore the sitemap entry. Run a canonical checker on flagged URLs to confirm the canonical matches the sitemap loc.

Orphan pages and crawlability

An orphan page is in your sitemap but has no internal links pointing to it. A bot can find it via the sitemap, but a human cannot reach it by navigating your site. This is common after content migrations, when old URLs linger in the sitemap but the nav menu was updated.

Orphans are not always bad. A landing page for a paid ad campaign might be orphaned on purpose to control access. But orphaned blog posts or product pages signal a site-structure problem. If the page should be accessible, add internal links. If it should not exist, remove it from the sitemap and 301 it to a live page.

Our checker detects probable orphans by comparing sitemap URLs to your internal link graph. If a URL appears in the sitemap but has zero inbound links from pages we crawled, we flag it. This heuristic catches most orphans without requiring a full-site crawl.

Common mistakes

  • Submitting a sitemap index to a tool that expects flat sitemaps. Most validators choke on indexes or test only the index file itself. Ours follows every reference, so you get results for the entire tree.
  • Listing non-canonical URLs. Every URL in your sitemap should be the canonical version. Do not list the www version if the canonical is non-www. Do not list http if the canonical is https. Use your canonical checker first if unsure.
  • Including URLs blocked by robots.txt. If a URL is in your sitemap but disallowed in robots.txt, Google cannot crawl it. This creates a Search Console warning. Check robots.txt with our robots.txt checker before deploying a new sitemap.
  • Setting lastmod to the date the sitemap was generated, not the date the content changed. If your CMS regenerates the sitemap daily and stamps every URL with today's date, Google stops trusting lastmod. Populate lastmod from the post's actual updated-at timestamp.
  • Forgetting to re-check after a migration. Old URLs often remain in a sitemap after moving to a new platform. If half your sitemap returns 404, Search Console will show the drop in coverage. Audit the sitemap immediately post-migration.
  • Not checking child sitemaps individually. If your sitemap index has one broken child, you might not notice until crawl errors spike. Test each child sitemap URL in isolation to confirm it returns 200 and parses correctly.

Advanced tips

  • Run a sample check first on large sitemaps. If the sample reveals a pattern-every URL is 301, or lastmod is missing-fix it before crawling all 10,000 URLs. The sample gives you signal in 10 seconds instead of 5 minutes.
  • Compare lastmod dates to your CMS publish dates. If a post was updated last week but lastmod is six months old, your sitemap generation script is broken.
  • Check your sitemap monthly, not once. Content goes stale, redirects are added, URLs are unpublished. A monthly check catches decay before Google does.
  • If you see a spike in 404s, export the CSV and cross-reference it with your server logs. Sometimes a URL is 404 in the sitemap but still gets traffic from backlinks, meaning it should be 301'd instead of removed.
  • Test the same sitemap from two different user-agents (desktop Chrome and Googlebot). If the status codes differ, your server is cloaking or returning different responses to bots, which violates Google's guidelines.
  • If duplicates are found, check for canonical tags. A duplicate URL with a canonical pointing elsewhere can stay in the sitemap if it is a regional or language variant. If it is not a variant, remove it.

After fixing sitemap issues, validate that your robots.txt file correctly declares the sitemap location with a Sitemap: line. Use the robots.txt checker to confirm. Then simulate how Googlebot sees one of your pages with the Google crawler simulator to confirm the URL loads, JavaScript executes, and content is visible. If you are checking metadata alongside sitemaps, the website metadata checker renders your title, meta, and OG tags as they appear in SERPs.

Generate the whole content, not just check it.

BlazeHive writes SEO articles end to end from a single keyword. Outline, draft, meta, schema, internal links. Free trial, no card.

Start with BlazeHive Free trial

Frequently Asked Questions

What is a sitemap?

A sitemap is an XML file listing every URL you want search engines to crawl and index. It lives at yoursite.com/sitemap.xml and acts as a directory for crawlers, especially useful for large sites, new sites with few backlinks, or sites with deep navigation where pages sit five or more clicks from the homepage. Sitemaps do not guarantee indexing. Google still decides whether a page is worth indexing based on quality and duplication. Without a sitemap, Google relies on internal links and external backlinks to find pages, which can take weeks or even months for new content. With a sitemap, you tell Google the page exists and when it was last updated, which speeds up discovery and helps prioritize fresh content. Our sitemap checker fetches your sitemap.xml, parses the structure, validates XML syntax, checks HTTP status codes for every listed URL, flags duplicates, and detects orphan pages (pages in your sitemap but unreachable via internal links). Use it after launching a new site, after a migration, or quarterly to catch regressions.

How do I check if my website has a sitemap?

Try three places. First, append /sitemap.xml to your domain (yourdomain.com/sitemap.xml) and see if it loads. Most CMSes generate a sitemap at this path automatically. If you see XML with a list of URLs, that is your sitemap. Second, check robots.txt at yourdomain.com/robots.txt for a line starting with Sitemap: followed by a URL. Many sites declare their sitemap location here. Third, log into Google Search Console, go to Sitemaps under Index, and see what sitemap URLs you submitted. This is the authoritative list Google uses to prioritize crawling. If you find a sitemap URL, paste it into our Sitemap URL field to validate structure, confirm all URLs return 200 status codes, and spot duplicates or orphans. If none of these methods finds a sitemap, you likely do not have one. That is fine for sites under 50 pages but a problem for larger ones. Generate one using your CMS plugin (Yoast, Rank Math, next-sitemap), then submit it to Search Console to speed up indexing.

What are the three types of sitemaps?

The three types are XML sitemaps (for search engines), HTML sitemaps (for users), and visual sitemaps (for designers). XML sitemaps are machine-readable files in XML format that list URLs, last modification dates, update frequency, and priority. Search engines use them to discover and prioritize pages. They live at /sitemap.xml and are not meant for human browsing. HTML sitemaps are human-readable pages with links to every major section of your site, organized hierarchically. They help users navigate large sites and provide internal links. They live at URLs like /sitemap and are often linked from the footer. Visual sitemaps are diagrams (in Figma, Miro, Sketch) that map out page hierarchy, user flows, and navigation before a site is built. They are planning artifacts, not live pages. Most sites need an XML sitemap (required for SEO) and benefit from an HTML sitemap if over 100 pages. Visual sitemaps are for the design phase. Our checker validates XML sitemaps only. Most CMSes generate them automatically. For static sites, use next-sitemap or astro-sitemap.

How do I validate a sitemap?

Validating a sitemap means checking XML structure, URL accessibility, metadata accuracy, and protocol limits. First, confirm the XML is well-formed with proper namespace declaration (xmlns attribute pointing to sitemaps.org). Malformed XML causes parsers to reject the entire file. Second, verify every URL returns a 200 status code, not a 301, 404, or 5xx error. Search engines may still crawl redirecting or broken URLs, but they deprioritize them. Third, check that URLs use absolute paths (https://example.com/page, not /page) and match your canonical domain (www or non-www, not mixed). Fourth, confirm file size is under 50 MB uncompressed with fewer than 50,000 URLs. If you exceed either, split into multiple sitemaps and use a sitemap index file. Fifth, validate lastmod dates use W3C format (YYYY-MM-DD or ISO 8601). Our tool automates all five: paste your Sitemap URL, pick crawl depth (index only, all referenced sitemaps, or sample), and we return status codes, duplicates, missing lastmod warnings, and a CSV export of problems.

Does Google have a sitemap generator?

Google used to offer a sitemap generator (Google Sitemap Generator) for Apache and IIS, but it was deprecated years ago. You no longer need it. Nearly every modern CMS and static framework generates sitemaps automatically. WordPress (via Yoast SEO, Rank Math, or core sitemap since 5.5), Shopify (built-in), Webflow (built-in), Squarespace (built-in), Wix (built-in), Next.js (via next-sitemap), Astro (via astro-sitemap), Gatsby (via gatsby-plugin-sitemap), and Hugo (built-in) all create and update XML sitemaps without manual work. For custom sites, use open-source libraries like sitemap.js (Node.js), django-sitemap (Python), or a build script. Once you have a sitemap, submit it to Google Search Console under Sitemaps. Google will crawl it periodically. You can ping Google manually after adding URLs by sending a GET request to google.com/ping?sitemap=yoursitemapurl, though most CMSes do this automatically. Use our sitemap checker to validate the sitemap before submitting to Search Console.

How often should I update my sitemap?

Update your sitemap every time you publish, unpublish, or significantly edit a page. Most CMSes and static generators handle this automatically. WordPress plugins regenerate the sitemap on every post publish, Shopify updates it when products change, and static frameworks rebuild the sitemap during every deploy. If managing manually (rare on custom sites), regenerate it weekly or after content batches. The lastmod field tells search engines when a page changed, which helps them prioritize fresh content over stale. If you never update lastmod or set the same date for every URL, search engines ignore it and fall back to link discovery and crawl budget. For daily publishers (news, blogs, e-commerce with inventory changes), dynamic sitemaps that regenerate on publish are essential. For monthly or quarterly publishers, a static sitemap regenerated on deploy is fine. Do not let your sitemap list URLs that 404, redirect, or are blocked by robots.txt. That wastes crawl budget and signals poor site quality. Use our checker after major changes (migration, URL restructure, bulk content changes).

What is the difference between a sitemap and robots.txt?

Robots.txt tells crawlers which parts of your site they are allowed or disallowed from accessing. A sitemap tells crawlers which pages you want them to prioritize crawling. They serve different purposes and work together. Robots.txt lives at yourdomain.com/robots.txt, uses plaintext syntax with User-agent, Allow, and Disallow directives, and blocks or permits access to paths, files, or directories. It declares where your sitemap lives via a Sitemap: directive. Robots.txt is crawled first. If you accidentally disallow your entire site, crawlers stop immediately and never see your sitemap. A sitemap lives at yourdomain.com/sitemap.xml, uses XML syntax, and lists URLs you want crawled with metadata like lastmod and priority. It does not control access. It suggests what to crawl. Crawlers can ignore your sitemap if they find pages via links, but they cannot bypass robots.txt. Use robots.txt to block admin panels, staging environments, and unwanted crawlers. Use a sitemap to list every indexable page.

Can a sitemap improve my SEO?

A sitemap does not directly improve rankings, but it removes discovery friction, which indirectly helps SEO by ensuring new and updated pages get crawled faster. Without a sitemap, Google relies on internal links and external backlinks to find pages, which can take weeks for new content, especially on large sites or sites with weak internal linking. With a sitemap, you tell Google the page exists and when it was last updated, which speeds up indexing. This is important for new sites with few backlinks, sites with deep page hierarchies (pages buried five clicks from the homepage), sites with orphan pages, and sites that publish frequently (blogs, news, e-commerce). A sitemap helps with crawl budget efficiency. Instead of discovering pages via link crawling, Google reads your sitemap and knows what to prioritize. However, a sitemap cannot force Google to index low-quality, duplicate, or thin pages. If a page is in your sitemap but still not indexed, the issue is usually content quality, canonicalization, or robots meta tags.

What should not be in a sitemap?

A sitemap should only list URLs you want indexed, so exclude anything blocked by robots.txt, tagged with noindex, redirecting to another URL, returning 404 or 5xx errors, or canonicalized to a different URL. Including these wastes crawl budget and signals poor site maintenance. Do not include admin pages, login pages, checkout pages, or user account dashboards (usually blocked by robots.txt or noindex). Do not include parameter URLs (like ?sort=price or ?page=2) if you use canonical tags to consolidate them. Do not include staging or test environments. Do not include URLs that redirect (301 or 302). List the final destination instead. Do not include URLs with noindex meta tags or X-Robots-Tag headers. Do not include paginated URLs unless each page has unique content worth indexing. Most e-commerce sites should only include page 1, with rel=next/prev or canonical tags handling the rest. Do not include duplicate content URLs. Use canonical tags to consolidate duplicates, then only list the canonical version.

How do I fix sitemap errors?

Sitemap errors fall into three buckets: structural (malformed XML, wrong namespace, file too large), URL-level (404s, redirects, noindex pages), and metadata issues (missing lastmod, incorrect date formats). Fix structural errors first. They prevent crawlers from parsing the file. Open your sitemap in a browser or XML validator, confirm it starts with the correct XML declaration and namespace (xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"), and check that every opening tag has a matching closing tag. If your sitemap exceeds 50 MB or 50,000 URLs, split it into multiple files and create a sitemap index file. Fix URL-level errors by removing or replacing broken entries. Delete any URL that returns 404, replace redirecting URLs with their final destination, remove URLs with noindex tags, and confirm every URL matches your canonical domain (www or non-www, not mixed). Fix metadata issues by ensuring lastmod dates use ISO 8601 format (YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS+00:00). Remove lastmod entirely if your CMS cannot keep it accurate. After fixing, revalidate with our tool and resubmit to Google Search Console.

Related free tools

All tools →