What a sitemap checker actually does
A sitemap checker fetches your sitemap.xml file, parses every <url> entry, extracts the <loc>, <lastmod>, <changefreq>, and <priority> tags, then makes an HTTP HEAD request to each URL to confirm it returns 200. It flags redirects, 404s, and server errors, checks for duplicate URLs, and compares your sitemap structure against the XML sitemap spec.
If your sitemap is a sitemap index-a file that lists other sitemap files instead of individual URLs-we follow every reference, fetch each child sitemap, and aggregate the results. A single check covers your entire sitemap tree up to 200 URLs in full mode or 50 URLs in sample mode.
Three categories of problem show up in every sitemap audit. Broken URLs that return 404 or 500. Duplicate URLs listed more than once, which waste crawl budget. And orphan URLs that appear in the sitemap but have zero internal links, meaning a user cannot reach them by clicking through your site. Our checker flags all three in one pass.
How to use this sitemap checker
- Paste your sitemap URL into Sitemap URL. Usually
https://www.yourdomain.com/sitemap.xmlorhttps://www.yourdomain.com/sitemap_index.xml. - Pick a Crawl depth from the dropdown. Index only validates the XML structure without fetching URLs. All referenced sitemaps follows every sitemap listed in an index. Sample 50 URLs checks status codes for 50 random URLs. Full - up to 200 URLs checks every URL we find, up to the limit.
- Hit Check sitemap. You get a summary table with total URLs, status code breakdown, duplicates count, average lastmod age, and any XML schema errors.
- Expand Problem URLs to see a row-by-row list of 404s, 301s, duplicates, and orphans. Each row shows the URL, status, lastmod date, and recommended fix.
- Click Download CSV to export the full report. Use it to batch-fix issues in your CMS or pass it to a developer.
Try checking a sitemap with more than one file. If your sitemap index lists five sub-sitemaps and one returns 404, we report the broken reference and skip that file. The other four are still checked. If you have a flat sitemap with 10,000 URLs, pick Sample 50 first to spot-check before running the full crawl.
Why status codes matter more than XML validity
A sitemap can be perfectly valid XML and still hurt your SEO. If 30 URLs return 404, Google wastes crawl budget fetching pages that do not exist. If 50 URLs are 301 redirects, Google has to follow the redirect, which doubles the request count and slows indexing. If URLs return 500 errors, Google might drop them from the index entirely.
Three practical consequences.
Crawl budget. Google allocates a daily crawl budget to each site based on server speed, site authority, and crawl demand. Every 404 or redirect in your sitemap subtracts from that budget without indexing new content. Cleaning the sitemap before submitting it to Search Console makes every crawl count.
Index coverage. URLs with 4xx or 5xx status codes may be excluded from the index after repeated failures. If those pages are important-product pages, blog posts with backlinks, landing pages for paid campaigns-you lose traffic. A sitemap check catches this before the damage compounds.
Lastmod accuracy. The <lastmod> tag tells Google when a page was last updated. If every page has the same lastmod from three years ago, Google learns your sitemap is stale and may crawl less often. If lastmod is always "yesterday" even when content has not changed, Google learns to ignore it. Our checker reports average lastmod age and flags suspicious patterns.
Duplicate URLs and canonical mismatches
A duplicate URL in a sitemap usually means the same loc appears twice, often with a trailing-slash difference or a protocol mismatch. /page and /page/ are different URLs to a parser, even if your server treats them as identical. http://example.com/page and https://example.com/page are different. Our checker normalizes these patterns and flags them as probable duplicates.
If your sitemap lists /page but that URL redirects to /page/, the redirect wastes a request. It is better to list the final destination in the sitemap and fix the redirect at the server level. We show the redirect chain and recommend listing the 200-status version.
Canonical mismatches are a related issue. If your sitemap includes /page-a but that page has a <link rel="canonical" href="/page-b"> tag, Google sees a conflict. The sitemap says "index page-a" but the page says "I am a duplicate of page-b." Google may choose to ignore the sitemap entry. Run a canonical checker on flagged URLs to confirm the canonical matches the sitemap loc.
Orphan pages and crawlability
An orphan page is in your sitemap but has no internal links pointing to it. A bot can find it via the sitemap, but a human cannot reach it by navigating your site. This is common after content migrations, when old URLs linger in the sitemap but the nav menu was updated.
Orphans are not always bad. A landing page for a paid ad campaign might be orphaned on purpose to control access. But orphaned blog posts or product pages signal a site-structure problem. If the page should be accessible, add internal links. If it should not exist, remove it from the sitemap and 301 it to a live page.
Our checker detects probable orphans by comparing sitemap URLs to your internal link graph. If a URL appears in the sitemap but has zero inbound links from pages we crawled, we flag it. This heuristic catches most orphans without requiring a full-site crawl.
Common mistakes
- Submitting a sitemap index to a tool that expects flat sitemaps. Most validators choke on indexes or test only the index file itself. Ours follows every reference, so you get results for the entire tree.
- Listing non-canonical URLs. Every URL in your sitemap should be the canonical version. Do not list the www version if the canonical is non-www. Do not list http if the canonical is https. Use your canonical checker first if unsure.
- Including URLs blocked by robots.txt. If a URL is in your sitemap but disallowed in robots.txt, Google cannot crawl it. This creates a Search Console warning. Check robots.txt with our robots.txt checker before deploying a new sitemap.
- Setting lastmod to the date the sitemap was generated, not the date the content changed. If your CMS regenerates the sitemap daily and stamps every URL with today's date, Google stops trusting lastmod. Populate lastmod from the post's actual updated-at timestamp.
- Forgetting to re-check after a migration. Old URLs often remain in a sitemap after moving to a new platform. If half your sitemap returns 404, Search Console will show the drop in coverage. Audit the sitemap immediately post-migration.
- Not checking child sitemaps individually. If your sitemap index has one broken child, you might not notice until crawl errors spike. Test each child sitemap URL in isolation to confirm it returns 200 and parses correctly.
Advanced tips
- Run a sample check first on large sitemaps. If the sample reveals a pattern-every URL is 301, or lastmod is missing-fix it before crawling all 10,000 URLs. The sample gives you signal in 10 seconds instead of 5 minutes.
- Compare lastmod dates to your CMS publish dates. If a post was updated last week but lastmod is six months old, your sitemap generation script is broken.
- Check your sitemap monthly, not once. Content goes stale, redirects are added, URLs are unpublished. A monthly check catches decay before Google does.
- If you see a spike in 404s, export the CSV and cross-reference it with your server logs. Sometimes a URL is 404 in the sitemap but still gets traffic from backlinks, meaning it should be 301'd instead of removed.
- Test the same sitemap from two different user-agents (desktop Chrome and Googlebot). If the status codes differ, your server is cloaking or returning different responses to bots, which violates Google's guidelines.
- If duplicates are found, check for canonical tags. A duplicate URL with a canonical pointing elsewhere can stay in the sitemap if it is a regional or language variant. If it is not a variant, remove it.
After fixing sitemap issues, validate that your robots.txt file correctly declares the sitemap location with a Sitemap: line. Use the robots.txt checker to confirm. Then simulate how Googlebot sees one of your pages with the Google crawler simulator to confirm the URL loads, JavaScript executes, and content is visible. If you are checking metadata alongside sitemaps, the website metadata checker renders your title, meta, and OG tags as they appear in SERPs.