🗺️ Sitemap Historian

Pull every historical sitemap.xml from the Wayback Machine, diff revisions, and discover which URLs appeared and vanished over time

Last updated:

0
Revisions
0
Total URLs
0
Current
0
Vanished
0
Still Live

How Does Sitemap Historian Track Content Lifecycle?

XML sitemaps are the canonical index of every page a website wants search engines to discover. When organizations remove URLs from their sitemap, those pages become invisible to crawlers — but the content often remains on the server. Sitemap Historian leverages the Wayback Machine CDX API (Internet Archive CDX Server) to recover every historical version of a site's sitemap.xml, parse the URLs from each revision, and build a complete timeline of appearances and disappearances.

What Can Vanished Sitemap URLs Reveal?

Removed URLs frequently expose discontinued products, deleted blog posts, deprecated API endpoints, reorganized site sections, or content taken down for legal reasons. According to OWASP Testing Guide v4.2 (Section 4.1.1: Search Engine Discovery), reviewing sitemap history is a recommended reconnaissance technique because organizations often forget to remove server-side content after delisting it from sitemaps. The SANS Institute Reading Room notes that sitemap analysis is particularly effective for identifying shadow infrastructure that bypasses standard security controls.

What Sitemap Formats Are Supported?

The tool parses standard XML sitemaps as defined in the Sitemaps Protocol 0.9 (sitemaps.org), including sitemap index files that reference multiple child sitemaps. Both <url><loc> entries and <sitemap><loc> index entries are extracted. Per Google Search Central documentation, sitemaps may also include <lastmod>, <changefreq>, and <priority> metadata — all of which are captured when available. The RFC 9309 (Robots Exclusion Protocol) standard complements sitemap analysis since robots.txt files often reference sitemap locations.

XML Sitemap
An XML file listing URLs a website wants search engines to index, with optional metadata about modification dates and crawl priority. Defined by the Sitemaps 0.9 protocol at sitemaps.org.
Sitemap Index
A parent XML file that references multiple child sitemaps, used by large sites to organize thousands of URLs into manageable segments.
Content Lifecycle
The progression of a URL through publication, modification, and eventual removal from a sitemap — which may or may not correspond to actual server-side deletion.
Vanished URL
A URL that appeared in a previous sitemap revision but is absent from the most recent version, indicating the organization chose to delist it from search engine discovery.

🗺️ Sitemap Historian — Frequently Asked Questions

Why do URLs disappear from XML sitemaps?

URLs are removed from sitemaps when content is deleted, pages are merged or redirected, products are discontinued, or sections are moved behind authentication. Tracking these removals reveals content the organization chose to unpublish.

How does Sitemap Historian detect changes?

The tool queries the Wayback Machine CDX API to find every archived snapshot of sitemap.xml. Each version is fetched, parsed, and content-hashed for deduplication. Consecutive versions are diffed to identify URLs added or removed per revision.

Can vanished sitemap URLs still be accessed?

Often yes. Removing a URL from the sitemap only affects search engine discovery. Sitemap Historian probes vanished URLs to check if they still return HTTP 200, revealing content delisted from search but never actually removed.