How does RSS Feed Archaeology work?
RSS (Really Simple Syndication) and Atom feeds are XML files that websites publish at predictable URLs like /feed, /rss, or /atom.xml. These feeds contain structured metadata for every article: title, URL, publication date, author, and description. The Wayback Machine archives these feeds over time, preserving snapshots that may contain articles the publisher later deleted, retracted, or moved.
This tool queries the Wayback Machine CDX API across 20+ common feed paths, fetches every unique archived snapshot (deduplicated by content digest), parses the XML, and merges all discovered articles into a single deduplicated timeline. The result is a comprehensive publishing history that can span decades — including content no longer available on the live site.
Why is this useful for OSINT?
Deleted articles are intelligence. When a company retracts a press release, removes a blog post, or a journalist deletes an article, the RSS feed snapshot preserves evidence that it existed. Author attributions in feeds can reveal anonymous writers. Publication dates establish timelines. And the structured format means no scraping or HTML parsing is needed — the data arrives pre-organized.
Key Terminology
- RSS
- Really Simple Syndication — an XML format for publishing frequently updated content. RSS 2.0 uses <item> elements with <title>, <link>, <pubDate>, <author>, and <description>.
- Atom
- An alternative XML syndication format (RFC 4287). Uses <entry> elements with <title>, <link href="...">, <published>, <author><name>, and <summary>/<content>.
- CDX API
- The Wayback Machine's index API, returning timestamp, digest, status code, and MIME type for every archived capture of a URL. Used here to find all unique feed snapshots.
- Content Digest
- A hash of the archived file content. Collapsing by digest means we only fetch feeds that actually changed, skipping duplicate snapshots.
📡 RSS Feed Archaeology — Frequently Asked Questions
How does RSS Feed Archaeology recover deleted articles?
RSS and Atom feeds live at predictable URLs (/feed, /rss, /atom.xml). The Wayback Machine archives these XML files, which contain structured article titles, URLs, dates, authors, and descriptions. By fetching every archived feed snapshot and parsing the XML, we reconstruct the complete publishing history — including articles that were later deleted from the live site.
What feed paths does this tool check?
Over 20 common RSS/Atom paths including /feed, /rss, /atom.xml, /feed.xml, /rss.xml, /index.xml, /blog/feed, /news/feed, /feed/atom, WordPress feeds (?feed=rss2), category feeds, and comment feeds.