🤖 Robots.txt Historian

Pull every historical robots.txt from the Wayback Machine, diff them side-by-side, and discover directories organizations stopped hiding. Removed Disallow paths are the most interesting — they reveal exposed infrastructure.

Last updated:

Target Domain
0
Snapshots
0
Unique Versions
0
Paths Added
0
Paths Removed
0
Still Live

Why are removed Disallow paths in robots.txt the most valuable OSINT signal?

A robots.txt file tells search engine crawlers which paths to avoid. When an organization removes a Disallow directive, it means a previously hidden directory is now exposed. According to the OWASP Web Security Testing Guide v4.2, reviewing robots.txt history is a key step in information gathering because it reveals paths organizations intentionally hid from crawlers — admin panels, staging servers, backup directories, API endpoints, and internal tools.

How does historical robots.txt diffing work?

This tool queries the Wayback Machine CDX API to discover every archived snapshot of a domain's robots.txt. Each unique version is fetched, content-hashed for deduplication, then diffed line-by-line against its predecessor. The diff highlights added lines (new restrictions) and removed lines (newly exposed paths). The Internet Archive stores over 835 billion web pages, making it the most comprehensive source for historical robots.txt analysis.

What categories of exposed paths are most critical?

Removed Disallow paths are automatically categorized by risk level. Admin paths (/admin, /wp-admin, /cpanel) and API endpoints (/api, /graphql) are highest priority because they indicate management interfaces. Backup paths (/backup, .sql, .bak) may expose database dumps. Config paths (/.env, /config) can leak credentials. The SANS Institute OSINT collection methodology recommends cross-referencing exposed paths with directory brute-forcing for maximum coverage. As specified in RFC 9309 (the robots exclusion protocol standard), removing a Disallow only changes crawler permissions — it does not change server access controls, so the path may still be live.

Robots.txt Disallow
A directive in robots.txt instructing crawlers not to access a specific URL path. Removing it exposes that path to both crawlers and investigators.
Content Deduplication
Hashing each robots.txt version to filter out identical snapshots, showing only meaningful changes in the diff timeline.
Removed Path Probing
Checking whether a previously disallowed (and now removed) path still returns a successful HTTP response on the live server, indicating the directory is accessible.
CDX API
The Wayback Machine's index API for discovering all archived snapshots of a given URL, returning timestamps and HTTP status codes as documented by the Internet Archive CDX Server.

🤖 Robots.txt Historian — Frequently Asked Questions

Why are removed Disallow paths in robots.txt valuable for OSINT?

When an organization removes a Disallow directive from robots.txt, it means a previously hidden path is now exposed to search engine crawlers — and to investigators. These removed paths often reveal admin panels, staging environments, API endpoints, backup directories, or internal tools that were intentionally hidden but are now accessible. The path may still be live on the server even after the robots.txt change.

How does Robots.txt Historian find historical changes?

The tool queries the Wayback Machine CDX API to discover every archived snapshot of a domain's robots.txt file. It fetches each unique version, computes content hashes to deduplicate identical snapshots, then performs line-by-line diffs between consecutive versions to identify added and removed directives.

Can removed Disallow paths still be accessed on the live site?

Yes — removing a path from robots.txt only changes crawler permissions, not server access controls. The Robots.txt Historian can probe removed paths against the live site to check HTTP status codes. A 200 response means the previously hidden directory is still accessible and was likely exposed intentionally or accidentally.