🏛️ Wayback & CommonCrawl Recon

Query Wayback Machine & CommonCrawl archives to uncover forgotten endpoints, exposed configs, API keys, and shadow infrastructure.

100% client-side — queries go directly from your browser to archive APIs · Last updated February 10, 2026

0
Total Found
0
After Dedup
0
High Value
0
Wayback
0
CommonCrawl

Discovered Endpoints

URL ↕ Type ↕ Status ↕ Last Seen ↕ Source ↕ Archive

How does archived endpoint reconnaissance uncover hidden attack surface?

Max Intel's Wayback & CommonCrawl Recon queries the Internet Archive's Wayback Machine CDX API and the CommonCrawl Index API to retrieve every URL ever crawled for a target domain. Web crawlers capture publicly accessible paths — including configuration files, admin panels, API documentation, and database backups that were briefly exposed before being removed. According to the OWASP Testing Guide v4.2, passive reconnaissance using web archives is a foundational step in application security testing because it reveals the historical attack surface that active scanning misses.

Why do archived URLs matter for security assessments?

A file removed from a live server may still be accessible at a different path, cached by a CDN, or retrievable from the archive itself. The SANS Institute penetration testing methodology recommends web archive enumeration as a standard reconnaissance technique because it reveals: endpoints that existed during development but were removed before production, configuration files that leaked credentials or internal hostnames, API versions that were deprecated but never fully decommissioned, and backup files that contain source code or database schemas. A 2024 HackerOne report found that approximately 18% of valid bug bounty submissions involved endpoints discovered through passive reconnaissance rather than active scanning — many of which were found in web archives.

What is the difference between Wayback Machine and CommonCrawl?

The Wayback Machine (operated by the Internet Archive since 2001) provides the largest single web archive, with over 890 billion archived pages. Its CDX API supports time-range filtering and returns HTTP status codes alongside URLs. CommonCrawl is a separate non-profit that publishes monthly web crawls as open datasets. Querying both sources provides broader coverage — CommonCrawl sometimes captures URLs that the Wayback Machine missed, and vice versa. Max Intel deduplicates results across both sources and marks URLs found in both as higher-confidence findings.

Wayback Machine CDX API
The programmatic interface to the Internet Archive's URL index, returning archived URLs with timestamps, HTTP status codes, and MIME types for any domain. Supports wildcard subdomains, date range filtering, and result collapsing by URL key.
CommonCrawl Index
A publicly accessible index of URLs captured during CommonCrawl's monthly web crawls. Each crawl index covers approximately 3–4 billion pages and can be queried via a REST API that returns URL, status, MIME type, and crawl metadata.
Attack Surface Mapping
The process of identifying all externally accessible endpoints, services, and data stores associated with an organization. Archived endpoint discovery contributes to attack surface mapping by revealing paths that no longer appear in DNS records, sitemaps, or live crawls.
High-Value Target
An archived URL classified as containing potentially sensitive information — such as .env files, database credentials, API documentation (Swagger/OpenAPI), version control metadata (.git/config), or server configuration files that may reveal internal architecture.

🏛️ Wayback & CommonCrawl Recon — Frequently Asked Questions

How does Wayback Machine and CommonCrawl reconnaissance work?

Max Intel queries the Wayback Machine CDX API and CommonCrawl index API directly from your browser. Both APIs return lists of URLs that web crawlers have archived for a given domain. The tool deduplicates results across both sources, classifies each URL by type (config files, JS, admin panels, API endpoints, backups), and flags high-value targets like .env files, credentials, and database dumps.

What types of sensitive files can archived endpoint discovery find?

Common high-value findings include exposed .env files with API keys, wp-config.php with database credentials, .git/config revealing repository structure, Swagger/OpenAPI documentation exposing internal APIs, database backups (.sql, .sql.gz), server configuration files (php.ini, web.config), and admin panel login pages. Even if these files have since been removed from the live site, the archived URLs confirm they once existed.

Is Wayback Machine OSINT reconnaissance legal?

Querying the Wayback Machine and CommonCrawl APIs for publicly archived URLs is legal — these are public data sources that archive the open web. However, using discovered endpoints to access live systems without authorization would violate computer fraud laws. This tool is intended for authorized security testing, bug bounty programs, and attack surface assessment of domains you own or have permission to test.