The Internet Archive’s Wayback Machine stores 1 trillion+ web page captures across 99 petabytes of data, archiving roughly 498 million pages per day. Its CDX API gives investigators programmatic access to the full index — every archived URL, timestamp, status code, MIME type, and content digest. This enables five core OSINT workflows: endpoint discovery (finding forgotten APIs, admin panels, config files), deleted content recovery (pages removed from the live web), infrastructure change tracking (robots.txt and sitemap diffs over time), WHOIS and registration archaeology, and ghost profile detection (deleted social media accounts). All queries are entirely passive — no packets touch the target.
What the Wayback Machine Actually Stores
The Internet Archive, founded in 1996 by Brewster Kahle, operates the Wayback Machine from a former Christian Science church on Funston Avenue in San Francisco. Its web crawlers continuously traverse the public internet, downloading and storing snapshots of web pages with their full HTTP responses — HTML content, headers, status codes, and associated resources. Each snapshot is indexed in a CDX (Capture/Display indeX) record containing the URL, a 14-digit timestamp (YYYYMMDDhhmmss), HTTP status code, MIME type, content digest (SHA-1 hash), and compressed size.
In October 2025, the archive crossed the 1 trillion page capture milestone. The physical infrastructure runs on custom-built PetaBox rack systems, each providing 1.4 petabytes of storage. The total unique data exceeds 99 petabytes (212+ petabytes including backups and redundancy). In July 2025, the U.S. Senate designated the Internet Archive as a Federal Depository Library, and in September 2024, Google partnered with the Archive to integrate Wayback Machine links into Google Search’s “more about this page” menu, effectively replacing the retired Google Cache.
The CDX API: Complete Reference
The CDX API is the primary programmatic interface for searching the Wayback Machine’s index. The endpoint is:
Essential Parameters
| Parameter | Example | Purpose |
|---|---|---|
url | example.com | Target URL (required). Supports wildcards: *.example.com |
output | json | Response format. json returns array of arrays with header row |
fl | timestamp,original,statuscode | Select specific fields to return |
from / to | from=2020&to=2024 | Date range filter (1–14 digit yyyyMMddhhmmss format) |
filter | statuscode:200 | Regex filter on any field. Prefix with ! to negate |
collapse | digest | Deduplicate adjacent rows by field. collapse=timestamp:8 = one per day |
limit | 1000 | Max results. Negative value returns last N results |
matchType | domain | exact, prefix, host, or domain matching |
CDX Fields
The default CDX response includes seven fields: urlkey (SURT-encoded URL), timestamp, original (original URL), mimetype, statuscode, digest (SHA-1 of content), and length (compressed size). Use the fl parameter to select only the fields you need, which significantly reduces response size for large queries.
Constructing Archive URLs
Once you have a timestamp and original URL from the CDX API, construct the archive URL in one of two forms:
The id_ suffix returns the original archived response without any Wayback Machine modifications — essential for programmatic content extraction and diffing.
Five Core OSINT Workflows
1. Endpoint and Attack Surface Discovery
The Wayback Machine’s wildcard query is one of the most powerful passive reconnaissance techniques available. Querying *.example.com/* returns every URL ever archived under a domain — including admin panels, API endpoints, configuration files, backup directories, staging environments, and forgotten web applications that may still be live but unlinked.
Bug bounty hunters and penetration testers rely heavily on this technique. Filtering by mimetype:application/json reveals API endpoints; mimetype:application/javascript uncovers JS files that may contain hardcoded secrets; and filtering for paths containing /admin/, /backup/, /config/, or /.env targets high-value infrastructure.
2. Deleted Content Recovery
Pages removed from the live web remain permanently in the Wayback Machine if they were captured before deletion. This is invaluable for recovering: retracted press releases and statements, deleted social media profiles and posts, removed product pages and pricing, taken-down legal documents, and edited-then-deleted blog posts. Query the CDX API for the specific URL, then fetch the most recent pre-deletion snapshot using the id_ suffix for raw content.
3. Robots.txt and Sitemap History
Every organization’s robots.txt is regularly archived. By pulling all historical snapshots and diffing them, investigators can discover paths that were once blocked by Disallow directives but have since been removed — suggesting the organization decommissioned something it was previously hiding, but the underlying resource may still be live. The same technique applied to sitemap.xml reveals URLs that once appeared in the sitemap but have since vanished, indicating removed pages, restructured content, or abandoned applications.
The collapse=digest parameter is critical here: it deduplicates by content hash, returning only snapshots where the file actually changed.
4. WHOIS and Registration Archaeology
Before GDPR redacted most WHOIS records in 2018, registrant names, addresses, phone numbers, and email addresses were publicly displayed. The Wayback Machine archived millions of WHOIS lookup pages from services like whois.domaintools.com and who.is. By querying the CDX API for archived WHOIS pages for a target domain, investigators can often recover the original registrant information that is no longer available through live WHOIS queries. Additionally, archived homepage footers, “about” pages, and “contact” pages frequently contained business addresses and phone numbers that have since been removed.
5. Ghost Profile Detection
Deleted social media accounts leave traces in the Wayback Machine. An investigator can check whether archived snapshots exist for a username across platforms like Twitter/X, Facebook, Instagram, LinkedIn, Reddit, GitHub, and others. If the CDX API returns captures with statuscode:200 for a profile URL that now returns 404, the profile existed and was deleted. The archived content may include the user’s display name, bio, profile photo, post history, follower counts, and linked accounts — all preserved at the time of capture.
The Wayback Machine vs. CommonCrawl
| Feature | Wayback Machine | CommonCrawl |
|---|---|---|
| Operator | Internet Archive (nonprofit) | CommonCrawl Foundation (nonprofit) |
| Data Span | 1996 – present (continuous) | 2008 – present (monthly crawls) |
| Total Size | 99+ PB unique data | ~350 TB per monthly crawl |
| Access | CDX API + web interface | AWS S3 bulk download + index API |
| Best For | Specific URL history, deleted content, visual browsing | Bulk cross-domain analysis, link graphs, content mining |
| Rate Limits | Informal; ~1 req/sec recommended | None (download at S3 speed) |
| Deduplication | CDX digest field + collapse param | Per-crawl dedup only |
For OSINT, the two sources are complementary. The Wayback Machine excels at investigating specific targets with deep historical coverage. CommonCrawl excels at discovering connections across the broader web. Querying both maximizes coverage — CommonCrawl often captures URLs the Wayback Machine missed, and vice versa.
The Three Public APIs
| API | Endpoint | Purpose |
|---|---|---|
| CDX | web.archive.org/cdx/search/cdx | Search the capture index. Returns metadata (timestamp, status, digest) for all captures of a URL. |
| Availability | archive.org/wayback/available | Quick check whether a URL is archived. Returns closest snapshot URL and timestamp. |
| Save Page Now | web.archive.org/save/ | Trigger archiving of a URL. Useful for preserving evidence before it disappears. |
Practical Tips and Rate Limits
The Wayback Machine does not publish official rate limits, but the community consensus is approximately 1 request per second for the CDX API and slower for full page fetches. Exceeding this often results in 429 (Too Many Requests) or 503 responses. Strategies for working within limits: use collapse=digest to minimize redundant fetches, batch CDX queries with broader wildcards instead of many narrow queries, add retry logic with exponential backoff (starting at 3 seconds), and consider caching responses locally since archived content is immutable.
The archive experienced a significant data breach in September 2024, exposing 31 million user records. In October 2024, a DDoS attack took the site offline, and Save Page Now remained disabled until November 2024. The service has since recovered, but investigators should maintain their own caching infrastructure for reliability.
Key Terminology
- CDX (Capture/Display indeX)
- The file format and API used by the Wayback Machine to index all archived captures. Each CDX record contains the URL, timestamp, status code, MIME type, content digest, and compressed size.
- SURT (Sort-friendly URI Rewriting Transform)
- A URL canonicalization scheme used by the CDX index. Reverses the domain components so
https://www.example.com/pagebecomescom,example,www)/page. This groups all pages from the same domain together in the index. - Memento
- A snapshot of a web resource at a specific point in time. The Wayback Machine implements the Memento protocol (RFC 7089), providing standardized time-based access to archived resources.
- Content Digest
- A SHA-1 hash of the archived page content. When two captures share the same digest, their content is identical. The
collapse=digestparameter uses this to deduplicate results. - Save Page Now (SPN)
- The Wayback Machine’s on-demand archiving service. Investigators use it to preserve evidence — social media posts, news articles, forum threads — before the content can be deleted or altered.
- id_ Modifier
- Appended to a Wayback timestamp (e.g.,
/web/20240101id_/) to retrieve the raw archived content without the Wayback Machine toolbar or link rewriting. Essential for programmatic content extraction. - CommonCrawl
- A separate nonprofit that performs monthly web crawls and publishes the raw data as freely downloadable datasets on AWS S3. Complementary to the Wayback Machine for OSINT: better for bulk analysis, worse for specific URL history.
Sources
Wayback Machine — Wikipedia (1 trillion pages, Oct 2025). Wayback CDX Server README — GitHub (official CDX API documentation). Wayback Machine APIs — Internet Archive (Availability API, Save Page Now). Internet Archive — Wikipedia (99 PB data, Federal Depository Library, Google partnership). Bellingcat Online Investigation Toolkit (Wayback Machine investigation methodology). Wayback CDX Server API Beta — Internet Archive Developer Portal (CDX parameters). wayback Python library documentation (WaybackClient, Memento API). PC Gamer (Dec 2025) (150 TB/day, 175 PB including backups).
Frequently Asked Questions
How do I use the Wayback Machine CDX API for OSINT?
The CDX API endpoint is web.archive.org/cdx/search/cdx. Required parameter: url. Key optional parameters: output=json, fl= (select fields), collapse=digest (deduplicate), filter=statuscode:200, from=/to= (date range), and limit=. Wildcards like *.example.com/* return all subpages and subdomains. The API returns index records — construct archive URLs as web.archive.org/web/{timestamp}id_/{url} to fetch actual content.
Can the Wayback Machine find deleted web pages?
Yes. If the Archive’s crawlers captured a page before deletion, the content remains permanently accessible. This works for deleted social media profiles, removed press releases, retracted statements, decommissioned web apps, and pages taken down under legal pressure. The archive captures approximately 498 million pages daily.
What is the difference between the Wayback Machine and CommonCrawl?
The Wayback Machine archives continuously since 1996 with a CDX API and browsable interface. CommonCrawl performs monthly crawls published as bulk datasets on AWS S3. Wayback is better for specific URL history; CommonCrawl is better for cross-domain bulk analysis. Use both for maximum coverage.
How often does the Wayback Machine archive pages?
Frequency varies by site popularity. Popular sites may be archived multiple times per day; smaller sites a few times per year. As of late 2025, approximately 498 million pages are captured daily (150 TB of new data). You can manually trigger archiving via Save Page Now (web.archive.org/save/).