How to Use the Wayback Machine for OSINT (2026)

Q: How do I use the Wayback Machine CDX API for OSINT?

The CDX API endpoint is web.archive.org/cdx/search/cdx. Required parameter: url (the target URL). Key optional parameters: output=json (structured data), fl= (select fields like timestamp, original, statuscode, mimetype, digest), collapse=digest (deduplicate by content hash), filter=statuscode:200 (only successful captures), from=/to= (date range in yyyyMMddhhmmss format), and limit= (max results). Wildcards like *.example.com/* return all subpages and subdomains. The API returns CDX index records, not page content — you then construct archive URLs as web.archive.org/web/{timestamp}id_/{original_url} to fetch the actual archived content.

Q: Can the Wayback Machine find deleted web pages?

Yes. The Wayback Machine archives pages before they are deleted. If the Internet Archive's crawlers captured a page before it was removed, the content remains permanently accessible. This is one of the most powerful OSINT applications: recovering deleted social media profiles, removed press releases, retracted statements, decommissioned web applications, and pages taken down under legal pressure. The archive captures approximately 498 million pages daily, and has accumulated over 1 trillion total page captures since 1996.

Q: What is the difference between the Wayback Machine and CommonCrawl?

The Wayback Machine (Internet Archive) archives pages continuously since 1996 and provides both a browsable interface and the CDX API for programmatic access. It stores rendered pages with timestamps. CommonCrawl is a separate nonprofit that performs monthly web crawls and publishes the raw data as freely downloadable datasets on AWS S3. CommonCrawl is better for bulk data analysis across many domains, while the Wayback Machine is better for investigating specific URLs and their historical changes over time.

Q: How often does the Wayback Machine archive pages?

Capture frequency varies by site. Popular, frequently updated sites may be archived multiple times per day. Smaller sites might be captured only a few times per year, or not at all if they block crawlers via robots.txt. As of late 2025, the Wayback Machine archives approximately 498 million pages per day, ingesting about 150 terabytes of new data daily. You can also manually trigger archiving via the Save Page Now feature (web.archive.org/save/) or its API.

The Internet Archive’s Wayback Machine stores 1 trillion+ web page captures across 99 petabytes of data, archiving roughly 498 million pages per day. Its CDX API gives investigators programmatic access to the full index — every archived URL, timestamp, status code, MIME type, and content digest. This enables five core OSINT workflows: endpoint discovery (finding forgotten APIs, admin panels, config files), deleted content recovery (pages removed from the live web), infrastructure change tracking (robots.txt and sitemap diffs over time), WHOIS and registration archaeology, and ghost profile detection (deleted social media accounts). All queries are entirely passive — no packets touch the target.

1T+

Pages Archived

99 PB

Unique Data Stored

498M/day

Pages Captured Daily

150 TB/day

New Data Ingested

29 years

Archive Span (1996–now)

Public APIs

What the Wayback Machine Actually Stores

The Internet Archive, founded in 1996 by Brewster Kahle, operates the Wayback Machine from a former Christian Science church on Funston Avenue in San Francisco. Its web crawlers continuously traverse the public internet, downloading and storing snapshots of web pages with their full HTTP responses — HTML content, headers, status codes, and associated resources. Each snapshot is indexed in a CDX (Capture/Display indeX) record containing the URL, a 14-digit timestamp (YYYYMMDDhhmmss), HTTP status code, MIME type, content digest (SHA-1 hash), and compressed size.

In October 2025, the archive crossed the 1 trillion page capture milestone. The physical infrastructure runs on custom-built PetaBox rack systems, each providing 1.4 petabytes of storage. The total unique data exceeds 99 petabytes (212+ petabytes including backups and redundancy). In July 2025, the U.S. Senate designated the Internet Archive as a Federal Depository Library, and in September 2024, Google partnered with the Archive to integrate Wayback Machine links into Google Search’s “more about this page” menu, effectively replacing the retired Google Cache.

The CDX API: Complete Reference

The CDX API is the primary programmatic interface for searching the Wayback Machine’s index. The endpoint is:

https://web.archive.org/cdx/search/cdx?url=example.com&output=json

Essential Parameters

Parameter	Example	Purpose
`url`	`example.com`	Target URL (required). Supports wildcards: `*.example.com`
`output`	`json`	Response format. `json` returns array of arrays with header row
`fl`	`timestamp,original,statuscode`	Select specific fields to return
`from` / `to`	`from=2020&to=2024`	Date range filter (1–14 digit `yyyyMMddhhmmss` format)
`filter`	`statuscode:200`	Regex filter on any field. Prefix with `!` to negate
`collapse`	`digest`	Deduplicate adjacent rows by field. `collapse=timestamp:8` = one per day
`limit`	`1000`	Max results. Negative value returns last N results
`matchType`	`domain`	`exact`, `prefix`, `host`, or `domain` matching

CDX Fields

The default CDX response includes seven fields: urlkey (SURT-encoded URL), timestamp, original (original URL), mimetype, statuscode, digest (SHA-1 of content), and length (compressed size). Use the fl parameter to select only the fields you need, which significantly reduces response size for large queries.

Constructing Archive URLs

Once you have a timestamp and original URL from the CDX API, construct the archive URL in one of two forms:

# With Wayback toolbar and rewritten links:
https://web.archive.org/web/20240115123456/https://example.com/page

# Raw archived content (no toolbar, no link rewriting):
https://web.archive.org/web/20240115123456id_/https://example.com/page

The id_ suffix returns the original archived response without any Wayback Machine modifications — essential for programmatic content extraction and diffing.

Five Core OSINT Workflows

1. Endpoint and Attack Surface Discovery

The Wayback Machine’s wildcard query is one of the most powerful passive reconnaissance techniques available. Querying *.example.com/* returns every URL ever archived under a domain — including admin panels, API endpoints, configuration files, backup directories, staging environments, and forgotten web applications that may still be live but unlinked.

# Find all unique URLs ever archived for a domain
curl -s "https://web.archive.org/cdx/search/cdx?url=*.example.com/*&output=json\
&fl=original,statuscode,timestamp,mimetype&collapse=urlkey&limit=5000" | \
jq -r '.[] | .[0]' | sort -u

Bug bounty hunters and penetration testers rely heavily on this technique. Filtering by mimetype:application/json reveals API endpoints; mimetype:application/javascript uncovers JS files that may contain hardcoded secrets; and filtering for paths containing /admin/, /backup/, /config/, or /.env targets high-value infrastructure.

2. Deleted Content Recovery

Pages removed from the live web remain permanently in the Wayback Machine if they were captured before deletion. This is invaluable for recovering: retracted press releases and statements, deleted social media profiles and posts, removed product pages and pricing, taken-down legal documents, and edited-then-deleted blog posts. Query the CDX API for the specific URL, then fetch the most recent pre-deletion snapshot using the id_ suffix for raw content.

3. Robots.txt and Sitemap History

Every organization’s robots.txt is regularly archived. By pulling all historical snapshots and diffing them, investigators can discover paths that were once blocked by Disallow directives but have since been removed — suggesting the organization decommissioned something it was previously hiding, but the underlying resource may still be live. The same technique applied to sitemap.xml reveals URLs that once appeared in the sitemap but have since vanished, indicating removed pages, restructured content, or abandoned applications.

# Get all unique robots.txt snapshots for a domain
curl -s "https://web.archive.org/cdx/search/cdx?url=example.com/robots.txt\
&output=json&fl=timestamp,statuscode,digest\
&filter=statuscode:200&collapse=digest"

The collapse=digest parameter is critical here: it deduplicates by content hash, returning only snapshots where the file actually changed.

4. WHOIS and Registration Archaeology

Before GDPR redacted most WHOIS records in 2018, registrant names, addresses, phone numbers, and email addresses were publicly displayed. The Wayback Machine archived millions of WHOIS lookup pages from services like whois.domaintools.com and who.is. By querying the CDX API for archived WHOIS pages for a target domain, investigators can often recover the original registrant information that is no longer available through live WHOIS queries. Additionally, archived homepage footers, “about” pages, and “contact” pages frequently contained business addresses and phone numbers that have since been removed.

5. Ghost Profile Detection

Deleted social media accounts leave traces in the Wayback Machine. An investigator can check whether archived snapshots exist for a username across platforms like Twitter/X, Facebook, Instagram, LinkedIn, Reddit, GitHub, and others. If the CDX API returns captures with statuscode:200 for a profile URL that now returns 404, the profile existed and was deleted. The archived content may include the user’s display name, bio, profile photo, post history, follower counts, and linked accounts — all preserved at the time of capture.

The Wayback Machine vs. CommonCrawl

Feature	Wayback Machine	CommonCrawl
Operator	Internet Archive (nonprofit)	CommonCrawl Foundation (nonprofit)
Data Span	1996 – present (continuous)	2008 – present (monthly crawls)
Total Size	99+ PB unique data	~350 TB per monthly crawl
Access	CDX API + web interface	AWS S3 bulk download + index API
Best For	Specific URL history, deleted content, visual browsing	Bulk cross-domain analysis, link graphs, content mining
Rate Limits	Informal; ~1 req/sec recommended	None (download at S3 speed)
Deduplication	CDX digest field + `collapse` param	Per-crawl dedup only

For OSINT, the two sources are complementary. The Wayback Machine excels at investigating specific targets with deep historical coverage. CommonCrawl excels at discovering connections across the broader web. Querying both maximizes coverage — CommonCrawl often captures URLs the Wayback Machine missed, and vice versa.

The Three Public APIs

API	Endpoint	Purpose
CDX	`web.archive.org/cdx/search/cdx`	Search the capture index. Returns metadata (timestamp, status, digest) for all captures of a URL.
Availability	`archive.org/wayback/available`	Quick check whether a URL is archived. Returns closest snapshot URL and timestamp.
Save Page Now	`web.archive.org/save/`	Trigger archiving of a URL. Useful for preserving evidence before it disappears.

Practical Tips and Rate Limits

The Wayback Machine does not publish official rate limits, but the community consensus is approximately 1 request per second for the CDX API and slower for full page fetches. Exceeding this often results in 429 (Too Many Requests) or 503 responses. Strategies for working within limits: use collapse=digest to minimize redundant fetches, batch CDX queries with broader wildcards instead of many narrow queries, add retry logic with exponential backoff (starting at 3 seconds), and consider caching responses locally since archived content is immutable.

The archive experienced a significant data breach in September 2024, exposing 31 million user records. In October 2024, a DDoS attack took the site offline, and Save Page Now remained disabled until November 2024. The service has since recovered, but investigators should maintain their own caching infrastructure for reliability.

Key Terminology

CDX (Capture/Display indeX): The file format and API used by the Wayback Machine to index all archived captures. Each CDX record contains the URL, timestamp, status code, MIME type, content digest, and compressed size.
SURT (Sort-friendly URI Rewriting Transform): A URL canonicalization scheme used by the CDX index. Reverses the domain components so https://www.example.com/page becomes com,example,www)/page. This groups all pages from the same domain together in the index.
Memento: A snapshot of a web resource at a specific point in time. The Wayback Machine implements the Memento protocol (RFC 7089), providing standardized time-based access to archived resources.
Content Digest: A SHA-1 hash of the archived page content. When two captures share the same digest, their content is identical. The collapse=digest parameter uses this to deduplicate results.
Save Page Now (SPN): The Wayback Machine’s on-demand archiving service. Investigators use it to preserve evidence — social media posts, news articles, forum threads — before the content can be deleted or altered.
id_ Modifier: Appended to a Wayback timestamp (e.g., /web/20240101id_/) to retrieve the raw archived content without the Wayback Machine toolbar or link rewriting. Essential for programmatic content extraction.
CommonCrawl: A separate nonprofit that performs monthly web crawls and publishes the raw data as freely downloadable datasets on AWS S3. Complementary to the Wayback Machine for OSINT: better for bulk analysis, worse for specific URL history.

Sources

Wayback Machine — Wikipedia (1 trillion pages, Oct 2025). Wayback CDX Server README — GitHub (official CDX API documentation). Wayback Machine APIs — Internet Archive (Availability API, Save Page Now). Internet Archive — Wikipedia (99 PB data, Federal Depository Library, Google partnership). Bellingcat Online Investigation Toolkit (Wayback Machine investigation methodology). Wayback CDX Server API Beta — Internet Archive Developer Portal (CDX parameters). wayback Python library documentation (WaybackClient, Memento API). PC Gamer (Dec 2025) (150 TB/day, 175 PB including backups).

Frequently Asked Questions

How do I use the Wayback Machine CDX API for OSINT?

The CDX API endpoint is web.archive.org/cdx/search/cdx. Required parameter: url. Key optional parameters: output=json, fl= (select fields), collapse=digest (deduplicate), filter=statuscode:200, from=/to= (date range), and limit=. Wildcards like *.example.com/* return all subpages and subdomains. The API returns index records — construct archive URLs as web.archive.org/web/{timestamp}id_/{url} to fetch actual content.

Can the Wayback Machine find deleted web pages?

Yes. If the Archive’s crawlers captured a page before deletion, the content remains permanently accessible. This works for deleted social media profiles, removed press releases, retracted statements, decommissioned web apps, and pages taken down under legal pressure. The archive captures approximately 498 million pages daily.

What is the difference between the Wayback Machine and CommonCrawl?

The Wayback Machine archives continuously since 1996 with a CDX API and browsable interface. CommonCrawl performs monthly crawls published as bulk datasets on AWS S3. Wayback is better for specific URL history; CommonCrawl is better for cross-domain bulk analysis. Use both for maximum coverage.

How often does the Wayback Machine archive pages?

Frequency varies by site popularity. Popular sites may be archived multiple times per day; smaller sites a few times per year. As of late 2025, approximately 498 million pages are captured daily (150 TB of new data). You can manually trigger archiving via Save Page Now (web.archive.org/save/).