Open prompt · 30 min recommended
Design a Web Crawler
00:00
Target 30:00
Design a Web Crawler
You have 30 minutes. Sketch the system in this notes pane.
Scope
Functional requirements
- Start from a seed set of URLs and discover new URLs by following links
- Download and store the HTML content of each page
- Avoid crawling the same URL twice (deduplication)
- Respect
robots.txtand crawl-delay directives
Non-functional requirements
- Throughput: crawl billions of pages; target ~100–1,000 pages/second at steady state
- Politeness: do not hammer a single host faster than its
robots.txtallows - Fault tolerance: worker crashes should not lose unprocessed URLs
- Extensible: the stored content should be processable by downstream jobs (indexing, training data, etc.)
Out of scope
- Full-text indexing or PageRank computation
- JavaScript rendering (headless browser) — v1 is static HTML only
- Link extraction from PDFs or other non-HTML formats
Suggested approach
- Clarify requirements — scope (entire web vs. domain-scoped), depth limit, refresh cycle (periodic re-crawl?), storage format, politeness constraints
- High-level design — a URL Frontier (queue), a pool of Fetcher workers, a Content Store, a Link Extractor, and a URL Deduplicator
- API + data model — no external API; internally: frontier is a priority queue of
(scheduled_at, url); content stored as(url, content_hash, html, fetched_at)in object storage - Storage + caching — the URL frontier can be backed by a distributed queue (Kafka, SQS) with per-host politeness enforcement via a delay queue; seen-URL deduplication via a bloom filter + a persistent KV store for confirmed duplicates
- Bottlenecks + mitigations — URL queue size, DNS resolution latency, politeness throttling, spider traps
Reference talking points
- URL Frontier architecture: separate the frontier into a priority queue (next fetch time) and a set of per-host queues. A Selector picks the host whose next fetch time has passed and hands a URL to a Fetcher. This cleanly enforces per-host crawl-delay.
- Deduplication with bloom filters: a bloom filter can tell you with high probability whether a URL has been seen before, using very little memory (a few bits per URL). False positives mean a small fraction of pages are skipped; false negatives are impossible. Use a persistent store as a backup for definitive dedup.
- Content deduplication: two different URLs can return the same HTML (mirrors, near-duplicates). Hash the content (MD5/SHA-1) and skip storage if the hash already exists. SimHash can detect near-duplicates.
- robots.txt: fetch and parse
https://{host}/robots.txtbefore crawling any page on that host. Cache it with a reasonable TTL (24 h). Disallow rules andCrawl-delaymust be respected to avoid being banned. - Politeness and rate limiting: throttle requests to any single host to at most one request per crawl-delay interval (default ~1 s if not specified). Use a per-host token bucket or a scheduled delay queue.
- Spider traps: sites can generate infinite URLs dynamically (e.g.,
?page=1,?page=2, ...). Defenses: URL length limit, max URL depth, canonicalization (remove session IDs and UTM params), and a cap on pages crawled per host per run. - URL canonicalization: normalize before dedup — lowercase scheme and host, remove default ports, sort query parameters, strip fragment identifiers.
- Distributed fetching: multiple Fetcher workers pull from the frontier. Assign URLs to workers by host hash so a single host is always handled by the same worker, simplifying per-host throttling.
- Storage: store raw HTML in object storage (S3); keep only the URL, fetch timestamp, status code, and content hash in a DB for fast lookup. Downstream jobs can process the raw HTML from object storage asynchronously.