code.davidloor.com

Open prompt · 30 min recommended

Design a Web Crawler

00:00

Target 30:00

Design a Web Crawler

You have 30 minutes. Sketch the system in this notes pane.

Scope

Functional requirements

Non-functional requirements

Out of scope

Suggested approach

  1. Clarify requirements — scope (entire web vs. domain-scoped), depth limit, refresh cycle (periodic re-crawl?), storage format, politeness constraints
  2. High-level design — a URL Frontier (queue), a pool of Fetcher workers, a Content Store, a Link Extractor, and a URL Deduplicator
  3. API + data model — no external API; internally: frontier is a priority queue of (scheduled_at, url); content stored as (url, content_hash, html, fetched_at) in object storage
  4. Storage + caching — the URL frontier can be backed by a distributed queue (Kafka, SQS) with per-host politeness enforcement via a delay queue; seen-URL deduplication via a bloom filter + a persistent KV store for confirmed duplicates
  5. Bottlenecks + mitigations — URL queue size, DNS resolution latency, politeness throttling, spider traps

Reference talking points

Your notes

Saved locally · 0 chars