Senior 4 min · June 25, 2026

Design a Web Crawler: Build a Production Crawler That Won't Get You Banned

Design a web crawler that respects robots.txt, handles rate limiting, and scales.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Design a web crawler by starting with a frontier (URL queue), a fetcher with politeness delays, a parser for links, and a deduplication store. Use a distributed architecture with message queues for scale. Always respect robots.txt and throttle requests per domain.

✦ Definition~90s read
What is Design a Web Crawler?

A web crawler is a bot that systematically downloads web pages to index content or extract data. Production crawlers must handle politeness, deduplication, scaling, and error recovery.

Imagine you're a librarian tasked with cataloging every book in a city.
Plain-English First

Imagine you're a librarian tasked with cataloging every book in a city. You can't run into every library at once — you'd get thrown out. So you plan a route, visit one library at a time, wait a bit between visits, and take notes on where to go next. A web crawler does the same: it politely visits websites, waits between requests, and follows links to discover new pages.

Most web crawler tutorials are toys. They show you how to fetch a page with requests and call it a day. In production, that naive approach gets your IP banned, crashes your database, and costs you thousands in bandwidth. I've seen a startup's entire crawling pipeline grind to a halt because they forgot to deduplicate URLs — they downloaded the same page 50,000 times. This article is the real deal: how to design a web crawler that respects robots.txt, handles rate limiting, scales to millions of pages, and doesn't get you sued. By the end, you'll be able to architect a distributed crawler that runs for weeks without manual intervention.

The Frontier: Your Crawler's Brain

The frontier is the queue of URLs to crawl. Without it, your crawler is a headless chicken. The naive approach is a FIFO queue, but that ignores politeness and priority. You need a priority queue that respects robots.txt crawl-delay and gives fresh content higher priority. I've seen teams use a simple Redis list and wonder why they get banned — because they hammer the same domain with 100 concurrent requests. The correct design: a set of per-domain queues, each with its own rate limiter. When a URL is discovered, hash the domain, push to that domain's queue. A scheduler picks the next domain whose rate limit allows a request. This ensures you never exceed the crawl-delay for any domain.

FrontierDesign.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// io.thecodeforge — System Design tutorial

// Frontier using Redis per-domain queues
// Each domain has a sorted set keyed by crawl priority (timestamp of last crawl)
// Scheduler picks domain with oldest last-crawl time that has waited >= crawl-delay

// Pseudocode for scheduler:
function nextUrl() {
    while (true) {
        domain = redis.zpopmin('domains')  // get domain with oldest last-crawl
        if (domain == null) break
        lastCrawl = redis.get('lastcrawl:' + domain)
        delay = getCrawlDelay(domain)  // from robots.txt cache
        if (now - lastCrawl >= delay) {
            url = redis.lpop('queue:' + domain)
            if (url) {
                redis.set('lastcrawl:' + domain, now)
                return url
            }
        } else {
            // domain not ready, push back with updated score = lastCrawl + delay
            redis.zadd('domains', lastCrawl + delay, domain)
        }
    }
    sleep(100ms)
}
Output
No direct output; this is a design pattern.
Production Trap: Single-Queue Starvation
If you use a single FIFO queue, a slow domain (e.g., one with 10-second crawl-delay) blocks all other domains. Always use per-domain queues with separate rate limiters.
Production Web Crawler Architecture THECODEFORGE.IO Production Web Crawler Architecture Core components for a polite, scalable, and robust crawler Frontier Manager URL queue with priority and politeness domains Politeness Engine Rate limiting per domain, crawl delays, robots.txt Deduplication Filter Bloom filter or hash set to avoid refetching URLs Fetcher Worker HTTP client with retries, JS rendering, and error handling Storage Layer Raw HTML, metadata, and extracted content in DB or blob store ⚠ Ignoring politeness can get your IP banned instantly Always respect robots.txt and add random delays between requests THECODEFORGE.IO
thecodeforge.io
Production Web Crawler Architecture
Design Web Crawler

Politeness: Don't Be a Jerk

Politeness is the single most important aspect of a production crawler. Ignore it and you'll get IP-banned, blocked by Cloudflare, or sued. The rules: respect robots.txt, obey crawl-delay, and throttle to a reasonable rate per domain. I've seen a team set a global delay of 1 second and still get banned because they had 100 workers hitting the same domain simultaneously. The fix: per-domain token bucket. Each domain gets a bucket with capacity = 1 and refill rate = 1/crawl-delay. Workers acquire a token before fetching. This ensures you never exceed the delay, even with hundreds of workers. Also cache robots.txt with a TTL of 24 hours, but re-fetch if you get a 403 or 429.

RateLimiter.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — System Design tutorial

// Per-domain token bucket rate limiter
class DomainRateLimiter {
    private buckets: Map<string, TokenBucket> = new Map()
    
    async acquire(domain: string): Promise<void> {
        let bucket = this.buckets.get(domain)
        if (!bucket) {
            const delay = await this.getCrawlDelay(domain)  // from robots.txt
            bucket = new TokenBucket(1, 1 / delay)  // capacity 1, refill 1 per delay seconds
            this.buckets.set(domain, bucket)
        }
        await bucket.consume()  // blocks until token available
    }
}
Output
No direct output; this is a design pattern.
Senior Shortcut: Cache robots.txt Aggressively
Robots.txt rarely changes. Cache it for 24 hours. But if you get a 403, re-fetch immediately — the site may have updated its rules.
Politeness Flow: Avoid Getting BannedTHECODEFORGE.IOPoliteness Flow: Avoid Getting BannedRespect robots.txt, delay, and throttle per domainParse robots.txtRead crawl-delay & disallowed pathsEnforce DelayWait crawl-delay seconds per domainThrottle RateLimit requests/sec per domainQueue PolitelyUse priority queue with delay⚠ One 429 spike can ban your entire IP rangeTHECODEFORGE.IO
thecodeforge.io
Politeness Flow: Avoid Getting Banned
Design Web Crawler

Deduplication: Never Fetch the Same URL Twice

Without deduplication, your crawler will fetch the same page thousands of times. The naive approach is a hash set of all seen URLs, but that doesn't scale — a billion URLs needs gigabytes of RAM. Use a Bloom filter. It's probabilistic: you might get false positives (skip a page you haven't seen) but never false negatives (never fetch a page twice). A Bloom filter with 1% false positive rate for 1 billion URLs needs about 1.2GB of memory. That's acceptable. For absolute correctness, combine with a Redis set for URLs that are 'seen' but not yet crawled. Check Bloom filter first, then Redis set. This reduces Redis memory usage by 99%.

Deduplication.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — System Design tutorial

// Bloom filter + Redis set for deduplication
class UrlDeduplicator {
    private bloom: BloomFilter  // initialized with expected count and false positive rate
    private redis: RedisClient
    
    async isDuplicate(url: string): Promise<boolean> {
        if (this.bloom.mightContain(url)) {
            // false positive possible, check Redis
            return await redis.sismember('crawled_urls', url)
        }
        return false
    }
    
    async markCrawled(url: string): Promise<void> {
        this.bloom.add(url)
        await redis.sadd('crawled_urls', url)
    }
}
Output
No direct output; this is a design pattern.
The Classic Bug: Forgetting to Check Before Enqueue
You must deduplicate at enqueue time, not just before fetch. Otherwise, the same URL gets enqueued multiple times by different workers, wasting resources.
Deduplication: Hash Set vs Bloom FilterTHECODEFORGE.IODeduplication: Hash Set vs Bloom FilterScale from millions to billions of URLsHash SetExact dedup, no false positivesUses ~32 bytes per URL1B URLs → 32 GB RAMDoes not scale to web scaleBloom FilterProbabilistic, small false positiveUses ~2 bytes per URL1B URLs → 2 GB RAMTunable accuracy vs memoryBloom filter trades tiny error for 10x memory savingsTHECODEFORGE.IO
thecodeforge.io
Deduplication: Hash Set vs Bloom Filter
Design Web Crawler

Distributed Architecture: Crawling at Scale

A single machine can't crawl the entire web. You need a distributed system. The standard pattern: a master node manages the frontier and assigns work to worker nodes. Workers fetch pages, parse links, and send discovered URLs back to the master. Use a message queue like RabbitMQ or Kafka for communication. The master publishes URL batches to a 'to-crawl' queue; workers consume, crawl, and publish discovered URLs to a 'discovered' queue. The master consumes 'discovered' and adds to the frontier. This decouples workers and allows easy scaling. But watch out: if a worker crashes mid-crawl, you lose that page. Implement a 're-queue' mechanism: if a worker doesn't acknowledge within a timeout, the URL goes back to the queue.

DistributedCrawler.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// io.thecodeforge — System Design tutorial

// Master node pseudocode
async function masterLoop() {
    while (true) {
        // get next URL from frontier (respecting politeness)
        url = await frontier.nextUrl()
        // publish to RabbitMQ
        await channel.sendToQueue('to_crawl', Buffer.from(url))
    }
}

// Worker node pseudocode
async function workerLoop() {
    channel.consume('to_crawl', async (msg) => {
        const url = msg.content.toString()
        try {
            const html = await fetch(url)
            const links = parseLinks(html, url)
            // send discovered URLs to master
            for (const link of links) {
                await channel.sendToQueue('discovered', Buffer.from(link))
            }
            channel.ack(msg)
        } catch (err) {
            // re-queue on failure
            channel.nack(msg, false, true)
        }
    })
}
Output
No direct output; this is a design pattern.
Production Trap: Message Queue Backpressure
If workers are slower than the master, the 'to-crawl' queue grows unbounded. Use a bounded queue with a max size. When full, the master pauses. Also monitor queue depth — alert if it exceeds a threshold.

Handling JavaScript-Rendered Pages

Modern websites are SPAs that load content via JavaScript. A simple HTTP GET returns an empty shell. You need a headless browser like Puppeteer or Playwright. But this is expensive: each page takes seconds and consumes 100MB+ of RAM. Never use a headless browser for every page. First, try a regular fetch. If the page contains no meaningful content (e.g., no text, only scripts), then fall back to a headless browser. Cache the rendered HTML for a TTL. Also, use a pool of browsers to reuse contexts. I've seen a team spin up a new browser per page — they ran out of file descriptors in 10 minutes.

HeadlessCrawler.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// io.thecodeforge — System Design tutorial

// Fallback to headless browser
async function fetchWithFallback(url: string): Promise<string> {
    const response = await fetch(url)
    const html = await response.text()
    if (hasMeaningfulContent(html)) {
        return html
    }
    // fallback to headless
    const browser = await pool.acquire()
    try {
        const page = await browser.newPage()
        await page.goto(url, { waitUntil: 'networkidle' })
        const content = await page.content()
        return content
    } finally {
        await page.close()
        pool.release(browser)
    }
}
Output
No direct output; this is a design pattern.
Never Do This: Headless Browser for Every Page
Headless browsers are 10-100x slower and more memory-hungry than plain HTTP. Only use them when necessary. Detect SPA pages by checking for a <div id="root"> with no children.

Error Handling and Retries

The web is unreliable. You'll get timeouts, 500s, DNS failures, and connection resets. Your crawler must handle these gracefully. Implement exponential backoff with jitter. Start with 1 second, double each retry, cap at 60 seconds, and add random jitter to avoid thundering herd. Max retries: 3 for transient errors, 0 for 4xx (client errors). For 429 (rate limit), respect Retry-After header. I've seen a crawler retry a 404 five times — wasted bandwidth and filled logs. Also, log every error with context: URL, status code, response headers, and stack trace. This is invaluable for debugging.

RetryLogic.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — System Design tutorial

async function fetchWithRetry(url: string, maxRetries = 3): Promise<Response> {
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
        try {
            const response = await fetch(url)
            if (response.status === 429) {
                const retryAfter = response.headers.get('Retry-After') || '60'
                await sleep(parseInt(retryAfter) * 1000)
                continue
            }
            if (response.status >= 500 || response.status === 0) {
                // transient
                const delay = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 60000)
                await sleep(delay)
                continue
            }
            return response
        } catch (err) {
            if (attempt === maxRetries) throw err
            const delay = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 60000)
            await sleep(delay)
        }
    }
}
Output
No direct output; this is a design pattern.
Senior Shortcut: Use a Circuit Breaker
If a domain returns 5xx for 10 consecutive requests, stop crawling it for an hour. Use a circuit breaker pattern to avoid hammering a downed server.

Storage: Where to Put All That Data

You need to store the crawled pages for indexing or analysis. The naive approach is to dump raw HTML into files. That doesn't scale. Use a distributed storage system like HDFS or S3. Partition by domain or date. Store metadata (URL, crawl timestamp, response headers, status code) in a database like Cassandra or DynamoDB for fast lookups. Raw HTML goes to blob storage. For small-scale (millions of pages), a single PostgreSQL database with text compression works. But watch out: storing HTML in a relational database kills performance. Use TOAST or separate blob storage. I've seen a team store 10TB of HTML in MySQL — the database became unusable.

StorageDesign.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — System Design tutorial

// Store metadata in Cassandra, raw HTML in S3
async function storePage(url: string, html: string, headers: Headers, status: number) {
    const key = hashUrl(url)
    // metadata
    await cassandra.execute(
        'INSERT INTO pages (url_hash, url, crawl_time, status, content_type, s3_key) VALUES (?, ?, ?, ?, ?, ?)',
        [key, url, new Date(), status, headers.get('content-type'), `pages/${key}.html`]
    )
    // raw HTML to S3
    await s3.putObject({
        Bucket: 'crawled-pages',
        Key: `pages/${key}.html`,
        Body: html,
        ContentType: 'text/html'
    }).promise()
}
Output
No direct output; this is a design pattern.
Production Trap: Hot Partition in Cassandra
If you partition by domain, a popular domain (e.g., wikipedia.org) creates a hot partition. Use a hash of the URL as partition key to distribute evenly.

Monitoring and Observability

A crawler runs unattended for days. You need monitoring. Track: pages crawled per second, error rate by status code, queue depth, memory usage, and politeness violations. Alert on: error rate > 5%, queue depth > 100k, memory > 80%, or any 429 spike. Log every request with URL, status, duration, and worker ID. Use structured logging (JSON) and ship to Elasticsearch. I've seen a team not monitor queue depth — the master crashed, workers finished their queue and sat idle for 8 hours before anyone noticed. Also, set up a health endpoint that reports the crawler's state: frontier size, last crawl time per domain, and error counts.

Monitoring.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — System Design tutorial

// Health endpoint example (Express.js)
app.get('/health', async (req, res) => {
    const stats = {
        frontierSize: await redis.zcard('domains'),
        totalCrawled: await redis.get('stats:crawled') || 0,
        errorRate: await calculateErrorRate(),
        queueDepth: await channel.checkQueue('to_crawl'),
        memoryUsage: process.memoryUsage().heapUsed / 1024 / 1024
    }
    res.json(stats)
})
Output
{"frontierSize": 15000, "totalCrawled": 250000, "errorRate": 0.02, "queueDepth": 5000, "memoryUsage": 512}
Interview Gold: How to Detect a Crawler Loop
If your crawler keeps discovering the same URLs (e.g., calendar links with infinite dates), you'll loop forever. Detect by tracking URL pattern frequency. If a pattern (e.g., /events?date=*) generates >1000 unique URLs, apply a limit.
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
A container running a Python crawler would OOM-kill after about 2 hours of crawling. No error logs, just a crash.
Assumption
The team assumed it was a memory leak in their custom HTML parser.
Root cause
The crawler was storing the entire HTML of every page in an in-memory list for deduplication. After 50,000 pages, the list consumed 3.8GB. The Bloom filter they thought they had was never actually initialized — it was a no-op.
Fix
Implemented a proper Bloom filter with Redis backend. Set max memory per page to 1MB. Added a watchdog that logs memory usage every 1000 pages.
Key lesson
  • Never trust a data structure you didn't see initialized.
  • Always monitor memory in production crawlers.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Crawler is getting HTTP 429 from many domains
Fix
1. Check per-domain rate limiter configuration. 2. Verify crawl-delay from robots.txt is being respected. 3. Check if multiple workers are hitting the same domain concurrently. 4. Reduce worker count or increase delay.
Symptom · 02
Crawler is crawling the same URLs repeatedly
Fix
1. Check Bloom filter false positive rate. 2. Verify Redis set is being populated. 3. Check if deduplication is applied at enqueue time. 4. Flush Redis set and restart if corrupted.
Symptom · 03
Crawler is stuck and not making progress
Fix
1. Check frontier queue depth. 2. Check if any domain has a very long crawl-delay (e.g., 3600 seconds). 3. Check if master is publishing to message queue. 4. Check worker logs for errors. 5. Restart master if queue is empty.
★ Web Crawler Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Crawler is getting `429 Too Many Requests`
Immediate action
Check per-domain rate limiter config
Commands
redis-cli get 'lastcrawl:example.com'
redis-cli get 'crawldelay:example.com'
Fix now
Increase delay or reduce worker count. Set CRAWL_DELAY_OVERRIDE=5 in env.
Duplicate pages being crawled+
Immediate action
Check Bloom filter and Redis set
Commands
redis-cli scard 'crawled_urls'
redis-cli get 'bloom:size'
Fix now
Flush Bloom filter and Redis set, then restart. Add URL normalization.
Crawler OOM crash+
Immediate action
Check memory usage per page
Commands
ps aux | grep crawler
cat /proc/<pid>/status | grep VmRSS
Fix now
Limit page size to 1MB. Use streaming parser. Reduce headless browser pool size.
Crawler not making progress+
Immediate action
Check frontier and queue
Commands
redis-cli zcard 'domains'
rabbitmqctl list_queues
Fix now
If frontier empty, check master logs. If queue full, increase worker count.
Feature / AspectSingle-Queue FrontierPer-Domain Queue Frontier
PolitenessPoor: one slow domain blocks allExcellent: each domain independent
ComplexityLowMedium
ScalabilityLow: single queue bottleneckHigh: distributed per domain
Starvation riskHigh: slow domains starve fast onesNone: each domain gets fair share

Key takeaways

1
Always respect robots.txt and crawl-delay
politeness is non-negotiable.
2
Use per-domain queues with separate rate limiters to avoid starvation.
3
Deduplicate URLs at enqueue time using a Bloom filter + Redis set.
4
Never use a headless browser for every page
fall back only when needed.
5
Monitor queue depth, error rate, and memory
alert on anomalies.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does your crawler handle a domain that returns 429 after every reque...
Q02SENIOR
When would you choose a Bloom filter over a Redis set for URL deduplicat...
Q03SENIOR
What happens when your crawler encounters a URL with an infinite calenda...
Q04JUNIOR
How do you ensure politeness when crawling thousands of domains concurre...
Q05SENIOR
Your crawler is running on 100 machines. One machine crashes mid-crawl. ...
Q06SENIOR
Design a crawler that can crawl 1 billion pages per day with 99.9% uptim...
Q01 of 06SENIOR

How does your crawler handle a domain that returns 429 after every request, even with a 10-second delay?

ANSWER
Check if the server is using a per-IP rate limit that your distributed workers share. Use a proxy pool to rotate IPs. Also verify you're not sending too many concurrent requests from the same IP.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do I design a web crawler that respects robots.txt?
02
What's the difference between a web crawler and a web scraper?
03
How do I handle JavaScript-rendered pages in a web crawler?
04
What's the best way to deduplicate URLs in a web crawler at scale?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't

Previous
mTLS Explained
18 / 40 · Real World
Next
Design Google Docs