Senior 4 min · June 25, 2026

Design a Web Crawler: Build a Production Crawler That Won't Get You Banned

Q: How do I design a web crawler that respects robots.txt?

Fetch and parse robots.txt before crawling any domain. Cache it with a 24-hour TTL. Extract the crawl-delay directive and use it to throttle requests per domain. If no crawl-delay, default to 1 second. Re-fetch robots.txt if you get a 403 or 429.

Q: What's the difference between a web crawler and a web scraper?

A web crawler discovers and downloads web pages by following links, typically for indexing. A web scraper extracts specific data from a page. Crawlers are broader; scrapers are targeted. A crawler may use a scraper to extract data from pages it downloads.

Q: How do I handle JavaScript-rendered pages in a web crawler?

Use a headless browser like Puppeteer or Playwright, but only as a fallback. First, try a regular HTTP fetch. If the page has no meaningful content (e.g., empty ), then render with a headless browser. Cache the rendered HTML and reuse browser contexts.

Q: What's the best way to deduplicate URLs in a web crawler at scale?

Use a Bloom filter for memory efficiency, combined with a Redis set for correctness. Check the Bloom filter first; if it says the URL might exist, verify with Redis. This reduces Redis memory usage by 99% while maintaining accuracy.

Design a web crawler that respects robots.txt, handles rate limiting, and scales.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Design a web crawler by starting with a frontier (URL queue), a fetcher with politeness delays, a parser for links, and a deduplication store. Use a distributed architecture with message queues for scale. Always respect robots.txt and throttle requests per domain.

✦ Definition~90s read

What is Design a Web Crawler?

A web crawler is a bot that systematically downloads web pages to index content or extract data. Production crawlers must handle politeness, deduplication, scaling, and error recovery.

★

Imagine you're a librarian tasked with cataloging every book in a city.

Plain-English First

Imagine you're a librarian tasked with cataloging every book in a city. You can't run into every library at once — you'd get thrown out. So you plan a route, visit one library at a time, wait a bit between visits, and take notes on where to go next. A web crawler does the same: it politely visits websites, waits between requests, and follows links to discover new pages.

Most web crawler tutorials are toys. They show you how to fetch a page with requests and call it a day. In production, that naive approach gets your IP banned, crashes your database, and costs you thousands in bandwidth. I've seen a startup's entire crawling pipeline grind to a halt because they forgot to deduplicate URLs — they downloaded the same page 50,000 times. This article is the real deal: how to design a web crawler that respects robots.txt, handles rate limiting, scales to millions of pages, and doesn't get you sued. By the end, you'll be able to architect a distributed crawler that runs for weeks without manual intervention.

The Frontier: Your Crawler's Brain

The frontier is the queue of URLs to crawl. Without it, your crawler is a headless chicken. The naive approach is a FIFO queue, but that ignores politeness and priority. You need a priority queue that respects robots.txt crawl-delay and gives fresh content higher priority. I've seen teams use a simple Redis list and wonder why they get banned — because they hammer the same domain with 100 concurrent requests. The correct design: a set of per-domain queues, each with its own rate limiter. When a URL is discovered, hash the domain, push to that domain's queue. A scheduler picks the next domain whose rate limit allows a request. This ensures you never exceed the crawl-delay for any domain.

FrontierDesign.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Frontier using Redis per-domain queues
// Each domain has a sorted set keyed by crawl priority (timestamp of last crawl)
// Scheduler picks domain with oldest last-crawl time that has waited >= crawl-delay

// Pseudocode for scheduler:
function nextUrl() {
    while (true) {
        domain = redis.zpopmin('domains')  // get domain with oldest last-crawl
        if (domain == null) break
        lastCrawl = redis.get('lastcrawl:' + domain)
        delay = getCrawlDelay(domain)  // from robots.txt cache
        if (now - lastCrawl >= delay) {
            url = redis.lpop('queue:' + domain)
            if (url) {
                redis.set('lastcrawl:' + domain, now)
                return url
            }
        } else {
            // domain not ready, push back with updated score = lastCrawl + delay
            redis.zadd('domains', lastCrawl + delay, domain)
        }
    }
    sleep(100ms)
}

Output

No direct output; this is a design pattern.

Production Trap: Single-Queue Starvation

If you use a single FIFO queue, a slow domain (e.g., one with 10-second crawl-delay) blocks all other domains. Always use per-domain queues with separate rate limiters.

thecodeforge.io

Production Web Crawler Architecture

Design Web Crawler

Politeness: Don't Be a Jerk

Politeness is the single most important aspect of a production crawler. Ignore it and you'll get IP-banned, blocked by Cloudflare, or sued. The rules: respect robots.txt, obey crawl-delay, and throttle to a reasonable rate per domain. I've seen a team set a global delay of 1 second and still get banned because they had 100 workers hitting the same domain simultaneously. The fix: per-domain token bucket. Each domain gets a bucket with capacity = 1 and refill rate = 1/crawl-delay. Workers acquire a token before fetching. This ensures you never exceed the delay, even with hundreds of workers. Also cache robots.txt with a TTL of 24 hours, but re-fetch if you get a 403 or 429.

RateLimiter.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Per-domain token bucket rate limiter
class DomainRateLimiter {
    private buckets: Map<string, TokenBucket> = new Map()
    
    async acquire(domain: string): Promise<void> {
        let bucket = this.buckets.get(domain)
        if (!bucket) {
            const delay = await this.getCrawlDelay(domain)  // from robots.txt
            bucket = new TokenBucket(1, 1 / delay)  // capacity 1, refill 1 per delay seconds
            this.buckets.set(domain, bucket)
        }
        await bucket.consume()  // blocks until token available
    }
}

Output

No direct output; this is a design pattern.

Senior Shortcut: Cache robots.txt Aggressively

Robots.txt rarely changes. Cache it for 24 hours. But if you get a 403, re-fetch immediately — the site may have updated its rules.

thecodeforge.io

Politeness Flow: Avoid Getting Banned

Design Web Crawler

Deduplication: Never Fetch the Same URL Twice

Without deduplication, your crawler will fetch the same page thousands of times. The naive approach is a hash set of all seen URLs, but that doesn't scale — a billion URLs needs gigabytes of RAM. Use a Bloom filter. It's probabilistic: you might get false positives (skip a page you haven't seen) but never false negatives (never fetch a page twice). A Bloom filter with 1% false positive rate for 1 billion URLs needs about 1.2GB of memory. That's acceptable. For absolute correctness, combine with a Redis set for URLs that are 'seen' but not yet crawled. Check Bloom filter first, then Redis set. This reduces Redis memory usage by 99%.

Deduplication.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Bloom filter + Redis set for deduplication
class UrlDeduplicator {
    private bloom: BloomFilter  // initialized with expected count and false positive rate
    private redis: RedisClient
    
    async isDuplicate(url: string): Promise<boolean> {
        if (this.bloom.mightContain(url)) {
            // false positive possible, check Redis
            return await redis.sismember('crawled_urls', url)
        }
        return false
    }
    
    async markCrawled(url: string): Promise<void> {
        this.bloom.add(url)
        await redis.sadd('crawled_urls', url)
    }
}

Output

No direct output; this is a design pattern.

The Classic Bug: Forgetting to Check Before Enqueue

You must deduplicate at enqueue time, not just before fetch. Otherwise, the same URL gets enqueued multiple times by different workers, wasting resources.

thecodeforge.io

Deduplication: Hash Set vs Bloom Filter

Design Web Crawler

Distributed Architecture: Crawling at Scale

A single machine can't crawl the entire web. You need a distributed system. The standard pattern: a master node manages the frontier and assigns work to worker nodes. Workers fetch pages, parse links, and send discovered URLs back to the master. Use a message queue like RabbitMQ or Kafka for communication. The master publishes URL batches to a 'to-crawl' queue; workers consume, crawl, and publish discovered URLs to a 'discovered' queue. The master consumes 'discovered' and adds to the frontier. This decouples workers and allows easy scaling. But watch out: if a worker crashes mid-crawl, you lose that page. Implement a 're-queue' mechanism: if a worker doesn't acknowledge within a timeout, the URL goes back to the queue.

DistributedCrawler.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Master node pseudocode
async function masterLoop() {
    while (true) {
        // get next URL from frontier (respecting politeness)
        url = await frontier.nextUrl()
        // publish to RabbitMQ
        await channel.sendToQueue('to_crawl', Buffer.from(url))
    }
}

// Worker node pseudocode
async function workerLoop() {
    channel.consume('to_crawl', async (msg) => {
        const url = msg.content.toString()
        try {
            const html = await fetch(url)
            const links = parseLinks(html, url)
            // send discovered URLs to master
            for (const link of links) {
                await channel.sendToQueue('discovered', Buffer.from(link))
            }
            channel.ack(msg)
        } catch (err) {
            // re-queue on failure
            channel.nack(msg, false, true)
        }
    })
}

Output

No direct output; this is a design pattern.

Production Trap: Message Queue Backpressure

If workers are slower than the master, the 'to-crawl' queue grows unbounded. Use a bounded queue with a max size. When full, the master pauses. Also monitor queue depth — alert if it exceeds a threshold.

Handling JavaScript-Rendered Pages

Modern websites are SPAs that load content via JavaScript. A simple HTTP GET returns an empty shell. You need a headless browser like Puppeteer or Playwright. But this is expensive: each page takes seconds and consumes 100MB+ of RAM. Never use a headless browser for every page. First, try a regular fetch. If the page contains no meaningful content (e.g., no text, only scripts), then fall back to a headless browser. Cache the rendered HTML for a TTL. Also, use a pool of browsers to reuse contexts. I've seen a team spin up a new browser per page — they ran out of file descriptors in 10 minutes.

HeadlessCrawler.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Fallback to headless browser
async function fetchWithFallback(url: string): Promise<string> {
    const response = await fetch(url)
    const html = await response.text()
    if (hasMeaningfulContent(html)) {
        return html
    }
    // fallback to headless
    const browser = await pool.acquire()
    try {
        const page = await browser.newPage()
        await page.goto(url, { waitUntil: 'networkidle' })
        const content = await page.content()
        return content
    } finally {
        await page.close()
        pool.release(browser)
    }
}

Output

No direct output; this is a design pattern.

Never Do This: Headless Browser for Every Page

Headless browsers are 10-100x slower and more memory-hungry than plain HTTP. Only use them when necessary. Detect SPA pages by checking for a <div id="root"> with no children.

Error Handling and Retries

The web is unreliable. You'll get timeouts, 500s, DNS failures, and connection resets. Your crawler must handle these gracefully. Implement exponential backoff with jitter. Start with 1 second, double each retry, cap at 60 seconds, and add random jitter to avoid thundering herd. Max retries: 3 for transient errors, 0 for 4xx (client errors). For 429 (rate limit), respect Retry-After header. I've seen a crawler retry a 404 five times — wasted bandwidth and filled logs. Also, log every error with context: URL, status code, response headers, and stack trace. This is invaluable for debugging.

RetryLogic.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

async function fetchWithRetry(url: string, maxRetries = 3): Promise<Response> {
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
        try {
            const response = await fetch(url)
            if (response.status === 429) {
                const retryAfter = response.headers.get('Retry-After') || '60'
                await sleep(parseInt(retryAfter) * 1000)
                continue
            }
            if (response.status >= 500 || response.status === 0) {
                // transient
                const delay = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 60000)
                await sleep(delay)
                continue
            }
            return response
        } catch (err) {
            if (attempt === maxRetries) throw err
            const delay = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 60000)
            await sleep(delay)
        }
    }
}

Output

No direct output; this is a design pattern.

Senior Shortcut: Use a Circuit Breaker

If a domain returns 5xx for 10 consecutive requests, stop crawling it for an hour. Use a circuit breaker pattern to avoid hammering a downed server.

Storage: Where to Put All That Data

You need to store the crawled pages for indexing or analysis. The naive approach is to dump raw HTML into files. That doesn't scale. Use a distributed storage system like HDFS or S3. Partition by domain or date. Store metadata (URL, crawl timestamp, response headers, status code) in a database like Cassandra or DynamoDB for fast lookups. Raw HTML goes to blob storage. For small-scale (millions of pages), a single PostgreSQL database with text compression works. But watch out: storing HTML in a relational database kills performance. Use TOAST or separate blob storage. I've seen a team store 10TB of HTML in MySQL — the database became unusable.

StorageDesign.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Store metadata in Cassandra, raw HTML in S3
async function storePage(url: string, html: string, headers: Headers, status: number) {
    const key = hashUrl(url)
    // metadata
    await cassandra.execute(
        'INSERT INTO pages (url_hash, url, crawl_time, status, content_type, s3_key) VALUES (?, ?, ?, ?, ?, ?)',
        [key, url, new Date(), status, headers.get('content-type'), `pages/${key}.html`]
    )
    // raw HTML to S3
    await s3.putObject({
        Bucket: 'crawled-pages',
        Key: `pages/${key}.html`,
        Body: html,
        ContentType: 'text/html'
    }).promise()
}

Output

No direct output; this is a design pattern.

Production Trap: Hot Partition in Cassandra

If you partition by domain, a popular domain (e.g., wikipedia.org) creates a hot partition. Use a hash of the URL as partition key to distribute evenly.

Monitoring and Observability

A crawler runs unattended for days. You need monitoring. Track: pages crawled per second, error rate by status code, queue depth, memory usage, and politeness violations. Alert on: error rate > 5%, queue depth > 100k, memory > 80%, or any 429 spike. Log every request with URL, status, duration, and worker ID. Use structured logging (JSON) and ship to Elasticsearch. I've seen a team not monitor queue depth — the master crashed, workers finished their queue and sat idle for 8 hours before anyone noticed. Also, set up a health endpoint that reports the crawler's state: frontier size, last crawl time per domain, and error counts.

Monitoring.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Health endpoint example (Express.js)
app.get('/health', async (req, res) => {
    const stats = {
        frontierSize: await redis.zcard('domains'),
        totalCrawled: await redis.get('stats:crawled') || 0,
        errorRate: await calculateErrorRate(),
        queueDepth: await channel.checkQueue('to_crawl'),
        memoryUsage: process.memoryUsage().heapUsed / 1024 / 1024
    }
    res.json(stats)
})

Output

{"frontierSize": 15000, "totalCrawled": 250000, "errorRate": 0.02, "queueDepth": 5000, "memoryUsage": 512}

Interview Gold: How to Detect a Crawler Loop

If your crawler keeps discovering the same URLs (e.g., calendar links with infinite dates), you'll loop forever. Detect by tracking URL pattern frequency. If a pattern (e.g., /events?date=*) generates >1000 unique URLs, apply a limit.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A container running a Python crawler would OOM-kill after about 2 hours of crawling. No error logs, just a crash.

Assumption

The team assumed it was a memory leak in their custom HTML parser.

Root cause

The crawler was storing the entire HTML of every page in an in-memory list for deduplication. After 50,000 pages, the list consumed 3.8GB. The Bloom filter they thought they had was never actually initialized — it was a no-op.

Fix

Implemented a proper Bloom filter with Redis backend. Set max memory per page to 1MB. Added a watchdog that logs memory usage every 1000 pages.

Key lesson

Never trust a data structure you didn't see initialized.
Always monitor memory in production crawlers.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Crawler is getting HTTP 429 from many domains

→

Fix

1. Check per-domain rate limiter configuration. 2. Verify crawl-delay from robots.txt is being respected. 3. Check if multiple workers are hitting the same domain concurrently. 4. Reduce worker count or increase delay.

Symptom · 02

Crawler is crawling the same URLs repeatedly

→

Fix

1. Check Bloom filter false positive rate. 2. Verify Redis set is being populated. 3. Check if deduplication is applied at enqueue time. 4. Flush Redis set and restart if corrupted.

Symptom · 03

Crawler is stuck and not making progress

→

Fix

1. Check frontier queue depth. 2. Check if any domain has a very long crawl-delay (e.g., 3600 seconds). 3. Check if master is publishing to message queue. 4. Check worker logs for errors. 5. Restart master if queue is empty.

★ Web Crawler Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Crawler is getting `429 Too Many Requests`−

Immediate action

Check per-domain rate limiter config

Commands

redis-cli get 'lastcrawl:example.com'

redis-cli get 'crawldelay:example.com'

Fix now

Increase delay or reduce worker count. Set CRAWL_DELAY_OVERRIDE=5 in env.

Duplicate pages being crawled+

Crawler OOM crash+

Crawler not making progress+

Feature / Aspect	Single-Queue Frontier	Per-Domain Queue Frontier
Politeness	Poor: one slow domain blocks all	Excellent: each domain independent
Complexity	Low	Medium
Scalability	Low: single queue bottleneck	High: distributed per domain
Starvation risk	High: slow domains starve fast ones	None: each domain gets fair share

Key takeaways

Always respect robots.txt and crawl-delay

politeness is non-negotiable.

Use per-domain queues with separate rate limiters to avoid starvation.

Deduplicate URLs at enqueue time using a Bloom filter + Redis set.

Never use a headless browser for every page

fall back only when needed.

Monitor queue depth, error rate, and memory

alert on anomalies.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does your crawler handle a domain that returns 429 after every reque...

Q02SENIOR

When would you choose a Bloom filter over a Redis set for URL deduplicat...

Q03SENIOR

What happens when your crawler encounters a URL with an infinite calenda...

Q04JUNIOR

How do you ensure politeness when crawling thousands of domains concurre...

Q05SENIOR

Your crawler is running on 100 machines. One machine crashes mid-crawl. ...

Q06SENIOR

Design a crawler that can crawl 1 billion pages per day with 99.9% uptim...

Q01 of 06SENIOR

How does your crawler handle a domain that returns 429 after every request, even with a 10-second delay?

ANSWER

Check if the server is using a per-IP rate limit that your distributed workers share. Use a proxy pool to rotate IPs. Also verify you're not sending too many concurrent requests from the same IP.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How do I design a web crawler that respects robots.txt?

What's the difference between a web crawler and a web scraper?

How do I handle JavaScript-rendered pages in a web crawler?

What's the best way to deduplicate URLs in a web crawler at scale?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't