AWS Lambda Cold Starts — Why P99 Spikes to 1.2s at 9 AM
Lambda cold starts added 800-1200ms to our /orders API every morning.
20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.
- AWS Lambda runs your code on demand without provisioning or managing servers
- Three core components: Functions (your code), Triggers (event sources), Execution Environment (isolated container)
- Cold starts add 100ms–1s latency when a new container spins up
- Performance insight: More memory = more CPU; tuning memory can reduce both cost and duration for compute-heavy tasks
- Production insight: Lambda bills for the full timeout duration even if your function finishes early — always set timeouts realistically
- Biggest mistake: Assuming /tmp is clean between invocations — it persists across warm starts, causing silent data corruption
Imagine you own a pizza shop but you only pay the chef when someone actually orders a pizza. The chef doesn't sit around waiting — they appear the moment an order comes in, make the pizza, then disappear. AWS Lambda is exactly that chef. You write a function, AWS runs it only when something triggers it, and you pay only for the milliseconds it runs. No server to babysit, no idle hours billed, no infrastructure to patch.
Every application needs compute power — something has to run your code. Traditionally, that meant renting a virtual machine or physical server that runs 24/7, even at 3 a.m. when zero users are online. You're paying for potential, not actual work. As cloud adoption exploded, this idle-cost problem became impossible to ignore, especially for startups and teams with unpredictable traffic spikes.
AWS Lambda, launched in 2014, flipped the model. Instead of managing servers, you upload a function — a single, focused piece of logic — and AWS handles everything else: provisioning, scaling, patching, and availability. The term 'serverless' doesn't mean there are no servers; it means YOU don't manage them. The servers exist, they're just Amazon's problem. This lets your team focus entirely on business logic instead of infrastructure operations.
By the end of this article you'll understand how Lambda executes code, how to wire it to real-world triggers like API Gateway and S3, how to avoid the cold start trap that kills performance, and how to structure a production-worthy serverless workflow. You'll also know exactly when Lambda is the right tool — and when it absolutely isn't.
AWS Lambda Serverless — The Execution Model That Bites at Scale
AWS Lambda is a function-as-a-service (FaaS) platform that runs your code in ephemeral, stateless containers. You upload a function, specify a trigger (API Gateway, SQS, S3, etc.), and AWS manages the underlying compute. The core mechanic: each invocation runs in a fresh or recycled sandbox, with no persistent local state across invocations. This is not a long-running process — it's a request-scoped execution that starts, runs, and dies within minutes.
When a Lambda function is invoked, the service either reuses a warm sandbox (if one is available) or creates a new one — this is the cold start. A cold start includes downloading your code, initializing the runtime (JVM in your case), and running your static initializers. For Java, this adds 500ms–1.2s of latency before your handler even executes. The sandbox lifecycle is opaque: you cannot pin a container, and AWS recycles them aggressively (typically after 5–15 minutes of idle time).
Use Lambda when you need elastic scaling with zero idle cost — bursty workloads, event-driven pipelines, or microservices that can tolerate sub-second startup latency. It's not for latency-sensitive user-facing endpoints at the 99th percentile unless you pre-warm or use Provisioned Concurrency. In production, the 9 AM spike is a classic pattern: a wave of concurrent requests hits cold containers simultaneously, amplifying P99 latency by 3–10x.
How AWS Lambda Actually Executes Your Code — The Execution Model
Lambda's execution model is the foundation everything else builds on. When a trigger fires — say, an HTTP request hits API Gateway — Lambda needs to run your function. If a pre-warmed container exists from a recent invocation, Lambda reuses it. This is a 'warm start' and it's fast. If no container is available, Lambda has to bootstrap one from scratch: download your code package, spin up a runtime environment, run any initialisation code outside your handler, then finally invoke your handler. That bootstrap phase is the dreaded cold start.
Cold starts typically add 100ms–1000ms of latency depending on the runtime (.NET and Java are heavier; Node.js and Python are lighter). For a background job this is irrelevant. For a user-facing API call, it's noticeable.
Your handler function receives two objects: the event (the payload that triggered the invocation — could be an HTTP body, an S3 event, a queue message) and the context (metadata about the invocation itself — function name, memory limit, request ID). Understanding this distinction is critical: the event is about WHAT happened, the context is about WHO is running.
Code outside the handler runs once per container lifecycle. That's where you put database connections, SDK clients, and config loading — doing it inside the handler means re-initialising on every single invocation, which is both slow and wasteful.
Init Duration field in CloudWatch logs to measure it.Wiring Lambda to the Real World — Triggers, Events, and API Gateway
A Lambda function sitting alone does nothing. It needs a trigger — an AWS service that says 'hey, something happened, go run'. The trigger determines the shape of the event object your handler receives, which is why reading the AWS event schema docs for each trigger type matters.
The most common triggers in production are: API Gateway (HTTP requests), S3 (file uploads/deletions), SQS (queue messages for async processing), EventBridge (scheduled cron jobs and event routing), DynamoDB Streams (react to database changes), and SNS (fan-out notifications).
API Gateway is the one you'll use for building REST APIs or webhooks. When a request hits your endpoint, API Gateway wraps it into a structured event object and hands it to Lambda. Your function returns a response object with a statusCode, headers, and body, and API Gateway translates that back into a real HTTP response.
The Lambda Proxy Integration model (the default and recommended approach) passes the raw request to your function and expects you to construct the full HTTP response yourself. This gives you complete control over status codes, CORS headers, and response bodies. Older tutorials show Lambda custom integrations — avoid them, they're fiddly and add complexity for no gain.
For async workloads, SQS is your best friend. Rather than calling Lambda directly (which creates tight coupling), push messages to a queue and let Lambda poll and process them in batches. This naturally handles traffic bursts without rate-limit errors.
Lambda Event Source Reference Table — What Triggers Your Function
The following table catalogs the most common Lambda event sources, their invocation model, payload size limits, retry behavior, and best-fit use cases. Knowing these details helps you design reliable, cost-efficient serverless workflows. For each source, the event structure is fixed by AWS — you cannot change the schema — so you must parse the documented fields correctly in your handler.
| Event Source | Invocation Type | Max Payload | Retry Behavior | Best For |
|---|---|---|---|---|
| API Gateway | Synchronous | 10 MB (request), 10 MB (response) | No automatic retries; client handles | HTTP/REST APIs, webhooks |
| S3 (Event Notifications) | Asynchronous | 128 KB (event record) | 2 retries (async) | File processing (image resize, logs, analytics) |
| DynamoDB Streams | Stream-based | 1 MB (batch) | Indefinite retry until data expires (24h) | React to DB changes (materialized views, sync) |
| Kinesis Data Streams | Stream-based | 1 MB (per record) | Indefinite retry until data expires (7 days) | Real-time data processing (clickstreams, logs) |
| SQS (Standard) | Poll-based (event source mapping) | 256 KB per message | Retries based on redrive policy | Async decoupling, buffering, batch processing |
| SQS (FIFO) | Poll-based (event source mapping) | 256 KB per message | Retries with exactly-once semantics | Ordered processing, deduplication |
| SNS (topic subscription) | Asynchronous | 256 KB | 2 retries (async) | Fan-out notifications to multiple subscribers |
| EventBridge (scheduled or event) | Asynchronous | 256 KB | 2 retries (async) | Cron jobs, event routing between AWS services |
| CloudFront (Lambda@Edge) | Synchronous | 1 MB | No automatic retries | Modify HTTP request/response at edge |
| Lambda Function URL | Synchronous | 10 MB (request/response) | No automatic retries | Simple HTTP endpoints without API Gateway |
Key details to remember: - Asynchronous invocations (S3, SNS, EventBridge) retry twice with 1–2 minute delays. Always configure a dead-letter queue (DLQ) for these triggers. - Stream-based triggers (DynamoDB, Kinesis) retry until the data record expires — a persistent bug will block the entire shard. Use bisectBatchOnFunctionError to split batches on failure. - Synchronous triggers (API Gateway, Lambda Function URL) do not retry; your client or upstream service must implement retry logic. - Payload size limits are hard: if your S3 event payload exceeds 128 KB, S3 will send the notification anyway but truncates the event — use the Deep Archive storage class sparingly to avoid this.
For a full list of event sources and their exact event schemas, refer to the [AWS Lambda Developer Guide — Using AWS Lambda with other services](https://docs.aws.amazon.com/lambda/latest/dg/lambda-services.html).
Cold Starts, Memory Tuning, and the Performance Levers You Actually Control
Lambda gives you one direct performance dial: memory. You set it anywhere from 128 MB to 10,240 MB. What most developers don't realise is that CPU allocation scales proportionally with memory. A 1,024 MB Lambda function gets roughly 8x the CPU of a 128 MB one. If your function is CPU-bound (image processing, data transformation, encryption), doubling the memory can halve the execution time — and since you pay for duration × memory, the cost often stays the same or even drops.
Cold starts are the other major lever. Three strategies exist: Provisioned Concurrency, keeping functions warm with scheduled EventBridge pings, and minimising package size.
Provisioned Concurrency is the only AWS-supported solution. You pay for a set number of pre-warmed containers to stay alive at all times. It costs more than on-demand but eliminates cold starts entirely for that concurrency slot. Use it for customer-facing APIs where tail latency matters.
Package size matters because Lambda has to download your deployment package before running it. A 50 MB Python package with unnecessary dependencies cold-starts noticeably slower than a 3 MB lean package. Use Lambda Layers to separate large dependencies (like numpy or Pillow) from your application code, and use .zip deployment packages rather than container images unless you specifically need Docker tooling.
Finally, watch your timeout setting. The default is 3 seconds. Downstream API calls, DB queries, and S3 operations can easily exceed this. Set it realistically (15 minutes max) and always handle partial failures gracefully.
Provisioned Concurrency vs Cold Start — Visual Breakdown
Provisioned Concurrency is the only AWS-native mechanism that guarantees zero cold starts for a fixed number of concurrent invocations. The diagram below contrasts the request flow for an on-demand function (which may incur a cold start) versus a function with Provisioned Concurrency.
How it works: When you enable Provisioned Concurrency, Lambda pre-initialises a specified number of execution environments and keeps them warm. Incoming invocations are routed to these warm environments instantly. On-demand environments are still used for invocations beyond the provisioned count, so cold starts still occur when the provisioned pool is exhausted. The visual logic flow:
- On-Demand Path: Request arrives → check for warm container → if none found → cold start (init + handler delay).
- Provisioned Concurrency Path: Request arrives → route to pre-warmed container → warm start (handler only, no init delay).
The benefit is a 100% elimination of cold start latency for the initial set of concurrent requests. The cost is paying for those environments even when idle.
When to use it: Only for latency-critical production endpoints where p99 must stay below, say, 500ms. For batch processing or background jobs, on-demand is sufficient and cheaper.
When NOT to use it: If your function is rarely invoked (once per hour), the cost of keeping a container warm 24/7 will far exceed any performance benefit. A simple scheduled EventBridge ping (every 5 minutes) is cheaper and nearly as effective — though not guaranteed, as AWS may reclaim containers during maintenance.
Alternative warming patterns: A common pattern is to set up an EventBridge rule that invokes your function every 5 minutes with a synthetic event (e.g., a 'warmup' field). This keeps 1–2 containers warm without Provisioned Concurrency cost. However, this is unreliable under burst traffic — if multiple concurrent requests arrive simultaneously, only one container may be warm. Provisioned Concurrency guarantees capacity.
ProvisionedConcurrencySpillover metric to see how many requests exceed the provisioned pool.Lambda Resource Limits & Constraints Table — What You Can't Change
Lambda has specific hard limits that constrain how you design your serverless applications. Exceeding these limits results in deployment failures, throttling, or runtime errors. The table below shows the most important limits — know them before you architect your system.
| Resource | Limit | Notes |
|---|---|---|
| Memory per function | 128 MB – 10,240 MB (in 1 MB increments) | CPU scales with memory; more memory = more CPU |
Ephemeral storage /tmp | 512 MB | Shared across warm invocations; not reset on reuse |
| Maximum execution timeout | 15 minutes (900 seconds) | Hard limit; cannot be increased |
| Deployment package size (.zip) | 250 MB (unzipped), 50 MB (zipped for direct upload) | Use Lambda Layers to exceed: up to 5 layers, each up to 250 MB unzipped |
| Container image size | 10 GB (ECR image) | Larger images cause slower cold starts |
| Concurrent executions per region (default) | 1,000 | Can be increased via service quota request |
| Concurrent executions per function (default) | 1,000 (unreserved) | Can be limited with reserved concurrency |
| Request/response payload size (sync) | 256 KB (6 MB for API Gateway) | For larger payloads, use S3 or streaming |
| Function environment variables | 4 KB total (unencrypted) | Use AWS Secrets Manager or Parameter Store for secrets |
| Lambda Layers per function | 5 | Layer size counts toward total unzipped limit (250 MB) |
| Event source mappings per function | 10 (for SQS, DynamoDB, Kinesis) | Add more by using multiple triggers |
| Reserved concurrency per function | 0 – regional limit | Setting reserved concurrency guarantees capacity but blocks other functions |
| Provisioned Concurrency per function | 0 – regional limit | Regional limit is 5,000 per region by default |
| Function execution role | AWS IAM role | Lambda attaches this role to the execution environment |
How to work around limits: - Package size: If you exceed 250 MB unzipped, separate large libraries (Panda, OpenCV, etc.) into Lambda Layers. Each layer can be up to 250 MB, and you can use up to 5 layers, giving you an effective 1.25 GB total. - Timeout: Lambda supports up to 15 minutes. For longer jobs, use AWS Step Functions to orchestrate multiple Lambda calls, or switch to Fargate/Batch. - Concurrency: If you anticipate more than 1,000 concurrent executions, request a limit increase in the AWS Service Quotas console. Also consider using SQS buffering to smooth traffic. - Payload size: For payloads larger than 256 KB, upload to S3 and pass the object key in the event. Lambda reads from S3 instead of the event body.
These limits are not negotiable — building against them from day one avoids costly refactors later.
Production Patterns: Error Handling, Retries, and Observability
Lambda's default retry behaviour depends on invocation type. Synchronous invocations (API Gateway, custom apps) do NOT retry automatically — your client must handle errors. Asynchronous invocations (S3, SNS, EventBridge) retry twice using built-in retry logic, then discard the event unless you configure a dead-letter queue (DLQ). Stream-based triggers (DynamoDB Streams, Kinesis) retry until the data expires (default 24 hours) and block the shard — meaning a permanently failing function stalls your stream.
For synchronous APIs, implement your own retry with exponential backoff inside Lambda. For async triggers, always attach a DLQ (SQS or SNS) to capture failed events. Without a DLQ, failed events vanish after two retries — you'll never know.
Observability in Lambda is driven by CloudWatch Logs, CloudWatch Metrics, and AWS X-Ray. Every invocation writes a REPORT line showing duration, billed duration, memory used, and init duration. X-Ray traces show downstream calls to DynamoDB, S3, and other services — essential for debugging latency.
Structured logging is critical. Use JSON-formatted logs with a correlation ID (often the X-Ray trace ID) so you can correlate invocations. Avoid print() statements without context.
- Synchronous invocations: no automatic retries. The caller must handle errors.
- Asynchronous invocations: two automatic retries with exponential backoff (0, 1, 2 min delays).
- Stream-based triggers: retry forever (up to 24 hours or 7 days for Kinesis).
- Always configure a dead-letter queue (DLQ) for async triggers to catch failures.
- DLQ can be an SQS queue (for processing later) or an SNS topic (for alerting).
When Lambda is the Wrong Tool — Alternatives and Trade-offs
Lambda excels at short-lived, event-driven, bursty workloads. But it's not a general-purpose compute platform. If your workload contradicts any of the following, reach for another service.
First, long-running processes: Lambda's hard 15-minute timeout means you cannot run a nightly batch job that takes an hour. Use AWS Batch or ECS/Fargate for that.
Second, stateful applications: Lambda is stateless by design. If your application needs to hold client connections (WebSockets), maintain session state in memory, or use files that persist beyond a single invocation, you'll fight the architecture. Use EC2 or ECS with sticky sessions instead.
Third, predictable, steady traffic: If your load is constant 24/7, Lambda's per-ms billing is more expensive than a low-cost EC2 instance or a reserved instance. A t3.small running 24 hours costs $15/month; 5 million Lambda invocations at 200ms average could cost $8, but steady traffic at 100 req/s would push cost higher than an EC2.
Fourth, heavy GPU/compute: Lambda has no GPU support. ML training, 3D rendering, or video transcoding with high compute needs are better on EC2 GPU instances or SageMaker.
Fifth, very low latency requirements (<10ms): Lambda's cold start and network overhead make it unsuitable for sub-millisecond use cases like real-time trading. Use containers on EC2 or custom hardware.
Finally, large binary processing: Lambda's deployment package limit is 250 MB (unzipped) and 50 MB (zipped) for direct upload. If you're processing multi-GB files, you'll hit storage and timeout limits. Use ECS or Batch with EFS.
The Core Concepts: Serverless & Event-Driven — What Your Manager Actually Means
Your manager says 'serverless'. You hear 'no ops work'. Both are wrong.
Serverless doesn't mean servers vanish. It means you stop caring about kernel patches, SSH keys, and OS upgrades. AWS runs the hypervisor, the runtime, and the scaling plane. Your job shrinks to code and IAM permissions. That's the trade: you give up control over the execution environment in exchange for not paging at 3 AM when a disk fills up.
Event-driven is the engine behind that trade. Your Lambda function does nothing until something pokes it. An S3 upload. An API Gateway request. A DynamoDB stream. That event arrives as a JSON payload, your function processes it, and then it dies. No daemons. No polling loops. No idle costs.
The mental model: Lambda is a stateless worker pool that only exists while handling a single request. If you write code that assumes long-lived connections, local file state, or sticky sessions, you will fail in production. Design for stateless idempotent handlers or don't deploy it.
Use Cases That Won't Burn Your Budget — And Two That Will
Lambda shines when the work is asynchronous, bursty, or short-lived. It bleeds money when you try to force it into a container-shaped hole.
- Image/video processing on upload. S3 event triggers Lambda, you resize, transcode, or extract metadata. Perfect fit: milliseconds of CPU per file, scales to zero when no uploads happen.
- Webhook handlers. Stripe, GitHub, Slack — they send JSON, you validate a signature, update a database, return 200. No keepalive costs.
- Scheduled batch jobs. CloudWatch Events every 15 minutes to purge stale records or aggregate metrics. 900 invocations a day, 500ms each, costs pennies.
- Real-time file transformation. CSV → Parquet before loading into Athena. Lambda grabs the S3 object, transforms in memory, writes to a target bucket.
- Synchronous request-response APIs with tight latency SLAs (<100ms p99). Cold starts kill you. Yes, Provisioned Concurrency exists. Yes, it costs 3x more per GB-hour than warm Lambda.
- Long-running data processing (>15 minutes runtime). Lambda hard caps at 15 minutes. If your ETL job runs 20 minutes, you can't split it? Lambda is the wrong tool. Use EMR or Fargate.
- WebSocket connections with 10k+ concurrent users. Lambda per-connection costs scale linearly with active connections. A single t3.medium handling WebSockets costs less at scale.
💰 Pricing: Pay-Per-Use — The Bill That Sneaks Up on You
Lambda pricing sounds simple: pay per request and compute duration. But the details matter when your traffic goes from zero to a million requests overnight.
Requests cost $0.20 per million. Compute charges by GB-second — memory allocation times execution time. The cheaper your memory tier, the longer your function runs, and sometimes a slightly higher memory setting finishes faster and costs less overall. Always benchmark with realistic payloads.
The real budget killer? Free tier ends after 12 months and 1 million requests. After that, sustained traffic adds up fast. A 128MB function running 500ms, hit 10 million times per month, runs roughly $35 — peanuts. But a 3GB function with 30-second cold starts and retries? That bill hits $500+ real quick.
Watch for data transfer costs too. Lambda talking to RDS or S3 in different regions racks up per-GB charges. Your serverless bill isn't just Lambda — it's the entire egress chain.
⚙️ Key Features — What Makes Lambda Worth the Headache
Lambda exists because nobody wants to manage servers. The core feature is automatic scaling: zero to thousands of concurrent executions in seconds, no provisioning, no load balancers. Each request gets an isolated micro-VM — your code and dependencies, no neighbors.
Event-driven execution is the architectural win. Lambda sits downstream of 200+ AWS services as a native event target. S3 object creation, DynamoDB streams, SNS, SQS, API Gateway — just drop a Lambda in the flow and you're done. No polling, no workers, no crons.
Built-in observability through CloudWatch logs, metrics, and traces. Every invocation gets a request ID, duration, memory used, and billing breakdown. You can catch failures, retry with backoff, and DLQ dead letters to SQS or SNS for reprocessing.
But don't mistake simplicity for power. Lambda is stateless by design — you cannot store local state across invocations. Any state must live in external services. That's a feature, not a bug — it forces stateless architecture that scales horizontally without thinking.
Securing Your Account with IAM
Identity and Access Management (IAM) is the front door to your AWS account. Before writing a single Lambda function, you must understand why IAM matters: it prevents accidental data leaks, stops unauthorized cost spikes, and enforces least-privilege access. The principle is simple—every action your Lambda performs, from reading an S3 object to writing logs in CloudWatch, requires explicit permission. Start by creating dedicated IAM roles for each function rather than using a shared admin role. Attach AWS managed policies like AWSLambdaBasicExecutionRole initially, then scope down to custom inline policies that specify exact ARNs of resources your function touches. Use IAM Access Analyzer to validate your policy statements against actual usage. Avoid hardcoding credentials in environment variables; rely on the execution role's temporary credentials. Enable CloudTrail to audit all IAM actions, and rotate keys regularly for any human users. This discipline prevents the all-too-common production incident where a misconfigured policy exposes a database. Treat IAM as your first security line, not an afterthought.
Computing in AWS
Lambda is one compute option among many, and choosing it blindly leads to cost overruns or performance headaches. Understanding why is straightforward: compute models in AWS fall on a spectrum of control versus overhead. EC2 gives you full control over the OS, runtime, and scaling but requires managing patching, capacity planning, and failover. ECS/EKS removes the underlying server management while letting you control the container orchestration. Fargate abstracts away the infrastructure entirely but still gives you per-task pricing and no cold starts. Lambda sits at the extreme end—zero infrastructure management, automatic scaling from zero to thousands of concurrent executions, but with hard limits on execution time (15 minutes max), memory (10,240 MB), and storage (512 MB /tmp). The why: If your workload is bursty, event-driven, and short-lived, Lambda is ideal. If you need persistent connections, long-running processes, or predictable latency for user-facing APIs under load, consider Fargate or EC2 with an Auto Scaling Group. Many teams build a hybrid architecture—Lambda for glue logic and ingestion pipelines, ECS for the heavy-lifting backend.
The Cold Start P99 Spike That Killed Our API Response Times
- Measure p50 and p99 separately — if p99 is much higher than p50, cold starts or throttling are the likely cause.
- Use Provisioned Concurrency for latency-sensitive endpoints, but only for the minimum number needed.
- Minimise package size and externalise heavy dependencies to Lambda Layers.
aws lambda get-function-configuration --function-name your-functionaws cloudwatch get-metric-statistics --metric-name InitDuration --namespace AWS/Lambda --dimensions Name=FunctionName,Value=your-functionKey takeaways
Common mistakes to avoid
5 patternsInitialising DB connections inside the handler
Ignoring the 512 MB /tmp storage limit and assuming a clean filesystem
Setting Lambda timeout lower than the slowest downstream dependency
Using synchronous invocation for long-polling or cron tasks
Forgotten DLQ for async triggers
Interview Questions on This Topic
A Lambda function handles user logins and is experiencing high tail latency during morning traffic spikes. The p99 latency is 1.2 seconds but the p50 is 180ms. What's likely causing this and how would you fix it?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Everything here is grounded in real deployments.
That's Cloud. Mark it forged?
18 min read · try the examples if you haven't