Senior 8 min · March 06, 2026

AWS DevOps Interview Questions — Production Debugging Focus

Q: What are the top 3 AWS DevOps interview questions for senior roles?

1. Explain IAM evaluation logic including SCPs and Boundaries. 2. Compare Blue/Green vs Canary deployment technical implementation in CodeDeploy. 3. Architect a multi-region disaster recovery plan for a containerized workload including RDS replication and Route 53 failover.

Q: How do I answer 'Why did you choose Terraform over CloudFormation?'

Focus on state management control, the ability to perform dry runs via `terraform plan`, and the superior module ecosystem. Avoid saying 'CloudFormation is bad'; instead, highlight that Terraform provides better visibility into infrastructure changes before they are committed to the provider.

Q: What is the 'Principle of Least Privilege' in an AWS context?

It is the practice of granting only the minimum permissions necessary to perform a task. In AWS, this means scoping IAM policies to specific Actions, specific Resource ARNs, and using Conditions (like SourceIp or SourceVpc) to restrict access further. Never use '*' for resources in production policies.

Q: How do you manage secrets across multiple environments on AWS?

Use AWS Secrets Manager for secrets that require rotation (like RDS passwords) and SSM Parameter Store SecureString for static secrets (like API keys). Never hardcode these in code or IaC templates; fetch them at runtime or resolve them via the orchestration layer (e.g., ECS environment variable injection).

Q: How does AWS CodeDeploy handle rollbacks for Lambda functions?

CodeDeploy uses 'Linear' or 'Canary' deployment configs. It creates a new Lambda version and shifts traffic. If a pre-configured CloudWatch Alarm (e.g., 5xx errors > 1%) triggers during the 'Baking' period, CodeDeploy immediately points the alias back to the old version.

Q: How do you handle secrets in CI/CD pipelines on AWS?

Use AWS Secrets Manager or SSM Parameter Store to store secrets, and reference them in CodeBuild as environment variables via parameter store references (e.g., `MY_SECRET: 'resolve:ssm:/my-secret'`). Never store secrets in source code or pipeline configuration files. For Terraform, use `data.aws_secretsmanager_secret`. Always enable encryption at rest and in transit.

CannotPullContainerError with correct IAM? Missing S3 Gateway Endpoint.

Naren Founder & Principal Engineer

20+ years shipping production code across the stack, with years spent interviewing engineers. Lessons pulled from things that broke in production.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

AWS DevOps interview questions test architectural tradeoffs, not service definitions.
Key areas: CI/CD pipeline design, IAM policy evaluation order, IaC (CloudFormation vs Terraform), and ECS/EKS container orchestration.
Performance insight: A well-architected pipeline can cut deployment time from 30 minutes to under 5.
Production insight: 90% of AWS security incidents stem from IAM misconfigurations (excessive permissions, missing resource policies).
Biggest mistake: Memorizing service names without understanding failure modes and recovery strategies.

✦ Definition~90s read

What is AWS Interview Questions?

This article is a targeted guide for senior engineers preparing for AWS DevOps interviews that emphasize production debugging — the kind of interview that separates candidates who can recite service names from those who have actually triaged a production incident at 3 AM. It covers the specific AWS services and patterns that interviewers use to probe your ability to diagnose and resolve real-world failures: broken CI/CD pipelines, misconfigured IAM policies that silently block deployments, CloudFormation drift that corrupts state, and container orchestration issues that cause cascading outages.

★

Think of AWS like a giant, perfectly organised city.

The focus is on the questions that reveal whether you understand the operational consequences of architectural decisions, not just how to spin up resources.

In the AWS ecosystem, this article fills a gap between generic interview prep and deep-dive certification guides. Most DevOps interview resources cover theory — this one assumes you already know what CodePipeline and ECS are, and instead tests your ability to reason about why a build failed at 2:47 AM, how you'd trace a 503 error through CloudWatch Logs and X-Ray, or what you'd do when a Terraform apply breaks a production database.

It's not for beginners; it's for engineers who have been on-call and need to articulate that experience under pressure.

When not to use this: If you're preparing for a junior DevOps role or a pure architecture interview without operational responsibilities, this material will be overkill. Similarly, if your interview is focused on GCP or Azure, the AWS-specific services won't translate.

But for anyone interviewing at companies like Amazon, Netflix, or mid-stage startups that run on AWS and expect engineers to handle production incidents, this article is the difference between a pass and a fail.

Plain-English First

Think of AWS like a giant, perfectly organised city. EC2 instances are the buildings, IAM is the security guard deciding who gets through the door, CloudFormation is the city blueprint, and CodePipeline is the conveyor belt that takes your raw code and delivers a finished product to the right building automatically. A DevOps engineer is the city planner — they design how all those pieces talk to each other, stay healthy, and rebuild themselves when something goes wrong.

DevOps on AWS isn't just a checkbox on a job description — it's the difference between a team that ships features on a Friday afternoon with confidence and one that treats deployments like defusing a bomb. Amazon Web Services powers roughly a third of the internet's infrastructure, and companies aren't hiring AWS DevOps engineers to click buttons in a console. They need people who can architect pipelines, diagnose failures at 2am, and make strong tradeoff decisions under pressure.

The problem most candidates run into is that they've memorised service names without understanding the reasoning behind architectural choices. Interviewers at mid-to-senior level aren't impressed by someone who can recite what S3 stands for — they want to hear you explain why you'd choose an ALB over an NLB for a microservices workload, or why you'd reach for SSM Parameter Store instead of hardcoding an environment variable.

This article covers the AWS DevOps interview questions that actually get asked in technical screens and on-site rounds at companies ranging from Series B startups to FAANG-adjacent engineering orgs. By the end, you'll have battle-ready answers with the depth and nuance that separates a senior-level response from a junior one — and you'll understand the 'why' well enough to adapt your answer to any follow-up curveball.

What AWS DevOps Interview Questions Actually Test

AWS DevOps interview questions are not trivia about services. They test your ability to debug production systems under load. The core mechanic is scenario-based: you're given a failure mode—like a 5xx spike or a deployment rollback—and expected to trace the root cause across AWS primitives (ELB, ASG, RDS, Lambda) using logs, metrics, and distributed tracing. These questions assume you understand the interplay between scaling policies, connection pools, and retry storms.

In practice, the interviewer cares about three properties: how you isolate the blast radius, how you read CloudWatch metrics to distinguish application errors from infrastructure saturation, and how you reason about eventual consistency in DynamoDB or S3. A typical question might involve a 502 error after a CodeDeploy rollout—you must check ALB target group health, application logs for uncaught exceptions, and database connection pool exhaustion simultaneously.

Use these questions to validate your mental model of failure modes in distributed systems. They matter because production incidents rarely have a single cause; they cascade. A senior engineer must articulate the chain: a misconfigured health check leads to premature instance termination, which triggers a connection pool drain, which causes a latency spike, which trips the ALB 5xx alarm.

Don't Memorize Service Limits

Interviewers rarely ask for exact limits (e.g., ALB idle timeout 60s). They want you to reason about what happens when you exceed them—like connection queue buildup.

Production Insight

Real scenario: A team set ASG min=2, max=2, no buffer. A single AZ went down, both instances terminated, zero capacity for 5 minutes.

Symptom: ALB returned 503 for all requests, no healthy targets.

Rule of thumb: Always set ASG min to at least 2x the number of AZs, and use a launch template with instance spread across AZs.

Key Takeaway

Production debugging questions test your ability to trace a failure across AWS service boundaries, not recite documentation.

Always start with the load balancer logs (ALB access logs) to see the request path before diving into application logs.

The most common root cause in AWS incidents is a misconfigured health check or a connection pool that doesn't handle transient failures.

thecodeforge.io

AWS DevOps Interview Questions — Production Debugging

Aws Interview Questions

CI/CD on AWS — CodePipeline, CodeBuild, and the Questions Behind Them

The most common first question in an AWS DevOps screen is some variation of: 'Walk me through your CI/CD pipeline.' Interviewers aren't looking for a list of services — they're listening for your decision-making.

CodePipeline is AWS's native pipeline orchestrator. It doesn't build or deploy anything itself — it coordinates other services. CodeBuild handles the compilation, testing, and packaging (it's a fully managed build server billed per build minute). CodeDeploy handles the actual deployment to EC2, Lambda, or ECS. Understanding that separation of concerns is critical.

Why use CodePipeline over Jenkins? The honest answer is: it depends. CodePipeline has zero infrastructure to manage and integrates natively with IAM, CloudTrail, and EventBridge. Jenkins gives you more plugin flexibility and is easier to migrate if you leave AWS. The right answer in an interview is to name the tradeoff, not pick a winner blindly.

One detail that trips people up: CodeBuild runs in an isolated, ephemeral container. That means any state — installed dependencies, cached layers — is gone after the build unless you explicitly configure a build cache in S3. Forgetting this is why builds that work locally are mysteriously slow or broken in CodeBuild.

CI_CD_Pipeline_Questions.mdMARKDOWN

Q: How does your team handle failed deployments in CodePipeline?

WEAK ANSWER:
  'We just re-run the pipeline.'

STRONG ANSWER:
  'We use CodeDeploy with a Blue/Green deployment strategy for ECS services.
   If the post-deployment health check fails — we define a 5-minute window
   where the load balancer monitors the new task set — CodeDeploy automatically
   rolls back by shifting traffic back to the original task set.

   For Lambda, we use CodeDeploy with a Canary10Percent5Minutes configuration:
   10% of traffic goes to the new version for 5 minutes. If CloudWatch alarms
   tied to error rate or latency spike, the deployment is rolled back automatically.
   No human has to be awake for that to happen.

   We also enable CloudTrail on the pipeline so every approval action,
   stage transition, and artifact push is auditable.'

---

Q: What is the difference between CodeDeploy in-place and Blue/Green?

IN-PLACE:
  - Stops old app version on existing instances
  - Installs new version on the SAME instances
  - Cheaper (no duplicate infrastructure)
  - Downtime risk if health check fails mid-deployment
  - Good for: dev/staging environments, non-critical workloads

BLUE/GREEN:
  - New version deployed to a SEPARATE set of instances/tasks
  - Load balancer shifts traffic only after health checks pass
  - Zero-downtime by design
  - Costs more during the transition window (double the compute)
  - Good for: production, anything customer-facing

KEY INSIGHT:
  Blue/Green is not just a deployment strategy — it's also a rollback strategy.
  Your 'blue' environment stays live until you're confident in 'green'.
  If something goes wrong, a traffic shift takes seconds, not a redeploy.

Output

N/A — interview Q&A format. These are model answers, not runnable code.

Interview Gold:

When asked about CI/CD, proactively mention your rollback strategy before the interviewer asks. It signals production maturity. Say: 'And if the deployment fails, here's exactly what happens automatically...' — most candidates never get there.

Production Insight

In production, the most common CI/CD failure is a CodeBuild build that works locally but fails in the pipeline because ephemeral containers lack pre-installed tooling.

Always explicitly define the build environment image and install dependencies in the buildspec.

Rule: Never rely on the default CodeBuild image for language-specific builds.

Key Takeaway

CI/CD is about failure recovery as much as pipeline speed.

Proactively design rollback strategies into your pipeline.

The strongest answer is one that includes both.

IAM, Security, and the Principle of Least Privilege — Where Candidates Get Caught

IAM is the single most tested AWS topic in DevOps interviews, and also the most misunderstood. Most candidates can explain what a policy is — very few can explain the evaluation logic when multiple policies conflict.

Here's the core rule: AWS evaluates all applicable policies (identity-based, resource-based, permission boundaries, SCPs). An explicit Deny anywhere always wins. An Allow only applies if no explicit Deny exists AND the action is permitted by at least one policy. This seems obvious until you're debugging why a Lambda function can't write to an S3 bucket even though the IAM role has an S3:PutObject allow — and the bucket policy has an explicit Deny for all non-VPC traffic.

The other area interviewers probe hard: IAM Roles vs. IAM Users for automation. The correct answer in 2024 is always roles for anything machine-to-machine. IAM users have long-lived static credentials — if a key leaks, you have a breach. Roles use short-lived STS tokens that auto-rotate. EC2 instance profiles, ECS task roles, Lambda execution roles — all of these use the role mechanism.

Permission Boundaries are often a senior-level differentiator. They let you delegate IAM administration safely: you can allow a team to create their own roles, but cap the maximum permissions those roles can ever have. It's the difference between 'trust but verify' and 'trust and you can't accidentally escalate anyway.'

io/thecodeforge/iam/SecureBucketPolicy.jsonJSON

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceVPCAccessOnly",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::thecodeforge-prod-data",
        "arn:aws:s3:::thecodeforge-prod-data/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:SourceVpc": "vpc-0123456789abcdef0"
        }
      }
    }
  ]
}

Output

Demonstrates a 'Deny by default' security posture for non-VPC traffic, a common senior interview discussion point.

Watch Out:

Never say 'I'd just give it admin access temporarily to get it working.' That phrase ends interviews. The correct answer is always 'I'd scope the policy to the minimum required actions and resources, then use IAM Policy Simulator to verify.' Interviewers are testing whether you'd create a security incident in production.

Production Insight

The most embarrassing IAM outage I've seen: a Lambda function couldn't write to S3 because the IAM role had s3:PutObject, but the bucket policy included an explicit Deny for non-VPC traffic, and the Lambda wasn't in the VPC.

Always check both identity-based and resource-based policies.

Rule: An explicit Deny anywhere beats an Allow everywhere.

Key Takeaway

IAM is the most common place AWS outages originate.

Master the evaluation logic and permission boundaries.

In an interview, showing you understand that nuance makes you senior.

Infrastructure as Code — CloudFormation vs Terraform and Real Architecture Questions

CloudFormation is AWS-native IaC. Terraform is cloud-agnostic HCL-based IaC by HashiCorp. This comparison comes up in almost every AWS DevOps interview, and the trap is giving a tribal 'Terraform is better' answer without nuance.

CloudFormation's strengths: native drift detection, StackSets for multi-account/multi-region deployments, no state file management (AWS manages state), and deep integration with AWS services like Service Catalog and CDK. Its weakness: verbose YAML/JSON, slower development cycle, and error messages that are notoriously unhelpful ('UPDATE_ROLLBACK_COMPLETE' tells you nothing about what actually failed).

Terraform's strengths: multi-cloud portability, cleaner module system, better plan output (you see exactly what will change before it changes), and a massive community registry of modules. Its weakness: you own the state file, which means you need a backend (S3 + DynamoDB for locking), and state file corruption or drift is your problem to solve.

In practice, many mature AWS shops use both: CloudFormation for account-level infrastructure (VPCs, IAM foundations, Service Control Policies) via AWS Control Tower, and Terraform for application-level infrastructure managed by product teams. Knowing this hybrid pattern signals real-world experience.

io/thecodeforge/terraform/main.tfHCL

/* 
 * Production-grade Terraform backend configuration.
 * Explaining this locking mechanism shows you understand state safety.
 */
terraform {
  backend "s3" {
    bucket         = "thecodeforge-tf-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "thecodeforge-tf-locks"
    encrypt        = true
  }
}

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  name   = "thecodeforge-main-vpc"
  cidr   = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

Output

Provisioning a highly available VPC with NAT gateways and state locking.

Pro Tip:

When answering IaC questions, always end with 'and here's how I'd handle drift.' CloudFormation has native drift detection; Terraform has terraform plan against live infrastructure. Showing you think about what happens after initial deployment separates senior candidates from mid-level ones.

Production Insight

Teams that use CloudFormation without StackSets often hit a wall when they need to deploy the same VPC to 20 accounts.

They end up copying templates manually, which introduces drift.

Rule: If you have more than 5 accounts, use StackSets or Terraform workspaces from day one.

Key Takeaway

IaC isn't about which tool—it's about state safety and drift detection.

CloudFormation native drift detection is a hidden gem.

In an interview, mention 'drift detection' before the interviewer does.

ECS, EKS, and Container Orchestration — The Questions That Reveal Depth

Container questions are where AWS DevOps interviews get genuinely technical. The ECS vs EKS question is almost a given, and the wrong move is to immediately say 'Kubernetes is always better.'

ECS (Elastic Container Service) is AWS-native container orchestration. It uses Task Definitions (the blueprint for a container workload) and Services (which maintain the desired count of tasks and wire them to load balancers). ECS with Fargate means zero EC2 management — AWS provisions the underlying compute per task. ECS with EC2 launch type means you manage the cluster nodes, but you get more control over instance types and pricing (Reserved Instances, Savings Plans).

EKS (Elastic Kubernetes Service) gives you a managed Kubernetes control plane. Use it when your team already has Kubernetes expertise, you need to run the same workloads on-prem and in AWS, or your application requires Kubernetes-native features like custom operators or CRDs. The operational overhead is meaningfully higher than ECS.

The real interview depth comes from task role vs execution role in ECS — a distinction that trips up 80% of candidates. The execution role is what ECS uses to pull the container image from ECR and write logs to CloudWatch. The task role is what your application code uses to call other AWS services (DynamoDB, S3, etc.). Mixing these up is a classic misconfiguration that causes silent permission failures.

io/thecodeforge/ecs/TaskDefinition.yamlYAML

/* 
 * CloudFormation snippet for an ECS Task Definition.
 * Senior highlight: Explicitly separating the two roles.
 */
ForgeTaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: thecodeforge-api-task
    Cpu: 256
    Memory: 512
    NetworkMode: awsvpc
    RequiresCompatibilities:
      - FARGATE
    # Used by the ECS Agent (ECR Pulls, CloudWatch Logs)
    ExecutionRoleArn: !Ref ForgeExecutionRole
    # Used by the Application Code (S3, DynamoDB calls)
    TaskRoleArn: !Ref ForgeTaskRole
    ContainerDefinitions:
      - Name: api-container
        Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/forge-api:latest
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: /ecs/thecodeforge-api
            awslogs-region: us-east-1
            awslogs-stream-prefix: ecs

Output

Defines a Fargate task with clear role separation, ready for high-scale production.

Interview Gold:

If asked 'ECS or EKS?', give this answer: 'ECS for AWS-native teams who want low operational overhead and Fargate's serverless model. EKS when the team has Kubernetes expertise, needs multi-cloud portability, or requires CRDs and custom operators. ECS is faster to get right; EKS is harder to get right but more portable.' That's a senior-level answer.

Production Insight

The most common production issue with ECS is mixing up execution role and task role.

Developers add S3 permissions to the execution role (which ECS uses), then wonder why their app can't write to S3.

Rule: Execution role is for the ECS agent; task role is for your application.

Key Takeaway

ECS vs EKS is a team maturity decision, not a technical superiority.

The real depth is in networking, roles, and health checks.

In interviews, the task role vs execution role distinction is the senior trap.

Monitoring, Observability, and Incident Response — The DevOps Interview Difference

Interviewers dig into monitoring because they want to know how you detect and respond to failures before customers do. The standard answer—'we use CloudWatch alarms'—is not enough. You need to show you understand the difference between monitoring (tracking known metrics) and observability (exploring unknown failure modes).

On AWS, CloudWatch collects metrics, logs, and events. But CloudWatch Metrics alone won't catch an intermittent 503 error that happens only when a downstream service is slow. That's where structured logging (JSON format) and distributed tracing with X-Ray come in. The trick is to log correlation IDs so you can trace a request across EC2, Lambda, RDS, and S3.

A senior DevOps answer should include: metrics (CPU, memory, request latency) that trigger alarms, structured logs with context for debugging, and traces for pinpointing bottlenecks. Also mention alarm fatigue: too many alarms cause engineers to ignore them. Use composite alarms to reduce noise.

io/thecodeforge/monitoring/CompositeAlarms.yamlYAML

AWSTemplateFormatVersion: '2010-09-09'
Description: Composite alarm to reduce alert fatigue by combining error rate and latency.
Resources:
  HighErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighErrorRate
      MetricName: 5XXError
      Namespace: AWS/ApplicationELB
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanOrEqualToThreshold
  HighLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighLatency
      MetricName: TargetResponseTime
      Namespace: AWS/ApplicationELB
      Statistic: p99
      Period: 60
      EvaluationPeriods: 2
      Threshold: 2000
      ComparisonOperator: GreaterThanOrEqualToThreshold
  CompositeHighErrorLatencyAlarm:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: CompositeHighErrorLatency
      AlarmRule: !Sub '(ALARM("${HighErrorRateAlarm}") AND ALARM("${HighLatencyAlarm}"))'
      ActionsEnabled: true
      AlarmActions:
        - !Ref SNSNotificationTopic

Output

Creates a composite alarm that fires only when both error rate and latency exceed thresholds, reducing noise.

Interview Gold:

When asked about monitoring, mention 'alarm fatigue' and how you use composite alarms. It shows you've been on-call and care about signal-to-noise ratio.

Production Insight

We once had a CloudWatch alarm on average latency that never fired because the average stayed under threshold. But 1% of requests were timing out at 30 seconds.

The fix was to use a percentile alarm (p99) instead of average.

Rule: Always alarm on p99 latency, not average.

Key Takeaway

Monitoring without observability is blind.

Use the three pillars: metrics, logs, traces.

In an interview, mention 'alarm fatigue'—it shows production maturity.

EC2 at Scale — Spot, Reserved, and the Cost Questions That Kill Candidates

Every junior can launch an EC2 instance. The interview question that separates engineers is: "How do you run 500 instances without burning through your budget?"

EC2 pricing is a design decision, not a billing afterthought. Spot Instances give you 90% off but can terminate in two minutes. You design for interruption — checkpoint your work, use instance fleets, and never run stateful workloads on spot without a fallback.

Reserved Instances are for your baseline. If you know you need 20 m5.large instances for 12 months, commit. Convertible RIs let you change families. That matters when you're migrating from compute-optimized to memory-optimized mid-cycle.

Savings Plans are the newer, more flexible cousin. They cover Fargate and Lambda too.

Interviewers ask: "You have a batch processing job that runs nightly for 3 hours. What's the cheapest way to run it?" Right answer: Spot Fleet with a diversified allocation strategy across instance types and AZs. Wrong answer: On-Demand because it's simple.

ec2_spot_fleet.tfHCL

// io.thecodeforge
resource "aws_spot_fleet_request" "batch_workers" {
  iam_fleet_role      = aws_iam_role.fleet_role.arn
  target_capacity     = 50
  allocation_strategy = "diversified"

  launch_specification {
    instance_type     = "c5.large"
    ami               = data.aws_ami.amazon_linux_2.id
    spot_price        = "0.05"
    subnet_id         = aws_subnet.private[0].id
    user_data         = filebase64("${path.module}/checkpoint_bootstrap.sh")
  }

  launch_specification {
    instance_type     = "c5a.large"
    ami               = data.aws_ami.amazon_linux_2.id
    spot_price        = "0.05"
    subnet_id         = aws_subnet.private[1].id
    user_data         = filebase64("${path.module}/checkpoint_bootstrap.sh")
  }
}

Output

spot_fleet_request.batch_workers: Creation complete after 12s

Instances launch. If spot price spikes, AWS terminates. Checkpoint files in S3 save your work.

Production Trap:

Never set your spot bid to the On-Demand price. Use the current spot price or a 50% discount threshold. If you bid On-Demand, you lose the savings and still risk termination.

Key Takeaway

Cost-optimized EC2 means using Spot for fault-tolerant workloads, Reserved for steady-state, and Savings Plans for mixed usage across compute services.

Serverless Interview Questions — Lambda Cold Starts, Concurrency, and the VPC Lie

Lambda questions expose how well you understand stateless architecture. The surface-level question: "What's a Lambda function?" The real question: "Your API returns 5-second latencies every 10 minutes. Why?"

Cold starts happen when Lambda needs to initialize a new execution environment. For Java and .NET, that's 200-300ms of overhead. For Python and Node.js, it's usually under 100ms. But in a VPC, Lambda must create an Elastic Network Interface first. That adds 5-10 seconds.

Interviewers probe: "When would you use Lambda in a VPC?" Only when you must access an RDS database or an Elasticache cluster. Otherwise, use RDS Proxy or move the compute outside the VPC.

Concurrency limits kill production systems. The default is 1,000 concurrent executions per account. A single misconfigured SQS trigger can exhaust that, starving all other functions. Set reserved concurrency per function. Don't trust the default.

Provisioned Concurrency pre-warms environments. Use it for your latency-sensitive endpoints. Pay the cost or own the cold start.

template.yamlYAML

// io.thecodeforge
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  OrderProcessor:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: target/
      Handler: com.thecodeforge.OrderHandler::handleRequest
      Runtime: java11
      MemorySize: 1024
      Timeout: 10
      ReservedConcurrentExecutions: 50
      ProvisionedConcurrencyConfig:
        ProvisionedConcurrentExecutions: 5
      Policies:
        - AWSLambdaVPCAccessExecutionRole
        - DynamoDBCrudPolicy:
            TableName: !Ref OrdersTable
      VpcConfig:
        SecurityGroupIds:
          - !Ref LambdaSecurityGroup
        SubnetIds:
          - !Ref PrivateSubnetA
          - !Ref PrivateSubnetB

Output

Cold start (first 5 invocations): ~2500ms

Warm invocations (after Provisioned Concurrency): ~120ms

Concurrent executions capped at 50: prevents account-level throttle

Production Trap:

Putting Lambda in a VPC without a NAT Gateway for internet access will break any API calls to external services. Your function will silently timeout. Always validate network plumbing before the incident.

Key Takeaway

Lambda in VPC costs 10x cold start time. Always use RDS Proxy or DynamoDB Accelerator (DAX) to avoid it. Reserved concurrency protects your other functions.

● Production incidentPOST-MORTEMseverity: high

ECS Task Fails to Pull Image in Private Subnet

Symptom

Task status shows 'CannotPullContainerError: Access Denied' in the ECS console. CloudWatch logs for the execution role show no errors. ECR interface endpoints are configured and appear healthy.

Assumption

The issue must be an IAM permission problem or a misconfigured ECR endpoint.

Root cause

ECR stores image layers in S3. The Fargate task had no route to S3 because we only configured ECR Interface Endpoints (which reach ECR API) but forgot the S3 Gateway Endpoint for the private subnet route table. The ECR API succeeded, but pulling the actual image bytes from S3 failed silently.

Fix

Add a com.amazonaws.region.s3 Gateway Endpoint to the route table of the private subnets. Also ensure that the S3 endpoint policy allows access from the VPC.

Key lesson

Always check the full data path: ECR interface endpoints handle API calls, but S3 gateway endpoints are required for layer downloads.
When debugging 'CannotPullContainerError' with correct IAM, suspect missing S3 endpoint before anything else.
Add a VPC endpoint checklist to your deployment runbook: ECR API, ECR Docker, S3 Gateway, and CloudWatch Logs interface endpoints for private Fargate tasks.

Production debug guideSymptom-to-action guide for common networking and permission issues3 entries

Symptom · 01

Task fails with 'CannotPullContainerError'

→

Fix

Check ECR interface endpoints and S3 gateway endpoint. Verify route tables. Test with 'aws ecr get-login-password' from a jump host in the same subnet.

Symptom · 02

Task starts but health checks fail with connection timeout

→

Fix

Check security group rules: task security group must allow inbound from ALB, and ALB security group must allow outbound to task on target port. Verify network ACLs.

Symptom · 03

Task exits immediately with 'ResourceInitializationError'

→

Fix

Check CloudWatch logs for the execution role. Common cause: missing 'logs:PutLogEvents' permission on the CloudWatch log group. Also verify log group exists.

★ AWS ECS Deployment Debug Cheat SheetQuick commands and fixes for the top 3 ECS deployment failures.

Task stuck in PENDING−

Immediate action

Check CPU/memory limits in the task definition; the cluster may have insufficient capacity.

Commands

aws ecs describe-clusters --clusters <cluster-name> --query 'clusters[0].registeredContainerInstancesCount'

aws ecs list-container-instances --cluster <cluster-name>

Fix now

Increase cluster capacity or reduce task size. For Fargate, ensure the task definition's CPU/Memory combinations are valid.

Health check failures on new tasks+

Task fails after start with 'OutOfMemoryError'+

Aspect	AWS CodePipeline / Native CI/CD	Terraform + External Pipeline
State Management	Managed by AWS — no state file	S3 backend + DynamoDB lock — you own it
Multi-Account Deployments	CloudFormation StackSets built-in	Requires workspace strategy + CI config
Drift Detection	Native CloudFormation drift detection	terraform plan against live infra
Rollback Mechanism	Automatic via CodeDeploy strategies	terraform apply previous state version
Secret Handling	Native SSM/Secrets Manager integration	AWS provider + data sources to fetch secrets
Cost	Pay per active pipeline ($1/month/pipeline)	Terraform OSS is free; Terraform Cloud adds cost
Learning Curve	Lower for AWS-only teams	Higher, but skills transfer to other clouds
Best For	AWS-native orgs, compliance-heavy environments	Multi-cloud orgs, teams with existing TF expertise

Key takeaways

IAM policy evaluation always processes explicit Deny first

an explicit Deny in a bucket policy or SCP beats any Allow in an identity-based policy, regardless of how permissive the role looks in isolation

ECS Task Role ≠ ECS Execution Role

the execution role is for ECS pulling your image and writing logs; the task role is for your application code calling AWS APIs at runtime — mixing them up causes silent permission failures

CloudFormation UPDATE_ROLLBACK_FAILED isn't a disaster

use 'continue-update-rollback --resources-to-skip' to unstick it, but prevent it by always previewing changes with Change Sets before applying updates to production stacks

Fargate tasks in private subnets need BOTH ECR Interface Endpoints AND an S3 Gateway Endpoint

ECR stores image layers in S3, so missing the S3 endpoint causes CannotPullContainerError even when all ECR permissions are correct

Monitoring should alarm on percentile (p99) latency, not average, to catch tail latencies

average hides the problems that actually hurt users

Use composite alarms to reduce alert fatigue

a single metric alarm firing alone might be noise, but two correlated alarms signal a real incident

Common mistakes to avoid

4 patterns

Putting secrets in CloudFormation parameters as 'String' type instead of 'AWS::SSM::Parameter::Value'

Symptom

The secret value appears in plaintext in the CloudFormation console under 'Parameters' and in CloudTrail events.

Fix

Always use SecureString SSM parameters or Secrets Manager references so the value is never exposed in AWS console history or API responses.

Forgetting the S3 VPC Gateway Endpoint when running Fargate tasks in private subnets

Symptom

ECR stores image layers in S3, so even with ECR Interface Endpoints configured, Fargate tasks will still fail to pull images because S3 traffic has no route out of the VPC.

Fix

Add a com.amazonaws.region.s3 Gateway Endpoint to the route tables of your private subnets alongside your ECR interface endpoints.

Treating CodeBuild as a persistent build server and relying on local filesystem state between builds

Symptom

A CodeBuild project starts fresh every time; any npm install, pip install, or compiled artifact from a previous build is gone.

Fix

Configure a build cache in S3 (specify cache paths in buildspec.yml under 'cache: paths') or use a custom Docker image with pre-installed dependencies as your CodeBuild environment image to dramatically cut build times.

Using 'Resource': '*' in IAM policies for Lambda execution roles

Symptom

A compromise of the Lambda could lead to data exfiltration across all resources in the account.

Fix

Scope the Resource to specific ARNs like 'arn:aws:s3:::my-bucket/*' and use Conditions to limit access further.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

A production ECS service with 10 tasks is failing health checks for 30% ...

Q02SENIOR

You need to design a secure cross-account CI/CD pipeline using AWS Organ...

Q03SENIOR

Explain the lifecycle of a request hitting an Application Load Balancer,...

Q04SENIOR

How would you implement a blue/green deployment for a stateful applicati...

Q05SENIOR

What is the difference between a CloudWatch Alarm and a Composite Alarm,...

Q06SENIOR

Your team uses Terraform and you notice state drift between the remote s...

Q01 of 06SENIOR

A production ECS service with 10 tasks is failing health checks for 30% of its nodes after a deployment. Walk me through the automated recovery process and your manual root cause analysis steps.

ANSWER

Automated Recovery: - CodeDeploy with Blue/Green: if health checks fail on the new task set, traffic stays on the old set. Auto-rollback is triggered after a configurable bake time (e.g., 5 minutes). - For canary deployments, CloudWatch alarms on error rate > 1% trigger rollback immediately. Manual Root Cause Analysis: 1. Check ECS service events and task logs via CloudWatch Logs. 2. Verify the health check endpoint: curl the container's IP from a jump host in the same VPC. 3. Check security groups: ensure ALB can reach the task on the health check port. 4. Compare failing vs healthy tasks: task definition, launch type (Fargate vs EC2), resource constraints. 5. Look for common patterns: new task definition introduced a changed health check path, or the container runs out of memory under load. 6. If the health check is not the issue, examine the ALB target group health check configuration (interval, timeout, threshold). Key: The automated recovery buys time. The manual analysis should focus on the difference between healthy and unhealthy tasks.

FAQ · 6 QUESTIONS

Frequently Asked Questions

What are the top 3 AWS DevOps interview questions for senior roles?

How do I answer 'Why did you choose Terraform over CloudFormation?'

What is the 'Principle of Least Privilege' in an AWS context?

How do you manage secrets across multiple environments on AWS?

How does AWS CodeDeploy handle rollbacks for Lambda functions?

How do you handle secrets in CI/CD pipelines on AWS?

Naren Founder & Principal Engineer

20+ years shipping production code across the stack, with years spent interviewing engineers. Lessons pulled from things that broke in production.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's DevOps Interview. Mark it forged?

8 min read · try the examples if you haven't