Senior 5 min · March 06, 2026

AWS DevOps Interview Questions — Production Debugging Focus

CannotPullContainerError with correct IAM? Missing S3 Gateway Endpoint.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • AWS DevOps interview questions test architectural tradeoffs, not service definitions.
  • Key areas: CI/CD pipeline design, IAM policy evaluation order, IaC (CloudFormation vs Terraform), and ECS/EKS container orchestration.
  • Performance insight: A well-architected pipeline can cut deployment time from 30 minutes to under 5.
  • Production insight: 90% of AWS security incidents stem from IAM misconfigurations (excessive permissions, missing resource policies).
  • Biggest mistake: Memorizing service names without understanding failure modes and recovery strategies.
Plain-English First

Think of AWS like a giant, perfectly organised city. EC2 instances are the buildings, IAM is the security guard deciding who gets through the door, CloudFormation is the city blueprint, and CodePipeline is the conveyor belt that takes your raw code and delivers a finished product to the right building automatically. A DevOps engineer is the city planner — they design how all those pieces talk to each other, stay healthy, and rebuild themselves when something goes wrong.

DevOps on AWS isn't just a checkbox on a job description — it's the difference between a team that ships features on a Friday afternoon with confidence and one that treats deployments like defusing a bomb. Amazon Web Services powers roughly a third of the internet's infrastructure, and companies aren't hiring AWS DevOps engineers to click buttons in a console. They need people who can architect pipelines, diagnose failures at 2am, and make strong tradeoff decisions under pressure.

The problem most candidates run into is that they've memorised service names without understanding the reasoning behind architectural choices. Interviewers at mid-to-senior level aren't impressed by someone who can recite what S3 stands for — they want to hear you explain why you'd choose an ALB over an NLB for a microservices workload, or why you'd reach for SSM Parameter Store instead of hardcoding an environment variable.

This article covers the AWS DevOps interview questions that actually get asked in technical screens and on-site rounds at companies ranging from Series B startups to FAANG-adjacent engineering orgs. By the end, you'll have battle-ready answers with the depth and nuance that separates a senior-level response from a junior one — and you'll understand the 'why' well enough to adapt your answer to any follow-up curveball.

CI/CD on AWS — CodePipeline, CodeBuild, and the Questions Behind Them

The most common first question in an AWS DevOps screen is some variation of: 'Walk me through your CI/CD pipeline.' Interviewers aren't looking for a list of services — they're listening for your decision-making.

CodePipeline is AWS's native pipeline orchestrator. It doesn't build or deploy anything itself — it coordinates other services. CodeBuild handles the compilation, testing, and packaging (it's a fully managed build server billed per build minute). CodeDeploy handles the actual deployment to EC2, Lambda, or ECS. Understanding that separation of concerns is critical.

Why use CodePipeline over Jenkins? The honest answer is: it depends. CodePipeline has zero infrastructure to manage and integrates natively with IAM, CloudTrail, and EventBridge. Jenkins gives you more plugin flexibility and is easier to migrate if you leave AWS. The right answer in an interview is to name the tradeoff, not pick a winner blindly.

One detail that trips people up: CodeBuild runs in an isolated, ephemeral container. That means any state — installed dependencies, cached layers — is gone after the build unless you explicitly configure a build cache in S3. Forgetting this is why builds that work locally are mysteriously slow or broken in CodeBuild.

CI_CD_Pipeline_Questions.mdMARKDOWN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Q: How does your team handle failed deployments in CodePipeline?

WEAK ANSWER:
  'We just re-run the pipeline.'

STRONG ANSWER:
  'We use CodeDeploy with a Blue/Green deployment strategy for ECS services.
   If the post-deployment health check fails — we define a 5-minute window
   where the load balancer monitors the new task set — CodeDeploy automatically
   rolls back by shifting traffic back to the original task set.

   For Lambda, we use CodeDeploy with a Canary10Percent5Minutes configuration:
   10% of traffic goes to the new version for 5 minutes. If CloudWatch alarms
   tied to error rate or latency spike, the deployment is rolled back automatically.
   No human has to be awake for that to happen.

   We also enable CloudTrail on the pipeline so every approval action,
   stage transition, and artifact push is auditable.'

---

Q: What is the difference between CodeDeploy in-place and Blue/Green?

IN-PLACE:
  - Stops old app version on existing instances
  - Installs new version on the SAME instances
  - Cheaper (no duplicate infrastructure)
  - Downtime risk if health check fails mid-deployment
  - Good for: dev/staging environments, non-critical workloads

BLUE/GREEN:
  - New version deployed to a SEPARATE set of instances/tasks
  - Load balancer shifts traffic only after health checks pass
  - Zero-downtime by design
  - Costs more during the transition window (double the compute)
  - Good for: production, anything customer-facing

KEY INSIGHT:
  Blue/Green is not just a deployment strategy — it's also a rollback strategy.
  Your 'blue' environment stays live until you're confident in 'green'.
  If something goes wrong, a traffic shift takes seconds, not a redeploy.
Output
N/A — interview Q&A format. These are model answers, not runnable code.
Interview Gold:
When asked about CI/CD, proactively mention your rollback strategy before the interviewer asks. It signals production maturity. Say: 'And if the deployment fails, here's exactly what happens automatically...' — most candidates never get there.
Production Insight
In production, the most common CI/CD failure is a CodeBuild build that works locally but fails in the pipeline because ephemeral containers lack pre-installed tooling.
Always explicitly define the build environment image and install dependencies in the buildspec.
Rule: Never rely on the default CodeBuild image for language-specific builds.
Key Takeaway
CI/CD is about failure recovery as much as pipeline speed.
Proactively design rollback strategies into your pipeline.
The strongest answer is one that includes both.

IAM, Security, and the Principle of Least Privilege — Where Candidates Get Caught

IAM is the single most tested AWS topic in DevOps interviews, and also the most misunderstood. Most candidates can explain what a policy is — very few can explain the evaluation logic when multiple policies conflict.

Here's the core rule: AWS evaluates all applicable policies (identity-based, resource-based, permission boundaries, SCPs). An explicit Deny anywhere always wins. An Allow only applies if no explicit Deny exists AND the action is permitted by at least one policy. This seems obvious until you're debugging why a Lambda function can't write to an S3 bucket even though the IAM role has an S3:PutObject allow — and the bucket policy has an explicit Deny for all non-VPC traffic.

The other area interviewers probe hard: IAM Roles vs. IAM Users for automation. The correct answer in 2024 is always roles for anything machine-to-machine. IAM users have long-lived static credentials — if a key leaks, you have a breach. Roles use short-lived STS tokens that auto-rotate. EC2 instance profiles, ECS task roles, Lambda execution roles — all of these use the role mechanism.

Permission Boundaries are often a senior-level differentiator. They let you delegate IAM administration safely: you can allow a team to create their own roles, but cap the maximum permissions those roles can ever have. It's the difference between 'trust but verify' and 'trust and you can't accidentally escalate anyway.'

io/thecodeforge/iam/SecureBucketPolicy.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceVPCAccessOnly",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::thecodeforge-prod-data",
        "arn:aws:s3:::thecodeforge-prod-data/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:SourceVpc": "vpc-0123456789abcdef0"
        }
      }
    }
  ]
}
Output
Demonstrates a 'Deny by default' security posture for non-VPC traffic, a common senior interview discussion point.
Watch Out:
Never say 'I'd just give it admin access temporarily to get it working.' That phrase ends interviews. The correct answer is always 'I'd scope the policy to the minimum required actions and resources, then use IAM Policy Simulator to verify.' Interviewers are testing whether you'd create a security incident in production.
Production Insight
The most embarrassing IAM outage I've seen: a Lambda function couldn't write to S3 because the IAM role had s3:PutObject, but the bucket policy included an explicit Deny for non-VPC traffic, and the Lambda wasn't in the VPC.
Always check both identity-based and resource-based policies.
Rule: An explicit Deny anywhere beats an Allow everywhere.
Key Takeaway
IAM is the most common place AWS outages originate.
Master the evaluation logic and permission boundaries.
In an interview, showing you understand that nuance makes you senior.

Infrastructure as Code — CloudFormation vs Terraform and Real Architecture Questions

CloudFormation is AWS-native IaC. Terraform is cloud-agnostic HCL-based IaC by HashiCorp. This comparison comes up in almost every AWS DevOps interview, and the trap is giving a tribal 'Terraform is better' answer without nuance.

CloudFormation's strengths: native drift detection, StackSets for multi-account/multi-region deployments, no state file management (AWS manages state), and deep integration with AWS services like Service Catalog and CDK. Its weakness: verbose YAML/JSON, slower development cycle, and error messages that are notoriously unhelpful ('UPDATE_ROLLBACK_COMPLETE' tells you nothing about what actually failed).

Terraform's strengths: multi-cloud portability, cleaner module system, better plan output (you see exactly what will change before it changes), and a massive community registry of modules. Its weakness: you own the state file, which means you need a backend (S3 + DynamoDB for locking), and state file corruption or drift is your problem to solve.

In practice, many mature AWS shops use both: CloudFormation for account-level infrastructure (VPCs, IAM foundations, Service Control Policies) via AWS Control Tower, and Terraform for application-level infrastructure managed by product teams. Knowing this hybrid pattern signals real-world experience.

io/thecodeforge/terraform/main.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* 
 * Production-grade Terraform backend configuration.
 * Explaining this locking mechanism shows you understand state safety.
 */
terraform {
  backend "s3" {
    bucket         = "thecodeforge-tf-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "thecodeforge-tf-locks"
    encrypt        = true
  }
}

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  name   = "thecodeforge-main-vpc"
  cidr   = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}
Output
Provisioning a highly available VPC with NAT gateways and state locking.
Pro Tip:
When answering IaC questions, always end with 'and here's how I'd handle drift.' CloudFormation has native drift detection; Terraform has terraform plan against live infrastructure. Showing you think about what happens after initial deployment separates senior candidates from mid-level ones.
Production Insight
Teams that use CloudFormation without StackSets often hit a wall when they need to deploy the same VPC to 20 accounts.
They end up copying templates manually, which introduces drift.
Rule: If you have more than 5 accounts, use StackSets or Terraform workspaces from day one.
Key Takeaway
IaC isn't about which tool—it's about state safety and drift detection.
CloudFormation native drift detection is a hidden gem.
In an interview, mention 'drift detection' before the interviewer does.

ECS, EKS, and Container Orchestration — The Questions That Reveal Depth

Container questions are where AWS DevOps interviews get genuinely technical. The ECS vs EKS question is almost a given, and the wrong move is to immediately say 'Kubernetes is always better.'

ECS (Elastic Container Service) is AWS-native container orchestration. It uses Task Definitions (the blueprint for a container workload) and Services (which maintain the desired count of tasks and wire them to load balancers). ECS with Fargate means zero EC2 management — AWS provisions the underlying compute per task. ECS with EC2 launch type means you manage the cluster nodes, but you get more control over instance types and pricing (Reserved Instances, Savings Plans).

EKS (Elastic Kubernetes Service) gives you a managed Kubernetes control plane. Use it when your team already has Kubernetes expertise, you need to run the same workloads on-prem and in AWS, or your application requires Kubernetes-native features like custom operators or CRDs. The operational overhead is meaningfully higher than ECS.

The real interview depth comes from task role vs execution role in ECS — a distinction that trips up 80% of candidates. The execution role is what ECS uses to pull the container image from ECR and write logs to CloudWatch. The task role is what your application code uses to call other AWS services (DynamoDB, S3, etc.). Mixing these up is a classic misconfiguration that causes silent permission failures.

io/thecodeforge/ecs/TaskDefinition.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* 
 * CloudFormation snippet for an ECS Task Definition.
 * Senior highlight: Explicitly separating the two roles.
 */
ForgeTaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: thecodeforge-api-task
    Cpu: 256
    Memory: 512
    NetworkMode: awsvpc
    RequiresCompatibilities:
      - FARGATE
    # Used by the ECS Agent (ECR Pulls, CloudWatch Logs)
    ExecutionRoleArn: !Ref ForgeExecutionRole
    # Used by the Application Code (S3, DynamoDB calls)
    TaskRoleArn: !Ref ForgeTaskRole
    ContainerDefinitions:
      - Name: api-container
        Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/forge-api:latest
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: /ecs/thecodeforge-api
            awslogs-region: us-east-1
            awslogs-stream-prefix: ecs
Output
Defines a Fargate task with clear role separation, ready for high-scale production.
Interview Gold:
If asked 'ECS or EKS?', give this answer: 'ECS for AWS-native teams who want low operational overhead and Fargate's serverless model. EKS when the team has Kubernetes expertise, needs multi-cloud portability, or requires CRDs and custom operators. ECS is faster to get right; EKS is harder to get right but more portable.' That's a senior-level answer.
Production Insight
The most common production issue with ECS is mixing up execution role and task role.
Developers add S3 permissions to the execution role (which ECS uses), then wonder why their app can't write to S3.
Rule: Execution role is for the ECS agent; task role is for your application.
Key Takeaway
ECS vs EKS is a team maturity decision, not a technical superiority.
The real depth is in networking, roles, and health checks.
In interviews, the task role vs execution role distinction is the senior trap.

Monitoring, Observability, and Incident Response — The DevOps Interview Difference

Interviewers dig into monitoring because they want to know how you detect and respond to failures before customers do. The standard answer—'we use CloudWatch alarms'—is not enough. You need to show you understand the difference between monitoring (tracking known metrics) and observability (exploring unknown failure modes).

On AWS, CloudWatch collects metrics, logs, and events. But CloudWatch Metrics alone won't catch an intermittent 503 error that happens only when a downstream service is slow. That's where structured logging (JSON format) and distributed tracing with X-Ray come in. The trick is to log correlation IDs so you can trace a request across EC2, Lambda, RDS, and S3.

A senior DevOps answer should include: metrics (CPU, memory, request latency) that trigger alarms, structured logs with context for debugging, and traces for pinpointing bottlenecks. Also mention alarm fatigue: too many alarms cause engineers to ignore them. Use composite alarms to reduce noise.

io/thecodeforge/monitoring/CompositeAlarms.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
AWSTemplateFormatVersion: '2010-09-09'
Description: Composite alarm to reduce alert fatigue by combining error rate and latency.
Resources:
  HighErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighErrorRate
      MetricName: 5XXError
      Namespace: AWS/ApplicationELB
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 2
      Threshold: 10
      ComparisonOperator: GreaterThanOrEqualToThreshold
  HighLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighLatency
      MetricName: TargetResponseTime
      Namespace: AWS/ApplicationELB
      Statistic: p99
      Period: 60
      EvaluationPeriods: 2
      Threshold: 2000
      ComparisonOperator: GreaterThanOrEqualToThreshold
  CompositeHighErrorLatencyAlarm:
    Type: AWS::CloudWatch::CompositeAlarm
    Properties:
      AlarmName: CompositeHighErrorLatency
      AlarmRule: !Sub '(ALARM("${HighErrorRateAlarm}") AND ALARM("${HighLatencyAlarm}"))'
      ActionsEnabled: true
      AlarmActions:
        - !Ref SNSNotificationTopic
Output
Creates a composite alarm that fires only when both error rate and latency exceed thresholds, reducing noise.
Interview Gold:
When asked about monitoring, mention 'alarm fatigue' and how you use composite alarms. It shows you've been on-call and care about signal-to-noise ratio.
Production Insight
We once had a CloudWatch alarm on average latency that never fired because the average stayed under threshold. But 1% of requests were timing out at 30 seconds.
The fix was to use a percentile alarm (p99) instead of average.
Rule: Always alarm on p99 latency, not average.
Key Takeaway
Monitoring without observability is blind.
Use the three pillars: metrics, logs, traces.
In an interview, mention 'alarm fatigue'—it shows production maturity.
● Production incidentPOST-MORTEMseverity: high

ECS Task Fails to Pull Image in Private Subnet

Symptom
Task status shows 'CannotPullContainerError: Access Denied' in the ECS console. CloudWatch logs for the execution role show no errors. ECR interface endpoints are configured and appear healthy.
Assumption
The issue must be an IAM permission problem or a misconfigured ECR endpoint.
Root cause
ECR stores image layers in S3. The Fargate task had no route to S3 because we only configured ECR Interface Endpoints (which reach ECR API) but forgot the S3 Gateway Endpoint for the private subnet route table. The ECR API succeeded, but pulling the actual image bytes from S3 failed silently.
Fix
Add a com.amazonaws.region.s3 Gateway Endpoint to the route table of the private subnets. Also ensure that the S3 endpoint policy allows access from the VPC.
Key lesson
  • Always check the full data path: ECR interface endpoints handle API calls, but S3 gateway endpoints are required for layer downloads.
  • When debugging 'CannotPullContainerError' with correct IAM, suspect missing S3 endpoint before anything else.
  • Add a VPC endpoint checklist to your deployment runbook: ECR API, ECR Docker, S3 Gateway, and CloudWatch Logs interface endpoints for private Fargate tasks.
Production debug guideSymptom-to-action guide for common networking and permission issues3 entries
Symptom · 01
Task fails with 'CannotPullContainerError'
Fix
Check ECR interface endpoints and S3 gateway endpoint. Verify route tables. Test with 'aws ecr get-login-password' from a jump host in the same subnet.
Symptom · 02
Task starts but health checks fail with connection timeout
Fix
Check security group rules: task security group must allow inbound from ALB, and ALB security group must allow outbound to task on target port. Verify network ACLs.
Symptom · 03
Task exits immediately with 'ResourceInitializationError'
Fix
Check CloudWatch logs for the execution role. Common cause: missing 'logs:PutLogEvents' permission on the CloudWatch log group. Also verify log group exists.
★ AWS ECS Deployment Debug Cheat SheetQuick commands and fixes for the top 3 ECS deployment failures.
Task stuck in PENDING
Immediate action
Check CPU/memory limits in the task definition; the cluster may have insufficient capacity.
Commands
aws ecs describe-clusters --clusters <cluster-name> --query 'clusters[0].registeredContainerInstancesCount'
aws ecs list-container-instances --cluster <cluster-name>
Fix now
Increase cluster capacity or reduce task size. For Fargate, ensure the task definition's CPU/Memory combinations are valid.
Health check failures on new tasks+
Immediate action
Check the application's health check endpoint directly from a jump host in the same VPC.
Commands
curl -v http://<task-private-ip>:<port>/health
aws ecs describe-tasks --cluster <cluster> --tasks <task-arn> --query 'tasks[0].healthStatus'
Fix now
If the container itself is healthy but the load balancer reports unhealthy, verify security group rules and target group registration.
Task fails after start with 'OutOfMemoryError'+
Immediate action
Check container memory limit vs. application memory usage.
Commands
aws ecs describe-tasks --cluster <cluster> --tasks <task-arn> --query 'tasks[0].containers[0].managedAgents[?name=="ExecuteCommandAgent"].{}'
aws ecs execute-command --cluster <cluster> --task <task-arn> --container <container> --command "free -m" --interactive
Fix now
Increase memory hard limit in the task definition or optimize application memory usage. Consider using swap (Fargate does not support swap).
AspectAWS CodePipeline / Native CI/CDTerraform + External Pipeline
State ManagementManaged by AWS — no state fileS3 backend + DynamoDB lock — you own it
Multi-Account DeploymentsCloudFormation StackSets built-inRequires workspace strategy + CI config
Drift DetectionNative CloudFormation drift detectionterraform plan against live infra
Rollback MechanismAutomatic via CodeDeploy strategiesterraform apply previous state version
Secret HandlingNative SSM/Secrets Manager integrationAWS provider + data sources to fetch secrets
CostPay per active pipeline ($1/month/pipeline)Terraform OSS is free; Terraform Cloud adds cost
Learning CurveLower for AWS-only teamsHigher, but skills transfer to other clouds
Best ForAWS-native orgs, compliance-heavy environmentsMulti-cloud orgs, teams with existing TF expertise

Key takeaways

1
IAM policy evaluation always processes explicit Deny first
an explicit Deny in a bucket policy or SCP beats any Allow in an identity-based policy, regardless of how permissive the role looks in isolation
2
ECS Task Role ≠ ECS Execution Role
the execution role is for ECS pulling your image and writing logs; the task role is for your application code calling AWS APIs at runtime — mixing them up causes silent permission failures
3
CloudFormation UPDATE_ROLLBACK_FAILED isn't a disaster
use 'continue-update-rollback --resources-to-skip' to unstick it, but prevent it by always previewing changes with Change Sets before applying updates to production stacks
4
Fargate tasks in private subnets need BOTH ECR Interface Endpoints AND an S3 Gateway Endpoint
ECR stores image layers in S3, so missing the S3 endpoint causes CannotPullContainerError even when all ECR permissions are correct
5
Monitoring should alarm on percentile (p99) latency, not average, to catch tail latencies
average hides the problems that actually hurt users
6
Use composite alarms to reduce alert fatigue
a single metric alarm firing alone might be noise, but two correlated alarms signal a real incident

Common mistakes to avoid

4 patterns
×

Putting secrets in CloudFormation parameters as 'String' type instead of 'AWS::SSM::Parameter::Value'

Symptom
The secret value appears in plaintext in the CloudFormation console under 'Parameters' and in CloudTrail events.
Fix
Always use SecureString SSM parameters or Secrets Manager references so the value is never exposed in AWS console history or API responses.
×

Forgetting the S3 VPC Gateway Endpoint when running Fargate tasks in private subnets

Symptom
ECR stores image layers in S3, so even with ECR Interface Endpoints configured, Fargate tasks will still fail to pull images because S3 traffic has no route out of the VPC.
Fix
Add a com.amazonaws.region.s3 Gateway Endpoint to the route tables of your private subnets alongside your ECR interface endpoints.
×

Treating CodeBuild as a persistent build server and relying on local filesystem state between builds

Symptom
A CodeBuild project starts fresh every time; any npm install, pip install, or compiled artifact from a previous build is gone.
Fix
Configure a build cache in S3 (specify cache paths in buildspec.yml under 'cache: paths') or use a custom Docker image with pre-installed dependencies as your CodeBuild environment image to dramatically cut build times.
×

Using 'Resource': '*' in IAM policies for Lambda execution roles

Symptom
A compromise of the Lambda could lead to data exfiltration across all resources in the account.
Fix
Scope the Resource to specific ARNs like 'arn:aws:s3:::my-bucket/*' and use Conditions to limit access further.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
A production ECS service with 10 tasks is failing health checks for 30% ...
Q02SENIOR
You need to design a secure cross-account CI/CD pipeline using AWS Organ...
Q03SENIOR
Explain the lifecycle of a request hitting an Application Load Balancer,...
Q04SENIOR
How would you implement a blue/green deployment for a stateful applicati...
Q05SENIOR
What is the difference between a CloudWatch Alarm and a Composite Alarm,...
Q06SENIOR
Your team uses Terraform and you notice state drift between the remote s...
Q01 of 06SENIOR

A production ECS service with 10 tasks is failing health checks for 30% of its nodes after a deployment. Walk me through the automated recovery process and your manual root cause analysis steps.

ANSWER
Automated Recovery: - CodeDeploy with Blue/Green: if health checks fail on the new task set, traffic stays on the old set. Auto-rollback is triggered after a configurable bake time (e.g., 5 minutes). - For canary deployments, CloudWatch alarms on error rate > 1% trigger rollback immediately. Manual Root Cause Analysis: 1. Check ECS service events and task logs via CloudWatch Logs. 2. Verify the health check endpoint: curl the container's IP from a jump host in the same VPC. 3. Check security groups: ensure ALB can reach the task on the health check port. 4. Compare failing vs healthy tasks: task definition, launch type (Fargate vs EC2), resource constraints. 5. Look for common patterns: new task definition introduced a changed health check path, or the container runs out of memory under load. 6. If the health check is not the issue, examine the ALB target group health check configuration (interval, timeout, threshold). Key: The automated recovery buys time. The manual analysis should focus on the difference between healthy and unhealthy tasks.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What are the top 3 AWS DevOps interview questions for senior roles?
02
How do I answer 'Why did you choose Terraform over CloudFormation?'
03
What is the 'Principle of Least Privilege' in an AWS context?
04
How do you manage secrets across multiple environments on AWS?
05
How does AWS CodeDeploy handle rollbacks for Lambda functions?
06
How do you handle secrets in CI/CD pipelines on AWS?
🔥

That's DevOps Interview. Mark it forged?

5 min read · try the examples if you haven't

Previous
Kubernetes Interview Questions
4 / 5 · DevOps Interview
Next
CI/CD Interview Questions