Skip to content
Home Interview AWS DevOps Interview Questions — Answered by a Senior Engineer

AWS DevOps Interview Questions — Answered by a Senior Engineer

Where developers are forged. · Structured learning · Free forever.
📍 Part of: DevOps Interview → Topic 4 of 5
AWS DevOps interview questions with real answers, not textbook fluff.
⚙️ Intermediate — basic Interview knowledge assumed
In this tutorial, you'll learn
AWS DevOps interview questions with real answers, not textbook fluff.
  • IAM policy evaluation always processes explicit Deny first — an explicit Deny in a bucket policy or SCP beats any Allow in an identity-based policy, regardless of how permissive the role looks in isolation
  • ECS Task Role ≠ ECS Execution Role — the execution role is for ECS pulling your image and writing logs; the task role is for your application code calling AWS APIs at runtime — mixing them up causes silent permission failures
  • CloudFormation UPDATE_ROLLBACK_FAILED isn't a disaster — use 'continue-update-rollback --resources-to-skip' to unstick it, but prevent it by always previewing changes with Change Sets before applying updates to production stacks
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Think of AWS like a giant, perfectly organised city. EC2 instances are the buildings, IAM is the security guard deciding who gets through the door, CloudFormation is the city blueprint, and CodePipeline is the conveyor belt that takes your raw code and delivers a finished product to the right building automatically. A DevOps engineer is the city planner — they design how all those pieces talk to each other, stay healthy, and rebuild themselves when something goes wrong.

DevOps on AWS isn't just a checkbox on a job description — it's the difference between a team that ships features on a Friday afternoon with confidence and one that treats deployments like defusing a bomb. Amazon Web Services powers roughly a third of the internet's infrastructure, and companies aren't hiring AWS DevOps engineers to click buttons in a console. They need people who can architect pipelines, diagnose failures at 2am, and make strong tradeoff decisions under pressure.

The problem most candidates run into is that they've memorised service names without understanding the reasoning behind architectural choices. Interviewers at mid-to-senior level aren't impressed by someone who can recite what S3 stands for — they want to hear you explain why you'd choose an ALB over an NLB for a microservices workload, or why you'd reach for SSM Parameter Store instead of hardcoding an environment variable.

This article covers the AWS DevOps interview questions that actually get asked in technical screens and on-site rounds at companies ranging from Series B startups to FAANG-adjacent engineering orgs. By the end, you'll have battle-ready answers with the depth and nuance that separates a senior-level response from a junior one — and you'll understand the 'why' well enough to adapt your answer to any follow-up curveball.

CI/CD on AWS — CodePipeline, CodeBuild, and the Questions Behind Them

The most common first question in an AWS DevOps screen is some variation of: 'Walk me through your CI/CD pipeline.' Interviewers aren't looking for a list of services — they're listening for your decision-making.

CodePipeline is AWS's native pipeline orchestrator. It doesn't build or deploy anything itself — it coordinates other services. CodeBuild handles the compilation, testing, and packaging (it's a fully managed build server billed per build minute). CodeDeploy handles the actual deployment to EC2, Lambda, or ECS. Understanding that separation of concerns is critical.

Why use CodePipeline over Jenkins? The honest answer is: it depends. CodePipeline has zero infrastructure to manage and integrates natively with IAM, CloudTrail, and EventBridge. Jenkins gives you more plugin flexibility and is easier to migrate if you leave AWS. The right answer in an interview is to name the tradeoff, not pick a winner blindly.

One detail that trips people up: CodeBuild runs in an isolated, ephemeral container. That means any state — installed dependencies, cached layers — is gone after the build unless you explicitly configure a build cache in S3. Forgetting this is why builds that work locally are mysteriously slow or broken in CodeBuild.

CI_CD_Pipeline_Questions.md · MARKDOWN
1234567891011121314151617181920212223242526272829303132333435363738394041
Q: How does your team handle failed deployments in CodePipeline?

WEAK ANSWER:
  'We just re-run the pipeline.'

STRONG ANSWER:
  'We use CodeDeploy with a Blue/Green deployment strategy for ECS services.
   If the post-deployment health check fails — we define a 5-minute window
   where the load balancer monitors the new task set — CodeDeploy automatically
   rolls back by shifting traffic back to the original task set.

   For Lambda, we use CodeDeploy with a Canary10Percent5Minutes configuration:
   10% of traffic goes to the new version for 5 minutes. If CloudWatch alarms
   tied to error rate or latency spike, the deployment is rolled back automatically.
   No human has to be awake for that to happen.

   We also enable CloudTrail on the pipeline so every approval action,
   stage transition, and artifact push is auditable.'

---

Q: What is the difference between CodeDeploy in-place and Blue/Green?

IN-PLACE:
  - Stops old app version on existing instances
  - Installs new version on the SAME instances
  - Cheaper (no duplicate infrastructure)
  - Downtime risk if health check fails mid-deployment
  - Good for: dev/staging environments, non-critical workloads

BLUE/GREEN:
  - New version deployed to a SEPARATE set of instances/tasks
  - Load balancer shifts traffic only after health checks pass
  - Zero-downtime by design
  - Costs more during the transition window (double the compute)
  - Good for: production, anything customer-facing

KEY INSIGHT:
  Blue/Green is not just a deployment strategy — it's also a rollback strategy.
  Your 'blue' environment stays live until you're confident in 'green'.
  If something goes wrong, a traffic shift takes seconds, not a redeploy.
▶ Output
N/A — interview Q&A format. These are model answers, not runnable code.
💡Interview Gold:
When asked about CI/CD, proactively mention your rollback strategy before the interviewer asks. It signals production maturity. Say: 'And if the deployment fails, here's exactly what happens automatically...' — most candidates never get there.

IAM, Security, and the Principle of Least Privilege — Where Candidates Get Caught

IAM is the single most tested AWS topic in DevOps interviews, and also the most misunderstood. Most candidates can explain what a policy is — very few can explain the evaluation logic when multiple policies conflict.

Here's the core rule: AWS evaluates all applicable policies (identity-based, resource-based, permission boundaries, SCPs). An explicit Deny anywhere always wins. An Allow only applies if no explicit Deny exists AND the action is permitted by at least one policy. This seems obvious until you're debugging why a Lambda function can't write to an S3 bucket even though the IAM role has an S3:PutObject allow — and the bucket policy has an explicit Deny for all non-VPC traffic.

The other area interviewers probe hard: IAM Roles vs. IAM Users for automation. The correct answer in 2024 is always roles for anything machine-to-machine. IAM users have long-lived static credentials — if a key leaks, you have a breach. Roles use short-lived STS tokens that auto-rotate. EC2 instance profiles, ECS task roles, Lambda execution roles — all of these use the role mechanism.

Permission Boundaries are often a senior-level differentiator. They let you delegate IAM administration safely: you can allow a team to create their own roles, but cap the maximum permissions those roles can ever have. It's the difference between 'trust but verify' and 'trust and you can't accidentally escalate anyway.'

io/thecodeforge/iam/SecureBucketPolicy.json · JSON
1234567891011121314151617181920
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "EnforceVPCAccessOnly",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "s3:*",
      "Resource": [
        "arn:aws:s3:::thecodeforge-prod-data",
        "arn:aws:s3:::thecodeforge-prod-data/*"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:SourceVpc": "vpc-0123456789abcdef0"
        }
      }
    }
  ]
}
▶ Output
Demonstrates a 'Deny by default' security posture for non-VPC traffic, a common senior interview discussion point.
⚠ Watch Out:
Never say 'I'd just give it admin access temporarily to get it working.' That phrase ends interviews. The correct answer is always 'I'd scope the policy to the minimum required actions and resources, then use IAM Policy Simulator to verify.' Interviewers are testing whether you'd create a security incident in production.

Infrastructure as Code — CloudFormation vs Terraform and Real Architecture Questions

CloudFormation is AWS-native IaC. Terraform is cloud-agnostic HCL-based IaC by HashiCorp. This comparison comes up in almost every AWS DevOps interview, and the trap is giving a tribal 'Terraform is better' answer without nuance.

CloudFormation's strengths: native drift detection, StackSets for multi-account/multi-region deployments, no state file management (AWS manages state), and deep integration with AWS services like Service Catalog and CDK. Its weakness: verbose YAML/JSON, slower development cycle, and error messages that are notoriously unhelpful ('UPDATE_ROLLBACK_COMPLETE' tells you nothing about what actually failed).

Terraform's strengths: multi-cloud portability, cleaner module system, better plan output (you see exactly what will change before it changes), and a massive community registry of modules. Its weakness: you own the state file, which means you need a backend (S3 + DynamoDB for locking), and state file corruption or drift is your problem to solve.

In practice, many mature AWS shops use both: CloudFormation for account-level infrastructure (VPCs, IAM foundations, Service Control Policies) via AWS Control Tower, and Terraform for application-level infrastructure managed by product teams. Knowing this hybrid pattern signals real-world experience.

io/thecodeforge/terraform/main.tf · HCL
12345678910111213141516171819202122232425262728293031
/* 
 * Production-grade Terraform backend configuration.
 * Explaining this locking mechanism shows you understand state safety.
 */
terraform {
  backend "s3" {
    bucket         = "thecodeforge-tf-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "thecodeforge-tf-locks"
    encrypt        = true
  }
}

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  name   = "thecodeforge-main-vpc"
  cidr   = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = false

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}
▶ Output
Provisioning a highly available VPC with NAT gateways and state locking.
🔥Pro Tip:
When answering IaC questions, always end with 'and here's how I'd handle drift.' CloudFormation has native drift detection; Terraform has terraform plan against live infrastructure. Showing you think about what happens after initial deployment separates senior candidates from mid-level ones.

ECS, EKS, and Container Orchestration — The Questions That Reveal Depth

Container questions are where AWS DevOps interviews get genuinely technical. The ECS vs EKS question is almost a given, and the wrong move is to immediately say 'Kubernetes is always better.'

ECS (Elastic Container Service) is AWS-native container orchestration. It uses Task Definitions (the blueprint for a container workload) and Services (which maintain the desired count of tasks and wire them to load balancers). ECS with Fargate means zero EC2 management — AWS provisions the underlying compute per task. ECS with EC2 launch type means you manage the cluster nodes, but you get more control over instance types and pricing (Reserved Instances, Savings Plans).

EKS (Elastic Kubernetes Service) gives you a managed Kubernetes control plane. Use it when your team already has Kubernetes expertise, you need to run the same workloads on-prem and in AWS, or your application requires Kubernetes-native features like custom operators or CRDs. The operational overhead is meaningfully higher than ECS.

The real interview depth comes from task role vs execution role in ECS — a distinction that trips up 80% of candidates. The execution role is what ECS uses to pull the container image from ECR and write logs to CloudWatch. The task role is what your application code uses to call other AWS services (DynamoDB, S3, etc.). Mixing these up is a classic misconfiguration that causes silent permission failures.

io/thecodeforge/ecs/TaskDefinition.yaml · YAML
1234567891011121314151617181920212223242526
/* 
 * CloudFormation snippet for an ECS Task Definition.
 * Senior highlight: Explicitly separating the two roles.
 */
ForgeTaskDefinition:
  Type: AWS::ECS::TaskDefinition
  Properties:
    Family: thecodeforge-api-task
    Cpu: 256
    Memory: 512
    NetworkMode: awsvpc
    RequiresCompatibilities:
      - FARGATE
    # Used by the ECS Agent (ECR Pulls, CloudWatch Logs)
    ExecutionRoleArn: !Ref ForgeExecutionRole
    # Used by the Application Code (S3, DynamoDB calls)
    TaskRoleArn: !Ref ForgeTaskRole
    ContainerDefinitions:
      - Name: api-container
        Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/forge-api:latest
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: /ecs/thecodeforge-api
            awslogs-region: us-east-1
            awslogs-stream-prefix: ecs
▶ Output
Defines a Fargate task with clear role separation, ready for high-scale production.
💡Interview Gold:
If asked 'ECS or EKS?', give this answer: 'ECS for AWS-native teams who want low operational overhead and Fargate's serverless model. EKS when the team has Kubernetes expertise, needs multi-cloud portability, or requires CRDs and custom operators. ECS is faster to get right; EKS is harder to get right but more portable.' That's a senior-level answer.
AspectAWS CodePipeline / Native CI/CDTerraform + External Pipeline
State ManagementManaged by AWS — no state fileS3 backend + DynamoDB lock — you own it
Multi-Account DeploymentsCloudFormation StackSets built-inRequires workspace strategy + CI config
Drift DetectionNative CloudFormation drift detectionterraform plan against live infra
Rollback MechanismAutomatic via CodeDeploy strategiesterraform apply previous state version
Secret HandlingNative SSM/Secrets Manager integrationAWS provider + data sources to fetch secrets
CostPay per active pipeline ($1/month/pipeline)Terraform OSS is free; Terraform Cloud adds cost
Learning CurveLower for AWS-only teamsHigher, but skills transfer to other clouds
Best ForAWS-native orgs, compliance-heavy environmentsMulti-cloud orgs, teams with existing TF expertise

🎯 Key Takeaways

  • IAM policy evaluation always processes explicit Deny first — an explicit Deny in a bucket policy or SCP beats any Allow in an identity-based policy, regardless of how permissive the role looks in isolation
  • ECS Task Role ≠ ECS Execution Role — the execution role is for ECS pulling your image and writing logs; the task role is for your application code calling AWS APIs at runtime — mixing them up causes silent permission failures
  • CloudFormation UPDATE_ROLLBACK_FAILED isn't a disaster — use 'continue-update-rollback --resources-to-skip' to unstick it, but prevent it by always previewing changes with Change Sets before applying updates to production stacks
  • Fargate tasks in private subnets need BOTH ECR Interface Endpoints AND an S3 Gateway Endpoint — ECR stores image layers in S3, so missing the S3 endpoint causes CannotPullContainerError even when all ECR permissions are correct

⚠ Common Mistakes to Avoid

    Putting secrets in CloudFormation parameters as 'String' type instead of 'AWS::SSM::Parameter::Value' — The secret value appears in plaintext in the CloudFormation console under 'Parameters' and in CloudTrail events — Fix: Always use SecureString SSM parameters or Secrets Manager references so the value is never exposed in AWS console history or API responses
    Fix

    Always use SecureString SSM parameters or Secrets Manager references so the value is never exposed in AWS console history or API responses

    Forgetting the S3 VPC Gateway Endpoint when running Fargate tasks in private subnets — ECR stores image layers in S3, so even with ECR Interface Endpoints configured, Fargate tasks will still fail to pull images because S3 traffic has no route out of the VPC — Fix: Add a com.amazonaws.region.s3 Gateway Endpoint to the route tables of your private subnets alongside your ECR interface endpoints
    Fix

    Add a com.amazonaws.region.s3 Gateway Endpoint to the route tables of your private subnets alongside your ECR interface endpoints

    Treating CodeBuild as a persistent build server and relying on local filesystem state between builds — A CodeBuild project starts fresh every time; any npm install, pip install, or compiled artifact from a previous build is gone — Fix: Configure a build cache in S3 (specify cache paths in buildspec.yml under 'cache: paths') or use a custom Docker image with pre-installed dependencies as your CodeBuild environment image to dramatically cut build times
    Fix

    Configure a build cache in S3 (specify cache paths in buildspec.yml under 'cache: paths') or use a custom Docker image with pre-installed dependencies as your CodeBuild environment image to dramatically cut build times

Interview Questions on This Topic

  • QQ: A production ECS service with 10 tasks is failing health checks for 30% of its nodes after a deployment. Walk me through the automated recovery process and your manual root cause analysis steps. (LeetCode Style Problem Solving)
  • QQ: You need to design a secure cross-account CI/CD pipeline using AWS Organizations. How do you handle IAM Role Assumption and KMS Key Decryption for artifacts moving from a Dev account to a Prod account? (Architectural Depth)
  • QQ: Explain the lifecycle of a request hitting an Application Load Balancer, routed to a Fargate task in a private subnet. What are the security group requirements at each hop? (Networking Fundamentals)
  • QQ: How would you implement a blue/green deployment for a stateful application that requires database schema migrations alongside code changes? (Real-world Tradeoffs)

Frequently Asked Questions

What are the top 3 AWS DevOps interview questions for senior roles?
  1. Explain IAM evaluation logic including SCPs and Boundaries. 2. Compare Blue/Green vs Canary deployment technical implementation in CodeDeploy. 3. Architect a multi-region disaster recovery plan for a containerized workload including RDS replication and Route 53 failover.
How do I answer 'Why did you choose Terraform over CloudFormation?'

Focus on state management control, the ability to perform dry runs via terraform plan, and the superior module ecosystem. Avoid saying 'CloudFormation is bad'; instead, highlight that Terraform provides better visibility into infrastructure changes before they are committed to the provider.

What is the 'Principle of Least Privilege' in an AWS context?

It is the practice of granting only the minimum permissions necessary to perform a task. In AWS, this means scoping IAM policies to specific Actions, specific Resource ARNs, and using Conditions (like SourceIp or SourceVpc) to restrict access further. Never use '*' for resources in production policies.

How do you manage secrets across multiple environments on AWS?

Use AWS Secrets Manager for secrets that require rotation (like RDS passwords) and SSM Parameter Store SecureString for static secrets (like API keys). Never hardcode these in code or IaC templates; fetch them at runtime or resolve them via the orchestration layer (e.g., ECS environment variable injection).

How does AWS CodeDeploy handle rollbacks for Lambda functions?

CodeDeploy uses 'Linear' or 'Canary' deployment configs. It creates a new Lambda version and shifts traffic. If a pre-configured CloudWatch Alarm (e.g., 5xx errors > 1%) triggers during the 'Baking' period, CodeDeploy immediately points the alias back to the old version.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousKubernetes Interview QuestionsNext →CI/CD Interview Questions
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged