Home Interview AWS DevOps Interview Questions — Answered by a Senior Engineer

AWS DevOps Interview Questions — Answered by a Senior Engineer

In Plain English 🔥
Think of AWS like a giant, perfectly organised city. EC2 instances are the buildings, IAM is the security guard deciding who gets through the door, CloudFormation is the city blueprint, and CodePipeline is the conveyor belt that takes your raw code and delivers a finished product to the right building automatically. A DevOps engineer is the city planner — they design how all those pieces talk to each other, stay healthy, and rebuild themselves when something goes wrong.
⚡ Quick Answer
Think of AWS like a giant, perfectly organised city. EC2 instances are the buildings, IAM is the security guard deciding who gets through the door, CloudFormation is the city blueprint, and CodePipeline is the conveyor belt that takes your raw code and delivers a finished product to the right building automatically. A DevOps engineer is the city planner — they design how all those pieces talk to each other, stay healthy, and rebuild themselves when something goes wrong.

DevOps on AWS isn't just a checkbox on a job description — it's the difference between a team that ships features on a Friday afternoon with confidence and one that treats deployments like defusing a bomb. Amazon Web Services powers roughly a third of the internet's infrastructure, and companies aren't hiring AWS DevOps engineers to click buttons in a console. They need people who can architect pipelines, diagnose failures at 2am, and make strong tradeoff decisions under pressure.

The problem most candidates run into is that they've memorised service names without understanding the reasoning behind architectural choices. Interviewers at mid-to-senior level aren't impressed by someone who can recite what S3 stands for — they want to hear you explain why you'd choose an ALB over an NLB for a microservices workload, or why you'd reach for SSM Parameter Store instead of hardcoding an environment variable.

This article covers the AWS DevOps interview questions that actually get asked in technical screens and on-site rounds at companies ranging from Series B startups to FAANG-adjacent engineering orgs. By the end, you'll have battle-ready answers with the depth and nuance that separates a senior-level response from a junior one — and you'll understand the 'why' well enough to adapt your answer to any follow-up curveball.

CI/CD on AWS — CodePipeline, CodeBuild, and the Questions Behind Them

The most common first question in an AWS DevOps screen is some variation of: 'Walk me through your CI/CD pipeline.' Interviewers aren't looking for a list of services — they're listening for your decision-making.

CodePipeline is AWS's native pipeline orchestrator. It doesn't build or deploy anything itself — it coordinates other services. CodeBuild handles the compilation, testing, and packaging (it's a fully managed build server billed per build minute). CodeDeploy handles the actual deployment to EC2, Lambda, or ECS. Understanding that separation of concerns is critical.

Why use CodePipeline over Jenkins? The honest answer is: it depends. CodePipeline has zero infrastructure to manage and integrates natively with IAM, CloudTrail, and EventBridge. Jenkins gives you more plugin flexibility and is easier to migrate if you leave AWS. The right answer in an interview is to name the tradeoff, not pick a winner blindly.

One detail that trips people up: CodeBuild runs in an isolated, ephemeral container. That means any state — installed dependencies, cached layers — is gone after the build unless you explicitly configure a build cache in S3. Forgetting this is why builds that work locally are mysteriously slow or broken in CodeBuild.

CI_CD_Pipeline_Questions.md · INTERVIEW
1234567891011121314151617181920212223242526272829303132333435363738394041
Q: How does your team handle failed deployments in CodePipeline?

WEAK ANSWER:
  'We just re-run the pipeline.'

STRONG ANSWER:
  'We use CodeDeploy with a Blue/Green deployment strategy for ECS services.
   If the post-deployment health check fails — we define a 5-minute window
   where the load balancer monitors the new task set — CodeDeploy automatically
   rolls back by shifting traffic back to the original task set.

   For Lambda, we use CodeDeploy with a Canary10Percent5Minutes configuration:
   10% of traffic goes to the new version for 5 minutes. If CloudWatch alarms
   tied to error rate or latency spike, the deployment is rolled back automatically.
   No human has to be awake for that to happen.

   We also enable CloudTrail on the pipeline so every approval action,
   stage transition, and artifact push is auditable.'

---

Q: What is the difference between CodeDeploy in-place and Blue/Green?

IN-PLACE:
  - Stops old app version on existing instances
  - Installs new version on the SAME instances
  - Cheaper (no duplicate infrastructure)
  - Downtime risk if health check fails mid-deployment
  - Good for: dev/staging environments, non-critical workloads

BLUE/GREEN:
  - New version deployed to a SEPARATE set of instances/tasks
  - Load balancer shifts traffic only after health checks pass
  - Zero-downtime by design
  - Costs more during the transition window (double the compute)
  - Good for: production, anything customer-facing

KEY INSIGHT:
  Blue/Green is not just a deployment strategy — it's also a rollback strategy.
  Your 'blue' environment stays live until you're confident in 'green'.
  If something goes wrong, a traffic shift takes seconds, not a redeploy.
▶ Output
N/A — interview Q&A format. These are model answers, not runnable code.
⚠️
Interview Gold:When asked about CI/CD, proactively mention your rollback strategy before the interviewer asks. It signals production maturity. Say: 'And if the deployment fails, here's exactly what happens automatically...' — most candidates never get there.

IAM, Security, and the Principle of Least Privilege — Where Candidates Get Caught

IAM is the single most tested AWS topic in DevOps interviews, and also the most misunderstood. Most candidates can explain what a policy is — very few can explain the evaluation logic when multiple policies conflict.

Here's the core rule: AWS evaluates all applicable policies (identity-based, resource-based, permission boundaries, SCPs). An explicit Deny anywhere always wins. An Allow only applies if no explicit Deny exists AND the action is permitted by at least one policy. This seems obvious until you're debugging why a Lambda function can't write to an S3 bucket even though the IAM role has an S3:PutObject allow — and the bucket policy has an explicit Deny for all non-VPC traffic.

The other area interviewers probe hard: IAM Roles vs. IAM Users for automation. The correct answer in 2024 is always roles for anything machine-to-machine. IAM users have long-lived static credentials — if a key leaks, you have a breach. Roles use short-lived STS tokens that auto-rotate. EC2 instance profiles, ECS task roles, Lambda execution roles — all of these use the role mechanism.

Permission Boundaries are often a senior-level differentiator. They let you delegate IAM administration safely: you can allow a team to create their own roles, but cap the maximum permissions those roles can ever have. It's the difference between 'trust but verify' and 'trust and you can't accidentally escalate anyway.'

IAM_Security_Questions.md · INTERVIEW
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
Q: An EC2 instance has an IAM role with S3:* on *, but it can't delete
   objects from a specific bucket. What do you check first?

DIAGNOSIS STEPS:

  STEP 1Check the S3 bucket policy.
    A resource-based policy with an explicit Deny overrides any identity-based Allow.
    Look for conditions like:
      {
        'Effect': 'Deny',
        'Principal': '*',
        'Action': 's3:DeleteObject',
        'Condition': { 'StringNotEquals': { 's3:prefix': ['allowed-prefix/'] } }
      }

  STEP 2Check for SCPs (Service Control Policies).
    If the account is inside an AWS Organization, an SCP might be blocking
    s3:DeleteObject at the org level — regardless of what the IAM role says.
    SCPs don't grant permissions; they set the ceiling.

  STEP 3Check for a Permission Boundary on the role.
    Even if the role policy allows S3:*, a permission boundary scoped to
    S3:GetObject and S3:PutObject would silently block DeleteObject.

  STEP 4Use IAM Policy Simulator.
    Go to: IAM Console > Policy Simulator
    Select the role, the action (s3:DeleteObject), the resource ARN.
    It will tell you exactly which policy is causing the deny.
    This is your 'show your work' tool in an interview — mentioning it
    demonstrates you actually debug IAM in real life.

---

Q: What is the difference between an IAM Role and an IAM User?

IAM USER:
  - Represents a human (or legacy automation)
  - Has long-lived access keys (Access Key ID + Secret)
  - Keys can be leaked in git, logs, error messages
  - Requires manual rotation
  - Should be avoided for any automated/machine workload

IAM ROLE:
  - Represents a workload identity (Lambda, EC2, ECS task, CI/CD pipeline)
  - Uses short-lived STS tokens (default 1hr, max 12hr for some services)
  - Auto-rotated by AWS — no key management needed
  - Can be assumed cross-account safely
  - The right answer for ANY non-human authentication in AWS

GOLDEN RULE:
  If a human needs AWS access → IAM User + MFA (or SSO via IAM Identity Center)
  If a machine needs AWS access → IAM Role. Full stop.
▶ Output
N/A — interview Q&A format with annotated answers and reasoning.
⚠️
Watch Out:Never say 'I'd just give it admin access temporarily to get it working.' That phrase ends interviews. The correct answer is always 'I'd scope the policy to the minimum required actions and resources, then use IAM Policy Simulator to verify.' Interviewers are testing whether you'd create a security incident in production.

Infrastructure as Code — CloudFormation vs Terraform and Real Architecture Questions

CloudFormation is AWS-native IaC. Terraform is cloud-agnostic HCL-based IaC by HashiCorp. This comparison comes up in almost every AWS DevOps interview, and the trap is giving a tribal 'Terraform is better' answer without nuance.

CloudFormation's strengths: native drift detection, StackSets for multi-account/multi-region deployments, no state file management (AWS manages state), and deep integration with AWS services like Service Catalog and CDK. Its weakness: verbose YAML/JSON, slower development cycle, and error messages that are notoriously unhelpful ('UPDATE_ROLLBACK_COMPLETE' tells you nothing about what actually failed).

Terraform's strengths: multi-cloud portability, cleaner module system, better plan output (you see exactly what will change before it changes), and a massive community registry of modules. Its weakness: you own the state file, which means you need a backend (S3 + DynamoDB for locking), and state file corruption or drift is your problem to solve.

In practice, many mature AWS shops use both: CloudFormation for account-level infrastructure (VPCs, IAM foundations, Service Control Policies) via AWS Control Tower, and Terraform for application-level infrastructure managed by product teams. Knowing this hybrid pattern signals real-world experience.

The CDK angle is increasingly important: AWS CDK lets you define CloudFormation stacks using TypeScript, Python, or Java. It compiles down to CloudFormation, so you get the native integration with the developer experience of a real programming language.

IaC_Architecture_Questions.md · INTERVIEW
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
Q: A CloudFormation stack update is stuck in UPDATE_ROLLBACK_FAILED.
   How do you recover it?

EXPLANATION:
  This is one of the most frustrating CloudFormation states because the
  stack can't proceed forward OR roll back. You're locked.

RECOVERY STEPS:

  OPTION 1Continue Update Rollback (Skip failing resources):
    aws cloudformation continue-update-rollback \
      --stack-name my-application-stack \
      --resources-to-skip LogicalResourceIdThatIsBlocking

    This tells CloudFormation: 'Give up on rolling that resource back,
    mark it as rolled back anyway, and continue with the rest.'
    Use this when the resource was already manually deleted or
    the rollback action is genuinely impossible.

  OPTION 2Delete the stack (nuclear option):
    Only valid for non-production stacks or if the stack has no
    retained resources you care about.
    Add DeletionPolicy: Retain on critical resources BEFORE this happens.

  OPTION 3Manual resource reconciliation:
    Identify what the rollback was trying to undo (CloudTrail will show you).
    Manually restore the resource to its pre-update state.
    Then re-trigger the rollback.

PREVENTION (the answer interviewers actually want):
  - Use change sets before every update: 'aws cloudformation create-change-set'
    This shows you exactly what will be modified/replaced/deleted.
  - Enable termination protection on production stacks.
  - Tag stacks with Environment=production so they're identifiable in billing
    and when scripting bulk operations.
  - Set up CloudFormation stack notifications via SNS so failures
    alert your on-call channel immediately.

---

Q: How do you manage secrets in an IaC pipeline? 
   (Do NOT hardcode them — what's the right pattern?)

WRONG:
  DB_PASSWORD: 'mypassword123'  # in your CloudFormation template or tfvars

RIGHT PATTERN 1SSM Parameter Store (for non-sensitive config + secrets):
  # Store at deploy time:
  aws ssm put-parameter \
    --name '/myapp/production/db-password' \
    --value 'actual-secret-value' \
    --type SecureString \
    --key-id alias/myapp-kms-key

  # Reference in CloudFormation (resolved at deploy time):
  DBPassword:
    Type: 'AWS::SSM::Parameter::Value<String>'
    Default: '/myapp/production/db-password'

RIGHT PATTERN 2Secrets Manager (for auto-rotation):
  - Stores secrets with automatic rotation via Lambda
  - Applications call GetSecretValue at runtime — never baked into the template
  - Costs ~$0.40/secret/month vs SSM SecureString which is free
  - Use Secrets Manager when you need rotation; SSM when you don't

KEY POINT FOR THE INTERVIEW:
  'The secret never touches the pipeline artifact. The pipeline has permission
   to READ from Secrets Manager, but the secret value itself is never in git,
   never in a build log, and never in a CloudFormation template parameter
   visible in the console history.'
▶ Output
N/A — interview Q&A format. Answers are designed to be spoken in a technical screen.
🔥
Pro Tip:When answering IaC questions, always end with 'and here's how I'd handle drift.' CloudFormation has native drift detection; Terraform has `terraform plan` against live infrastructure. Showing you think about what happens after initial deployment separates senior candidates from mid-level ones.

ECS, EKS, and Container Orchestration — The Questions That Reveal Depth

Container questions are where AWS DevOps interviews get genuinely technical. The ECS vs EKS question is almost a given, and the wrong move is to immediately say 'Kubernetes is always better.'

ECS (Elastic Container Service) is AWS-native container orchestration. It uses Task Definitions (the blueprint for a container workload) and Services (which maintain the desired count of tasks and wire them to load balancers). ECS with Fargate means zero EC2 management — AWS provisions the underlying compute per task. ECS with EC2 launch type means you manage the cluster nodes, but you get more control over instance types and pricing (Reserved Instances, Savings Plans).

EKS (Elastic Kubernetes Service) gives you a managed Kubernetes control plane. Use it when your team already has Kubernetes expertise, you need to run the same workloads on-prem and in AWS, or your application requires Kubernetes-native features like custom operators or CRDs. The operational overhead is meaningfully higher than ECS.

The real interview depth comes from task role vs execution role in ECS — a distinction that trips up 80% of candidates. The execution role is what ECS uses to pull the container image from ECR and write logs to CloudWatch. The task role is what your application code uses to call other AWS services (DynamoDB, S3, etc.). Mixing these up is a classic misconfiguration that causes silent permission failures.

ECS_EKS_Container_Questions.md · INTERVIEW
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
Q: Your ECS task keeps failing to start. The service shows
   'CannotPullContainerError'. What do you check?

DIAGNOSIS FLOW:

  CHECK 1Execution Role permissions:
    The ECS Task Execution Role needs:
      ecr:GetAuthorizationToken
      ecr:BatchCheckLayerAvailability
      ecr:GetDownloadUrlForLayer
      ecr:BatchGetImage
    Missing ANY of these → CannotPullContainerError
    Managed policy: arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

  CHECK 2ECR Repository policy:
    If pulling cross-account, the ECR repo needs a resource-based policy
    allowing the other account's execution role.

  CHECK 3Network configuration:
    If using Fargate in a private subnet:
    - VPC needs a NAT Gateway to reach public ECR endpoints, OR
    - VPC Interface Endpoints for ECR (com.amazonaws.region.ecr.api
      and com.amazonaws.region.ecr.dkr) + S3 Gateway Endpoint
      (ECR layers are stored in S3)
    A Fargate task in a private subnet with NO NAT and NO endpoints
    will ALWAYS fail to pull. This is the #1 ECS networking mistake.

  CHECK 4Image tag or digest:
    If the image tag doesn't exist in the repo, you get a pull error.
    Use immutable tags in ECR (enforce in repo settings) so 'latest'
    can't be overwritten unexpectedly.

---

Q: Explain the difference between an ECS Task Role and Execution Role.

  EXECUTION ROLE (used BY ECS, not your app):
    - Pulled by the ECS agent before your container starts
    - Permissions: pull image from ECR, write logs to CloudWatch,
      read secrets from Secrets Manager or SSM at container startup
    - Your application code CANNOT use this role directly

  TASK ROLE (used BY your application code at runtime):
    - Injected into the running container as environment variables
      (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)
    - Your app's AWS SDK reads from this endpoint automatically
    - Permissions: whatever your app needs (DynamoDB:GetItem, S3:PutObject, etc.)
    - Principle of least privilege applies here — scope tightly

  REAL EXAMPLE:
    An ECS task running a Python API that reads from DynamoDB and
    writes logs needs:

    Execution Role permissions:
      - ecr:BatchGetImage (pull the Docker image)
      - logs:CreateLogStream, logs:PutLogEvents (write to CloudWatch)

    Task Role permissions:
      - dynamodb:GetItem, dynamodb:Query (on the specific table ARN only)

    If you conflate these and put DynamoDB permissions on the
    execution role, it still won't work — your app reads credentials
    from the task role endpoint, not the execution role.
▶ Output
N/A — interview Q&A format. Paste these into flashcards or use them to practice verbal answers.
⚠️
Interview Gold:If asked 'ECS or EKS?', give this answer: 'ECS for AWS-native teams who want low operational overhead and Fargate's serverless model. EKS when the team has Kubernetes expertise, needs multi-cloud portability, or requires CRDs and custom operators. ECS is faster to get right; EKS is harder to get right but more portable.' That's a senior-level answer.
AspectAWS CodePipeline / Native CI/CDTerraform + External Pipeline
State ManagementManaged by AWS — no state fileS3 backend + DynamoDB lock — you own it
Multi-Account DeploymentsCloudFormation StackSets built-inRequires workspace strategy + CI config
Drift DetectionNative CloudFormation drift detectionterraform plan against live infra
Rollback MechanismAutomatic via CodeDeploy strategiesterraform apply previous state version
Secret HandlingNative SSM/Secrets Manager integrationAWS provider + data sources to fetch secrets
CostPay per active pipeline ($1/month/pipeline)Terraform OSS is free; Terraform Cloud adds cost
Learning CurveLower for AWS-only teamsHigher, but skills transfer to other clouds
Best ForAWS-native orgs, compliance-heavy environmentsMulti-cloud orgs, teams with existing TF expertise

🎯 Key Takeaways

  • IAM policy evaluation always processes explicit Deny first — an explicit Deny in a bucket policy or SCP beats any Allow in an identity-based policy, regardless of how permissive the role looks in isolation
  • ECS Task Role ≠ ECS Execution Role — the execution role is for ECS pulling your image and writing logs; the task role is for your application code calling AWS APIs at runtime — mixing them up causes silent permission failures
  • CloudFormation UPDATE_ROLLBACK_FAILED isn't a disaster — use 'continue-update-rollback --resources-to-skip' to unstick it, but prevent it by always previewing changes with Change Sets before applying updates to production stacks
  • Fargate tasks in private subnets need BOTH ECR Interface Endpoints AND an S3 Gateway Endpoint — ECR stores image layers in S3, so missing the S3 endpoint causes CannotPullContainerError even when all ECR permissions are correct

⚠ Common Mistakes to Avoid

  • Mistake 1: Putting secrets in CloudFormation parameters as 'String' type instead of 'AWS::SSM::Parameter::Value' — The secret value appears in plaintext in the CloudFormation console under 'Parameters' and in CloudTrail events — Fix: Always use SecureString SSM parameters or Secrets Manager references so the value is never exposed in AWS console history or API responses
  • Mistake 2: Forgetting the S3 VPC Gateway Endpoint when running Fargate tasks in private subnets — ECR stores image layers in S3, so even with ECR Interface Endpoints configured, Fargate tasks will still fail to pull images because S3 traffic has no route out of the VPC — Fix: Add a com.amazonaws.region.s3 Gateway Endpoint to the route tables of your private subnets alongside your ECR interface endpoints
  • Mistake 3: Treating CodeBuild as a persistent build server and relying on local filesystem state between builds — A CodeBuild project starts fresh every time; any npm install, pip install, or compiled artifact from a previous build is gone — Fix: Configure a build cache in S3 (specify cache paths in buildspec.yml under 'cache: paths') or use a custom Docker image with pre-installed dependencies as your CodeBuild environment image to dramatically cut build times

Interview Questions on This Topic

  • QYour ECS service is running 10 tasks in production. A new deployment causes 30% of tasks to fail their health checks. Walk me through exactly what happens next and what you would do — both the automated response and your manual investigation steps.
  • QWe have three AWS accounts: dev, staging, and production. How would you design a CI/CD pipeline that deploys to all three, with manual approval before production, and ensures that each environment uses environment-specific secrets without any credential sharing between accounts?
  • QA CloudFormation stack update just completed successfully, but the application is behaving differently than expected. CloudFormation shows no errors. How would you determine whether infrastructure drift is the cause, and what's the difference between drift detection and a failed update?

Frequently Asked Questions

What AWS services are most commonly asked about in a DevOps interview?

IAM, CodePipeline, CodeBuild, CodeDeploy, CloudFormation, ECS (especially Fargate), and CloudWatch are the core services. At senior level, expect questions on AWS Organizations, Service Control Policies, VPC networking (NAT, endpoints, security groups vs NACLs), and secrets management via SSM Parameter Store or Secrets Manager.

How is a blue/green deployment different from a canary deployment on AWS?

Blue/green shifts 100% of traffic to the new version after health checks pass, keeping the old environment on standby for instant rollback. Canary gradually shifts a percentage of traffic (e.g., 10%, then 50%, then 100%) and monitors metrics at each step before proceeding. AWS CodeDeploy supports both strategies for ECS and Lambda deployments, with automatic rollback tied to CloudWatch alarms.

Should I learn CloudFormation or Terraform for AWS DevOps roles?

Learn both at a working level, but know CloudFormation deeply if you're targeting AWS-native companies. CloudFormation is inescapable in AWS environments — CDK compiles to it, Control Tower uses it, and Service Catalog stacks are built on it. Terraform knowledge is a strong differentiator for roles at multi-cloud or cloud-agnostic organisations, and its plan/apply model is genuinely better for reviewing changes before they happen.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousKubernetes Interview QuestionsNext →Prefix Sum Interview Problems
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged