AWS DevOps Interview Questions — Production Debugging Focus
CannotPullContainerError with correct IAM? Missing S3 Gateway Endpoint.
- AWS DevOps interview questions test architectural tradeoffs, not service definitions.
- Key areas: CI/CD pipeline design, IAM policy evaluation order, IaC (CloudFormation vs Terraform), and ECS/EKS container orchestration.
- Performance insight: A well-architected pipeline can cut deployment time from 30 minutes to under 5.
- Production insight: 90% of AWS security incidents stem from IAM misconfigurations (excessive permissions, missing resource policies).
- Biggest mistake: Memorizing service names without understanding failure modes and recovery strategies.
Think of AWS like a giant, perfectly organised city. EC2 instances are the buildings, IAM is the security guard deciding who gets through the door, CloudFormation is the city blueprint, and CodePipeline is the conveyor belt that takes your raw code and delivers a finished product to the right building automatically. A DevOps engineer is the city planner — they design how all those pieces talk to each other, stay healthy, and rebuild themselves when something goes wrong.
DevOps on AWS isn't just a checkbox on a job description — it's the difference between a team that ships features on a Friday afternoon with confidence and one that treats deployments like defusing a bomb. Amazon Web Services powers roughly a third of the internet's infrastructure, and companies aren't hiring AWS DevOps engineers to click buttons in a console. They need people who can architect pipelines, diagnose failures at 2am, and make strong tradeoff decisions under pressure.
The problem most candidates run into is that they've memorised service names without understanding the reasoning behind architectural choices. Interviewers at mid-to-senior level aren't impressed by someone who can recite what S3 stands for — they want to hear you explain why you'd choose an ALB over an NLB for a microservices workload, or why you'd reach for SSM Parameter Store instead of hardcoding an environment variable.
This article covers the AWS DevOps interview questions that actually get asked in technical screens and on-site rounds at companies ranging from Series B startups to FAANG-adjacent engineering orgs. By the end, you'll have battle-ready answers with the depth and nuance that separates a senior-level response from a junior one — and you'll understand the 'why' well enough to adapt your answer to any follow-up curveball.
CI/CD on AWS — CodePipeline, CodeBuild, and the Questions Behind Them
The most common first question in an AWS DevOps screen is some variation of: 'Walk me through your CI/CD pipeline.' Interviewers aren't looking for a list of services — they're listening for your decision-making.
CodePipeline is AWS's native pipeline orchestrator. It doesn't build or deploy anything itself — it coordinates other services. CodeBuild handles the compilation, testing, and packaging (it's a fully managed build server billed per build minute). CodeDeploy handles the actual deployment to EC2, Lambda, or ECS. Understanding that separation of concerns is critical.
Why use CodePipeline over Jenkins? The honest answer is: it depends. CodePipeline has zero infrastructure to manage and integrates natively with IAM, CloudTrail, and EventBridge. Jenkins gives you more plugin flexibility and is easier to migrate if you leave AWS. The right answer in an interview is to name the tradeoff, not pick a winner blindly.
One detail that trips people up: CodeBuild runs in an isolated, ephemeral container. That means any state — installed dependencies, cached layers — is gone after the build unless you explicitly configure a build cache in S3. Forgetting this is why builds that work locally are mysteriously slow or broken in CodeBuild.
IAM, Security, and the Principle of Least Privilege — Where Candidates Get Caught
IAM is the single most tested AWS topic in DevOps interviews, and also the most misunderstood. Most candidates can explain what a policy is — very few can explain the evaluation logic when multiple policies conflict.
Here's the core rule: AWS evaluates all applicable policies (identity-based, resource-based, permission boundaries, SCPs). An explicit Deny anywhere always wins. An Allow only applies if no explicit Deny exists AND the action is permitted by at least one policy. This seems obvious until you're debugging why a Lambda function can't write to an S3 bucket even though the IAM role has an S3:PutObject allow — and the bucket policy has an explicit Deny for all non-VPC traffic.
The other area interviewers probe hard: IAM Roles vs. IAM Users for automation. The correct answer in 2024 is always roles for anything machine-to-machine. IAM users have long-lived static credentials — if a key leaks, you have a breach. Roles use short-lived STS tokens that auto-rotate. EC2 instance profiles, ECS task roles, Lambda execution roles — all of these use the role mechanism.
Permission Boundaries are often a senior-level differentiator. They let you delegate IAM administration safely: you can allow a team to create their own roles, but cap the maximum permissions those roles can ever have. It's the difference between 'trust but verify' and 'trust and you can't accidentally escalate anyway.'
Infrastructure as Code — CloudFormation vs Terraform and Real Architecture Questions
CloudFormation is AWS-native IaC. Terraform is cloud-agnostic HCL-based IaC by HashiCorp. This comparison comes up in almost every AWS DevOps interview, and the trap is giving a tribal 'Terraform is better' answer without nuance.
CloudFormation's strengths: native drift detection, StackSets for multi-account/multi-region deployments, no state file management (AWS manages state), and deep integration with AWS services like Service Catalog and CDK. Its weakness: verbose YAML/JSON, slower development cycle, and error messages that are notoriously unhelpful ('UPDATE_ROLLBACK_COMPLETE' tells you nothing about what actually failed).
Terraform's strengths: multi-cloud portability, cleaner module system, better plan output (you see exactly what will change before it changes), and a massive community registry of modules. Its weakness: you own the state file, which means you need a backend (S3 + DynamoDB for locking), and state file corruption or drift is your problem to solve.
In practice, many mature AWS shops use both: CloudFormation for account-level infrastructure (VPCs, IAM foundations, Service Control Policies) via AWS Control Tower, and Terraform for application-level infrastructure managed by product teams. Knowing this hybrid pattern signals real-world experience.
terraform plan against live infrastructure. Showing you think about what happens after initial deployment separates senior candidates from mid-level ones.ECS, EKS, and Container Orchestration — The Questions That Reveal Depth
Container questions are where AWS DevOps interviews get genuinely technical. The ECS vs EKS question is almost a given, and the wrong move is to immediately say 'Kubernetes is always better.'
ECS (Elastic Container Service) is AWS-native container orchestration. It uses Task Definitions (the blueprint for a container workload) and Services (which maintain the desired count of tasks and wire them to load balancers). ECS with Fargate means zero EC2 management — AWS provisions the underlying compute per task. ECS with EC2 launch type means you manage the cluster nodes, but you get more control over instance types and pricing (Reserved Instances, Savings Plans).
EKS (Elastic Kubernetes Service) gives you a managed Kubernetes control plane. Use it when your team already has Kubernetes expertise, you need to run the same workloads on-prem and in AWS, or your application requires Kubernetes-native features like custom operators or CRDs. The operational overhead is meaningfully higher than ECS.
The real interview depth comes from task role vs execution role in ECS — a distinction that trips up 80% of candidates. The execution role is what ECS uses to pull the container image from ECR and write logs to CloudWatch. The task role is what your application code uses to call other AWS services (DynamoDB, S3, etc.). Mixing these up is a classic misconfiguration that causes silent permission failures.
Monitoring, Observability, and Incident Response — The DevOps Interview Difference
Interviewers dig into monitoring because they want to know how you detect and respond to failures before customers do. The standard answer—'we use CloudWatch alarms'—is not enough. You need to show you understand the difference between monitoring (tracking known metrics) and observability (exploring unknown failure modes).
On AWS, CloudWatch collects metrics, logs, and events. But CloudWatch Metrics alone won't catch an intermittent 503 error that happens only when a downstream service is slow. That's where structured logging (JSON format) and distributed tracing with X-Ray come in. The trick is to log correlation IDs so you can trace a request across EC2, Lambda, RDS, and S3.
A senior DevOps answer should include: metrics (CPU, memory, request latency) that trigger alarms, structured logs with context for debugging, and traces for pinpointing bottlenecks. Also mention alarm fatigue: too many alarms cause engineers to ignore them. Use composite alarms to reduce noise.
ECS Task Fails to Pull Image in Private Subnet
- Always check the full data path: ECR interface endpoints handle API calls, but S3 gateway endpoints are required for layer downloads.
- When debugging 'CannotPullContainerError' with correct IAM, suspect missing S3 endpoint before anything else.
- Add a VPC endpoint checklist to your deployment runbook: ECR API, ECR Docker, S3 Gateway, and CloudWatch Logs interface endpoints for private Fargate tasks.
Key takeaways
Common mistakes to avoid
4 patternsPutting secrets in CloudFormation parameters as 'String' type instead of 'AWS::SSM::Parameter::Value'
Forgetting the S3 VPC Gateway Endpoint when running Fargate tasks in private subnets
Treating CodeBuild as a persistent build server and relying on local filesystem state between builds
Using 'Resource': '*' in IAM policies for Lambda execution roles
Interview Questions on This Topic
A production ECS service with 10 tasks is failing health checks for 30% of its nodes after a deployment. Walk me through the automated recovery process and your manual root cause analysis steps.
Frequently Asked Questions
That's DevOps Interview. Mark it forged?
5 min read · try the examples if you haven't