AWS Fargate: The Complete Guide to Serverless Containers on ECS and EKS
- Fargate runs containers without managing servers β you define tasks, AWS handles infrastructure
- Each task gets its own ENI and resource isolation β plan subnet CIDR blocks for peak task count
- Right-sizing CPU and memory based on actual utilization saves 30-50% of compute costs
- Fargate is AWS serverless compute engine that runs containers without managing servers
- You define CPU and memory per task β Fargate provisions and scales infrastructure automatically
- Works with both Amazon ECS and Amazon EKS for orchestration
- You pay per second for vCPU and memory resources allocated to running tasks
- Production use requires careful networking, IAM, and logging configuration
- Biggest mistake: over-provisioning CPU and memory per task, inflating costs by 3-5x
Task stuck Existing tasks continued operating normally.
aws ecs describe-tasks --cluster my-cluster --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status RUNNING --query 'taskArns[0]' --output text) --query 'tasks[0].{status:lastStatus,stopReason:stopReason,attachments:attachments[0].details}'aws ec2 describe-subnets --subnet-ids subnet-xxxxx --query 'Subnets[0].AvailableIpAddressCount'Container crashes on startup
aws logs get-log-events --log-group-name /ecs/my-task --log-stream-name ecs/my-container/TASK_ID --limit 50aws ecs describe-tasks --cluster my-cluster --tasks TASK_ARN --query 'tasks[0].containers[0].{exitCode:exitCode,reason:reason}'Cannot pull image from ECR
aws ecr describe-images --repository-name my-repo --query 'imageDetails[0].imageTags'aws iam list-attached-role-policies --role-name ecsTaskExecutionRole --query 'AttachedPolicies[].PolicyArn'High Fargate costs
aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name CpuUtilized --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Averageaws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name MemoryUtilized --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics AverageProduction Incident
Production Debug GuideCommon symptoms and actions for Fargate production issues
AWS Fargate is a serverless compute engine for containers that eliminates the need to provision, configure, or scale virtual machine clusters. You package your application as a container image, define resource requirements, and Fargate runs it on infrastructure managed entirely by AWS.
Fargate shifts operational burden from managing EC2 instance fleets to defining task-level resource requirements. This simplifies capacity planning but introduces new challenges around networking configuration, IAM task roles, cold start latency, and cost optimization at scale. Production deployments require understanding these trade-offs before committing to Fargate over EC2 launch type.
What Is AWS Fargate?
AWS Fargate is a serverless compute engine for containers that works with both Amazon ECS and Amazon EKS. It removes the need to manage EC2 instances β you define container images, CPU, memory, and networking requirements, and Fargate provisions the underlying infrastructure to run your containers.
Fargate assigns each task its own kernel runtime environment and elastic network interface (ENI). This provides task-level isolation comparable to running containers on dedicated EC2 instances, without the operational overhead of managing the instance fleet.
The core abstraction is the task β a set of one or more containers that share a network namespace and storage volumes. You define tasks in a task definition, which specifies the container image, resource requirements, IAM roles, logging configuration, and networking mode. ECS or EKS schedules these tasks onto Fargate-managed infrastructure.
{
"family": "io-thecodeforge-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
"containerDefinitions": [
{
"name": "api",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{"name": "NODE_ENV", "value": "production"},
{"name": "LOG_LEVEL", "value": "info"}
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-url:DATABASE_URL::"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/io-thecodeforge-api",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs",
"awslogs-create-group": "true"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
- Each task gets its own ENI, kernel, and resource isolation β no shared host contention
- You define CPU and memory per task, not per cluster β capacity planning is task-level
- Fargate works with ECS and EKS β same serverless model for both orchestrators
- You pay per second for vCPU and memory allocated to running tasks only
- No SSH access to underlying infrastructure β all debugging happens through logs and APIs
Fargate Networking and Security
Fargate tasks run in awsvpc mode β each task receives its own elastic network interface (ENI) with a private IP address in your VPC subnet. This provides VPC-level security controls through security groups and network ACLs, but requires careful subnet planning to avoid IP exhaustion.
Networking decisions have cost and performance implications. Tasks in private subnets require a NAT gateway for outbound internet access, which adds data processing charges. VPC endpoints for AWS services (S3, ECR, CloudWatch, Secrets Manager) eliminate NAT gateway costs for service-to-service communication.
Security follows the principle of least privilege through two IAM roles per task: the execution role (pulling images, writing logs, fetching secrets) and the task role (application-level AWS API access). Separating these roles ensures the task can only access resources it actually needs.
# CloudFormation snippet for Fargate networking infrastructure # Shows VPC endpoints, subnets, and security groups Resources: # Private subnets for Fargate tasks PrivateSubnetA: Type: AWS::EC2::Subnet Properties: VpcId: !Ref Vpc CidrBlock: 10.0.0.0/20 # 4091 usable IPs for Fargate tasks AvailabilityZone: us-east-1a Tags: - Key: Name Value: fargate-private-a PrivateSubnetB: Type: AWS::EC2::Subnet Properties: VpcId: !Ref Vpc CidrBlock: 10.0.16.0/20 AvailabilityZone: us-east-1b Tags: - Key: Name Value: fargate-private-b # NAT Gateway for outbound internet access NatGateway: Type: AWS::EC2::NatGateway Properties: AllocationId: !GetAtt NatEip.AllocationId SubnetId: !Ref PublicSubnetA # VPC Endpoint for ECR (avoids NAT gateway charges) EcrApiEndpoint: Type: AWS::EC2::VPCEndpoint Properties: VpcId: !Ref Vpc ServiceName: !Sub com.amazonaws}.ecr.api !Ref PrivateSubnetA - !Ref PrivateSubnetB SecurityGroupIds: - !Ref VpcEndpointSg S3Endpoint: Type: AWS::EC2::VPCEndpoint Properties: VpcId: !Ref Vpc ServiceName: !Sub com.amazonaws.${AWS::Region}.s3 VpcEndpointType: Gateway RouteTableIds: - !Ref PrivateRouteTable # Security group for Fargate tasks FargateTaskSg: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Security group for Fargate tasks VpcId: !Ref Vpc SecurityGroupIngress: - IpProtocol: tcp FromPort: 8080 ToPort: 8080 SourceSecurityGroupId: !Ref AlbSg SecurityGroupEgress: - IpProtocol: tcp FromPort: 443 ToPort: 443 CidrIp: 0.0.0.0/0 # HTTPS outbound for ECR, Secrets Manager - IpProtocol: tcp FromPort: 5432 ToPort: 5432 SourceSecurityGroupId: !Ref DatabaseSg # Security group for VPC endpoints VpcEndpointSg: Type: AWS::EC2::SecurityGroup Properties: GroupDescription: Allow HTTPS from Fargate tasks to VPC endpoints VpcId: !Ref Vpc SecurityGroupIngress: - IpProtocol: tcp FromPort: 443 ToPort: 443 SourceSecurityGroupId: !Ref FargateTaskSg
Fargate Pricing and Cost Optimization
Fargate pricing is based on vCPU and memory resources allocated to running tasks, billed per second with a one-minute minimum. This model eliminates idle capacity costs but requires right-sizing tasks to avoid over-provisioning.
Cost optimization in Fargate centers on three levers: right-sizing CPU and memory allocations, using Fargate Spot for fault-tolerant workloads, and consolidating containers into fewer, larger tasks. Most production Fargate deployments overspend by 30-50% due to inflated resource requests that do not match actual utilization.
Fargate Spot provides up to 70% cost reduction for interrupt-tolerant workloads like batch processing, CI/CD pipelines, and stateless workers. Spot tasks can be interrupted with two minutes notice, requiring graceful shutdown handling in your application.
from dataclasses import dataclass from typing import List, Dict, Tuple from enum import Enum class PricingRegion(Enum): US_EAST_1 = "us-east-1" EU_WEST_1 = "eu-west-1" AP_SOUTHEAST_1 = "ap-southeast-1" @dataclass class FargatePricing: """Fargate pricing per hour for a region.""" vcpu_per_hour: float memory_per_gb_hour: float spot_discount: float = 0.70 @dataclass class TaskConfiguration: """A Fargate task's resource allocation.""" name: str vcpu: float memory_gb: float task_count: int hours_per_day: float = 24.0 is_spot_eligible: bool = False class FargateCostCalculator: """ Calculates and optimizes Fargate costs. """ PRICING = { PricingRegion.US_EAST_1: FargatePricing( vcpu_per_hour=0.04048, memory_per_gb_hour=0.004445 ), PricingRegion.EU_WEST_1: FargatePricing( vcpu_per_hour=0.04655, memory_per_gb_hour=0.005112 ), } def __init__(self, region: PricingRegion): self.pricing = self.PRICING[region] def task_hourly_cost(self, task: TaskConfiguration) -> float: """ Calculate the hourly cost of a single Fargate task. """ base_cost = ( task.vcpu * self.pricing.vcpu_per_hour + task.memory_gb * self.pricing.memory_per_gb_hour ) if task.is_spot_eligible: return base_cost * (1 - self.pricing.spot_discount) return base_cost def monthly_cost(self, task: TaskConfiguration) -> float: """ Calculate the monthly cost for all instances of a task. """ hourly = self.task_hourly_cost(task) return hourly * task.hours_per_day * 30 * task.task_count def total_monthly_cost(self, tasks: List[TaskConfiguration]) -> Dict: """ Calculate total monthly cost breakdown. """ breakdown = {} total = 0.0 for task in tasks: cost = self.monthly_cost(task) breakdown[task.name] = { "task_cost_hourly": round(self.task_hourly_cost(task), 4), "monthly_cost": round(cost, 2), "task_count": task.task_count, "vcpu_per_task": task.vcpu, "memory_gb_per_task": task.memory_gb, "spot_eligible": task.is_spot_eligible, } total += cost return { "tasks": breakdown, "total_monthly_cost": round(total, 2), "region": self.pricing.vcpu_per_hour, } def right_size_recommendation( self, task: TaskConfiguration, actual_cpu_utilization_pct: float, actual_memory_utilization_pct: float ) -> Dict: """ Recommend right-sized task configuration based on actual utilization. """ recommended_vcpu = task.vcpu * (actual_cpu_utilization_pct / 100) * 1.3 recommended_memory = task.memory_gb * (actual_memory_utilization_pct / 100) * 1.3 # Snap to valid Fargate sizes valid_vcpu = [0.25, 0.5, 1, 2, 4, 8, 16] valid_memory = [0.5, 1, 2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 24, 30, 48, 64, 96, 120, 192, 256] recommended_vcpu = min(v for v in valid_vcpu if v >= recommended_vcpu) recommended_memory = min(m for m in valid_memory if m >= recommended_memory) current_cost = self.monthly_cost(task) right_sized_task = TaskConfiguration( name=task.name, vcpu=recommended_vcpu, memory_gb=recommended_memory, task_count=task.task_count, hours_per_day=task.hours_per_day, is_spot_eligible=task.is_spot_eligible, ) new_cost = self.monthly_cost(right_sized_task) return { "current": { "vcpu": task.vcpu, "memory_gb": task.memory_gb, "monthly_cost": round(current_cost, 2), }, "recommended": { "vcpu": recommended_vcpu, "memory_gb": recommended_memory, "monthly_cost": round(new_cost, 2), }, "savings_monthly": round(current_cost - new_cost, 2), "savings_pct": round((1 - new_cost / current_cost) * 100, 1), } # Example usage calculator = FargateCostCalculator(PricingRegion.US_EAST_1) tasks = [ TaskConfiguration("api-server", 1, 2, 10, 24, False), TaskConfiguration("worker", 0.5, 1, 20, 24, True), TaskConfiguration("scheduler", 0.25, 0.5, 2, 24, False), ] result = calculator.total_monthly_cost(tasks) print(f"Total monthly cost: ${result['total_monthly_cost']}") # Right-sizing recommendation recommendation = calculator.right_size_recommendation( tasks[0], actual_cpu_utilization_pct=25, actual_memory_utilization_pct=40 ) print(f"Right-size savings: ${recommendation['savings_monthly']}/month ({recommendation['savings_pct']}%)")
- Right-size tasks using CloudWatch Container Insights β most deployments over-provision by 2-3x
- Use Fargate Spot for batch jobs, CI/CD workers, and stateless background tasks (70% savings)
- Consolidate sidecar containers into the main task to reduce per-task overhead
- Schedule non-production tasks to stop outside business hours using EventBridge + Lambda
- Use Compute Savings Plans for predictable Fargate workloads (up to 17% savings)
Deploying ECS Services on Fargate
Production ECS services on Fargate require a deployment configuration that handles rolling updates, health checks, auto-scaling, and service discovery. The ECS service abstraction manages task placement, desired count, and deployment strategy across Fargate-managed infrastructure.
Rolling updates with the circuit breaker pattern prevent failed deployments from replacing healthy tasks. The circuit breaker monitors task health and automatically rolls back if new tasks fail to start. Combined with health check grace periods, this prevents deployment cascading failures.
Auto-scaling on Fargate adjusts the desired task count based on CloudWatch metrics β CPU utilization, memory utilization, request count, or custom metrics via Application Auto Scaling. Scaling policies should use target tracking for steady-state adjustments and step scaling for rapid traffic spikes.
# CloudFormation for production ECS Fargate service # Includes deployment circuit breaker, auto-scaling, and service discovery Resources: EcsCluster: Type: AWS::ECS::Cluster Properties: ClusterName: io-thecodeforge-production ContainerInsights: enabled ServiceConnectDefaults: Namespace: io-thecodeforge.local EcsService: Type: AWS::ECS::Service DependsOn: - AlbListener Properties: ServiceName: io-thecodeforge-api Cluster: !Ref EcsCluster TaskDefinition: !Ref TaskDefinition DesiredCount: 3 LaunchType: FARGATE PlatformVersion: LATEST DeploymentConfiguration: MaximumPercent: 200 MinimumHealthyPercent: 100 DeploymentCircuitBreaker: Enable: true Rollback: true NetworkConfiguration: AwsvpcConfiguration: AssignPublicIp: DISABLED SecurityGroups: - !Ref FargateTaskSg Subnets: - !Ref PrivateSubnetA - !Ref PrivateSubnetB LoadBalancers: - ContainerName: api ContainerPort: 8080 TargetGroupArn: !Ref ApiTargetGroup ServiceRegistries: - RegistryArn: !GetAtt ServiceDiscoveryService.Arn HealthCheckGracePeriodSeconds: 120 # Auto-scaling target ScalableTarget: Type: AWS::ApplicationAutoScaling::ScalableTarget Properties: MaxCapacity: 20 MinCapacity: 3 ResourceId: !Sub service/${EcsCluster}/${EcsService.Name} ScalableDimension: ecs:service:DesiredCount ServiceNamespace: ecs RoleARN: !GetAtt AutoScalingRole.Arn # Target tracking scaling policy CpuScalingPolicy: Type: AWS::ApplicationAutoScaling::ScalingPolicy Properties: PolicyName: cpu-target-tracking PolicyType: TargetTrackingScaling ScalingTargetId: !Ref ScalableTarget TargetTrackingScalingPolicyConfiguration: TargetValue: 65 PredefinedMetricSpecification: PredefinedMetricType: ECSServiceAverageCPUUtilization ScaleInCooldown: 300 ScaleOutCooldown: 60 # Service discovery ServiceDiscoveryService: Type: AWS::ServiceDiscovery::Service Properties: Name: api DnsConfig: NamespaceId: !Ref ServiceDiscoveryNamespace DnsRecords: - TTL: 10 Type: A HealthCheckCustomConfig: FailureThreshold: 1 # CloudWatch alarm for failed deployments FailedDeploymentAlarm: Type: AWS::CloudWatch::Alarm Properties: AlarmName: fargate-api-failed-tasks AlarmDescription: Alert when Fargate tasks fail to start Namespace: AWS/ECS MetricName: RunningTaskCount Dimensions: - Name: ClusterName Value: !Ref EcsCluster - Name: ServiceName Value: !Ref EcsService Statistic: Minimum Period: 60 EvaluationPeriods: 3 Threshold: 2 ComparisonOperator: LessThanThreshold AlarmActions: - !Ref OpsSnsTopic
- Circuit breaker monitors new task health during rolling deployments
- If tasks fail to start, ECS automatically rolls back to the previous task definition
- Health check grace period gives containers time to initialize before health checks begin
- MinimumHealthyPercent: 100 ensures zero-downtime deployments β old tasks stay until new ones are healthy
- Auto-scaling adjusts task count based on CPU, memory, or custom CloudWatch metrics
Fargate Logging and Observability
Production Fargate workloads require structured logging, distributed tracing, and container-level metrics. Since Fargate provides no SSH access, all observability must be configured through the task definition and external services before deployment.
CloudWatch Logs is the default log driver, but production systems benefit from FireLens β a Fluent Bit-based log router that supports structured JSON output, multi-destination routing, and log filtering. FireLens sends logs to CloudWatch, Datadog, Splunk, or Elasticsearch simultaneously.
Container Insights provides CPU, memory, disk, and network metrics per task and container. Combined with X-Ray for distributed tracing, this creates a complete observability stack for Fargate microservices.
{
"family": "io-thecodeforge-api-observed",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
"containerDefinitions": [
{
"name": "api",
"image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
"essential": true,
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
],
"logConfiguration": {
"logDriver": "awsfirelens",
"options": {
"Name": "cloudwatch",
"region": "us-east-1",
"log_group_name": "/ecs/io-thecodeforge-api",
"log_stream_prefix": "ecs",
"auto_create_group": "true",
"log_key": "log"
}
},
"dependsOn": [
{
"containerName": "log-router",
"condition": "START"
}
]
},
{
"name": "xray",
"image": "amazon/aws-xray-daemon:latest",
"essential": false,
"cpu": 32,
"memoryReservation": 256,
"portMappings": [
{
"containerPort": 2000,
"protocol": "udp"
}
]
},
{
"name": "log-router",
"image": "amazon/aws-for-fluent-bit:latest",
"essential": true,
"cpu": 32,
"memoryReservation": 64,
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"enable-ecs-log-metadata": "true",
"config-file-type": "file",
"config-file-value": "/fluent-bit/configs/parse-json.conf"
}
}
}
]
}
- Use FireLens (Fluent Bit) as log router β supports structured JSON and multi-destination output
- Enable Container Insights on the ECS cluster for per-task CPU, memory, and network metrics
- Add X-Ray sidecar for distributed tracing across microservices
- Emit structured JSON logs with correlation IDs β never plain text log lines
- Set log retention policy on CloudWatch log groups β default is infinite, which is expensive
| Feature | Fargate | EC2 Launch Type | Lambda (Container Images) |
|---|---|---|---|
| Server Management | Fully managed by AWS | You manage instances | Fully managed by AWS |
| Max Memory | Up to 120 GB per task | Depends on instance type | Up to 10 GB |
| Max vCPU | Up to 16 per task | Depends on instance type | Up to 6 vCPU |
| Execution Duration | Unlimited | Unlimited | 15 minutes max |
| Networking | ENI per task in VPC | Shared ENI on instance | VPC optional |
| Cold Start | 30-90 seconds for new tasks | None (instances running) | 1-3 seconds |
| Cost Model | Per second for vCPU + memory | Per hour for instances | Per invocation + duration |
| Best For | Steady microservices, APIs | Cost-optimized at scale | Event-driven, short tasks |
π― Key Takeaways
- Fargate runs containers without managing servers β you define tasks, AWS handles infrastructure
- Each task gets its own ENI and resource isolation β plan subnet CIDR blocks for peak task count
- Right-sizing CPU and memory based on actual utilization saves 30-50% of compute costs
- Enable deployment circuit breaker with rollback on every ECS Fargate service
- VPC endpoints for ECR, S3, and CloudWatch eliminate NAT gateway charges
- Structured logging via FireLens is mandatory β Fargate has no SSH access for debugging
β Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is AWS Fargate and how does it differ from running containers on EC2?JuniorReveal
- QHow would you optimize Fargate costs for a production microservices architecture?Mid-levelReveal
- QA production Fargate service is experiencing intermittent task placement failures with tasks stuck in PENDING. Walk through your diagnosis process.SeniorReveal
Frequently Asked Questions
Is AWS Fargate really serverless?
Fargate is serverless in the sense that you do not provision, manage, or patch any servers. AWS manages the underlying compute infrastructure entirely. However, unlike Lambda, Fargate tasks run continuously and you are billed for the duration they run, not per invocation. You still need to define networking, IAM, and logging configuration. Fargate removes server management but not infrastructure configuration.
What is the maximum size of a Fargate task?
Fargate supports up to 16 vCPU and 120 GB of memory per task. The valid CPU values are 0.25, 0.5, 1, 2, 4, 8, and 16 vCPU. Each CPU value has a set of valid memory configurations β for example, 1 vCPU supports 2-8 GB of memory. A single task can run up to 10 containers that share the task's CPU and memory allocation.
Can Fargate tasks communicate with each other?
Yes, Fargate tasks communicate through standard networking since each task has its own ENI in the VPC. Tasks can reach each other using private IP addresses, service discovery (Cloud Map), or an internal ALB. For ECS, AWS Service Connect provides service mesh capabilities with automatic service discovery and traffic management.
How does Fargate handle persistent storage?
Fargate supports two storage options: Amazon EFS (Elastic File System) for persistent shared storage across tasks, and ephemeral storage up to 200 GB per task. EFS volumes mount inside containers like a regular filesystem and persist across task restarts. For stateful workloads, EFS is the recommended approach β it provides shared, durable storage without managing EBS volumes.
Should I use Fargate Spot for production workloads?
Fargate Spot can be used in production for fault-tolerant workloads. AWS provides a two-minute interruption warning before reclaiming Spot capacity. Your application must handle SIGTERM gracefully and drain connections. Use a mixed capacity provider strategy β run baseline capacity on regular Fargate and burst capacity on Spot. This provides cost savings while maintaining availability for critical workloads.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.