Intermediate 5 min · April 11, 2026

AWS Fargate: Serverless Containers on ECS and EKS

AWS Fargate — ENI IP Exhaustion Blocks Deployments Silently

Q: Is AWS Fargate really serverless?

Fargate is serverless in the sense that you do not provision, manage, or patch any servers. AWS manages the underlying compute infrastructure entirely. However, unlike Lambda, Fargate tasks run continuously and you are billed for the duration they run, not per invocation. You still need to define networking, IAM, and logging configuration. Fargate removes server management but not infrastructure configuration.

Q: What is the maximum size of a Fargate task?

Fargate supports up to 16 vCPU and 120 GB of memory per task. The valid CPU values are 0.25, 0.5, 1, 2, 4, 8, and 16 vCPU. Each CPU value has a set of valid memory configurations — for example, 1 vCPU supports 2-8 GB of memory. A single task can run up to 10 containers that share the task's CPU and memory allocation.

Q: Can Fargate tasks communicate with each other?

Yes, Fargate tasks communicate through standard networking since each task has its own ENI in the VPC. Tasks can reach each other using private IP addresses, service discovery (Cloud Map), or an internal ALB. For ECS, AWS Service Connect provides service mesh capabilities with automatic service discovery and traffic management.

Q: How does Fargate handle persistent storage?

Fargate supports two storage options: Amazon EFS (Elastic File System) for persistent shared storage across tasks, and ephemeral storage up to 200 GB per task. EFS volumes mount inside containers like a regular filesystem and persist across task restarts. For stateful workloads, EFS is the recommended approach — it provides shared, durable storage without managing EBS volumes.

Q: Should I use Fargate Spot for production workloads?

Fargate Spot can be used in production for fault-tolerant workloads. AWS provides a two-minute interruption warning before reclaiming Spot capacity. Your application must handle SIGTERM gracefully and drain connections. Use a mixed capacity provider strategy — run baseline capacity on regular Fargate and burst capacity on Spot. This provides cost savings while maintaining availability for critical workloads.

Fargate tasks stuck PENDING with zero errors? A /24 subnet exhausts at 120 tasks — each consumes one ENI.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Production

production tested

July 04, 2026

last updated

230

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of DevOps fundamentals
✓Comfortable with command-line tools
✓Basic Linux administration knowledge

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Fargate is AWS serverless compute engine that runs containers without managing servers
You define CPU and memory per task — Fargate provisions and scales infrastructure automatically
Works with both Amazon ECS and Amazon EKS for orchestration
You pay per second for vCPU and memory resources allocated to running tasks
Production use requires careful networking, IAM, and logging configuration
Biggest mistake: over-provisioning CPU and memory per task, inflating costs by 3-5x

✦ Definition~90s read

What is AWS Fargate?

AWS Fargate is a serverless compute engine for containers that works with both Amazon ECS and Amazon EKS. It removes the need to manage EC2 instances — you define container images, CPU, memory, and networking requirements, and Fargate provisions the underlying infrastructure to run your containers.

★

Fargate is like renting individual apartments instead of buying an entire building.

Fargate assigns each task its own kernel runtime environment and elastic network interface (ENI). This provides task-level isolation comparable to running containers on dedicated EC2 instances, without the operational overhead of managing the instance fleet.

The core abstraction is the task — a set of one or more containers that share a network namespace and storage volumes. You define tasks in a task definition, which specifies the container image, resource requirements, IAM roles, logging configuration, and networking mode. ECS or EKS schedules these tasks onto Fargate-managed infrastructure.

Plain-English First

Fargate is like renting individual apartments instead of buying an entire building. With EC2, you own the building and manage everything — plumbing, electricity, maintenance. With Fargate, you rent just the space you need, and AWS handles the building. You bring your containers, specify how much CPU and memory they need, and Fargate runs them without you ever seeing a server.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

AWS Fargate is a serverless compute engine for containers that eliminates the need to provision, configure, or scale virtual machine clusters. You package your application as a container image, define resource requirements, and Fargate runs it on infrastructure managed entirely by AWS.

Fargate shifts operational burden from managing EC2 instance fleets to defining task-level resource requirements. This simplifies capacity planning but introduces new challenges around networking configuration, IAM task roles, cold start latency, and cost optimization at scale. Production deployments require understanding these trade-offs before committing to Fargate over EC2 launch type.

What Is AWS Fargate?

io.thecodeforge.fargate.task_definition.jsonJSON

{
  "family": "io-thecodeforge-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "NODE_ENV", "value": "production"},
        {"name": "LOG_LEVEL", "value": "info"}
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-url:DATABASE_URL::"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/io-thecodeforge-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Fargate as Serverless Container Hosting

Each task gets its own ENI, kernel, and resource isolation — no shared host contention
You define CPU and memory per task, not per cluster — capacity planning is task-level
Fargate works with ECS and EKS — same serverless model for both orchestrators
You pay per second for vCPU and memory allocated to running tasks only
No SSH access to underlying infrastructure — all debugging happens through logs and APIs

Production Insight

Fargate tasks are immutable — you cannot SSH into them for debugging.

All troubleshooting must happen through CloudWatch Logs and ECS APIs.

Rule: invest in structured logging and health checks before deploying to Fargate.

Key Takeaway

Fargate runs containers without managing servers — you define tasks, AWS handles infrastructure.

Each task gets its own network interface and resource isolation.

Choose Fargate for operational simplicity, EC2 for cost optimization at steady scale.

.${AWS::RegionFargate vs EC2 Launch Type

IfWorkload has predictable, steady-state traffic

→

UseConsider EC2 with Savings Plans — lower cost at consistent utilization

IfWorkload is bursty or has variable scaling patterns

→

UseUse Fargate — pay only for running tasks, no idle instance cost

IfTasks require GPU, large memory (>30GB), or specific instance types

→

UseUse EC2 — Fargate has CPU/memory limits and no GPU support

IfTeam wants minimal operational overhead

→

UseUse Fargate — no instance patching, AMI management, or capacity planning

thecodeforge.io

Aws Fargate

Fargate Networking and Security

Fargate tasks run in awsvpc mode — each task receives its own elastic network interface (ENI) with a private IP address in your VPC subnet. This provides VPC-level security controls through security groups and network ACLs, but requires careful subnet planning to avoid IP exhaustion.

Networking decisions have cost and performance implications. Tasks in private subnets require a NAT gateway for outbound internet access, which adds data processing charges. VPC endpoints for AWS services (S3, ECR, CloudWatch, Secrets Manager) eliminate NAT gateway costs for service-to-service communication.

Security follows the principle of least privilege through two IAM roles per task: the execution role (pulling images, writing logs, fetching secrets) and the task role (application-level AWS API access). Separating these roles ensures the task can only access resources it actually needs.

io.thecodeforge.fargate.networking.ymlYAML

# CloudFormation snippet for Fargate networking infrastructure
# Shows VPC endpoints, subnets, and security groups

Resources:
  # Private subnets for Fargate tasks
  PrivateSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref Vpc
      CidrBlock: 10.0.0.0/20  # 4091 usable IPs for Fargate tasks
      AvailabilityZone: us-east-1a
      Tags:
        - Key: Name
          Value: fargate-private-a

  PrivateSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref Vpc
      CidrBlock: 10.0.16.0/20
      AvailabilityZone: us-east-1b
      Tags:
        - Key: Name
          Value: fargate-private-b

  # NAT Gateway for outbound internet access
  NatGateway:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatEip.AllocationId
      SubnetId: !Ref PublicSubnetA

  # VPC Endpoint for ECR (avoids NAT gateway charges)
  EcrApiEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref Vpc
      ServiceName: !Sub com.amazonaws}.ecr.api !Ref PrivateSubnetA
        - !Ref PrivateSubnetB
      SecurityGroupIds:
        - !Ref VpcEndpointSg

  S3Endpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref Vpc
      ServiceName: !Sub com.amazonaws.${AWS::Region}.s3
      VpcEndpointType: Gateway
      RouteTableIds:
        - !Ref PrivateRouteTable

  # Security group for Fargate tasks
  FargateTaskSg:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for Fargate tasks
      VpcId: !Ref Vpc
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8080
          ToPort: 8080
          SourceSecurityGroupId: !Ref AlbSg
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0  # HTTPS outbound for ECR, Secrets Manager
        - IpProtocol: tcp
          FromPort: 5432
          ToPort: 5432
          SourceSecurityGroupId: !Ref DatabaseSg

  # Security group for VPC endpoints
  VpcEndpointSg:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTPS from Fargate tasks to VPC endpoints
      VpcId: !Ref Vpc
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref FargateTaskSg

Fargate Networking Pitfalls

Each Fargate task consumes one ENI with a private IP — plan subnet CIDR blocks for peak task count
Tasks in private subnets without NAT gateway or VPC endpoints cannot pull images from ECR
Security groups on Fargate tasks must allow outbound HTTPS (443) for ECR, Secrets Manager, and CloudWatch
Cross-AZ traffic between Fargate tasks and RDS incurs data transfer charges
Public IP on Fargate tasks exposes them directly to the internet — always use private subnets with ALB

Production Insight

VPC endpoints eliminate NAT gateway data processing charges for AWS service access.

A single ECR image pull through NAT costs ~$0.045/GB — at scale this adds up fast.

Rule: create VPC endpoints for ECR, S3, CloudWatch, and Secrets Manager immediately.

Key Takeaway

Fargate tasks run in awsvpc mode with dedicated ENIs in your VPC subnets.

VPC endpoints for AWS services eliminate NAT gateway costs and improve reliability.

Separate execution role (infrastructure) from task role (application) for least privilege.

Fargate Pricing and Cost Optimization

Fargate pricing is based on vCPU and memory resources allocated to running tasks, billed per second with a one-minute minimum. This model eliminates idle capacity costs but requires right-sizing tasks to avoid over-provisioning.

Cost optimization in Fargate centers on three levers: right-sizing CPU and memory allocations, using Fargate Spot for fault-tolerant workloads, and consolidating containers into fewer, larger tasks. Most production Fargate deployments overspend by 30-50% due to inflated resource requests that do not match actual utilization.

Fargate Spot provides up to 70% cost reduction for interrupt-tolerant workloads like batch processing, CI/CD pipelines, and stateless workers. Spot tasks can be interrupted with two minutes notice, requiring graceful shutdown handling in your application.

io.thecodeforge.fargate.cost_analysis.pyPYTHON

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

from dataclasses import dataclass
from typing import List, Dict, Tuple
from enum import Enum


class PricingRegion(Enum):
    US_EAST_1 = "us-east-1"
    EU_WEST_1 = "eu-west-1"
    AP_SOUTHEAST_1 = "ap-southeast-1"


@dataclass
class FargatePricing:
    """Fargate pricing per hour for a region."""
    vcpu_per_hour: float
    memory_per_gb_hour: float
    spot_discount: float = 0.70


@dataclass
class TaskConfiguration:
    """A Fargate task's resource allocation."""
    name: str
    vcpu: float
    memory_gb: float
    task_count: int
    hours_per_day: float = 24.0
    is_spot_eligible: bool = False


class FargateCostCalculator:
    """
    Calculates and optimizes Fargate costs.
    """

    PRICING = {
        PricingRegion.US_EAST_1: FargatePricing(
            vcpu_per_hour=0.04048,
            memory_per_gb_hour=0.004445
        ),
        PricingRegion.EU_WEST_1: FargatePricing(
            vcpu_per_hour=0.04655,
            memory_per_gb_hour=0.005112
        ),
    }

    def __init__(self, region: PricingRegion):
        self.pricing = self.PRICING[region]

    def task_hourly_cost(self, task: TaskConfiguration) -> float:
        """
        Calculate the hourly cost of a single Fargate task.
        """
        base_cost = (
            task.vcpu * self.pricing.vcpu_per_hour
            + task.memory_gb * self.pricing.memory_per_gb_hour
        )
        if task.is_spot_eligible:
            return base_cost * (1 - self.pricing.spot_discount)
        return base_cost

    def monthly_cost(self, task: TaskConfiguration) -> float:
        """
        Calculate the monthly cost for all instances of a task.
        """
        hourly = self.task_hourly_cost(task)
        return hourly * task.hours_per_day * 30 * task.task_count

    def total_monthly_cost(self, tasks: List[TaskConfiguration]) -> Dict:
        """
        Calculate total monthly cost breakdown.
        """
        breakdown = {}
        total = 0.0

        for task in tasks:
            cost = self.monthly_cost(task)
            breakdown[task.name] = {
                "task_cost_hourly": round(self.task_hourly_cost(task), 4),
                "monthly_cost": round(cost, 2),
                "task_count": task.task_count,
                "vcpu_per_task": task.vcpu,
                "memory_gb_per_task": task.memory_gb,
                "spot_eligible": task.is_spot_eligible,
            }
            total += cost

        return {
            "tasks": breakdown,
            "total_monthly_cost": round(total, 2),
            "region": self.pricing.vcpu_per_hour,
        }

    def right_size_recommendation(
        self,
        task: TaskConfiguration,
        actual_cpu_utilization_pct: float,
        actual_memory_utilization_pct: float
    ) -> Dict:
        """
        Recommend right-sized task configuration based on actual utilization.
        """
        recommended_vcpu = task.vcpu * (actual_cpu_utilization_pct / 100) * 1.3
        recommended_memory = task.memory_gb * (actual_memory_utilization_pct / 100) * 1.3

        # Snap to valid Fargate sizes
        valid_vcpu = [0.25, 0.5, 1, 2, 4, 8, 16]
        valid_memory = [0.5, 1, 2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 24, 30, 48, 64, 96, 120, 192, 256]

        recommended_vcpu = min(v for v in valid_vcpu if v >= recommended_vcpu)
        recommended_memory = min(m for m in valid_memory if m >= recommended_memory)

        current_cost = self.monthly_cost(task)
        right_sized_task = TaskConfiguration(
            name=task.name,
            vcpu=recommended_vcpu,
            memory_gb=recommended_memory,
            task_count=task.task_count,
            hours_per_day=task.hours_per_day,
            is_spot_eligible=task.is_spot_eligible,
        )
        new_cost = self.monthly_cost(right_sized_task)

        return {
            "current": {
                "vcpu": task.vcpu,
                "memory_gb": task.memory_gb,
                "monthly_cost": round(current_cost, 2),
            },
            "recommended": {
                "vcpu": recommended_vcpu,
                "memory_gb": recommended_memory,
                "monthly_cost": round(new_cost, 2),
            },
            "savings_monthly": round(current_cost - new_cost, 2),
            "savings_pct": round((1 - new_cost / current_cost) * 100, 1),
        }


# Example usage
calculator = FargateCostCalculator(PricingRegion.US_EAST_1)

tasks = [
    TaskConfiguration("api-server", 1, 2, 10, 24, False),
    TaskConfiguration("worker", 0.5, 1, 20, 24, True),
    TaskConfiguration("scheduler", 0.25, 0.5, 2, 24, False),
]

result = calculator.total_monthly_cost(tasks)
print(f"Total monthly cost: ${result['total_monthly_cost']}")

# Right-sizing recommendation
recommendation = calculator.right_size_recommendation(
    tasks[0],
    actual_cpu_utilization_pct=25,
    actual_memory_utilization_pct=40
)
print(f"Right-size savings: ${recommendation['savings_monthly']}/month ({recommendation['savings_pct']}%)")

Fargate Cost Optimization Strategies

Right-size tasks using CloudWatch Container Insights — most deployments over-provision by 2-3x
Use Fargate Spot for batch jobs, CI/CD workers, and stateless background tasks (70% savings)
Consolidate sidecar containers into the main task to reduce per-task overhead
Schedule non-production tasks to stop outside business hours using EventBridge + Lambda
Use Compute Savings Plans for predictable Fargate workloads (up to 17% savings)

Production Insight

Fargate Spot interruptions come with a two-minute warning via ECS.

Applications must handle SIGTERM gracefully to avoid data loss.

Rule: implement graceful shutdown hooks before using Fargate Spot in production.

Key Takeaway

Fargate charges per second for allocated vCPU and memory — not actual usage.

Right-sizing tasks based on real utilization saves 30-50% of compute costs.

Fargate Spot provides 70% savings for interrupt-tolerant workloads.

thecodeforge.io

Aws Fargate

Deploying ECS Services on Fargate

Production ECS services on Fargate require a deployment configuration that handles rolling updates, health checks, auto-scaling, and service discovery. The ECS service abstraction manages task placement, desired count, and deployment strategy across Fargate-managed infrastructure.

Rolling updates with the circuit breaker pattern prevent failed deployments from replacing healthy tasks. The circuit breaker monitors task health and automatically rolls back if new tasks fail to start. Combined with health check grace periods, this prevents deployment cascading failures.

Auto-scaling on Fargate adjusts the desired task count based on CloudWatch metrics — CPU utilization, memory utilization, request count, or custom metrics via Application Auto Scaling. Scaling policies should use target tracking for steady-state adjustments and step scaling for rapid traffic spikes.

io.thecodeforge.fargate.ecs_service.ymlYAML

100

101

102

103

# CloudFormation for production ECS Fargate service
# Includes deployment circuit breaker, auto-scaling, and service discovery

Resources:
  EcsCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: io-thecodeforge-production
      ContainerInsights: enabled
      ServiceConnectDefaults:
        Namespace: io-thecodeforge.local

  EcsService:
    Type: AWS::ECS::Service
    DependsOn:
      - AlbListener
    Properties:
      ServiceName: io-thecodeforge-api
      Cluster: !Ref EcsCluster
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: 3
      LaunchType: FARGATE
      PlatformVersion: LATEST
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 100
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref FargateTaskSg
          Subnets:
            - !Ref PrivateSubnetA
            - !Ref PrivateSubnetB
      LoadBalancers:
        - ContainerName: api
          ContainerPort: 8080
          TargetGroupArn: !Ref ApiTargetGroup
      ServiceRegistries:
        - RegistryArn: !GetAtt ServiceDiscoveryService.Arn
      HealthCheckGracePeriodSeconds: 120

  # Auto-scaling target
  ScalableTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 20
      MinCapacity: 3
      ResourceId: !Sub service/${EcsCluster}/${EcsService.Name}
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN: !GetAtt AutoScalingRole.Arn

  # Target tracking scaling policy
  CpuScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: cpu-target-tracking
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref ScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 65
        PredefinedMetricSpecification:
          PredefinedMetricType: ECSServiceAverageCPUUtilization
        ScaleInCooldown: 300
        ScaleOutCooldown: 60

  # Service discovery
  ServiceDiscoveryService:
    Type: AWS::ServiceDiscovery::Service
    Properties:
      Name: api
      DnsConfig:
        NamespaceId: !Ref ServiceDiscoveryNamespace
        DnsRecords:
          - TTL: 10
            Type: A
      HealthCheckCustomConfig:
        FailureThreshold: 1

  # CloudWatch alarm for failed deployments
  FailedDeploymentAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: fargate-api-failed-tasks
      AlarmDescription: Alert when Fargate tasks fail to start
      Namespace: AWS/ECS
      MetricName: RunningTaskCount
      Dimensions:
        - Name: ClusterName
          Value: !Ref EcsCluster
        - Name: ServiceName
          Value: !Ref EcsService
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 3
      Threshold: 2
      ComparisonOperator: LessThanThreshold
      AlarmActions:
        - !Ref OpsSnsTopic

Fargate Deployment Safety Net

Circuit breaker monitors new task health during rolling deployments
If tasks fail to start, ECS automatically rolls back to the previous task definition
Health check grace period gives containers time to initialize before health checks begin
MinimumHealthyPercent: 100 ensures zero-downtime deployments — old tasks stay until new ones are healthy
Auto-scaling adjusts task count based on CPU, memory, or custom CloudWatch metrics

Production Insight

Deployment circuit breaker is the most important Fargate production feature.

Without it, a bad image push replaces all healthy tasks with failing ones.

Rule: enable circuit breaker with rollback on every ECS Fargate service.

Key Takeaway

ECS services on Fargate need circuit breaker, health checks, and auto-scaling.

Circuit breaker with rollback prevents bad deployments from taking down production.

Auto-scaling on CPU utilization at 65% target provides headroom for traffic spikes.

Fargate Logging and Observability

Production Fargate workloads require structured logging, distributed tracing, and container-level metrics. Since Fargate provides no SSH access, all observability must be configured through the task definition and external services before deployment.

CloudWatch Logs is the default log driver, but production systems benefit from FireLens — a Fluent Bit-based log router that supports structured JSON output, multi-destination routing, and log filtering. FireLens sends logs to CloudWatch, Datadog, Splunk, or Elasticsearch simultaneously.

Container Insights provides CPU, memory, disk, and network metrics per task and container. Combined with X-Ray for distributed tracing, this creates a complete observability stack for Fargate microservices.

io.thecodeforge.fargate.observability.jsonJSON

{
  "family": "io-thecodeforge-api-observed",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
      ],
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "cloudwatch",
          "region": "us-east-1",
          "log_group_name": "/ecs/io-thecodeforge-api",
          "log_stream_prefix": "ecs",
          "auto_create_group": "true",
          "log_key": "log"
        }
      },
      "dependsOn": [
        {
          "containerName": "log-router",
          "condition": "START"
        }
      ]
    },
    {
      "name": "xray",
      "image": "amazon/aws-xray-daemon:latest",
      "essential": false,
      "cpu": 32,
      "memoryReservation": 256,
      "portMappings": [
        {
          "containerPort": 2000,
          "protocol": "udp"
        }
      ]
    },
    {
      "name": "log-router",
      "image": "amazon/aws-for-fluent-bit:latest",
      "essential": true,
      "cpu": 32,
      "memoryReservation": 64,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "enable-ecs-log-metadata": "true",
          "config-file-type": "file",
          "config-file-value": "/fluent-bit/configs/parse-json.conf"
        }
      }
    }
  ]
}

Fargate Observability Stack

Use FireLens (Fluent Bit) as log router — supports structured JSON and multi-destination output
Enable Container Insights on the ECS cluster for per-task CPU, memory, and network metrics
Add X-Ray sidecar for distributed tracing across microservices
Emit structured JSON logs with correlation IDs — never plain text log lines
Set log retention policy on CloudWatch log groups — default is infinite, which is expensive

Production Insight

Fargate has no SSH — logs are the only window into running containers.

Structured JSON logs with correlation IDs are mandatory for debugging microservices.

Rule: configure logging and tracing in the task definition before first deployment.

Key Takeaway

Fargate requires pre-configured observability — no SSH means no after-the-fact debugging.

FireLens routes structured logs to multiple destinations with filtering.

Container Insights and X-Ray provide the metrics and tracing stack for production monitoring.

Fargate Task Lifecycle: The State Machine You Can't Ignore

When a Fargate task dies — and it will — most engineers sit there staring at CloudWatch logs asking why. The real problem isn't debugging the crash. It's understanding the seven explicit states a task transitions through before it ever reaches RUNNING. Ignore this, and you'll waste hours chasing ghosts.

Fargate moves from PENDING to ACTIVATING to RUNNING, then eventually STOPPED when it fails or completes. The PROVISIONING state is where the magic — and the pain — lives. That's when Fargate allocates the underlying compute, pulls the container image, and attaches the ENI. If the image pull takes more than two minutes, your task gets recycled.

Here's the production gotcha: you don't get real-time feedback during PROVISIONING. CloudWatch logs aren't streaming yet because the container isn't running. The only signal is the event stream from ECS. Wire that into your alerting pipeline. If a task sits in PENDING for more than 30 seconds, something is starving — usually vCPU quota, subnet capacity, or ECR rate limits.

The state machine is unforgiving. A single STOPPED transition with reason "ScalingActivityInitiated" means your task was healthy but the deployment replaced it. That's fine. But "EssentialContainerExited" means your entrypoint blew up. Treat states as signals, not noise.

FargateTaskStateMachine.ymlYAML

// io.thecodeforge — devops tutorial

// Example event pattern for catching task state transitions
events:
  task_state_change:
    schedule: 
      expression: rate(1 minute)
    task: 
      - namespace: aws:ecs
        detail-type: 
          - "ECS Task State Change"
        detail:
          lastStatus:
            - "RUNNING"
            - "STOPPED"
          clusterArn:
            - "arn:aws:ecs:us-east-1:123456789012:cluster/production-backend"
    target:
      arn: "arn:aws:sns:us-east-1:123456789012:alerting-topic"
      input_transformer:
        input_paths_map:
          taskArn: "$.detail.taskArn"
          status: "$.detail.lastStatus"
          reason: "$.detail.stoppedReason"
        input_template: '"Task <taskArn> transitioned to <status>. Reason: <reason>"'

Output

"Task arn:aws:ecs:us-east-1:123456789012:task/production-backend/abc12345 transitioned to STOPPED. Reason: EssentialContainerExited"

Production Trap:

Fargate task restart policies are strict. If your container exits with non-zero code, ECS will restart it up to three times by default. After that, the task stops permanently and the service drains. Always set RetryLimit=0 in your executionRole if you want to crash once and stay dead for debugging. Otherwise, you'll see ephemeral flashes in CloudWatch and never catch the root cause.

Key Takeaway

Every Fargate task transitions through seven states. Wire the ECS event stream to your logging system on day one, not after the third incident.

ephemeral Storage Limits: Why Your 20GB Ephemeral Task Will Silently Fail

AWS Fargate gives you 20 GB of ephemeral storage by default per task. That sounds generous until your ETL job downloads a 10 GB model file, unpacks it, and tries to write a 15 GB result. The task runs for an hour, then exits with no obvious error — just a disk-full message buried in the kernel logs that you'll never see unless you dig into /var/log/messages inside the container.

The fix isn't more storage — it's knowing your data. Fargate supports up to 200 GB per task if you explicitly set ephemeralStorage.sizeInGiB in your task definition. But every extra gig costs you. Budget it like you budget memory and CPU.

Here's the worst part: EphimeralStorage is shared across all containers in the same task. If you run a sidecar proxy and a main container, the proxy's log files eat into the same pool as your processing workspace. Set log rotation on every container. Pin your working directory to a mount point you monitor.

And under no circumstance should you rely on ephemeral storage for persistent data. Fargate tasks get recycled on deploy, scale-in, or failure. That 20 GB is gone with the task. If you need durable state, use EFS or S3. If you need scratch space for transient workloads, bump the storage to 60 GB and put a lifecycle check in your entrypoint that bails if disk usage exceeds 80%.

FargateEphemeralStorage.ymlYAML

// io.thecodeforge — devops tutorial

// Task definition with explicit ephemeral storage and log rotation
aws_ecs_task_definition:
  family: "etl-worker-production"
  requiresCompatibilities:
    - FARGATE
  networkMode: "awsvpc"
  cpu: "2048"
  memory: "8192"
  ephemeralStorage:
    sizeInGiB: 60
  containerDefinitions:
    - name: "etl-processor"
      image: "123456789012.dkr.ecr.us-east-1.amazonaws.com/etl-pipeline:latest"
      mountPoints:
        - sourceVolume: "scratch"
          containerPath: "/data"
      logConfiguration:
        logDriver: "awslogs"
        options:
          awslogs-group: "/ecs/etl-worker"
          awslogs-region: "us-east-1"
          awslogs-stream-prefix: "etl"
          max-buffer-size: "25m"
  volume:
    - name: "scratch"
      efsVolumeConfiguration:
        fileSystemId: "fs-12345678"
        transitEncryption: "ENABLED"

Output

[

{

"taskDefinitionArn": "arn:aws:ecs:us-east-1:123456789012:task-definition/etl-worker-production:42",

"status": "ACTIVE",

"ephemeralStorage": {

"sizeInGiB": 60

}

]

Senior Shortcut:

Don't guess storage needs. Add a 10-second health check in your container that writes a marker file, then checks disk usage. If it exceeds 90%, log a WARN and stay alive. You'll catch silent failures before they become production outages. Also: never use docker cp on Fargate — it doesn't work. Always debug with aws ecs execute-command and SSM Session Manager.

Key Takeaway

Ephemeral storage is shared, limited to 20 GB by default, and cleared on task stop. Always set explicit size, pin logs to EFS or CloudWatch, and monitor disk usage inside the container.

Why Fargate Sprawl Costs You Thousands: Right-Sizing From Day One

Most Fargate bills explode because teams deploy without a sizing strategy. Fargate charges per vCPU and per GB of memory provisioned, not used. A single idle 4-vCPU, 16-GB task costs over $700 monthly. The root cause? Developers default to oversized instances assuming safety buffers. Stop guessing. Use AWS Compute Optimizer recommendations or run your tasks with CloudWatch Container Insights to observe peak CPU and memory. Right-size each task definition: batch jobs tolerate lower memory, while web servers need memory headroom for request bursts. For spiky workloads, pair Fargate with Application Auto Scaling using target tracking on CPU or request count. This cuts costs by 40-60% without code changes. Remember: Fargate’s elasticity is useless if every task eats capacity you never use. Standardize a sizing review in your CI/CD pipeline — reject any task definition without a justified resource spec.

fargate-rightsizing-policy.ymlYAML

// io.thecodeforge — devops tutorial

// CloudWatch alarm + Auto Scaling for Fargate
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  FargateScalableTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 10
      MinCapacity: 1
      ResourceId: !Sub service/${ECSCluster}/${ServiceName}
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
  FargateScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: cpu-target-tracking
      PolicyType: TargetTrackingScaling
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 50.0
        PredefinedMetricSpecification:
          PredefinedMetricType: ECSServiceAverageCPUUtilization
      ScalingTargetId: !Ref FargateScalableTarget

Output

Auto Scaling policy created. Fargate service will maintain 50% CPU target.

Production Trap:

Setting MinCapacity too high wastes money. Use a CloudWatch scheduled action to drop to 0 at night for dev environments.

Key Takeaway

Always right-size tasks; idle vCPUs burn money. Auto Scale by CPU, not memory.

Clean Up Fargate Leftovers Before AWS Bills You for Ghost Resources

Fargate tasks that complete exit instantly, but their associated resources linger. Elastic Network Interfaces (ENIs), CloudWatch log groups, and Application Load Balancer (ALB) target group registrations remain orphaned unless explicitly cleaned. Each orphaned ENI costs $3.60/month plus data transfer. Worse, active ALB target groups incur per-hour charges even with zero healthy targets. The why: Fargate detaches only the task, not its infrastructure. Fix it by tagging every Fargate task definition and service with an expiry date or environment tag. Then run a nightly Lambda function that queries ECS services with a tag filter and deletes terminated tasks' ENIs via the EC2 API. For ephemeral jobs, set ECS task execution role policy to auto-remove CloudWatch log groups older than 30 days. Finally, attach a lifecycle hook to your CI/CD pipeline that deletes ALB target groups after a rolling update. Cleaning isn't optional — it's a cost discipline.

cleanup-orphaned-enis.ymlYAML

// io.thecodeforge — devops tutorial

// Lambda to delete orphaned ENIs from stopped Fargate tasks
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  CleanupFunction:
    Type: AWS::Lambda::Function
    Properties:
      Handler: index.handler
      Role: !GetAtt CleanupRole.Arn
      Runtime: python3.11
      Code:
        ZipFile: |
          import boto3
          ec2 = boto3.client('ec2')
          def handler(event, context):
              enis = ec2.describe_network_interfaces(
                  Filters=[{'Name':'description','Values':['AWS ECS *']}]
              )['NetworkInterfaces']
              for eni in enis:
                  if eni['Status'] == 'available':
                      ec2.delete_network_interface(NetworkInterfaceId=eni['NetworkInterfaceId'])

Output

Orphaned ENIs deleted. Logs show removed interface IDs.

Production Trap:

Deleting an ENI attached to a running task breaks the task. Always filter by status 'available'.

Key Takeaway

Orphaned ENIs and ALB target groups bleed costs. Automate nightly cleanup with Lambda.

● Production incidentPOST-MORTEMseverity: high

Fargate Task ENI Exhaustion Blocked All New Deployments

Symptom

New Fargate tasks stuck in PENDING status indefinitely. ECS service deployments hung at 0% progress. No error messages in CloudWatch — tasks simply never transitioned to RUNNING.

Assumption

Fargate capacity was temporarily unavailable in the us-east-1a Availability Zone.

Root cause

Each Fargate task requires an elastic network interface (ENI) with a private IP address in the VPC subnet. The team used /24 subnets (251 usable IPs) across two Availability Zones. With 120 tasks running and each task consuming one ENI, plus 30 ENIs consumed by NAT gateways, ALBs, and other VPC resources, the subnets were exhausted. New tasks could not be placed because no IP addresses were available. The team had no monitoring on subnet IP utilization.

Fix

Expanded subnets to /20 (4091 usable IPs) per AZ using secondary CIDR blocks. Added a CloudWatch alarm on available IP count via the AWS::EC2::Subnet AvailableIpAddressCount metric. Set alarm threshold at 20% remaining IPs. Added subnet IP utilization to the weekly capacity review dashboard.

Key lesson

Each Fargate task consumes one ENI with a private IP — plan subnet sizing for peak task count plus infrastructure overhead
Monitor subnet AvailableIpAddressCount and alert before exhaustion
Use /20 or larger subnets for Fargate workloads to avoid IP exhaustion
Consider AWS VPC Lattice or awsvpc mode alternatives for high-density task deployments

Production debug guideCommon symptoms and actions for Fargate production issues5 entries

Symptom · 01

Fargate task stuck in PENDING status

→

Fix

Check subnet available IPs, security group rules, and task execution role permissions. Run: aws ecs describe-tasks --cluster CLUSTER --tasks TASK_ARN --query 'tasks[0].stopReason'

Symptom · 02

Fargate task starts then exits immediately

→

Fix

Check CloudWatch Logs for the container. Verify the entrypoint and command in the task definition. Ensure the image exists in ECR with correct permissions.

Symptom · 03

Fargate tasks cannot reach RDS or other AWS services

→

Fix

Verify the task is in a subnet with NAT gateway or VPC endpoint. Check security group outbound rules. Verify the task execution role has required permissions.

Symptom · 04

Fargate deployment takes 5-10 minutes to replace tasks

→

Fix

Check health check grace period and deregistration delay on the target group. Reduce health check interval to 10s and healthy threshold to 2 for faster detection.

Symptom · 05

Fargate costs higher than expected

→

Fix

Review task CPU and memory allocation vs actual usage in CloudWatch Container Insights. Right-size tasks by analyzing p95 utilization over 14 days.

★ AWS Fargate Quick Debug ReferenceFast commands for diagnosing Fargate issues

Task stuck Existing tasks continued operating normally.−

Immediate action

Check task stop reason and subnet capacity

Commands

aws ecs describe-tasks --cluster my-cluster --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status RUNNING --query 'taskArns[0]' --output text) --query 'tasks[0].{status:lastStatus,stopReason:stopReason,attachments:attachments[0].details}'

aws ec2 describe-subnets --subnet-ids subnet-xxxxx --query 'Subnets[0].AvailableIpAddressCount'

Fix now

If IPs exhausted, expand subnet CIDR or spread tasks across more subnets. If execution role missing, attach ecsTaskExecutionRole policy.

Container crashes on startup+

Cannot pull image from ECR+

High Fargate costs+

Fargate vs EC2 vs Lambda for Container Workloads

Feature	Fargate	EC2 Launch Type	Lambda (Container Images)
Server Management	Fully managed by AWS	You manage instances	Fully managed by AWS
Max Memory	Up to 120 GB per task	Depends on instance type	Up to 10 GB
Max vCPU	Up to 16 per task	Depends on instance type	Up to 6 vCPU
Execution Duration	Unlimited	Unlimited	15 minutes max
Networking	ENI per task in VPC	Shared ENI on instance	VPC optional
Cold Start	30-90 seconds for new tasks	None (instances running)	1-3 seconds
Cost Model	Per second for vCPU + memory	Per hour for instances	Per invocation + duration
Best For	Steady microservices, APIs	Cost-optimized at scale	Event-driven, short tasks

⚙ Quick Reference

9 commands from this guide

File	Command / Code	Purpose
io.thecodeforge.fargate.task_definition.json	{	What Is AWS Fargate?
io.thecodeforge.fargate.networking.yml	Resources:	Fargate Networking and Security
io.thecodeforge.fargate.cost_analysis.py	from dataclasses import dataclass	Fargate Pricing and Cost Optimization
io.thecodeforge.fargate.ecs_service.yml	Resources:	Deploying ECS Services on Fargate
io.thecodeforge.fargate.observability.json	{	Fargate Logging and Observability
FargateTaskStateMachine.yml	events:	Fargate Task Lifecycle
FargateEphemeralStorage.yml	aws_ecs_task_definition:	ephemeral Storage Limits
fargate-rightsizing-policy.yml	AWSTemplateFormatVersion: '2010-09-09'	Why Fargate Sprawl Costs You Thousands
cleanup-orphaned-enis.yml	AWSTemplateFormatVersion: '2010-09-09'	Clean Up Fargate Leftovers Before AWS Bills You for Ghost Re

Key takeaways

Fargate runs containers without managing servers

you define tasks, AWS handles infrastructure

Each task gets its own ENI and resource isolation

plan subnet CIDR blocks for peak task count

Right-sizing CPU and memory based on actual utilization saves 30-50% of compute costs

Enable deployment circuit breaker with rollback on every ECS Fargate service

VPC endpoints for ECR, S3, and CloudWatch eliminate NAT gateway charges

Structured logging via FireLens is mandatory

Fargate has no SSH access for debugging

Common mistakes to avoid

6 patterns

Over-provisioning CPU and memory per Fargate task

Symptom

CloudWatch shows 15% CPU utilization and 25% memory utilization — paying for 85% unused resources across all tasks

Fix

Review Container Insights metrics for p95 utilization over 14 days. Right-size tasks to 1.3x actual utilization. Valid Fargate CPU values: 0.25, 0.5, 1, 2, 4, 8, 16 vCPU.

Running Fargate tasks in subnets too small for peak task count

Symptom

New tasks stuck in PENDING status — no error message, tasks never transition to RUNNING

Fix

Each task requires one ENI with a private IP. Use /20 subnets (4091 IPs) minimum. Monitor AvailableIpAddressCount with CloudWatch alarms at 20% remaining threshold.

Not enabling deployment circuit breaker on ECS services

Symptom

Bad deployment replaces all healthy tasks with failing ones — entire service goes down with no automatic recovery

Fix

Enable DeploymentCircuitBreaker with rollback in the ECS service definition. This automatically reverts to the previous task definition when new tasks fail.

Using default log driver without structured logging

Symptom

Debugging production issues requires grep through raw text logs — no correlation IDs, no structured fields, no multi-destination routing

Fix

Use FireLens with Fluent Bit as log router. Emit structured JSON logs with correlation IDs, service name, and request context. Set log retention policy on CloudWatch log groups.

Not creating VPC endpoints for AWS services

Symptom

All ECR image pulls and Secrets Manager fetches route through NAT gateway — high data processing charges and single point of failure

Fix

Create VPC endpoints for ECR (api + dkr), S3 (gateway), CloudWatch Logs, and Secrets Manager. This eliminates NAT gateway charges for AWS service communication.

Running production and development Fargate tasks in the same cluster without namespace isolation

Symptom

Development task OOM kills affect production task placement. Noisy neighbor issues in shared cluster capacity.

Fix

Use separate ECS clusters for production and development. Tag resources with environment labels. Use cluster capacity providers to isolate Fargate and Fargate Spot workloads.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What is AWS Fargate and how does it differ from running containers on EC...

Q02SENIOR

How would you optimize Fargate costs for a production microservices arch...

Q03SENIOR

A production Fargate service is experiencing intermittent task placement...

Q01 of 03JUNIOR

What is AWS Fargate and how does it differ from running containers on EC2?

ANSWER

AWS Fargate is a serverless compute engine for containers that works with Amazon ECS and Amazon EKS. The key differences from EC2 launch type: 1. Server management: With Fargate, AWS manages the underlying infrastructure — no EC2 instances to provision, patch, or scale. With EC2, you manage the instance fleet. 2. Resource model: Fargate allocates resources per task (vCPU and memory). EC2 allocates resources per instance, and tasks share instance capacity. 3. Isolation: Each Fargate task gets its own kernel runtime and ENI. On EC2, tasks share the host kernel and network interface. 4. Pricing: Fargate charges per second for allocated task resources. EC2 charges per hour for running instances, regardless of task utilization. 5. Scaling: Fargate scales task count automatically. EC2 requires Auto Scaling Groups to scale the instance fleet. Fargate is best for variable workloads and operational simplicity. EC2 is better for cost optimization at steady, predictable scale.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Is AWS Fargate really serverless?

What is the maximum size of a Fargate task?

Can Fargate tasks communicate with each other?

How does Fargate handle persistent storage?

Should I use Fargate Spot for production workloads?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

✓ Verified

production tested

July 04, 2026

last updated

230

articles · all by Naren

🔥

That's AWS. Mark it forged?

5 min read · try the examples if you haven't