Senior 3 min · April 11, 2026

AWS Fargate — ENI IP Exhaustion Blocks Deployments Silently

Fargate tasks stuck PENDING with zero errors? A /24 subnet exhausts at 120 tasks — each consumes one ENI.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Fargate is AWS serverless compute engine that runs containers without managing servers
  • You define CPU and memory per task — Fargate provisions and scales infrastructure automatically
  • Works with both Amazon ECS and Amazon EKS for orchestration
  • You pay per second for vCPU and memory resources allocated to running tasks
  • Production use requires careful networking, IAM, and logging configuration
  • Biggest mistake: over-provisioning CPU and memory per task, inflating costs by 3-5x
Plain-English First

Fargate is like renting individual apartments instead of buying an entire building. With EC2, you own the building and manage everything — plumbing, electricity, maintenance. With Fargate, you rent just the space you need, and AWS handles the building. You bring your containers, specify how much CPU and memory they need, and Fargate runs them without you ever seeing a server.

AWS Fargate is a serverless compute engine for containers that eliminates the need to provision, configure, or scale virtual machine clusters. You package your application as a container image, define resource requirements, and Fargate runs it on infrastructure managed entirely by AWS.

Fargate shifts operational burden from managing EC2 instance fleets to defining task-level resource requirements. This simplifies capacity planning but introduces new challenges around networking configuration, IAM task roles, cold start latency, and cost optimization at scale. Production deployments require understanding these trade-offs before committing to Fargate over EC2 launch type.

What Is AWS Fargate?

AWS Fargate is a serverless compute engine for containers that works with both Amazon ECS and Amazon EKS. It removes the need to manage EC2 instances — you define container images, CPU, memory, and networking requirements, and Fargate provisions the underlying infrastructure to run your containers.

Fargate assigns each task its own kernel runtime environment and elastic network interface (ENI). This provides task-level isolation comparable to running containers on dedicated EC2 instances, without the operational overhead of managing the instance fleet.

The core abstraction is the task — a set of one or more containers that share a network namespace and storage volumes. You define tasks in a task definition, which specifies the container image, resource requirements, IAM roles, logging configuration, and networking mode. ECS or EKS schedules these tasks onto Fargate-managed infrastructure.

io.thecodeforge.fargate.task_definition.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
{
  "family": "io-thecodeforge-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "NODE_ENV", "value": "production"},
        {"name": "LOG_LEVEL", "value": "info"}
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-url:DATABASE_URL::"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/io-thecodeforge-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}
Fargate as Serverless Container Hosting
  • Each task gets its own ENI, kernel, and resource isolation — no shared host contention
  • You define CPU and memory per task, not per cluster — capacity planning is task-level
  • Fargate works with ECS and EKS — same serverless model for both orchestrators
  • You pay per second for vCPU and memory allocated to running tasks only
  • No SSH access to underlying infrastructure — all debugging happens through logs and APIs
Production Insight
Fargate tasks are immutable — you cannot SSH into them for debugging.
All troubleshooting must happen through CloudWatch Logs and ECS APIs.
Rule: invest in structured logging and health checks before deploying to Fargate.
Key Takeaway
Fargate runs containers without managing servers — you define tasks, AWS handles infrastructure.
Each task gets its own network interface and resource isolation.
Choose Fargate for operational simplicity, EC2 for cost optimization at steady scale.
.${AWS::RegionFargate vs EC2 Launch Type
IfWorkload has predictable, steady-state traffic
UseConsider EC2 with Savings Plans — lower cost at consistent utilization
IfWorkload is bursty or has variable scaling patterns
UseUse Fargate — pay only for running tasks, no idle instance cost
IfTasks require GPU, large memory (>30GB), or specific instance types
UseUse EC2 — Fargate has CPU/memory limits and no GPU support
IfTeam wants minimal operational overhead
UseUse Fargate — no instance patching, AMI management, or capacity planning

Fargate Networking and Security

Fargate tasks run in awsvpc mode — each task receives its own elastic network interface (ENI) with a private IP address in your VPC subnet. This provides VPC-level security controls through security groups and network ACLs, but requires careful subnet planning to avoid IP exhaustion.

Networking decisions have cost and performance implications. Tasks in private subnets require a NAT gateway for outbound internet access, which adds data processing charges. VPC endpoints for AWS services (S3, ECR, CloudWatch, Secrets Manager) eliminate NAT gateway costs for service-to-service communication.

Security follows the principle of least privilege through two IAM roles per task: the execution role (pulling images, writing logs, fetching secrets) and the task role (application-level AWS API access). Separating these roles ensures the task can only access resources it actually needs.

io.thecodeforge.fargate.networking.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# CloudFormation snippet for Fargate networking infrastructure
# Shows VPC endpoints, subnets, and security groups

Resources:
  # Private subnets for Fargate tasks
  PrivateSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref Vpc
      CidrBlock: 10.0.0.0/20  # 4091 usable IPs for Fargate tasks
      AvailabilityZone: us-east-1a
      Tags:
        - Key: Name
          Value: fargate-private-a

  PrivateSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref Vpc
      CidrBlock: 10.0.16.0/20
      AvailabilityZone: us-east-1b
      Tags:
        - Key: Name
          Value: fargate-private-b

  # NAT Gateway for outbound internet access
  NatGateway:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatEip.AllocationId
      SubnetId: !Ref PublicSubnetA

  # VPC Endpoint for ECR (avoids NAT gateway charges)
  EcrApiEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref Vpc
      ServiceName: !Sub com.amazonaws}.ecr.api !Ref PrivateSubnetA
        - !Ref PrivateSubnetB
      SecurityGroupIds:
        - !Ref VpcEndpointSg

  S3Endpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref Vpc
      ServiceName: !Sub com.amazonaws.${AWS::Region}.s3
      VpcEndpointType: Gateway
      RouteTableIds:
        - !Ref PrivateRouteTable

  # Security group for Fargate tasks
  FargateTaskSg:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for Fargate tasks
      VpcId: !Ref Vpc
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8080
          ToPort: 8080
          SourceSecurityGroupId: !Ref AlbSg
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0  # HTTPS outbound for ECR, Secrets Manager
        - IpProtocol: tcp
          FromPort: 5432
          ToPort: 5432
          SourceSecurityGroupId: !Ref DatabaseSg

  # Security group for VPC endpoints
  VpcEndpointSg:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTPS from Fargate tasks to VPC endpoints
      VpcId: !Ref Vpc
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref FargateTaskSg
Fargate Networking Pitfalls
  • Each Fargate task consumes one ENI with a private IP — plan subnet CIDR blocks for peak task count
  • Tasks in private subnets without NAT gateway or VPC endpoints cannot pull images from ECR
  • Security groups on Fargate tasks must allow outbound HTTPS (443) for ECR, Secrets Manager, and CloudWatch
  • Cross-AZ traffic between Fargate tasks and RDS incurs data transfer charges
  • Public IP on Fargate tasks exposes them directly to the internet — always use private subnets with ALB
Production Insight
VPC endpoints eliminate NAT gateway data processing charges for AWS service access.
A single ECR image pull through NAT costs ~$0.045/GB — at scale this adds up fast.
Rule: create VPC endpoints for ECR, S3, CloudWatch, and Secrets Manager immediately.
Key Takeaway
Fargate tasks run in awsvpc mode with dedicated ENIs in your VPC subnets.
VPC endpoints for AWS services eliminate NAT gateway costs and improve reliability.
Separate execution role (infrastructure) from task role (application) for least privilege.

Fargate Pricing and Cost Optimization

Fargate pricing is based on vCPU and memory resources allocated to running tasks, billed per second with a one-minute minimum. This model eliminates idle capacity costs but requires right-sizing tasks to avoid over-provisioning.

Cost optimization in Fargate centers on three levers: right-sizing CPU and memory allocations, using Fargate Spot for fault-tolerant workloads, and consolidating containers into fewer, larger tasks. Most production Fargate deployments overspend by 30-50% due to inflated resource requests that do not match actual utilization.

Fargate Spot provides up to 70% cost reduction for interrupt-tolerant workloads like batch processing, CI/CD pipelines, and stateless workers. Spot tasks can be interrupted with two minutes notice, requiring graceful shutdown handling in your application.

io.thecodeforge.fargate.cost_analysis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
from dataclasses import dataclass
from typing import List, Dict, Tuple
from enum import Enum


class PricingRegion(Enum):
    US_EAST_1 = "us-east-1"
    EU_WEST_1 = "eu-west-1"
    AP_SOUTHEAST_1 = "ap-southeast-1"


@dataclass
class FargatePricing:
    """Fargate pricing per hour for a region."""
    vcpu_per_hour: float
    memory_per_gb_hour: float
    spot_discount: float = 0.70


@dataclass
class TaskConfiguration:
    """A Fargate task's resource allocation."""
    name: str
    vcpu: float
    memory_gb: float
    task_count: int
    hours_per_day: float = 24.0
    is_spot_eligible: bool = False


class FargateCostCalculator:
    """
    Calculates and optimizes Fargate costs.
    """

    PRICING = {
        PricingRegion.US_EAST_1: FargatePricing(
            vcpu_per_hour=0.04048,
            memory_per_gb_hour=0.004445
        ),
        PricingRegion.EU_WEST_1: FargatePricing(
            vcpu_per_hour=0.04655,
            memory_per_gb_hour=0.005112
        ),
    }

    def __init__(self, region: PricingRegion):
        self.pricing = self.PRICING[region]

    def task_hourly_cost(self, task: TaskConfiguration) -> float:
        """
        Calculate the hourly cost of a single Fargate task.
        """
        base_cost = (
            task.vcpu * self.pricing.vcpu_per_hour
            + task.memory_gb * self.pricing.memory_per_gb_hour
        )
        if task.is_spot_eligible:
            return base_cost * (1 - self.pricing.spot_discount)
        return base_cost

    def monthly_cost(self, task: TaskConfiguration) -> float:
        """
        Calculate the monthly cost for all instances of a task.
        """
        hourly = self.task_hourly_cost(task)
        return hourly * task.hours_per_day * 30 * task.task_count

    def total_monthly_cost(self, tasks: List[TaskConfiguration]) -> Dict:
        """
        Calculate total monthly cost breakdown.
        """
        breakdown = {}
        total = 0.0

        for task in tasks:
            cost = self.monthly_cost(task)
            breakdown[task.name] = {
                "task_cost_hourly": round(self.task_hourly_cost(task), 4),
                "monthly_cost": round(cost, 2),
                "task_count": task.task_count,
                "vcpu_per_task": task.vcpu,
                "memory_gb_per_task": task.memory_gb,
                "spot_eligible": task.is_spot_eligible,
            }
            total += cost

        return {
            "tasks": breakdown,
            "total_monthly_cost": round(total, 2),
            "region": self.pricing.vcpu_per_hour,
        }

    def right_size_recommendation(
        self,
        task: TaskConfiguration,
        actual_cpu_utilization_pct: float,
        actual_memory_utilization_pct: float
    ) -> Dict:
        """
        Recommend right-sized task configuration based on actual utilization.
        """
        recommended_vcpu = task.vcpu * (actual_cpu_utilization_pct / 100) * 1.3
        recommended_memory = task.memory_gb * (actual_memory_utilization_pct / 100) * 1.3

        # Snap to valid Fargate sizes
        valid_vcpu = [0.25, 0.5, 1, 2, 4, 8, 16]
        valid_memory = [0.5, 1, 2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 24, 30, 48, 64, 96, 120, 192, 256]

        recommended_vcpu = min(v for v in valid_vcpu if v >= recommended_vcpu)
        recommended_memory = min(m for m in valid_memory if m >= recommended_memory)

        current_cost = self.monthly_cost(task)
        right_sized_task = TaskConfiguration(
            name=task.name,
            vcpu=recommended_vcpu,
            memory_gb=recommended_memory,
            task_count=task.task_count,
            hours_per_day=task.hours_per_day,
            is_spot_eligible=task.is_spot_eligible,
        )
        new_cost = self.monthly_cost(right_sized_task)

        return {
            "current": {
                "vcpu": task.vcpu,
                "memory_gb": task.memory_gb,
                "monthly_cost": round(current_cost, 2),
            },
            "recommended": {
                "vcpu": recommended_vcpu,
                "memory_gb": recommended_memory,
                "monthly_cost": round(new_cost, 2),
            },
            "savings_monthly": round(current_cost - new_cost, 2),
            "savings_pct": round((1 - new_cost / current_cost) * 100, 1),
        }


# Example usage
calculator = FargateCostCalculator(PricingRegion.US_EAST_1)

tasks = [
    TaskConfiguration("api-server", 1, 2, 10, 24, False),
    TaskConfiguration("worker", 0.5, 1, 20, 24, True),
    TaskConfiguration("scheduler", 0.25, 0.5, 2, 24, False),
]

result = calculator.total_monthly_cost(tasks)
print(f"Total monthly cost: ${result['total_monthly_cost']}")

# Right-sizing recommendation
recommendation = calculator.right_size_recommendation(
    tasks[0],
    actual_cpu_utilization_pct=25,
    actual_memory_utilization_pct=40
)
print(f"Right-size savings: ${recommendation['savings_monthly']}/month ({recommendation['savings_pct']}%)")
Fargate Cost Optimization Strategies
  • Right-size tasks using CloudWatch Container Insights — most deployments over-provision by 2-3x
  • Use Fargate Spot for batch jobs, CI/CD workers, and stateless background tasks (70% savings)
  • Consolidate sidecar containers into the main task to reduce per-task overhead
  • Schedule non-production tasks to stop outside business hours using EventBridge + Lambda
  • Use Compute Savings Plans for predictable Fargate workloads (up to 17% savings)
Production Insight
Fargate Spot interruptions come with a two-minute warning via ECS.
Applications must handle SIGTERM gracefully to avoid data loss.
Rule: implement graceful shutdown hooks before using Fargate Spot in production.
Key Takeaway
Fargate charges per second for allocated vCPU and memory — not actual usage.
Right-sizing tasks based on real utilization saves 30-50% of compute costs.
Fargate Spot provides 70% savings for interrupt-tolerant workloads.

Deploying ECS Services on Fargate

Production ECS services on Fargate require a deployment configuration that handles rolling updates, health checks, auto-scaling, and service discovery. The ECS service abstraction manages task placement, desired count, and deployment strategy across Fargate-managed infrastructure.

Rolling updates with the circuit breaker pattern prevent failed deployments from replacing healthy tasks. The circuit breaker monitors task health and automatically rolls back if new tasks fail to start. Combined with health check grace periods, this prevents deployment cascading failures.

Auto-scaling on Fargate adjusts the desired task count based on CloudWatch metrics — CPU utilization, memory utilization, request count, or custom metrics via Application Auto Scaling. Scaling policies should use target tracking for steady-state adjustments and step scaling for rapid traffic spikes.

io.thecodeforge.fargate.ecs_service.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
# CloudFormation for production ECS Fargate service
# Includes deployment circuit breaker, auto-scaling, and service discovery

Resources:
  EcsCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: io-thecodeforge-production
      ContainerInsights: enabled
      ServiceConnectDefaults:
        Namespace: io-thecodeforge.local

  EcsService:
    Type: AWS::ECS::Service
    DependsOn:
      - AlbListener
    Properties:
      ServiceName: io-thecodeforge-api
      Cluster: !Ref EcsCluster
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: 3
      LaunchType: FARGATE
      PlatformVersion: LATEST
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 100
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref FargateTaskSg
          Subnets:
            - !Ref PrivateSubnetA
            - !Ref PrivateSubnetB
      LoadBalancers:
        - ContainerName: api
          ContainerPort: 8080
          TargetGroupArn: !Ref ApiTargetGroup
      ServiceRegistries:
        - RegistryArn: !GetAtt ServiceDiscoveryService.Arn
      HealthCheckGracePeriodSeconds: 120

  # Auto-scaling target
  ScalableTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 20
      MinCapacity: 3
      ResourceId: !Sub service/${EcsCluster}/${EcsService.Name}
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN: !GetAtt AutoScalingRole.Arn

  # Target tracking scaling policy
  CpuScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: cpu-target-tracking
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref ScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 65
        PredefinedMetricSpecification:
          PredefinedMetricType: ECSServiceAverageCPUUtilization
        ScaleInCooldown: 300
        ScaleOutCooldown: 60

  # Service discovery
  ServiceDiscoveryService:
    Type: AWS::ServiceDiscovery::Service
    Properties:
      Name: api
      DnsConfig:
        NamespaceId: !Ref ServiceDiscoveryNamespace
        DnsRecords:
          - TTL: 10
            Type: A
      HealthCheckCustomConfig:
        FailureThreshold: 1

  # CloudWatch alarm for failed deployments
  FailedDeploymentAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: fargate-api-failed-tasks
      AlarmDescription: Alert when Fargate tasks fail to start
      Namespace: AWS/ECS
      MetricName: RunningTaskCount
      Dimensions:
        - Name: ClusterName
          Value: !Ref EcsCluster
        - Name: ServiceName
          Value: !Ref EcsService
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 3
      Threshold: 2
      ComparisonOperator: LessThanThreshold
      AlarmActions:
        - !Ref OpsSnsTopic
Fargate Deployment Safety Net
  • Circuit breaker monitors new task health during rolling deployments
  • If tasks fail to start, ECS automatically rolls back to the previous task definition
  • Health check grace period gives containers time to initialize before health checks begin
  • MinimumHealthyPercent: 100 ensures zero-downtime deployments — old tasks stay until new ones are healthy
  • Auto-scaling adjusts task count based on CPU, memory, or custom CloudWatch metrics
Production Insight
Deployment circuit breaker is the most important Fargate production feature.
Without it, a bad image push replaces all healthy tasks with failing ones.
Rule: enable circuit breaker with rollback on every ECS Fargate service.
Key Takeaway
ECS services on Fargate need circuit breaker, health checks, and auto-scaling.
Circuit breaker with rollback prevents bad deployments from taking down production.
Auto-scaling on CPU utilization at 65% target provides headroom for traffic spikes.

Fargate Logging and Observability

Production Fargate workloads require structured logging, distributed tracing, and container-level metrics. Since Fargate provides no SSH access, all observability must be configured through the task definition and external services before deployment.

CloudWatch Logs is the default log driver, but production systems benefit from FireLens — a Fluent Bit-based log router that supports structured JSON output, multi-destination routing, and log filtering. FireLens sends logs to CloudWatch, Datadog, Splunk, or Elasticsearch simultaneously.

Container Insights provides CPU, memory, disk, and network metrics per task and container. Combined with X-Ray for distributed tracing, this creates a complete observability stack for Fargate microservices.

io.thecodeforge.fargate.observability.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
{
  "family": "io-thecodeforge-api-observed",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
      ],
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "cloudwatch",
          "region": "us-east-1",
          "log_group_name": "/ecs/io-thecodeforge-api",
          "log_stream_prefix": "ecs",
          "auto_create_group": "true",
          "log_key": "log"
        }
      },
      "dependsOn": [
        {
          "containerName": "log-router",
          "condition": "START"
        }
      ]
    },
    {
      "name": "xray",
      "image": "amazon/aws-xray-daemon:latest",
      "essential": false,
      "cpu": 32,
      "memoryReservation": 256,
      "portMappings": [
        {
          "containerPort": 2000,
          "protocol": "udp"
        }
      ]
    },
    {
      "name": "log-router",
      "image": "amazon/aws-for-fluent-bit:latest",
      "essential": true,
      "cpu": 32,
      "memoryReservation": 64,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "enable-ecs-log-metadata": "true",
          "config-file-type": "file",
          "config-file-value": "/fluent-bit/configs/parse-json.conf"
        }
      }
    }
  ]
}
Fargate Observability Stack
  • Use FireLens (Fluent Bit) as log router — supports structured JSON and multi-destination output
  • Enable Container Insights on the ECS cluster for per-task CPU, memory, and network metrics
  • Add X-Ray sidecar for distributed tracing across microservices
  • Emit structured JSON logs with correlation IDs — never plain text log lines
  • Set log retention policy on CloudWatch log groups — default is infinite, which is expensive
Production Insight
Fargate has no SSH — logs are the only window into running containers.
Structured JSON logs with correlation IDs are mandatory for debugging microservices.
Rule: configure logging and tracing in the task definition before first deployment.
Key Takeaway
Fargate requires pre-configured observability — no SSH means no after-the-fact debugging.
FireLens routes structured logs to multiple destinations with filtering.
Container Insights and X-Ray provide the metrics and tracing stack for production monitoring.
● Production incidentPOST-MORTEMseverity: high

Fargate Task ENI Exhaustion Blocked All New Deployments

Symptom
New Fargate tasks stuck in PENDING status indefinitely. ECS service deployments hung at 0% progress. No error messages in CloudWatch — tasks simply never transitioned to RUNNING.
Assumption
Fargate capacity was temporarily unavailable in the us-east-1a Availability Zone.
Root cause
Each Fargate task requires an elastic network interface (ENI) with a private IP address in the VPC subnet. The team used /24 subnets (251 usable IPs) across two Availability Zones. With 120 tasks running and each task consuming one ENI, plus 30 ENIs consumed by NAT gateways, ALBs, and other VPC resources, the subnets were exhausted. New tasks could not be placed because no IP addresses were available. The team had no monitoring on subnet IP utilization.
Fix
Expanded subnets to /20 (4091 usable IPs) per AZ using secondary CIDR blocks. Added a CloudWatch alarm on available IP count via the AWS::EC2::Subnet AvailableIpAddressCount metric. Set alarm threshold at 20% remaining IPs. Added subnet IP utilization to the weekly capacity review dashboard.
Key lesson
  • Each Fargate task consumes one ENI with a private IP — plan subnet sizing for peak task count plus infrastructure overhead
  • Monitor subnet AvailableIpAddressCount and alert before exhaustion
  • Use /20 or larger subnets for Fargate workloads to avoid IP exhaustion
  • Consider AWS VPC Lattice or awsvpc mode alternatives for high-density task deployments
Production debug guideCommon symptoms and actions for Fargate production issues5 entries
Symptom · 01
Fargate task stuck in PENDING status
Fix
Check subnet available IPs, security group rules, and task execution role permissions. Run: aws ecs describe-tasks --cluster CLUSTER --tasks TASK_ARN --query 'tasks[0].stopReason'
Symptom · 02
Fargate task starts then exits immediately
Fix
Check CloudWatch Logs for the container. Verify the entrypoint and command in the task definition. Ensure the image exists in ECR with correct permissions.
Symptom · 03
Fargate tasks cannot reach RDS or other AWS services
Fix
Verify the task is in a subnet with NAT gateway or VPC endpoint. Check security group outbound rules. Verify the task execution role has required permissions.
Symptom · 04
Fargate deployment takes 5-10 minutes to replace tasks
Fix
Check health check grace period and deregistration delay on the target group. Reduce health check interval to 10s and healthy threshold to 2 for faster detection.
Symptom · 05
Fargate costs higher than expected
Fix
Review task CPU and memory allocation vs actual usage in CloudWatch Container Insights. Right-size tasks by analyzing p95 utilization over 14 days.
★ AWS Fargate Quick Debug ReferenceFast commands for diagnosing Fargate issues
Task stuck Existing tasks continued operating normally.
Immediate action
Check task stop reason and subnet capacity
Commands
aws ecs describe-tasks --cluster my-cluster --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status RUNNING --query 'taskArns[0]' --output text) --query 'tasks[0].{status:lastStatus,stopReason:stopReason,attachments:attachments[0].details}'
aws ec2 describe-subnets --subnet-ids subnet-xxxxx --query 'Subnets[0].AvailableIpAddressCount'
Fix now
If IPs exhausted, expand subnet CIDR or spread tasks across more subnets. If execution role missing, attach ecsTaskExecutionRole policy.
Container crashes on startup+
Immediate action
Check container logs and exit code
Commands
aws logs get-log-events --log-group-name /ecs/my-task --log-stream-name ecs/my-container/TASK_ID --limit 50
aws ecs describe-tasks --cluster my-cluster --tasks TASK_ARN --query 'tasks[0].containers[0].{exitCode:exitCode,reason:reason}'
Fix now
If exit code 137 = OOM killed, increase memory in task definition. If exit code 1 = application error, check container entrypoint and environment variables.
Cannot pull image from ECR+
Immediate action
Verify ECR permissions and VPC endpoint
Commands
aws ecr describe-images --repository-name my-repo --query 'imageDetails[0].imageTags'
aws iam list-attached-role-policies --role-name ecsTaskExecutionRole --query 'AttachedPolicies[].PolicyArn'
Fix now
Ensure AmazonECSTaskExecutionRolePolicy is attached. If in private subnet, create ECR VPC endpoint (com.amazonaws.region.ecr.api + com.amazonaws.region.ecr.dkr).
High Fargate costs+
Immediate action
Analyze actual vs allocated resource usage
Commands
aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name CpuUtilized --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Average
aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name MemoryUtilized --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Average
Fix now
If utilization consistently below 40%, reduce task CPU/memory. Consider Fargate Spot for non-critical workloads (up to 70% savings).
Fargate vs EC2 vs Lambda for Container Workloads
FeatureFargateEC2 Launch TypeLambda (Container Images)
Server ManagementFully managed by AWSYou manage instancesFully managed by AWS
Max MemoryUp to 120 GB per taskDepends on instance typeUp to 10 GB
Max vCPUUp to 16 per taskDepends on instance typeUp to 6 vCPU
Execution DurationUnlimitedUnlimited15 minutes max
NetworkingENI per task in VPCShared ENI on instanceVPC optional
Cold Start30-90 seconds for new tasksNone (instances running)1-3 seconds
Cost ModelPer second for vCPU + memoryPer hour for instancesPer invocation + duration
Best ForSteady microservices, APIsCost-optimized at scaleEvent-driven, short tasks

Key takeaways

1
Fargate runs containers without managing servers
you define tasks, AWS handles infrastructure
2
Each task gets its own ENI and resource isolation
plan subnet CIDR blocks for peak task count
3
Right-sizing CPU and memory based on actual utilization saves 30-50% of compute costs
4
Enable deployment circuit breaker with rollback on every ECS Fargate service
5
VPC endpoints for ECR, S3, and CloudWatch eliminate NAT gateway charges
6
Structured logging via FireLens is mandatory
Fargate has no SSH access for debugging

Common mistakes to avoid

6 patterns
×

Over-provisioning CPU and memory per Fargate task

Symptom
CloudWatch shows 15% CPU utilization and 25% memory utilization — paying for 85% unused resources across all tasks
Fix
Review Container Insights metrics for p95 utilization over 14 days. Right-size tasks to 1.3x actual utilization. Valid Fargate CPU values: 0.25, 0.5, 1, 2, 4, 8, 16 vCPU.
×

Running Fargate tasks in subnets too small for peak task count

Symptom
New tasks stuck in PENDING status — no error message, tasks never transition to RUNNING
Fix
Each task requires one ENI with a private IP. Use /20 subnets (4091 IPs) minimum. Monitor AvailableIpAddressCount with CloudWatch alarms at 20% remaining threshold.
×

Not enabling deployment circuit breaker on ECS services

Symptom
Bad deployment replaces all healthy tasks with failing ones — entire service goes down with no automatic recovery
Fix
Enable DeploymentCircuitBreaker with rollback in the ECS service definition. This automatically reverts to the previous task definition when new tasks fail.
×

Using default log driver without structured logging

Symptom
Debugging production issues requires grep through raw text logs — no correlation IDs, no structured fields, no multi-destination routing
Fix
Use FireLens with Fluent Bit as log router. Emit structured JSON logs with correlation IDs, service name, and request context. Set log retention policy on CloudWatch log groups.
×

Not creating VPC endpoints for AWS services

Symptom
All ECR image pulls and Secrets Manager fetches route through NAT gateway — high data processing charges and single point of failure
Fix
Create VPC endpoints for ECR (api + dkr), S3 (gateway), CloudWatch Logs, and Secrets Manager. This eliminates NAT gateway charges for AWS service communication.
×

Running production and development Fargate tasks in the same cluster without namespace isolation

Symptom
Development task OOM kills affect production task placement. Noisy neighbor issues in shared cluster capacity.
Fix
Use separate ECS clusters for production and development. Tag resources with environment labels. Use cluster capacity providers to isolate Fargate and Fargate Spot workloads.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is AWS Fargate and how does it differ from running containers on EC...
Q02SENIOR
How would you optimize Fargate costs for a production microservices arch...
Q03SENIOR
A production Fargate service is experiencing intermittent task placement...
Q01 of 03JUNIOR

What is AWS Fargate and how does it differ from running containers on EC2?

ANSWER
AWS Fargate is a serverless compute engine for containers that works with Amazon ECS and Amazon EKS. The key differences from EC2 launch type: 1. Server management: With Fargate, AWS manages the underlying infrastructure — no EC2 instances to provision, patch, or scale. With EC2, you manage the instance fleet. 2. Resource model: Fargate allocates resources per task (vCPU and memory). EC2 allocates resources per instance, and tasks share instance capacity. 3. Isolation: Each Fargate task gets its own kernel runtime and ENI. On EC2, tasks share the host kernel and network interface. 4. Pricing: Fargate charges per second for allocated task resources. EC2 charges per hour for running instances, regardless of task utilization. 5. Scaling: Fargate scales task count automatically. EC2 requires Auto Scaling Groups to scale the instance fleet. Fargate is best for variable workloads and operational simplicity. EC2 is better for cost optimization at steady, predictable scale.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Is AWS Fargate really serverless?
02
What is the maximum size of a Fargate task?
03
Can Fargate tasks communicate with each other?
04
How does Fargate handle persistent storage?
05
Should I use Fargate Spot for production workloads?
🔥

That's Cloud. Mark it forged?

3 min read · try the examples if you haven't

Previous
AWS Snowball: Data Migration, Edge Computing, and Physical Data Transport
23 / 23 · Cloud
Next
Introduction to Monitoring and Observability