DevOps Intermediate

AWS Fargate: The Complete Guide to Serverless Containers on ECS and EKS

📅 2026-04-11 ⏱ 3 min read 🎯 Intermediate

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Cloud → Topic 23 of 23

Learn AWS Fargate — serverless compute for containers on ECS and EKS.

⚙️ Intermediate — basic DevOps knowledge assumed

In this tutorial, you'll learn

Learn AWS Fargate — serverless compute for containers on ECS and EKS.

Fargate runs containers without managing servers — you define tasks, AWS handles infrastructure
Each task gets its own ENI and resource isolation — plan subnet CIDR blocks for peak task count
Right-sizing CPU and memory based on actual utilization saves 30-50% of compute costs

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

Fargate is AWS serverless compute engine that runs containers without managing servers
You define CPU and memory per task — Fargate provisions and scales infrastructure automatically
Works with both Amazon ECS and Amazon EKS for orchestration
You pay per second for vCPU and memory resources allocated to running tasks
Production use requires careful networking, IAM, and logging configuration
Biggest mistake: over-provisioning CPU and memory per task, inflating costs by 3-5x

🚨 START HERE

AWS Fargate Quick Debug Reference

Fast commands for diagnosing Fargate issues

🟡Task stuck Existing tasks continued operating normally.

Immediate ActionCheck task stop reason and subnet capacity

Commands

aws ecs describe-tasks --cluster my-cluster --tasks $(aws ecs list-tasks --cluster my-cluster --desired-status RUNNING --query 'taskArns[0]' --output text) --query 'tasks[0].{status:lastStatus,stopReason:stopReason,attachments:attachments[0].details}'

aws ec2 describe-subnets --subnet-ids subnet-xxxxx --query 'Subnets[0].AvailableIpAddressCount'

Fix NowIf IPs exhausted, expand subnet CIDR or spread tasks across more subnets. If execution role missing, attach ecsTaskExecutionRole policy.

🟡Container crashes on startup

Immediate ActionCheck container logs and exit code

Commands

aws logs get-log-events --log-group-name /ecs/my-task --log-stream-name ecs/my-container/TASK_ID --limit 50

aws ecs describe-tasks --cluster my-cluster --tasks TASK_ARN --query 'tasks[0].containers[0].{exitCode:exitCode,reason:reason}'

Fix NowIf exit code 137 = OOM killed, increase memory in task definition. If exit code 1 = application error, check container entrypoint and environment variables.

🟡Cannot pull image from ECR

Immediate ActionVerify ECR permissions and VPC endpoint

Commands

aws ecr describe-images --repository-name my-repo --query 'imageDetails[0].imageTags'

aws iam list-attached-role-policies --role-name ecsTaskExecutionRole --query 'AttachedPolicies[].PolicyArn'

Fix NowEnsure AmazonECSTaskExecutionRolePolicy is attached. If in private subnet, create ECR VPC endpoint (com.amazonaws.region.ecr.api + com.amazonaws.region.ecr.dkr).

🟡High Fargate costs

Immediate ActionAnalyze actual vs allocated resource usage

Commands

aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name CpuUtilized --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Average

aws cloudwatch get-metric-statistics --namespace AWS/ECS --metric-name MemoryUtilized --dimensions Name=ClusterName,Value=my-cluster Name=ServiceName,Value=my-service --start-time $(date -u -d '14 days ago' +%Y-%m-%dT%H:%M:%S) --end-time $(date -u +%Y-%m-%dT%H:%M:%S) --period 3600 --statistics Average

Fix NowIf utilization consistently below 40%, reduce task CPU/memory. Consider Fargate Spot for non-critical workloads (up to 70% savings).

Production IncidentFargate Task ENI Exhaustion Blocked All New DeploymentsA microservices deployment on Fargate failed silently for 4 hours because the VPC subnet ran out of available IP addresses for task ENIs.

SymptomNew Fargate tasks stuck in PENDING status indefinitely. ECS service deployments hung at 0% progress. No error messages in CloudWatch — tasks simply never transitioned to RUNNING.

AssumptionFargate capacity was temporarily unavailable in the us-east-1a Availability Zone.

Root causeEach Fargate task requires an elastic network interface (ENI) with a private IP address in the VPC subnet. The team used /24 subnets (251 usable IPs) across two Availability Zones. With 120 tasks running and each task consuming one ENI, plus 30 ENIs consumed by NAT gateways, ALBs, and other VPC resources, the subnets were exhausted. New tasks could not be placed because no IP addresses were available. The team had no monitoring on subnet IP utilization.

FixExpanded subnets to /20 (4091 usable IPs) per AZ using secondary CIDR blocks. Added a CloudWatch alarm on available IP count via the AWS::EC2::Subnet AvailableIpAddressCount metric. Set alarm threshold at 20% remaining IPs. Added subnet IP utilization to the weekly capacity review dashboard.

Key Lesson

Each Fargate task consumes one ENI with a private IP — plan subnet sizing for peak task count plus infrastructure overheadMonitor subnet AvailableIpAddressCount and alert before exhaustionUse /20 or larger subnets for Fargate workloads to avoid IP exhaustionConsider AWS VPC Lattice or awsvpc mode alternatives for high-density task deployments

Production Debug GuideCommon symptoms and actions for Fargate production issues

Fargate task stuck in PENDING status→Check subnet available IPs, security group rules, and task execution role permissions. Run: aws ecs describe-tasks --cluster CLUSTER --tasks TASK_ARN --query 'tasks[0].stopReason'

Fargate task starts then exits immediately→Check CloudWatch Logs for the container. Verify the entrypoint and command in the task definition. Ensure the image exists in ECR with correct permissions.

Fargate tasks cannot reach RDS or other AWS services→Verify the task is in a subnet with NAT gateway or VPC endpoint. Check security group outbound rules. Verify the task execution role has required permissions.

Fargate deployment takes 5-10 minutes to replace tasks→Check health check grace period and deregistration delay on the target group. Reduce health check interval to 10s and healthy threshold to 2 for faster detection.

Fargate costs higher than expected→Review task CPU and memory allocation vs actual usage in CloudWatch Container Insights. Right-size tasks by analyzing p95 utilization over 14 days.

AWS Fargate is a serverless compute engine for containers that eliminates the need to provision, configure, or scale virtual machine clusters. You package your application as a container image, define resource requirements, and Fargate runs it on infrastructure managed entirely by AWS.

Fargate shifts operational burden from managing EC2 instance fleets to defining task-level resource requirements. This simplifies capacity planning but introduces new challenges around networking configuration, IAM task roles, cold start latency, and cost optimization at scale. Production deployments require understanding these trade-offs before committing to Fargate over EC2 launch type.

What Is AWS Fargate?

AWS Fargate is a serverless compute engine for containers that works with both Amazon ECS and Amazon EKS. It removes the need to manage EC2 instances — you define container images, CPU, memory, and networking requirements, and Fargate provisions the underlying infrastructure to run your containers.

Fargate assigns each task its own kernel runtime environment and elastic network interface (ENI). This provides task-level isolation comparable to running containers on dedicated EC2 instances, without the operational overhead of managing the instance fleet.

The core abstraction is the task — a set of one or more containers that share a network namespace and storage volumes. You define tasks in a task definition, which specifies the container image, resource requirements, IAM roles, logging configuration, and networking mode. ECS or EKS schedules these tasks onto Fargate-managed infrastructure.

io.thecodeforge.fargate.task_definition.json · JSON

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748

{
  "family": "io-thecodeforge-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "NODE_ENV", "value": "production"},
        {"name": "LOG_LEVEL", "value": "info"}
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-url:DATABASE_URL::"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/io-thecodeforge-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Mental Model

Fargate as Serverless Container Hosting

Fargate abstracts away the EC2 layer — you specify containers and resources, AWS handles everything below.

Each task gets its own ENI, kernel, and resource isolation — no shared host contention
You define CPU and memory per task, not per cluster — capacity planning is task-level
Fargate works with ECS and EKS — same serverless model for both orchestrators
You pay per second for vCPU and memory allocated to running tasks only
No SSH access to underlying infrastructure — all debugging happens through logs and APIs

📊 Production Insight

Fargate tasks are immutable — you cannot SSH into them for debugging.

All troubleshooting must happen through CloudWatch Logs and ECS APIs.

Rule: invest in structured logging and health checks before deploying to Fargate.

🎯 Key Takeaway

Fargate runs containers without managing servers — you define tasks, AWS handles infrastructure.

Each task gets its own network interface and resource isolation.

Choose Fargate for operational simplicity, EC2 for cost optimization at steady scale.

.${AWS::RegionFargate vs EC2 Launch Type

IfWorkload has predictable, steady-state traffic

→

UseConsider EC2 with Savings Plans — lower cost at consistent utilization

IfWorkload is bursty or has variable scaling patterns

→

UseUse Fargate — pay only for running tasks, no idle instance cost

IfTasks require GPU, large memory (>30GB), or specific instance types

→

UseUse EC2 — Fargate has CPU/memory limits and no GPU support

IfTeam wants minimal operational overhead

→

UseUse Fargate — no instance patching, AMI management, or capacity planning

Fargate Networking and Security

Fargate tasks run in awsvpc mode — each task receives its own elastic network interface (ENI) with a private IP address in your VPC subnet. This provides VPC-level security controls through security groups and network ACLs, but requires careful subnet planning to avoid IP exhaustion.

Networking decisions have cost and performance implications. Tasks in private subnets require a NAT gateway for outbound internet access, which adds data processing charges. VPC endpoints for AWS services (S3, ECR, CloudWatch, Secrets Manager) eliminate NAT gateway costs for service-to-service communication.

Security follows the principle of least privilege through two IAM roles per task: the execution role (pulling images, writing logs, fetching secrets) and the task role (application-level AWS API access). Separating these roles ensures the task can only access resources it actually needs.

io.thecodeforge.fargate.networking.yml · YAML

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283

# CloudFormation snippet for Fargate networking infrastructure
# Shows VPC endpoints, subnets, and security groups

Resources:
  # Private subnets for Fargate tasks
  PrivateSubnetA:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref Vpc
      CidrBlock: 10.0.0.0/20  # 4091 usable IPs for Fargate tasks
      AvailabilityZone: us-east-1a
      Tags:
        - Key: Name
          Value: fargate-private-a

  PrivateSubnetB:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref Vpc
      CidrBlock: 10.0.16.0/20
      AvailabilityZone: us-east-1b
      Tags:
        - Key: Name
          Value: fargate-private-b

  # NAT Gateway for outbound internet access
  NatGateway:
    Type: AWS::EC2::NatGateway
    Properties:
      AllocationId: !GetAtt NatEip.AllocationId
      SubnetId: !Ref PublicSubnetA

  # VPC Endpoint for ECR (avoids NAT gateway charges)
  EcrApiEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref Vpc
      ServiceName: !Sub com.amazonaws}.ecr.api !Ref PrivateSubnetA
        - !Ref PrivateSubnetB
      SecurityGroupIds:
        - !Ref VpcEndpointSg

  S3Endpoint:
    Type: AWS::EC2::VPCEndpoint
    Properties:
      VpcId: !Ref Vpc
      ServiceName: !Sub com.amazonaws.${AWS::Region}.s3
      VpcEndpointType: Gateway
      RouteTableIds:
        - !Ref PrivateRouteTable

  # Security group for Fargate tasks
  FargateTaskSg:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for Fargate tasks
      VpcId: !Ref Vpc
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 8080
          ToPort: 8080
          SourceSecurityGroupId: !Ref AlbSg
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0  # HTTPS outbound for ECR, Secrets Manager
        - IpProtocol: tcp
          FromPort: 5432
          ToPort: 5432
          SourceSecurityGroupId: !Ref DatabaseSg

  # Security group for VPC endpoints
  VpcEndpointSg:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow HTTPS from Fargate tasks to VPC endpoints
      VpcId: !Ref Vpc
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref FargateTaskSg

⚠ Fargate Networking Pitfalls

📊 Production Insight

VPC endpoints eliminate NAT gateway data processing charges for AWS service access.

A single ECR image pull through NAT costs ~$0.045/GB — at scale this adds up fast.

Rule: create VPC endpoints for ECR, S3, CloudWatch, and Secrets Manager immediately.

🎯 Key Takeaway

Fargate tasks run in awsvpc mode with dedicated ENIs in your VPC subnets.

VPC endpoints for AWS services eliminate NAT gateway costs and improve reliability.

Separate execution role (infrastructure) from task role (application) for least privilege.

Fargate Pricing and Cost Optimization

Fargate pricing is based on vCPU and memory resources allocated to running tasks, billed per second with a one-minute minimum. This model eliminates idle capacity costs but requires right-sizing tasks to avoid over-provisioning.

Cost optimization in Fargate centers on three levers: right-sizing CPU and memory allocations, using Fargate Spot for fault-tolerant workloads, and consolidating containers into fewer, larger tasks. Most production Fargate deployments overspend by 30-50% due to inflated resource requests that do not match actual utilization.

Fargate Spot provides up to 70% cost reduction for interrupt-tolerant workloads like batch processing, CI/CD pipelines, and stateless workers. Spot tasks can be interrupted with two minutes notice, requiring graceful shutdown handling in your application.

io.thecodeforge.fargate.cost_analysis.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158

from dataclasses import dataclass
from typing import List, Dict, Tuple
from enum import Enum


class PricingRegion(Enum):
    US_EAST_1 = "us-east-1"
    EU_WEST_1 = "eu-west-1"
    AP_SOUTHEAST_1 = "ap-southeast-1"


@dataclass
class FargatePricing:
    """Fargate pricing per hour for a region."""
    vcpu_per_hour: float
    memory_per_gb_hour: float
    spot_discount: float = 0.70


@dataclass
class TaskConfiguration:
    """A Fargate task's resource allocation."""
    name: str
    vcpu: float
    memory_gb: float
    task_count: int
    hours_per_day: float = 24.0
    is_spot_eligible: bool = False


class FargateCostCalculator:
    """
    Calculates and optimizes Fargate costs.
    """

    PRICING = {
        PricingRegion.US_EAST_1: FargatePricing(
            vcpu_per_hour=0.04048,
            memory_per_gb_hour=0.004445
        ),
        PricingRegion.EU_WEST_1: FargatePricing(
            vcpu_per_hour=0.04655,
            memory_per_gb_hour=0.005112
        ),
    }

    def __init__(self, region: PricingRegion):
        self.pricing = self.PRICING[region]

    def task_hourly_cost(self, task: TaskConfiguration) -> float:
        """
        Calculate the hourly cost of a single Fargate task.
        """
        base_cost = (
            task.vcpu * self.pricing.vcpu_per_hour
            + task.memory_gb * self.pricing.memory_per_gb_hour
        )
        if task.is_spot_eligible:
            return base_cost * (1 - self.pricing.spot_discount)
        return base_cost

    def monthly_cost(self, task: TaskConfiguration) -> float:
        """
        Calculate the monthly cost for all instances of a task.
        """
        hourly = self.task_hourly_cost(task)
        return hourly * task.hours_per_day * 30 * task.task_count

    def total_monthly_cost(self, tasks: List[TaskConfiguration]) -> Dict:
        """
        Calculate total monthly cost breakdown.
        """
        breakdown = {}
        total = 0.0

        for task in tasks:
            cost = self.monthly_cost(task)
            breakdown[task.name] = {
                "task_cost_hourly": round(self.task_hourly_cost(task), 4),
                "monthly_cost": round(cost, 2),
                "task_count": task.task_count,
                "vcpu_per_task": task.vcpu,
                "memory_gb_per_task": task.memory_gb,
                "spot_eligible": task.is_spot_eligible,
            }
            total += cost

        return {
            "tasks": breakdown,
            "total_monthly_cost": round(total, 2),
            "region": self.pricing.vcpu_per_hour,
        }

    def right_size_recommendation(
        self,
        task: TaskConfiguration,
        actual_cpu_utilization_pct: float,
        actual_memory_utilization_pct: float
    ) -> Dict:
        """
        Recommend right-sized task configuration based on actual utilization.
        """
        recommended_vcpu = task.vcpu * (actual_cpu_utilization_pct / 100) * 1.3
        recommended_memory = task.memory_gb * (actual_memory_utilization_pct / 100) * 1.3

        # Snap to valid Fargate sizes
        valid_vcpu = [0.25, 0.5, 1, 2, 4, 8, 16]
        valid_memory = [0.5, 1, 2, 3, 4, 5, 6, 7, 8, 12, 16, 20, 24, 30, 48, 64, 96, 120, 192, 256]

        recommended_vcpu = min(v for v in valid_vcpu if v >= recommended_vcpu)
        recommended_memory = min(m for m in valid_memory if m >= recommended_memory)

        current_cost = self.monthly_cost(task)
        right_sized_task = TaskConfiguration(
            name=task.name,
            vcpu=recommended_vcpu,
            memory_gb=recommended_memory,
            task_count=task.task_count,
            hours_per_day=task.hours_per_day,
            is_spot_eligible=task.is_spot_eligible,
        )
        new_cost = self.monthly_cost(right_sized_task)

        return {
            "current": {
                "vcpu": task.vcpu,
                "memory_gb": task.memory_gb,
                "monthly_cost": round(current_cost, 2),
            },
            "recommended": {
                "vcpu": recommended_vcpu,
                "memory_gb": recommended_memory,
                "monthly_cost": round(new_cost, 2),
            },
            "savings_monthly": round(current_cost - new_cost, 2),
            "savings_pct": round((1 - new_cost / current_cost) * 100, 1),
        }


# Example usage
calculator = FargateCostCalculator(PricingRegion.US_EAST_1)

tasks = [
    TaskConfiguration("api-server", 1, 2, 10, 24, False),
    TaskConfiguration("worker", 0.5, 1, 20, 24, True),
    TaskConfiguration("scheduler", 0.25, 0.5, 2, 24, False),
]

result = calculator.total_monthly_cost(tasks)
print(f"Total monthly cost: ${result['total_monthly_cost']}")

# Right-sizing recommendation
recommendation = calculator.right_size_recommendation(
    tasks[0],
    actual_cpu_utilization_pct=25,
    actual_memory_utilization_pct=40
)
print(f"Right-size savings: ${recommendation['savings_monthly']}/month ({recommendation['savings_pct']}%)")

💡Fargate Cost Optimization Strategies

Right-size tasks using CloudWatch Container Insights — most deployments over-provision by 2-3x
Use Fargate Spot for batch jobs, CI/CD workers, and stateless background tasks (70% savings)
Consolidate sidecar containers into the main task to reduce per-task overhead
Schedule non-production tasks to stop outside business hours using EventBridge + Lambda
Use Compute Savings Plans for predictable Fargate workloads (up to 17% savings)

📊 Production Insight

Fargate Spot interruptions come with a two-minute warning via ECS.

Applications must handle SIGTERM gracefully to avoid data loss.

Rule: implement graceful shutdown hooks before using Fargate Spot in production.

🎯 Key Takeaway

Fargate charges per second for allocated vCPU and memory — not actual usage.

Right-sizing tasks based on real utilization saves 30-50% of compute costs.

Fargate Spot provides 70% savings for interrupt-tolerant workloads.

Deploying ECS Services on Fargate

Production ECS services on Fargate require a deployment configuration that handles rolling updates, health checks, auto-scaling, and service discovery. The ECS service abstraction manages task placement, desired count, and deployment strategy across Fargate-managed infrastructure.

Rolling updates with the circuit breaker pattern prevent failed deployments from replacing healthy tasks. The circuit breaker monitors task health and automatically rolls back if new tasks fail to start. Combined with health check grace periods, this prevents deployment cascading failures.

Auto-scaling on Fargate adjusts the desired task count based on CloudWatch metrics — CPU utilization, memory utilization, request count, or custom metrics via Application Auto Scaling. Scaling policies should use target tracking for steady-state adjustments and step scaling for rapid traffic spikes.

io.thecodeforge.fargate.ecs_service.yml · YAML

# CloudFormation for production ECS Fargate service
# Includes deployment circuit breaker, auto-scaling, and service discovery

Resources:
  EcsCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: io-thecodeforge-production
      ContainerInsights: enabled
      ServiceConnectDefaults:
        Namespace: io-thecodeforge.local

  EcsService:
    Type: AWS::ECS::Service
    DependsOn:
      - AlbListener
    Properties:
      ServiceName: io-thecodeforge-api
      Cluster: !Ref EcsCluster
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: 3
      LaunchType: FARGATE
      PlatformVersion: LATEST
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 100
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: DISABLED
          SecurityGroups:
            - !Ref FargateTaskSg
          Subnets:
            - !Ref PrivateSubnetA
            - !Ref PrivateSubnetB
      LoadBalancers:
        - ContainerName: api
          ContainerPort: 8080
          TargetGroupArn: !Ref ApiTargetGroup
      ServiceRegistries:
        - RegistryArn: !GetAtt ServiceDiscoveryService.Arn
      HealthCheckGracePeriodSeconds: 120

  # Auto-scaling target
  ScalableTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 20
      MinCapacity: 3
      ResourceId: !Sub service/${EcsCluster}/${EcsService.Name}
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN: !GetAtt AutoScalingRole.Arn

  # Target tracking scaling policy
  CpuScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: cpu-target-tracking
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref ScalableTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 65
        PredefinedMetricSpecification:
          PredefinedMetricType: ECSServiceAverageCPUUtilization
        ScaleInCooldown: 300
        ScaleOutCooldown: 60

  # Service discovery
  ServiceDiscoveryService:
    Type: AWS::ServiceDiscovery::Service
    Properties:
      Name: api
      DnsConfig:
        NamespaceId: !Ref ServiceDiscoveryNamespace
        DnsRecords:
          - TTL: 10
            Type: A
      HealthCheckCustomConfig:
        FailureThreshold: 1

  # CloudWatch alarm for failed deployments
  FailedDeploymentAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: fargate-api-failed-tasks
      AlarmDescription: Alert when Fargate tasks fail to start
      Namespace: AWS/ECS
      MetricName: RunningTaskCount
      Dimensions:
        - Name: ClusterName
          Value: !Ref EcsCluster
        - Name: ServiceName
          Value: !Ref EcsService
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 3
      Threshold: 2
      ComparisonOperator: LessThanThreshold
      AlarmActions:
        - !Ref OpsSnsTopic

Mental Model

Fargate Deployment Safety Net

The deployment circuit breaker automatically rolls back failed deployments — you never manually recover from a bad release.

Circuit breaker monitors new task health during rolling deployments
If tasks fail to start, ECS automatically rolls back to the previous task definition
Health check grace period gives containers time to initialize before health checks begin
MinimumHealthyPercent: 100 ensures zero-downtime deployments — old tasks stay until new ones are healthy
Auto-scaling adjusts task count based on CPU, memory, or custom CloudWatch metrics

📊 Production Insight

Deployment circuit breaker is the most important Fargate production feature.

Without it, a bad image push replaces all healthy tasks with failing ones.

Rule: enable circuit breaker with rollback on every ECS Fargate service.

🎯 Key Takeaway

ECS services on Fargate need circuit breaker, health checks, and auto-scaling.

Circuit breaker with rollback prevents bad deployments from taking down production.

Auto-scaling on CPU utilization at 65% target provides headroom for traffic spikes.

Fargate Logging and Observability

Production Fargate workloads require structured logging, distributed tracing, and container-level metrics. Since Fargate provides no SSH access, all observability must be configured through the task definition and external services before deployment.

CloudWatch Logs is the default log driver, but production systems benefit from FireLens — a Fluent Bit-based log router that supports structured JSON output, multi-destination routing, and log filtering. FireLens sends logs to CloudWatch, Datadog, Splunk, or Elasticsearch simultaneously.

Container Insights provides CPU, memory, disk, and network metrics per task and container. Combined with X-Ray for distributed tracing, this creates a complete observability stack for Fargate microservices.

io.thecodeforge.fargate.observability.json · JSON

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970

{
  "family": "io-thecodeforge-api-observed",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/apiTaskRole",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/io-thecodeforge-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
      ],
      "logConfiguration": {
        "logDriver": "awsfirelens",
        "options": {
          "Name": "cloudwatch",
          "region": "us-east-1",
          "log_group_name": "/ecs/io-thecodeforge-api",
          "log_stream_prefix": "ecs",
          "auto_create_group": "true",
          "log_key": "log"
        }
      },
      "dependsOn": [
        {
          "containerName": "log-router",
          "condition": "START"
        }
      ]
    },
    {
      "name": "xray",
      "image": "amazon/aws-xray-daemon:latest",
      "essential": false,
      "cpu": 32,
      "memoryReservation": 256,
      "portMappings": [
        {
          "containerPort": 2000,
          "protocol": "udp"
        }
      ]
    },
    {
      "name": "log-router",
      "image": "amazon/aws-for-fluent-bit:latest",
      "essential": true,
      "cpu": 32,
      "memoryReservation": 64,
      "firelensConfiguration": {
        "type": "fluentbit",
        "options": {
          "enable-ecs-log-metadata": "true",
          "config-file-type": "file",
          "config-file-value": "/fluent-bit/configs/parse-json.conf"
        }
      }
    }
  ]
}

💡Fargate Observability Stack

Use FireLens (Fluent Bit) as log router — supports structured JSON and multi-destination output
Enable Container Insights on the ECS cluster for per-task CPU, memory, and network metrics
Add X-Ray sidecar for distributed tracing across microservices
Emit structured JSON logs with correlation IDs — never plain text log lines
Set log retention policy on CloudWatch log groups — default is infinite, which is expensive

📊 Production Insight

Fargate has no SSH — logs are the only window into running containers.

Structured JSON logs with correlation IDs are mandatory for debugging microservices.

Rule: configure logging and tracing in the task definition before first deployment.

🎯 Key Takeaway

Fargate requires pre-configured observability — no SSH means no after-the-fact debugging.

FireLens routes structured logs to multiple destinations with filtering.

Container Insights and X-Ray provide the metrics and tracing stack for production monitoring.

🗂 Fargate vs EC2 vs Lambda for Container Workloads

Choosing the right compute model for your containerized application

Feature	Fargate	EC2 Launch Type	Lambda (Container Images)
Server Management	Fully managed by AWS	You manage instances	Fully managed by AWS
Max Memory	Up to 120 GB per task	Depends on instance type	Up to 10 GB
Max vCPU	Up to 16 per task	Depends on instance type	Up to 6 vCPU
Execution Duration	Unlimited	Unlimited	15 minutes max
Networking	ENI per task in VPC	Shared ENI on instance	VPC optional
Cold Start	30-90 seconds for new tasks	None (instances running)	1-3 seconds
Cost Model	Per second for vCPU + memory	Per hour for instances	Per invocation + duration
Best For	Steady microservices, APIs	Cost-optimized at scale	Event-driven, short tasks

🎯 Key Takeaways

Fargate runs containers without managing servers — you define tasks, AWS handles infrastructure
Each task gets its own ENI and resource isolation — plan subnet CIDR blocks for peak task count
Right-sizing CPU and memory based on actual utilization saves 30-50% of compute costs
Enable deployment circuit breaker with rollback on every ECS Fargate service
VPC endpoints for ECR, S3, and CloudWatch eliminate NAT gateway charges
Structured logging via FireLens is mandatory — Fargate has no SSH access for debugging

⚠ Common Mistakes to Avoid

✕Over-provisioning CPU and memory per Fargate task

Symptom

CloudWatch shows 15% CPU utilization and 25% memory utilization — paying for 85% unused resources across all tasks

Fix

Review Container Insights metrics for p95 utilization over 14 days. Right-size tasks to 1.3x actual utilization. Valid Fargate CPU values: 0.25, 0.5, 1, 2, 4, 8, 16 vCPU.

✕Running Fargate tasks in subnets too small for peak task count

Symptom

New tasks stuck in PENDING status — no error message, tasks never transition to RUNNING

Fix

Each task requires one ENI with a private IP. Use /20 subnets (4091 IPs) minimum. Monitor AvailableIpAddressCount with CloudWatch alarms at 20% remaining threshold.

✕Not enabling deployment circuit breaker on ECS services

Symptom

Bad deployment replaces all healthy tasks with failing ones — entire service goes down with no automatic recovery

Fix

Enable DeploymentCircuitBreaker with rollback in the ECS service definition. This automatically reverts to the previous task definition when new tasks fail.

✕Using default log driver without structured logging

Symptom

Debugging production issues requires grep through raw text logs — no correlation IDs, no structured fields, no multi-destination routing

Fix

Use FireLens with Fluent Bit as log router. Emit structured JSON logs with correlation IDs, service name, and request context. Set log retention policy on CloudWatch log groups.

✕Not creating VPC endpoints for AWS services

Symptom

All ECR image pulls and Secrets Manager fetches route through NAT gateway — high data processing charges and single point of failure

Fix

Create VPC endpoints for ECR (api + dkr), S3 (gateway), CloudWatch Logs, and Secrets Manager. This eliminates NAT gateway charges for AWS service communication.

✕Running production and development Fargate tasks in the same cluster without namespace isolation

Symptom

Development task OOM kills affect production task placement. Noisy neighbor issues in shared cluster capacity.

Fix

Use separate ECS clusters for production and development. Tag resources with environment labels. Use cluster capacity providers to isolate Fargate and Fargate Spot workloads.

Interview Questions on This Topic

QWhat is AWS Fargate and how does it differ from running containers on EC2?JuniorReveal
AWS Fargate is a serverless compute engine for containers that works with Amazon ECS and Amazon EKS. The key differences from EC2 launch type: 1. Server management: With Fargate, AWS manages the underlying infrastructure — no EC2 instances to provision, patch, or scale. With EC2, you manage the instance fleet. 2. Resource model: Fargate allocates resources per task (vCPU and memory). EC2 allocates resources per instance, and tasks share instance capacity. 3. Isolation: Each Fargate task gets its own kernel runtime and ENI. On EC2, tasks share the host kernel and network interface. 4. Pricing: Fargate charges per second for allocated task resources. EC2 charges per hour for running instances, regardless of task utilization. 5. Scaling: Fargate scales task count automatically. EC2 requires Auto Scaling Groups to scale the instance fleet. Fargate is best for variable workloads and operational simplicity. EC2 is better for cost optimization at steady, predictable scale.
QHow would you optimize Fargate costs for a production microservices architecture?Mid-levelReveal
Fargate cost optimization follows a systematic approach: 1. Right-sizing: Use Container Insights to measure actual CPU and memory utilization per task over 14 days. Most deployments over-provision by 2-3x. Reduce task resources to 1.3x p95 utilization. 2. Fargate Spot: Use Spot for fault-tolerant workloads — background workers, batch jobs, CI/CD pipelines. Spot provides up to 70% savings. Implement graceful shutdown to handle two-minute interruption notices. 3. Container consolidation: Combine sidecar containers (logging, monitoring) into fewer tasks to reduce per-task overhead. Each task has a base cost regardless of how many containers it runs. 4. VPC endpoints: Create endpoints for ECR, S3, CloudWatch, and Secrets Manager to eliminate NAT gateway data processing charges. 5. Scheduling: Stop non-production tasks outside business hours using EventBridge-triggered Lambda functions that update ECS service desired count to zero. 6. Compute Savings Plans: Purchase 1-year or 3-year Savings Plans for predictable baseline workloads — up to 17% savings. The most impactful single action is right-sizing — it typically saves 30-50% without any functionality changes.
QA production Fargate service is experiencing intermittent task placement failures with tasks stuck in PENDING. Walk through your diagnosis process.SeniorReveal
Systematic diagnosis for Fargate PENDING tasks: 1. Check task stop reason: aws ecs describe-tasks --cluster CLUSTER --tasks TASK_ARN. Look at stopReason and attachments for ENI creation failures. 2. Subnet IP exhaustion: Each task needs one ENI with a private IP. Check available IPs: aws ec2 describe-subnets --subnet-ids SUBNET_ID. If below 10%, this is the issue. Fix by expanding CIDR or adding subnets. 3. Task execution role: Verify the ecsTaskExecutionRole has AmazonECSTaskExecutionRolePolicy attached. Without it, tasks cannot pull images from ECR or fetch secrets. 4. Security group: Check that the task security group allows outbound HTTPS (443) for ECR, Secrets Manager, and CloudWatch Logs. 5. VPC endpoints: In private subnets without NAT gateway, tasks need VPC endpoints for ECR (api + dkr) and CloudWatch Logs. 6. Resource constraints: Check if the cluster has sufficient Fargate capacity. Verify no service quota limits are hit. 7. Image availability: Verify the container image exists in ECR and the task execution role has ecr:GetAuthorizationToken and ecr:BatchGetImage permissions. The most common cause in production is subnet IP exhaustion — teams underestimate how many ENIs Fargate tasks consume at scale.

Frequently Asked Questions

Is AWS Fargate really serverless?

Fargate is serverless in the sense that you do not provision, manage, or patch any servers. AWS manages the underlying compute infrastructure entirely. However, unlike Lambda, Fargate tasks run continuously and you are billed for the duration they run, not per invocation. You still need to define networking, IAM, and logging configuration. Fargate removes server management but not infrastructure configuration.

What is the maximum size of a Fargate task?

Fargate supports up to 16 vCPU and 120 GB of memory per task. The valid CPU values are 0.25, 0.5, 1, 2, 4, 8, and 16 vCPU. Each CPU value has a set of valid memory configurations — for example, 1 vCPU supports 2-8 GB of memory. A single task can run up to 10 containers that share the task's CPU and memory allocation.

Can Fargate tasks communicate with each other?

Yes, Fargate tasks communicate through standard networking since each task has its own ENI in the VPC. Tasks can reach each other using private IP addresses, service discovery (Cloud Map), or an internal ALB. For ECS, AWS Service Connect provides service mesh capabilities with automatic service discovery and traffic management.

How does Fargate handle persistent storage?

Fargate supports two storage options: Amazon EFS (Elastic File System) for persistent shared storage across tasks, and ephemeral storage up to 200 GB per task. EFS volumes mount inside containers like a regular filesystem and persist across task restarts. For stateful workloads, EFS is the recommended approach — it provides shared, durable storage without managing EBS volumes.

Should I use Fargate Spot for production workloads?

Fargate Spot can be used in production for fault-tolerant workloads. AWS provides a two-minute interruption warning before reclaiming Spot capacity. Your application must handle SIGTERM gracefully and drain connections. Use a mixed capacity provider strategy — run baseline capacity on regular Fargate and burst capacity on Spot. This provides cost savings while maintaining availability for critical workloads.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged