DEV Community

ryo ariyama
ryo ariyama

Posted on

How We Manage ECS with Terraform and GitHub Repos

Introduction

This may seem like a thoroughly discussed topic already, but since the same conversation recently came up internally at work, I decided to write about it.

In our day-to-day operations, we use Amazon ECS as the platform for running our applications. AWS resources are provisioned with Terraform, and we manage our code on GitHub with separate repositories for infrastructure and backend.

A frequent point of discussion is how much of the ECS-related resources should be managed by the infrastructure side, which I’ll break down in this article.

Intended audience: Engineers using ECS and Terraform, especially those working in teams with separate infrastructure and application repositories

Estimated reading time: 7 minutes

How We Manage It

To cut to the chase: ECS task definitions and ECS service definitions are created in the infrastructure repository but updated from the backend repository. Everything else is managed in the infrastructure repository.

Before explaining why, let’s list the common triggers for updating these definitions:

  • Changing the tag of a container image in ECR
  • Updating environment variables or CPU/MEM parameters
  • Deploying a new revision of the task definition via the service definition

Especially with the first point, changing the container image tag typically results from updates to backend source code rather than infrastructure changes. Because of this dependency on application implementation, we believe it’s better not to manage such values from the infrastructure repository. Instead, they should be handled in the backend repository.

If you’re unsure what this looks like, consider a case where another team manages the infrastructure repository and also owns the ECS task definitions. When you want to deploy:

  • You build the container and push the image to ECR
  • You inform the infra team of the new image tag so they can create a new task definition
  • They update the ECS service to use the new task definition

Doing this every time you deploy creates a lot of communication overhead. Ideally, both teams should be able to deploy independently. That’s why we create the ECS service and task definitions in the infrastructure repo but handle updates from the backend.

Sample Implementation

Here’s a reference implementation, starting with Terraform code for ECS:

# ECS task definition 
resource "aws_ecs_task_definition" "main" {
  family             = var.container_name
  task_role_arn      = var.task_role_arn
  execution_role_arn = var.exec_role_arn

  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.task_cpu
  memory                   = var.task_mem

  lifecycle {
    ignore_changes = [container_definitions]
  }

  container_definitions = jsonencode([
    {
      essential   = true
      name        = var.container_name
      image       = "${var.ecr_repo_url}:latest"
      cpu         = 0
      environment = []
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          awslogs-create-group  = "true"
          awslogs-group         = "/ecs/example"
          awslogs-region        = var.region
          awslogs-stream-prefix = "ecs"
        }
      }
      portMappings = [
        {
          appProtocol   = var.container_protocol
          containerPort = var.container_port
          hostPort      = var.container_port
          name          = "${var.container_name}_port"
          protocol      = "tcp"
        }
      ]
    }
  ])
}

# ECS service definition
resource "aws_ecs_service" "main" {
  name                               = var.service_name
  cluster                            = var.cluster_id
  deployment_maximum_percent         = var.deployment_maximum_percent
  deployment_minimum_healthy_percent = var.deployment_minimum_healthy_percent
  desired_count                      = var.desired_count
  enable_execute_command             = var.enable_execute_command
  health_check_grace_period_seconds  = var.health_check_grace_period_seconds
  launch_type                        = "FARGATE"
  platform_version                   = "1.4.0"
  task_definition                    = aws_ecs_task_definition.main.arn

  deployment_controller {
    type = "ECS"
  }

  load_balancer {
    container_name   = var.container_name
    container_port   = var.container_port
    target_group_arn = var.tg_arn
  }

  network_configuration {
    assign_public_ip = false
    security_groups  = [var.sg_id]
    subnets          = var.subnets
  }

  lifecycle {
    ignore_changes = [
      task_definition,
      desired_count
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

The lifecycle.ignore_changes block ensures that attributes dependent on the backend won’t show as diffs in Terraform plans. In this example, we ignore container versions and environment variables to allow safe initial creation.

For backend deployments, we use ecspresso, a CLI tool that lets you deploy using JSON definitions for task and service configurations.

Here’s what the definitions look like:

// ecs_task_def.json
{
  "containerDefinitions": [
    {
      "cpu": {{ env `ECS_CPU` `256` }},
      "essential": true,
      "image": "{{ env `ECR_REPO_IMAGE_URL` }}:{{ env `IMAGE_TAG` }}",
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-create-group": "true",
          "awslogs-group": "/ecs/example",
          "awslogs-region": "ap-northeast-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "memory": {{ env `ECS_MEMORY` `512` }},
      "name": "example",
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80,
          "protocol": "tcp"
        }
      ],
      "secrets": [
        {
          "name": "ENV",
          "valueFrom": "{{ env `SECRETS_MANAGER_ARN` }}:ENV::"
        }
      ]
    }
  ],
  "cpu": {{ env `ECS_CPU` `256` }},
  "executionRoleArn": "{{ env `EXEC_ROLE_ARN` }}",
  "family": "example",
  "memory": {{ env `ECS_MEMORY` `512` }},
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "taskRoleArn": "{{ env `TASK_ROLE_ARN` }}"
}

// ecs_service_def.json
{
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": false,
      "rollback": false
    },
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  },
  "desiredCount": {{ env `DESIRED_COUNT` }},
  "enableECSManagedTags": false,
  "enableExecuteCommand": {{ env `ENABLE_ECS_EXEC` `true` }},
  "healthCheckGracePeriodSeconds": 5,
  "launchType": "FARGATE",
  "loadBalancers": [
    {
      "containerName": "example",
      "containerPort": 80,
      "targetGroupArn": "{{ env `TARGET_GROUP_ARN` }}"
    }
  ],
  "networkConfiguration": {
    "awsvpcConfiguration": {
      "subnets": [
        "{{ env `SUBNET_1` }}",
        "{{ env `SUBNET_2` }}",
        "{{ env `SUBNET_3` }}"
      ],
      "securityGroups": [
        "{{ env `SECURITY_GROUP_ID` }}"
      ],
      "assignPublicIp": "DISABLED"
    }
  },
  "placementConstraints": [],
  "placementStrategy": [],
  "platformVersion": "1.4.0",
  "schedulingStrategy": "REPLICA",
  "serviceRegistries": []
}
Enter fullscreen mode Exit fullscreen mode

You can deploy with a YAML config like this:

# ecspresso.yaml
cluster: example
service: example
service_definition: ecs_service_def.json
task_definition: ecs_task_def.json
timeout: 15m0s

$ ECS_CPU=256 \
  ECR_REPO_IMAGE_URL=<ecr-repository-url> \
  IMAGE_TAG=<ecr-container-image-tag> \
  ECS_MEMORY=512 \
  SECRETS_MANAGER_ARN=<arn-of-secrets-manager> \
  EXEC_ROLE_ARN=<arn-of-ecs-task-exec-role> \
  TASK_ROLE_ARN=<arn-of-ecs-task-role> \
  DESIRED_COUNT=<amount-of-ecs-tasks> \
  ENABLE_ECS_EXEC=<enable-ecs-task-exec> \
  TARGET_GROUP_ARN=<amount-of-target-group> \
  SUBNET_1=<subnet-id1> \
  SUBNET_2=<subnet-id2> \
  SUBNET_3=<subnet-id3> \
  SECURITY_GROUP_ID=<security-group-id> \
  ecspresso deploy --config ecspresso.yaml
Enter fullscreen mode Exit fullscreen mode

You simply change the IMAGE_TAG based on your code updates and run the command.
For more advanced usage, check out how to pull values from tfstate or GitHub Actions integration, though these are beyond the scope of this article.

Conclusion

That’s it. I’ve summarized our approach to managing ECS resources. I hope this is helpful for your development workflows.

Deployment practices like these can vary between teams and organizations, so if you have other approaches or improvements, I’d love to hear about them.

Top comments (0)