Introduction
This may seem like a thoroughly discussed topic already, but since the same conversation recently came up internally at work, I decided to write about it.
In our day-to-day operations, we use Amazon ECS as the platform for running our applications. AWS resources are provisioned with Terraform, and we manage our code on GitHub with separate repositories for infrastructure and backend.
A frequent point of discussion is how much of the ECS-related resources should be managed by the infrastructure side, which I’ll break down in this article.
Intended audience: Engineers using ECS and Terraform, especially those working in teams with separate infrastructure and application repositories
Estimated reading time: 7 minutes
How We Manage It
To cut to the chase: ECS task definitions and ECS service definitions are created in the infrastructure repository but updated from the backend repository. Everything else is managed in the infrastructure repository.
Before explaining why, let’s list the common triggers for updating these definitions:
- Changing the tag of a container image in ECR
- Updating environment variables or CPU/MEM parameters
- Deploying a new revision of the task definition via the service definition
Especially with the first point, changing the container image tag typically results from updates to backend source code rather than infrastructure changes. Because of this dependency on application implementation, we believe it’s better not to manage such values from the infrastructure repository. Instead, they should be handled in the backend repository.
If you’re unsure what this looks like, consider a case where another team manages the infrastructure repository and also owns the ECS task definitions. When you want to deploy:
- You build the container and push the image to ECR
- You inform the infra team of the new image tag so they can create a new task definition
- They update the ECS service to use the new task definition
Doing this every time you deploy creates a lot of communication overhead. Ideally, both teams should be able to deploy independently. That’s why we create the ECS service and task definitions in the infrastructure repo but handle updates from the backend.
Sample Implementation
Here’s a reference implementation, starting with Terraform code for ECS:
# ECS task definition
resource "aws_ecs_task_definition" "main" {
family = var.container_name
task_role_arn = var.task_role_arn
execution_role_arn = var.exec_role_arn
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.task_cpu
memory = var.task_mem
lifecycle {
ignore_changes = [container_definitions]
}
container_definitions = jsonencode([
{
essential = true
name = var.container_name
image = "${var.ecr_repo_url}:latest"
cpu = 0
environment = []
logConfiguration = {
logDriver = "awslogs"
options = {
awslogs-create-group = "true"
awslogs-group = "/ecs/example"
awslogs-region = var.region
awslogs-stream-prefix = "ecs"
}
}
portMappings = [
{
appProtocol = var.container_protocol
containerPort = var.container_port
hostPort = var.container_port
name = "${var.container_name}_port"
protocol = "tcp"
}
]
}
])
}
# ECS service definition
resource "aws_ecs_service" "main" {
name = var.service_name
cluster = var.cluster_id
deployment_maximum_percent = var.deployment_maximum_percent
deployment_minimum_healthy_percent = var.deployment_minimum_healthy_percent
desired_count = var.desired_count
enable_execute_command = var.enable_execute_command
health_check_grace_period_seconds = var.health_check_grace_period_seconds
launch_type = "FARGATE"
platform_version = "1.4.0"
task_definition = aws_ecs_task_definition.main.arn
deployment_controller {
type = "ECS"
}
load_balancer {
container_name = var.container_name
container_port = var.container_port
target_group_arn = var.tg_arn
}
network_configuration {
assign_public_ip = false
security_groups = [var.sg_id]
subnets = var.subnets
}
lifecycle {
ignore_changes = [
task_definition,
desired_count
]
}
}
The lifecycle.ignore_changes
block ensures that attributes dependent on the backend won’t show as diffs in Terraform plans. In this example, we ignore container versions and environment variables to allow safe initial creation.
For backend deployments, we use ecspresso, a CLI tool that lets you deploy using JSON definitions for task and service configurations.
Here’s what the definitions look like:
// ecs_task_def.json
{
"containerDefinitions": [
{
"cpu": {{ env `ECS_CPU` `256` }},
"essential": true,
"image": "{{ env `ECR_REPO_IMAGE_URL` }}:{{ env `IMAGE_TAG` }}",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "/ecs/example",
"awslogs-region": "ap-northeast-1",
"awslogs-stream-prefix": "ecs"
}
},
"memory": {{ env `ECS_MEMORY` `512` }},
"name": "example",
"portMappings": [
{
"containerPort": 80,
"hostPort": 80,
"protocol": "tcp"
}
],
"secrets": [
{
"name": "ENV",
"valueFrom": "{{ env `SECRETS_MANAGER_ARN` }}:ENV::"
}
]
}
],
"cpu": {{ env `ECS_CPU` `256` }},
"executionRoleArn": "{{ env `EXEC_ROLE_ARN` }}",
"family": "example",
"memory": {{ env `ECS_MEMORY` `512` }},
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"taskRoleArn": "{{ env `TASK_ROLE_ARN` }}"
}
// ecs_service_def.json
{
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": false,
"rollback": false
},
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"desiredCount": {{ env `DESIRED_COUNT` }},
"enableECSManagedTags": false,
"enableExecuteCommand": {{ env `ENABLE_ECS_EXEC` `true` }},
"healthCheckGracePeriodSeconds": 5,
"launchType": "FARGATE",
"loadBalancers": [
{
"containerName": "example",
"containerPort": 80,
"targetGroupArn": "{{ env `TARGET_GROUP_ARN` }}"
}
],
"networkConfiguration": {
"awsvpcConfiguration": {
"subnets": [
"{{ env `SUBNET_1` }}",
"{{ env `SUBNET_2` }}",
"{{ env `SUBNET_3` }}"
],
"securityGroups": [
"{{ env `SECURITY_GROUP_ID` }}"
],
"assignPublicIp": "DISABLED"
}
},
"placementConstraints": [],
"placementStrategy": [],
"platformVersion": "1.4.0",
"schedulingStrategy": "REPLICA",
"serviceRegistries": []
}
You can deploy with a YAML config like this:
# ecspresso.yaml
cluster: example
service: example
service_definition: ecs_service_def.json
task_definition: ecs_task_def.json
timeout: 15m0s
$ ECS_CPU=256 \
ECR_REPO_IMAGE_URL=<ecr-repository-url> \
IMAGE_TAG=<ecr-container-image-tag> \
ECS_MEMORY=512 \
SECRETS_MANAGER_ARN=<arn-of-secrets-manager> \
EXEC_ROLE_ARN=<arn-of-ecs-task-exec-role> \
TASK_ROLE_ARN=<arn-of-ecs-task-role> \
DESIRED_COUNT=<amount-of-ecs-tasks> \
ENABLE_ECS_EXEC=<enable-ecs-task-exec> \
TARGET_GROUP_ARN=<amount-of-target-group> \
SUBNET_1=<subnet-id1> \
SUBNET_2=<subnet-id2> \
SUBNET_3=<subnet-id3> \
SECURITY_GROUP_ID=<security-group-id> \
ecspresso deploy --config ecspresso.yaml
You simply change the IMAGE_TAG
based on your code updates and run the command.
For more advanced usage, check out how to pull values from tfstate or GitHub Actions integration, though these are beyond the scope of this article.
Conclusion
That’s it. I’ve summarized our approach to managing ECS resources. I hope this is helpful for your development workflows.
Deployment practices like these can vary between teams and organizations, so if you have other approaches or improvements, I’d love to hear about them.
Top comments (0)