- The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget
- Why is it so challenging?
- How does the “Just do it” approach affect pillars?
- How do VPC interface endpoints fit into all this?
- How much does it cost? – A gentle overview of provisioning VPC interface endpoints for each new VPC
- Optimizing Architecture for Cost Savings and Business Continuity
- High Level Design
- Key takeaways
The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget
Working as a DevOps engineer is like juggling flaming swords while someone shouts, 'Can you deploy that by Friday?' Or worse, 'By 17:00 Friday.'
Why is it so challenging?
Explaining that your solution should align with the six pillars of the AWS Well-Architected Framework is like asking for a seatbelt in a car that's already halfway down the hill—or opening your umbrella after the rain has passed. You need time, planning, and a roadmap—and nobody wants to hear that when the only goal is “just make it work.”
“Just do it” is an effective strategy but out of those six pillars, cost optimization and sustainability are usually the first to be sacrificed.
How does the “Just do it” approach affect pillars?
Because in the race to deliver, speed beats everything. Deadlines are sacred.
And what about budgets? Well, they’re not a problem—until someone sees the monthly AWS bill and starts panicking. Simply because cost impact is often hidden behind shared billing, and nobody has tagging discipline in the early phase.
Now you're asked to deploy a Graviton instance for a legacy application that doesn't even support ARM. Why wouldn’t you? After all, cost optimization is suddenly top priority—never mind compatibility?
That’s when suddenly, cost optimization becomes everyone's favorite pillar.
How do VPC interface endpoints fit into all this?
Initially, VPC endpoints are provisioned separately per VPC—because we prioritized speed over cost and, sometimes, even quality or security.
If we have 20 VPCs, we will create endpoints in each, this will lead to increased costs 20 times, especially if we have same endpoints, while the traffic is almost idle. One VPC endpoint in one availability zone provides 10 Gbps with automatic scaling up to 100 Gbps. This is enough to handle multiple workloads, even high-throughput data workloads.
For those with a programming background, this is a classic example of violating the ‘Don’t Repeat Yourself’ (DRY) principle.
Because repeating the same setup in every VPC introduces unnecessary costs for a horizontally scalable networking component designed to handle large volumes of traffic efficiently—and doing it multiple times means paying multiple times.
According to the documentation
By default, each VPC endpoint can support a bandwidth of up to 10 Gbps per Availability Zone, and automatically scales up to 100 Gbps.
How much does it cost? - a gentle overview of provisioning VPC interface endpoints for each new VPC in environments with multi-account strategy. We will use 13 accounts (let believe it is an unlucky number) and some randomly generated endpoint services as an example
account | interface endpoints |
---|---|
1 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager |
2 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager, sqs, airflow.api, airflow.env, airflow.ops |
3 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager, sagemaker.api, sagemaker.runtime |
4 | ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, sagemaker.runtime, execute-api, secretsmanager, states, sts, acm-pca, glue, athena, macie2, ecs, bedrock-runtime |
5 | s3, sts |
6 | ssm, ssmmessages, ec2messages, ec2, s3, logs, monitoring, kms, sts |
7 | ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, secretsmanager, elasticfilesystem, codecommit, git-codecommit, glue, athena, application-autoscaling |
8 | logs, monitoring, sts, glue, lambda, states, secretsmanager |
9 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager |
10 | logs, monitoring, sts, ec2 |
11 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, secretsmanager, acm-pca |
12 | athena, logs, monitoring, kms, secretsmanager, codecommit, sagemaker.api, sagemaker.runtime, glue, git-codecommit, sts, bedrock-runtime |
13 | ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager |
If we group the endpoints by frequency, assuming one environment or four environments, the numbers look like this:
VPC Endpoint | Frequency (x1) | Frequency (x4) |
---|---|---|
sts | 14 | 56 |
logs | 12 | 48 |
monitoring | 12 | 48 |
kms | 10 | 40 |
secretsmanager | 10 | 40 |
lambda | 9 | 36 |
ecr.api | 8 | 32 |
ecr.dkr | 8 | 32 |
acm-pca | 7 | 28 |
ec2 | 6 | 24 |
ssm | 5 | 20 |
sagemaker.api | 4 | 16 |
glue | 4 | 16 |
ssmmessages | 3 | 12 |
ec2messages | 3 | 12 |
sagemaker.runtime | 3 | 12 |
athena | 3 | 12 |
ssm-contacts | 2 | 8 |
states | 2 | 8 |
bedrock-runtime | 2 | 8 |
s3 | 2 | 8 |
codecommit | 2 | 8 |
git-codecommit | 2 | 8 |
sqs | 1 | 4 |
airflow.api | 1 | 4 |
airflow.env | 1 | 4 |
airflow.ops | 1 | 4 |
execute-api | 1 | 4 |
macie2 | 1 | 4 |
ecs | 1 | 4 |
elasticfilesystem | 1 | 4 |
application-autoscaling | 1 | 4 |
Total | 132 | 528 |
Total costs
Calculation of total costs for eu-west-2 or London region would look like
Total costs for 132 endpoints for 1 environment = 0.011 (per hour) * 3 AZs * 24* 30 * 132 = 3136.32
Total costs for 528 endpoints = 3136.32* 4 = 12545.28
Data Processing costs for 4 environments = 5.28 (rough estimation)
Total unique VPC endpoints count = 32
Costs for 32 endpoints = 0.011 (per hour) * 3 AZs * 24* 30 * 32 = 760.32
A centralized approach for VPC endpoints in a shared services account for prod and nonprod may provide same scalability and high availability, while reducing the costs with 87% and administrative burden. Of course we can do a step further and replace some of the interface endpoints like S3 and DynamoDB for gateway endpoints in case we don't want to use their transitive nature and share them across VPCs and we want to save money.
Summary
132 endpoints x 3 AZs x $0.011/hour x 24 hours x 30 days = $3,136.32/month
For 4 environments (528 endpoints): $12,545.28/month
Costs for 32 endpoints across 3 AZs: 0.011 USD/hour × 3 AZs × 24 hours × 30 days × 32 = $760.32/month
Savings: ~87%
Note: I missed to add the costs for the resolver endpoints, which are between $180 and $270 depending on the number of ENIs or more specifically AZs
Optimizing Architecture for Cost Savings and Business Continuity
The costs above are not necessarily something bad. You have isolation between environments and you gather extensive knowledge how things work and how you need to approach stakeholders in order improve the situation.
Why Isn't Cost Enough to Convince the Business?
Business is only interested of certain things. I would say: nobody cares that the administrative burden would be smaller. So how you can approach this?
When you start the deployment of the interface endpoints they were not secured well. This means now we have a lot of networks, resulting in inconsistent security standards—each VPC becomes a snowflake. You may avoid saying that is not secure, a more suitable approach would be:
By standardizing the security policies and security groups you can make sure that sensitive workloads have access only to those specific buckets and only those specific tables and only those specific APIs. This improves the security baseline and reduces the blast radius. As a result this reduces the possibility of a data leakage. (How to Sell Optimization Without Saying 'Security Is Bad')By centralizing and standardizing the interface endpoints, we could achieve an 87% cost reduction. In Bulgaria, there’s a well-known satirical series called The Three Fools. In this context, it feels like we're unintentionally playing a similar role—continuing to pay thousands to AWS for redundant endpoints simply because the architecture hasn't been revisited with fresh eyes.
Note: Security is always a good selling point for the business and nobody measures it after a change. Controlling fear factor and risk sells, a good example would be insurance, that we buy for our houses
High Level Design
🧩 Components Table
ID | Name | Type | Description |
---|---|---|---|
C1 | Interface Endpoints | VPC Interface Endpoints | Provides private access to AWS services (e.g., ssm.eu-west-2.amazonaws.com ). |
C2 | Route 53 Private Hosted Zone | DNS Zone | Hosts private DNS entries for the services. |
C3 | Route 53 Resolver Inbound Endpoint | DNS Resolver | Accepts DNS queries from the spoke VPC. |
C3 | Shared Resolver | Route 53 Resolver | Used by EC2 instances in the spoke VPC to resolve private DNS. |
C4 | AWS RAM | Resource Access Manager | Shares the inbound endpoint and private hosted zone with the spoke VPC. |
C5 | Cloud WAN Segment Network | Network Routing | Routes traffic between segments (e.g., from spoke to shared services). |
EC2 | Amazon EC2 Instance | Compute | The instance initiating the request to ssm.eu-west-2.amazonaws.com . |
Spoke VPC | VPC | Contains the EC2 instance. CIDR: 192.168.20.X . |
|
Centralized VPC Endpoints | VPC | Hosts the interface endpoints and inbound resolver. CIDR: 192.168.10.X . |
🔗 Integrations Table
Step | Integration Description | Direction | Protocol/Mechanism |
---|---|---|---|
1 | EC2 in spoke VPC wants to resolve ssm.eu-west-2.amazonaws.com . |
Spoke → Shared | DNS Query via Shared Resolver |
2 | Shared Resolver provides IP 192.168.10.4 for the endpoint. |
Shared → Spoke | DNS Response |
3 | Traffic to 192.168.10.4 is not local, forwarded to Cloud WAN uplink. |
Spoke → Cloud WAN | VPC Route Table / Cloud WAN Routing |
4 | Cloud WAN checks if route to another network is permitted. | Cloud WAN | Firewall/Policy Check |
5 | If permitted, traffic is routed to shared services VPC. | Cloud WAN → Shared | Network Forwarding |
I1 | Private hosted zone is associated with the shared resolver and spoke via RAM. | Shared ↔ Spoke | AWS RAM and Route 53 Association |
I4 | RAM shares the inbound resolver with spoke VPC. | Shared → Spoke | AWS Resource Access Manager |
I5 | Spoke EC2 sends DNS queries to shared resolver. | Spoke → Shared | DNS |
Prerequisite: All VPCs are connected via peering/TransitGateway/CloudWAN
Hub VPC
First we need to create centralized hub VPC that will have all of the necessary VPC interface endpoints. When you create a VPC endpoint to an AWS service, you can enable private DNS. When enabled, the setting creates an AWS managed Route 53 private hosted zone (PHZ), which enables the resolution of public AWS service endpoint to the private IP of the interface endpoint. You need this disabled in order to define a centralized PHZ trough a Route 53 inbound resolver, which would be shared with other accounts
To do this you need to disable this in terraform:
resource "aws_vpc_endpoint" "private_links" {
for_each = toset(local.vpc_endpoints_all)
vpc_id = aws_vpc.main.id
service_name = each.key
vpc_endpoint_type = "Interface"
private_dns_enabled = false
#Disabling private DNS lets us override the default endpoint #resolution and use our own Route 53 hosted zone across accounts
security_group_ids = [aws_security_group.vpc_endpoint[each.key].id]
policy = data.aws_iam_policy_document.vpc_endpoints_policy.json
subnet_ids = local.subnets
tags = merge({ Name = "${var.prefix}-${each.key}-interface-endpoint" }, var.tags)
}
As next step you create route 53 private hosted zones for each endpoint. We associate them with the centralized VPC from step 1.
Then we create Alias A record in each hosted zone pointing to the VPC endpoint's DNS name. For example: For STS endpoint
the name "sts.${data.aws_region.current.name}.amazonaws.com"
should point to the DNS of the newly created VPC endpoint for STS
This allows traffic from spoke VPCs to resolve AWS service endpoints via centralized VPC interface endpoints and inbound endpoint.Then we create an inbound endpoint in Route 53 with security group, Do53 protocol in at least two subnets for high availability, that will be used in the spoke VPCs as well. The idea of the inbound endpoint resolver is route your DNS queries from other spoke VPCs or networks to the hub VPC
As last step we share the resolver or inbound endpoint with other accounts and define policy for security trough Resource Access Manager or RAM
Spoke VPCs
- Each newly create spoke VPC needs to be associated with the inbound resolver, that was shared from hub VPC. Example
data "aws_route53_resolver_rules" "eu_west_2" {
owner_id = var.resolver_rules[terraform.workspace]
share_status = "SHARED_WITH_ME"
}
resource "aws_route53_resolver_rule_association" "eu_west_2" {
for_each = data.aws_route53_resolver_rules.eu_west_2.resolver_rule_ids
resolver_rule_id = each.value
vpc_id = data.terraform_remote_state.networking.outputs.network.aws_vpc.id
}
Minimizing downtime
Now you would ask: How to move from the state with decentralized VPC interface endpoints to a state, where they were centralized with as minimal downtime as possible?
In general what could be done is associating the shared resolver with the spoke VPC and then destroying the decentralized VPC endpoints with a rolling deployment from development to production and automatic tests via System Manager Run Command/Lambda. This will guarantee that first you gather knowledge what can fail (Everything fails all the time) and you document it in Confluence or even Readme.md. This will give you confidence for production and will make the big change controllable and more understandable for technical and non-technical people
Key Takeaways
Rushing architecture decisions often leads to long-term cost explosions.
Interface endpoints are scalable—duplicating them per VPC isn’t.
Centralizing shared services like VPC endpoints saves money and simplifies security management.
To convince stakeholders, lead with security and cost—not technical purity.
AWS gives you the tools; architecture is about using them with purpose.
Top comments (1)
Great breakdown of VPC endpoint costs! Do you have any tips for managing legacy environments that can’t centralize endpoints right away? Maybe next time you could write about automating the migration process or addressing common pitfalls during endpoint consolidation.