Martin Nanchev for AWS Community Builders

Posted on May 20 • Edited on May 23

How (not) to Burn Money on VPC Endpoints (So You Don't Have To)

#aws #vpc #networking #costs

The Hidden Cost of Speed: How 'Just Make It Work' Breaks Your AWS Budget

Working as a DevOps engineer is like juggling flaming swords while someone shouts, 'Can you deploy that by Friday?' Or worse, 'By 17:00 Friday.'

Why is it so challenging?

Explaining that your solution should align with the six pillars of the AWS Well-Architected Framework is like asking for a seatbelt in a car that's already halfway down the hill—or opening your umbrella after the rain has passed. You need time, planning, and a roadmap—and nobody wants to hear that when the only goal is “just make it work.”

“Just do it” is an effective strategy but out of those six pillars, cost optimization and sustainability are usually the first to be sacrificed.

How does the “Just do it” approach affect pillars?

Because in the race to deliver, speed beats everything. Deadlines are sacred.

And what about budgets? Well, they’re not a problem—until someone sees the monthly AWS bill and starts panicking. Simply because cost impact is often hidden behind shared billing, and nobody has tagging discipline in the early phase.

Now you're asked to deploy a Graviton instance for a legacy application that doesn't even support ARM. Why wouldn’t you? After all, cost optimization is suddenly top priority—never mind compatibility?

That’s when suddenly, cost optimization becomes everyone's favorite pillar.

How do VPC interface endpoints fit into all this?

Initially, VPC endpoints are provisioned separately per VPC—because we prioritized speed over cost and, sometimes, even quality or security.
If we have 20 VPCs, we will create endpoints in each, this will lead to increased costs 20 times, especially if we have same endpoints, while the traffic is almost idle. One VPC endpoint in one availability zone provides 10 Gbps with automatic scaling up to 100 Gbps. This is enough to handle multiple workloads, even high-throughput data workloads.

For those with a programming background, this is a classic example of violating the ‘Don’t Repeat Yourself’ (DRY) principle.
Because repeating the same setup in every VPC introduces unnecessary costs for a horizontally scalable networking component designed to handle large volumes of traffic efficiently—and doing it multiple times means paying multiple times.

According to the documentation

By default, each VPC endpoint can support a bandwidth of up to 10 Gbps per Availability Zone, and automatically scales up to 100 Gbps.

How much does it cost? - a gentle overview of provisioning VPC interface endpoints for each new VPC in environments with multi-account strategy. We will use 13 accounts (let believe it is an unlucky number) and some randomly generated endpoint services as an example

account	interface endpoints
1	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager
2	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, ssm, ssmmessages, ssm-contacts, ec2, ec2messages, acm-pca, secretsmanager, sqs, airflow.api, airflow.env, airflow.ops
3	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager, sagemaker.api, sagemaker.runtime
4	ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, sagemaker.runtime, execute-api, secretsmanager, states, sts, acm-pca, glue, athena, macie2, ecs, bedrock-runtime
5	s3, sts
6	ssm, ssmmessages, ec2messages, ec2, s3, logs, monitoring, kms, sts
7	ssm, ec2, ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, sagemaker.api, secretsmanager, elasticfilesystem, codecommit, git-codecommit, glue, athena, application-autoscaling
8	logs, monitoring, sts, glue, lambda, states, secretsmanager
9	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager
10	logs, monitoring, sts, ec2
11	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, secretsmanager, acm-pca
12	athena, logs, monitoring, kms, secretsmanager, codecommit, sagemaker.api, sagemaker.runtime, glue, git-codecommit, sts, bedrock-runtime
13	ecr.api, ecr.dkr, logs, monitoring, lambda, kms, sts, acm-pca, secretsmanager

If we group the endpoints by frequency, assuming one environment or four environments, the numbers look like this:

VPC Endpoint	Frequency (x1)	Frequency (x4)
sts	14	56
logs	12	48
monitoring	12	48
kms	10	40
secretsmanager	10	40
lambda	9	36
ecr.api	8	32
ecr.dkr	8	32
acm-pca	7	28
ec2	6	24
ssm	5	20
sagemaker.api	4	16
glue	4	16
ssmmessages	3	12
ec2messages	3	12
sagemaker.runtime	3	12
athena	3	12
ssm-contacts	2	8
states	2	8
bedrock-runtime	2	8
s3	2	8
codecommit	2	8
git-codecommit	2	8
sqs	1	4
airflow.api	1	4
airflow.env	1	4
airflow.ops	1	4
execute-api	1	4
macie2	1	4
ecs	1	4
elasticfilesystem	1	4
application-autoscaling	1	4
Total	132	528

Total costs

Calculation of total costs for eu-west-2 or London region would look like

Total costs for 132 endpoints for 1 environment = 0.011 (per hour) * 3 AZs * 24* 30 * 132 = 3136.32
Total costs for 528 endpoints = 3136.32* 4 = 12545.28
Data Processing costs for 4 environments = 5.28 (rough estimation)

Total unique VPC endpoints count = 32
Costs for 32 endpoints = 0.011 (per hour) * 3 AZs * 24* 30 * 32 = 760.32

A centralized approach for VPC endpoints in a shared services account for prod and nonprod may provide same scalability and high availability, while reducing the costs with 87% and administrative burden. Of course we can do a step further and replace some of the interface endpoints like S3 and DynamoDB for gateway endpoints in case we don't want to use their transitive nature and share them across VPCs and we want to save money.

Summary

132 endpoints x 3 AZs x $0.011/hour x 24 hours x 30 days = $3,136.32/month
For 4 environments (528 endpoints): $12,545.28/month
Costs for 32 endpoints across 3 AZs: 0.011 USD/hour × 3 AZs × 24 hours × 30 days × 32 = $760.32/month
Savings: ~87%

Note: I missed to add the costs for the resolver endpoints, which are between $180 and $270 depending on the number of ENIs or more specifically AZs

Optimizing Architecture for Cost Savings and Business Continuity

The costs above are not necessarily something bad. You have isolation between environments and you gather extensive knowledge how things work and how you need to approach stakeholders in order improve the situation.

Why Isn't Cost Enough to Convince the Business?

Business is only interested of certain things. I would say: nobody cares that the administrative burden would be smaller. So how you can approach this?

When you start the deployment of the interface endpoints they were not secured well. This means now we have a lot of networks, resulting in inconsistent security standards—each VPC becomes a snowflake. You may avoid saying that is not secure, a more suitable approach would be:
By standardizing the security policies and security groups you can make sure that sensitive workloads have access only to those specific buckets and only those specific tables and only those specific APIs. This improves the security baseline and reduces the blast radius. As a result this reduces the possibility of a data leakage. (How to Sell Optimization Without Saying 'Security Is Bad')
By centralizing and standardizing the interface endpoints, we could achieve an 87% cost reduction. In Bulgaria, there’s a well-known satirical series called The Three Fools. In this context, it feels like we're unintentionally playing a similar role—continuing to pay thousands to AWS for redundant endpoints simply because the architecture hasn't been revisited with fresh eyes.

Note: Security is always a good selling point for the business and nobody measures it after a change. Controlling fear factor and risk sells, a good example would be insurance, that we buy for our houses

High Level Design

🧩 Components Table

ID	Name	Type	Description
C1	Interface Endpoints	VPC Interface Endpoints	Provides private access to AWS services (e.g., `ssm.eu-west-2.amazonaws.com`).
C2	Route 53 Private Hosted Zone	DNS Zone	Hosts private DNS entries for the services.
C3	Route 53 Resolver Inbound Endpoint	DNS Resolver	Accepts DNS queries from the spoke VPC.
C3	Shared Resolver	Route 53 Resolver	Used by EC2 instances in the spoke VPC to resolve private DNS.
C4	AWS RAM	Resource Access Manager	Shares the inbound endpoint and private hosted zone with the spoke VPC.
C5	Cloud WAN Segment Network	Network Routing	Routes traffic between segments (e.g., from spoke to shared services).
EC2	Amazon EC2 Instance	Compute	The instance initiating the request to `ssm.eu-west-2.amazonaws.com`.
	Spoke VPC	VPC	Contains the EC2 instance. CIDR: `192.168.20.X`.
	Centralized VPC Endpoints	VPC	Hosts the interface endpoints and inbound resolver. CIDR: `192.168.10.X`.

🔗 Integrations Table

Step	Integration Description	Direction	Protocol/Mechanism
1	EC2 in spoke VPC wants to resolve `ssm.eu-west-2.amazonaws.com`.	Spoke → Shared	DNS Query via Shared Resolver
2	Shared Resolver provides IP `192.168.10.4` for the endpoint.	Shared → Spoke	DNS Response
3	Traffic to `192.168.10.4` is not local, forwarded to Cloud WAN uplink.	Spoke → Cloud WAN	VPC Route Table / Cloud WAN Routing
4	Cloud WAN checks if route to another network is permitted.	Cloud WAN	Firewall/Policy Check
5	If permitted, traffic is routed to shared services VPC.	Cloud WAN → Shared	Network Forwarding
I1	Private hosted zone is associated with the shared resolver and spoke via RAM.	Shared ↔ Spoke	AWS RAM and Route 53 Association
I4	RAM shares the inbound resolver with spoke VPC.	Shared → Spoke	AWS Resource Access Manager
I5	Spoke EC2 sends DNS queries to shared resolver.	Spoke → Shared	DNS

Prerequisite: All VPCs are connected via peering/TransitGateway/CloudWAN

Hub VPC

First we need to create centralized hub VPC that will have all of the necessary VPC interface endpoints. When you create a VPC endpoint to an AWS service, you can enable private DNS. When enabled, the setting creates an AWS managed Route 53 private hosted zone (PHZ), which enables the resolution of public AWS service endpoint to the private IP of the interface endpoint. You need this disabled in order to define a centralized PHZ trough a Route 53 inbound resolver, which would be shared with other accounts
To do this you need to disable this in terraform:

resource "aws_vpc_endpoint" "private_links" {
  for_each            = toset(local.vpc_endpoints_all)
  vpc_id              = aws_vpc.main.id
  service_name        = each.key
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = false
#Disabling private DNS lets us override the default endpoint #resolution and use our own Route 53 hosted zone across accounts
  security_group_ids  = [aws_security_group.vpc_endpoint[each.key].id]
  policy              = data.aws_iam_policy_document.vpc_endpoints_policy.json
  subnet_ids          = local.subnets
  tags                = merge({ Name = "${var.prefix}-${each.key}-interface-endpoint" }, var.tags)
}

As next step you create route 53 private hosted zones for each endpoint. We associate them with the centralized VPC from step 1.
Then we create Alias A record in each hosted zone pointing to the VPC endpoint's DNS name. For example: For STS endpoint
the name "sts.${data.aws_region.current.name}.amazonaws.com"
should point to the DNS of the newly created VPC endpoint for STS
This allows traffic from spoke VPCs to resolve AWS service endpoints via centralized VPC interface endpoints and inbound endpoint.
Then we create an inbound endpoint in Route 53 with security group, Do53 protocol in at least two subnets for high availability, that will be used in the spoke VPCs as well. The idea of the inbound endpoint resolver is route your DNS queries from other spoke VPCs or networks to the hub VPC
As last step we share the resolver or inbound endpoint with other accounts and define policy for security trough Resource Access Manager or RAM

Spoke VPCs

Each newly create spoke VPC needs to be associated with the inbound resolver, that was shared from hub VPC. Example

data "aws_route53_resolver_rules" "eu_west_2" {
  owner_id     = var.resolver_rules[terraform.workspace]
  share_status = "SHARED_WITH_ME"
}
resource "aws_route53_resolver_rule_association" "eu_west_2" {
  for_each         = data.aws_route53_resolver_rules.eu_west_2.resolver_rule_ids
  resolver_rule_id = each.value
  vpc_id           = data.terraform_remote_state.networking.outputs.network.aws_vpc.id
}

Minimizing downtime

Now you would ask: How to move from the state with decentralized VPC interface endpoints to a state, where they were centralized with as minimal downtime as possible?

In general what could be done is associating the shared resolver with the spoke VPC and then destroying the decentralized VPC endpoints with a rolling deployment from development to production and automatic tests via System Manager Run Command/Lambda. This will guarantee that first you gather knowledge what can fail (Everything fails all the time) and you document it in Confluence or even Readme.md. This will give you confidence for production and will make the big change controllable and more understandable for technical and non-technical people

Key Takeaways

Rushing architecture decisions often leads to long-term cost explosions.
Interface endpoints are scalable—duplicating them per VPC isn’t.
Centralizing shared services like VPC endpoints saves money and simplifies security management.
To convince stakeholders, lead with security and cost—not technical purity.
AWS gives you the tools; architecture is about using them with purpose.

Sources: https://aws.amazon.com/blogs/networking-and-content-delivery/centralize-access-using-vpc-interface-endpoints/

https://docs.aws.amazon.com/whitepapers/latest/building-scalable-secure-multi-vpc-network-infrastructure/centralized-access-to-vpc-private-endpoints.html

Top comments (1)

Sawyer Wolfe • May 26

Great breakdown of VPC endpoint costs! Do you have any tips for managing legacy environments that can’t centralize endpoints right away? Maybe next time you could write about automating the migration process or addressing common pitfalls during endpoint consolidation.