AMP (Managed Prometheus) with Terraform: A Production Deep Dive
Modern infrastructure teams face a relentless challenge: observability at scale. Traditional self-managed Prometheus deployments, while powerful, introduce significant operational overhead – patching, scaling, storage management, and high availability. This complexity distracts from core business logic. Terraform, as the leading infrastructure-as-code tool, needs a streamlined way to provision and manage these critical monitoring systems. AWS Managed Prometheus (AMP) directly addresses this, offering a serverless, scalable, and cost-effective Prometheus-compatible monitoring solution. It fits squarely within IaC pipelines, acting as a foundational component of a platform engineering stack, enabling self-service observability for development teams.
What is AMP (Managed Prometheus) in Terraform Context?
AMP is accessed via the AWS provider in Terraform. The primary resource is aws_prometheus_workspace
. This resource defines the core Prometheus workspace, including its name and tags. Currently, there isn’t a comprehensive, officially maintained Terraform module for AMP, which is a gap in the ecosystem. However, several community-driven modules are emerging.
Terraform-specific behavior centers around the asynchronous nature of workspace creation. The aws_prometheus_workspace
resource doesn’t immediately return a fully functional workspace. Dependencies must be carefully managed, often using depends_on
or data sources to poll for workspace availability. The lifecycle
block can be used to manage resource updates, but be aware that certain attributes (like workspace name) are immutable after creation. Importing existing AMP workspaces is possible, but requires careful handling of the workspace ID.
Use Cases and When to Use
AMP isn’t a one-size-fits-all solution, but excels in specific scenarios:
- Microservices Observability: Teams deploying numerous microservices benefit from AMP’s scalability and ease of integration with existing Prometheus-based tooling (e.g., Grafana, Alertmanager). SREs can quickly onboard new services without managing underlying infrastructure.
- Kubernetes Monitoring: AMP seamlessly integrates with Kubernetes clusters, providing a centralized monitoring solution for containerized applications. This is crucial for DevOps teams adopting container orchestration.
- Multi-Account Observability: Centralizing metrics across multiple AWS accounts is simplified with AMP. This allows for organization-wide visibility and reporting, a key requirement for platform engineering teams.
- Cost Optimization: AMP’s pay-as-you-go pricing model can be more cost-effective than self-managed Prometheus, especially for workloads with variable demand. Finance teams appreciate the predictable cost structure.
- Rapid Prototyping: Quickly spin up a Prometheus-compatible monitoring environment for proof-of-concept projects. This accelerates development cycles and reduces time-to-market.
Key Terraform Resources
Here are eight essential Terraform resources for working with AMP:
-
aws_prometheus_workspace
: Defines the core AMP workspace.
resource "aws_prometheus_workspace" "example" {
workspace_name = "my-amp-workspace"
tags = {
Environment = "production"
}
}
-
aws_iam_role
: Creates an IAM role for accessing AMP.
resource "aws_iam_role" "amp_role" {
name = "amp-access-role"
assume_role_policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Action = "sts:AssumeRole",
Principal = {
Service = "monitoring.amazonaws.com"
},
Effect = "Allow",
Sid = ""
}
]
})
}
-
aws_iam_policy
: Grants permissions to the IAM role.
resource "aws_iam_policy" "amp_policy" {
name = "amp-policy"
description = "Policy for accessing AMP"
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Action = [
"prometheusext:DescribeWorkspace",
"prometheusext:GetWorkspace",
"prometheusext:ListWorkspaces"
],
Effect = "Allow",
Resource = "*"
}
]
})
}
-
aws_iam_role_policy_attachment
: Attaches the policy to the role.
resource "aws_iam_role_policy_attachment" "amp_attachment" {
role = aws_iam_role.amp_role.name
policy_arn = aws_iam_policy.amp_policy.arn
}
-
data.aws_region
: Dynamically retrieves the current AWS region.
data "aws_region" "current" {}
-
data.aws_caller_identity
: Retrieves information about the current AWS account.
data "aws_caller_identity" "current" {}
-
aws_prometheus_rule_group
: Defines a rule group for alerting and metric aggregation.
resource "aws_prometheus_rule_group" "example" {
workspace_id = aws_prometheus_workspace.example.id
name = "my-rule-group"
rules = jsonencode([
{
alert = "HighCPUUsage"
expr = "sum(rate(node_cpu_seconds_total{mode=\"user\"}[5m])) > 0.8"
for = "5m"
labels = {
severity = "critical"
}
annotations = {
summary = "High CPU usage detected"
description = "CPU usage is above 80%."
}
}
])
}
-
aws_prometheus_remote_write_configuration
: Configures remote write access to AMP.
resource "aws_prometheus_remote_write_configuration" "example" {
workspace_id = aws_prometheus_workspace.example.id
remote_write_receiver_url = "https://your-remote-write-endpoint"
}
Common Patterns & Modules
Using for_each
with aws_prometheus_rule_group
allows for dynamic rule creation based on a map of rules. Remote state backends (e.g., S3) are essential for collaboration and state locking. A layered architecture – separating core AMP infrastructure from application-specific monitoring configurations – promotes reusability. Monorepos are well-suited for managing AMP configurations alongside application code. While a definitive public module is lacking, several community efforts are available on the Terraform Registry, but require thorough vetting.
Hands-On Tutorial
This example creates a basic AMP workspace and a simple rule group.
Provider Setup:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1" # Replace with your desired region
}
Resource Configuration:
resource "aws_prometheus_workspace" "example" {
workspace_name = "my-test-amp-workspace"
tags = {
Name = "Test AMP Workspace"
}
}
resource "aws_prometheus_rule_group" "example" {
workspace_id = aws_prometheus_workspace.example.id
name = "high-cpu-alert"
rules = jsonencode([
{
alert = "HighCPUUsage"
expr = "sum(rate(node_cpu_seconds_total{mode=\"user\"}[5m])) > 0.8"
for = "5m"
labels = {
severity = "critical"
}
annotations = {
summary = "High CPU usage detected"
description = "CPU usage is above 80%."
}
}
])
}
Apply & Destroy Output:
terraform init
terraform plan
terraform apply
terraform destroy
The terraform plan
output will show the resources to be created. terraform apply
will provision the AMP workspace and rule group. terraform destroy
will remove them.
Enterprise Considerations
Large organizations leverage Terraform Cloud/Enterprise for state management, remote operations, and collaboration. Sentinel or Open Policy Agent (OPA) enforce policy-as-code, ensuring compliance with security and governance standards. IAM roles are meticulously designed with least privilege in mind. State locking prevents concurrent modifications. Costs are monitored using AWS Cost Explorer and Terraform Cloud’s cost estimation features. Multi-region deployments require careful consideration of data replication and workspace availability.
Security and Compliance
Least privilege is enforced through granular IAM policies. RBAC is implemented using IAM roles and policies. Policy constraints are defined using Sentinel or OPA. Drift detection is crucial; Terraform Cloud’s drift detection feature identifies unauthorized changes. Tagging policies ensure consistent metadata. Auditability is achieved through CloudTrail logging and Terraform Cloud’s audit logs.
# Example IAM Policy for AMP access
resource "aws_iam_policy" "amp_access_policy" {
name = "amp-access-policy"
description = "Policy granting access to AMP resources"
policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Action = [
"prometheusext:DescribeWorkspace",
"prometheusext:GetWorkspace",
"prometheusext:ListWorkspaces",
"prometheusext:CreateWorkspace",
"prometheusext:DeleteWorkspace"
],
Effect = "Allow",
Resource = "*"
}
]
})
}
Integration with Other Services
graph LR
A[Terraform] --> B(AWS Managed Prometheus);
B --> C{Grafana};
B --> D{Alertmanager};
B --> E[EC2 Instances];
B --> F[EKS Clusters];
E --> B;
F --> B;
- Grafana: Visualize AMP metrics using Grafana data sources.
- Alertmanager: Configure alerts based on AMP metrics.
- EC2 Instances: Export metrics from EC2 instances to AMP using the Prometheus exporter.
- EKS Clusters: Monitor Kubernetes clusters using AMP’s integration with Prometheus.
- Lambda Functions: Export custom metrics from Lambda functions to AMP.
Module Design Best Practices
Abstract AMP into reusable modules with well-defined input variables (e.g., workspace name, tags, rule groups) and output variables (e.g., workspace ID, ARN). Use locals to simplify complex configurations. Document modules thoroughly using Markdown. Employ a remote backend for state management. Consider versioning modules using semantic versioning.
CI/CD Automation
# .github/workflows/amp-deploy.yml
name: Deploy AMP Infrastructure
on:
push:
branches:
- main
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: hashicorp/setup-terraform@v2
- run: terraform fmt
- run: terraform validate
- run: terraform plan -out=tfplan
- run: terraform apply tfplan
Pitfalls & Troubleshooting
-
Workspace Creation Delays: Workspace creation can take several minutes. Use
depends_on
or data sources to ensure dependencies are met. - IAM Permissions: Incorrect IAM permissions prevent access to AMP. Verify the IAM role has the necessary permissions.
- Rule Group Syntax Errors: Invalid JSON syntax in rule groups causes deployment failures. Use a JSON validator.
- Workspace ID Mismatch: Incorrect workspace ID in rule groups or remote write configurations leads to errors. Double-check the ID.
- API Rate Limits: Excessive API calls can trigger rate limits. Implement retry logic.
- Data Source Staleness: Data sources may return stale information. Refresh data sources before applying changes.
Pros and Cons
Pros:
- Serverless and scalable.
- Cost-effective for variable workloads.
- Simplified management compared to self-managed Prometheus.
- Seamless integration with AWS services.
Cons:
- Limited customization options compared to self-managed Prometheus.
- Lack of a comprehensive official Terraform module.
- Vendor lock-in to AWS.
- Asynchronous workspace creation requires careful dependency management.
Conclusion
AMP, when orchestrated with Terraform, provides a powerful and efficient solution for observability at scale. It addresses the operational burden of self-managed Prometheus while enabling infrastructure-as-code best practices. Engineers should prioritize evaluating community modules, integrating AMP into their CI/CD pipelines, and leveraging Sentinel/OPA for robust policy enforcement. Start with a proof-of-concept, focusing on a critical microservice or Kubernetes cluster, to unlock the strategic value of AMP within your organization.
Top comments (0)