DevOps Fundamental for DevOps Fundamentals

Posted on Jun 21

Terraform Fundamentals: AMP (Managed Prometheus)

#terraform #iac #aws #ampmanagedprometheus

AMP (Managed Prometheus) with Terraform: A Production Deep Dive

Modern infrastructure teams face a relentless challenge: observability at scale. Traditional self-managed Prometheus deployments, while powerful, introduce significant operational overhead – patching, scaling, storage management, and high availability. This complexity distracts from core business logic. Terraform, as the leading infrastructure-as-code tool, needs a streamlined way to provision and manage these critical monitoring systems. AWS Managed Prometheus (AMP) directly addresses this, offering a serverless, scalable, and cost-effective Prometheus-compatible monitoring solution. It fits squarely within IaC pipelines, acting as a foundational component of a platform engineering stack, enabling self-service observability for development teams.

What is AMP (Managed Prometheus) in Terraform Context?

AMP is accessed via the AWS provider in Terraform. The primary resource is aws_prometheus_workspace. This resource defines the core Prometheus workspace, including its name and tags. Currently, there isn’t a comprehensive, officially maintained Terraform module for AMP, which is a gap in the ecosystem. However, several community-driven modules are emerging.

Terraform-specific behavior centers around the asynchronous nature of workspace creation. The aws_prometheus_workspace resource doesn’t immediately return a fully functional workspace. Dependencies must be carefully managed, often using depends_on or data sources to poll for workspace availability. The lifecycle block can be used to manage resource updates, but be aware that certain attributes (like workspace name) are immutable after creation. Importing existing AMP workspaces is possible, but requires careful handling of the workspace ID.

Use Cases and When to Use

AMP isn’t a one-size-fits-all solution, but excels in specific scenarios:

Microservices Observability: Teams deploying numerous microservices benefit from AMP’s scalability and ease of integration with existing Prometheus-based tooling (e.g., Grafana, Alertmanager). SREs can quickly onboard new services without managing underlying infrastructure.
Kubernetes Monitoring: AMP seamlessly integrates with Kubernetes clusters, providing a centralized monitoring solution for containerized applications. This is crucial for DevOps teams adopting container orchestration.
Multi-Account Observability: Centralizing metrics across multiple AWS accounts is simplified with AMP. This allows for organization-wide visibility and reporting, a key requirement for platform engineering teams.
Cost Optimization: AMP’s pay-as-you-go pricing model can be more cost-effective than self-managed Prometheus, especially for workloads with variable demand. Finance teams appreciate the predictable cost structure.
Rapid Prototyping: Quickly spin up a Prometheus-compatible monitoring environment for proof-of-concept projects. This accelerates development cycles and reduces time-to-market.

Key Terraform Resources

Here are eight essential Terraform resources for working with AMP:

aws_prometheus_workspace: Defines the core AMP workspace.

resource "aws_prometheus_workspace" "example" {
  workspace_name = "my-amp-workspace"
  tags = {
    Environment = "production"
  }
}

aws_iam_role: Creates an IAM role for accessing AMP.

resource "aws_iam_role" "amp_role" {
  name = "amp-access-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = "sts:AssumeRole",
        Principal = {
          Service = "monitoring.amazonaws.com"
        },
        Effect = "Allow",
        Sid = ""
      }
    ]
  })
}

aws_iam_policy: Grants permissions to the IAM role.

resource "aws_iam_policy" "amp_policy" {
  name        = "amp-policy"
  description = "Policy for accessing AMP"
  policy      = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = [
          "prometheusext:DescribeWorkspace",
          "prometheusext:GetWorkspace",
          "prometheusext:ListWorkspaces"
        ],
        Effect   = "Allow",
        Resource = "*"
      }
    ]
  })
}

aws_iam_role_policy_attachment: Attaches the policy to the role.

resource "aws_iam_role_policy_attachment" "amp_attachment" {
  role       = aws_iam_role.amp_role.name
  policy_arn = aws_iam_policy.amp_policy.arn
}

data.aws_region: Dynamically retrieves the current AWS region.

data "aws_region" "current" {}

data.aws_caller_identity: Retrieves information about the current AWS account.

data "aws_caller_identity" "current" {}

aws_prometheus_rule_group: Defines a rule group for alerting and metric aggregation.

resource "aws_prometheus_rule_group" "example" {
  workspace_id = aws_prometheus_workspace.example.id
  name         = "my-rule-group"
  rules = jsonencode([
    {
      alert = "HighCPUUsage"
      expr  = "sum(rate(node_cpu_seconds_total{mode=\"user\"}[5m])) > 0.8"
      for   = "5m"
      labels = {
        severity = "critical"
      }
      annotations = {
        summary = "High CPU usage detected"
        description = "CPU usage is above 80%."
      }
    }
  ])
}

aws_prometheus_remote_write_configuration: Configures remote write access to AMP.

resource "aws_prometheus_remote_write_configuration" "example" {
  workspace_id = aws_prometheus_workspace.example.id
  remote_write_receiver_url = "https://your-remote-write-endpoint"
}

Common Patterns & Modules

Using for_each with aws_prometheus_rule_group allows for dynamic rule creation based on a map of rules. Remote state backends (e.g., S3) are essential for collaboration and state locking. A layered architecture – separating core AMP infrastructure from application-specific monitoring configurations – promotes reusability. Monorepos are well-suited for managing AMP configurations alongside application code. While a definitive public module is lacking, several community efforts are available on the Terraform Registry, but require thorough vetting.

Hands-On Tutorial

This example creates a basic AMP workspace and a simple rule group.

Provider Setup:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1" # Replace with your desired region

}

Resource Configuration:

resource "aws_prometheus_workspace" "example" {
  workspace_name = "my-test-amp-workspace"
  tags = {
    Name = "Test AMP Workspace"
  }
}

resource "aws_prometheus_rule_group" "example" {
  workspace_id = aws_prometheus_workspace.example.id
  name         = "high-cpu-alert"
  rules = jsonencode([
    {
      alert = "HighCPUUsage"
      expr  = "sum(rate(node_cpu_seconds_total{mode=\"user\"}[5m])) > 0.8"
      for   = "5m"
      labels = {
        severity = "critical"
      }
      annotations = {
        summary = "High CPU usage detected"
        description = "CPU usage is above 80%."
      }
    }
  ])
}

Apply & Destroy Output:

terraform init
terraform plan
terraform apply
terraform destroy

The terraform plan output will show the resources to be created. terraform apply will provision the AMP workspace and rule group. terraform destroy will remove them.

Enterprise Considerations

Large organizations leverage Terraform Cloud/Enterprise for state management, remote operations, and collaboration. Sentinel or Open Policy Agent (OPA) enforce policy-as-code, ensuring compliance with security and governance standards. IAM roles are meticulously designed with least privilege in mind. State locking prevents concurrent modifications. Costs are monitored using AWS Cost Explorer and Terraform Cloud’s cost estimation features. Multi-region deployments require careful consideration of data replication and workspace availability.

Security and Compliance

Least privilege is enforced through granular IAM policies. RBAC is implemented using IAM roles and policies. Policy constraints are defined using Sentinel or OPA. Drift detection is crucial; Terraform Cloud’s drift detection feature identifies unauthorized changes. Tagging policies ensure consistent metadata. Auditability is achieved through CloudTrail logging and Terraform Cloud’s audit logs.

# Example IAM Policy for AMP access

resource "aws_iam_policy" "amp_access_policy" {
  name        = "amp-access-policy"
  description = "Policy granting access to AMP resources"
  policy      = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = [
          "prometheusext:DescribeWorkspace",
          "prometheusext:GetWorkspace",
          "prometheusext:ListWorkspaces",
          "prometheusext:CreateWorkspace",
          "prometheusext:DeleteWorkspace"
        ],
        Effect   = "Allow",
        Resource = "*"
      }
    ]
  })
}

Integration with Other Services

graph LR
    A[Terraform] --> B(AWS Managed Prometheus);
    B --> C{Grafana};
    B --> D{Alertmanager};
    B --> E[EC2 Instances];
    B --> F[EKS Clusters];
    E --> B;
    F --> B;

Grafana: Visualize AMP metrics using Grafana data sources.
Alertmanager: Configure alerts based on AMP metrics.
EC2 Instances: Export metrics from EC2 instances to AMP using the Prometheus exporter.
EKS Clusters: Monitor Kubernetes clusters using AMP’s integration with Prometheus.
Lambda Functions: Export custom metrics from Lambda functions to AMP.

Module Design Best Practices

Abstract AMP into reusable modules with well-defined input variables (e.g., workspace name, tags, rule groups) and output variables (e.g., workspace ID, ARN). Use locals to simplify complex configurations. Document modules thoroughly using Markdown. Employ a remote backend for state management. Consider versioning modules using semantic versioning.

CI/CD Automation

# .github/workflows/amp-deploy.yml

name: Deploy AMP Infrastructure

on:
  push:
    branches:
      - main

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/setup-terraform@v2
      - run: terraform fmt
      - run: terraform validate
      - run: terraform plan -out=tfplan
      - run: terraform apply tfplan

Pitfalls & Troubleshooting

Workspace Creation Delays: Workspace creation can take several minutes. Use depends_on or data sources to ensure dependencies are met.
IAM Permissions: Incorrect IAM permissions prevent access to AMP. Verify the IAM role has the necessary permissions.
Rule Group Syntax Errors: Invalid JSON syntax in rule groups causes deployment failures. Use a JSON validator.
Workspace ID Mismatch: Incorrect workspace ID in rule groups or remote write configurations leads to errors. Double-check the ID.
API Rate Limits: Excessive API calls can trigger rate limits. Implement retry logic.
Data Source Staleness: Data sources may return stale information. Refresh data sources before applying changes.

Pros and Cons

Pros:

Serverless and scalable.
Cost-effective for variable workloads.
Simplified management compared to self-managed Prometheus.
Seamless integration with AWS services.

Cons:

Limited customization options compared to self-managed Prometheus.
Lack of a comprehensive official Terraform module.
Vendor lock-in to AWS.
Asynchronous workspace creation requires careful dependency management.

Conclusion

AMP, when orchestrated with Terraform, provides a powerful and efficient solution for observability at scale. It addresses the operational burden of self-managed Prometheus while enabling infrastructure-as-code best practices. Engineers should prioritize evaluating community modules, integrating AMP into their CI/CD pipelines, and leveraging Sentinel/OPA for robust policy enforcement. Start with a proof-of-concept, focusing on a critical microservice or Kubernetes cluster, to unlock the strategic value of AMP within your organization.

DEV Community