DevOps Fundamental for DevOps Fundamentals

Posted on Jun 21

Terraform Fundamentals: CloudWatch Internet Monitor

#terraform #iac #aws #cloudwatchinternetmonitor

Monitoring the Internet's View of Your Infrastructure with Terraform and CloudWatch Internet Monitor

The relentless pressure to deliver reliable, performant applications demands proactive monitoring, not just of internal metrics, but of the user experience. Traditional infrastructure monitoring often falls short in detecting issues stemming from network conditions, DNS resolution problems, or edge server outages impacting global reach. This is especially critical for globally distributed applications and those reliant on CDNs. CloudWatch Internet Monitor (CWIM) addresses this gap, providing synthetic monitoring from a global network of locations. Integrating CWIM into a Terraform-driven infrastructure pipeline allows for automated, version-controlled, and repeatable monitoring setup alongside application deployment. This isn’t a “nice-to-have” anymore; it’s a foundational component of a robust SRE practice and a key element in platform engineering stacks aiming for self-healing infrastructure.

What is CloudWatch Internet Monitor in Terraform Context?

CloudWatch Internet Monitor allows you to create monitors that periodically check the availability and performance of your endpoints from locations around the world. Terraform manages these monitors through the aws_internet_monitor resource within the AWS provider. Currently, there isn’t a dedicated Terraform module for CWIM widely adopted in the community, meaning you’ll typically define resources directly in your HCL.

The resource itself is relatively straightforward, but understanding its lifecycle is crucial. CWIM monitors have a propagation delay after creation or modification – it can take several minutes for the monitor to become fully active and start reporting data. Terraform’s default behavior can lead to immediate attempts to read monitor status, resulting in errors. Using time_sleep or depends_on with a delay is often necessary to avoid these issues. Furthermore, deleting a monitor doesn’t immediately remove the associated data; it’s retained for a period defined by AWS.

Use Cases and When to Use

CWIM isn’t a replacement for traditional infrastructure monitoring, but complements it. Here are key scenarios:

Global Application Availability: Monitoring the accessibility of your application from multiple geographic regions. Critical for ensuring a consistent user experience regardless of location. SRE teams use this to establish SLOs and track error budgets.
CDN Performance Validation: Verifying that your CDN is correctly caching content and delivering it with acceptable latency. DevOps teams can integrate this into release pipelines to validate CDN configuration changes.
DNS Propagation Monitoring: Confirming that DNS changes have propagated correctly across the internet. Essential for minimizing downtime during DNS updates. Infrastructure teams rely on this during disaster recovery drills.
Third-Party API Dependency Monitoring: Tracking the availability and performance of external APIs your application relies on. Allows for proactive alerting and fallback mechanisms.
Edge Service Health Checks: Monitoring the health of edge services like load balancers or API gateways. Provides early warning of issues before they impact end-users.

Key Terraform Resources

Here are essential Terraform resources for working with CWIM:

aws_internet_monitor: The core resource for creating and managing monitors.

   resource "aws_internet_monitor" "example" {
     name          = "my-internet-monitor"
     health_checks = [
       {
         domain      = "example.com"
         port        = 80
         protocol    = "HTTP"
         path        = "/"
         invert_result = false
       }
     ]
   }

aws_route53_health_check: Used in conjunction with CWIM for more sophisticated health checks.

   resource "aws_route53_health_check" "example" {
     name             = "my-route53-health-check"
     type             = "HTTP"
     port             = 80
     resource_path    = "/"
     failure_threshold = 3
     request_interval = 30
   }

aws_cloudwatch_metric_alarm: Alerting based on CWIM monitor data.

   resource "aws_cloudwatch_metric_alarm" "example" {
     alarm_name          = "internet-monitor-failure"
     comparison_operator = "LessThanOrEqualToThreshold"
     evaluation_periods  = 1
     metric_name         = "Availability"
     namespace           = "AWS/InternetMonitor"
     period              = 60
     statistic           = "Average"
     threshold           = 1
     alarm_description   = "Alert when internet monitor reports availability below 1"
     dimensions = {
       MonitorName = aws_internet_monitor.example.name
     }
   }

aws_iam_role & aws_iam_policy: For granting necessary permissions.
aws_cloudwatch_log_group: For storing CWIM logs.
aws_cloudwatch_log_stream: For managing log streams within a log group.
data.aws_region: To dynamically determine the AWS region.
terraform_remote_state: For managing state when collaborating with teams.

Common Patterns & Modules

Using for_each with aws_internet_monitor allows you to create multiple monitors for different endpoints or regions. Dynamic blocks within health_checks are useful for defining varying health check configurations.

variable "endpoints" {
  type = list(object({
    domain = string
    port   = number
  }))
  default = [
    { domain = "example.com", port = 80 },
    { domain = "api.example.com", port = 443 }
  ]
}

resource "aws_internet_monitor" "endpoints" {
  for_each = var.endpoints
  name = "internet-monitor-${each.key}"
  health_checks = [
    {
      domain = each.value.domain
      port = each.value.port
      protocol = "HTTP"
      path = "/"
    }
  ]
}

A layered module structure is recommended. A base module handles the core aws_internet_monitor resource, while separate modules manage alerting and IAM roles. This promotes reusability and maintainability. While no widely adopted public module exists, building your own is a worthwhile investment for larger organizations.

Hands-On Tutorial

This example creates a basic CWIM monitor for example.com.

Provider Setup: (Assume AWS provider is already configured)

Resource Configuration:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

resource "aws_internet_monitor" "example" {
  name = "example-com-monitor"
  health_checks = [
    {
      domain   = "example.com"
      port     = 80
      protocol = "HTTP"
      path     = "/"
    }
  ]
}

output "monitor_name" {
  value = aws_internet_monitor.example.name
}

Apply & Destroy:

terraform init
terraform plan
terraform apply

terraform plan output will show the creation of the aws_internet_monitor resource. After applying, the monitor will take several minutes to become active.

terraform destroy

This will delete the monitor. Remember that data retention policies apply.

Enterprise Considerations

Large organizations should leverage Terraform Cloud/Enterprise for state locking, remote operations, and collaboration. Sentinel or Open Policy Agent (OPA) can enforce policy-as-code, ensuring compliance with security and governance standards. IAM roles should be narrowly scoped, granting only the necessary permissions to Terraform. Consider using separate workspaces for different environments (dev, staging, production). Costs can be significant with a large number of monitors and frequent health checks; optimize the request_interval and consider the number of locations monitored. Multi-region deployments require careful planning to ensure monitors cover all relevant geographic areas.

Security and Compliance

Enforce least privilege using IAM policies. For example:

resource "aws_iam_policy" "cwim_policy" {
  name        = "cwim-terraform-policy"
  description = "Policy for Terraform to manage CloudWatch Internet Monitor"
  policy      = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "internetmonitor:CreateMonitor",
          "internetmonitor:DeleteMonitor",
          "internetmonitor:GetMonitor",
          "internetmonitor:ListMonitors"
        ]
        Effect   = "Allow"
        Resource = "*"
      }
    ]
  })
}

Implement tagging policies to categorize monitors for cost allocation and reporting. Drift detection should be enabled in Terraform Cloud/Enterprise to identify unauthorized changes.

Integration with Other Services

Here's how CWIM integrates with other services:

CloudWatch Alarms: (Shown previously) Triggered by CWIM data.
SNS Notifications: Alerts sent via SNS.
Lambda Functions: Automated remediation triggered by CWIM alerts.
Route 53 Health Checks: Used to enhance CWIM health checks.
EventBridge: Routing CWIM events to other AWS services.

graph LR
    A[CloudWatch Internet Monitor] --> B(CloudWatch Alarms);
    A --> C(SNS Notifications);
    A --> D(Lambda Functions);
    A --> E[Route 53 Health Checks];
    A --> F(EventBridge);

Module Design Best Practices

Abstract CWIM into reusable modules with clear input variables (e.g., domain, port, protocol, health_check_interval). Use output variables to expose key monitor attributes (e.g., monitor_name, monitor_arn). Leverage locals for default values and complex calculations. Thorough documentation is essential. Use a remote backend (e.g., S3) for state storage.

CI/CD Automation

Here's a simplified GitHub Actions workflow:

name: Terraform Apply

on:
  push:
    branches:
      - main

jobs:
  apply:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: hashicorp/terraform-github-actions/tf-fmt@v1
      - uses: hashicorp/terraform-github-actions/tf-validate@v1
      - uses: hashicorp/terraform-github-actions/tf-plan@v1
      - uses: hashicorp/terraform-github-actions/tf-apply@v1
        with:
          args: '-auto-approve'

Pitfalls & Troubleshooting

Propagation Delay: Terraform errors due to attempting to read monitor status before it’s active. Solution: Use time_sleep or depends_on.
IAM Permissions: Insufficient permissions for Terraform to create/manage monitors. Solution: Review and update IAM policies.
Incorrect Protocol/Port: Health checks failing due to misconfigured protocol or port. Solution: Verify endpoint configuration.
DNS Resolution Issues: Monitors failing due to DNS resolution problems. Solution: Check DNS records and propagation.
Throttling: AWS API throttling limiting monitor creation/updates. Solution: Implement retry logic or reduce request frequency.
Monitor Limit: Reaching the account limit for Internet Monitors. Solution: Request an increase from AWS support.

Pros and Cons

Pros:

Proactive detection of user-facing issues.
Global visibility into application availability.
Automated monitoring setup via IaC.
Integration with existing AWS monitoring tools.

Cons:

Propagation delay requires careful Terraform configuration.
Cost can be significant at scale.
Limited customization options compared to other monitoring solutions.
Lack of a widely adopted community module.

Conclusion

CloudWatch Internet Monitor, when integrated with Terraform, provides a powerful mechanism for proactively monitoring the user experience of your applications. It’s a critical component of a modern SRE practice and a valuable addition to any platform engineering stack. Start by incorporating CWIM into a proof-of-concept for a critical application, evaluate existing modules or build your own, and establish a CI/CD pipeline to automate its deployment and management. The investment in proactive monitoring will pay dividends in reduced downtime and improved customer satisfaction.

DEV Community