Platform engineers have a new nightmare: explaining to their CTO why the AI agent deployment that worked perfectly in staging is now burning through $50,000/month in production. The Terraform config looks flawless. The security groups are properly configured. The ECS tasks are healthy. But somehow, the vector database is choking on embeddings, the LLM gateway is routing traffic to the wrong regions, and the workflow orchestration is stuck in an infinite retry loop.
Traditional IaC tools weren't built for this complexity.
Traditional IaC Can't Handle AI Workloads
When ChatGPT generates your Terraform config, it looks perfect. But deploy it and everything breaks:
# This looks right but will fail in production
resource "aws_security_group" "ai_agent" {
name = "ai-agent-sg"
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # ❌ Too permissive
}
}
resource "aws_ecs_service" "ai_agent" {
name = "ai-agent"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.ai_agent.arn
# ❌ Missing: vector DB networking, LLM provider configs,
# retry policies, cost controls, monitoring...
}
LLMs generating IaC are trained on public examples, not production systems. They miss vector database networking, multi-provider LLM failover, and other complexities that break under real traffic.
AI agents need completely different infrastructure:
Traditional Layer: AI-Specific Layer:
- Compute (ECS/Lambda) - Vector Database (Pinecone/Weaviate)
- Storage (S3/EBS) - LLM Gateway (Multi-provider routing)
- Database (RDS) - Workflow Orchestration (Temporal/Prefect)
- Networking (VPC/ALB) - Model Serving & State Management
Each has its own failure modes and scaling patterns that traditional IaC treats as generic cloud resources.
What Actually Works
Pulumi for AI Infrastructure
Pulumi has native AI providers that treat vector databases and LLM gateways as real infrastructure. The trade-off? Your team needs to learn TypeScript/Python instead of HCL, and you're betting on a smaller ecosystem than Terraform's.
Alternative approaches:
- Custom Terraform providers - Build your own for AI services (more work, but stays in Terraform)
- Terraform + scripts - Use Terraform for basic infra, scripts for AI-specific parts
- AWS CDK - Good if you're AWS-only
import * as pinecone from "@pulumi/pinecone";
import * as temporal from "@pulumi/temporal";
// Native vector database support
const vectorIndex = new pinecone.Index("knowledge-base", {
name: "customer-support-kb",
metric: "cosine",
dimension: 1536,
spec: {
serverless: {
cloud: "aws",
region: "us-east-1"
}
}
});
// Workflow orchestration as code
const aiWorkflow = new temporal.Namespace("ai-workflows", {
namespace: "customer-support",
retention: "7d"
});
Temporal Handles Complex AI Workflows
Temporal manages the orchestration that AI agents need. Downsides: another system to operate, and your team needs to learn workflow concepts.
Alternatives:
- Prefect - Similar to Temporal but more Python-native
- Step Functions - AWS-native, simpler but less powerful
- Kubernetes Jobs - If you want to stay close to K8s
@workflow.defn
class CustomerSupportAgent:
@workflow.run
async def handle_request(self, user_query: str) -> str:
# Survives infrastructure failures
context = await workflow.execute_activity(
search_knowledge_base,
user_query,
start_to_close_timeout=timedelta(seconds=30)
)
# Automatic retries with backoff
response = await workflow.execute_activity(
call_llm_with_context,
{"query": user_query, "context": context},
retry_policy=RetryPolicy(maximum_attempts=3)
)
# Long-running workflows (hours/days/weeks)
if needs_human_review(response):
await workflow.wait_condition(
lambda: workflow.info().search_attributes.get("approved")
)
return response
class CostOptimizedAI(pulumi.ComponentResource):
def __init__(self, name: str):
# Spot instances for training
self.training_cluster = aws.ecs.Cluster(
f"{name}-training",
capacity_providers=["FARGATE_SPOT"]
)
# Reserved capacity for production
self.inference_service = aws.ecs.Service(
f"{name}-inference",
desired_count=self.calculate_optimal_capacity()
)
Security and Operational Considerations
API Key Management:
- Use AWS Secrets Manager or Azure Key Vault for LLM API keys
- Rotate keys automatically (most AI providers support this)
- Never put API keys in your IaC code - use secret references
Rollback Strategy:
- AI infrastructure changes can break in subtle ways
- Always test rollbacks in staging first
- Keep vector database backups before schema changes
- Use blue-green deployments for model updates
Team Training:
- Budget 2-4 weeks for engineers to learn Pulumi + Temporal
- Start with one person, then spread knowledge
- Document your AI infrastructure patterns for the team
Monitoring That Actually Matters
Regular monitoring misses what's important for AI systems. AI infrastructure spending hits $223 billion by 2028, so you need proper observability:
const aiMetrics = new aws.cloudwatch.Dashboard("ai-observability", {
dashboardBody: pulumi.jsonStringify({
widgets: [{
type: "metric",
properties: {
metrics: [
// Traditional metrics
["AWS/ECS", "CPUUtilization"],
["AWS/ECS", "MemoryUtilization"],
// AI-specific metrics that actually matter
["AI/VectorDB", "QueryLatency"],
["AI/LLM", "TokensPerSecond"],
["AI/LLM", "ResponseQuality"],
["AI/Workflow", "CompletionRate"],
["AI/Cost", "DollarPerInteraction"]
],
title: "AI System Health"
}
}]
})
});
// Alert on cost spikes
const costSpike = new aws.cloudwatch.MetricAlarm("ai-cost-spike", {
comparisonOperator: "GreaterThanThreshold",
metricName: "DollarPerInteraction",
threshold: 0.50, // Alert if cost per interaction > $0.50
alarmDescription: "AI infrastructure costs spiking"
});
What Teams Are Seeing
People adopting AI-native infrastructure report significant improvements:
- 10-100x lower costs with serverless vector databases vs. provisioned capacity
- Self-hosted models can cost significantly less than API-based solutions for high-volume workloads
Companies using Temporal for AI workflows report significantly reduced debugging time and improved reliability for long-running AI processes.
Start here:
- Check your AI costs - How much are you spending compared to self-hosted options?
- Pick one AI workflow to rebuild as a test
- Try Pulumi with Pinecone - deploy a test vector database
Next month:
- Move critical AI workflows to Temporal
- Set up cost monitoring and alerts
- Add AI-specific observability
Companies building reliable, cheap AI infrastructure stopped using traditional IaC tools. They switched to AI-native approaches that treat AI workloads properly.
Your call: Keep fighting with Terraform and burning money, or use patterns that actually work.
Top comments (0)