Forem

Samson Tanimawo profile picture

Samson Tanimawo

Building the first Agentic SRE Platform. 100 AI agents that detect, investigate, and resolve incidents autonomously.

Location Houston Joined Joined on  Personal website https://novaaiops.com

Pronouns

He/Him/His

DNS: The SRE's Most Underrated Skill
Cover image for DNS: The SRE's Most Underrated Skill

DNS: The SRE's Most Underrated Skill

Comments
2 min read
The Silent Outage: Monitoring What You Can't See
Cover image for The Silent Outage: Monitoring What You Can't See

The Silent Outage: Monitoring What You Can't See

Comments
2 min read
Why Every SRE Should Learn a Little Rust
Cover image for Why Every SRE Should Learn a Little Rust

Why Every SRE Should Learn a Little Rust

Comments
2 min read
How We Built Our Own Incident Management System
Cover image for How We Built Our Own Incident Management System

How We Built Our Own Incident Management System

Comments
2 min read
The Role of Platform Engineering in a Startup
Cover image for The Role of Platform Engineering in a Startup

The Role of Platform Engineering in a Startup

Comments
2 min read
Building Dashboards People Actually Use
Cover image for Building Dashboards People Actually Use

Building Dashboards People Actually Use

Comments
2 min read
SRE Maturity Models: Where Is Your Team?
Cover image for SRE Maturity Models: Where Is Your Team?

SRE Maturity Models: Where Is Your Team?

Comments
2 min read
The Art of Writing a Good Post-Mortem
Cover image for The Art of Writing a Good Post-Mortem

The Art of Writing a Good Post-Mortem

Comments
1 min read
Why We Stopped Using Log Aggregation for Everything
Cover image for Why We Stopped Using Log Aggregation for Everything

Why We Stopped Using Log Aggregation for Everything

Comments
1 min read
Running Postgres at Scale: Lessons Learned
Cover image for Running Postgres at Scale: Lessons Learned

Running Postgres at Scale: Lessons Learned

Comments
2 min read
How We Reduced Our Deployment Failure Rate to Under 2%
Cover image for How We Reduced Our Deployment Failure Rate to Under 2%

How We Reduced Our Deployment Failure Rate to Under 2%

Comments
1 min read
The Hidden Cost of Flaky Tests
Cover image for The Hidden Cost of Flaky Tests

The Hidden Cost of Flaky Tests

Comments
1 min read
Observability for Serverless: What's Different
Cover image for Observability for Serverless: What's Different

Observability for Serverless: What's Different

Comments
2 min read
From DevOps to SRE: Making the Transition
Cover image for From DevOps to SRE: Making the Transition

From DevOps to SRE: Making the Transition

Comments
2 min read
The SRE Interview: Questions I Actually Ask
Cover image for The SRE Interview: Questions I Actually Ask

The SRE Interview: Questions I Actually Ask

1
Comments
1 min read
Incident Retrospectives Without Blame
Cover image for Incident Retrospectives Without Blame

Incident Retrospectives Without Blame

Comments
1 min read
Alert Fatigue: The Silent Productivity Killer
Cover image for Alert Fatigue: The Silent Productivity Killer

Alert Fatigue: The Silent Productivity Killer

Comments
1 min read
Why SLIs Matter More Than SLOs
Cover image for Why SLIs Matter More Than SLOs

Why SLIs Matter More Than SLOs

Comments
1 min read
The PagerDuty Migration Playbook
Cover image for The PagerDuty Migration Playbook

The PagerDuty Migration Playbook

Comments
1 min read
How We Cut Datadog Bills by 60% Without Losing Observability
Cover image for How We Cut Datadog Bills by 60% Without Losing Observability

How We Cut Datadog Bills by 60% Without Losing Observability

Comments
1 min read
Building Your First Runbook: A Template That Actually Works
Cover image for Building Your First Runbook: A Template That Actually Works

Building Your First Runbook: A Template That Actually Works

Comments
1 min read
AIOps vs Traditional Monitoring: What Actually Changed
Cover image for AIOps vs Traditional Monitoring: What Actually Changed

AIOps vs Traditional Monitoring: What Actually Changed

Comments
1 min read
Eventual Consistency: Debugging the Hardest Class of Bugs
Cover image for Eventual Consistency: Debugging the Hardest Class of Bugs

Eventual Consistency: Debugging the Hardest Class of Bugs

Comments
4 min read
The Economics of Self-Hosting vs. Managed Monitoring
Cover image for The Economics of Self-Hosting vs. Managed Monitoring

The Economics of Self-Hosting vs. Managed Monitoring

Comments
4 min read
Building an Incident Response Playbook Library
Cover image for Building an Incident Response Playbook Library

Building an Incident Response Playbook Library

Comments
4 min read
Kubernetes Network Policies: Lessons from Production Incidents
Cover image for Kubernetes Network Policies: Lessons from Production Incidents

Kubernetes Network Policies: Lessons from Production Incidents

Comments
4 min read
Reducing Toil: The Google SRE Book Applied to Startups
Cover image for Reducing Toil: The Google SRE Book Applied to Startups

Reducing Toil: The Google SRE Book Applied to Startups

Comments
4 min read
Incident Severity Levels: SEV-1 to SEV-5 Calibration
Cover image for Incident Severity Levels: SEV-1 to SEV-5 Calibration

Incident Severity Levels: SEV-1 to SEV-5 Calibration

Comments
4 min read
Memory Leak Detection in Long-Running Services
Cover image for Memory Leak Detection in Long-Running Services

Memory Leak Detection in Long-Running Services

Comments
3 min read
CI/CD Reliability: When Your Deploy Pipeline is Your SPOF
Cover image for CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

CI/CD Reliability: When Your Deploy Pipeline is Your SPOF

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot
Cover image for Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Multi-Region Failover: Lessons from Running It Hot
Cover image for Multi-Region Failover: Lessons from Running It Hot

Multi-Region Failover: Lessons from Running It Hot

Comments
3 min read
Disaster Recovery Drills That Actually Work
Cover image for Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Disaster Recovery Drills That Actually Work
Cover image for Disaster Recovery Drills That Actually Work

Disaster Recovery Drills That Actually Work

Comments
3 min read
Feature Flags as a Reliability Tool, Not Just an A/B Platform
Cover image for Feature Flags as a Reliability Tool, Not Just an A/B Platform

Feature Flags as a Reliability Tool, Not Just an A/B Platform

Comments
3 min read
eBPF for SREs: Observability Without Agents
Cover image for eBPF for SREs: Observability Without Agents

eBPF for SREs: Observability Without Agents

Comments
3 min read
Observability as Code: Managing Dashboards and Alerts with Terraform
Cover image for Observability as Code: Managing Dashboards and Alerts with Terraform

Observability as Code: Managing Dashboards and Alerts with Terraform

Comments
2 min read
Service Level Objectives for Complex Microservices
Cover image for Service Level Objectives for Complex Microservices

Service Level Objectives for Complex Microservices

Comments
3 min read
Building a Culture of Reliability: Beyond the SRE Handbook
Cover image for Building a Culture of Reliability: Beyond the SRE Handbook

Building a Culture of Reliability: Beyond the SRE Handbook

Comments
3 min read
Debugging Kubernetes OOMKilled: A Step-by-Step Guide
Cover image for Debugging Kubernetes OOMKilled: A Step-by-Step Guide

Debugging Kubernetes OOMKilled: A Step-by-Step Guide

Comments
3 min read
Deployment Frequency: How We Went From Weekly to 20x/Day
Cover image for Deployment Frequency: How We Went From Weekly to 20x/Day

Deployment Frequency: How We Went From Weekly to 20x/Day

1
Comments
3 min read
Cost-Effective Observability: The 80/20 Stack for Startups
Cover image for Cost-Effective Observability: The 80/20 Stack for Startups

Cost-Effective Observability: The 80/20 Stack for Startups

Comments
3 min read
Incident Communication: The Status Page That Builds Trust
Cover image for Incident Communication: The Status Page That Builds Trust

Incident Communication: The Status Page That Builds Trust

Comments
3 min read
Load Testing in Production: How We Do It Safely
Cover image for Load Testing in Production: How We Do It Safely

Load Testing in Production: How We Do It Safely

Comments
3 min read
Effective On-Call Rotations: Lessons From Building Fair Schedules
Cover image for Effective On-Call Rotations: Lessons From Building Fair Schedules

Effective On-Call Rotations: Lessons From Building Fair Schedules

Comments
3 min read
GitOps for Infrastructure: How We Deploy With Zero SSH
Cover image for GitOps for Infrastructure: How We Deploy With Zero SSH

GitOps for Infrastructure: How We Deploy With Zero SSH

Comments
2 min read
Prometheus at Scale: Surviving the Cardinality Cliff
Cover image for Prometheus at Scale: Surviving the Cardinality Cliff

Prometheus at Scale: Surviving the Cardinality Cliff

Comments
2 min read
Database Reliability: The SRE Approach to Keeping Data Safe
Cover image for Database Reliability: The SRE Approach to Keeping Data Safe

Database Reliability: The SRE Approach to Keeping Data Safe

1
Comments
3 min read
Container Security for SREs: The Practical Checklist
Cover image for Container Security for SREs: The Practical Checklist

Container Security for SREs: The Practical Checklist

Comments
3 min read
The Incident Commander Role: Running Incidents Without Chaos
Cover image for The Incident Commander Role: Running Incidents Without Chaos

The Incident Commander Role: Running Incidents Without Chaos

1
Comments
2 min read
Terraform at Scale: Lessons from Managing 500+ Resources
Cover image for Terraform at Scale: Lessons from Managing 500+ Resources

Terraform at Scale: Lessons from Managing 500+ Resources

Comments
2 min read
Why Your Microservices Need Circuit Breakers (And How to Add Them)
Cover image for Why Your Microservices Need Circuit Breakers (And How to Add Them)

Why Your Microservices Need Circuit Breakers (And How to Add Them)

Comments
2 min read
The On-Call Handoff That Prevents Dropped Incidents
Cover image for The On-Call Handoff That Prevents Dropped Incidents

The On-Call Handoff That Prevents Dropped Incidents

Comments
2 min read
SLOs That Product Managers Actually Understand
Cover image for SLOs That Product Managers Actually Understand

SLOs That Product Managers Actually Understand

Comments
2 min read
MTTR Optimization: The 7 Levers That Actually Move the Needle
Cover image for MTTR Optimization: The 7 Levers That Actually Move the Needle

MTTR Optimization: The 7 Levers That Actually Move the Needle

Comments
3 min read
Service Maps: The Architectural Clarity Your Team Is Missing
Cover image for Service Maps: The Architectural Clarity Your Team Is Missing

Service Maps: The Architectural Clarity Your Team Is Missing

Comments
2 min read
AI in Incident Response: Hype vs. Reality in 2024
Cover image for AI in Incident Response: Hype vs. Reality in 2024

AI in Incident Response: Hype vs. Reality in 2024

Comments
3 min read
Monitoring Costs Are Out of Control — Here's How to Fix It
Cover image for Monitoring Costs Are Out of Control — Here's How to Fix It

Monitoring Costs Are Out of Control — Here's How to Fix It

Comments
2 min read
Hiring SREs: What I Look For After Interviewing 100+ Candidates
Cover image for Hiring SREs: What I Look For After Interviewing 100+ Candidates

Hiring SREs: What I Look For After Interviewing 100+ Candidates

Comments
3 min read
Log Management at Scale: How We Cut Costs 70% Without Losing Signal
Cover image for Log Management at Scale: How We Cut Costs 70% Without Losing Signal

Log Management at Scale: How We Cut Costs 70% Without Losing Signal

Comments
2 min read
loading...