Anuj Tyagi

Posted on Jun 20

KEDA Upgrade Debugging: When Empty Triggers Break Your Scaling

#keda #eventdriven #kubernetes #debugging

This is one of the past use case to troubleshooting KEDA, Kubernetes based event driven autoscaler during upgrade in a non production environment.

So, I was working to upgrade KEDA from v2.10 to v2.15 for a infra unfamiliar to me. It was my first hands on experience with KEDA. I quickly understood purpose of KEDA, I worked more with HPA before that.
If you're not aware of the difference between all pod scaling options, you can read my last post
on Kubernetes pod scaling patterns

My goal was to upgrade KEDA from v2.10 to v2.15 and ensure all existing ScaledObjects continued to function properly. The environment had been running with KEDA v2.10 for months, and all configurations appeared to be working correctly.

Initial Error Analysis

After the upgrade, the KEDA operator logs showed concerning errors:

2024/11/04 17:57:49 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
{"level":"info","ts":"2024-11-04T17:57:49.765Z","logger":"setup","msg":"KEDA Version: 2.15.1"}
{"level":"info","ts":"2024-11-04T17:57:49.765Z","logger":"setup","msg":"Git Commit: 123543fnerfin4fcw3d23d23b"}
I1104 17:57:49.866460    1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...

The key concern was that if the last line shows only attempting to acquire leader lease without the follow-up successfully acquired lease, it means the leader is locked and can't work as leader. But this is fine. It can also mean, another pod is working as a leader.

I went ahead to understand leader election process.
Understanding KEDA's leader election process was crucial. A healthy startup sequence looks like:

I1106 21:42:09.498384       1 leaderelection.go:254] attempting to acquire leader lease keda/operator.keda.sh...
I1106 21:42:55.066863       1 leaderelection.go:268] successfully acquired lease keda/operator.keda.sh
2024-11-06T21:42:55Z    INFO    Starting EventSource    {"controller": "scaledobject"}
2024-11-06T21:42:55Z    INFO    Starting Controller {"controller": "scaledobject"}

The sequence should include:

Attempting to acquire lease
Successfully acquiring lease
Multiple controller initialization messages

Configuration Investigation

Examining the failing ScaledObject revealed the root cause:

kubectl get scaledobject app -n test-app -o yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: webapp
  namespace: test-app
  creationTimestamp: "2024-05-10T13:16:22Z"  # Created months ago
spec:
  scaleTargetRef:
    name: webapp
  minReplicaCount: 1
  maxReplicaCount: 1
  triggers: []  
status:
  conditions:
  - message: ScaledObject doesn't have correct triggers specification
    reason: ScaledObjectCheckFailed
    status: "False"
    type: Ready

The Real Issue Discovery

When I checked another KEDA operator pod, I found the root cause:

error":"no triggers defined in the ScaledObject/ScaledJob"

I spend more time searching, why KEDA was screaming for empty trigger now in v2.15 but not in v2.10. So, any release after v2.10 added this as exception and log message.

KEDA v2.10 behavior: Silently accepted empty triggers (triggers: []) and created a default HPA with 80% CPU utilization
KEDA v2.15 behavior: Validates triggers and throws errors for empty arrays
Timeline: This ScaledObject had been running incorrectly from past 6 months, but v2.10 hid the problem.

The Fix Implementation

I found specific Github issue and PR:

The empty triggers validation was introduced in:

GitHub Issue: #5520 - "KEDA doesn't validate empty array of triggers"
Pull Request: #5524 - "fix: Validate empty array value of triggers in ScaledObject/ScaledJob creation"
KEDA Version: Introduced in v2.14, refined in v2.15
Merge Date: February 2024

Configuration Fix

The solution was to add proper triggers to the ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: webapp
  namespace: test-app
spec:
  scaleTargetRef:
    name: webapp
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: http_requests_per_second
      threshold: "100"
      query: sum(rate(http_requests_total[1m]))

Validation Commands

To identify similar issues across the cluster:

# Find ScaledObjects with empty triggers
kubectl get scaledobjects -A -o jsonpath='{range .items[?(@.spec.triggers[0] == null)]}{.metadata.namespace}{"/"}{.metadata.name}{"\n"}{end}'

# Check ScaledObject status
kubectl get scaledobjects -A -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[?(@.type=='Ready')].status"

Key Takeaways

1. Silent Failures Are Dangerous

KEDA v2.10's behavior of silently creating default HPAs masked configuration errors for months. The application had been using basic CPU scaling instead of the intended event-driven scaling.

2. Validation Improvements

The upgrade didn't break anything - it revealed existing problems. KEDA v2.15's strict validation prevents:

Misleading functionality (thinking event-driven scaling is active when it's not)
Resource waste from inappropriate scaling decisions
Configuration drift

3. Understanding Version Changes

Breaking changes often fix underlying issues. The validation was introduced because:

Empty triggers create meaningless ScaledObjects
Default CPU-based scaling defeats KEDA's event-driven purpose
Silent failures violate "fail fast, fail loud" principles

4. Debugging Best Practices

When investigating KEDA issues:

Check leader election sequence completion
Examine ScaledObject status conditions
Validate trigger configurations before upgrades
Test in non-production environments first

5. Prevention Strategies

Implement CI/CD validation for empty triggers
Monitor ScaledObject health status
Set up alerts for configuration failures
Review configurations before major upgrades

Conclusion

What initially appeared to be a breaking change in KEDA v2.15 was actually a long-overdue fix for silent configuration failures. The ScaledObject had been misconfigured since May 2024, but v2.10 had been hiding the problem by falling back to default CPU-based scaling.

This experience reinforces that sometimes "breaking" changes reveal existing problems rather than creating new ones. The improved validation in KEDA v2.15 ensures that event-driven autoscaling works as intended, making the system more reliable and preventing future silent failures.

Understanding the difference between a tool breaking and a tool revealing existing breakage is crucial for effective debugging and system maintenance.

DEV Community