This is one of the past use case to troubleshooting KEDA, Kubernetes based event driven autoscaler during upgrade in a non production environment.
So, I was working to upgrade KEDA from v2.10 to v2.15 for a infra unfamiliar to me. It was my first hands on experience with KEDA. I quickly understood purpose of KEDA, I worked more with HPA before that.
If you're not aware of the difference between all pod scaling options, you can read my last post
on Kubernetes pod scaling patterns
My goal was to upgrade KEDA from v2.10 to v2.15 and ensure all existing ScaledObjects
continued to function properly. The environment had been running with KEDA v2.10 for months, and all configurations appeared to be working correctly.
Initial Error Analysis
After the upgrade, the KEDA operator logs showed concerning errors:
2024/11/04 17:57:49 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
{"level":"info","ts":"2024-11-04T17:57:49.765Z","logger":"setup","msg":"KEDA Version: 2.15.1"}
{"level":"info","ts":"2024-11-04T17:57:49.765Z","logger":"setup","msg":"Git Commit: 123543fnerfin4fcw3d23d23b"}
I1104 17:57:49.866460 1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
The key concern was that if the last line shows only attempting to acquire leader lease
without the follow-up successfully acquired lease
, it means the leader is locked and can't work as leader. But this is fine. It can also mean, another pod is working as a leader.
I went ahead to understand leader election process.
Understanding KEDA's leader election process was crucial. A healthy startup sequence looks like:
I1106 21:42:09.498384 1 leaderelection.go:254] attempting to acquire leader lease keda/operator.keda.sh...
I1106 21:42:55.066863 1 leaderelection.go:268] successfully acquired lease keda/operator.keda.sh
2024-11-06T21:42:55Z INFO Starting EventSource {"controller": "scaledobject"}
2024-11-06T21:42:55Z INFO Starting Controller {"controller": "scaledobject"}
The sequence should include:
- Attempting to acquire lease
- Successfully acquiring lease
- Multiple controller initialization messages
Configuration Investigation
Examining the failing ScaledObject revealed the root cause:
kubectl get scaledobject app -n test-app -o yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: webapp
namespace: test-app
creationTimestamp: "2024-05-10T13:16:22Z" # Created months ago
spec:
scaleTargetRef:
name: webapp
minReplicaCount: 1
maxReplicaCount: 1
triggers: []
status:
conditions:
- message: ScaledObject doesn't have correct triggers specification
reason: ScaledObjectCheckFailed
status: "False"
type: Ready
The Real Issue Discovery
When I checked another KEDA operator pod, I found the root cause:
error":"no triggers defined in the ScaledObject/ScaledJob"
I spend more time searching, why KEDA was screaming for empty trigger now in v2.15 but not in v2.10. So, any release after v2.10 added this as exception and log message.
-
KEDA v2.10 behavior: Silently accepted empty triggers (
triggers: []
) and created a default HPA with 80% CPU utilization - KEDA v2.15 behavior: Validates triggers and throws errors for empty arrays
- Timeline: This ScaledObject had been running incorrectly from past 6 months, but v2.10 hid the problem.
The Fix Implementation
I found specific Github issue and PR:
The empty triggers validation was introduced in:
- GitHub Issue: #5520 - "KEDA doesn't validate empty array of triggers"
- Pull Request: #5524 - "fix: Validate empty array value of triggers in ScaledObject/ScaledJob creation"
- KEDA Version: Introduced in v2.14, refined in v2.15
- Merge Date: February 2024
Configuration Fix
The solution was to add proper triggers to the ScaledObject:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: webapp
namespace: test-app
spec:
scaleTargetRef:
name: webapp
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_per_second
threshold: "100"
query: sum(rate(http_requests_total[1m]))
Validation Commands
To identify similar issues across the cluster:
# Find ScaledObjects with empty triggers
kubectl get scaledobjects -A -o jsonpath='{range .items[?(@.spec.triggers[0] == null)]}{.metadata.namespace}{"/"}{.metadata.name}{"\n"}{end}'
# Check ScaledObject status
kubectl get scaledobjects -A -o custom-columns="NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.conditions[?(@.type=='Ready')].status"
Key Takeaways
1. Silent Failures Are Dangerous
KEDA v2.10's behavior of silently creating default HPAs masked configuration errors for months. The application had been using basic CPU scaling instead of the intended event-driven scaling.
2. Validation Improvements
The upgrade didn't break anything - it revealed existing problems. KEDA v2.15's strict validation prevents:
- Misleading functionality (thinking event-driven scaling is active when it's not)
- Resource waste from inappropriate scaling decisions
- Configuration drift
3. Understanding Version Changes
Breaking changes often fix underlying issues. The validation was introduced because:
- Empty triggers create meaningless ScaledObjects
- Default CPU-based scaling defeats KEDA's event-driven purpose
- Silent failures violate "fail fast, fail loud" principles
4. Debugging Best Practices
When investigating KEDA issues:
- Check leader election sequence completion
- Examine ScaledObject status conditions
- Validate trigger configurations before upgrades
- Test in non-production environments first
5. Prevention Strategies
- Implement CI/CD validation for empty triggers
- Monitor ScaledObject health status
- Set up alerts for configuration failures
- Review configurations before major upgrades
Conclusion
What initially appeared to be a breaking change in KEDA v2.15 was actually a long-overdue fix for silent configuration failures. The ScaledObject had been misconfigured since May 2024, but v2.10 had been hiding the problem by falling back to default CPU-based scaling.
This experience reinforces that sometimes "breaking" changes reveal existing problems rather than creating new ones. The improved validation in KEDA v2.15 ensures that event-driven autoscaling works as intended, making the system more reliable and preventing future silent failures.
Understanding the difference between a tool breaking and a tool revealing existing breakage is crucial for effective debugging and system maintenance.
Top comments (0)