feat: add per-node untaint grace period#286
Open
FocalChord wants to merge 1 commit into
Open
Conversation
|
Thank you for your submission! Like many open source projects, we ask that you sign our CLA (Contributor License Agreement) before we can accept your contribution. Already signed the CLA? To re-check, try refreshing the page. |
ed756da to
b43d33b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR adds a new optional config field
untaint_grace_periodthat prevents a node from being re-tainted within a configurable duration after it was untainted. When Escalator untaints a node, a timestamp is recorded in an in-memory map onNodeGroupState. WhentaintInstancesselects taint candidates, it skips any node whose untaint timestamp is within the grace period.Defaults to no grace period when the field is omitted, empty, or set to "0s". Existing deployments are unaffected.
Why
When Escalator untaints a node, the Kubernetes scheduler can place new pods on it within seconds. If utilisation dips on the next tick, Escalator can re-taint that same node. Those freshly scheduled pods are now running on a NoSchedule node that will not receive further work but cannot be terminated until the pods complete or
hard_delete_grace_periodexpires.We observed this on one of our external clusters where a node group was oscillating at 28 taint/untaint cycles per hour. One node was untainted at 22:37, received 6 pipeline pods during the untainted window, and was re-tainted at 22:43. Those pods stranded the node at single-digit useful utilisation for hours.
A grace period of 5 to 10 minutes makes recently untainted nodes ineligible for re-tainting, giving the scheduler time to fill them and giving Escalator a more stable signal before deciding whether to remove them again.
Design considerations
In-memory map vs Kubernetes annotations. We considered using node annotations to store the untaint timestamp. This would survive Escalator restarts but requires an extra API write on every untaint and an extra annotation read per taint candidate per tick. For clusters with 80+ nodes, that's meaningful API load. The in-memory approach is a hashmap lookup with zero API calls.
NodeGroupStatealready carries in-memory state between scans in the same pattern:taintTracker,forceTaintTracker,scaleDelta,lastScaleOut,scaleUpLock,cpuCapacity,memCapacity. This is another instance of that, not a new concept.The tradeoff is that the map is lost on restart. If Escalator restarts, every node is immediately eligible for tainting. The worst case is a node untainted shortly before the restart could be re-tainted on the first tick after restart. That's a brief window of reduced protection, not a correctness problem.
Map growth. A cleanup step runs at the start of each
scaleNodeGroupcall and removes entries for nodes that no longer exist. The map is bounded by the number of nodes in the node group.Testing
Six test cases covering: recently untainted node is not re-tainted, expired grace period allows re-tainting, empty/0s grace period preserves existing behavior, stale tracker entries are cleaned up, wet mode records timestamps, dry mode records timestamps.
Rovo Dev code review: Rovo Dev couldn't review this pull request
Upgrade to Rovo Dev Standard to continue using code review.