You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, we've noticed some unexpected behaviour in some of our clusters when our workers should scale but not enough nodes are available to place all desired worker pods. Even after the queue is completely drained, if the autoscaler wanted to scale beyond what is possible, the pods are never retired.
For example, let's say that we have 10 nodes available and 3 worker types which correspond to 3 different queues. All workers take different amounts of time to complete their jobs. We allow for any individual worker deployment to compete for resources so that the queue with the most work is allowed the most workers. Also of note is that in different clusters (dev vs prod) we have different instance types / ASG max counts due to differences in expected usage / scaling patterns / cost controls / etc.
The following happens:
Queue A begins to fill, prompting Worker A to scale.
Worker A1 completes jobA1 and adds a job to Queue B prompting Worker B to scale.
Worker B1 completes jobB1 and adds a job to Queue C prompting Worker C to scale.
Worker C1 completes jobC1.
This process continues until the queues are saturated and all nodes are saturated with workers. Workers A and B are unable to scale to their maximum / desired pod counts since there are no available nodes to accommodate them. Workers C require fewer resources and reach their maximum pod count faster since they are quicker to start.
As the queues begin to drain and all queues reach 0 jobs, Workers C pods begin to scale down. As resources become available Workers A and B scale up and fill in the available resources, but are still unable to reach desired counts as they compete for resources.
After all queues are drained, Workers A and B never scale in and log messages similar to the following appear:
autoscaler.go:162] default/my-worker-a, starting auto-scale decision
autoscaler.go:165] default/my-worker-a is currently unstable, retry later, not enough workers (ready: 5 / wanted: 8)
autoscaler.go:162] default/my-worker-b, starting auto-scale decision
autoscaler.go:165] default/my-worker-b is currently unstable, retry later, not enough workers (ready: 8 / wanted: 10)
This continues indefinitely until manual intervention occurs. In our case manually scaling the ASG to accommodate the desired pods allows the autoscaler to complete the desired actions and immediately begins to scale in the Worker A and B deployments.
Expectation:
When the watched queues empty to 0 jobs, the autoscaler is allowed to scale in the deployments normally even if the wanted count is never met. This should not be dangerous when the queue is empty for multiple ticks.
The text was updated successfully, but these errors were encountered:
Problem:
Hello, we've noticed some unexpected behaviour in some of our clusters when our workers should scale but not enough nodes are available to place all desired worker pods. Even after the queue is completely drained, if the autoscaler wanted to scale beyond what is possible, the pods are never retired.
For example, let's say that we have 10 nodes available and 3 worker types which correspond to 3 different queues. All workers take different amounts of time to complete their jobs. We allow for any individual worker deployment to compete for resources so that the queue with the most work is allowed the most workers. Also of note is that in different clusters (dev vs prod) we have different instance types / ASG max counts due to differences in expected usage / scaling patterns / cost controls / etc.
The following happens:
This process continues until the queues are saturated and all nodes are saturated with workers. Workers A and B are unable to scale to their maximum / desired pod counts since there are no available nodes to accommodate them. Workers C require fewer resources and reach their maximum pod count faster since they are quicker to start.
As the queues begin to drain and all queues reach 0 jobs, Workers C pods begin to scale down. As resources become available Workers A and B scale up and fill in the available resources, but are still unable to reach desired counts as they compete for resources.
After all queues are drained, Workers A and B never scale in and log messages similar to the following appear:
This continues indefinitely until manual intervention occurs. In our case manually scaling the ASG to accommodate the desired pods allows the autoscaler to complete the desired actions and immediately begins to scale in the Worker A and B deployments.
Expectation:
When the watched queues empty to 0 jobs, the autoscaler is allowed to scale in the deployments normally even if the
wanted
count is never met. This should not be dangerous when the queue is empty for multiple ticks.The text was updated successfully, but these errors were encountered: