[bug] autoscaler tries to scale up when no jobs remain in queue #1

lhriley · 2020-03-23T20:12:22Z

Problem:

Hello, we've noticed some unexpected behaviour in some of our clusters when our workers should scale but not enough nodes are available to place all desired worker pods. Even after the queue is completely drained, if the autoscaler wanted to scale beyond what is possible, the pods are never retired.

For example, let's say that we have 10 nodes available and 3 worker types which correspond to 3 different queues. All workers take different amounts of time to complete their jobs. We allow for any individual worker deployment to compete for resources so that the queue with the most work is allowed the most workers. Also of note is that in different clusters (dev vs prod) we have different instance types / ASG max counts due to differences in expected usage / scaling patterns / cost controls / etc.

The following happens:

Queue A begins to fill, prompting Worker A to scale.
Worker A1 completes jobA1 and adds a job to Queue B prompting Worker B to scale.
Worker B1 completes jobB1 and adds a job to Queue C prompting Worker C to scale.
Worker C1 completes jobC1.

This process continues until the queues are saturated and all nodes are saturated with workers. Workers A and B are unable to scale to their maximum / desired pod counts since there are no available nodes to accommodate them. Workers C require fewer resources and reach their maximum pod count faster since they are quicker to start.

As the queues begin to drain and all queues reach 0 jobs, Workers C pods begin to scale down. As resources become available Workers A and B scale up and fill in the available resources, but are still unable to reach desired counts as they compete for resources.

After all queues are drained, Workers A and B never scale in and log messages similar to the following appear:

autoscaler.go:162] default/my-worker-a, starting auto-scale decision
autoscaler.go:165] default/my-worker-a is currently unstable, retry later, not enough workers (ready: 5 / wanted: 8)
autoscaler.go:162] default/my-worker-b, starting auto-scale decision
autoscaler.go:165] default/my-worker-b is currently unstable, retry later, not enough workers (ready: 8 / wanted: 10)

This continues indefinitely until manual intervention occurs. In our case manually scaling the ASG to accommodate the desired pods allows the autoscaler to complete the desired actions and immediately begins to scale in the Worker A and B deployments.

Expectation:

When the watched queues empty to 0 jobs, the autoscaler is allowed to scale in the deployments normally even if the wanted count is never met. This should not be dangerous when the queue is empty for multiple ticks.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] autoscaler tries to scale up when no jobs remain in queue #1

[bug] autoscaler tries to scale up when no jobs remain in queue #1

lhriley commented Mar 23, 2020

[bug] autoscaler tries to scale up when no jobs remain in queue #1

[bug] autoscaler tries to scale up when no jobs remain in queue #1

Comments

lhriley commented Mar 23, 2020