Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] autoscaler tries to scale up when no jobs remain in queue #1

Open
lhriley opened this issue Mar 23, 2020 · 0 comments
Open

[bug] autoscaler tries to scale up when no jobs remain in queue #1

lhriley opened this issue Mar 23, 2020 · 0 comments

Comments

@lhriley
Copy link

lhriley commented Mar 23, 2020

Problem:

Hello, we've noticed some unexpected behaviour in some of our clusters when our workers should scale but not enough nodes are available to place all desired worker pods. Even after the queue is completely drained, if the autoscaler wanted to scale beyond what is possible, the pods are never retired.

For example, let's say that we have 10 nodes available and 3 worker types which correspond to 3 different queues. All workers take different amounts of time to complete their jobs. We allow for any individual worker deployment to compete for resources so that the queue with the most work is allowed the most workers. Also of note is that in different clusters (dev vs prod) we have different instance types / ASG max counts due to differences in expected usage / scaling patterns / cost controls / etc.

The following happens:

  • Queue A begins to fill, prompting Worker A to scale.
  • Worker A1 completes jobA1 and adds a job to Queue B prompting Worker B to scale.
  • Worker B1 completes jobB1 and adds a job to Queue C prompting Worker C to scale.
  • Worker C1 completes jobC1.

This process continues until the queues are saturated and all nodes are saturated with workers. Workers A and B are unable to scale to their maximum / desired pod counts since there are no available nodes to accommodate them. Workers C require fewer resources and reach their maximum pod count faster since they are quicker to start.

As the queues begin to drain and all queues reach 0 jobs, Workers C pods begin to scale down. As resources become available Workers A and B scale up and fill in the available resources, but are still unable to reach desired counts as they compete for resources.

After all queues are drained, Workers A and B never scale in and log messages similar to the following appear:

autoscaler.go:162] default/my-worker-a, starting auto-scale decision
autoscaler.go:165] default/my-worker-a is currently unstable, retry later, not enough workers (ready: 5 / wanted: 8)
autoscaler.go:162] default/my-worker-b, starting auto-scale decision
autoscaler.go:165] default/my-worker-b is currently unstable, retry later, not enough workers (ready: 8 / wanted: 10)

This continues indefinitely until manual intervention occurs. In our case manually scaling the ASG to accommodate the desired pods allows the autoscaler to complete the desired actions and immediately begins to scale in the Worker A and B deployments.

Expectation:

When the watched queues empty to 0 jobs, the autoscaler is allowed to scale in the deployments normally even if the wanted count is never met. This should not be dangerous when the queue is empty for multiple ticks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant