Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taskmanager stability issues #67

Merged
merged 4 commits into from
Nov 23, 2023

Conversation

BertScholten
Copy link
Member

During some local testing, I ended up with a system that wasn't calculating anymore. A few things were changed:

  • Opening a connection was moved outside of a try-with-resources block. As the exceptions can also occur within this opening method, moved it back in.
  • Avoid delay in handleShutdownSignal method in handlers: this caused the old (unusable) AMQP Connection thread to stay up for way too long, and as each handler had to wait for the other, the state of the system would also be updated late (not a 100% sure this was an issue, it was just noticable and don't really see a point in the delay at this point)
  • Sturdier WorkerMonitor: locally I ended up in a system where worker would pick up a calculation (as it was already on the worker queue) but it wouldn't send out any tasks to the workers. Turned out that when rebooting RabbitMQ occasionally the workermonitor wouldn't kick in again (no queues anymore on the event exchange). Not a 100% sure this is what happened, but think that the start() method that was called in the handleShutdownSignal method would also throw an exception, and it would not be attempted to start again.

The `factory.getConnection()` call could also result in a `ShutdownSignalException`, doing it this way ensures the loop is not broken when that happens.
This caused the taskmanager to stay in an 'old' state as it would block all the other shutdown listeners till all the handlers were done being notified. As there are quite some handlers this would cause the disrupted connection thread to linger way longer than needed.
Also avoiding a sleep when we're shut down anyway. Not entirely sure if this caused issues, but there's not much point in doing so.
As it was, an exception could occur in the start method and that would cause the monitor to stop working when RabbitMQ was restarted. As this monitor is used to track other worker utilitzation to dynamically scale, that would grind to a halt as it would never be updated again.
By making it sturdier (catching some exceptions and retrying) the monitor should be better at self-healing when connection with rabbitmq is restored.
Changes are similar to how RabbitMQMessageHandler is implemented.
Copy link
Member

@Hilbrand Hilbrand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One possible improvement.

@BertScholten BertScholten merged commit 6d8a037 into aerius:main Nov 23, 2023
1 check passed
@BertScholten BertScholten deleted the taskmanager_stability branch November 23, 2023 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants