Taskmanager stability issues #67

BertScholten · 2023-11-23T15:19:19Z

During some local testing, I ended up with a system that wasn't calculating anymore. A few things were changed:

Opening a connection was moved outside of a try-with-resources block. As the exceptions can also occur within this opening method, moved it back in.
Avoid delay in handleShutdownSignal method in handlers: this caused the old (unusable) AMQP Connection thread to stay up for way too long, and as each handler had to wait for the other, the state of the system would also be updated late (not a 100% sure this was an issue, it was just noticable and don't really see a point in the delay at this point)
Sturdier WorkerMonitor: locally I ended up in a system where worker would pick up a calculation (as it was already on the worker queue) but it wouldn't send out any tasks to the workers. Turned out that when rebooting RabbitMQ occasionally the workermonitor wouldn't kick in again (no queues anymore on the event exchange). Not a 100% sure this is what happened, but think that the start() method that was called in the handleShutdownSignal method would also throw an exception, and it would not be attempted to start again.

The `factory.getConnection()` call could also result in a `ShutdownSignalException`, doing it this way ensures the loop is not broken when that happens.

This caused the taskmanager to stay in an 'old' state as it would block all the other shutdown listeners till all the handlers were done being notified. As there are quite some handlers this would cause the disrupted connection thread to linger way longer than needed. Also avoiding a sleep when we're shut down anyway. Not entirely sure if this caused issues, but there's not much point in doing so.

As it was, an exception could occur in the start method and that would cause the monitor to stop working when RabbitMQ was restarted. As this monitor is used to track other worker utilitzation to dynamically scale, that would grind to a halt as it would never be updated again. By making it sturdier (catching some exceptions and retrying) the monitor should be better at self-healing when connection with rabbitmq is restored. Changes are similar to how RabbitMQMessageHandler is implemented.

Hilbrand

LGTM. One possible improvement.

.../taskmanager-client/src/main/java/nl/aerius/taskmanager/client/mq/RabbitMQWorkerMonitor.java

BertScholten added 3 commits November 23, 2023 12:56

Ensure getting connection occurs within try-with-resources

f2cc0cb

The `factory.getConnection()` call could also result in a `ShutdownSignalException`, doing it this way ensures the loop is not broken when that happens.

BertScholten requested review from Hilbrand and SerhatG November 23, 2023 15:19

Hilbrand approved these changes Nov 23, 2023

View reviewed changes

.../taskmanager-client/src/main/java/nl/aerius/taskmanager/client/mq/RabbitMQWorkerMonitor.java Show resolved Hide resolved

Review suggestion

51d5755

BertScholten merged commit 6d8a037 into aerius:main Nov 23, 2023
1 check passed

BertScholten deleted the taskmanager_stability branch November 23, 2023 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taskmanager stability issues #67

Taskmanager stability issues #67

BertScholten commented Nov 23, 2023

Hilbrand left a comment

Taskmanager stability issues #67

Taskmanager stability issues #67

Conversation

BertScholten commented Nov 23, 2023

Hilbrand left a comment

Choose a reason for hiding this comment