-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject task on worker still starting up #21921
Conversation
7c05184
to
5e4a87c
Compare
`StartupStatus.startupComplete` was set in production code, but not in tests.
When worker node is restarted after a crash, coordinator may be still unaware of the situation and may attempt to schedule tasks on it. Ideally the coordinator should not schedule tasks on worker that is not ready, but in pipelined execution there is currently no way to move a task. Accepting a request too early will likely lead to some failure and HTTP 500 (INTERNAL_SERVER_ERROR) response. The coordinator won't retry on this. Send 503 (SERVICE_UNAVAILABLE) so that request is retried.
5e4a87c
to
9be0554
Compare
draft because of the TMP commit |
stress test OK https://github.com/trinodb/trino/actions/runs/9034341743/job/24827391786?pr=21921
(the actual number are probably a lie -- #20113) |
9be0554
to
8fbbd54
Compare
cc @sajjoseph |
I like the lighter version. Let me try this out with 447 release. |
I tried out the new PR and so far all are positive signs.
Worker started processing fresh queries coming in. There were few exceptions in the worker node's server log file. I don't think these have anything to do with the current PR. I just included them here to see if any zombie tasks status checks are still causing these errors in the log files. These entries are showing up even after worker node was up for 10+ minutes.
|
They look wrong cc @wendigo . i think we can assume this is not related to this PR, as it only changes anything during the startup phase. We should have an issue about them. |
These errors are |
Can it mean a bad thing? or should we just ignore these? |
@findepi I think that we should ignore. |
When worker node is restarted after a crash, coordinator may be still
unaware of the situation and may attempt to schedule tasks on it.
Ideally the coordinator should not schedule tasks on worker that is not
ready, but in pipelined execution there is currently no way to move a
task. Accepting a request too early will likely lead to some failure
and HTTP 500 (INTERNAL_SERVER_ERROR) response. The coordinator won't
retry on this. Send 503 (SERVICE_UNAVAILABLE) so that request is
retried.
Simpler alternative to #21744.
That other PR has the problem that after worker restarts queries fail for some time because the new instance ID is not propagated.
The test being added here should prove that we can close #21735.