-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preprocessing hook not executing #313
Comments
Jobs without transfer tasks are created in the STAGED_IN state by default.
The “processing” service at a site maintains a queue of these jobs, runs
the preprocess function (which might be not implemented, but is still
executed as a no-op), and sends updates back to the job API to advance
their state to PREPROCESSED.
It sounds like a component of the processing service on the site is either
stopped, hanging, or otherwise unable to communicate with the API.
I would look for logs from the 3 components that handle preprocessing:
- JobSource
https://github.com/argonne-lcf/balsam/blob/main/balsam/site/job_source.py#L132
- BulkStatusUpdater
https://github.com/argonne-lcf/balsam/blob/main/balsam/site/status_updater.py
- Processing Workers
https://github.com/argonne-lcf/balsam/blob/main/balsam/site/service/processing.py#L169
If you find any of them going dark or raising an error, that would explain
the jobs getting stuck. If the logs just suddenly end, we’d need to dig
deeper to understand what’s stopping the process.
At any rate, the various components could update their liveness status by
writing a timestamp into a file, making an API call, or something similar.
Those “heartbeats” would give a more direct way for a tool to interrogate
and infer “okay, the processing service hasn’t sent a heartbeat in over 90
seconds, so I’m assuming it needs to be restarted”.
Of course that doesn’t solve the problem entirely, for that you’d need a
userspace systemd or cron job to act as a watchdog and
restart/reauthenticate the site when it goes down. But establishing a
better way to check liveness would be the first step.
…On Wed, Jan 25, 2023 at 2:17 PM Christine Simpson ***@***.***> wrote:
A problem that has been seen now by multiple users is that jobs get stuck
in the STAGED_IN state. Once this happens, subsequently created jobs also
get stuck in STAGED_IN. Since no jobs advance to PREPROCESSED, nothing can
be run. This sometimes happens with a user defined preprocessing hook, but
sometimes not. Sometimes restarting the site causes jobs to get unstuck and
advance to PREPROCESSED, but we should figure out why this happens and
prevent it.
—
Reply to this email directly, view it on GitHub
<#313>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE753U5ZENOGC2IUMCRWGRTWUGC4VANCNFSM6AAAAAAUGXZ4XI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
We need to figure out a reproducer for this issue. We've not had this occur recently, some of the site stability fixes may have helped. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
A problem that has been seen now by multiple users is that jobs get stuck in the STAGED_IN state. Once this happens, subsequently created jobs also get stuck in STAGED_IN. Since no jobs advance to PREPROCESSED, nothing can be run. This sometimes happens with a user defined preprocessing hook, but sometimes not. Sometimes restarting the site causes jobs to get unstuck and advance to PREPROCESSED, but we should figure out why this happens and prevent it.
The text was updated successfully, but these errors were encountered: