-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to sync site as daemon hangs #340
Comments
Thanks for flagging this @s-sajid-ali . What version of Balsam are you running? You can query the version at the command line with |
|
This bug is occurring more frequently than expected and I'd like to fix this before submitting some large workloads. One approach I'd tried was to see if I could manually push the job toward the
and while this does set the job to the correct state, a corresponding While I see that all jobs inherit an empty Could you give me some pointers on how the job creation process decides the phase of the job and I can try debugging it to understand the root cause of this behavior? Thanks in advance! |
With a known
More specifically, it looks like some child processes are in a
|
Okay, I agree that this issue is probably related to #364. To find the process group id, you can do this at the command line: The first number is the process group id. You can kill all processes in the group (so should include child processes) by typing this at the command line: |
A few questions:
|
To answer some of your questions about job creation: Unless the job has a transfer as part of it (which I don't think you have?), jobs are all created in the STAGED_IN state. If there is no user specified So what prevents the jobs to advancing to the PREPROCESSED state? Either somehow the zombie process is not running the code that advances the job state or there's an uncaught exception in the code that advances the job state. Look in your most recent service logs in the log directory (they will have file names like |
Thanks a lot for the all the pointers! I haven't seen the service hang today but I'll inspect the logs and provide them if it hangs again.
When the service hangs, I typically do Thanks for the detailed explanation on the job creation process. I did notice that the steps with a non-trivial |
Okay, if you have a chance to look in the logs and you find anything, pass it along. If there's an uncaught exception happening in the code that advances jobs, we'd like to know so we can fix it. |
Apologies for the erroneous post about a failure, I saw a mistake in the logic to generate jobs with the same name. I'll post logs if I see a job getting stuck after fixing it. I'd forgotten to change the |
I saw this error when trying to delete a site with stuck jobs (I wanted to have a new site to reproduce the bug with minimal logs to reduce the amount of logs to sift through):
I cancelled the operation and stopped the (zombie) processes individually. Could the above stack trace be the reason behind some of the python threads being created by the balsam client becoming zombie processes? |
With a fresh site, creating two identical sets of jobs with the only definition being a change in a parameter, I see that one of the jobs to start a server is created in the
The only difference between the two jobs to start the server being the different parameter and the different workdir:
Here are the two service logs from the site: I'd run PS: I tried moving the job
Confirming the hypothesis that the root cause of the issue is a subset of threads of the balsam client becoming unresponsive, More specifically, the threads associated with |
Hi @s-sajid-ali sorry for the delay. I'm going to try to track down those backtraces, but a few more questions:
|
The directory for that job was not created:
Attaching them here: The workflow was:
(split the jobs into two files to see if that helps prevent the issue by submitting the server job as a standalone one). Thanks again for helping to track this bug down and no worries about the delay. |
Hopefully I can provide some helpful context, with an explanation of why you cannot delete jobs. I think part of what you are seeing is that jobs are filtered out of the Delete operation if they are still tied to a Session on the backend. This ensures that jobs currently being processed by a site don't get deleted mid-execution. The On the backend, the session API uses an atomic locking operation ( Rather than keeping an explicit lock in the database (which would require long-lived transactions and be detrimental to performance), however, the Session behaves like a temporary lock or lease: it has an expiration date of a few minutes, and the Launcher/Processing services have a background thread that sends a heartbeat every minute to refresh the lease (https://github.com/argonne-lcf/balsam/blob/main/balsam/site/job_source.py#L25). While the job is associated with a live Session, it cannot be deleted or acquired by another session. At the same time, we want to make sure that all sessions eventually get cleaned up. If the JobSource crashes and stops sending heartbeats, the session eventually becomes stale and will automatically get deleted the next time that a different session comes online. To summarize:
|
@s-sajid-ali Thanks for sending your scripts. I think I've been able to reproduce the issue. I think the problem is the preprocessing hook for the This is what you sent:
I think what's happening is the site process is stuck in that loop, looking for that file, but not finding it. It cannot advance to any of the other jobs while it's stuck there. It doesn't raise an exception because it is executing valid code. This may also be related the problems with deleting the jobs. It would be helpful to understand a bit more about your workflow. How is the |
On
alcf-theta-knl
the error is:A daemon that hangs causes job creating to fail silently, i.e. a job with no pre-processing step shows up as
STAGED_IN
instead ofPREPROCESSED
causing the workflow to fail.Is there a way to enable better diagnostics to warn users when the daemon hangs? I'm seeing this occur frequently on
alcf-theta-knl
.The text was updated successfully, but these errors were encountered: