-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CellArmy exceptions deadlocking simulations #753
Comments
Hard to tell without knowing what MPI calls might be in flight when the exception triggers. Can you drop a link to the line(s) that throw in here please? Is it that control never reaches the top level |
We had deadlocks with the @CharlesQiZhou could you please send me stdout/stderr for one of those cases. I need to check whether https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/main.cc#L53 is being logged at all. That should allow me to answer @rupertnash second question. |
Closed by mistake, sorry. Reopening. |
Nothing stands out to me. I immediately notice that above: Does this matter? Does it cost performance? I also see that you're calling
|
Good points, @rupertnash. That loop should be over @CharlesQiZhou I don't know if you were notified about my previous message cause I added your name in an edit. Please send me those files if you can. |
@mobernabeu Sorry for missing your message. Just saw that from my personal mailbox. Please find the stderr/stdout files below: |
This may be useful for debugging purpose. One recent benchmark test of mine triggering the CellArmy exception successfully invoked the MPI_ABORT and immediately terminated the simulation for local runs on a desktop (with 4 cores, about 10 min running). The same simulation running on ARCHER failed to invoke MPI_ABORT and ended in deadlocks. Enclosed please find the log file from one of my local runs triggering the exception.
|
Thanks @CharlesQiZhou. This is very bizarre indeed and needs more investigation to understand what part is broken (exception not being thrown, not being caught, MPI_ABORT deadlocking or not doing its job). I suggest that you try to replicate on Cirrus or ARCHER2 once the code is running there and add a bit more tracing to see which of the above is the culprit. Running it through a parallel debugger may be necessary if print statements are not sufficient. |
CellArmy::Fluid2CellInteractions
throws exceptions for a couple of anomalous situations (i.e. numerical instability) that can potentially occur in just a subset of the MPI ranks (in a single rank most of the times). I'm puzzled about why the catch blocks inmain.cc
are not picking them up andMPI_Abort
ing the whole simulation in ARCHER. This has lead to some costly deadlocks in production!Any thoughts @rupertnash?
The text was updated successfully, but these errors were encountered: