Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CellArmy exceptions deadlocking simulations #753

Open
mobernabeu opened this issue Nov 9, 2020 · 8 comments
Open

CellArmy exceptions deadlocking simulations #753

mobernabeu opened this issue Nov 9, 2020 · 8 comments

Comments

@mobernabeu
Copy link
Contributor

CellArmy::Fluid2CellInteractions throws exceptions for a couple of anomalous situations (i.e. numerical instability) that can potentially occur in just a subset of the MPI ranks (in a single rank most of the times). I'm puzzled about why the catch blocks in main.cc are not picking them up and MPI_Aborting the whole simulation in ARCHER. This has lead to some costly deadlocks in production!

Any thoughts @rupertnash?

@rupertnash
Copy link
Member

Hard to tell without knowing what MPI calls might be in flight when the exception triggers. Can you drop a link to the line(s) that throw in here please?

Is it that control never reaches the top level catch or that the call to MPI_Abort doesn't stop the sim?

@mobernabeu
Copy link
Contributor Author

mobernabeu commented Nov 10, 2020

We had deadlocks with the throw statement in https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/redblood/CellArmy.h#L269

@CharlesQiZhou could you please send me stdout/stderr for one of those cases. I need to check whether https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/main.cc#L53 is being logged at all. That should allow me to answer @rupertnash second question.

@mobernabeu
Copy link
Contributor Author

Closed by mistake, sorry. Reopening.

@mobernabeu mobernabeu reopened this Nov 10, 2020
@rupertnash
Copy link
Member

Nothing stands out to me.

I immediately notice that above:
https://github.com/UCL-CCS/hemelb-dev/blob/ea7a49a561277ba7aa5d275d3413e6dc4d71d0ec/Code/redblood/CellArmy.h#L253
you are iterating over copies of the elements in cells which isn't usually what you want.

Does this matter? Does it cost performance?

I also see that you're calling std::map<>::at - is it this throwing? Can check with a temporary + IIFE ("iffy")

auto&& tmp = [&](){
  try {
    return nodeDistributions.at(cell->GetTag());
  } catch (std::out_of_range& e) {
    // Log an error an re-throw
    throw;
  }
}();
try {
  tmp.template Reindex<Stencil>(globalCoordsToProcMap, cell);
} catch (std::exception& e) {
  // log and rethrow as before
}

@mobernabeu
Copy link
Contributor Author

Good points, @rupertnash. That loop should be over const&. std::map<>::at was defensive programming while debugging something that turned out to be unrelated. We can spare the bound check now. I'll make those change. Thanks for the iffy trick, I didn't know it.

@CharlesQiZhou I don't know if you were notified about my previous message cause I added your name in an edit. Please send me those files if you can.

@CharlesQiZhou
Copy link
Contributor

@mobernabeu Sorry for missing your message. Just saw that from my personal mailbox. Please find the stderr/stdout files below:
stderr.txt
stdout.txt

@CharlesQiZhou
Copy link
Contributor

CharlesQiZhou commented Feb 2, 2021

This may be useful for debugging purpose. One recent benchmark test of mine triggering the CellArmy exception successfully invoked the MPI_ABORT and immediately terminated the simulation for local runs on a desktop (with 4 cores, about 10 min running). The same simulation running on ARCHER failed to invoke MPI_ABORT and ended in deadlocks.

Enclosed please find the log file from one of my local runs triggering the exception.
log.txt

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

@mobernabeu
Copy link
Contributor Author

Thanks @CharlesQiZhou. This is very bizarre indeed and needs more investigation to understand what part is broken (exception not being thrown, not being caught, MPI_ABORT deadlocking or not doing its job). I suggest that you try to replicate on Cirrus or ARCHER2 once the code is running there and add a bit more tracing to see which of the above is the culprit. Running it through a parallel debugger may be necessary if print statements are not sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants