-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get HemeLB RBC code running on Cirrus post merge into main #775
Comments
Minimum requirement for the dependencies to be compiled on cirrus:
@mobernabeu @rupertnash
|
@rupertnash is there any good reason not to go with gcc 10.2 in Cirrus? (rather than ancient 8.2) |
Just tried gcc 10.2, namely with the below modules
Dependencies ( See the complete stderr.txt and stdout.txt below for this attempt with gcc 10.2: |
I'm pro updating to 10: newer is usually better for the compilers |
I think you need to load the libtool module on Cirrus: |
Thanks @rupertnash. Adding
I'm also attaching the stdout.txt here: |
Thanks @CharlesQiZhou, I suggest that you try running the non-MPI unit tests (assuming they have compiled?) and then test with the main executable whether you encounter the issue with RBC vertex coordinates in #770. Moving forward, please parse the compilation output and only add to tickets the actual errors, rather than a long list of warnings. |
Thanks @mobernabeu. Unfortunately only two executable |
You've got a linking error there. Which symbol is it complaining about? You can demangle the name with |
Well spotted @rupertnash. The demangled functions are:
|
Thanks @CharlesQiZhou. @rupertnash, could it be a lambda function that it's been optimised out by error? Perhaps we could try setting |
Hi @rupertnash, thank for your recent changes before Christmas. I found the superbuild quite handy. However, even with superbuild, the code is still compiled only for serial jobs on Cirrus (with srun --ntasks=1). Parallel jobs with srun would fail with the error below:
and mpirun would fail with libstdc++.so.6 error instead:
The mpirun error is beyond me as gcc10.2.0 rather than 8.2.0 was loaded for both the compilation and the job submission. FYI, the modules I loaded for compiling the code were:
The hemelb main executable did compile, but errors occurred during
|
Maybe it's also useful to post the error from conventional build, i.e. manually compiling the dependencies and then the code itself. Previously in last November both would compile and the main executable was generated. Now make the dependencies gives the error as below:
|
Hi Charles, while this gets looked into properly. Can you check if directories |
Hi @mobernabeu, |
We agreed that @CharlesQiZhou should go back to the SuperBuild build and understand the MPI runtime error that the simulations are throwing. Sysadmin advised linking with a different MPI implementation. If it persist afterwards, we will need to understand if we are doing something wrong in the code that shows in Cirrus but not in other machines (due to different MPI implementations being used). |
Hi @mobernabeu, superbuild with the MPT library (default MPI implementation recommended by Cirrus) instead of Intel-MPI (i.e. intel-mpi-19 as in previous comment above) gives us a SIGSEGV error at run time, which was also encountered for the earlier See the run errors below for (1) the serial job (srun --ntasks=1), (2) parallel job with Srun, and (3) parallel job with MPIrun, respectively. Note that MPIrun gives the same error ( (1) srun --ntasks=1
(2) srun
(3) mpirun
|
Thanks, Charles. We need to understand if (1) is a a bug in HemeLB or in the MPI implementation. The output is giving you a stack trace where |
Hi @mobernabeu, the print output indicates that the code crashes at the line below |
Thanks, Charles. If you mean the return statement, it should be happening within the constructor of the object that it's being returned. Could you please see where exactly? Look for |
@mobernabeu, apologies for the confusion. The crashed line itself is Only "breakpoint 1" and "breakpoint 2" from below are printed at runtime:
|
Can you try replacing that line with If you are gonna use |
Hi @mobernabeu, I just gave it a go. Unfortunately there is no "success?" output from the recompiled code. Still "breakpoint 1" and "breakpoint 2" only |
Mmmh, puzzling, can you add an |
@mobernabeu, got the assertion error below
|
@mobernabeu, following the added assertion about commPtr in
Running with 2 cores ( |
Following discussion with @mobernabeu offline, a bug in the
This fixes the
rather than
If the above is not dramatic enough, the first way of initialising the simulation ( The stack traceback below is for
Note that the last call to hemelb in |
Note the build error below is common for the main executable built either by MPT or Intel-MPI:
I should also mention that for parallel jobs run with hemelb built by MPT,
or
|
Let's start with the MPT and ntasks=4. It might be a similar problem to the previous. |
Hi @mobernabeu, unfortunately |
I spoke with @CharlesQiZhou yesterday and it seems as flow-only simulations are crashing at initialisation some times, so we need to start with that before investigating RBCs any more. @CharlesQiZhou could you please document that here. If you could do a run with the logger turned to Debug that could be useful. Separately, could you also document the errors that you get when compiling the unit tests (sequential and parallel)? We should check those first to see if they pick up any of the issues as it will make it easier to debug. |
@rupertnash A quick follow-up of our discussion this morning.
|
Further info. on the unittest compilation error and main application runtime error (files attached at the end) after last week's merge of
|
@CharlesQiZhou can you please try again on Cirrus? The last 2 commits to main may have fixed these problems. |
Oh man, I didn't think it could be picking up the wrong ctor... 🤦 Hope you are right! |
Thanks @rupertnash for the commits.
|
Thanks @CharlesQiZhou. Please add some tracing to check which of the two assertions in |
OK - after a brief fight with Cirrus yesterday I have the tests compiling and passing with the following modules and CMake via the superbuild
|
Thanks a lot @rupertnash for negotiating with Cirrus and the new commits. I attempted a few compilations following your modules but the main executable failed to compile this time due to TinyXML library issues (see error message below). I saw your comments about
Hi @mobernabeu , sure I'll follow up with the |
Similar to #770.
Let's start by compiling the code manually on the login nodes of the machine and then we will try to sort out builds with Fabric afterwards.
Document module configuration here.
The text was updated successfully, but these errors were encountered: