Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing error message if slave hangs during startup #22

Open
kyllingstad opened this issue Mar 11, 2017 · 1 comment
Open

Confusing error message if slave hangs during startup #22

kyllingstad opened this issue Mar 11, 2017 · 1 comment
Labels

Comments

@kyllingstad
Copy link
Member

kyllingstad commented Mar 11, 2017

If a slave hangs/crashes during its fmiInstantiateSlave() call, coralmaster gives the following rather unhelpful information:

Parsing execution configuration file '../execonf'
Creating new execution
Parsing model configuration file '../sysconf' and spawning slaves
Error: Unexpected internal error: std::future_error: Broken promise

It should inform the user that it failed because of a timeout.

Running with a single slave named "sine" and debug/trace info on, we get:

Parsing execution configuration file '../execonf'
Creating new execution
[ trace ] ExecutionManager state change: none -> N5coral3bus19ReadyExecutionStateE (/path_redacted/coral/src/lib/bus_execution_manager_private.cpp:219)
Parsing model configuration file '../sysconf' and spawning slaves
[ trace ] Sent GetSlaveTypes request to 1 providers (/path_redacted/coral/src/lib/master_cluster.cpp:326)
[ trace ] GetSlaveTypes request to slave provider 62a34cb1-6f67-4884-81c6-a28a08c5f4a7 returned 4 types (/path_redacted/coral/src/lib/master_cluster.cpp:356)
[ trace ] ExecutionManager state change: N5coral3bus19ReadyExecutionStateE -> N5coral3bus28ReconstitutingExecutionStateE (/path_redacted/coral/src/lib/bus_execution_manager_private.cpp:219)
[ trace ] PendingSlaveControlConnectionPrivate  0x7fffe4001070: Connecting to endpoint tcp://127.0.0.1:42251 (/path_redacted/coral/src/lib/bus_slave_control_messenger.cpp:140)
[ trace ] PendingSlaveControlConnectionPrivate  0x7fffe4001070: Sent HELLO (/path_redacted/coral/src/lib/bus_slave_control_messenger.cpp:147)
[ trace ] PendingSlaveControlConnectionPrivate  0x7fffe4001070: Received MSG_HELLO (/path_redacted/coral/src/lib/bus_slave_control_messenger.cpp:182)
[ trace ] SlaveControlMessengerV0 0x7fffe400b690: connected to "sine" (ID = 1) (/path_redacted/coral/src/lib/bus_slave_control_messenger_v0.cpp:86)
[ trace ] SlaveControlMessengerV0 0x7fffe400b690: Sending MSG_SETUP (/path_redacted/coral/src/lib/bus_slave_control_messenger_v0.cpp:309)
[ trace ] SlaveControlMessengerV0 0x7fffe400b690: Send complete (/path_redacted/coral/src/lib/bus_slave_control_messenger_v0.cpp:313)
[ debug ] Unexpected exception thrown in CommThread destructor: SetPeers: Precondition not satisfied: State() == SLAVE_READY (/path_redacted/coral/src/include/
coral/async.hpp:673)
Error: Unexpected internal error: std::future_error: Broken promise
@kyllingstad
Copy link
Member Author

This error happens during the "reconstitute" operation, i.e. when the slave is added, and here's why:

The "reconstitute" operation consists of two steps:

  1. Establish connections with all added slaves (includes sending HELLO and SETUP).
  2. Update the list of slave network addresses and send it to all slaves (SET_PEERS).

Currently, if the first step fails, it will still continue to the second step. There, it will attempt to send the SET_PEERS message to the dead slave too, leading to the [debug] message in the output above. The consequent exception causes the final error message, because it kills the communication thread, thus breaking the promise to the main thread.

There are in fact three bugs here:

  • The operation continues to the second step, when it probably shouldn't.
  • In the second step, we attempt to send a command to a dead slave.
  • The error which is finally printed does not in any way hint at what went wrong, or even point us in the right direction. (At least, it should be something like the debug message.)

kyllingstad added a commit to kyllingstad/coral that referenced this issue Mar 11, 2017
This fixes the first problem described in issue viproma#22. Now, if any of the
slaves hang/crash during the "add slaves" step, the whole simulation is
terminated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant