-
Notifications
You must be signed in to change notification settings - Fork 165
KillSwitch terminates pending instance #429
KillSwitch terminates pending instance #429
Conversation
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
//! - terminable: `true` | ||
//! - execution_domain: `Pending` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted a two line summary of this change, this is it. It would be very useful if an instance is terminable from creation onward. Before running, it will be in the Pending
domain.
And as mentioned in the FIXME
below, I will make a pass at updating this before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just 2c since this is still WIP
: I'm really glad to see this comes out to be a small and clear change overall! I'm looking forward to giving this another review once you're ready.
ee003cd
to
b4a8919
Compare
828a293
to
a2b50de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've said it before, but I'm really happy to see KillState
simplified to concerns around one variable that itself is functionally just an atomic integer. That KillSwitch::terminate
works with these changes (almost) unchanged is super cool too.
I do have a few tweaks/edge cases I'd like to see addressed, but the general shape of this looks really good.
lucet-runtime/lucet-runtime-internals/src/context/context_asm.S
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
@@ -298,6 +424,7 @@ impl KillSwitch { | |||
} | |||
Ok(KillSuccess::Signalled) | |||
} else { | |||
// If we are here, we probably faulted without updating the execution domain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fault today (either an unreachable
/abort
in guest code, or a segfault from a bad memory access) would be caught by the signal handler, which would then have updated to Domain::Terminated
. So to get here would require the instance exit without going through the backstop or signal handler, leaving Instance::terminate
. But then Instance::terminate
would be called in a hostcall, so the domain would not have been Guest
, unless someone provided a hostcall without #[lucet_hostcall]
.
I originally didn't want to speculate on how the system would reach this too much, not being able to think of a reason at first, but "bad hostcall" seems like something that might happen. How's it sound to note that the panic-message-seer might want to investigate hostcalls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this directly, and over here, but we can avoid trying to lock the domain in the signal handler. As you mentioned, that invites a lot of trouble, and we can update the domain here.
The catch is that this arm is then reachable, if we have faulted in guest code, as we won't have updated the domain from Guest
.
lucet-runtime/lucet-runtime-internals/src/context/context_asm.S
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
via pr review from @iximeow.
Because we are now viewing a pending instance as terminable, we use the execution domain to indicate whether or not it is safe to issue a signal to a running Instance. This means that the `terminable` flag is no longer needed! We shouldn't ever be in contention over this mutex, so this should not present a notable performance hit.
3978e41
to
b566730
Compare
This is ready for another round of review! After addressing your notes, this feels much more concise. The documentation for the |
/// | ||
/// See `lucet_runtime_internals::context::lucet_context_activate`, and | ||
/// `execution::enter_guest_region` for more info. | ||
fn with_terminability(&mut self) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
πΆοΈ So, this is a spicy change, but it's a no-op in the name of documentation. πΆοΈ
We have our activation thunk, our entry callback, and data to pass to the entry callback. Notably, the activation thunk does not actually care what the entry callback does, or what data is passed to it! The activation thunk exists purely to call some callback, and then jump to an address.
This block, which was previously inlined in run_func
, is how that thunk is used to make the guest terminable upon entry. Putting this into a separate private method means we can document this a little more cleanly.
I am open to bikeshedding name here if you'd like, I know returning &mut Self
means this isn't quite what we would canonically call with_something
. I just liked that this means we can place this into run_func
above inside of a line that looks like:
self.with_terminability().swap_and_return()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like making this a Thing we can talk about outside run_func
. Here's an argument: this should be a function (install_activator(callback: fn(*mut Instance))
) on Context
.
As you noted, this doesn't really care what the callback does, so terminability
is a bit misleading there. It also doesn't set up termination itself, but lays the groundwork for the instance to be terminable ... but this whole PR is about making the instance terminable from when it was allocated! A with_...
style name would suggest this could be called more than once without error too, but calling it twice on the same Context would swap the preserved guest function address with another copy of lucet_context_activate
, definitely introducing errors.
So: why on Context
? The only structure this function works with is self.ctx
and self.ctx.gpr
. It also takes a callback function, but the function's argument is also a field on Context
, so callback_data_ptr
wouldn't need to exist! If execution::enter_guest_region
is taken as an argument, the remaining body doesn't even reference data outside Context
either. I'll need to think a bit about how this interacts with #433 as it's Instance
implementation details, but I think if you squint this looks like a general "shim in a function call before this Context resumes" mechanism - I'm happy to square that away later.
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
/// Instance execution domains. | ||
/// | ||
/// This enum allow us to distinguish how to appropriately terminate an instance. | ||
/// | ||
/// We can signal in guest code, but must avoid signalling in host code lest we interrupt some | ||
/// function operating on guest/host shared memory, and invalidate invariants. For example, | ||
/// interrupting in the middle of a resize operation on a `Vec` could be extremely dangerous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some documentation for the domain, and the hostcall methods on the kill state, since domain is now much more important w.r.t. terminability. Most of this was borrowed from existing comments etc., but I would also greatly appreciate eyes here π
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more tweaks, please! Almost entirely documentation thoughts.
/// | ||
/// See `lucet_runtime_internals::context::lucet_context_activate`, and | ||
/// `execution::enter_guest_region` for more info. | ||
fn with_terminability(&mut self) -> &mut Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like making this a Thing we can talk about outside run_func
. Here's an argument: this should be a function (install_activator(callback: fn(*mut Instance))
) on Context
.
As you noted, this doesn't really care what the callback does, so terminability
is a bit misleading there. It also doesn't set up termination itself, but lays the groundwork for the instance to be terminable ... but this whole PR is about making the instance terminable from when it was allocated! A with_...
style name would suggest this could be called more than once without error too, but calling it twice on the same Context would swap the preserved guest function address with another copy of lucet_context_activate
, definitely introducing errors.
So: why on Context
? The only structure this function works with is self.ctx
and self.ctx.gpr
. It also takes a callback function, but the function's argument is also a field on Context
, so callback_data_ptr
wouldn't need to exist! If execution::enter_guest_region
is taken as an argument, the remaining body doesn't even reference data outside Context
either. I'll need to think a bit about how this interacts with #433 as it's Instance
implementation details, but I think if you squint this looks like a general "shim in a function call before this Context resumes" mechanism - I'm happy to square that away later.
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented, re: never type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is close! Sorry for leaving this one so long - two small points to fix and this looks good to go.
lucet-runtime/lucet-runtime-internals/src/instance/execution.rs
Outdated
Show resolved
Hide resolved
// the kill state, invalidating any existing killswitches' weak references. We | ||
// forget the mutex guard so that we don't invoke its destructor and segfault. | ||
(*instance).kill_state = Arc::new(KillState::default()); | ||
mem::forget(current_domain); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean mem::drop
here? I'm pretty sure forget
ing the lock guard would result in old KillState
accumulating over time.
Co-Authored-By: iximeow <git@iximeow.net>
by taking a `*mut Instance`, it is very easy to accidentally do some UB-invoking operation such as trying to reassign and replace `kill_state`. For operations that *are* safe on a shared ptr to `Instance`, like updating an atomic or taking a mutex lock, we only need a `*const` ptr to get a regular `&Instance` - trying to use a `&mut Instance` in callback functions is likely a sign of impending UB.
yielding does not cause a kill state reset because the same guest function may be resumed later on
β¦ yielding" This reverts commit 79c4d0f.
This reverts commit 21bf244.
Previously, we were leaking our guard on the execution domain, by calling `mem::forget`. However, calling `mem::drop` here would cause a segmentation fault. This uses `as_mut` so that our guard's lifetime can be inferred correctly. Co-Authored-By: iximeow <me@iximeow.net> Co-Authored-By: Adam C. Foltzer <acfoltzer@fastly.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π π looks good! I have one doc suggestion which you're welcome to ignore if you feel it's not helpful. This PR has been a real journey. π
Co-Authored-By: iximeow <git@iximeow.net>
Yep!! Thanks for reviewing along the way β€οΈ |
KillSwitch Terminates Pending Instances π ββοΈ β
Overview
This updates the
lucet_runtime_internals::instance::execution
module so thatKillSwitch::terminate
will also terminate pending instances.Before if an instance was remotely terminated before running, the result would be a
KillError::NotTerminable
error. Moreover, the instance would still run a guest context ifrun
was called afterwards.Now if an instance is remotely terminated before running, the result will now be
Ok(KillSuccess::Cancelled)
rather than aKillError::NotTerminable
error. The instance will then enter aDomain::Cancelled
state, and will not enter a guest region whenrun
is called.Example
This is best summarized in a small snippet, assuming an initialized instance
inst
:Implementation
In order for this to happen, it means that an instance is now considered terminable when it is not running. In the original implementation,
terminable
was really a flag that meant that the instance could be safely signalled. (Recall: Depending on the current execution domain, it may or may not be safe to signal the instance to terminate)If the instance is now terminable when the guest context is not active (i.e. it has not yet started), we now need to indicate whether or not the instance can be safely signalled using the execution domain. This means touching some of the π» fun π» parts of
lucet
.The TLDR of the changes here is:
Domain
now has two additional variants:Pending
, andCancelled
. A new instance will be in thePending
state. If a kill switch calls itsterminate
method before the instance runs, that instance will be left in theCancelled
domain.KillState::execution_domain
.terminability
member is no longer semantically useful. (This part, is the πΆοΈ spicy part.)lucet_context_activate
should call a Rust function to lock the domainMutex
, and update the execution domain before jumping to guest code. This means we now have aenter_guest_region
that corresponds to theexit_guest_region
that we use as a backstop callback.