Skip to content
This repository has been archived by the owner on Mar 24, 2022. It is now read-only.

KillSwitch terminates pending instance #429

Conversation

data-pup
Copy link
Member

@data-pup data-pup commented Feb 19, 2020

KillSwitch Terminates Pending Instances πŸ™…β€β™€οΈ ❌

Overview

This updates the lucet_runtime_internals::instance::execution module so that KillSwitch::terminate will also terminate pending instances.

Before if an instance was remotely terminated before running, the result would be a KillError::NotTerminable error. Moreover, the instance would still run a guest context if run was called afterwards.

Now if an instance is remotely terminated before running, the result will now be Ok(KillSuccess::Cancelled) rather than a KillError::NotTerminable error. The instance will then enter a Domain::Cancelled state, and will not enter a guest region when run is called.

Example

This is best summarized in a small snippet, assuming an initialized instance inst:

let inst: Instance = // ...
let kill_switch = inst.kill_switch();
assert_eq!(kill_switch.terminate(), Ok(KillSuccess::Cancelled));

// if terminated before running, the guest should not start at all
match inst.run("onetwothree", &[]) {
    Err(Error::RuntimeTerminated(TerminationDetails::Remote)) => {}
    res => panic!("unexpected result: {:?}", res),
}

Implementation

In order for this to happen, it means that an instance is now considered terminable when it is not running. In the original implementation, terminable was really a flag that meant that the instance could be safely signalled. (Recall: Depending on the current execution domain, it may or may not be safe to signal the instance to terminate)

If the instance is now terminable when the guest context is not active (i.e. it has not yet started), we now need to indicate whether or not the instance can be safely signalled using the execution domain. This means touching some of the πŸ‘» fun πŸ‘» parts of lucet.

The TLDR of the changes here is:

  • Domain now has two additional variants: Pending, and Cancelled. A new instance will be in the Pending state. If a kill switch calls its terminate method before the instance runs, that instance will be left in the Cancelled domain.
  • Signal behavior now depends on KillState::execution_domain.
    • The terminability member is no longer semantically useful. (This part, is the 🌢️ spicy part.)
    • lucet_context_activate should call a Rust function to lock the domain Mutex, and update the execution domain before jumping to guest code. This means we now have a enter_guest_region that corresponds to the exit_guest_region that we use as a backstop callback.

Comment on lines 32 to 33
//! - terminable: `true`
//! - execution_domain: `Pending`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted a two line summary of this change, this is it. It would be very useful if an instance is terminable from creation onward. Before running, it will be in the Pending domain.

And as mentioned in the FIXME below, I will make a pass at updating this before merging.

Copy link
Contributor

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just 2c since this is still WIP: I'm really glad to see this comes out to be a small and clear change overall! I'm looking forward to giving this another review once you're ready.

@data-pup data-pup force-pushed the ktm/killswitch-terminates-pending-guest branch from ee003cd to b4a8919 Compare February 20, 2020 17:17
@data-pup data-pup force-pushed the ktm/killswitch-terminates-pending-guest branch 2 times, most recently from 828a293 to a2b50de Compare February 21, 2020 22:30
@data-pup data-pup requested a review from iximeow February 21, 2020 22:46
@data-pup data-pup marked this pull request as ready for review February 21, 2020 23:41
Copy link
Contributor

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said it before, but I'm really happy to see KillState simplified to concerns around one variable that itself is functionally just an atomic integer. That KillSwitch::terminate works with these changes (almost) unchanged is super cool too.

I do have a few tweaks/edge cases I'd like to see addressed, but the general shape of this looks really good.

lucet-runtime/lucet-runtime-internals/src/context/mod.rs Outdated Show resolved Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance.rs Outdated Show resolved Hide resolved
lucet-runtime/lucet-runtime-internals/src/instance.rs Outdated Show resolved Hide resolved
@@ -298,6 +424,7 @@ impl KillSwitch {
}
Ok(KillSuccess::Signalled)
} else {
// If we are here, we probably faulted without updating the execution domain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fault today (either an unreachable/abort in guest code, or a segfault from a bad memory access) would be caught by the signal handler, which would then have updated to Domain::Terminated. So to get here would require the instance exit without going through the backstop or signal handler, leaving Instance::terminate. But then Instance::terminate would be called in a hostcall, so the domain would not have been Guest, unless someone provided a hostcall without #[lucet_hostcall].

I originally didn't want to speculate on how the system would reach this too much, not being able to think of a reason at first, but "bad hostcall" seems like something that might happen. How's it sound to note that the panic-message-seer might want to investigate hostcalls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this directly, and over here, but we can avoid trying to lock the domain in the signal handler. As you mentioned, that invites a lot of trouble, and we can update the domain here.

The catch is that this arm is then reachable, if we have faulted in guest code, as we won't have updated the domain from Guest.

Because we are now viewing a pending instance as terminable, we use the
execution domain to indicate whether or not it is safe to issue a signal
to a running Instance.

This means that the `terminable` flag is no longer needed! We shouldn't
ever be in contention over this mutex, so this should not present a
notable performance hit.
@data-pup data-pup force-pushed the ktm/killswitch-terminates-pending-guest branch from 3978e41 to b566730 Compare February 25, 2020 05:52
@data-pup data-pup requested a review from iximeow February 25, 2020 06:04
@data-pup
Copy link
Member Author

This is ready for another round of review! After addressing your notes, this feels much more concise. The documentation for the lucet_runtime_internals::instance::execution module is still stale, but I can take a pass updating that once the changes here look good πŸ™‚

///
/// See `lucet_runtime_internals::context::lucet_context_activate`, and
/// `execution::enter_guest_region` for more info.
fn with_terminability(&mut self) -> &mut Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌢️ So, this is a spicy change, but it's a no-op in the name of documentation. 🌢️

We have our activation thunk, our entry callback, and data to pass to the entry callback. Notably, the activation thunk does not actually care what the entry callback does, or what data is passed to it! The activation thunk exists purely to call some callback, and then jump to an address.

This block, which was previously inlined in run_func, is how that thunk is used to make the guest terminable upon entry. Putting this into a separate private method means we can document this a little more cleanly.

I am open to bikeshedding name here if you'd like, I know returning &mut Self means this isn't quite what we would canonically call with_something. I just liked that this means we can place this into run_func above inside of a line that looks like:

self.with_terminability().swap_and_return()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like making this a Thing we can talk about outside run_func. Here's an argument: this should be a function (install_activator(callback: fn(*mut Instance))) on Context.

As you noted, this doesn't really care what the callback does, so terminability is a bit misleading there. It also doesn't set up termination itself, but lays the groundwork for the instance to be terminable ... but this whole PR is about making the instance terminable from when it was allocated! A with_... style name would suggest this could be called more than once without error too, but calling it twice on the same Context would swap the preserved guest function address with another copy of lucet_context_activate, definitely introducing errors.

So: why on Context? The only structure this function works with is self.ctx and self.ctx.gpr. It also takes a callback function, but the function's argument is also a field on Context, so callback_data_ptr wouldn't need to exist! If execution::enter_guest_region is taken as an argument, the remaining body doesn't even reference data outside Context either. I'll need to think a bit about how this interacts with #433 as it's Instance implementation details, but I think if you squint this looks like a general "shim in a function call before this Context resumes" mechanism - I'm happy to square that away later.

Comment on lines +315 to +321
/// Instance execution domains.
///
/// This enum allow us to distinguish how to appropriately terminate an instance.
///
/// We can signal in guest code, but must avoid signalling in host code lest we interrupt some
/// function operating on guest/host shared memory, and invalidate invariants. For example,
/// interrupting in the middle of a resize operation on a `Vec` could be extremely dangerous.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some documentation for the domain, and the hostcall methods on the kill state, since domain is now much more important w.r.t. terminability. Most of this was borrowed from existing comments etc., but I would also greatly appreciate eyes here πŸ™‚

Copy link
Contributor

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more tweaks, please! Almost entirely documentation thoughts.

///
/// See `lucet_runtime_internals::context::lucet_context_activate`, and
/// `execution::enter_guest_region` for more info.
fn with_terminability(&mut self) -> &mut Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like making this a Thing we can talk about outside run_func. Here's an argument: this should be a function (install_activator(callback: fn(*mut Instance))) on Context.

As you noted, this doesn't really care what the callback does, so terminability is a bit misleading there. It also doesn't set up termination itself, but lays the groundwork for the instance to be terminable ... but this whole PR is about making the instance terminable from when it was allocated! A with_... style name would suggest this could be called more than once without error too, but calling it twice on the same Context would swap the preserved guest function address with another copy of lucet_context_activate, definitely introducing errors.

So: why on Context? The only structure this function works with is self.ctx and self.ctx.gpr. It also takes a callback function, but the function's argument is also a field on Context, so callback_data_ptr wouldn't need to exist! If execution::enter_guest_region is taken as an argument, the remaining body doesn't even reference data outside Context either. I'll need to think a bit about how this interacts with #433 as it's Instance implementation details, but I think if you squint this looks like a general "shim in a function call before this Context resumes" mechanism - I'm happy to square that away later.

lucet-runtime/lucet-runtime-tests/src/timeout.rs Outdated Show resolved Hide resolved
lucet-runtime/lucet-runtime-tests/src/timeout.rs Outdated Show resolved Hide resolved
Copy link
Member

@cratelyn cratelyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented, re: never type.

@cratelyn cratelyn requested a review from iximeow March 13, 2020 23:35
Copy link
Contributor

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close! Sorry for leaving this one so long - two small points to fix and this looks good to go.

// the kill state, invalidating any existing killswitches' weak references. We
// forget the mutex guard so that we don't invoke its destructor and segfault.
(*instance).kill_state = Arc::new(KillState::default());
mem::forget(current_domain);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean mem::drop here? I'm pretty sure forgeting the lock guard would result in old KillState accumulating over time.

cratelyn and others added 6 commits March 25, 2020 14:29
by taking a `*mut Instance`, it is very easy to accidentally do some
UB-invoking operation such as trying to reassign and replace
`kill_state`. For operations that *are* safe on a shared ptr to
`Instance`, like updating an atomic or taking a mutex lock, we only need
a `*const` ptr to get a regular `&Instance` - trying to use a `&mut
Instance` in callback functions is likely a sign of impending UB.
yielding does not cause a kill state reset because the same guest
function may be resumed later on
Previously, we were leaking our guard on the execution domain, by
calling `mem::forget`. However, calling `mem::drop` here would cause a
segmentation fault. This uses `as_mut` so that our guard's lifetime can
be inferred correctly.

Co-Authored-By: iximeow <me@iximeow.net>
Co-Authored-By: Adam C. Foltzer <acfoltzer@fastly.com>
@cratelyn cratelyn changed the title [WIP] KillSwitch terminates pending instance KillSwitch terminates pending instance Mar 26, 2020
@cratelyn cratelyn requested a review from iximeow March 30, 2020 14:38
iximeow
iximeow previously approved these changes Mar 30, 2020
Copy link
Contributor

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸŽ‰ πŸŽ‰ looks good! I have one doc suggestion which you're welcome to ignore if you feel it's not helpful. This PR has been a real journey. 😎

lucet-runtime/lucet-runtime-internals/src/instance.rs Outdated Show resolved Hide resolved
Co-Authored-By: iximeow <git@iximeow.net>
@cratelyn
Copy link
Member

This PR has been a real journey. sunglasses

Yep!! Thanks for reviewing along the way ❀️

@cratelyn cratelyn merged commit e0da8c4 into bytecodealliance:master Mar 30, 2020
@iximeow iximeow mentioned this pull request Apr 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants