-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify interoperability with fork and subprocess creation #474
base: main
Are you sure you want to change the base?
Conversation
In my testing, an application that holds to the following clause works as expected for both Cray SHMEM on Aries and OSHMEM/UCX on InfiniBand:
|
Yes, makes sense. |
No, the second clause says (emphasis added):
So, no; the child process could not invoke SHMEM operations. |
Thanks @nspark. The text mentioned system level interfaces, so I assume, user level threads / processes do not fall under this restriction. After |
@nspark I realize we have deprecated |
@jdinan I wonder whether |
@jdinan Re: implicit finalization, what about adding something like the following under "Notes to implementers"?
|
@nspark This is a useful note to implementors. The |
@jdinan Just to clarify, by "remove these functions," you mean totally gone from the API and not just deprecated, correct? |
Co-authored-by: Bryant C. Lam <BryantLam@users.noreply.github.com>
Note that this would require OpenSHMEM libraries using IB Verbs to call |
\item When \FUNC{fork} is invoked within the \openshmem portion of the | ||
program or after the \openshmem library has been finalized, the | ||
newly created child process shall not call any \openshmem routines; | ||
otherwise, the behavior is undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two cases above are still problematic. The issue is that fork marks pages shared by the parent and child processes as copy on write. When either process touches the page, it causes a copy of that page to be mapped into that process' address space and this breaks memory registration because it can change the parent's mapping for memory registered with the NIC. The second bullet above is solved by ibv_fork_init
, which marks any memory registered to the NIC with MADV_DONTFORK
. This prevents the page being remapped to a copy in the parent's address space, but it also changes the behavior of fork since some pages don't propagate to the child process. In new kernels, Linux copies the pages immediately, which solves the memory registration problem without breaking fork, but at the expense of copy space/time overhead (especially if the application does fork+exec). I'm not exactly sure what will happen in the case where fork
is called before ibv_fork_init
on older Linux kernels. Hopefully the NIC driver will see that the page is marked copy on write during registration and duplicate it in the parent's address space.
To summarize, any support for fork is problematic. I know the bullet below says that we aren't required to support it, but the two bullets above describe two usage models as things that ought to work. Fork bugs are very hard to diagnose; here's an interesting read of one user's journey discovering that fork was broken: https://blog.nelhage.com/post/a-cursed-bug/. I would suggest that if we want to keep the above bullets, we also need to give users some way to ask the SHMEM library whether or not the above usage models are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That post is a great write-up, and it touches on a behavioral aspect of CoW that I didn't know:
Following a fork, [...] the first process to attempt to write to a copy-on-write (“CoW”) page will make a copy, leaving the original page (still visible to one or more other processes) intact.
This (potentially) interacts poorly with RDMA; If a process using RDMA forks, any future attempts to write to pages used for RDMA would result in the parent copying the page, which would mean it would no longer see any remote writes to that page by RDMA peers.
I had always thought the child always gets the copy. It sounds like the behavior in Linux 5.12 is (or can be) much better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to, for now, limit subprocess creation to posix_spawn
. We could then revisit documenting fork
compatibility with a supplemental API (e.g., shmem_is_forksafe
) later.
I think the documentation rationale for posix_spawn
is sound. I didn't know posix_spawn
existed until I found that I couldn't fork-exec
in various SHMEM implementations. I imagine many Unix-y programmers are more used to fork-exec
than posix_spawn
.
Still, what's the concern with use of fork
before shmem_init[_thread]
? I'd generally think that, in general, we're still in "vanilla Linux memory land" during that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern I had about fork before init is what NIC drivers will do when you try to register a page that's marked CoW. I suppose this is a common case, e.g. if a new allocation is registered before it's initialized (all pages mapped to zero page), so perhaps nothing to worry about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been digging around this trying to better understand the problem and a few things have changed since this thread was initiated. It is not clear to me that posix_spawn
should better than fork()
or vfork()
. I believe both of those functions can be called depending on the setting of a few POSIX system environment variables. I would like to see this interface defined, but I know this is not a trivial task. Have any of your (@jdinan) opinions on the matter changed with updates to the kernel and networking middle layers options.
The main ambition is having a defined interface within a SHMEM program that is able to spawn off a subprocess safely. The process should have limitations on what it can do (e.g. no shmem routines).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change that comes to mind is Linux 5.12 copying the parent process pages into the child process at the time of the fork rather than marking them as CoW. Is having a copy of the parent process' data a usage model that's important for your apps? If not, we leave more freedom to implementations by not needing to support such a model.
This PR specifies interoperability with
fork
and subprocess creation viaposix_spawn
.Feedback is very much requested.
Closes #223