Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify interoperability with fork and subprocess creation #474

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion content/backmatter.tex
Original file line number Diff line number Diff line change
Expand Up @@ -668,6 +668,10 @@ \section{Version 1.6}
\openshmem[1.5] Table 10, and clarified the types, names, and supporting
operations for team-based reductions.
\ChangelogRef{teamreducetypes}%
%
\item Clarified the interaction of the \openshmem library with
subprocess creation.%
\ChangelogRef{subsec:subprocess}%
\end{itemize}

\section{Version 1.5}
Expand Down Expand Up @@ -742,7 +746,7 @@ \section{Version 1.5}
subsec:shmem_reductions}%
%
\item Clarified interoperability of \openshmem with other programming models.
\ChangelogRef{sec:interoperability}%
\ChangelogRef{sec:interoperability, subsec:mpi}%
%
\item Clarified restrictions on using pointers to symmetric objects.
\ChangelogRef{
Expand Down
42 changes: 41 additions & 1 deletion content/interoperability.tex
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,48 @@ \chapter{Interoperability with Other Programming Models}\label{sec:interoperabil
and advice to \openshmem library users and developers that may improve the portability
and performance of hybrid programs.

\section{Subprocesses}\label{subsec:subprocess}

\section{MPI Interoperability}
In some cases, an \openshmem application may be used to create or
orchestrate other processes, which can be created through a number of
system-level interfaces. In these instances, such subprocesses are
subject to the following interoperability constraints.

On platforms that provide the referenced POSIX\footnotemark[1] \acp{API}:

\footnotetext[1]{POSIX, the Portable Operating System Interface, is
formally specified in IEEE Std 1003.1-2017 and The Open Group
Technical Standard Base Specifications, Issue 7.}

\begin{itemize}
\item When \FUNC{fork} is invoked before the \openshmem library is
initialized, only one of either the parent or child processes may
initialize the \openshmem library.
\item When \FUNC{fork} is invoked within the \openshmem portion of the
program or after the \openshmem library has been finalized, the
newly created child process shall not call any \openshmem routines;
otherwise, the behavior is undefined.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two cases above are still problematic. The issue is that fork marks pages shared by the parent and child processes as copy on write. When either process touches the page, it causes a copy of that page to be mapped into that process' address space and this breaks memory registration because it can change the parent's mapping for memory registered with the NIC. The second bullet above is solved by ibv_fork_init, which marks any memory registered to the NIC with MADV_DONTFORK. This prevents the page being remapped to a copy in the parent's address space, but it also changes the behavior of fork since some pages don't propagate to the child process. In new kernels, Linux copies the pages immediately, which solves the memory registration problem without breaking fork, but at the expense of copy space/time overhead (especially if the application does fork+exec). I'm not exactly sure what will happen in the case where fork is called before ibv_fork_init on older Linux kernels. Hopefully the NIC driver will see that the page is marked copy on write during registration and duplicate it in the parent's address space.

To summarize, any support for fork is problematic. I know the bullet below says that we aren't required to support it, but the two bullets above describe two usage models as things that ought to work. Fork bugs are very hard to diagnose; here's an interesting read of one user's journey discovering that fork was broken: https://blog.nelhage.com/post/a-cursed-bug/. I would suggest that if we want to keep the above bullets, we also need to give users some way to ask the SHMEM library whether or not the above usage models are supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That post is a great write-up, and it touches on a behavioral aspect of CoW that I didn't know:

Following a fork, [...] the first process to attempt to write to a copy-on-write (“CoW”) page will make a copy, leaving the original page (still visible to one or more other processes) intact.

This (potentially) interacts poorly with RDMA; If a process using RDMA forks, any future attempts to write to pages used for RDMA would result in the parent copying the page, which would mean it would no longer see any remote writes to that page by RDMA peers.

I had always thought the child always gets the copy. It sounds like the behavior in Linux 5.12 is (or can be) much better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to, for now, limit subprocess creation to posix_spawn. We could then revisit documenting fork compatibility with a supplemental API (e.g., shmem_is_forksafe) later.

I think the documentation rationale for posix_spawn is sound. I didn't know posix_spawn existed until I found that I couldn't fork-exec in various SHMEM implementations. I imagine many Unix-y programmers are more used to fork-exec than posix_spawn.

Still, what's the concern with use of fork before shmem_init[_thread]? I'd generally think that, in general, we're still in "vanilla Linux memory land" during that time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern I had about fork before init is what NIC drivers will do when you try to register a page that's marked CoW. I suppose this is a common case, e.g. if a new allocation is registered before it's initialized (all pages mapped to zero page), so perhaps nothing to worry about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been digging around this trying to better understand the problem and a few things have changed since this thread was initiated. It is not clear to me that posix_spawn should better than fork() or vfork(). I believe both of those functions can be called depending on the setting of a few POSIX system environment variables. I would like to see this interface defined, but I know this is not a trivial task. Have any of your (@jdinan) opinions on the matter changed with updates to the kernel and networking middle layers options.

The main ambition is having a defined interface within a SHMEM program that is able to spawn off a subprocess safely. The process should have limitations on what it can do (e.g. no shmem routines).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change that comes to mind is Linux 5.12 copying the parent process pages into the child process at the time of the fork rather than marking them as CoW. Is having a copy of the parent process' data a usage model that's important for your apps? If not, we leave more freedom to implementations by not needing to support such a model.

\item Not all \openshmem implementations may support the use of
\FUNC{fork} within the \openshmem portion of the program or after
the \openshmem library has been finalized.
When subprocess creation is needed in these instances, the
application is encouraged to make use of the \FUNC{posix\_spawn} and
\FUNC{posix\_spawnp} \acp{API}.
\end{itemize}

\parimpnotes{
All \openshmem implementations should ensure interoperability with
the \FUNC{posix\_spawn} and \FUNC{posix\_spawnp} \acp{API}.

\openshmem implementations that support implicit library
finalization for compatibility with \FUNC{start\_pes} should ensure
that child processes created after library initialization do not
implicitly call \openshmem operations as part of exit handlers
invoked during normal process termination.
}


\section{MPI Interoperability}\label{subsec:mpi}

\openshmem and \ac{MPI} are two commonly used parallel programming models for
distributed-memory systems. The user can choose to utilize both models in the same program
Expand Down