-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdesign.tex
343 lines (298 loc) · 20.8 KB
/
design.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
\section{A New Design for Secure Virtualization Systems}
\label{sec.design}
%Providing essential system functionality without exposing privileged code is a
%critical challenge in the design of secure systems.
%Currently, there are two basic approaches.
%One, known as system call interposition (SCI), checks and passes system calls
%through to the underlying kernel. The other, which we call ``functionality
%re-creation," requires rebuilding system functionality with new code. In this
%section, we show that both existing methods are limited in their ability to
%prevent attacks in the kernel.
%Using our metric described in Section~\ref{sec.metric},
%we then propose a new design scheme named \lip, which accesses only popular
%code paths through a very small trusted computing base, and utilizes
%functionality re-creation within a secure sandboxed environment for complex implementations.
In the previous section we have shown that ``popular paths'' correlate in
a statistically significant manner with security.
Next, we want to demonstrate that our ``popular paths'' metric is useful in
practice for designing
secure virtualization systems. We first briefly discuss the limitations faced
by existing methods, due to the lack of a good
security metric. We then discuss our new design scheme named \lip, which follows
our metric by accessing only popular code paths.
\subsection{Previous Attempts and Their Limitations}
%\subsubsection{System Call Interposition (SCI)}
\noindent
\textbf{System call interposition (SCI).}
SCI systems~\cite{Janus0:96, Janus:99} filter system calls to mediate requests
from untrusted user code instead of allowing them to go directly to the kernel.
The filter checks a predefined security policy to decide which system calls are
allowed to pass to the underlying kernel, and which ones must be stopped.
%Figure \ref{fig:design_system_call_interposition} illustrates how
%system call interposition works. System administrators have direct access to a policy engine that sets and changes security policies.
%\begin{figure}%[h]
%\centering
%\includegraphics[width=1.0\columnwidth]{diagram/Virtualization_Design_Model_03.png}
%\caption{\small Schematic of how System Call Interposition functions.}
%\label{fig:design_system_call_interposition}
%\end{figure}
%SCI was once popular
%approach to the design of secure virtualization systems because
%because it gave developers the ability to set and enforce security policies.
%\lois{I think I asked this during the last revision. You say ``was"
%a popular approach. Is it not a popular approach anymore?}
%\yiwen{It is not a popular approach anymore, since no modern design could simply rely on
%this idea to build practical system.}
This design is limited by its overly complicated approach to policy
decisions and implementation.
To make a policy decision, the system needs to
obtain and interpret the OS state (e.g., permissions, user groups, register flags)
associated with the programs it is monitoring.
The complexity of OS states makes this process difficult and can lead to
inaccurate policy decisions.
%In addition, there are many indirect paths in the kernel that can be accessed.
%If security policy makers overlook those paths, it renders the
%policy ineffective, as attackers will be able to
%bypass security checks.
%Moreover, blocking
%certain system calls could affect necessary functionality.
%It is difficult for developers to fully understand the side-effects of all the
%system calls in an interface as complex as the UNIX API.
%For example, many applications that rely on \texttt{setuid} fail to check its return value.
%If \texttt{setuid} fails, these applications will continue to function in a compromised state,
%with incorrect permissions and privileges.
%The above limitations make it very challenging to design and build a secure virtualization system using
%system call interposition alone.
%More importantly, there is not a good security metric to indicate if a set of policy decisions is good.
%\subsubsection{Functionality Re-Creation}
\noindent
\textbf{Functionality re-creation.}
Systems such as Drawbridge \cite{Drawbridge-11},
Bascule \cite{Bascule}, and Graphene \cite{Graphene-14} can
provide richer functionality and run more complex programs than most systems built
with SCI alone because they have their own
interfaces and libraries. We label such a design
as ``functionality re-creation."
%\begin{figure}%[h]
%\centering
%\includegraphics[width=1.0\columnwidth]{diagram/Virtualization_Design_Model_02.png}
%\caption{\small Schematic of a Functionality Re-creation System.}
%\label{fig:design_functionality_reimplementation}
%\end{figure}
The key to this design is to not fully rely on the underlying
kernel for system functions, but to re-create its own system functionality.
%As illustrated in Figure \ref{fig:design_functionality_reimplementation},
When it has to access resources, like memory, CPU, and disk storage, the system
accesses the kernel directly with its underlying TCB code.
%which can access the kernel directly.
%For example, Graphene \cite{Graphene-14} re-creates its own Linux system calls
%in \texttt{libLinux.so}. When it needs to acquire resources from the kernel, it
%uses a Platform Adaptation Layer (PAL) with access to the kernel, and provides
%basic API functions to the OS library.
Functionality re-creation provides a more realistic solution to building
virtualization systems than earlier efforts.
However, functionality re-creation has two pitfalls:
first, if the re-created functionality resides in the TCB of the virtualization system, then vulnerabilities there can expose the host OS to attack as well.
For example, hundreds of vulnerabilities have been
reported in existing virtualization systems, such as QEMU and VMWare, over the
past ten years~\cite{NVD}.
%In addition, the
%complex semantics of OS functions can easily lead to the emergence of bugs during
%the re-creation process. Some of these vulnerabilities
%can directly lead to privilege escalation, which allows attackers to escape the sandbox
%and execute arbitrary code on the host OS.
%For example, a vulnerability in VMWare's codebase caused by buffer overflows in the VIX
%API allowed local users to
%gain privilege to execute arbitrary code in the host
%OS~\cite{CVE-2008-2100}.
Second, functionality re-creation may assume that the underlying host kernel is correct.
As we have seen, this assumption is often incorrect: host kernels may have bugs in their implementation that leave them vulnerable to attack.
Thus, to provide the greatest assurance that the host kernel will not be exposed to malicious user programs,
a secure functionality re-creation design should try to deliberately avoid kernel paths that are likely to contain flaws.
We discuss this approach in detail next.
\subsection{Lock-in-Pop: Staying on the Beaten Path}
\label{lock-in-pop}
%As discussed above, a weakness of the previous approaches is the inevitable contact
%between the privileged kernel code and an untrusted application.
Recall that we want to show that the ``popular paths'' metric can be used in practice.
We do so by devising a design
in which all code, including the complex part
of the operating system interface, accesses only
popular kernel paths through a small TCB. As it ``locks" all functionality
requests into only the ``popular paths,'' we call this design \lip.
\begin{figure}%[h]
\centering
\includegraphics[width=1.0\columnwidth]{diagram/Virtualization_Design_Model_01.png}
\caption{\small \lip design ensures safe execution of untrusted user code
despite existing potential zero-day bugs in the OS kernel.}
\label{fig:design_safe_reimplementation}
\end{figure}
At the lowest level of the design (interfacing with the host OS) is the
sandbox kernel (\ding{172} in Figure \ref{fig:design_safe_reimplementation}).
The sandbox kernel's main role is to ensure that only popular paths (\ding{173} in Figure \ref{fig:design_safe_reimplementation})
of the host OS's kernel can be accessed.
The sandbox kernel could thus function as a very granular system call filter, or
as the core of a programming language sandbox. Note that the functionality
provided by the sandbox kernel is (intentionally) much less than what
an application needs. For example, an application may store files in directories and set permissions on those files.
The sandbox kernel may provide a much simpler abstraction (e.g., a block storage abstraction),
so long as the strictly needed functionality (e.g., persistent storage) is provided.
Constructing the sandbox kernel is not dependent on any specific technique or programming language.
Instead, the sandbox kernel follows a central design principle to include only simple and necessary system calls with basic flags,
which can be checked to verify that only ``popular paths'' are used.
The sandbox kernel should start with building-block functions to first form a minimum set of system calls.
To give one example, for network programs, opening a TCP connection would be considered an essential function.
We can verify that the lines of kernel code that correspond
to opening TCP sockets, such as lines in \texttt{void tcp\_init\_sock(struct sock *sk)}, are included in the ``popular paths'' for that system, and so decide to include the \texttt{open\_tcp\_connection()} function
in the sandbox kernel.
Examples of other necessary functions are \texttt{file.open}, \texttt{file.close}, \texttt{file.read}, and \texttt{file.write} for filesystem
functions, and \texttt{create\_thread}, \texttt{create\_lock}, \texttt{lock.acquire}, and \texttt{lock.release} for threading functions.
In order to make security our priority, the designed sandbox kernel should only use a subset of the ``popular paths.''
For systems where security is not as critical, trade-offs can certainly be made to include some ``unpopular paths'' to accommodate applications.
Further discussion of this trade-off is beyond the scope of this paper,
though we acknowledge it is an issue that should be addressed as \lip is deployed.
While restricting the system call interface is a big hammer for limiting access to ``popular paths'' in the kernel,
we believe that this is the best choice available, given that we do not want to require modification to the kernel, and
would like to allow users to easily run their applications without much extra effort.
The application is provided more complex functionality via the SafePOSIX re-creation
(\ding{174} in Figure \ref{fig:design_safe_reimplementation}).
SafePOSIX has the needed complexity to build more convenient higher-level
abstractions using the basic functionality provided by the sandbox kernel.
The SafePOSIX re-creation is itself isolated within a library OS sandbox, which
forces all system calls to go through the sandbox kernel.
So long as this is performed, all calls from the SafePOSIX re-creation will only
touch the permitted (popular) kernel paths in the underlying host OS.
Similarly, untrusted user code (\ding{175} in Figure \ref{fig:design_safe_reimplementation})
also must be restricted in the way in which it performs system calls.
System calls must go through the SafePOSIX re-creation, into the sandbox kernel, and then to the host OS.
This is done because if user code could directly make system calls, it could
access any desired path in the host OS's kernel,
and thus exploit bugs within it.
Note that it is expected that bugs will occur in many components, including both the
non-popular (risky) kernel paths (\ding{176} in Figure \ref{fig:design_safe_reimplementation}),
and in the SafePOSIX re-creation. Even the user program will be buggy or
perhaps explicitly malicious (created by attackers).
Since the remaining components (\ding{172} and \ding{173} in Figure \ref{fig:design_safe_reimplementation})
are small and can be thoroughly tested, this leads to a lower risk of compromise.
\subsection{Implementation of Lock-in-Pop}
\label{implementation}
To test the practicality of the ``popular paths'' metric and our \lip design,
we implement a prototype virtual machine
called Lind.\footnote{\scriptsize Lind is an old English word for a lightweight,
but still strong shield constructed from two layers of linden wood.} The purpose of building the Lind
prototype is to demonstrate that our ``popular paths'' metric is practical, and
that developers can build secure systems using it.
Lind is divided into a \emph{computational module} that enforces software fault isolation (SFI) and a
\emph{SafePOSIX module} that safely re-creates the OS functionality needed by user
applications. We use a slightly modified version of Native Client
(NaCl)~\cite{NaCl-09} for the computational module; SafePOSIX is
implemented using Restricted Python (Repy)~\cite{Repy-10} and supports
complex user applications without exposing potentially risky kernel paths.
In this section we provide a brief description of these components and how they
were integrated into Lind, followed by an example of how the system works.
\subsubsection{Primary Components}
\paragraph{Native Client.}
We use NaCl to isolate the computation of the user application
from the kernel. NaCl allows Lind to work on most types of legacy code.
It compiles the programs to produce a binary with software fault isolation.
This prevents applications from performing system calls
or executing arbitrary instructions.
Instead, the application will call into a small, privileged
part of NaCl that forwards system calls. In NaCl's original implementation,
these calls would usually be forwarded to the host OS kernel. In Lind, we
modified NaCl to instead forward these calls to our SafePOSIX re-creation
(described in detail below).
\paragraph{Repy Sandbox.}
To build an API that can access the safe parts of the underlying kernel while
still supporting existing applications, we need two things. First, we need a
restricted sandbox kernel that only allows access to popular kernel paths. We
used Seattle's Repy~\cite{Repy-10} sandbox to perform this task. Second, we
have to provide complex system functions to user programs.
For this task we created SafePOSIX, which implements the widely accepted standard
POSIX interface on top of Repy.
Because the sandbox kernel is the only code that will be in direct contact with host
system calls, it should be small (to make it easy to audit), while providing
primitives that can be used to build more complex functionality.
We used Seattle's Repy system API due to its tiny (around 8K LOC) sandbox
kernel and its minimal set of system call APIs needed to build general
computational functionality. Repy allows access only to the popular portions of
the OS kernel through 33 basic API functions, including 13 network functions, 6
file functions, 6 threading functions, and 8 miscellaneous functions (Table
\ref{table:RepyKernel})~\cite{Repy-10, RepyKernel}.
Repy is only one possible implementation of the sandbox kernel built for our \lip design.
It was chosen because it starts with basic building-block functions and tries to be conservative in what underlying kernel functionality it uses.
Repy was designed and implemented before our ``popular paths'' study, and so it was not a perfect match,
but it we experimentally verified that it uses a subset of the ``popular paths.''
As reported in our evaluation (Section~{{\ref{Reachable-Kernel-Trace-Analysis-for-Repy-Sandbox}}),
Repy accessed a subset (around 70\% to 80\%) of the ``popular paths.''
Our current implementation does not end up using all of the ``popular paths.''
It is certainly safe to use fewer paths than are available, but it is possible
that we are missing out on some performance or compatibility gains.
As we extend our prototype, the ``popular path'' metric will allow us to check
whether new APIs we add
expose potentially unsafe kernel code to applications in the sandbox.
\begin{table}
\centering
\begin{tabular}{ | p{2.5cm} | p{4.5cm} |}
\hline
\textbf{Repy Function} & \textbf{Available System Calls} \\ \hline
Networking & \emph{gethostbyname, openconnection, getmyip, socket.send, socket.receive, socket.close,
listenforconnection, tcpserversocket.getconnection, tcpserversocket.close, sendmessage, listenformessage,
udpserversocket.getmessage, and udpserversocket.close.} \\ \hline
File System I/O Operations & \emph{openfile(filename, create), file.close(), file.readat(size limit, offset), file.writeat(data, offset),
listfiles(), and removefile(filename).} \\ \hline
Threading & \emph{createlock, sleep, lock.acquire, lock.release, createthread, and getthreadname.} \\ \hline
Miscellaneous Functions & \emph{getruntime, randombytes, log, exitall, createvirtualnamespace,
virtualnamespace.evaluate, getresources, and getlasterror.} \\ \hline
\end{tabular}
\caption{Repy sandbox kernel functions that support Lind's SafePOSIX re-creation.}
\label{table:RepyKernel}
\end{table}
\subsubsection{Enhanced Safety in Call Handling with SafePOSIX Re-creation}
The full kernel interface is extremely rich and hard to protect.
The \lip design used to build Lind provides enhances
safety protection through both isolation and a POSIX interface (SafePOSIX). The
latter re-creates risky system calls to
provide full-featured API for legacy applications, with minimal impact on the kernel.
In Lind, a system call issued from user code is
received by NaCl, and then redirected to SafePOSIX.
To service a system call in NaCl, a server routine in
Lind marshals its arguments into a text string, and sends the call and the arguments
to SafePOSIX. The SafePOSIX re-creation services the system call request, marshals the result, and
returns it back to NaCl. Eventually, the result is returned as the appropriate
native type to the calling program.
SafePOSIX is safe because of two design principles.
First, its re-creation only relies on a small set of basic Repy functions (Table \ref{table:RepyKernel}).
Therefore, the interaction with the host OS kernel is strictly controlled.
Second, the SafePOSIX re-creation is run within the Repy programming language sandbox,
which properly isolates any bugs inside SafePOSIX itself.
%We now offer a more detailed example of how SafePOSIX works by reviewing how it re-creates a file system.
%The core of the SafePOSIX file system is the \texttt{open}, \texttt{close}, \texttt{read}, \texttt{write},
%\texttt{getdents}, \texttt{stat}, \texttt{mkdir} and \texttt{rmdir} system calls.
%These give the program the illusion of a normal file system even though Repy does not allow directories or access to file attributes.
%When Lind starts, the file system does some pre-initialization. Using the Repy API, the SafePOSIX file system reads a file named ``lind.metadata''
%from the local directory. This file contains packed metadata from previous runs of Lind,
%and is loaded into the runtime SafePOSIX file system data structures.
%There are three main data structures: a list of open file handles, a Python dict of inodes and file metadata,
%and a mapping table to go from a file name and path to an inode number.
%All these data structures are stored in memory, and written to disk when they are changed.
%The \texttt{open} system call is the normal starting point for most file system operations.
%Given a path, it will return a file descriptor to perform other operations like \texttt{read} and \texttt{write}.
%When SafePOSIX receives the \texttt{open} system call, it parses the path, and
%traverses the path in the inode lookup table.
%When SafePOSIX finds the file, it uses the Repy \texttt{openfile} call to get the backing file's object.
%It then picks a free entry from the file handle table, and stores a link to the inode and the file object.
%If the \texttt{create} flag is passed, it adds an entry to the inode and inode lookup table, and creates a new backing file.
%The backing files are not named the same as the actual files, but rather just ``linddata.001,'' ``linddata.002,'' etc.
%The simple names for the backing files allow us to store the real file name in the metadata, a
%necessary step because of Repy's strict rules about the content of filenames.
%Finally, the call returns the index into the file handle table or, if an error was encountered, an error number to which the Unix \texttt{errno} value is set.
%Here is another example of how SafePOSIX re-creation would work with the symbolic link function.
%Instead of relying on the underlying kernel to create symbolic links between real files
%in the host file system, SafePOSIX builds and maintains its own metadata to represent a virtual symbolic link
%between files within its system. In this case, if there is a bug in this symbolic link function,
%such as creation of a link with a deleted file, the bug will be contained within the SafePOSIX re-creation.
%As a result, instead of creating a security issue, the application is denied privileged access to the host OS kernel.
%Therefore, attackers will not be able to leverage a bug within the symbolic link function to exploit the host kernel.
%As described in the above example, the SafePOSIX re-creation only uses a few Repy sandbox kernel functions to access the hardware.
%It creates and maintains its own metadata and data structures, using the Repy programming language sandbox.