-
Notifications
You must be signed in to change notification settings - Fork 0
Using Multiple GPUs
This page describes how to use Python's multiprocessing module to drive multiple GPUs, by spawning one child process per GPU from a common parent. The example provided uses two GPUs to fit logistic regression models, and demonstrates how to pack up and communicate both common and private arguments from the parent to the children.
The key is to initialize theano specifying the CPU as the target device, and to later override the assigned device to a specified GPU. According to a discussion on theano-users, this re-binding of devices can occur only once per process. So this example would (probably) not work with threads, nor could it be used to do GPU switching.
Here is a generic .theanorc file, similar to the file I use:
[global]
floatX = float32
device = cpu
openmp = True
base_compiledir = /path/to/base/dir
[nvcc]
fastmath = True
[blas]
ldflags = -L/path/to/blas/libs -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lm
[cuda]
root = /path/to/cuda/
At a high level, the procedure looks like:
- Set up your arguments.
- Launch each of your sub-processes.
- In the child process, import
theano.sandbox.cuda
and bind theano in the child process to a specific GPU.
In the example below, f is the function containing the work which will be carried out by the child process, shared_args provides shared arguments from the parent, and private_args holds the name of the gpu device to use for this child.
def f(shared_args,private_args):
# At this point, no theano import statements have been processed, and so the device is unbound
# Import sandbox.cuda to bind the specified GPU to this subprocess
# then import the remaining theano and model modules.
import theano.sandbox.cuda
theano.sandbox.cuda.use(private_args['gpu'])
import theano
import theano.tensor as T
from theano.tensor.shared_randomstreams import RandomStreams
...
That's it! After calling theano.sandbox.cuda.use(private_args['gpu'])
, proceed as you normally would in any theano script.
The example below does not take into account use cases that include communication between sub-processes, nor does it perform any post-processing on the output of each sub-process. If you need to perform inter-process communication, the Manager (declared in the parent, see below) can provide a safe and easy way to do this.
""" Test script that uses two GPUs, one per sub-process,
via the Python multiprocessing module. Each GPU fits a logistic regression model. """
# These imports will not trigger any theano GPU binding
from multiprocessing import Process, Manager
import numpy as np
import os
def f(shared_args,private_args):
""" Build and fit a logistic regression model. Adapted from
http://deeplearning.net/software/theano/tutorial/examples.html#a-real-example-logistic-regression
"""
# Import sandbox.cuda to bind the specified GPU to this subprocess
# then import the remaining theano and model modules.
import theano.sandbox.cuda
theano.sandbox.cuda.use(private_args['gpu'])
import theano
import theano.tensor as T
from theano.tensor.shared_randomstreams import RandomStreams
rng = np.random
# Pull the size of the matrices from
shared_args_dict = shared_args[0]
N = shared_args_dict['N']
feats = shared_args_dict['n_features']
D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
training_steps = shared_args_dict['n_steps']
# Declare Theano symbolic variables
x = T.matrix("x")
y = T.vector("y")
w = theano.shared(rng.randn(feats), name="w")
b = theano.shared(0., name="b")
print "Initial model:"
print w.get_value(), b.get_value()
# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b)) # Probability that target = 1
prediction = p_1 > 0.5 # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
gw,gb = T.grad(cost, [w, b]) # Compute the gradient of the cost
# (we shall return to this in a
# following section of this tutorial)
# Compile. allow_input_downcast reassures the compiler that we are ok using
# 64 bit floating point numbers on the cpu, gut only 32 bit floats on the gpu.
train = theano.function(
inputs=[x,y],
outputs=[prediction, xent],
updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)), allow_input_downcast=True)
predict = theano.function(inputs=[x], outputs=prediction, allow_input_downcast=True)
# Train
for i in range(training_steps):
pred, err = train(D[0], D[1])
print "Final model:"
print w.get_value(), b.get_value()
print "target values for D:", D[1]
print "prediction on D:", predict(D[0])
if __name__ == '__main__':
# Construct a dict to hold arguments that can be shared by both processes
# The Manager class is a convenient to implement this
# See: http://docs.python.org/2/library/multiprocessing.html#managers
#
# Important: managers store information in mutable *proxy* data structures
# but any mutation of those proxy vars must be explicitly written back to the manager.
manager = Manager()
args = manager.list()
args.append({})
shared_args = args[0]
shared_args['N'] = 400
shared_args['n_features'] = 784
shared_args['n_steps'] = 10000
args[0] = shared_args
# Construct the specific args for each of the two processes
p_args = {}
q_args = {}
p_args['gpu'] = 'gpu0'
q_args['gpu'] = 'gpu1'
# Run both sub-processes
p = Process(target=f, args=(args,p_args,))
q = Process(target=f, args=(args,q_args,))
p.start()
q.start()
p.join()
q.join()
You can also use Queue to transfer data between process. One user create batch of data on the CPU and processed on the GPU. A sketch of this is:
import multiprocessing, time
def batchBuilderThread(q):
while True: ##Create batches of data to be processed,
time.sleep(10) ##Assume this step is slow.
batch=[1,2]
q.put(batch) ##Put batches into a queue
batchQueue=multiprocessing.Queue(20) ##Avoids running out of RAM
threads=[multiprocessing.Process(
target=batchBuilderThread,
args=(batchQueue,))
for i in xrange(multiprocessing.cpu_count())]
for thread in threads:
thread.start() ## Create batches in parallel
while True:
batch=batchQueue.get()
print batch[0]+batch[1] ##Process batches one at a time