Using Multiple GPUs

This page describes how to use Python's multiprocessing module to drive multiple GPUs, by spawning one child process per GPU from a common parent. The example provided uses two GPUs to fit logistic regression models, and demonstrates how to pack up and communicate both common and private arguments from the parent to the children.

The key is to initialize theano specifying the CPU as the target device, and to later override the assigned device to a specified GPU. According to a discussion on theano-users, this re-binding of devices can occur only once per process. So this example would (probably) not work with threads, nor could it be used to do GPU switching.

Here is a generic .theanorc file, similar to the file I use:

  [global]
  floatX = float32
  device = cpu
  openmp = True
  base_compiledir = /path/to/base/dir

  [nvcc]
  fastmath = True

  [blas]
  ldflags = -L/path/to/blas/libs -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread -lm 

  [cuda]
  root = /path/to/cuda/

At a high level, the procedure looks like:

Set up your arguments.
Launch each of your sub-processes.
In the child process, import theano.sandbox.cuda and bind theano in the child process to a specific GPU.

In the example below, f is the function containing the work which will be carried out by the child process, shared_args provides shared arguments from the parent, and private_args holds the name of the gpu device to use for this child.

def f(shared_args,private_args): 
    # At this point, no theano import statements have been processed, and so the device is unbound
    
    # Import sandbox.cuda to bind the specified GPU to this subprocess
    # then import the remaining theano and model modules.
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(private_args['gpu'])
    
    import theano
    import theano.tensor as T
    from theano.tensor.shared_randomstreams import RandomStreams
...

That's it! After calling theano.sandbox.cuda.use(private_args['gpu']), proceed as you normally would in any theano script.

The example below does not take into account use cases that include communication between sub-processes, nor does it perform any post-processing on the output of each sub-process. If you need to perform inter-process communication, the Manager (declared in the parent, see below) can provide a safe and easy way to do this.

""" Test script that uses two GPUs, one per sub-process,
via the Python multiprocessing module.  Each GPU fits a logistic regression model. """


# These imports will not trigger any theano GPU binding
from multiprocessing import Process, Manager
import numpy as np
import os

def f(shared_args,private_args): 
    """ Build and fit a logistic regression model.  Adapted from 
    http://deeplearning.net/software/theano/tutorial/examples.html#a-real-example-logistic-regression
    """
    
    # Import sandbox.cuda to bind the specified GPU to this subprocess
    # then import the remaining theano and model modules.
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(private_args['gpu'])
    
    import theano
    import theano.tensor as T
    from theano.tensor.shared_randomstreams import RandomStreams
    
    rng = np.random    
    
    # Pull the size of the matrices from 
    shared_args_dict = shared_args[0]
    N = shared_args_dict['N']
    feats = shared_args_dict['n_features']
    D = (rng.randn(N, feats), rng.randint(size=N,low=0, high=2))
    training_steps = shared_args_dict['n_steps']
    
    # Declare Theano symbolic variables
    x = T.matrix("x")
    y = T.vector("y")
    w = theano.shared(rng.randn(feats), name="w")
    b = theano.shared(0., name="b")
    print "Initial model:"
    print w.get_value(), b.get_value()
    
    # Construct Theano expression graph
    p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))   # Probability that target = 1
    prediction = p_1 > 0.5                    # The prediction thresholded
    xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
    cost = xent.mean() + 0.01 * (w ** 2).sum()# The cost to minimize
    gw,gb = T.grad(cost, [w, b])              # Compute the gradient of the cost
                                              # (we shall return to this in a
                                              # following section of this tutorial)
    
    # Compile.  allow_input_downcast reassures the compiler that we are ok using
    # 64 bit floating point numbers on the cpu, gut only 32 bit floats on the gpu.
    train = theano.function(
              inputs=[x,y],
              outputs=[prediction, xent],
              updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)), allow_input_downcast=True)
    predict = theano.function(inputs=[x], outputs=prediction, allow_input_downcast=True)
    
    # Train
    for i in range(training_steps):
        pred, err = train(D[0], D[1])
    
    print "Final model:"
    print w.get_value(), b.get_value()
    print "target values for D:", D[1]
    print "prediction on D:", predict(D[0])           
    
    

if __name__ == '__main__':
        
    # Construct a dict to hold arguments that can be shared by both processes
    # The Manager class is a convenient to implement this
    # See: http://docs.python.org/2/library/multiprocessing.html#managers
    #
    # Important: managers store information in mutable *proxy* data structures
    # but any mutation of those proxy vars must be explicitly written back to the manager.
    manager = Manager()

    args = manager.list()
    args.append({})
    shared_args = args[0]
    shared_args['N'] = 400
    shared_args['n_features'] = 784
    shared_args['n_steps'] = 10000
    args[0] = shared_args       
    
    # Construct the specific args for each of the two processes
    p_args = {}
    q_args = {}
       
    p_args['gpu'] = 'gpu0'
    q_args['gpu'] = 'gpu1'

    # Run both sub-processes
    p = Process(target=f, args=(args,p_args,))
    q = Process(target=f, args=(args,q_args,))
    p.start()
    q.start()
    p.join()
    q.join()

You can also use Queue to transfer data between process. One user create batch of data on the CPU and processed on the GPU. A sketch of this is:

import multiprocessing, time
def batchBuilderThread(q):
    while True: ##Create batches of data to be processed,
        time.sleep(10) ##Assume this step is slow.
        batch=[1,2]
        q.put(batch) ##Put batches into a queue

batchQueue=multiprocessing.Queue(20) ##Avoids running out of RAM
threads=[multiprocessing.Process(
        target=batchBuilderThread,
        args=(batchQueue,))
         for i in xrange(multiprocessing.cpu_count())]
for thread in threads:
    thread.start()  ## Create batches in parallel

while True:
    batch=batchQueue.get()
    print batch[0]+batch[1]  ##Process batches one at a time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Multiple GPUs

Clone this wiki locally