Basic Multiprocessing Parallelization of Matrix Multiplication

I want to parallelize the simple following expression on 2 GPUs: C = A^n + B^nby computing A^non GPU 0 and B^non GPU 1 before summing the results.

In TensorFlow, I would like to:

with tf.device('/gpu:0'):
    An = matpow(A, n)
with tf.device('/gpu:1'):
    Bn = matpow(B, n)
with tf.Session() as sess:
    C = sess.run(An + Bn)

However, since PyTorch is dynamic, I am having problems with the same. I tried the following, but it takes longer.

with torch.cuda.device(0):
    A = A.cuda()       
with torch.cuda.device(1):
    B = B.cuda()
C = matpow(A, n) + matpow(B, n).cuda(0)

I know that there is a module for parallelizing models to the batch size using torch.nn.DataParallel, but here I am trying to do something more basic.

+6
source share
1 answer

cuda streams. , .

s1 = torch.cuda.Stream()
s2 = torch.cuda.Stream()

with torch.cuda.stream(s1):
    A = torch.pow(A,n)
with torch.cuda.stream(s2):
    B = torch.pow(B,n)

C = A+B

, , . .

, , :

A = A.cuda(0)
B = B.cuda(1)

, . B = B.cuda(0). .

0

All Articles