CUDA Shared Banking Conflicts: How Memory is Tied to Banks

Question

CUDA Shared Banking Conflicts: How Memory is Tied to Banks

As far as I understand, shared memory is divided into banks, and calls of several threads to one data element in one bank will cause a conflict (or broadcast).

At the moment, I am allocating a fairly large array, which conceptually represents several pairs of two matrices:

__shared__ float A[34*N]

Where Nis the number of pairs, and the first 16 floats of the pair are one matrix, and the next 18 floats are the second.

The fact is that access to the first matrix is resolved by the conflict, but access to the second conflicts. These conflicts are inevitable, however, I think that since the second matrix is 18, all future matrices will be shifted to banks, and therefore more conflicts will occur than necessary.

Is it true, if so, how can I avoid this?

Every time I allocate shared memory, does it start with a new bank? So potentially I could do

__shared__ Apair1[34]
__shared__ Apair2[34]
...

Any ideas?

thank

+5

optimization cuda

zenna Feb 17 '10 at 19:10

source share

2 answers

Tom · Answer 1 · 2010-02-17T19:39:13+0000

If your matrix pairs are stored contiguously, and if you access elements linearly by stream index, then you will not have conflicts with banks of shared memory.

In other words, if you have:

A[0]  <- mat1 element1
A[1]  <- mat1 element2
A[2]  <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18

And you will access this using:

float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];

Then the neighboring threads access the neighboring elements in the matrix, and you have no conflicts.

() , . , 16 ( 32 , Fermi), 32- , . , , , x + threadIdx.x ( x threadIdx.x , , 16 ), .

, , , . , .

SDK , , , .

P O'Conbhui · Answer 2 · 2012-04-10T04:29:46+0000

, 32 . , 4 , float ( 16 32, ). , 1.x, 16.

18 16, . 16x16,

__shared__ float sixteen[16][16+1]

threadIdx.x( , , ). , , 16x16, 1- . , , - . . , , [] [] , [ * (16 + 1) + ].

18x18, , . : 1.

__shared__ float eighteens[18][18+1]

, (, ), (18 + 1)% 16 = 3, 3, 6, 9, 12, 15, 2, 5, 8 .., .

- 18 , , , . , , 1, , .

CUDA Shared Banking Conflicts: How Memory is Tied to Banks

More articles: