Expected amount of banking conflicts in shared memory with random access

Question

Expected amount of banking conflicts in shared memory with random access

Let A be a properly aligned array of 32-bit integers in shared memory.

If one warp tries to accidentally get elements of A , what is the expected number of banking conflicts?

In other words:

 __shared__ int A[N]; //N is some big constant integer ... int v = A[ random(0..N-1) ]; // <-- expected number of bank conflicts here?

Please take Tesla or Fermi architecture. I do not want to dwell on the 32-bit and 64-bit banking configurations of Kepler. In addition, for simplicity, we assume that all random numbers are different (thus, there is no broadcast mechanism).

The gut feeling tells me somewhere between 4 and 6, but I would like to find a mathematical estimate of this.

I believe that the problem can be abstracted from CUDA and presented as a mathematical problem. I looked for it as an addition to the birthday paradox, but I found really scary formulas there and did not find the final formula. Hope there is an easier way ...

+7

cuda

CygnusX1 Oct 10 '12 at 15:58

source share

3 answers

In mathematics, this is considered the “balls in boxes” problem - 32 balls are randomly lowered onto 32 bins. You can list the possible patterns and calculate their probabilities to determine the distribution. The naive approach will not work, because the number of templates is huge: (63!) / (32!) (31!) It is an “almost” quintillion.

You can solve at least if you recursively create a solution and use conditional probabilities.

Look at the article titled “The exact distribution of the maximum, minimum, and range of multidimensional / dirchl and multidimensional hypergeometric frequencies” by Charles J. Corrado.

In the future, we start from the leftmost bucket and calculate the probabilities for each number of balls that could fall into it. Then we move one to the right and determine the conditional probabilities of each number of balls that can be in this bucket, given the number of balls and buckets that have already been used.

Apologies for the VBA code, but VBA was all that I had when I was motivated to answer :).

 Function nCr#(ByVal n#, ByVal r#) Static combin#() Static size# Dim i#, j# If n = r Then nCr = 1 Exit Function End If If n > size Then ReDim combin(0 To n, 0 To n) combin(0, 0) = 1 For i = 1 To n combin(i, 0) = 1 For j = 1 To i combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j) Next Next size = n End If nCr = combin(n, r) End Function Function p_binom#(n#, r#, p#) p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r) End Function Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _ bucket#, total_buckets#, bucket_capacity#) If balls > bucket_capacity Then p_next_bucket_balls = 0 Else p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1)) End If End Function Function p_capped_buckets#(n#, cap#) Dim p_prior, p_update Dim bucket#, balls#, prior_balls# ReDim p_prior(0 To n) ReDim p_update(0 To n) p_prior(0) = 1 For bucket = 1 To n For balls = 0 To n p_update(balls) = 0 For prior_balls = 0 To balls p_update(balls) = p_update(balls) + p_prior(prior_balls) * _ p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap) Next Next p_prior = p_update Next p_capped_buckets = p_update(n) End Function Function expected_max_buckets#(n#) Dim cap# For cap = 0 To n expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap)) Next End Function Sub test32() Dim p_cumm#(0 To 32) Dim cap# For cap# = 0 To 32 p_cumm(cap) = p_capped_buckets(32, cap) Next For cap = 1 To 32 Debug.Print " ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000") Next End Sub

For 32 balls and buckets, I get the expected maximum number of balls in buckets of about 3,532,941.

Output for comparison with ahmad's:

  1 0.000000 2 0.029273 3 0.516311 4 0.361736 5 0.079307 6 0.011800 7 0.001417 8 0.000143 9 0.000012 10 0.000001 11 0.000000 12 0.000000 13 0.000000 14 0.000000 15 0.000000 16 0.000000 17 0.000000 18 0.000000 19 0.000000 20 0.000000 21 0.000000 22 0.000000 23 0.000000 24 0.000000 25 0.000000 26 0.000000 27 0.000000 28 0.000000 29 0.000000 30 0.000000 31 0.000000 32 0.000000

+6

A. Webb Oct 18 '12 at 14:56

source share

I will try the mathematical answer, although I don't have one yet.

Basically, you want to know, given the random 32-bit indexing of words within warp in a __shared__ aligned array, what does the expected value of the maximum number of addresses within the stem that are displayed in the same bank mean? "

If I consider a hash-like problem, then it refers to the expected maximum number of elements that will hash in one place, and this document shows the upper bound of this O (log n / log log n) number for hashing n elements in n buckets. (The math is pretty hairy!).

For n = 32, this is about 2.788 (using a natural log). Thats fine, but here I changed the ahmad program a bit to empirically calculate the expected maximum (also simplified the code and changed names, etc. For clarity, and fixed some errors).

 #include <stdio.h> #include <stdlib.h> #include <string.h> #include <algorithm> #define NBANK 32 #define WARPSIZE 32 #define NSAMPLE 100000 int main(){ int i=0,j=0; int *bank=(int*)malloc(sizeof(int)*NBANK); int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE); int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1)); memset(maxCount, 0, sizeof(int)*(NBANK+1)); for (int i=0; i<NSAMPLE; ++i) { // generate a sample warp shared memory access for(j=0; j<WARPSIZE; j++){ randomNumber[j]=rand()%NBANK; } // check the bank conflict memset(bank, 0, sizeof(int)*NBANK); int max_bank_conflict=0; for(j=0; j<WARPSIZE; j++){ bank[randomNumber[j]]++; } for(j=0; j<WARPSIZE; j++) max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]); // store statistic maxCount[max_bank_conflict]++; } // report statistic printf("Max conflict degree %% (%d random samples)\n", NSAMPLE); float expected = 0; for(i=1; i<NBANK+1; i++) { float prob = maxCount[i]/(float)NSAMPLE; printf("%02d -> %6.4f\n", i, prob); expected += prob * i; } printf("Expected maximum bank conflict degree = %6.4f\n", expected); return 0; }

Using percentage values found in the program as probabilities, the expected maximum value is the sum of the products sum(i * probability(i)) , for i from 1 to 32. I calculate the expected value equal to 3.529 (corresponds to ahmad data). Its not too far, but 2.788 is supposed to be the upper bound. Since the upper bound is given in the notation with large O, I believe that the constant factor is not taken into account. But this is currently as far as I got.

Open questions: is this constant factor enough to explain this? Is it possible to calculate a constant coefficient for n = 32? It would be interesting to reconcile them and / or find a closed-form solution for the expected maximum level of banking conflict with 32 banks and 32 parallel flows.

This is a very useful topic as it can help in modeling and predicting performance when addressing shared memory is random.

+4

harrism Oct 12 '12 at 0:28

source share

ahmad · Accepted Answer · 2012-10-11T08:47:35+0000

I assume 32-bit shared memory fermi, where each 4 consecutive bytes are stored in subsequent banks. Using the following code:

 #include <stdio.h> #include <stdlib.h> #include <time.h> #define NBANK 32 #define N 7823 #define WARPSIZE 32 #define NSAMPLE 10000 int main(){ srand ( time(NULL) ); int i=0,j=0; int *conflictCheck=NULL; int *randomNumber=NULL; int *statisticCheck=NULL; conflictCheck=(int*)malloc(sizeof(int)*NBANK); randomNumber=(int*)malloc(sizeof(int)*WARPSIZE); statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1)); while(i<NSAMPLE){ // generate a sample warp shared memory access for(j=0; j<WARPSIZE; j++){ randomNumber[j]=rand()%NBANK; } // check the bank conflict memset(conflictCheck, 0, sizeof(int)*NBANK); int max_bank_conflict=0; for(j=0; j<WARPSIZE; j++){ conflictCheck[randomNumber[j]]++; max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict; } // store statistic statisticCheck[max_bank_conflict]++; // next iter i++; } // report statistic printf("Over %d random shared memory access, there found following precentages of bank conflicts\n"); for(i=0; i<NBANK+1; i++){ // printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE); } return 0; }

I got the following result:

 Over 0 random shared memory access, there found following precentages of bank conflicts 0 -> 0.0000 1 -> 0.0000 2 -> 0.0281 3 -> 0.5205 4 -> 0.3605 5 -> 0.0780 6 -> 0.0106 7 -> 0.0022 8 -> 0.0001 9 -> 0.0000 10 -> 0.0000 11 -> 0.0000 12 -> 0.0000 13 -> 0.0000 14 -> 0.0000 15 -> 0.0000 16 -> 0.0000 17 -> 0.0000 18 -> 0.0000 19 -> 0.0000 20 -> 0.0000 21 -> 0.0000 22 -> 0.0000 23 -> 0.0000 24 -> 0.0000 25 -> 0.0000 26 -> 0.0000 27 -> 0.0000 28 -> 0.0000 29 -> 0.0000 30 -> 0.0000 31 -> 0.0000 32 -> 0.0000

We can conclude that a conflict of paths 3 through 4 is most likely with random access. You can configure the launch using another N (the number of elements in the array), NBANK (the number of banks in the shared memory), WARPSIZE (the size of the base for the machine) and NSAMPLE (the amount of random shared access memory generated to evaluate the model).

Expected amount of banking conflicts in shared memory with random access

More articles: