Radix parallel sorting, how would this implementation really work? Are there any heuristics?

Question

Radix parallel sorting, how would this implementation really work? Are there any heuristics?

I am working on a Udacity quiz for their parallel programming course. I'm pretty obsessed with how to start a task, because I'm not sure if I understand this correctly.

For assignment (in the code) we are given two arrays and an array of values and an array of positions. We must sort the array of values with parallel sorting by the base, and also set the positions correctly.

I fully understand radix sorting and how it works. I don’t understand how they want us to implement it. Here is the template set to start the assignment

//Udacity HW 4
//Radix Sorting

#include "reference_calc.cpp"
#include "utils.h"

/* Red Eye Removal
   ===============

   For this assignment we are implementing red eye removal.  This is
   accomplished by first creating a score for every pixel that tells us how
   likely it is to be a red eye pixel.  We have already done this for you - you
   are receiving the scores and need to sort them in ascending order so that we
   know which pixels to alter to remove the red eye.

   Note: ascending order == smallest to largest

   Each score is associated with a position, when you sort the scores, you must
   also move the positions accordingly.

   Implementing Parallel Radix Sort with CUDA
   ==========================================

   The basic idea is to construct a histogram on each pass of how many of each
   "digit" there are.   Then we scan this histogram so that we know where to put
   the output of each digit.  For example, the first 1 must come after all the
   0s so we have to know how many 0s there are to be able to start moving 1s
   into the correct position.

   1) Histogram of the number of occurrences of each digit
   2) Exclusive Prefix Sum of Histogram
   3) Determine relative offset of each digit
        For example [0 0 1 1 0 0 1]
                ->  [0 1 0 1 2 3 2]
   4) Combine the results of steps 2 & 3 to determine the final
      output location for each element and move it there

   LSB Radix sort is an out-of-place sort and you will need to ping-pong values
   between the input and output buffers we have provided.  Make sure the final
   sorted results end up in the output buffer!  Hint: You may need to do a copy
   at the end.

 */


void your_sort(unsigned int* const d_inputVals,
               unsigned int* const d_inputPos,
               unsigned int* const d_outputVals,
               unsigned int* const d_outputPos,
               const size_t numElems)
{

}

I specifically don’t understand how these 4 steps finish sorting the array.

, "" ( ...?). , n, 0 1 . , 1 , ?

, . - , ?

+4

sorting parallel-processing cuda

KDecker 05 . '14 19:48

2

, @Robert Crovella, . , Udacity. , , , :

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
LSB:         1       0       0       1

Predicate:   0     __1__   __1__     0
Pred. Scan:  0     __0__   __1__     2

Number of ones in predicate: 2

!Predicate:__1__     0       0     __1__
!Pred. Scan: 0       1       1       1

Offset for !Pred. Scan = Number of ones in predicate = 2

!Pred. Scan + Offset:
           __2__     3       3     __3__

Final indexes to move values after 1 iteration (on LSB):
             2       0       1       3

Values after 1 iteration (on LSB):
             14      4       7       1

(__ __) , .

( Udacity):

LSB =
( LSB): (x 1) == 0
- : (x 2) == 0
- : (x 4) == 0
- .., (<)
Pred. Scan = Predicate Scan =
! Pred. = (0- > 1 1- > 0)
- , , (/ ) Blelloch

:

( , LSB)
- Blelloch
- , , Blelloch Scan , 2 ^ n
,
:
- i- :
- i- TRUE, i- index i-
- else, i- i- Predicate Predicate
(NSB)

HW CUDA.

+3

polpetti 05 . '15 19:13

Robert Crovella · Accepted Answer · 2014-10-07T06:32:49+0000

radix , , , . , .

. , 4 . 1, 4, 7 14. , :

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001

0:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

, , ( 0) , . , . :

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      1110    0100    0111    0001
bit 0:       0       0       1       1

radix . - () :

Element #    3       2       1       4
Value:       4       14      7       1
Binary:      0100    1110    0111    0001
bit 1:       0       1       1       0

, ( 1) :

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 1:       0       0       1       1

:

Element #    3       4       2       1
Value:       4       1       14      7
Binary:      0100    0001    1110    0111
bit 2:       1       0       1       1

:

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 2:       0       1       1       1

( ):

Element #    4       3       2       1
Value:       1       4       14      7
Binary:      0001    0100    1110    0111
bit 3:       0       0       1       0

:

Element #    4       3       1       2
Value:       1       4       7       14
Binary:      0001    0100    0111    1110
bit 3:       0       0       0       1

. , , , , " , ?" " , ?" , , , . 0:

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1

0 1 0:

bit 0:       1       0       0       1

              zero bits       one bits
              ---------       --------
1)histogram:         2              2

:

              zero bits       one bits
              ---------       --------
1)histogram:         2              2
2)prefix sum:        0              2

- - . , 2 ( 0 0). , , 0 , - :

bit 0:       1       0       0       1
3)offset:    0       0       1       1

- , 0 1 , 1:

0 bit 0:             1       1       
3)ex. psum:          0       1    

1 bit 0:     1                        1      
3)ex. psum:  0                        1

, 4 :

4) 2 3,

, -, (0 1), , , :

Element #    1       2       3       4
Value:       7       14      4       1
Binary:      0111    1110    0100    0001
bit 0:       1       0       0       1
hist psum:   2       0       0       2
offset:      0       0       1       1
new index:   2       0       1       3

" ":

Element #    2       3       1       4
Value:       14      4       7       1
Binary:      0111    1110    0111    0001

, , . 1, ( ) ; , .

:

Radix-, , . , , 2,3 4 .
, radix, - , . , 32- , , 1023 (2 ^ 10-1), 32 . , , 10 .
? . - , , - . radix , , .

. radix. , , 32- warp, warp, .. 32 . , , , , CUDA. , , .., .

#include <stdio.h>
#include <stdlib.h>
#define WSIZE 32
#define LOOPS 100000
#define UPPER_BIT 31
#define LOWER_BIT 0

__device__ unsigned int ddata[WSIZE];

// naive warp-level bitwise radix sort

__global__ void mykernel(){
  __shared__ volatile unsigned int sdata[WSIZE*2];
  // load from global into shared variable
  sdata[threadIdx.x] = ddata[threadIdx.x];
  unsigned int bitmask = 1<<LOWER_BIT;
  unsigned int offset  = 0;
  unsigned int thrmask = 0xFFFFFFFFU << threadIdx.x;
  unsigned int mypos;
  //  for each LSB to MSB
  for (int i = LOWER_BIT; i <= UPPER_BIT; i++){
    unsigned int mydata = sdata[((WSIZE-1)-threadIdx.x)+offset];
    unsigned int mybit  = mydata&bitmask;
    // get population of ones and zeroes (cc 2.0 ballot)
    unsigned int ones = __ballot(mybit); // cc 2.0
    unsigned int zeroes = ~ones;
    offset ^= WSIZE; // switch ping-pong buffers
    // do zeroes, then ones
    if (!mybit) // threads with a zero bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes&thrmask);
    else        // threads with a one bit
      // get my position in ping-pong buffer
      mypos = __popc(zeroes)+__popc(ones&thrmask);
    // move to buffer  (or use shfl for cc 3.0)
    sdata[mypos-1+offset] = mydata;
    // repeat for next bit
    bitmask <<= 1;
    }
  // save results to global
  ddata[threadIdx.x] = sdata[threadIdx.x+offset];
  }

int main(){

  unsigned int hdata[WSIZE];
  for (int lcount = 0; lcount < LOOPS; lcount++){
    unsigned int range = 1U<<UPPER_BIT;
    for (int i = 0; i < WSIZE; i++) hdata[i] = rand()%range;
    cudaMemcpyToSymbol(ddata, hdata, WSIZE*sizeof(unsigned int));
    mykernel<<<1, WSIZE>>>();
    cudaMemcpyFromSymbol(hdata, ddata, WSIZE*sizeof(unsigned int));
    for (int i = 0; i < WSIZE-1; i++) if (hdata[i] > hdata[i+1]) {printf("sort error at loop %d, hdata[%d] = %d, hdata[%d] = %d\n", lcount,i, hdata[i],i+1, hdata[i+1]); return 1;}
    // printf("sorted data:\n");
    //for (int i = 0; i < WSIZE; i++) printf("%u\n", hdata[i]);
    }
  printf("Success!\n");
  return 0;
}

Radix parallel sorting, how would this implementation really work? Are there any heuristics?

More articles: