3 whole key searches in CUDA

Question

3 whole key searches in CUDA

I would like to find 3 integers (ie [1 2 3]) in a large dataset with about a million points.

I am currently using MATLAB Map (hashmap), and for each point I do the following:

key = sprintf('%d ', [1 2 3]); % 23 us % key = '1 2 3 ' result = lookup_map( key ); % 32 us

This is quite a lot of time, although - 1 million points * 55 us = 55 seconds.

I would like to port this to the GPU using CUDA, but I'm not sure of a better way to approach this.

I could pass four arrays - key1, key2, key3, result , and then perform a binary search on the keys, but this will take 20 iterations (2 ^ 20 = 1048576) per key. Then I also have delays due to simultaneous access to memory from each thread.

Is there a data structure optimized for parallel (O (1), ideally) multiple key searches in CUDA?

Q: What are the boundaries of three integers? And what data is viewed?

Integer keys can be from 0 to ~ 75,000 at present, but may be more (200,000+) in the future.

For the purposes of this question, we can assume that result is an integer from 0 to the size of the data set.

Q: Why don't you pack all three numbers into one 64-bit number (21 bits per number gives you a range of 0-2,097,152). And use this to index into a sparse array?

 >> A = uint64(ones(10)); >> sparse_A = sparse(A) ??? Undefined function or method 'sparse' for input arguments of type 'uint64'. >> A = int64(ones(10)); >> sparse_A = sparse(A) ??? Undefined function or method 'sparse' for input arguments of type 'int64'.

It seems that my matlab does not support sparse arrays of 64-bit numbers.

In case this helps someone else, I wrote a quick function to create a 64-bit key from three unsigned integers: 2 ^ 21:

 function [key] = to_key(face) key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1)); end

Q: From @Dennis - why not use logical indexing?

Test it out!

 % Generate a million random integers between 0 and 1000 >> M = int32(floor(rand(10000000,4)*1000)); % Find a point to look for >> search = M(500000,1:3) search = 850 910 581 >> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc; Elapsed time is 0.089801 seconds. >> M(idx,:) ans = 850 910 581 726

Unfortunately, it takes 89801us, which is 1632x less than my existing solution (55us)! It takes 2.5 hours to run it a million times!

We can try to filter M after each search:

 >> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc; Elapsed time is 0.038272 seconds.

It's a little faster, but still 696 times slower than using a card.

I thought about this a little more, and I decided to profile the speed of re-generating some data "on the fly" from one key search - it can be faster than searching by 3 keys, given potential problems with this approach.

+6

hashmap data-structures matlab cuda lookup-tables

Alex l Oct 15 '12 at 7:49

source share

2 answers

Given the attention this question has already received, this answer seems to be too simple, but why don't you just do it like this:

 M=[1:6; 2:7; 3:8; 4:9]'; %Some matrix that contains key 1 2 3, corresponding value is 4 idx=M(:,1)==1&M(:,2)==2&M(:,3)==3; M(idx,4)

This should be evaluated fairly quickly, even if M is 1 million x 4.

0

Dennis jaheruddin Oct 17 '12 at 20:47

source share

angainor · Accepted Answer · 2012-10-15T08:28:40+0000

I assume this question is related to your previous question about the faces of the tetrahedron. I still suggest you try the sparse repository and sparse matrix-vector multiplication for this purpose:

 size(spA) ans = 1244810 1244810 tic; vv = spA*v; idx = find(vv); toc; Elapsed time is 0.106581 seconds.

This is just a time analysis, see my previous answer on how to implement it in your case. Before moving on to CUDA and doing complex things, check out simpler options.

3 whole key searches in CUDA

More articles: