3 whole key searches in CUDA

I would like to find 3 integers (ie [1 2 3]) in a large dataset with about a million points.

I am currently using MATLAB Map (hashmap), and for each point I do the following:

key = sprintf('%d ', [1 2 3]); % 23 us % key = '1 2 3 ' result = lookup_map( key ); % 32 us 

This is quite a lot of time, although - 1 million points * 55 us = 55 seconds.

I would like to port this to the GPU using CUDA, but I'm not sure of a better way to approach this.

I could pass four arrays - key1, key2, key3, result , and then perform a binary search on the keys, but this will take 20 iterations (2 ^ 20 = 1048576) per key. Then I also have delays due to simultaneous access to memory from each thread.

Is there a data structure optimized for parallel (O (1), ideally) multiple key searches in CUDA?


Q: What are the boundaries of three integers? And what data is viewed?

Integer keys can be from 0 to ~ 75,000 at present, but may be more (200,000+) in the future.

For the purposes of this question, we can assume that result is an integer from 0 to the size of the data set.


Q: Why don't you pack all three numbers into one 64-bit number (21 bits per number gives you a range of 0-2,097,152). And use this to index into a sparse array?

 >> A = uint64(ones(10)); >> sparse_A = sparse(A) ??? Undefined function or method 'sparse' for input arguments of type 'uint64'. >> A = int64(ones(10)); >> sparse_A = sparse(A) ??? Undefined function or method 'sparse' for input arguments of type 'int64'. 

It seems that my matlab does not support sparse arrays of 64-bit numbers.

In case this helps someone else, I wrote a quick function to create a 64-bit key from three unsigned integers: 2 ^ 21:

 function [key] = to_key(face) key = uint64(bitsll(face(1), 42) + bitsll(face(2), 21) + rand(face(3),1)); end 

Q: From @Dennis - why not use logical indexing?

Test it out!

 % Generate a million random integers between 0 and 1000 >> M = int32(floor(rand(10000000,4)*1000)); % Find a point to look for >> search = M(500000,1:3) search = 850 910 581 >> tic; idx = M(:,1)==search(1) & M(:,2)==search(2)&M(:,3)==search(3); toc; Elapsed time is 0.089801 seconds. >> M(idx,:) ans = 850 910 581 726 

Unfortunately, it takes 89801us, which is 1632x less than my existing solution (55us)! It takes 2.5 hours to run it a million times!

We can try to filter M after each search:

 >> tic; idx1=M(:,1)==search(1); N=M(idx1,:); idx2=N(:,2)==search(2); N2=N(idx2,:); idx3 = N2(:,3)==search(3); toc; Elapsed time is 0.038272 seconds. 

It's a little faster, but still 696 times slower than using a card.


I thought about this a little more, and I decided to profile the speed of re-generating some data "on the fly" from one key search - it can be faster than searching by 3 keys, given potential problems with this approach.

+6
source share
2 answers

I assume this question is related to your previous question about the faces of the tetrahedron. I still suggest you try the sparse repository and sparse matrix-vector multiplication for this purpose:

 size(spA) ans = 1244810 1244810 tic; vv = spA*v; idx = find(vv); toc; Elapsed time is 0.106581 seconds. 

This is just a time analysis, see my previous answer on how to implement it in your case. Before moving on to CUDA and doing complex things, check out simpler options.

+2
source

Given the attention this question has already received, this answer seems to be too simple, but why don't you just do it like this:

 M=[1:6; 2:7; 3:8; 4:9]'; %Some matrix that contains key 1 2 3, corresponding value is 4 idx=M(:,1)==1&M(:,2)==2&M(:,3)==3; M(idx,4) 

This should be evaluated fairly quickly, even if M is 1 million x 4.

0
source

Source: https://habr.com/ru/post/927716/


All Articles