CUDA memory for lookup tables

I develop a set of mathematical functions and implement them both in processors and in graphic processors (with CUDA).

Some of these features are based on lookup tables. Most tables take up 4 KB, some of them are slightly larger. Functions based on lookup tables enter an input, select one or two inputs of a lookup table, and then calculate the result by interpolation or using similar methods.

Now my question is: where should I save these lookup tables? The CUDA device has many places to store values ​​(global memory, read only memory, texture memory, ...). Provided that each table can be read simultaneously by many threads and that the input values ​​and, therefore, the search indexes may not be completely connected between the threads of each warp (which leads to uncorrelated images of memory access), what memory provides quick access?

I add that the contents of these tables are pre-computed and completely constant.

EDIT

Just to clarify: I need to store about 10 different 4K search tables. In any case, it would be great to know what the solution is, since for this case it would be the same for the case, for example. 100 tables of 4 KB or, for example, 10 search tables of 16 KB.

+7
source share
1 answer

Texture memory (now called the read-only data cache) is likely to be a choice worth exploring, although not for the benefits of interpolation. It supports 32-bit reads without reading beyond this amount. However, you are limited to only 48 KB. For Kepler (compute 3.x), it's pretty simple for programming.

Global memory, unless you configure it in 32-bit mode, often drags 128 bytes for each stream, extremely multiplying what is actually the data required from memory, since you (apparently) cannot combine memory access . Thus, you probably need 32-bit mode if you want to use more than 48K (you mentioned 40K).

Thinking about sharing, if you had to consistently get a set of values ​​from these tables, you could alternate with the tables so that these combinations can be grouped and read as 64 or 128 bits to be read in the stream. This would mean that it is useful to use 128 bytes from global memory.

The problem that you will encounter is that the bandwidth limit of the solution is limited by using lookup tables. Changing the size of the L1 cache (on Fermi / compute 2.x) to 48K is likely to change significantly, especially if you are not using 32K of shared memory. Try texture memory and then global memory in 32-bit mode and see which one works best for your algorithm. Finally, select a card with a good memory bandwidth indicator if you have a choice over the hardware.

+2
source

All Articles