How to successfully read from a 2D texture

How can I:

  • Associates cudaMallocPitch floating memory with a two-dimensional texture link
  • Copy some host data to a 2D array on the device
  • Add one link to the texture and write either.) Pitch 2D array OR b.) Write to a linear memory array
  • Read the answer and show it.

Below is the code that should execute this. Please note that for NxN array sizes, my code works. For NxM, where N! = M, my code bites dust (not the correct result). If you can solve this problem, I will give you 1 Internet (limited offer). Maybe I'm crazy, but according to the documentation, this should work (and it works for square arrays!). The attached code should work with "nvcc whateveryoucallit.cu -o runit".

Help is appreciated!

#include<stdio.h> #include<cuda.h> #include<iostream> #define height 16 #define width 11 #define BLOCKSIZE 16 using namespace std; // Device Kernels //Texture reference Declaration texture<float,2> texRefEx; __global__ void kernel_w_textures(float* devMPPtr, float * devMPtr, int pitch) { // Thread indexes unsigned int idx = blockIdx.x*blockDim.x + threadIdx.x; unsigned int idy = blockIdx.y*blockDim.y + threadIdx.y; // Texutre Coordinates float u=(idx)/float(width); float v=(idy)/float(height); devMPtr[idy*width+idx]=devMPPtr[idy*pitch/sizeof(float)+idx]; // Write Texture Contents to malloc array +1 devMPtr[idy*width+idx]= tex2D(texRefEx,u,v);//+1.0f; } int main() { // memory size size_t memsize=height*width; size_t offset; float * data, // input from host *h_out, // host space for output *devMPPtr, // malloc Pitch ptr *devMPtr; // malloc ptr size_t pitch; // Allocate space on the host data=(float *)malloc(sizeof(float)*memsize); h_out=(float *)malloc(sizeof(float)*memsize); // Define data for (int i = 0; i < height; i++) for (int j=0; j < width; j++) data[i*width+j]=float(j); // Define the grid dim3 grid((int)(width/BLOCKSIZE)+1,(int)(height/BLOCKSIZE)+1), threads(BLOCKSIZE,BLOCKSIZE); // allocate Malloc Pitch cudaMallocPitch((void**)&devMPPtr,&pitch, width * sizeof(float), height); // Print the pitch printf("The pitch is %d \n",pitch/sizeof(float)); // Texture Channel Description //cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32,0,0,0,cudaChannelFormatKindFloat); // Bind texture to pitch mem: cudaBindTexture2D(&offset,&texRefEx,devMPPtr,&channelDesc,width,height,pitch); cout << "My Description x is " << channelDesc.x << endl; cout << "My Description y is " << channelDesc.y << endl; cout << "My Description z is " << channelDesc.z << endl; cout << "My Description w is " << channelDesc.w << endl; cout << "My Description kind is " << channelDesc.f << endl; cout << "Offset is " << offset << endl; // Set mutable properties: texRefEx.normalized=true; texRefEx.addressMode[0]=cudaAddressModeWrap; texRefEx.addressMode[1]=cudaAddressModeWrap; texRefEx.filterMode= cudaFilterModePoint; // Allocate cudaMalloc memory cudaMalloc((void**)&devMPtr,memsize*sizeof(float)); // Read data from host to device cudaMemcpy2D((void*)devMPPtr,pitch,(void*)data,sizeof(float)*width, sizeof(float)*width,height,cudaMemcpyHostToDevice); //Read back and check this memory cudaMemcpy2D((void*)h_out,width*sizeof(float),(void*)devMPPtr,pitch, sizeof(float)*width,height,cudaMemcpyDeviceToHost); // Print the memory for (int i=0; i<height; i++){ for (int j=0; j<width; j++){ printf("%2.2f ",h_out[i*width+j]); } cout << endl; } cout << "Done" << endl; // Memory is fine... kernel_w_textures<<<grid,threads>>>(devMPPtr, devMPtr, pitch); // Copy back data to host cudaMemcpy((void*)h_out,(void*)devMPtr,width*height*sizeof(float),cudaMemcpyDeviceToHost); // Print the Result cout << endl; for (int i=0; i<height; i++){ for (int j=0; j<width; j++){ printf("%2.2f ",h_out[i*width+j]); } cout << endl; } cout << "Done" << endl; return(0); } 

Edit October 17th: I still have not found a solution to this problem. Nvidia was quite silent, it seems that the world too. I found a workaround using shared mem, but if someone has a texture solution, I would really like it.

Edit Octoboer 26: still no interlocutor, but still interested in one if anyone knows.

Edit July 26: Wow 9 months passed - and all the time I did not notice the correct answer. The trick was:

 if ( idx < width && idy < height){//.... code } 

As previously stated. Thanks to everyone who contributed!

+6
c ++ cuda textures
source share
5 answers

Perhaps this is due to your block. In this code, you are trying to write a block of 16x16 streams to an 11x16 memory block. This means that some of your threads are writing to unallocated memory. This also explains why your tests (16 * M to 32 * N) worked: there were no threads in unallocated memory, since your sizes were multiples of 16.

An easy way to fix this problem is to:

 if ((x < width) && (y < height)) { // write output devMPtr[idy*width+idx]= tex2D(texRefEx,u,v); } 

You will need to either pass the height and width of the kernel function, or copy the constant to the map before calling the kernel.

+3
source share

I think:

  float u=(idx)/float(width); float v=(idy)/float(height); 

it should be

  float u=(idx+0.5f)/float(width); float v=(idy+0.5f)/float(height); 

To get identical input / output, otherwise the second output column is equal to the first input column, not the second, and the second last output column is also incorrect.

Please correct me if you have another observation.

+2
source share
  // Texutre Coordinates float u=(idx + 0.5)/float(width); float v=(idy + 0.5)/float(height); 

You need an offset to get to the center of texel. I think there might be some rounding error for your multiple of 16 textures. I tried this and it worked for me (both outputs were identical).

+1
source share

Graphics cards usually expect textures to be 2, which is especially true for nVidia cards. Cuda cudaMallocPitch and cudaMemcpy2D work with these resins and look at your code, the safest solution is to adjust the width and height for security. Otherwise, Cuda may write invalid memory because it expects incorrect offsets:

 #define height 16 #define width 11 ... size_t roundUpToPowerOf2(size_t v) { // See http://graphics.stanford.edu/~seander/bithacks.html#RoundUpPowerOf2 --v; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; ++v; return v; } ... size_t horizontal_pitch = roundUpToPowerOf2(width); size_t vertical_pitch = roundUpToPowerOf2(height); size_t memsize = horizontal_pitch * vertical_pitch; ... // Read data from host to device cudaMemcpy2D((void*)devMPPtr,pitch,(void*)data,sizeof(float)*horizontal_pitch, sizeof(float)*width,height,cudaMemcpyHostToDevice); //Read back and check this memory cudaMemcpy2D((void*)h_out,horizontal_pitch*sizeof(float),(void*)devMPPtr,pitch, sizeof(float)*width,height,cudaMemcpyDeviceToHost); // Print the memory for (int i=0; i<height; i++){ for (int j=0; j<width; j++){ printf("%2.2f ",h_out[i*horizontal_pitch+j]); } cout << endl; } ... // Copy back data to host cudaMemcpy((void*)h_out,(void*)devMPtr,horizontal_pitch*vertical_pitch*sizeof(float),cudaMemcpyDeviceToHost); // Print the Result cout << endl; for (int i=0; i<height; i++){ for (int j=0; j<width; j++){ printf("%2.2f ",h_out[i*horizontal_pitch+j]); } cout << endl; } cout << "Done" << endl; 

I hope I have not missed a single place where instead of a simple width / height, use horizontal_pitch / vertical_pitch.

0
source share

Perhaps take a look at this topic: http://forums.nvidia.com/index.php?showtopic=186585

Another very useful code sample is currently in the NVIDIA SDK; as mentioned in the aforementioned thread on NVIDIA forums, the simplePitchLinearTexture example works well.

Since we use texture memory, I believe that the dimensions of a 2D mesh should be degrees 2 on some hardware, which is also suggested in one of the answers above.

0
source share

All Articles