Speeding up conversion calculations

I am programming the OpenGL3 2D Engine. I'm currently trying to solve a bottleneck. Please, from here the following AMD Profiler output: http://h7.abload.de/img/profilerausa.png

Data was made using several thousand sprites.

However, with 50,000 sprites, testapp is no longer applicable at 5 frames per second.

This shows that my bottleneck is the conversion function that I use. This is the corresponding function: http://code.google.com/p/nightlight2d/source/browse/NightLightDLL/NLBoundingBox.cpp#130

void NLBoundingBox::applyTransform(NLVertexData* vertices) { if ( needsTransform() ) { // Apply Matrix for ( int i=0; i<6; i++ ) { glm::vec4 transformed = m_rotation * m_translation * glm::vec4(vertices[i].x, vertices[i].y, 0, 1.0f); vertices[i].x = transformed.x; vertices[i].y = transformed.y; } m_translation = glm::mat4(1); m_rotation = glm::mat4(1); m_needsTransform = false; } } 

I can not do this in the shader, because I run all the sprites at the same time. This means that I have to use the CPU to calculate the transforms.

My question is: What is the best way to solve this bottleneck?

I don't use atm threads, so when I use vsync, I get an extra performance hit because it waits for the screen to finish. This tells me that I should use threads.

Can I use OpenCL? I want to avoid CUDA because, as far as I know, it only works on NVIDIA cards. It is right?

post scriptum:

You can download the demo here if you want:

http://www63.zippyshare.com/v/45025690/file.html

Please note that this requires the installation of VC ++ 2008, since this is the debug version for launching the profiler.

+7
source share
4 answers

The first thing I would like to do is combine your rotation and convert the matrices into one matrix before introducing a for ... loop so you do not calculate two matrix multiplications and a vector on each for-loop; instead, you would only multiply one vector and matrix. Secondly, you may want to expand your loop and then compile with a higher level of optimization (in g++ I would use at least -O2 , but I am not familiar with MSVC, so you have to translate this level of optimization yourself). This would avoid any overhead that may occur in code codes, especially cache flushes. Finally, if you haven’t studied it yet, check out some of the SSE optimizations since you are dealing with vectors.

UPDATE I'm going to add a final idea that will include threads ... basically pipelined vertices when you do threading. For example, let's say you have a machine with eight available CPU threads (i.e., Quad core with hyper-thread). Set up six threads to process the vertex pipeline and use the non-blocking queues of the same consumer / producer to send messages between the pipeline stages. Each step converts one member of your six-element vertex array. I assume that there are a bunch of these six-element vertex arrays, so setting up in a stream that goes through the pipeline, you can handle the stream very efficiently and avoid using mutexes and other blocking semaphores, etc. For more information about the fast non-blocking single-processor / consumer queue, see my answer here .

UPDATE 2 . You only have a dual-core processor ... so dump the idea of ​​the pipeline, as it will run into bottlenecks as each thread supports CPU resources.

+4
source

I can not do this in the shader, because I run all the sprites at the same time. This means that I have to use the CPU to calculate the transforms.

This sounds suspiciously like a premature optimization you made, assuming batch processing is the most important thing you can do, and so you structured your renderer by creating the least amount of callbacks. And now he comes back to bite you.

What you need to do is no less than parties. You must have the correct number of parties. You know that you have gone too far with batch processing when you refuse to convert GPU vertices in favor of CPU transformations.

As Datenwolf suggested, you need to get some acceleration to get the conversion on the GPU. But even then you need to undo some of the extra operations that you have. You didn’t talk much about which scene you are showing (tilemaps with sprites on top, a large system of particles, etc.), so it’s hard to understand what to offer.

GLM is also an excellent math library, but it is not designed for maximum performance. This is usually not what I would like to use if I needed to convert 300,000 vertices per processor in each frame.

+2
source

Assignment inside a loop can be a problem, but I'm not familiar with the library. Moving it outside the for loop and doing field assignments manually can help. It would also help move the transforms out of the loop.

Edit:

This is more like what I was thinking.

 // Apply Matrix glm::vec4 transformed; glm::mat4 translation = m_rotation * m_translation; for ( int i=0; i<6; i++ ) { transformed.x = vertices[i].x; transformed.y = vertices[i].y; transformed.z = vertices[i].z; transformed.w = 1.f; // ? /* I can't find docs, but assume they have an in-place multiply transformed.mult(translation); // */ vertices[i].x = transformed.x; vertices[i].y = transformed.y; } 

Perhaps, perhaps, the task does not allow the compiler to embed or deploy something. I guess the multiplication is large enough to knock this out of the command cache. Indeed, if you start talking about cache sizes, you will not be stable on many platforms.

You can try to duplicate some stack and do more, fewer loops.

 glm::vec4 transformed[6]; for (size_t i = 0; i < 6; i++) { transformed[i].x = vertices[i].x; transformed[i].y = vertices[i].y; transformed[i].z = vertices[i].z; transformed.w = 1.f; // ? } glm::mat4 translation = m_rotation * m_translation; for (size_t i = 0; i < 6; i++) { /* I can't find docs, but assume they have an in-place multiply transformed.mult(translation); // */ } for (size_t i = 0; i < 6; i++) { vertices[i].x = transformed[i].x; vertices[i].y = transformed[i].y; } 

As Jason said, manually unrolling these loops would be interesting.

I really don't think you will see an improvement in order on any of these changes.

I suspect that calling this function is less important than speeding up this function. The fact that you have it requires checking the transformation inside this function, makes me think that this is probably relevant.

When you have high-level problems like this in low-level code, you end up just blindly repeating this method, thinking it is free. Regardless of whether your assumptions about how often the conversion is needed are true, they can be completely wrong.

The reality is that you should just call this method once. You must apply Transform when you want to apply Transform. You should not call applyTransform when you can apply Transform. Interfaces should be a contract, consider them as such.

+1
source

If you insist on doing your calculations on the processor, you must do the math yourself.

You are currently using 4x4 matrices in a 2D environment where one 2x2 matrix is ​​enough for rotation and a simple vector for translation. These are 4 multiplications and 4 additions for rotation, as well as two additions for translation.

If you absolutely need two matrices (because you need to combine translation and rotation), it will still be much less than what you have now. But you can also “manually” combine the two by moving the position of the vector, rotating, and then moving it again, which can be a little faster than multiplication, although I'm not sure about that.

Compared to the operations that these 4x4 matrices are doing right now, this is much less.

+1
source

All Articles