As a general question for those who are working on optimizing and tuning program performance, how do you know if your code is tied to CPU or memory? I understand these concepts in general, but if I say "y" is the number of downloads and storages and "2y" calculations, how can I find what is the bottleneck?
You can also find out exactly where you spend most of your time, and tell me, if you load "x" the amount of data in the cache (if its memory is tied), in each iteration of the loop, will your code work faster? Is there any specific way to define this "x" other than trial and error?
Are there any tools that you will use, say, on the IA-32 or IA-64 architecture? Help to cope with VTune?
For example, I am currently doing the following:
I have 26 8 * 8 complex doubling matrices, and I have to do MVM (matrix vector multiplication) with vectors (~ 4000) of length 8 for each of these 26 matrices. I use SSE to do complex multiplication.
/*Copy 26 matrices to temporary storage*/ for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors for(int k=0;k<26;k++){//Loop over the 26 matrices /* Perform MVM in blocks of '2' between kth matrix and 'i' and 'i+1' vector */ } }
26 matrices take 26kb (L1 cache - 32KB), and I put the vectors into memory so that I had access to stride'1 '. As soon as I execute MVM on a vector with 27 matrices, I don't visit them anymore, so I don't think cache locking will help. I used a vector, but I still adhere to 60% of maximum performance.
I tried copying, say, 64 vectors to temporary storage for each iteration of the outer loop, thinking that they would be in the cache and help, but this only reduces performance. I tried using _mm_prefetch () as follows: when I finished with half the matrices, I load the next "i" and "i + 1" vector into memory, but that didn't help either.
I did all this, believing that his memory is connected, but I want to know for sure. Is there any way?