Understand if the code sample is CPU bound or memory related

Question

Understand if the code sample is CPU bound or memory related

As a general question for those who are working on optimizing and tuning program performance, how do you know if your code is tied to CPU or memory? I understand these concepts in general, but if I say "y" is the number of downloads and storages and "2y" calculations, how can I find what is the bottleneck?

You can also find out exactly where you spend most of your time, and tell me, if you load "x" the amount of data in the cache (if its memory is tied), in each iteration of the loop, will your code work faster? Is there any specific way to define this "x" other than trial and error?

Are there any tools that you will use, say, on the IA-32 or IA-64 architecture? Help to cope with VTune?

For example, I am currently doing the following:

I have 26 8 * 8 complex doubling matrices, and I have to do MVM (matrix vector multiplication) with vectors (~ 4000) of length 8 for each of these 26 matrices. I use SSE to do complex multiplication.

/*Copy 26 matrices to temporary storage*/ for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors for(int k=0;k<26;k++){//Loop over the 26 matrices /* Perform MVM in blocks of '2' between kth matrix and 'i' and 'i+1' vector */ } }

26 matrices take 26kb (L1 cache - 32KB), and I put the vectors into memory so that I had access to stride'1 '. As soon as I execute MVM on a vector with 27 matrices, I don't visit them anymore, so I don't think cache locking will help. I used a vector, but I still adhere to 60% of maximum performance.

I tried copying, say, 64 vectors to temporary storage for each iteration of the outer loop, thinking that they would be in the cache and help, but this only reduces performance. I tried using _mm_prefetch () as follows: when I finished with half the matrices, I load the next "i" and "i + 1" vector into memory, but that didn't help either.

I did all this, believing that his memory is connected, but I want to know for sure. Is there any way?

+7

performance optimization c sse

user1715122 Feb 20 '13 at 19:38

source share

1 answer

Shan · Answer 1 · 2013-04-26T07:43:00+0000

As far as I understand, the best way is to profile your application / workload. Based on the input, the characteristics of the application / workload can vary significantly. However, these behaviors can be quantified in several phases [2 , 3 ], and the histogram can broadly describe the most common way to optimize workload. The question you ask will also require reference programs (for example, SPEC2006, PARSEC, Media bench, etc.) For the architecture as a whole it is difficult to answer (and is an active part of research in computer architecture). However, for specific cases, a quantitative result can be specified for different memory hierarchies. You can use tools such as:

Perf
OProfile
VTune
Likwid
LLTng

and other monitoring and modeling tools to get profiling traces of the application. You can look at performance counters such as IPC, CPI (for CPU binding) and memory access, cache misses, cache access, and other memory counts to determine memory limitations. Like IPC, memory access per cycle (MPC) is often used to determine the limited memory of an application / workload. To specifically improve matrix multiplication, I would suggest using an optimized algorithm, as in LAPACK .

Understand if the code sample is CPU bound or memory related

More articles: