Cycle time

Question

Cycle time

I analyze and measure and get different results, and my analysis and measurement. The code is two loops with a data cache of 512 bytes and a block size of 32 bytes:

int SumByColRow (int matrix[M][M], int size) { int i, j, Sum = 0; for (j = 0; j < size; j ++) { for (i = 0; i < size; i ++) { Sum += matrix[i][j]; } } return Sum; } int SumByRowCol (int matrix[M][M], int size) { int i, j, Sum = 0; for (i = 0; i < size; i ++) { for (j = 0; j < size; j ++) { Sum += matrix[i][j]; } } return Sum; }

I think it’s better not to switch rows in the inner loop, since C stores matrices row by row, and therefore SumByRowCol should be faster, but in the dimension it is the other way around. I thought it would be faster when the cache, due to the principle of spatial locality, can make internal loops faster, since the values are taken from sequential elements? What is the reason that the measurement runtime is actually measured, that SumByColRow is actually faster?

 SumByColRow: Result: 31744 6415.29 us(641529 ticks) SumByRowCol: Result: 31744 7336.47 us(733647 ticks)

Update

I ran the program again, making sure that I was actually using the data cache, and this time the result was as expected, so the above result may be a coincidence, and the following looks more like it:

 SumByColRow: Result: 31744 5961.13 us(596113 ticks) SumByRowCol: Result: 31744 2328.89 us(232889 ticks)

+7

c caching altera nios

Niklas Rosencrantz October 8th. '13 at 8:32

source share

1 answer

Jonathan leffler · Accepted Answer · 2013-10-08T09:22:45+0000

I can offer a counter example based on your code.

The code

 #include "timer.h" #include <stdio.h> enum { M = 128 }; extern int SumByColRow (int matrix[M][M], int size); extern int SumByRowCol (int matrix[M][M], int size); int SumByColRow (int matrix[M][M], int size) { int Sum = 0; for (int j = 0; j < size; j ++) { for (int i = 0; i < size; i ++) Sum += matrix[i][j]; } return Sum; } int SumByRowCol (int matrix[M][M], int size) { int Sum = 0; for (int i = 0; i < size; i ++) { for (int j = 0; j < size; j ++) Sum += matrix[i][j]; } return Sum; } static inline int max(int i, int j) { return (i > j) ? i : j; } int main(void) { int matrix[M][M]; for (int i = 0; i < M; i++) for (int j = 0; j < M; j++) matrix[i][j] = 1000*i + j; Clock clk; unsigned long long x[M]; char buffer[32]; unsigned long long sum; clk_init(&clk); clk_start(&clk); for (int i = 0; i < M; i++) x[i] = SumByColRow(matrix, max(M - i, 10)); clk_stop(&clk); sum = 0; for (int i = 0; i < M; i++) sum += x[i]; printf("SumByColRow: value = %llu, time = %s\n", sum, clk_elapsed_us(&clk, buffer, sizeof(buffer))); clk_start(&clk); for (int i = 0; i < M; i++) x[i] = SumByRowCol(matrix, max(M - i, 10)); clk_stop(&clk); sum = 0; for (int i = 0; i < M; i++) sum += x[i]; printf("SumByRowCol: value = %llu, time = %s\n", sum, clk_elapsed_us(&clk, buffer, sizeof(buffer))); return 0; }

Two SumBy functions practically do not change (small musical tweaks, but no more). The synchronization node contains the start and stop times in the Clock structure, and the clk_elapsed_us() function formats the elapsed time in microseconds into the string passed to it.

Mess with x[i] etc. should (try and) make sure that the compiler is not optimizing everything.

Exit

Machine: Mac OS X 10.8.5, GCC (i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (based on Apple Inc. build 5658) (LLVM build 2336.11.00)), Intel Core 2 Duo at a frequency of 2 GHz, 4 GB of DDR3 RAM with a capacity of 1067 MHz (Mac mini "Early 2009").

 SumByColRow: value = 33764046316, time = 0.002411 SumByRowCol: value = 33764046316, time = 0.000677

This shows the expected result - the calculation of columns by rows is slower, because the matrix is large enough to cover pages (64 KiB). It is still unclear what size M , and what size is passed to SumBy functions, but with an array "large enough" and various sizes, you can get the expected performance pattern.

Those times are not big enough for comfort — I'd rather have a lower time on the order of a second or two. Adding a for (int j = 0; j < 1600; j++) loop for (int j = 0; j < 1600; j++) before each of the time loops in the main program gives:

 SumByColRow: value = 33764046316, time = 2.529205 SumByRowCol: value = 33764046316, time = 1.022970

The ratio is smaller (3.56 versus 2.47), but still strongly rejected in favor of SumByRowCol() .

The initialization of the matrix heats the cache to the extent that it can be warmed. Reversing the order of calculations (SumByRowCol before SumByColRow) does not have a significant effect on timings. The results are pretty consistent across multiple runs.

Assembler output

Compiled with gcc -O3 -std=c99 -S :

  .section __TEXT,__text,regular,pure_instructions .globl _SumByColRow .align 4, 0x90 _SumByColRow: Leh_func_begin1: pushq %rbp Ltmp0: movq %rsp, %rbp Ltmp1: testl %esi, %esi jg LBB1_5 xorl %eax, %eax LBB1_2: popq %rbp ret LBB1_5: movl %esi, %ecx xorl %eax, %eax movq %rcx, %rdx jmp LBB1_6 .align 4, 0x90 LBB1_3: addl (%r8), %eax addq $512, %r8 decq %rsi jne LBB1_3 addq $4, %rdi decq %rdx je LBB1_2 LBB1_6: movq %rcx, %rsi movq %rdi, %r8 jmp LBB1_3 Leh_func_end1: .globl _SumByRowCol .align 4, 0x90 _SumByRowCol: Leh_func_begin2: pushq %rbp Ltmp2: movq %rsp, %rbp Ltmp3: testl %esi, %esi jg LBB2_5 xorl %eax, %eax LBB2_2: popq %rbp ret LBB2_5: movl %esi, %ecx xorl %eax, %eax movq %rcx, %rdx jmp LBB2_6 .align 4, 0x90 LBB2_3: addl (%r8), %eax addq $4, %r8 decq %rsi jne LBB2_3 addq $512, %rdi decq %rdx je LBB2_2 LBB2_6: movq %rcx, %rsi movq %rdi, %r8 jmp LBB2_3 Leh_func_end2:

Cycle time

Update

The code

Exit

Assembler output

More articles: