For loop access point

Question

For loop access point

I am trying to optimize this code.

static lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 ) { const size_t len1 = s1.size(), len2 = s2.size(); std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 ); const size_t prevColSize = prevCol.size(); for( unsigned int i = 0; i < prevColSize; i++ ) prevCol[i] = i; for( unsigned int i = 0, j; i < len1; ++i ) { col[0] = i+1; const char s1i = s1[i]; for( j = 0; j < len2; ++j ) { const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] ); col[j+1] = std::min( minPrev, prevCol[j] + ( s1i == s2[j] ? 0 : 1 ) ); } col.swap( prevCol ); } return prevCol[len2]; }

Intel VTune shows that about half of the processor time is spent in the second for command , and not in two lines inside the for loop. When I expand the assembly source, I see that the for C ++ instruction has been translated into a series of operation codes, 3 of which seem to devour processor time:

 Code Location Source Line Assembly CPU Time Block 14: [Unknown] 0x420c00 31 movq (%r12), %rcx 19.969ms 0x420c04 30 add $0x1, %r11d [Unknown] 0x420c08 32 test %rbx, %rbx [Unknown] 0x420c0b 30 movl %r11d, (%r8) [Unknown] 0x420c0e 31 movzxb (%rcx,%rdx,1), %r9d 19.964ms 0x420c13 32 jz 0x420c53 <Block 17> [Unknown] Block 15: [Unknown] 0x420c15 32 movq (%rbp), %r10 [Unknown] 0x420c19 32 mov %r11d, %edx [Unknown] 0x420c1c 32 xor %ecx, %ecx 39.928ms 0x420c1e 32 xor %edi, %edi [Unknown] Block 16: [Unknown] 0x420c20 34 add $0x1, %edi 29.994ms 0x420c23 34 mov %edi, %esi 30.956ms 0x420c25 34 movl (%rax,%rsi,4), %r15d 180.659ms 0x420c29 34 cmp %r15d, %edx 39.896ms 0x420c2c 34 cmovbe %edx, %r15d 19.951ms 0x420c30 35 xor %edx, %edx 460.772ms 0x420c32 34 add $0x1, %r15d 19.946ms 0x420c36 35 cmpb (%r10,%rcx,1), %r9b 169.659ms 0x420c3a 35 setnz %dl 49.815ms 0x420c3d 35 addl (%rax,%rcx,4), %edx [Unknown] 0x420c40 32 mov %rsi, %rcx 210.615ms <------------------ 0x420c43 32 cmp %edx, %r15d 29.936ms 0x420c46 32 cmovbe %r15d, %edx 29.938ms 0x420c4a 32 cmp %rsi, %rbx 558.298ms <------------------- 0x420c4d 35 movl %edx, (%r8,%rsi,4) 19.965ms 0x420c51 32 jnbe 0x420c20 <Block 16> 200.625ms <-------------------

I don’t understand how a simple move and comparison can be time-consuming.

+6

c ++ optimization assembly profiling vtune

qdii Apr 22 '13 at 10:37

source share

1 answer

Evgeny Kluev · Accepted Answer · 2013-04-22T11:27:26+0000

Profiler cannot show you the most time-consuming instructions, because all modern processors use non-standard and speculative execution. It is not unusual to see the longest measured time on one or two lines from the most laborious instructions.

As expected, the most laborious instructions here are cmovbe (implementation of std::min ). You can see the biggest times near them: 460.772ms and 558.298ms. cmovbe are the most time-consuming instructions, since they usually have a longer delay and more dependencies on previous instructions.

For loop access point

More articles: