What is the most efficient way to compare two memory blocks in D?

I need a comparison function for memory blocks to perform binary searches on byte arrays in the programming language D. It does not need any useful semantics. It should only be fast and be a valid comparison function (which produces full ordering). The memory blocks to be compared are already known as the same length.

C memcmp is actually quite slow because it tries to preserve useful string comparison semantics that I don't need. The following is the best I've come up with so far. Does anyone know anything better, preferably without diving into intolerable instructions for the processor?

 // Faster than C memcmp because it doesn't preserve any meaningful // semantics. It just a completely arbitrary, but really fast, // comparison function. int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) { for(; n >= uint.sizeof; n -= uint.sizeof) { if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) { return -1; } else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) { return 1; } lhs += uint.sizeof; rhs += uint.sizeof; } for(; n >= ubyte.sizeof; n -= ubyte.sizeof) { if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) { return -1; } else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) { return 1; } lhs += ubyte.sizeof; rhs += ubyte.sizeof; } return 0; } 

Edit: I read SSE and I do not want to use it for three reasons:

  • It is not tolerated.
  • Requires programming in ASM.
  • Comparison teams assume that your data is floating point, which can be problematic if some data matches the pattern for NaN.
+7
performance algorithm binary-search low-level d
source share
6 answers

You can try:

  • check if uint is the largest type that suits your target processor (ulons can better match the native register).
  • use 2 pointers of this type
  • use 2 local variables using * p ++ (do not look up pointers 2 times for 1 value)
  • split the first loop counter up (use while (counter -))
  • we expand the second cycle, replacing it with a switch (if sizeof (the type that fits into the register) is known and will always be the same.)

Edit : if the first loop is a bottleneck, there might be a response to a U-turn. In combination with doubling the number of conditions in case of equal values, to expand 4 times, I get something like:

 uint* lp = (uint*)lhs; uint* rp = (uint*)rhs; uint l; uint r; int count = (n / uint.sizeof) / 4; while (count--) { if( (l = *lp++) != (r = *rp++) { return (l < r) ? -1 : 1; } if( (l = *lp++) != (r = *rp++) { return (l < r) ? -1 : 1; } if( (l = *lp++) != (r = *rp++) { return (l < r) ? -1 : 1; } if( (l = *lp++) != (r = *rp++) { return (l < r) ? -1 : 1; } } 

Of course, what remains is to leave the iterations (n / uint.sizeof) % 4 , which you can mix in this loop, alternating the switch, I left this as an exercise for the reader's wicked smile.

+3
source share

I don't know much about this, but there are vector instructions that can apply instructions to many bytes at a time. You can use these results to execute fast and fast memcmp. I don’t know what instructions you will need, but if you look at the new Larrabee instructions or see this article, you may find what you are looking for. http://www.ddj.com/architect/216402188

NOTE. This CPU does not exit ATM AFAIK

-Edit- Now I'm sure there are sets of instructions (try looking at SSE or SSE2) that can compare 16 bytes at once if they are aligned.

You can try this clean C ++ code.

 template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) { const T* endLHS = lhs + n/sizeof(T); while(lhs<endLHS) { int i = *lhs - *rhs; if(i != 0) return i > 0 ? 1 : -1; lhs++; rhs++; } //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T)); return 0; } 

The advantage here is that you increase the pointer so that you can dereference it and not use the index (its ptr_offset [index] vs ptr_offset). The above example uses a pattern, so you can use 64 bits on 64-bit machines. and the CMP in the assembly is really just subtracted by checking the N and Z flags. Instead of comparing N and decreasing N, I just compare in my version.

+2
source share

I think memcmp is specified for comparing bytes, regardless of the data type. Are you sure your compiler implementation retains the semantics of strings? It should not be.

+1
source share

Well, a lot depends on your system and data. There are so many assumptions that we can make. Which processor are you using? Should there be a direct C code? How wide are the processor registers? What is the structure of the processor cache? etc etc.

It also depends on how different your data is. If it is unlikely that the first byte from each buffer is the same, then the speed of the function is quite meaningless, since logically it will not reach the rest of the function. If, probably, the first n-1 bytes are usually sme, then this becomes more important.

All that you are unlikely to see a lot of changes, regardless of how you conduct the test.

In any case, this is a small implementation of my own, it may or may not be faster than your own (or, if I just did it when I went, it may or may not work;))

 int memoryCompare(const void* lhs, const void* rhs, size_t n) { uint_64 diff = 0 // Test the first few bytes until we are 32-bit aligned. while( (n & 0x3) != 0 && diff != 0 ) { diff = (uint_8*)lhs - (uint_8*)rhs; n--; ((uint_8*)lhs)++; ((uint_8*)rhs)++; } // Test the next set of 32-bit integers using comparisons with // aligned data. while( n > sizeof( uint_32 ) && diff != 0 ) { diff = (uint_32*)lhs - (uint_32*)rhs; n -= sizeof( uint_32 ); ((uint_32*)lhs)++; ((uint_32*)rhs)++; } // now do final bytes. while( n > 0 && diff != 0 ) { diff = (uint_8*)lhs - (uint_8*)rhs; n--; ((uint_8*)lhs)++; ((uint_8*)rhs)++; } return (int)*diff / abs( diff )); } 
+1
source share

Does it help you answer this question ?

If the compiler has support for implementing memcmp () as an inline / inline function, it seems like it will be difficult for you to find it.

I have to admit that I don't know anything about D, so I have no idea if the D compiler supports built-in functions.

+1
source share

If you trust your compiler optimization, you can try several modifications to the acidzombie24s proposal:

 template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) { const T* endLHS = &lhs[n]; while(lhs<endLHS) { //A good optimiser will keep these values in register //and may even be clever enough to just retest the flags //before incrementing the pointers iff we loop again. //gcc -O3 did the optimisation very well. if (*lhs > *rhs) return 1; if (*lhs++ < *rhs++) return -1; } //more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T)); return 0; } 

Here's the whole optimized gcc-O3 inner loop in x86 assembler for version C above, only passing char array pointers:

 Loop: incl %eax ; %eax is lhs incl %edx ; %edx is rhs cmpl %eax, %ebx ; %ebx is endLHS jbe ReturnEq movb (%edx), %cl cmpb %cl, (%eax) jg ReturnGT jge Loop ReturnLT: ... 
+1
source share

All Articles