I found out that memset(ptr, 0, nbytes) is very fast, but is there a faster way (at least on x86)?
I assume that memset uses mov , however, when memory is zeroed, most compilers use xor , since it is faster, right? edit1: Wrong, because GregS indicated that it only works with registers. What was I thinking?
And I asked a man who knew more about assembler to look at stdlib, and he told me that on x86 memset does not fully use 32-bit registers. However, at that time I was very tired, so I'm not quite sure that I understood correctly.
edit2 : I looked at this question again and did a little testing. Here is what I tested:
#include <stdio.h> #include <malloc.h> #include <string.h> #include <sys/time.h> #define TIME(body) do { \ struct timeval t1, t2; double elapsed; \ gettimeofday(&t1, NULL); \ body \ gettimeofday(&t2, NULL); \ elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \ printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \ #define SIZE 0x1000000 void zero_1(void* buff, size_t size) { size_t i; char* foo = buff; for (i = 0; i < size; i++) foo[i] = 0; } /* I foolishly assume size_t has register width */ void zero_sizet(void* buff, size_t size) { size_t i; char* bar; size_t* foo = buff; for (i = 0; i < size / sizeof(size_t); i++) foo[i] = 0; // fixes bug pointed out by tristopia bar = (char*)buff + size - size % sizeof(size_t); for (i = 0; i < size % sizeof(size_t); i++) bar[i] = 0; } int main() { char* buffer = malloc(SIZE); TIME( memset(buffer, 0, SIZE); ); TIME( zero_1(buffer, SIZE); ); TIME( zero_sizet(buffer, SIZE); ); return 0; }
results:
zero_1 is the slowest except for -O3. zero_sizet is the fastest with approximately equal performance in -O1, -O2 and -O3. memset has always been slower than zero_sizet. (twice slower for -O3). One thing is interesting in that at -O3, zero_1 was as fast as zero_sizet. however, the disassembled function had about four times as many instructions (I think this is caused by a loop unfolding). In addition, I tried to optimize zero_sizet further, but the compiler always surpassed me, but there is nothing surprising here.
Now that memset wins, the previous results have been corrupted by the CPU cache. (all tests were performed on Linux) Further testing is needed. I will try the following assembler :)
edit3: bug fixed in test code, test results are not affected
edit4: While deploying the dismantled VS2010 C environment, I noticed that memset has an SSE-optimized procedure for zero. It will be hard to beat.