A faster way to reset memory than using memset?

Question

A faster way to reset memory than using memset?

I found out that memset(ptr, 0, nbytes) is very fast, but is there a faster way (at least on x86)?

I assume that memset uses mov , however, when memory is zeroed, most compilers use xor , since it is faster, right? edit1: Wrong, because GregS indicated that it only works with registers. What was I thinking?

And I asked a man who knew more about assembler to look at stdlib, and he told me that on x86 memset does not fully use 32-bit registers. However, at that time I was very tired, so I'm not quite sure that I understood correctly.

edit2 : I looked at this question again and did a little testing. Here is what I tested:

  #include <stdio.h> #include <malloc.h> #include <string.h> #include <sys/time.h> #define TIME(body) do { \ struct timeval t1, t2; double elapsed; \ gettimeofday(&t1, NULL); \ body \ gettimeofday(&t2, NULL); \ elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \ printf("%s\n --- %f ---\n", #body, elapsed); } while(0) \ #define SIZE 0x1000000 void zero_1(void* buff, size_t size) { size_t i; char* foo = buff; for (i = 0; i < size; i++) foo[i] = 0; } /* I foolishly assume size_t has register width */ void zero_sizet(void* buff, size_t size) { size_t i; char* bar; size_t* foo = buff; for (i = 0; i < size / sizeof(size_t); i++) foo[i] = 0; // fixes bug pointed out by tristopia bar = (char*)buff + size - size % sizeof(size_t); for (i = 0; i < size % sizeof(size_t); i++) bar[i] = 0; } int main() { char* buffer = malloc(SIZE); TIME( memset(buffer, 0, SIZE); ); TIME( zero_1(buffer, SIZE); ); TIME( zero_sizet(buffer, SIZE); ); return 0; }

results:

~~zero_1 is the slowest except for -O3.~~ ~~zero_sizet is the fastest with approximately equal performance in -O1, -O2 and -O3.~~ ~~memset has always been slower than zero_sizet.~~ ~~(twice slower for -O3).~~ ~~One thing is interesting in that at -O3, zero_1 was as fast as zero_sizet.~~ ~~however, the disassembled function had about four times as many instructions (I think this is caused by a loop unfolding).~~ ~~In addition, I tried to optimize zero_sizet further, but the compiler always surpassed me, but there is nothing surprising here.~~

Now that memset wins, the previous results have been corrupted by the CPU cache. (all tests were performed on Linux) Further testing is needed. I will try the following assembler :)

edit3: bug fixed in test code, test results are not affected

edit4: While deploying the dismantled VS2010 C environment, I noticed that memset has an SSE-optimized procedure for zero. It will be hard to beat.

+57

c std

maep 06 Sep 2018-10-06T00:

source share

10 answers

memset is usually intended for very fast encoding / zeroing of general purpose code. It handles all cases with different sizes and alignments, which affect the types of instructions that you can use to do your job. Depending on which system you are on (and which vendor your stdlib comes from), the base implementation may be in assembler specific to this architecture in order to use all its own properties. It may also have internal special cases for handling a case of zeroing (compared to setting a different value).

However, if you have a very specific, critical memory that is very important for performance, you can probably beat a specific memset implementation by doing it yourself. memset and his friends in the standard library are always interesting targets for programming with one improvement. :)

+26

Ben Zotto Sep 06 2018-10-10T00:

source share

Currently, your compiler should do all the work for you. At least from what I know, gcc is very effective at optimizing memset calls (it's better to check assembler though).

Then also avoid memset unless you need to:

use calloc for heap memory
use proper initialization ( ... = { 0 } ) for stack memory

And for really big pieces, use mmap if you have one. It just gets zero initialized memory from the system for free.

+23

Jens Gustedt Sep 07 '10 at 12:39 on

source share

If I remember correctly (a couple of years ago), one of the senior developers talked about a quick way to bzero () on PowerPC (the specifications said that we need to reset almost all memory when we turn on the power). This may not translate well (if at all) to x86, but it may be worth exploring.

The idea was to load the data cache line, clear that data cache line, and then write the cleared data cache line back to memory.

What is it worth, I hope this helps.

+5

Sparky Sep 07 2018-10-10T00:

source share

Unless you have specific needs or know that your compiler / stdlib sucks, stick with memset. It is versatile, and should have decent overall performance. In addition, compilers may have easier time optimization / inlining memset (), because it may have internal support for it.

For example, Visual C ++ often generates embedded versions of memcpy / memset, which are smaller than a call , in a library function, which avoids the overhead of push / call / ret. Optimization is also possible when the size parameter can be estimated at compile time.

However, if you have special needs (where the size will always be tiny * or * huge ), you can get a speed boost by dropping to the build level. For example, using end-to-end operations to clear huge blocks of memory without polluting your L2 cache.

But it all depends - and for normal things, please stick with memset / memcpy :)

+5

snemarch Sep 07 2018-10-17T00:

source share

Also see the question Strange assembly from array 0-initialization for a comparison of memset and = { 0 } .

+2

Johann Gerell Sep 07 2018-10-12T00:

source share

The memset function is designed for flexibility and simplicity, even at the expense of speed. In many implementations, this is a simple while loop that copies the specified value one byte for a certain number of bytes. If you need a faster memset (or memcpy, memmove, etc.), you can always copy the code yourself.

The simplest setup will be to perform single-byte “set” operations until the destination address is 32 or 64 bit aligned (regardless of your chip architecture), and then starts copying the full CPU register at a time. You may need to perform a couple of single-byte "given" operations at the end if your range does not end at a aligned address.

Depending on your particular processor, you may also have some SIMD stream instructions that may help you. As a rule, they work better on aligned addresses, so the above methodology for using aligned addresses can be useful here.

To reset large sections of memory, you can also see speed acceleration by dividing the range into sections and simultaneously processing each section (where the number of sections matches the number or cores / hardware threads).

Most importantly, there is no way to tell if any of these options will help if you don't try it. At a minimum, take a look at what your compiler emits for each case. See what other compilers emit for their standard "memset" (their implementation may be more efficient than your compiler).

+2

bta Sep 07 '10 at 13:17

source share

There is one fatal flaw in this excellent and useful test: Since memset is the first instruction, there seems to be some "memory overhead" or so, which makes it extremely slow. Transferring memset time to second place and something else to first place or just a temporary storage device twice makes memset the fastest with all compilation switches !!!

+2

Chris Aug 19 2018-11-11T00:

source share

This is an interesting question. I made this implementation, which is slightly faster (but hardly measurable) when compiling 32-bit versions in VC ++ 2012. It can probably be improved a lot. Adding this to your class in a multi-threaded environment is likely to give you an even greater performance boost, as in multi-threaded scripts there are some bottlenecks with memset() .

 // MemsetSpeedTest.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include <iostream> #include "Windows.h" #include <time.h> #pragma comment(lib, "Winmm.lib") using namespace std; /** a signed 64-bit integer value type */ #define _INT64 __int64 /** a signed 32-bit integer value type */ #define _INT32 __int32 /** a signed 16-bit integer value type */ #define _INT16 __int16 /** a signed 8-bit integer value type */ #define _INT8 __int8 /** an unsigned 64-bit integer value type */ #define _UINT64 unsigned _INT64 /** an unsigned 32-bit integer value type */ #define _UINT32 unsigned _INT32 /** an unsigned 16-bit integer value type */ #define _UINT16 unsigned _INT16 /** an unsigned 8-bit integer value type */ #define _UINT8 unsigned _INT8 /** maximum allo wed value in an unsigned 64-bit integer value type */ #define _UINT64_MAX 18446744073709551615ULL #ifdef _WIN32 /** Use to init the clock */ #define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency); /** Use to start the performance timer */ #define TIMER_START QueryPerformanceCounter(&t1); /** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */ #define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl; #else /** Use to init the clock */ #define TIMER_INIT clock_t start;double diff; /** Use to start the performance timer */ #define TIMER_START start=clock(); /** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */ #define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl; #endif void *MemSet(void *dest, _UINT8 c, size_t count) { size_t blockIdx; size_t blocks = count >> 3; size_t bytesLeft = count - (blocks << 3); _UINT64 cUll = c | (((_UINT64)c) << 8 ) | (((_UINT64)c) << 16 ) | (((_UINT64)c) << 24 ) | (((_UINT64)c) << 32 ) | (((_UINT64)c) << 40 ) | (((_UINT64)c) << 48 ) | (((_UINT64)c) << 56 ); _UINT64 *destPtr8 = (_UINT64*)dest; for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = cUll; if (!bytesLeft) return dest; blocks = bytesLeft >> 2; bytesLeft = bytesLeft - (blocks << 2); _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx]; for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = (_UINT32)cUll; if (!bytesLeft) return dest; blocks = bytesLeft >> 1; bytesLeft = bytesLeft - (blocks << 1); _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx]; for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = (_UINT16)cUll; if (!bytesLeft) return dest; _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx]; for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = (_UINT8)cUll; return dest; } int _tmain(int argc, _TCHAR* argv[]) { TIMER_INIT const size_t n = 10000000; const _UINT64 m = _UINT64_MAX; const _UINT64 o = 1; char test[n]; { cout << "memset()" << endl; TIMER_START; for (int i = 0; i < m ; i++) for (int j = 0; j < o ; j++) memset((void*)test, 0, n); TIMER_STOP; } { cout << "MemSet() took:" << endl; TIMER_START; for (int i = 0; i < m ; i++) for (int j = 0; j < o ; j++) MemSet((void*)test, 0, n); TIMER_STOP; } cout << "Done" << endl; int wait; cin >> wait; return 0; }

The output is as follows when compiling the release for 32-bit systems:

 memset() took: 5.569000 MemSet() took: 5.544000 Done

The output is as follows when compiling the release for 64-bit systems:

 memset() took: 2.781000 MemSet() took: 2.765000 Done

Here you can find the source code of Berkley memset() , which I consider the most common version.

+2

user152949 Mar 08 '13 at 10:09

source share

memset can be embedded by the compiler as a series of effective operation codes deployed over several cycles. For very large blocks of memory, such as a 4000x2000 64-bit frame buffer, you can try to optimize it in several threads (which you are preparing for this single task), each of which sets up its own part. Note that there is also bzero (), but it is more obscure and less likely to be optimized like memset, and the compiler will probably notice that you are passing 0.

Typically, the compiler assumes that you are installing large blocks, so for small blocks it would be more efficient to simply execute *(uint64_t*)p = 0 if you are initiating a large number of small objects.

Generally, all x86 processors are different (unless you compile for any standardized platform), and what you optimize for Pentium 2 will work differently in Core Duo or i486. Therefore, if you really believe in it and want to squeeze the last few pieces of toothpaste, it makes sense to put several versions of your exe, compiled and optimized for different popular processor models. From personal experience, Clang -march = native increased my game FPS from 60 to 65, compared to no -march.

0

SmugLispWeenie Aug 26 '19 at 15:28

source share

Tim · Accepted Answer · 2010-09-07 00:25

x86 is a fairly wide range of devices.

For a completely general x86 purpose, an assembly block with "rep movsd" can explode zeros into 32-bit memory at a point in time. Try to make sure that the bulk of this work is DWORD aligned.

For chips with mmx, the assembly loop with movq can reach 64 bits at a time.

You might be able to get a C / C ++ compiler to use 64-bit writing with a pointer to long or _m64. The target should be aligned by 8 bytes for best performance.

for chips with sse, movaps is fast, but only if the address is 16 bytes, so use movsb before alignment and then fill clear with the movaps loop

Win32 has "ZeroMemory ()", but I forget if this is a macro for memset or the actual "good" implementation.

A faster way to reset memory than using memset?

More articles: