Why is memset slow?

The spec for my processor says that it should get a bandwidth of 5.336 GB / s for memory. To test this, I wrote a simple program that runs memset (or memcpy) in a large array and reports the time. I show 3.8GB / s on memset and 1.9GB / s on memcpy. http://en.wikipedia.org/wiki/Intel_Core_ (microarchitecture) says that my Q9400 should receive 5.336MB / s. What's wrong?

I tried replacing memset or memcpy with assignment loops. I searched googled to try to learn about memory alignment. I tried different compiler flags. I spent an embarrassing number of hours on this. Thanks for any help you can provide!

I am using Ubuntu 12.04 with libc-dev version 2.15-0ubuntu10.5 and kernel 3.8.0-37-generic

The code:

#include <stdio.h> #include <time.h> #include <string.h> #include <stdlib.h> #define numBytes ((long)(1024*1024*1024)) #define numTransfers ((long)(8)) int main(int argc,char**argv){ if(argc!=3){ printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]); return -1; } char*__restrict__ source=(char*)malloc(numBytes); char*__restrict__ dest=(char*)malloc(numBytes); struct timespec start,end; long totalTimeMs; int i; clock_gettime(CLOCK_MONOTONIC_RAW,&start); for(i=0;i<numTransfers;++i) memset(source,0,numBytes); clock_gettime(CLOCK_MONOTONIC_RAW,&end); totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec); printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs); clock_gettime(CLOCK_MONOTONIC_RAW,&start); for(i=0;i<numTransfers;++i) memcpy( dest, source, numBytes); clock_gettime(CLOCK_MONOTONIC_RAW,&end); totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec); printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs); free(source); free(dest); return EXIT_SUCCESS; } 

Compilation Commands:

 gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.co -c memcpyStackOverflowNoParameters.c gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.co -o memcpy -rdynamic -lrt 

Selective Outputs:

 memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s). 

My equipment according to lshw:

  product: OptiPlex 960 () vendor: Winbond Electronics width: 64 bits *-core description: Motherboard product: 0Y958C vendor: Winbond Electronics *-firmware description: BIOS capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot *-cpu product: Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz physical id: 400 size: 2666MHz width: 64 bits clock: 1333MHz capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority configuration: cores=4 enabledcores=4 threads=4 *-cache:0 description: L1 cache physical id: 700 size: 256KiB capacity: 256KiB capabilities: internal write-back unified *-cache:1 description: L2 cache physical id: 701 size: 6MiB capacity: 6MiB capabilities: internal varies unified *-memory description: System Memory physical id: 1000 slot: System board or motherboard size: 4GiB *-bank:0 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) product: CT51264AA667.M16FC vendor: 7F7F7F7F7F9B0000 slot: DIMM_1 size: 4GiB width: 64 bits clock: 667MHz (1.5ns) *-bank:1 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty] *-bank:2 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty] *-bank:3 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty] 
+7
optimization memcpy memset
source share
2 answers

Memory addresses are "virtualized", the addresses used by your program are translated to real addresses. This translation allows you to highlight what your program sees as continuous memory from any parts that are convenient at that time. Each general purpose CPU does this. To translate, you need to search the table, which requires access to memory. The CPU received a cache for it, but long sections of virtual addresses can easily explode its cache, "TLB" ("translation viewing buffer"). Thus, every 4 KB (2 MB on the Linux system that found out what you are doing), the processor stops when it actually sends memory traffic. These stalls can take quite some time. You can try running two copies of your test, it seems reasonable that the TLB passes do not match and you get the total throughput much closer to your nominal capacity.

(edit: um, you can replace #define with

 size_t numBytes=atoi(argv[1]); size_t numTransfers=atoi(argv[2]); 

in the main body ...)

Edit: by the way: the bandwidth that I saw (and reported in the comments) from this test on my box was so much lower than the estimated capacity for my processor that I was able to research my own system. My block designer put really crappy memory in these slots. I have long replaced them with a well-known brand, more than doubled my throughput and noticeably improved the performance of my car.

+5
source share

The last thing I checked memcpy and memset were not optimized in GCC. This was still true in 2012 . See Agner Fog C ++ Software Optimization , Section 2.6 2.6, “Selecting Function Libraries” and Table 2.1. It compares several different compilers and OS.

GCC has built-in functions for executing memcpy. They seem to be even worse than memcpy in Glib. As far as I understand, GCC developers and Glib developers work independently. To get libraries from Glib, you need to use -fno-builtin . However, although Glib (or at least was) better, it is still not optimal. For best results, use Agner Fog asmlib . He optimized memcpy and memset, and many other common features in the assembly, to use SSE and AVX among other optimizations.

+3
source share

All Articles