The spec for my processor says that it should get a bandwidth of 5.336 GB / s for memory. To test this, I wrote a simple program that runs memset (or memcpy) in a large array and reports the time. I show 3.8GB / s on memset and 1.9GB / s on memcpy. http://en.wikipedia.org/wiki/Intel_Core_ (microarchitecture) says that my Q9400 should receive 5.336MB / s. What's wrong?
I tried replacing memset or memcpy with assignment loops. I searched googled to try to learn about memory alignment. I tried different compiler flags. I spent an embarrassing number of hours on this. Thanks for any help you can provide!
I am using Ubuntu 12.04 with libc-dev version 2.15-0ubuntu10.5 and kernel 3.8.0-37-generic
The code:
#include <stdio.h> #include <time.h> #include <string.h> #include <stdlib.h> #define numBytes ((long)(1024*1024*1024)) #define numTransfers ((long)(8)) int main(int argc,char**argv){ if(argc!=3){ printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]); return -1; } char*__restrict__ source=(char*)malloc(numBytes); char*__restrict__ dest=(char*)malloc(numBytes); struct timespec start,end; long totalTimeMs; int i; clock_gettime(CLOCK_MONOTONIC_RAW,&start); for(i=0;i<numTransfers;++i) memset(source,0,numBytes); clock_gettime(CLOCK_MONOTONIC_RAW,&end); totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec); printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs); clock_gettime(CLOCK_MONOTONIC_RAW,&start); for(i=0;i<numTransfers;++i) memcpy( dest, source, numBytes); clock_gettime(CLOCK_MONOTONIC_RAW,&end); totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec); printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs); free(source); free(dest); return EXIT_SUCCESS; }
Compilation Commands:
gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.co -c memcpyStackOverflowNoParameters.c gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.co -o memcpy -rdynamic -lrt
Selective Outputs:
memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s). memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).
My equipment according to lshw:
product: OptiPlex 960 () vendor: Winbond Electronics width: 64 bits *-core description: Motherboard product: 0Y958C vendor: Winbond Electronics *-firmware description: BIOS capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot *-cpu product: Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz physical id: 400 size: 2666MHz width: 64 bits clock: 1333MHz capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority configuration: cores=4 enabledcores=4 threads=4 *-cache:0 description: L1 cache physical id: 700 size: 256KiB capacity: 256KiB capabilities: internal write-back unified *-cache:1 description: L2 cache physical id: 701 size: 6MiB capacity: 6MiB capabilities: internal varies unified *-memory description: System Memory physical id: 1000 slot: System board or motherboard size: 4GiB *-bank:0 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) product: CT51264AA667.M16FC vendor: 7F7F7F7F7F9B0000 slot: DIMM_1 size: 4GiB width: 64 bits clock: 667MHz (1.5ns) *-bank:1 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty] *-bank:2 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty] *-bank:3 description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
optimization memcpy memset
Jeff guy
source share