We have recently acquired several new servers and are experiencing poor memcpy performance. Memcpy performance on servers is 3 times slower compared to our laptops.
Server Features
- Chassis and Mobo: SUPER MICRO 1027GR-TRF
- Processor: 2x Intel Xeon E5-2680 @ 2.70 Ghz
- Memory: 8x 16 GB DDR3 1600 MHz
Edit: I am also testing on a different server with slightly higher specifications and see the same results as the previous server
Server 2 Features
- Chassis and Mobo: SUPER MICRO 10227GR-TRFT
- Processor: 2x Intel Xeon E5-2650 v2 @ 2.6 Ghz
- Memory: 8x 16GB DDR3 1866MHz
Laptop specifications
- Case: Lenovo W530
- Processor: 1x Intel Core i7 i7-3720QM @ 2.6Ghz
- Memory: 4x 4 GB DDR3 1600 MHz
operating system
$ cat /etc/redhat-release Scientific Linux release 6.5 (Carbon) $ uname -a Linux r113 2.6.32-431.1.2.el6.x86_64
Compiler (in all systems)
$ gcc --version gcc (GCC) 4.6.1
Also tested with gcc 4.8.2 based on a suggestion from @stefan. There was no performance difference between compilers.
Test Code The following is the test code - this is a canned test to duplicate the problem that I see in our production code. I know that this criterion is simplified, but he was able to use and identify our problem. The code creates two 1 GB buffers and memcpys between them, synchronizing the memcpy call. You can specify the size of alternative buffers on the command line using: ./ big_memcpy_test [SIZE_BYTES]
#include <chrono> #include <cstring> #include <iostream> #include <cstdint> class Timer { public: Timer() : mStart(), mStop() { update(); } void update() { mStart = std::chrono::high_resolution_clock::now(); mStop = mStart; } double elapsedMs() { mStop = std::chrono::high_resolution_clock::now(); std::chrono::milliseconds elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(mStop - mStart); return elapsed_ms.count(); } private: std::chrono::high_resolution_clock::time_point mStart; std::chrono::high_resolution_clock::time_point mStop; }; std::string formatBytes(std::uint64_t bytes) { static const int num_suffix = 5; static const char* suffix[num_suffix] = { "B", "KB", "MB", "GB", "TB" }; double dbl_s_byte = bytes; int i = 0; for (; (int)(bytes / 1024.) > 0 && i < num_suffix; ++i, bytes /= 1024.) { dbl_s_byte = bytes / 1024.0; } const int buf_len = 64; char buf[buf_len]; // use snprintf so there is no buffer overrun int res = snprintf(buf, buf_len,"%0.2f%s", dbl_s_byte, suffix[i]); // snprintf returns number of characters that would have been written if n had // been sufficiently large, not counting the terminating null character. // if an encoding error occurs, a negative number is returned. if (res >= 0) { return std::string(buf); } return std::string(); } void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes) { memmove(pDest, pSource, sizeBytes); } int main(int argc, char* argv[]) { std::uint64_t SIZE_BYTES = 1073741824; // 1GB if (argc > 1) { SIZE_BYTES = std::stoull(argv[1]); std::cout << "Using buffer size from command line: " << formatBytes(SIZE_BYTES) << std::endl; } else { std::cout << "To specify a custom buffer size: big_memcpy_test [SIZE_BYTES] \n" << "Using built in buffer size: " << formatBytes(SIZE_BYTES) << std::endl; } // big array to use for testing char* p_big_array = NULL; ///////////// // malloc { Timer timer; p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char)); if (p_big_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " returned NULL!" << std::endl; return 1; } std::cout << "malloc for " << formatBytes(SIZE_BYTES) << " took " << timer.elapsedMs() << "ms" << std::endl; } ///////////// // memset { Timer timer; // set all data in p_big_array to 0 memset(p_big_array, 0xF, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memset for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; } ///////////// // memcpy { char* p_dest_array = (char*)malloc(SIZE_BYTES); if (p_dest_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memcpy test" << " returned NULL!" << std::endl; return 1; } memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); // time only the memcpy FROM p_big_array TO p_dest_array Timer timer; memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memcpy for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; // cleanup p_dest_array free(p_dest_array); p_dest_array = NULL; } ///////////// // memmove { char* p_dest_array = (char*)malloc(SIZE_BYTES); if (p_dest_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memmove test" << " returned NULL!" << std::endl; return 1; } memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); // time only the memmove FROM p_big_array TO p_dest_array Timer timer; // memmove(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); doMemmove(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memmove for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; // cleanup p_dest_array free(p_dest_array); p_dest_array = NULL; } // cleanup free(p_big_array); p_big_array = NULL; return 0; }
Build CMake File
project(big_memcpy_test) cmake_minimum_required(VERSION 2.4.0) include_directories(${CMAKE_CURRENT_SOURCE_DIR}) # create verbose makefiles that show each command line as it is issued set( CMAKE_VERBOSE_MAKEFILE ON CACHE BOOL "Verbose" FORCE ) # release mode set( CMAKE_BUILD_TYPE Release ) # grab in CXXFLAGS environment variable and append C++11 and -Wall options set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x -Wall -march=native -mtune=native" ) message( INFO "CMAKE_CXX_FLAGS = ${CMAKE_CXX_FLAGS}" ) # sources to build set(big_memcpy_test_SRCS main.cpp ) # create an executable file named "big_memcpy_test" from # the source files in the variable "big_memcpy_test_SRCS". add_executable(big_memcpy_test ${big_memcpy_test_SRCS})
Test results
Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | NUMA nodes (numactl --hardware) --------------------------------------------------------------------------------------------- Laptop 1 | 0 | 127 | 113 | 1 Laptop 2 | 0 | 180 | 120 | 1 Server 1 | 0 | 306 | 301 | 2 Server 2 | 0 | 352 | 325 | 2
As you can see memcpys and memsets on our servers are much slower than memcpys and memsets on our laptops.
Different buffer sizes
I tried buffers from 100 MB to 5 GB with the same results (servers are slower than a laptop)
NUMA Affinity
I read about people having performance problems with NUMA, so I tried to establish the proximity of the CPU and memory using numactl, but the results remained the same.
NUMA Server Hardware
$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 65501 MB node 0 free: 62608 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 65536 MB node 1 free: 63837 MB node distances: node 0 1 0: 10 21 1: 21 10
Notebook Hardware NUMA
$ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 16018 MB node 0 free: 6622 MB node distances: node 0 0: 10
NUMA proximity settings
$ numactl --cpunodebind=0 --membind=0 ./big_memcpy_test
Any help resolving this is greatly appreciated.
Edit: GCC settings
Based on the comments, I tried to compile with different GCC options:
Compiling with -march and -mtune installed in native
g++ -std=c++0x -Wall -march=native -mtune=native -O3 -DNDEBUG -o big_memcpy_test main.cpp
Result: Exact performance (no improvement)
Compiling with -O2 instead of -O3
g++ -std=c++0x -Wall -march=native -mtune=native -O2 -DNDEBUG -o big_memcpy_test main.cpp
Result: Exact performance (no improvement)
Edit: Changed memset to write 0xF instead of 0 to avoid a NULL page (@SteveCox)
There is no improvement when installing memset with a value other than 0 (in this case, 0xF is used).
Edit: Cachebench Results
To exclude that my test program is too simplified, I downloaded the real LLCacheBench benchmarking program ( http://icl.cs.utk.edu/projects/llcbench/cachebench.html )
I built a benchmark on each machine separately to avoid problems with the architecture. Below are my results.

Note that there is a VERY big difference in performance with large buffer sizes. The last tested size (16777216) was made at 18849.29 MB / s on a laptop and 6710.40 on a server. This is approximately a 3-fold difference in performance. You may also notice that the drop in server performance is much steeper than on a laptop.
Edit: memmove () - 2x FASTER than memcpy () on the server
Based on some experiments, I tried using memmove () instead of memcpy () in my test case and found an improvement at 2x on the server. Memmove () on a laptop is slower than memcpy (), but, oddly enough, it runs at the same speed as memmove () on the server. This asks the question, why is memcpy so slow?
Updated code for checking memmove along with memcpy. I had to wrap memmove () inside a function because if I left it with GCC inline, it optimized it and did the same as memcpy () (I assume gcc optimized it for memcpy because I knew that locations do not overlap).
Updated Results
Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | memmove() | NUMA nodes (numactl --hardware) --------------------------------------------------------------------------------------------------------- Laptop 1 | 0 | 127 | 113 | 161 | 1 Laptop 2 | 0 | 180 | 120 | 160 | 1 Server 1 | 0 | 306 | 301 | 159 | 2 Server 2 | 0 | 352 | 325 | 159 | 2
Edit: Naive Memcpy
Based on the assumption from @Salgar, I implemented my own naive memcpy function and tested it.
Naive Memcpy Source
void naiveMemcpy(void* pDest, const void* pSource, std::size_t sizeBytes) { char* p_dest = (char*)pDest; const char* p_source = (const char*)pSource; for (std::size_t i = 0; i < sizeBytes; ++i) { *p_dest++ = *p_source++; } }
Memcpy naive results Compared to memcpy ()
Buffer Size: 1GB | memcpy (ms) | memmove(ms) | naiveMemcpy() ------------------------------------------------------------ Laptop 1 | 113 | 161 | 160 Server 1 | 301 | 159 | 159 Server 2 | 325 | 159 | 159
Edit: Assembly output
Simple source memcpy
#include <cstring> #include <cstdlib> int main(int argc, char* argv[]) { size_t SIZE_BYTES = 1073741824; // 1GB char* p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char)); char* p_dest_array = (char*)malloc(SIZE_BYTES * sizeof(char)); memset(p_big_array, 0xA, SIZE_BYTES * sizeof(char)); memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); free(p_dest_array); free(p_big_array); return 0; }
Build: This is the same on both the server and laptop. I am saving space, not inserting both.
.file "main_memcpy.cpp" .section .text.startup,"ax",@progbits .p2align 4,,15 .globl main .type main, @function main: .LFB25: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movl $1073741824, %edi pushq %rbx .cfi_def_cfa_offset 24 .cfi_offset 3, -24 subq $8, %rsp .cfi_def_cfa_offset 32 call malloc movl $1073741824, %edi movq %rax, %rbx call malloc movl $1073741824, %edx movq %rax, %rbp movl $10, %esi movq %rbx, %rdi call memset movl $1073741824, %edx movl $15, %esi movq %rbp, %rdi call memset movl $1073741824, %edx movq %rbx, %rsi movq %rbp, %rdi call memcpy movq %rbp, %rdi call free movq %rbx, %rdi call free addq $8, %rsp .cfi_def_cfa_offset 24 xorl %eax, %eax popq %rbx .cfi_def_cfa_offset 16 popq %rbp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE25: .size main, .-main .ident "GCC: (GNU) 4.6.1" .section .note.GNU-stack,"",@progbits
PROGRESS !!!! ASMlib
Based on a suggestion from @tbenson, I tried working with the asmlib version of memcpy. Initially, my results were bad, but after changing SetMemcpyCacheLimit () to 1 GB (the size of my buffer), I worked at a speed along with my naive cycle!
The bad news is that the memmove asmlib version is slower than the glibc version, now it runs at 300ms (along with the memcpy glibc version). The strange thing is that on a laptop, when I setMemcpyCacheLimit () a lot, it harms performance ...
In the lines below, the lines marked by SetCache have SetMemcpyCacheLimit set to 1073741824. Results without SetCache do not call SetMemcpyCacheLimit ()
Results using functions from asmlib:
Buffer Size: 1GB | memcpy (ms) | memmove(ms) | naiveMemcpy() ------------------------------------------------------------ Laptop | 136 | 132 | 161 Laptop SetCache | 182 | 137 | 161 Server 1 | 305 | 302 | 164 Server 1 SetCache | 162 | 303 | 164 Server 2 | 300 | 299 | 166 Server 2 SetCache | 166 | 301 | 166
We begin to rely on a cache problem, but what can cause this?