Poor memcpy performance for Linux

Question

Poor memcpy performance for Linux

We have recently acquired several new servers and are experiencing poor memcpy performance. Memcpy performance on servers is 3 times slower compared to our laptops.

Server Features

Chassis and Mobo: SUPER MICRO 1027GR-TRF
Processor: 2x Intel Xeon E5-2680 @ 2.70 Ghz
Memory: 8x 16 GB DDR3 1600 MHz

Edit: I am also testing on a different server with slightly higher specifications and see the same results as the previous server

Server 2 Features

Chassis and Mobo: SUPER MICRO 10227GR-TRFT
Processor: 2x Intel Xeon E5-2650 v2 @ 2.6 Ghz
Memory: 8x 16GB DDR3 1866MHz

Laptop specifications

Case: Lenovo W530
Processor: 1x Intel Core i7 i7-3720QM @ 2.6Ghz
Memory: 4x 4 GB DDR3 1600 MHz

operating system

$ cat /etc/redhat-release Scientific Linux release 6.5 (Carbon) $ uname -a Linux r113 2.6.32-431.1.2.el6.x86_64 #1 SMP Thu Dec 12 13:59:19 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

Compiler (in all systems)

 $ gcc --version gcc (GCC) 4.6.1

Also tested with gcc 4.8.2 based on a suggestion from @stefan. There was no performance difference between compilers.

Test Code The following is the test code - this is a canned test to duplicate the problem that I see in our production code. I know that this criterion is simplified, but he was able to use and identify our problem. The code creates two 1 GB buffers and memcpys between them, synchronizing the memcpy call. You can specify the size of alternative buffers on the command line using: ./ big_memcpy_test [SIZE_BYTES]

 #include <chrono> #include <cstring> #include <iostream> #include <cstdint> class Timer { public: Timer() : mStart(), mStop() { update(); } void update() { mStart = std::chrono::high_resolution_clock::now(); mStop = mStart; } double elapsedMs() { mStop = std::chrono::high_resolution_clock::now(); std::chrono::milliseconds elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(mStop - mStart); return elapsed_ms.count(); } private: std::chrono::high_resolution_clock::time_point mStart; std::chrono::high_resolution_clock::time_point mStop; }; std::string formatBytes(std::uint64_t bytes) { static const int num_suffix = 5; static const char* suffix[num_suffix] = { "B", "KB", "MB", "GB", "TB" }; double dbl_s_byte = bytes; int i = 0; for (; (int)(bytes / 1024.) > 0 && i < num_suffix; ++i, bytes /= 1024.) { dbl_s_byte = bytes / 1024.0; } const int buf_len = 64; char buf[buf_len]; // use snprintf so there is no buffer overrun int res = snprintf(buf, buf_len,"%0.2f%s", dbl_s_byte, suffix[i]); // snprintf returns number of characters that would have been written if n had // been sufficiently large, not counting the terminating null character. // if an encoding error occurs, a negative number is returned. if (res >= 0) { return std::string(buf); } return std::string(); } void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes) { memmove(pDest, pSource, sizeBytes); } int main(int argc, char* argv[]) { std::uint64_t SIZE_BYTES = 1073741824; // 1GB if (argc > 1) { SIZE_BYTES = std::stoull(argv[1]); std::cout << "Using buffer size from command line: " << formatBytes(SIZE_BYTES) << std::endl; } else { std::cout << "To specify a custom buffer size: big_memcpy_test [SIZE_BYTES] \n" << "Using built in buffer size: " << formatBytes(SIZE_BYTES) << std::endl; } // big array to use for testing char* p_big_array = NULL; ///////////// // malloc { Timer timer; p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char)); if (p_big_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " returned NULL!" << std::endl; return 1; } std::cout << "malloc for " << formatBytes(SIZE_BYTES) << " took " << timer.elapsedMs() << "ms" << std::endl; } ///////////// // memset { Timer timer; // set all data in p_big_array to 0 memset(p_big_array, 0xF, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memset for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; } ///////////// // memcpy { char* p_dest_array = (char*)malloc(SIZE_BYTES); if (p_dest_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memcpy test" << " returned NULL!" << std::endl; return 1; } memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); // time only the memcpy FROM p_big_array TO p_dest_array Timer timer; memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memcpy for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; // cleanup p_dest_array free(p_dest_array); p_dest_array = NULL; } ///////////// // memmove { char* p_dest_array = (char*)malloc(SIZE_BYTES); if (p_dest_array == NULL) { std::cerr << "ERROR: malloc of " << SIZE_BYTES << " for memmove test" << " returned NULL!" << std::endl; return 1; } memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); // time only the memmove FROM p_big_array TO p_dest_array Timer timer; // memmove(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); doMemmove(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); double elapsed_ms = timer.elapsedMs(); std::cout << "memmove for " << formatBytes(SIZE_BYTES) << " took " << elapsed_ms << "ms " << "(" << formatBytes(SIZE_BYTES / (elapsed_ms / 1.0e3)) << " bytes/sec)" << std::endl; // cleanup p_dest_array free(p_dest_array); p_dest_array = NULL; } // cleanup free(p_big_array); p_big_array = NULL; return 0; }

Build CMake File

 project(big_memcpy_test) cmake_minimum_required(VERSION 2.4.0) include_directories(${CMAKE_CURRENT_SOURCE_DIR}) # create verbose makefiles that show each command line as it is issued set( CMAKE_VERBOSE_MAKEFILE ON CACHE BOOL "Verbose" FORCE ) # release mode set( CMAKE_BUILD_TYPE Release ) # grab in CXXFLAGS environment variable and append C++11 and -Wall options set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++0x -Wall -march=native -mtune=native" ) message( INFO "CMAKE_CXX_FLAGS = ${CMAKE_CXX_FLAGS}" ) # sources to build set(big_memcpy_test_SRCS main.cpp ) # create an executable file named "big_memcpy_test" from # the source files in the variable "big_memcpy_test_SRCS". add_executable(big_memcpy_test ${big_memcpy_test_SRCS})

Test results

 Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | NUMA nodes (numactl --hardware) --------------------------------------------------------------------------------------------- Laptop 1 | 0 | 127 | 113 | 1 Laptop 2 | 0 | 180 | 120 | 1 Server 1 | 0 | 306 | 301 | 2 Server 2 | 0 | 352 | 325 | 2

As you can see memcpys and memsets on our servers are much slower than memcpys and memsets on our laptops.

Different buffer sizes

I tried buffers from 100 MB to 5 GB with the same results (servers are slower than a laptop)

NUMA Affinity

I read about people having performance problems with NUMA, so I tried to establish the proximity of the CPU and memory using numactl, but the results remained the same.

NUMA Server Hardware

 $ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 node 0 size: 65501 MB node 0 free: 62608 MB node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31 node 1 size: 65536 MB node 1 free: 63837 MB node distances: node 0 1 0: 10 21 1: 21 10

Notebook Hardware NUMA

 $ numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 16018 MB node 0 free: 6622 MB node distances: node 0 0: 10

NUMA proximity settings

 $ numactl --cpunodebind=0 --membind=0 ./big_memcpy_test

Any help resolving this is greatly appreciated.

Edit: GCC settings

Based on the comments, I tried to compile with different GCC options:

Compiling with -march and -mtune installed in native

 g++ -std=c++0x -Wall -march=native -mtune=native -O3 -DNDEBUG -o big_memcpy_test main.cpp

Result: Exact performance (no improvement)

Compiling with -O2 instead of -O3

 g++ -std=c++0x -Wall -march=native -mtune=native -O2 -DNDEBUG -o big_memcpy_test main.cpp

Result: Exact performance (no improvement)

Edit: Changed memset to write 0xF instead of 0 to avoid a NULL page (@SteveCox)

There is no improvement when installing memset with a value other than 0 (in this case, 0xF is used).

Edit: Cachebench Results

To exclude that my test program is too simplified, I downloaded the real LLCacheBench benchmarking program ( http://icl.cs.utk.edu/projects/llcbench/cachebench.html )

I built a benchmark on each machine separately to avoid problems with the architecture. Below are my results.

laptop vs server memcpy performance

Note that there is a VERY big difference in performance with large buffer sizes. The last tested size (16777216) was made at 18849.29 MB / s on a laptop and 6710.40 on a server. This is approximately a 3-fold difference in performance. You may also notice that the drop in server performance is much steeper than on a laptop.

Edit: memmove () - 2x FASTER than memcpy () on the server

Based on some experiments, I tried using memmove () instead of memcpy () in my test case and found an improvement at 2x on the server. Memmove () on a laptop is slower than memcpy (), but, oddly enough, it runs at the same speed as memmove () on the server. This asks the question, why is memcpy so slow?

Updated code for checking memmove along with memcpy. I had to wrap memmove () inside a function because if I left it with GCC inline, it optimized it and did the same as memcpy () (I assume gcc optimized it for memcpy because I knew that locations do not overlap).

Updated Results

 Buffer Size: 1GB | malloc (ms) | memset (ms) | memcpy (ms) | memmove() | NUMA nodes (numactl --hardware) --------------------------------------------------------------------------------------------------------- Laptop 1 | 0 | 127 | 113 | 161 | 1 Laptop 2 | 0 | 180 | 120 | 160 | 1 Server 1 | 0 | 306 | 301 | 159 | 2 Server 2 | 0 | 352 | 325 | 159 | 2

Edit: Naive Memcpy

Based on the assumption from @Salgar, I implemented my own naive memcpy function and tested it.

Naive Memcpy Source

 void naiveMemcpy(void* pDest, const void* pSource, std::size_t sizeBytes) { char* p_dest = (char*)pDest; const char* p_source = (const char*)pSource; for (std::size_t i = 0; i < sizeBytes; ++i) { *p_dest++ = *p_source++; } }

Memcpy naive results Compared to memcpy ()

 Buffer Size: 1GB | memcpy (ms) | memmove(ms) | naiveMemcpy() ------------------------------------------------------------ Laptop 1 | 113 | 161 | 160 Server 1 | 301 | 159 | 159 Server 2 | 325 | 159 | 159

Edit: Assembly output

Simple source memcpy

 #include <cstring> #include <cstdlib> int main(int argc, char* argv[]) { size_t SIZE_BYTES = 1073741824; // 1GB char* p_big_array = (char*)malloc(SIZE_BYTES * sizeof(char)); char* p_dest_array = (char*)malloc(SIZE_BYTES * sizeof(char)); memset(p_big_array, 0xA, SIZE_BYTES * sizeof(char)); memset(p_dest_array, 0xF, SIZE_BYTES * sizeof(char)); memcpy(p_dest_array, p_big_array, SIZE_BYTES * sizeof(char)); free(p_dest_array); free(p_big_array); return 0; }

Build: This is the same on both the server and laptop. I am saving space, not inserting both.

  .file "main_memcpy.cpp" .section .text.startup,"ax",@progbits .p2align 4,,15 .globl main .type main, @function main: .LFB25: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movl $1073741824, %edi pushq %rbx .cfi_def_cfa_offset 24 .cfi_offset 3, -24 subq $8, %rsp .cfi_def_cfa_offset 32 call malloc movl $1073741824, %edi movq %rax, %rbx call malloc movl $1073741824, %edx movq %rax, %rbp movl $10, %esi movq %rbx, %rdi call memset movl $1073741824, %edx movl $15, %esi movq %rbp, %rdi call memset movl $1073741824, %edx movq %rbx, %rsi movq %rbp, %rdi call memcpy movq %rbp, %rdi call free movq %rbx, %rdi call free addq $8, %rsp .cfi_def_cfa_offset 24 xorl %eax, %eax popq %rbx .cfi_def_cfa_offset 16 popq %rbp .cfi_def_cfa_offset 8 ret .cfi_endproc .LFE25: .size main, .-main .ident "GCC: (GNU) 4.6.1" .section .note.GNU-stack,"",@progbits

PROGRESS !!!! ASMlib

Based on a suggestion from @tbenson, I tried working with the asmlib version of memcpy. Initially, my results were bad, but after changing SetMemcpyCacheLimit () to 1 GB (the size of my buffer), I worked at a speed along with my naive cycle!

The bad news is that the memmove asmlib version is slower than the glibc version, now it runs at 300ms (along with the memcpy glibc version). The strange thing is that on a laptop, when I setMemcpyCacheLimit () a lot, it harms performance ...

In the lines below, the lines marked by SetCache have SetMemcpyCacheLimit set to 1073741824. Results without SetCache do not call SetMemcpyCacheLimit ()

Results using functions from asmlib:

 Buffer Size: 1GB | memcpy (ms) | memmove(ms) | naiveMemcpy() ------------------------------------------------------------ Laptop | 136 | 132 | 161 Laptop SetCache | 182 | 137 | 161 Server 1 | 305 | 302 | 164 Server 1 SetCache | 162 | 303 | 164 Server 2 | 300 | 299 | 166 Server 2 SetCache | 166 | 301 | 166

We begin to rely on a cache problem, but what can cause this?

+66

c ++ c linux memcpy numa

nick Apr 01 '14 at 18:14

source share

7 answers

This looks fine to me.

Managing 8x16GB ECC memory cards with two processors is much more complicated than a single processor with 2x2GB. Your 16 gigabyte sticks are double-sided memory + they can have buffers + ECC (even disabled at the motherboard level) ... everything that makes the data path to RAM much longer. You also have 2 processors that use ram, and even if you do nothing on another CPU, there is always access to memory. It takes some extra time to switch this data. Just look at the tremendous performance lost on PCs that use a single drum with a graphics card.

Your excerpts are still really powerful datapumps. I'm not sure that 1GB duplication occurs very often in real software, but I'm sure your 128GB is much faster than any hard drive, even the best SSD, and this is where you can take advantage of your servers. Performing the same test with 3 GB will lift your laptop.

This looks like a great example of how commodity-based architecture can be much more efficient than large servers. How many consumer PCs could afford the money spent on these large servers?

Thank you for your very detailed question.

EDIT: (took me so long to write this answer that I missed part of the schedule.)

I think the problem is where the data is stored. Can you compare this:

test one: select two adjacent blocks with 500 MB of RAM and copy them from one to another (which you already did)
test two: select 20 (or more) blocks of 500 MB of memory and copy them from the first to the last, so that they are far apart (even if you cannot be sure of their real position).

This way you will see how the memory controller processes blocks of memory far apart. I think that your data is placed in different memory zones, and at some point in the data transfer path, switching from one zone and the other (such a problem with double-sided memory) is required.

In addition, are you sure that the thread is associated with one processor?

EDIT 2:

There are several types of memory zone separators. NUMA is one thing, but it is not the only one. For example, double-sided bars require a flag for the address of one or the other side. Look at your graph as performance degrades with large amounts of memory even on a laptop (which does not have NUMA). I'm not sure about this, but memcpy can use the hardware function to copy ram (a kind of DMA), and this chip should have less cache than your processor, this may explain why a dumb copy with a processor is faster than memcpy.

+8

bokan Apr 02 '14 at 16:29

source share

Perhaps some of the processor improvements in your IvyBridge-based laptop contribute to this gain over SandyBridge-based servers.

Page prefetch - your laptop processor would pre-program the next line page whenever you reach the end of the current page, keeping the nasty TLB skip every time. To try to mitigate this, try creating server code for 2M / 1G pages.
Cache replacement schemes also seem to have been improved (see interesting reverse engineering here ). If this processor actually uses a dynamic insertion policy, this will easily prevent your copied data from trying to exceed your last cache level (which it cannot use efficiently anyway due to size) and save space for other useful caching such as code, stack, page table data, etc.). To test this, you can try to restore your naive implementation using streaming loads / storages ( movntdq or similar, you can also use gcc builtin for this). This possibility may explain the sudden drop in large data sizes.
I believe that some improvements have been made with a copy of the line ( here ), it may or may not be applied here, depending on how your assembly code looks. You can try benchmarking with Dhrystone to see if there is a difference. It may also explain the difference between memcpy and memmove.

If you could access an IvyBridge-based server or Sandy-Bridge laptop, it would be easier to test it all together.

+7

Leeor Apr 02 '14 at 17:28

source share

I changed the benchmark to use the nsec timer on Linux, and found similar options on different processors, all with similar memory. All run by RHEL 6. Numbers are consistent across multiple runs.

 Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, L2/L3 256K/20M, 16 GB ECC malloc for 1073741824 took 47us memset for 1073741824 took 643841us memcpy for 1073741824 took 486591us Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, L2/L3 256K/12M, 12 GB ECC malloc for 1073741824 took 54us memset for 1073741824 took 789656us memcpy for 1073741824 took 339707us Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, L2 256K/8M, 12 GB ECC malloc for 1073741824 took 126us memset for 1073741824 took 280107us memcpy for 1073741824 took 272370us

Here are the results with embedded C -O3 code

 Sandy Bridge E5-2648L v2 @ 1.90GHz, HT enabled, 256K/20M, 16 GB malloc for 1 GB took 46 us memset for 1 GB took 478722 us memcpy for 1 GB took 262547 us Westmere E5645 @2.40 GHz, HT not enabled, dual 6-core, 256K/12M, 12 GB malloc for 1 GB took 53 us memset for 1 GB took 681733 us memcpy for 1 GB took 258147 us Jasper Forest C5549 @ 2.53GHz, HT enabled, dual quad-core, 256K/8M, 12 GB malloc for 1 GB took 67 us memset for 1 GB took 254544 us memcpy for 1 GB took 255658 us

For this, I also tried to make the built-in memcpy 8 bytes at a time. On these Intel processors, this made no noticeable difference. The cache combines all byte operations with a minimum number of memory operations. I suspect the gcc library code is trying to be too smart.

+3

stark Apr 2 '14 at 15:43

source share

The question has already been answered above , but in any case, here is an implementation using AVX, which should be faster for large copies if this bothers you:

 #define ALIGN(ptr, align) (((ptr) + (align) - 1) & ~((align) - 1)) void *memcpy_avx(void *dest, const void *src, size_t n) { char * d = static_cast<char*>(dest); const char * s = static_cast<const char*>(src); /* fall back to memcpy() if misaligned */ if ((reinterpret_cast<uintptr_t>(d) & 31) != (reinterpret_cast<uintptr_t>(s) & 31)) return memcpy(d, s, n); if (reinterpret_cast<uintptr_t>(d) & 31) { uintptr_t header_bytes = 32 - (reinterpret_cast<uintptr_t>(d) & 31); assert(header_bytes < 32); memcpy(d, s, min(header_bytes, n)); d = reinterpret_cast<char *>(ALIGN(reinterpret_cast<uintptr_t>(d), 32)); s = reinterpret_cast<char *>(ALIGN(reinterpret_cast<uintptr_t>(s), 32)); n -= min(header_bytes, n); } for (; n >= 64; s += 64, d += 64, n -= 64) { __m256i *dest_cacheline = (__m256i *)d; __m256i *src_cacheline = (__m256i *)s; __m256i temp1 = _mm256_stream_load_si256(src_cacheline + 0); __m256i temp2 = _mm256_stream_load_si256(src_cacheline + 1); _mm256_stream_si256(dest_cacheline + 0, temp1); _mm256_stream_si256(dest_cacheline + 1, temp2); } if (n > 0) memcpy(d, s, n); return dest; }

+2

Guilherme May 21 '15 at 23:51

source share

The numbers make sense to me. , .

-, , ¹ - Intel. , , .

L1 , , . ( ), L2 (100+ ), DRAM.
¹ , .
, ³ /, DRAM L2 , .
, ARK. , 3720QM Lenovo 25.6 GB . ( 1600 Mhz ) 8 (64 ) (2): 1600 * 8 * 2 = 25.6 GB/s . 51,2 / ~ 102 /.
, , . , DRAM (- , ), 90% .

, (1) , . DRAM , , . 10 , , , .

, , E5-2680 DRAM 80ns. 64 , DRAM, 64 bytes / 80 ns = 0.8 GB/s , ( ), memcpy . , 10 , 10 10 , 8 /.

, - . , , " Dr Bandwidth , .

, ...

memcpy , memmove ?

, memcpy 120 , 300 . , , memmove -memcpy ( hrm ) 160 , ( ) .

, concurrency , DRAM. , , 300 / 120 = 2.5x !

( ) . libc memcpy , memmove . "" memcpy , , asmlib () ().

, :

() - - , concurrency, 10 , /.
(B) E5-2680, , .

. :

"" L2 , . , (), , DRAM. , , , , , , .

... E5, :

"" Xeon E3 . Xeon E5 , , , () .

, - 1,8 E5 "", 2,5- OP , 1.8x STREAM TRIAD, 2: 1, memcpy 1:1, .

- , . , concurrency , , , , () , .

, .

?

- 160 / 120 = ~1.33x . What gives?

, , , , . - , ( 2000 ) , (a) (b) (c) (d) "enterprise-ish", ECC, virutalization .. ⁵ .

, , ⁴ . , , :

, "uncore", , , , .
(100 ), .
OP, , - .

, , 40-60% , . E5 , , , ~ 80 - , 50 .

, , RAM, , , , memcpy . , memcpy , ? , RAM , ⁶ .

, , , ( , - , 50% - , , .

References

, , .

(. MemLatX86 NewMemLat )
Sandy Bridge ( Opteron) - , OP.

¹ , LLC. , LLC ( ), . OPs llcachebench , , LLC.

² , -, 10 , , .

³ , , / , .

⁴ , . E5, E3 .

⁵ , , " " , , AVX-512 Skylake.

⁶ Per 80 , (51.2 B/ns * 80 ns) == 4096 bytes 64 - , , 20.

0

BeeOnRope 31 . '17 17:23

source share

1
: 2x Intel Xeon E5-2680 @2.70 Ghz
2
: 2x Intel Xeon E5-2650 v2 @2.6 Ghz

Intel ARK, E5-2650 E5-2680 AVX.

CMake

. CMake . , make VERBOSE=1 .

-march=native , -O3 CFLAGS CXXFLAGS . , . AVX. -march=XXX i686 x86_64. -O3 GCC.

, GCC 4.6 AVX ( , BMI). , GCC 4.8 4.9 , , segfault, GCC memcpy memset MMX. AVX AVX2 16- 32- .

GCC MMX, , . 16 , GCC, , . . GCC __builtin_assume_aligned . . , GCC, ?

- void* . . , :

 void doMemmove(void* pDest, const void* pSource, std::size_t sizeBytes) { memmove(pDest, pSource, sizeBytes); }

, - :

 template <typename T> void doMemmove(T* pDest, const T* pSource, std::size_t count) { memmove(pDest, pSource, count*sizeof(T)); }

- new malloc . ++- GCC new , malloc . , GCC .

- . 16- . GCC , MMX-, ( void* malloc ).

, Clang -march=native . ., , Ubuntu Issue 1616723, Clang 3.4 SSE2 , Ubuntu Issue 1616723, Clang 3.5 SSE2 Ubuntu Issue 1616723, Clang 3.6 SSE2 .

0

jww 27 . '17 7:06

source share

tbenson · Accepted Answer · 2014-04-02 21:39

[I would make it a comment, but not enough reputation for that.]

I have a similar system and you see similar results, but you can add several data points:

If you reverse the direction of naive memcpy (i.e. convert to *p_dest-- = *p_src-- ), you can get much worse performance than for the direct direction (~ 637 ms for me). In glibc 2.12, memcpy() was changed, which revealed several errors for calling memcpy on overlapping buffers ( http://lwn.net/Articles/414467/ ), and I believe that the problem was caused by switching to the memcpy version that works in reverse order. Thus, a reverse or direct copy may explain the mismatch of memcpy() / memmove() .
It seems better not to use timeless shops. Many optimized memcpy() implementations switch to non-temporary stores (which are not cached) for large buffers (i.e. more than the last level cache). I tested the memcpy version from Agner Fog ( http://www.agner.org/optimize/#asmlib ) and found that it was about the same as the version in glibc . However, asmlib has a function ( SetMemcpyCacheLimit ) that allows you to set a threshold over which non-temporary storage is used. Setting this limit to 8GiB (or only more than 1 gigabyte buffer) to avoid the untimely storage of doubled performance in my case (time up to 176 ms). Of course, this only corresponds to the highest forward performance, therefore it is not stellar.
The BIOS on these systems allows you to enable / disable four different hardware prefetters (MFC Streamer Prefetcher, MFC Spatial Prefetcher, Prefetcher for DCU Streamer and Prefetcher for DCU). I tried to disable each, but at the same time, performance parity and performance degradation for several parameters were supported.
Disabling Low Power Level (RAPL) DRAM is not affected.
I have access to other Supermicro systems with Fedora 19 (glibc 2.17). With Supermicro X9DRG-HF, Fedora 19, and Xeon E5-2670 boards, I see similar performance as above. On a Supermicro X10SLM-F single board, running on Xeon E3-1275 v3 (Haswell) and Fedora 19, I see 9.6 GB / s for memcpy (104 ms). The RAM on the Haswell system is DDR3-1600 (like other systems).

UPDATES

I set the maximum performance to control the processor power and disabled the hyper-threads in the BIOS. Based on /proc/cpuinfo , the kernels were then synchronized at 3 GHz. However, this strangely reduced memory performance by about 10%.
memtest86 + 4.10 reports a bandwidth of the main memory of 9091 MB / s. I could not find if this corresponds to reading, writing, or copying.
the STREAM test reports 13422 MB / s for copying, but they count the bytes as read and written, so this corresponds to ~ 6.5 GB / s, if we want to compare with the above results.

Poor memcpy performance for Linux

memcpy , memmove ?

?

References

More articles: