Is multi-threaded memory access faster than single-threaded memory access?

Is access to multi-threaded memory faster than access to single-threaded memory?

Suppose we are in C. A simple example is as follows. If I have a giant array A , and I want to copy A to array B with the same size as A Does multithreading use to copy a copy of memory faster than with a single thread? How many threads are suitable for this type of memory?

EDIT: Let me put the question narrower. First of all, we are not considering the GPU case. Optimizing memory access is very important and effective when programming on the GPU. In my experience, we should always be careful in memory operations. On the other hand, this is not always the case when we work on a processor. Also, ignore SIMD instructions such as avx and sse. It will also cause memory performance problems when the program has too many memory access operations, and not many computational operations. Suppose we work with x86 architecture with 1-2 processors. Each processor has several cores and a four-channel memory interface. The main memory is DDR4, as is customary today.

My array is an array of double-precision floating-point numbers with a size similar to the size of the processor's L3 cache, which is approximately 50 MB. Now I have two cases: 1) copy this array to another array with the same size, using an elemental copy or using memcpy. 2) combine many small arrays into this giant array. Both real-time operations mean they must be completed as quickly as possible. Does multithreading give acceleration or a drop down menu? What factor in this case affects the performance of memory operations?

Someone said that it will mainly depend on DMA performance. I think when we do memcpy. What if we make a basic copy, will skip the processor cache first?

+8
c multithreading memory
source share
3 answers

It depends on many factors. One of the factors is the equipment you use. On modern PC hardware, multithreading is unlikely to improve performance, as processor time is not a limiting factor in copy operations. The limiting factor is the memory interface. The CPU will most likely use the DMA controller for copying, so the CPU will not be too busy when copying data.

+8
source share

Over the years, CPU performance has increased significantly, literally increased. RAM performance could not catch up. This made the cache more important. Especially after celeron.

Thus, you can increase or decrease performance:

Heavily dependent on

  • selection of memory and storage units per core
  • memory controller modules
  • memory module piping depths and listing of memory banks
  • memory access for each thread (software)
  • Alignment of data blocks, command blocks
  • Sharing and its datapaths of shared hardware resources
  • The operating system does too many privileges for all threads

Just optimize the code for the cache, then the quality of the processor will determine the performance.


Example:

The FX8150 has weaker cores than the i7-4700:

  • FX cores may have scaling with additional threads, but i7 vertices with only one thread (I mean codes with large memory)
  • FX has more L3 but it is slower
  • FX can work with high-frequency RAM, but the i7 has better internuclear bandwidth (if 1 stream sends data to another stream)
  • FX conveyor too long, too long to recover from branch

it looks like AMD can share finer performance with threads, while INTEL powers one thread. (assembly council against the monarchy) Perhaps that is why AMD works better on GPU and HBM.


If I had to stop speculating, I would only take care of the cache, since it does not change in the processor, while RAM can have many combinations on the motherboard.

+3
source share

Assuming AMD / Intel64 architecture.

One core is not able to saturate the memory bandwidth. But that means multithreading is faster. To do this, the threads must be on different cores, at startup, as many threads as there are physical cores should be accelerated, since the OS will most likely assign threads to different kernels, but there should be a function in your streaming library that connects the thread to a specific kernel, use This is best for speed. Another thought is NUMA if you have a multihomed system. For maximum speed, you should also think about using AVX instructions.

0
source share

All Articles