Can I allocate memory faster using multiple threads?

If I create a loop that reserves whole 1kb, int [1024] arrays, and I want it to allocate 10,000 arrays, can I do it faster by allocating memory from multiple threads?

I want them on the heap.

Suppose I have a multi-core processor for the job.

I already tried this, but it reduced performance. I am just wondering if I just made bad code or is there something that I did not know about memory allocation?

The answer depends on the OS? please tell me how it works on different platforms, if so.

Edit:

The integer distribution loop was just a simplified example. Don’t worry, tell me how I can improve this.

+4
source share
8 answers

It depends on many things, but above all:

  • OS
  • the malloc implementation that you use

The OS is responsible for allocating "virtual memory" to which your process has access and creates a translation table that maps virtual memory to actual memory addresses.

Now the standard malloc implementation is usually conservative and will just have a giant castle around it all. This means that requests are processed sequentially, and the only thing that stands out from several threads instead of one slows it down.

There are more intelligent distribution schemes, usually based on pools, and can be found in other malloc implementations: tcmalloc (from Google) and jemalloc (used by Facebook) - these are two such implementations designed for high performance in multi-threaded applications.

However, there is no silver bullet, and at some point, the OS must perform a virtual <=> real translation, which requires some form of locking.

It’s best to highlight the arenas:

  • Select large pieces (arenas) immediately
  • Separate them in arrays of the appropriate size.

There is no need to parallelize the distribution of the arena, and it would be better for you to ask about the largest arenas that you can (remember that requests for the distribution of too large amounts may fail), then you can parallelize the separation.

tcmalloc and jemalloc may help a little, however they are not intended for large distributions (which is unusual), and I do not know if it is possible to adjust the size of the arena that they request.

+4
source

The answer depends on the memory allocation routine, which is a combination of the C ++ operator new library layer, probably wrapped around libC malloc() , which in turn sometimes calls an OS function such as sbreak() . The implementation and performance characteristics of all of them are undefined and can vary from compiler version to version, with compiler flags, different OS versions, different operating systems, etc. If profiling shows this more slowly, then this is on the bottom line. You can try changing the number of threads, but it is likely that the threads are all trying to get the same lock to change the heap ... the overhead associated with the expression "ok, thread X gets the next result", and "thread X here, I finished, "just wasting time. Another C ++ environment might end up using atomic operations to avoid blocking, which might or might not be faster ... there is no general rule.

If you want to complete the work faster, consider allocating one array of 10000 * 1024 ints, then using different parts of it (for example, [0]..[1023] , [1024]..[2047] ...).

+3
source

I think that maybe you need to set your expectation to multithread.

The main advantage of multithreading is that you can perform tasks asynchronously, i.e. in parallel . In your case, when your main thread needs more memory, it does not matter if it is allocated by another thread - you still need to stop and wait for the allocation to complete, so there is no parallelism . In addition, there is an overhead for signaling flows when this is done, and the other is waiting for completion, which can lead to poor performance. In addition, if you start the thread every time you need allocation, this is huge overhead. If not, you need some kind of mechanism for transmitting a distribution request and response between threads, a kind of task queue, which again is overhead without amplification.

Another approach might be to allocate the thread ahead and pre-allocates to the memory that you will need. This can give you real gains, but if you do pre-allocation, you can also do this in the main thread, which will be easier. For instance. allocate 10M in one shot (or 10 times 1M or as much continuous memory as you can) and have an array of 10,000 pointers pointing to it at 1024 offsets, representing your arrays. If you do not need to release them independently of each other, this is apparently much simpler and may be even more efficient than using multithreading.

+3
source

As for glibc, it has an arena (see here ) that has a lock on the arena.

You can also consider tcmalloc using google (means Thread-Caching malloc), which shows a 30% performance improvement for a streaming application. We use it in our project. In debug mode, it can even detect improper memory usage (e.g. new / free mismatch)

+1
source

As far as I know, all os have a hidden mutex lock inside the dynamic allocation of a system call (malloc ...). If you think about it, if you do not block this action, you may run into terrible problems.

You can use the multi-threaded apy threaded blocks http://threadingbuildingblocks.org/ which has a multi-threaded friendly scalable distributor.

But I think the best idea should be to allocate all the memory once (you need to work pretty fast) and separate it yourself. I think the tbb dispenser does something similar.

Do something like

new int [1024 * 10000] and assign the 1024int parts to your array of pointers or whatever you are using.

You understand?

0
source

Since the heap is divided into each process, the heap will be blocked for each distribution, so each thread can only be accessed every time. This may explain the performance degradation when distributing from multiple threads, as you do.

0
source

The answer depends on your operating system and runtime, but in most cases you cannot.

Typically, you will have two versions of the runtime: a multi-threaded version and a single-threaded version.

The single-threaded version is not thread safe. Allocating two threads at the same time can lead to removal of your application.

The multi-threaded version is thread safe. However, since distributions go to most common implementations, this simply means that malloc calls are wrapped in a mutex. Only one thread can ever be in the malloc function at any given time, so trying to speed up the allocation of multiple threads will result in a lockout convoy.

It is possible that there are operating systems that can safely handle concurrent distributions within the same process, using minimal locking, which will allow you to reduce the time spent on allocation. Unfortunately, I don’t know anything.

0
source

If the arrays belong to each other and will be freed up only as a whole, you can simply allocate an array of 10000 * 1024 ints, and then make your own separate arrays in it. Just remember that you cannot delete small arrays, only integers.

 int *all_arrays = new int[1024 * 10000]; int *small_array123 = all_arrays + 1024 * 123; 

So you have small arrays when you replace 123 with a number from 0 to 9999.

0
source

All Articles