C # performance based on memory

Hope this is a valid post here, its a combination of issues and C # hardware.

I am comparing our server because we found performance issues with our quantum library (written in C #). I simulated the same performance issues with a simple C # code that uses memory very much.

Below is the code that is generated from the thread thread, up to a maximum of 32 threads (since our server has 4x CPU x 8 cores each).

It's all on .Net 3.5

The problem is that we get wildly different performance. I run the function below 1000 times. The average time it takes to run the code can be, say, 3.5 s, but the fastest of them will be only 1.2 s, and the slowest - 7 s - for the same function!

I correlated memory usage with timings, and there seems to be no correlation with what the GC uses.

One thing I noticed is that when starting in the same thread, the timings are identical and there is no wild rejection. I also tested processor-bound algorithms, and the timings are identical too. This made us wonder if it could handle the memory bus.

I was wondering if this could be another .net or C # problem, or is it related to our hardware? It will be the same experience if I used C ++ or Java ?? We use 4x Intel x7550 with 32 GB of RAM. Is there any way around this problem in general?

Stopwatch watch = new Stopwatch(); watch.Start(); List<byte> list1 = new List<byte>(); List<byte> list2 = new List<byte>(); List<byte> list3 = new List<byte>(); int Size1 = 10000000; int Size2 = 2 * Size1; int Size3 = Size1; for (int i = 0; i < Size1; i++) { list1.Add(57); } for (int i = 0; i < Size2; i = i + 2) { list2.Add(56); } for (int i = 0; i < Size3; i++) { byte temp = list1.ElementAt(i); byte temp2 = list2.ElementAt(i); list3.Add(temp); list2[i] = temp; list1[i] = temp2; } watch.Stop(); 

(the code is just designed to emphasize memory)

I would include threadpool code, but we used the non-standard threadpool library.

EDIT: I reduced the size of "size1" to 100000, which basically does not consume much memory, and I still get a lot of jitter. This suggests that it is not the amount of memory transferred, but the frequency of memory capture?

+8
performance memory-management c # memory
source share
6 answers

There is not enough time to continue, but here are a few areas to start:

  • Variability is the result of the internal state of the GC. GC dynamically manages the sizes of various pools. If you start with different pool sizes, you will get different GC behavior during the runs.
  • Ants in planning flows. Depending on the random variations in the flow sequence, you might have more or less favorable patterns of contention. If there is any periodicity, this can lead to an amplified effect close to constructive interference.
  • False sharing. If you have two threads that both fall into memory addresses that are close enough to be located in the processor cache, you will see a noticeable performance decrease, as processors have to spend a lot of time re-synchronizing their caches. Depending on how you organize your data and allocate threads to process it, you may receive patterns in false sharing based on changes at the beginning.
  • Another process in the system takes up processor time. You might want to use a user-mode measure of the process instead of wall time. (There's an assistant in the Process class somewhere).
  • The machine runs next to it with a full physical memory limit. Switching to disk occurs with a more or less random pattern.
+4
source share

Here you are faced with quite serious restrictions on the car. You have many cores, but there is one more memory. Therefore, if your streams do a lot of data shuffling, then they are likely to decay in the bandwidth of this particular bus. This is Amdahl’s law at work.

There is one possible optimization, it depends on the type of operating system on which this machine is running. This is the server hardware, but if you have a non-server version of Windows, the garbage collector will work in workstation mode. You can then use the <gcServer> element in the application's .config file to request the server version of the collector. It uses multiple heaps, so threads will not fight to block the GC heap so often when they allocate memory. YMMV.

+1
source share

The list uses arrays for storage. I believe that he will try to double the size of the array every time he reaches the limit of free space in the List.

When you go into a loop, it needs large and large chunks of continuous memory to allocate new arrays as the list grows. With a single thread, this is pretty simple. With 2+ threads, you are competing for large chunks of contiguous memory. It will run GC at random times when arrays will be larger and adjacent memory will be harder to find.

0
source share

Make sure the runtime configuration has gcserver = true

0
source share

At the moment, it seems a guess that everything will be just a hypothesis. In fact, you need more information.

I would connect a profiler or install some Windows performance counters:

http://support.microsoft.com/kb/300504

You should be able to add some process-focused performance counters. You can see how many threads are deployed, memory usage, etc. I would suggest some other suggestions and measure the scenario you are looking for. If you upload performance counter data to a csv file, you can even graphically display the results pretty quickly to get good data for actual chewing. If you can find which metric changes using the 1.2s vs 7s script, you can start making some enlightened guesses about what is happening and continue to hone.

0
source share

Synchronous calls for shared resources, such as Console or File System, significantly degrade performance, but judging by the appearance, this code simply maximizes the processor, and time deviations should be associated with other processes that request processor time.

0
source share

All Articles