Is falling and recreating an array faster, or filling it with zeros, and why?

Question

Is falling and recreating an array faster, or filling it with zeros, and why?

Let's say I create an array designed to emulate processor memory:

byte[] mem = new byte[0xF00];

This array is used during the emulation operation and ultimately (read: with frequency) should be discarded or reset. My question is: faster, and why?

 mem = new byte[0xF00];

or

 for(int i = 0; i < mem.length; i++) mem[i] = 0;

This may seem insignificant, but when emulating a large number of processors, low efficiency matters. The speed difference will be obtained from the JVM junk collection; in one, the array should be flushed and garbage collected, however the JVM should no longer allocate (and possibly zero?) new memory. In the second case, the cost of the JVM is excluded, but we still have to iterate over each element of the array.

As additional reservations to this issue:

Does the ratio of costs to data type size change? For example, what about short[] ?
Does array length increase cost ratio?
Most importantly, why?

+4

java garbage-collection memory-management jvm

Zyerah Apr 23 '13 at 6:10

source share

5 answers

You can check it yourself, but resetting and re-creating the array are about the same.

However, it has two drawbacks.

it scrolls the CPU data cache, which reduces its efficiency.
this makes starting the GC more likely, especially if you do it often, which pauses systems or slows it down (if it is parallel)

I prefer to reuse the array, not because it is the fastest, but it has the least impact on the rest of your application.

 for (int size = 16; size <= 16* 1024; size *= 2) { int count1 = 0, count1b = 0,count2 = 0; long total1 = 0, total1b = 0, total2 = 0; for (long i = 0; i < 10000000000L; i += size) { long start = System.nanoTime(); long[] longs = new long[size]; if (longs[0] + longs[longs.length - 1] != 0) throw new AssertionError(); long mid = System.nanoTime(); long time1 = mid - start; Arrays.fill(longs, 1L); long time2 = System.nanoTime() - mid; count1b++; total1b += time1; if (time1 < 10e3) {// no GC total1 += time1; count1++; } if (time2 < 10e3) {// no GC total2 += time2; count2++; } } System.out.printf("%s KB took on average of %,d ns to allocate, %,d ns to allocate including GCs and %,d ns to fill%n", size * 8 / 1024.0, total1 / count1, total1b/count1b, total2 / count2); }

prints

 0.125 KB took on average of 35 ns to allocate, 36 ns to allocate including GCs and 19 ns to fill 0.25 KB took on average of 39 ns to allocate, 40 ns to allocate including GCs and 31 ns to fill 0.5 KB took on average of 56 ns to allocate, 58 ns to allocate including GCs and 55 ns to fill 1.0 KB took on average of 75 ns to allocate, 77 ns to allocate including GCs and 117 ns to fill 2.0 KB took on average of 129 ns to allocate, 134 ns to allocate including GCs and 232 ns to fill 4.0 KB took on average of 242 ns to allocate, 248 ns to allocate including GCs and 368 ns to fill 8.0 KB took on average of 479 ns to allocate, 496 ns to allocate including GCs and 644 ns to fill 16.0 KB took on average of 1,018 ns to allocate, 1,055 ns to allocate including GCs and 1,189 ns to fill 32.0 KB took on average of 2,119 ns to allocate, 2,200 ns to allocate including GCs and 2,625 ns to fill 64.0 KB took on average of 4,419 ns to allocate, 4,604 ns to allocate including GCs and 4,728 ns to fill 128.0 KB took on average of 8,333 ns to allocate, 9,472 ns to allocate including GCs and 8,685 ns to fill

Evidence that it is difficult to assume that one approach is faster than the other in all cases.

If I change long[] to int[] , I see almost the same

 0.125 KB took on average of 35 ns to allocate, 36 ns to allocate including GCs and 16 ns to fill 0.25 KB took on average of 40 ns to allocate, 41 ns to allocate including GCs and 24 ns to fill 0.5 KB took on average of 58 ns to allocate, 60 ns to allocate including GCs and 40 ns to fill 1.0 KB took on average of 86 ns to allocate, 87 ns to allocate including GCs and 94 ns to fill 2.0 KB took on average of 139 ns to allocate, 143 ns to allocate including GCs and 149 ns to fill 4.0 KB took on average of 256 ns to allocate, 262 ns to allocate including GCs and 206 ns to fill 8.0 KB took on average of 472 ns to allocate, 481 ns to allocate including GCs and 317 ns to fill 16.0 KB took on average of 981 ns to allocate, 999 ns to allocate including GCs and 516 ns to fill 32.0 KB took on average of 2,098 ns to allocate, 2,146 ns to allocate including GCs and 1,458 ns to fill 64.0 KB took on average of 4,312 ns to allocate, 4,445 ns to allocate including GCs and 4,028 ns to fill 128.0 KB took on average of 8,497 ns to allocate, 9,072 ns to allocate including GCs and 7,141 ns to fill

+5

Peter Lawrey Apr 23 '13 at 6:15

source share

I agree that the resulting array will have the least impact on the application, but your specific case does not seem to have much effect on the GC:

 for(int i = 0; i < mem.length; i++) mem[i] = 0;

In the above loop ( mem.length - 61440) there will be 2*61400 destinations and 61400 comparisons.

Now, in the case of GC, during the phase of allocating or marking up the memory of a specific object, the entire memory fragment will be canceled, and the IMO should be faster than the statistics from the cycle above.

But the actual cost of GC on application performance occurs when code / application behavior causes too many GC cycles (even worse if its frequent main cycles). Your particular case does not show a higher explicit GC.

I think the loop approach in byte[] would be better. If there was Object[] , what we could have a different approach.

+3

harsh Apr 23 '13 at 6:40

source share

I would definitely go for mem = new byte[0xF00]; and left GC.

Memory usage may be a little more, but it will not affect your application if you do not do it thousands of times per second.

Runtime will be much faster, and there is no need to manually call GC, it will still do its job.

+1

Padrus Apr 23 '13 at 6:15

source share

Four factors are important here.

1) What is the target platform? (Does it have a lot of RAM? Several processor cores?) 2) What is the maximum amount of memory that you plan to allocate? (Large amounts are likely to contribute to distribution / release) 3) Which JVM do you plan to use? 4) If your application is performance critical, why are you developing it in Java?

More than I would say, “don't worry about premature optimization.” Write the software first, then profile it, and then optimize the parts that are running slowly. Typically, algorithm performance tends to be a bigger problem than data structure performance, especially when your data structure is basically just an empty address space.

0

Mikkel løkke Apr 23 '13 at 6:51

source share

Michael Schmeißer · Accepted Answer · 2013-04-23T08:15:51+0000

Array redistribution does not actually increase the value of each GC, since the GC only visits and copies live objects and does nothing with dead objects. However, object selection leads to the more frequent appearance of small GCs. But if none of the recently allocated objects is still alive, the cost of the younger GC is very low, and the main GCs will not be called at all.

In addition, placing objects in modern versions of Java is cheap, and nulling the distribution space can easily be considered the most efficient nulling that the JVM can achieve. If you manage to zero out an array in your code as fast as the JVM does ( Change ): as Stephen Shlansker mentioned, the JIT compiler can optimize memory cycles), reusing the array should be faster. Anyway, while the while loop you illustrated is optimized by the JIT compiler, I assume it will be much slower.

To answer other questions:

GC has a zero distribution space (Eden) immediately, so there will be no difference whether it is short[] or byte[] . However, for the for loop, only half the number of iterations to zero of the same number of bytes will be used when using short[] instead of byte[] (setting a byte or short circuit to 0 should not make any difference)
The longer the array, the more iterations will be required for your loop. Thus, this growth is linear. GC will also need amortized linear time to a zero range of bytes, so I believe that the relationship between the two approaches remains constant. However, there may be more effective ways to zero out large areas of memory than small ones, which would make zeroing the GC (all allocated space at the same time) more efficient than the loop approach. For very large arrays, the situation may change: they are distributed directly to the generation of Tenured (unless G1 is used), and therefore will cause large GCs, which are much more expensive.

Is falling and recreating an array faster, or filling it with zeros, and why?

More articles: