"Is there a way to make data1-data2 = data3 without listing through a loop?" Not. It is technically impossible.
At best, or, even worse, you can call a function that will do the enumeration for you. But it will be slow. In the case of LINQ, the wicked is slow.
For the machine I'm working on now, the results of the other answers correspond to the following 4KB table (1024 integers).
- 23560 ticks - Giannis Paraskevopoulos . Converting an Array-Enumerable-Array is not too fast, copying the array through ToList (). The ToArray () chain is about 25 times slower than Array.Copy ().
- 10,198 ticks - Selman22 . 2 times faster, but still slow. Lambdas is an eye candy to make events events more beautiful, and not faster. You get some anonymous method that may take more time for the processor to return the call than its work (remember that the math we do here can only take a few cycles).
- 566 ticks - Tim Schmelter GetDifference () function (The main culprit of the JIT is here, in the native code and / or more often the use of the difference would be insignificant)
- 27 ticks - a loop. 400 times faster than Zip, more than 800 faster than converting an array to and from a list.
Loop code:
for (int i = 0; i < data3.Length; i++) { data3[i] = Math.Abs(data1[i] - data2[i]); }
Such basic memory operations can be directly translated to machine code without the terrible performance and significant amount of LINQ memory.
Moral of the story: LINQ is for readability (which is controversial in this case) not for performance (which is noticeable in this case).
Optimization Time! Let's play a little with our processor.
- Expand the loop . Or not. Your experience may vary. Even in assembler, it itself increases performance or loses performance significantly in the same processor family. The new processor and compiler aware of the old tricks and just implementing their own. For i3-3220. I tested the code at the output of the loop on 4 lines, which led to faster execution on 32-bit code, but in 64-bit language it was a little slower, and the turn to 8 was the opposite.
- Compiled for x64. Since we are working on 32-bit data, we will not use 64-bit registers ... or will we? On x86, less than half of the registers are really available for the generated code (you can always squeeze more in the assembly manually), on x64, however, you get eight bonus registers that can be freely used. The more you can do without access to memory, the faster your code. In this case the gain is around 20%.
- Close Visual Studio. Do not check the speed of 64-bit code in a 32-bit IDE (there is currently no 64-bit version, and probably will not be for a long time ). This will make the x64 code about twice as slow due to architecture mismatch. (Well ... you should never check the code in the debugger anyway ...)
- Do not use the built-in functions too. In this case, Math.Abs ββhave overhead hidden inside . For some reason (which would require IL analysis to verify), checking negative values ββwas faster with :: than with If-Else. This verification will save a lot of time.
UPDATE: ?: Faster than If-Else due to differences in machine code result ... at least for comparing two values. Its machine code is much less strange than If-Else (which doesn't look like what you write "manually"). Apparently, this is not just another form of writing an If-Else statement, but a completely separate command optimized for simple conditional assignment.
The resulting code was about 8 times faster than a simple loop with Math.Abs ββ(); Remember that you can only loop through to dividers of your data set size. You wrote that your data set size is 25920, so 8 in order. (max - 64, but I doubt that it will make any sense to rise so high). I suggest hiding this code in some function since it is fugly.
int[] data3 = new int[data1.Length]; for (int i = 0; i < data1.Length; i += 8) { int b; b = (data1[i + 0] - data2[i + 0]); data3[i + 0] = b < 0 ? -b : b; b = (data1[i + 1] - data2[i + 1]); data3[i + 1] = b < 0 ? -b : b; b = (data1[i + 2] - data2[i + 2]); data3[i + 2] = b < 0 ? -b : b; b = (data1[i + 3] - data2[i + 3]); data3[i + 3] = b < 0 ? -b : b; b = (data1[i + 3] - data2[i + 4]); data3[i + 4] = b < 0 ? -b : b; b = (data1[i + 5] - data2[i + 5]); data3[i + 5] = b < 0 ? -b : b; b = (data1[i + 6] - data2[i + 6]); data3[i + 6] = b < 0 ? -b : b; b = (data1[i + 7] - data2[i + 7]); data3[i + 7] = b < 0 ? -b : b; }
It is not even its final shape. I will try to make a few heretical tricks.
BitHack, low level cheats!
As I mentioned, there is still room for improvement.
After excision LINQ major tick Munchkin was Abs (). When it was removed from the code, we were left with a competition between IF-ELSE and shorthand ?: Operator. Both are branching operators that were once widely called slower than linear. Nowadays, ease of use / recording usually depends on performance (sometimes right, sometimes wrong).
So, let's make our branch condition linear. Perhaps abusing the fact that branching in this code contains math that works with only one variable. So let's do the equivalent code is the this .
Now you remember how to nullify two additional numbers ?, reset all bits and add them. Let's do it in one line without any conditions!
Bitwise glitter operators. OR and AND are boring, real men use XOR. What's so cool to XOR? In addition to the usual behavior, you can also turn it into NOT (negation) and NOP (without operation).
1 XOR 1 = 0 0 XOR 1 = 1
so XOR'ing by value, only 1 filled, gives you no transactions.
1 XOR 0 = 1 0 XOR 0 = 0
so XOR'ing by value, filled with only 0 does nothing.
We can get a sign from our numbers. For a 32-bit integer, it's as simple as x>>31 . It moves the bit to the minimum bit. As even the wiki will tell you, the bits inserted on the left will be zeros, so the result x>>31 will be 1 for a negative number (x <0) and 0 for non-negative (x> = 0), right?
Nope. For signed values, an arithmetic shift is used to simply shift the bits. Thus, we get -1 or 0 depending on the sign .... which means that "x β 31" will give 111 ... 111 for negative and 000 ... 000 for non-negative. If you have the XOR original x as a result of such a shift, you will do NOT or NOP depending on the sign of the value. Another useful thing: 0 will result in a NOP to add / negate, so we can add / subtract -1 depending on the sign of the value.
So, βx ^ (x β 31)β will flip the bits of a negative number without changing non-negative values, and βx- (x β 31)β will add 1 (a negative negative value gives a positive value) minus x and not make a change to a non-negative value .
When combining, you get '(x ^ (x β 31)) - (x β 31)' ..., which can be translated into:
IF X<0 X=!X+1
and it's just
IF X<0 X=-X
How does this affect performance? Our XorAbs () only requires four basic integer operations with one load and one repository. The branching operator itself takes about the same amount of CPU ticks. And although the modern processor does an excellent job of predicting branches, they are still faster just not doing it when the serial code is fed.
And what is the score?
- About four times faster than the built-in Abs ();
- About twice as fast as the previous code (non-rollback versions)
- Depending on the processor, it can get the best result without a cyclical reversal. Due to the elimination of branching code processor can "deploy" cycle on its own. (Haswells weird with unfolding)
Final code:
for (int i = 0; i < data1.Length; i++) { int x = data1[i] - data2[i]; data3[i] = (x ^ (x >> 31)) - (x >> 31); }
Parallelism and use of cache
The processor has a superfast Cache memory. When sequentially processing an array, it copies entire pieces of it to the cache. But if you write crappy code, you will receive faults in the cache. You can easily fall into this trap by entering the order of nested loops .
Parallelism (multiple threads, the same data) must work on sequential fragments in order to effectively use the processor cache.
Writing threads manually will allow you to select chunks for threads manually, but this is an intrusive way. Starting with 4.0.NET comes with helpers for this, but by default Parallel.For creates a cache mess. Thus, this code is actually slower than its version of a single thread due to cache-miss .
Parallel.For(0, data1.Length, fn => { int x = data1[fn] - data2[fn]; data3[fn] = (x ^ (x >> 31)) - (x >> 31); }
You can manually use cached data by performing a sequential operation on it. For example, you can unroll a cycle, but its dirty hacking and reversal have their own performance problems (this depends on the CPU model).
Parallel.For(0, data1.Length >> 3, i => { int b; b = (data1[i + 0] - data2[i + 0]); data3[i + 0] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 1] - data2[i + 1]); data3[i + 1] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 2] - data2[i + 2]); data3[i + 2] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 3]); data3[i + 3] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 4]); data3[i + 4] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 5] - data2[i + 5]); data3[i + 5] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 6] - data2[i + 6]); data3[i + 6] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 7] - data2[i + 7]); data3[i + 7] = b < 0 ? (b ^ -1) + b : b; } b; Parallel.For(0, data1.Length >> 3, i => { int b; b = (data1[i + 0] - data2[i + 0]); data3[i + 0] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 1] - data2[i + 1]); data3[i + 1] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 2] - data2[i + 2]); data3[i + 2] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 3]); data3[i + 3] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 4]); data3[i + 4] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 5] - data2[i + 5]); data3[i + 5] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 6] - data2[i + 6]); data3[i + 6] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 7] - data2[i + 7]); data3[i + 7] = b < 0 ? (b ^ -1) + b : b; } b; Parallel.For(0, data1.Length >> 3, i => { int b; b = (data1[i + 0] - data2[i + 0]); data3[i + 0] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 1] - data2[i + 1]); data3[i + 1] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 2] - data2[i + 2]); data3[i + 2] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 3]); data3[i + 3] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 4]); data3[i + 4] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 5] - data2[i + 5]); data3[i + 5] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 6] - data2[i + 6]); data3[i + 6] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 7] - data2[i + 7]); data3[i + 7] = b < 0 ? (b ^ -1) + b : b; } b; Parallel.For(0, data1.Length >> 3, i => { int b; b = (data1[i + 0] - data2[i + 0]); data3[i + 0] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 1] - data2[i + 1]); data3[i + 1] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 2] - data2[i + 2]); data3[i + 2] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 3]); data3[i + 3] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 3] - data2[i + 4]); data3[i + 4] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 5] - data2[i + 5]); data3[i + 5] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 6] - data2[i + 6]); data3[i + 6] = b < 0 ? (b ^ -1) + b : b; b = (data1[i + 7] - data2[i + 7]); data3[i + 7] = b < 0 ? (b ^ -1) + b : b; }
However, .NET also has Parrarel.ForEach and Load Balancing partition editing. Using both of them, you get the best of all worlds:
- independent dataset size code
- short, neat code
- multithreading
- good use of cache
Thus, the final code:
var rangePartitioner = Partitioner.Create(0, data1.Length); Parallel.ForEach(rangePartitioner, (range, loopState) => { for (int i = range.Item1; i < range.Item2; i++) { int x = data1[i] - data2[i]; data3[i] = (x ^ (x >> 31)) - (x >> 31); } });
This is far from the maximum CPU usage (which is more complicated than just increasing its clock, there are several cache levels, several pipelines and much more), but it is readable, fast and platform independent (except for integer size, but C # int is an alias of System .Int32, so we are safe here).
Here I think we will focus on optimization. It came out as an article, not an answer, I hope no one will clear me for it.