Why is vectorization faster than loops?

Why is the vectorization, as a rule, so sharply faster than the loop, at the lowest level of hardware operations and basic operations (i.e.: things common to all implementations of software languages)?

What does the computer do during cyclization, which it does not do when using vectorization (I'm talking about the actual calculations that the computer does, and not what the programmer writes) or what does it do differently?

I could not convince myself why the difference should be so significant. I probably could have convinced that the vectorized code at some point resets some of them, but the computer still has to perform the same number of operations, right? For example, if we multiply a vector of size N by a scalar, we will have N multiplications to execute in any case, right?

+7
performance language-agnostic vectorization low-level
source share
3 answers

Vectorization (as this term is commonly used) refers to the SIMD operation (one command, several data).

This means that essentially one instruction performs the same operation on multiple parallel operands. For example, for a multiple vector of size N with a scalar, call M the number of operands whose size can work simultaneously. If so, then the number of commands that he needs to execute is approximately N / M, where (with purely scalar operations) he would have to perform N operations.

For example, Intel's instruction set for the current AVX 2 uses 256-bit registers. They can be used to store (and operate) a set of 4 operands of 64 bits or 8 operands of 32 bits apiece.

So, assuming you are dealing with 32-bit real numbers with one precision, this means that one command can perform 8 operations (multiplications in your case) right away, so (at least theoretically) you can finish N multiplications using only N / 8 multiplication commands. At least theoretically, this should allow the operation to complete about 8 times faster than allowing the execution of one command at a time.

Of course, the exact benefit depends on how many operands you support in each instruction. Intel initially tries to support only 64-bit registers, so for working with 8 elements at the same time, these elements can be as little as 8 bits. They currently support 256-bit registers, and they announced support for 512-bit (and they may even have sent this to several high-performance processors, but not to ordinary consumer processors, at least for now). The positive use of this feature can also be nontrivial, to say the least. Scheduled instructions, so that you really have N operands available and in the right place at the right time, are not necessarily an easy task (in general).

To put things in perspective, (now ancient) Cray 1 got great speed in that way. Its vector block worked on sets of 64 registers at 64 bits apiece, so it could perform 64 operations with double precision per cycle. On optimally vectorized code, it was much closer to the speed of the current processor than you might expect based solely on its (much lower) clock speed. Making full use of it was not always easy (and still not).

Keep in mind, however, that vectorization is not the only way the CPU can perform operations in parallel. There is also the possibility of a parallelism training level that allows a single processor (or a single processor core) to execute multiple commands simultaneously. Most modern processors include hardware (in theory) to execute up to 4 instructions per cycle, if the instructions are a combination of loads, storage, and ALUs. They can quite regularly execute about two instructions per hour on average or more in well-tuned cycles when memory is not a bottleneck.

Then, of course, there is multithreading - launching several threads of instructions (at least logically) of individual processors / cores.

Thus, a modern processor can have, say, 4 cores, each of which can perform 2 vector multiplications per cycle, and each of these instructions can work on 8 operands. So, at least theoretically, he can perform 4 * 2 * 8 = 64 operations per cycle.

Some instructions have better or worse throughput. For example, FP bandwidth is lower than FMA, or multiplied by Intel before Skylake (1 vector per cycle instead of 2). But logical logic like AND or XOR has 3 vectors per bandwidth of each clock cycle; building a AND / XOR / OR block does not require a lot of transistors, so processors replicate them. Bottlenecks on the total width of the pipeline (the front server, which decodes and goes to the non-standard part of the kernel) are common when using high-performance instructions, and not bottlenecks on a particular actuator.

+12
source share

Vectorization is a type of parallel processing. This allows more computer equipment to be dedicated to performing calculations, so the calculation is faster.

Many numerical problems, especially solving partial differential equations, require the same calculation for a large number of cells, elements, or nodes. Vectorization performs calculations for many elements / elements / nodes in parallel.

Vectorization uses special equipment. Unlike a multi-core processor, for which each of the parallel processing units is a fully functional CPU core, vector computing devices can perform only simple operations, and all devices perform the same operation at the same time, working on a sequence of data values ​​(vector).

+1
source share

Vectorization has two main advantages.

  • The main advantage is that hardware designed to support vector instructions usually has hardware that can perform multiple ALU operations in general when using vector instructions. For example, if you ask him to do 16 additions with a 16-element vector instruction, he can have 16 parallel adders that can do all the additions at once. The only way to access all of these adders 1 is through vectorization. With scalar instructions, you only get 1 single adder.

  • Typically, some overhead is saved using vector instructions. You load and store the data in large chunks (up to 512 bits at a time on some latest Intel processors), and each iteration of the loop works more, so the loop overhead is usually lower in the relative sense of 2 and you need less instructions to do the same job therefore the overhead of the foreground of the processor is lower, etc.

Finally, your dichotomy between cycles and vectorization is odd. When you take a non-vector code and vectorize it, you usually end the loop if there was a loop before, or not if it wasn't. The comparison is valid between scalar (non-vector) instructions and vector instructions.


1 Or at least 15 out of 16, perhaps one is used to perform scalar operations.

2 Probably, you can get a similar benefit from the overhead in a scalar case due to the large number of cycle unrolling.

0
source share

All Articles