Practical use of automatic vectorization?

Has anyone used the automatic vector that gcc can do? In the real world (as opposed to sample code)? Is restructuring of existing code required? Are there a significant number of cases in any production code that can be vectorized in this way?

+4
source share
5 answers

I have not yet seen that either GCC or Intel C ++ automatically vectorize anything other than very simple loops, even if the given code is algorithms that can (and were after I manually rewrote them using SSE-intrinsics) will be vectorized.

Part of this is conservative - especially when it comes to possible pointer smoothing, it can be very difficult for the C / C ++ compiler to β€œprove” that vectorization will be safe, even if you, as a programmer, know what it is. Most compilers (reasonably) prefer not to optimize the code, but to risk compromising it. This is one area where higher-level languages ​​have a real advantage over C, at least in theory (I speak theoretically since I don't know about any automatic vectorization of ML or Haskell compilers).

Another part of this is just analytical limitations - most of the research in the field of vectorization, I understand, is connected with the optimization of classical numerical problems (for example, fluid dynamics), which was a little bread and butter of most vector machines a few years ago (when, between CUDA / OpenCL , Altivec / SSE and STI cell, vector programming in various forms has become widely available in commercial systems).

It is rather unlikely that code written for the processor in the mind will be easy to vectorize the compiler. Fortunately, many things you can do to make it easier for the compiler to figure out how to vectorize it, such as loop tiles and expanding a partial loop, also (usually) help performance on modern processors, even if the compiler doesn't understand how to vectorize it.

+5
source

It is difficult to use in any business logic, but it provides acceleration when processing data volumes in the same way.

A good example is audio / video processing, where you apply the same operation to each sample / pixel. I used VisualDSP for this, and you had to check the results after compilation - if it is really used where it should.

+1
source

Vectorization will be primarily useful for numerical programs. Vectorized programs can run faster on vector processors such as the STI Cell Processor, which are used in PS3 game consoles. There, the numerical calculations used, for example, to create game graphics, can be accelerated by vectorization. Such processors are called SIMD (Single Instruction Multiple Data) processors.

In other embodiments, vectorization will not be used. Vectorized programs run on a vectorized instruction set that is not applicable to a processor without SIMD.

Intel Nehalem processors (released late 2008) implement SSE 4.2 instructions, which are SIMD instructions. Source: wikipedia .

0
source

Vectorized instructions are not limited to Cell processors - most modern processors, such as CPUs, have PPC, x86, starting with pentium 3, Sparc, etc.). When used well for floating point operations, this can help quite a lot for very complex computational tasks (filters, etc.). In my experience, automatic vectorization doesn't work so well.

0
source

You may have noticed that virtually no one knows how to use the GCC automatic vectology effectively. If you browse the web to see user comments, it always comes to the conclusion that GCC allows you to enable automatic vectorization, but it rarely uses it, so if you want to use SIMD acceleration (for example, MMX, SSE, AVX, NEON, AltiVec ), then you basically haveto figure out how to write it using the built-in compiler scripts or assembly language code.

But the problem with the internal features is that you really need to understand its assembler language, and then also learn the Intrinsics method to describe what you want, which can lead to significantly less efficient code than if you wrote it in assembly code (for example, 10 times), because the compiler will still have problems making good use of your internal instructions!

For example, you can use SIMD Intrinsics to simultaneously perform several operations in parallel, but your compiler will probably generate assembly code that transfers data between the SIMD registers and normal processor registers and vice versa, effectively making your SIMD code at the same speed (or even slower) than regular code!

So basically:

  • If you want up to 100% acceleration (2x speed), then either buy official Intel / ARM compilers or convert some of your code to use SIMD C / C ++ Intrinsics.
  • If you want 1000% acceleration (10x speed), then write it to the assembly code using the SIMD instructions manually. Or, if they are available on your hardware, use GPU acceleration like OpenCL or the Nvidia CUDA SDK, as they can provide similar accelerations in the GPU as SIMD does in the CPU.
0
source

All Articles