How is Win32 Bitmap rendering faster than pixels?

Question

How is Win32 Bitmap rendering faster than pixels?

Win32 bitmaps are (much) faster to draw compared to SetPixelV or another feature such as. How does it work if at the end the computer draws pixels for a bitmap?

+6

c ++ winapi msdn

user5818733 Jan 28 '16 at 10:37

source share

2 answers

Repeated function calls, such as SetPixelV , are slow because they must translate the coordinate each time into a memory offset, and also potentially perform some color translation on the fly.

A simple “given pixel” function may look like this (without test restrictions, color translation, or anything interesting):

 size_t offset = y * bytes_per_scanline + x * bytes_per_pixel; for(size_t i = offset; i < offset + bytes_per_pixel; i++) target[i] = source[i];

Bitmaps, on the other hand, are usually drawn through a process known as blitting. This is essentially a direct copy from one memory location to another. To do this, on Windows, you create a device context for your bitmap compatible with the target context. This ensures that the memory can be copied without translation. It can also provide accelerated hardware copies, which are even faster.

A simple “copy” of blit might look like this:

 size_t nbytes = bytes_per_scanline * height; for(size_t i = 0; i < nbytes; i++) target[i] = source[i];

This does not require coordinated searches and will be very effective in terms of accessing the memory cache. There are much faster ways to copy pieces of memory, and the above example just illustrates.

+2

paddy Jan 28 '16 at 22:49

source share

Yakk · Accepted Answer · 2016-01-28T23:05:26+0000

Suppose you have a pixel. This pixel has color components AB and C. On the surface you are painting on, there are color components XY and Z.

So, first you need to check if they match. If they do not match, costs increase. Suppose they match.

Then you need to do a border check - did something calling you stupid cause it? Some comparisons, additions and multiplications.

Then you need to find where the pixel is. These are some of the multiplications and additions.

Now you need to access the source data and the destination data and record them.

If you work with a scan line at a time, almost all of this overhead can be done at once. You can calculate which part of the scanned line falls within the borders or not, only a little more overhead than one pixel. You can find where the scanned line is written to the destination, and again for a few extra overhead than one pixel. You can check color space conversions with the same overhead as a single pixel.

The big difference is that instead of copying a single pixel, you copy to a block.

As it happens, a computer is really good at copying blocks of things. Some processors have built-in instructions, some memory systems can do this without the participation of the processor (the processor says “copy X to Y” and then can do other things, and the memory bandwidth can be higher, to the CPU memory). Even if you perform circular control via the CPU, there are SIMD instructions that allow you to work with 2, 4, 8, 16 or even more data units at the same time, if you work with them the same way, using a limited set of instructions.

In some cases, you can even offload work on the GPU - if the source processor and the target scan are located on the GPU, you can say "yo GPU, you are processing it", and the GPU is even more specialized for this kind of task.

The very first bit of optimization — just checking once per scan line instead of one per pixel — can easily give you an acceleration of 2x to ~ 10x. The second - more effective hit - 4x ~ 20x faster. Running everything on the GPU can be 2x to 100 times faster.

The last thing that actually calls the function. This is usually negligible; but when you call SetPixel 1 million times (a 1000 x 1000 image or a screen with a minimum size) it adds.

For an HD display with 2 million pixels 60 times per second, 120 million pixels are processed per second. A single-threaded program on a 3 GHz machine has a place to run ~ 25 instructions per pixel if you want to keep up with the screen, and this suggests that nothing happens (which is unlikely). On a 4k monitor, you follow up to 6 instructions per pixel.

Given that many pixels are reproduced by resetting each instruction, you can make a big difference.

Multipliers burst out of nowhere. I wrote several pixel operations transformations for scanning operations that showed impressive accelerations, however, and also for loading the CPU on the GPU, and saw that SIMD gives impressive accelerations.

How is Win32 Bitmap rendering faster than pixels?

More articles: