Suppose you have a pixel. This pixel has color components AB and C. On the surface you are painting on, there are color components XY and Z.
So, first you need to check if they match. If they do not match, costs increase. Suppose they match.
Then you need to do a border check - did something calling you stupid cause it? Some comparisons, additions and multiplications.
Then you need to find where the pixel is. These are some of the multiplications and additions.
Now you need to access the source data and the destination data and record them.
If you work with a scan line at a time, almost all of this overhead can be done at once. You can calculate which part of the scanned line falls within the borders or not, only a little more overhead than one pixel. You can find where the scanned line is written to the destination, and again for a few extra overhead than one pixel. You can check color space conversions with the same overhead as a single pixel.
The big difference is that instead of copying a single pixel, you copy to a block.
As it happens, a computer is really good at copying blocks of things. Some processors have built-in instructions, some memory systems can do this without the participation of the processor (the processor says “copy X to Y” and then can do other things, and the memory bandwidth can be higher, to the CPU memory). Even if you perform circular control via the CPU, there are SIMD instructions that allow you to work with 2, 4, 8, 16 or even more data units at the same time, if you work with them the same way, using a limited set of instructions.
In some cases, you can even offload work on the GPU - if the source processor and the target scan are located on the GPU, you can say "yo GPU, you are processing it", and the GPU is even more specialized for this kind of task.
The very first bit of optimization — just checking once per scan line instead of one per pixel — can easily give you an acceleration of 2x to ~ 10x. The second - more effective hit - 4x ~ 20x faster. Running everything on the GPU can be 2x to 100 times faster.
The last thing that actually calls the function. This is usually negligible; but when you call SetPixel 1 million times (a 1000 x 1000 image or a screen with a minimum size) it adds.
For an HD display with 2 million pixels 60 times per second, 120 million pixels are processed per second. A single-threaded program on a 3 GHz machine has a place to run ~ 25 instructions per pixel if you want to keep up with the screen, and this suggests that nothing happens (which is unlikely). On a 4k monitor, you follow up to 6 instructions per pixel.
Given that many pixels are reproduced by resetting each instruction, you can make a big difference.
Multipliers burst out of nowhere. I wrote several pixel operations transformations for scanning operations that showed impressive accelerations, however, and also for loading the CPU on the GPU, and saw that SIMD gives impressive accelerations.