Memory alignment on modern processors?

I often see code, for example, the following, when, for example, it represents a large raster map in memory:

size_t width = 1280; size_t height = 800; size_t bytesPerPixel = 3; size_t bytewidth = ((width * bytesPerPixel) + 3) & ~3; /* Aligned to 4 bytes */ uint8_t *pixelData = malloc(bytewidth * height); 

(that is, a bitmap allocated as a contiguous block of memory having a bytewidth aligned with a certain number of bytes, most often 4.)

Then the point is indicated on the image:

 pixelData + (bytewidth * y) + (bytesPerPixel * x) 

This leads me to two questions:

  • Does alignment of such a buffer correspond to performance impact on modern processors? Do I have to worry about alignment at all, or will the compiler handle this?
  • If this has any effect, can someone point me to a resource to find the perfect byte alignment for different processors?

Thank.

+11
performance c memory-management alignment memory
Dec 06 '09 at 17:06
source share
4 answers

It depends on many factors. If you only access pixel data one byte at a time, alignment will not make any difference in the vast majority of cases. To read / write one byte of data, most processors do not care whether this byte is on a 4-byte boundary or not.

However, if you access data in units of large bytes (say, in 2-byte or 4-byte units), you will definitely see alignment effects. For some processors (for example, for many RISC processors) it is completely impossible to access unchanged data at certain levels: an attempt to read a 4-byte word from an address that is not aligned by 4 bytes will generate a data access exception (or a Data Storage exception ) on PowerPC, for example.

On other processors (for example, x86) access to unbalanced addresses is allowed, but often this happens with a hidden decrease in performance. Loading / storing memory is often implemented in microcode, and the microcode will detect uneven access. Typically, the microcode will retrieve a 4-byte amount from memory, but if it is not aligned, it will need to extract two 4-byte locations from the memory and restore the required 4-byte amount from the corresponding bytes of the two locations. Capturing two memory locations is clearly slower than one.

It is easy for simple downloads and stores. Some instructions, such as those in the MMX or SSE instruction sets, require their memory operands to be aligned correctly. If you try to access unmodified memory using these special instructions, you will see something like an illegal exception to the instruction.

To summarize, I would not worry too much about alignment unless you write super-critical code (for example, in an assembly). The compiler helps you a lot, for example. by adding structures so that 4-byte values ​​are aligned at 4-byte boundaries, and on x86 the CPU also helps you deal with unsatisfied access. Since the pixel data you are dealing with is 3 bytes in size, you almost always make single-byte calls anyway.

If you decide that instead you want to access pixels in singular 4-byte accesses (as opposed to 3 single-byte accesses), it would be better to use 32-bit pixels and align each individual pixel to a 4-byte border. Aligning each line to a 4-byte border, but not every pixel will have a small, if any effect.

Based on your code, I assume this is due to reading the Windows bitmap file format. Raster files require that the length of each scan line be a multiple of 4 bytes, so setting up pixel data buffers with this property is a property that you can simply read in the entire bitmap in one fell swoop to your buffer (of course, you still have to deal with that fact that the scan lines are stored from bottom to top and not from top to bottom, and that pixel data is BGR instead of RGB). This is actually not very profitable, but it is not much harder to read in a raster single-line line at a time.

+7
Dec 6 '09 at 17:31
source share
β€” -

Yes, alignment affects modern - let x86 say - processors. Typically, loads and data stores occur at the boundaries of natural alignment; if you get a 32-bit value in a register, it will be the fastest if it is aligned on a 32-bit boundary. If this is not the case, x86 will β€œtake care of this for you,” in the sense that the processor will still perform the load, but this will require a much larger number of cycles, because there will be internal disputes with the β€œreset” access.

Of course, in most cases, this overhead is trivial. Binary data structures are often packaged together in unchanged ways for transporting over the network or for storing on disk, and the advantages of the size of the packed storage outweigh any performance from accidental use of this data.

But especially with large buffers of homogeneous data that are accessed randomly and where performance in the aggregate is really important, as in your pixel buffer above, maintaining alignment of data structures can be useful.

Note that in the case of the example above, only each row of pixel data is aligned. The pixels themselves still have a length of 3 bytes and often do not align inside the "lines", so there is not much use. There are texture formats, for example, that have 3 bytes of real data per pixel and literally just spend an extra byte on each to maintain data alignment.

There is more general information here: http://en.wikipedia.org/wiki/Data_structure_alignment

(Specific characteristics vary between architectures, such as what the natural alignments are, regardless of whether the processor handles unbalanced loads / storages automatically and how expensive they are. In cases where the processor does not handle access magically, often the compiler / C will do what it can do for you.)

+4
Dec 06 '09 at 17:26
source share
  • Does alignment of such a buffer correspond to performance impact on modern processors?

Yes. For example, if memcpy is optimized using SIMD instructions (for example, MMX / SSE), some operations will be faster with lined memory. Some architectures have (processor) commands that fail if the data is not aligned, so something may work on your computer, but not in another.

With aligned data, you also make better use of CPU caching.

  • Do I have to worry about alignment at all, or will the compiler handle this?

I need to worry about alignment when I use dynamic memory, and the compiler cannot handle it (see the answer to this comment).

For other things in your code, you use the -malign flag and the aligned attribute.

+1
Dec 6 '09 at 17:22
source share

Buffer alignment has an effect. The question is, is this a significant impact? The answer may be high for a specific application . In architectures that do not support independent access, for example, 68000 and 68010 (68020 adds unattached access), this is really a performance and / or maintenance problem, since the processor will be to blame or maybe a trap for the handler to perform uneven access.

You can appreciate the perfect alignment for different processors: 4-byte alignment is suitable for architectures with a 32-bit data path. 8-byte alignment for 64-bit. However, L1 has a caching effect . For many processors, this is 64 bytes, although this will undoubtedly change in the future.

Too high alignment (i.e. eight bytes where only two bytes are required) does not lead to performance inefficiencies for any narrower system, even on an 8-bit microcontroller. It just spends (potentially) a few bytes of memory.

Your example is quite peculiar: 3-byte elements have a 50% chance of not being individually aligned (up to 32 bits), so buffer alignment seems pointless - at least for performance reasons. However, in the case of mass transfer of all this, it optimizes the first access. Note that an ambiguous first byte may also have a performance impact when transferred to a video controller.

+1
Dec 06 '09 at 17:39
source share



All Articles