What are the benefits of page-aligned memory chunk?

I understand that most processors read data better at the aligned memory address, that is, at the memory address that is a multiple of the CPU word. However, in many places I read about page-aligned memory allocation. Why does someone want a page aligned memory address? Is it just for even more performance?

+5
source share
3 answers

The "traditional" way to allocate memory is to have it in an adjacent address space (a "heap" growing upwards through sbrk() calls). Each time you click the page border, a page error occurs and you get a map of the new page. These are two strategies:

  • pages can only be freed when all distributions inside this page are freed and when all other distributions are mapped to lower addresses. (typical heap fragmentation effect).
  • Larger distributions may take up one page more than necessary (if they start somewhere in the middle of the page).

Thus, this strategy is only suitable for small blocks of memory, where you do not want to "spend" the entire page on each selection.

For large snippets, it’s better to use mmap() , which directly displays your new pages, so you get a “aligned page”. Using this, your distribution does not share pages with other distributions. Once you no longer need memory, you can return it to the OS. Note that many malloc() implementations automatically choose whether to allocate using sbrk() or mmap() , depending on the size of the required allocation.

+4
source

Alignment restrictions are usually associated with direct IO - which bypasses the page cache, copies data to disk from disk directly or from the process address space. This can provide significant performance improvements when a cache page is not required — for example, streaming several gigabytes of data, especially when I / O to / from extremely fast disk systems.

Please note that only some file systems support direct IO.

On Linux RedHat Documentation , in particular:

Best Practices for Direct I / O


Users should always try to use a properly aligned and dimensional IO. This is especially important for direct I / O access. Direct I / O must be aligned at "logical_block_size" with a border and a multiple of "logical_block_size". With native 4K devices (logical_block_size - 4K), it is now important that applications perform direct I / O, which is a multiple of 'Logical_block_size. This means that applications that do not perform 4K aligned I / O, but 512-byte aligned I / O, are native 4K devices. Applications can access the I / O Limits device to use correctly aligned and dimensional I / O. "I / O Constraints" are displayed both through sysfs and in the block device ioctl interfaces (see also: libblkid).

sysfs interface

/ Sys / block // alignment_offset

/ Sys / block /// alignment_offset

/ Sys / block // queues / physical_block_size

/ Sys / block // queues / logical_block_size

/ Sys / block // queues / minimum_io_size

/ Sys / block // queues / optimal_io_size

Please note that the use of direct I / O may be limited by the actual hardware as well as the software. As noted in the RedHat documentation, the material limitations of the device matter.

To use direct I / O on Linux, you need to open the file using the O_DIRECT flag:

 int fd = open( filename, O_RDONLY | O_DIRECT ); 

In my experience, direct I / O can lead to a 20-30% gain in I / O performance under certain circumstances. These circumstances usually include streams of large amounts of data to / from a file in a very fast file system, with the application not making or very few seek() calls.

+1
source

Alignment is what always causes some performance issues. when you write (2) or read (2) a file, it is best if you can set your reading limits to block alimony, because you make the kernel read in two blocks instead of one. In the worst case, just reading two bytes at the block boundary. Suppose you have a block size of 1024 bytes, this code:

 char var[2]; int fd; fd = open("/etc/passwd", O_RDONLY); lseek(fd, 1023UL, SEEK_SET); read(fd, &var, sizeof var); 

It forces the kernel to force two block reads (at most, since blocks can already be cached before) for only two read (2) bytes.

In the case of memory, all of this is usually managed by malloc (3) , and since you don't fail with page errors, you don't get any performance penalties (which is the reason you don't have any standard library function to get aligned memory , even in virtual systems with given requests), since you consume memory, the kernel allocates it on the pages for you. The processor’s virtual memory system makes page alignment almost transparent. Only if you have unplaced memory access (suppose you are accessing a 32-bit integer invalid invalid --- two pages, and these two pages have been replaced by the kernel, you will have to wait for the kernel to replace with two pages of memory instead of one - but this is an incredible thing that arises, the compiler usually forces the internal loops not to interrupt between the borders of the page to minimize the likelihood of this, and you also have a command cache to handle these things)

It is said that there are some places where you get performance improvements if you align the memory somewhat. I will try to show you a scenario of this:

Suppose you need to dynamically manage many small structures (suppose 16 bytes), and you plan to manage them with malloc () . malloc (3) manages the memory, including the header in each allocated memory location (let this header be 8 bytes long), which is an overhead of 50% more than ideal. If you plan to get memory in pieces (let us say) of 64 structures, you will get only one of these headers (8 bytes) for each byte 64*16 = 1024 (only about 8%)

To deal with this, you should think about knowing which fragment all of these structures belong to (so you can free (3) a piece when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size is pointless since you add 4 bytes to each structure, losing 25% of memory again) to point to chunck, or 2.- * causing chunck to be aligned, so the chunk address can be easy to calculate from the structure address (you only need to subtract the rest of m mod chunksize modulation address) to get the chunk address. This last method does not impose any overhead on the search for the fragment, but imposes in practice all the pieces line alignment (does not align on the page).

Thus, you increase performance too much, since you significantly reduce the number of malloc (3) calls and lose the memory imposed by allocating small amounts of memory.

By the way, malloc does not request the operating system for the memory that you request on every call. It allocates memory in chunks, similar to the way it was described here, and ordinary implementations do not even manage to return the allocated memory to the system again (reusing freed memory before allocating a new one). It manages sbrk (2) calls, which means that you are going to intervene in malloc if you use this system call.

Linux / unix will provide you with aligned pages using the shmat (2) system call. Try reading this and related documents.

+1
source

All Articles