Alignment is what always causes some performance issues. when you write (2) or read (2) a file, it is best if you can set your reading limits to block alimony, because you make the kernel read in two blocks instead of one. In the worst case, just reading two bytes at the block boundary. Suppose you have a block size of 1024 bytes, this code:
char var[2]; int fd; fd = open("/etc/passwd", O_RDONLY); lseek(fd, 1023UL, SEEK_SET); read(fd, &var, sizeof var);
It forces the kernel to force two block reads (at most, since blocks can already be cached before) for only two read (2) bytes.
In the case of memory, all of this is usually managed by malloc (3) , and since you don't fail with page errors, you don't get any performance penalties (which is the reason you don't have any standard library function to get aligned memory , even in virtual systems with given requests), since you consume memory, the kernel allocates it on the pages for you. The processor’s virtual memory system makes page alignment almost transparent. Only if you have unplaced memory access (suppose you are accessing a 32-bit integer invalid invalid --- two pages, and these two pages have been replaced by the kernel, you will have to wait for the kernel to replace with two pages of memory instead of one - but this is an incredible thing that arises, the compiler usually forces the internal loops not to interrupt between the borders of the page to minimize the likelihood of this, and you also have a command cache to handle these things)
It is said that there are some places where you get performance improvements if you align the memory somewhat. I will try to show you a scenario of this:
Suppose you need to dynamically manage many small structures (suppose 16 bytes), and you plan to manage them with malloc () . malloc (3) manages the memory, including the header in each allocated memory location (let this header be 8 bytes long), which is an overhead of 50% more than ideal. If you plan to get memory in pieces (let us say) of 64 structures, you will get only one of these headers (8 bytes) for each byte 64*16 = 1024 (only about 8%)
To deal with this, you should think about knowing which fragment all of these structures belong to (so you can free (3) a piece when not in use), and you can do this in two ways: 1.- Using a pointer (adding 4 bytes to each structure size is pointless since you add 4 bytes to each structure, losing 25% of memory again) to point to chunck, or 2.- * causing chunck to be aligned, so the chunk address can be easy to calculate from the structure address (you only need to subtract the rest of m mod chunksize modulation address) to get the chunk address. This last method does not impose any overhead on the search for the fragment, but imposes in practice all the pieces line alignment (does not align on the page).
Thus, you increase performance too much, since you significantly reduce the number of malloc (3) calls and lose the memory imposed by allocating small amounts of memory.
By the way, malloc does not request the operating system for the memory that you request on every call. It allocates memory in chunks, similar to the way it was described here, and ordinary implementations do not even manage to return the allocated memory to the system again (reusing freed memory before allocating a new one). It manages sbrk (2) calls, which means that you are going to intervene in malloc if you use this system call.
Linux / unix will provide you with aligned pages using the shmat (2) system call. Try reading this and related documents.