Why does wrong access to addresses have 2 or more accesses?

The normal answers to the question why data alignment is more efficient access and simplification of processor design.

The relevant question and its answers are here . And another source here . But both of them do not solve my question.

Suppose a processor has an access granularity of 4 bytes. This means that the processor reads 4 bytes at a time. The material that I listed above says that if I get access to inconsistent data, say, address 0x1, then the CPU must make 2 accesses (one of the addresses 0x0, 0x1, 0x2 and 0x3, one of the addresses 0x4, 0x5, 0x6 and 0x7) and combine the results. I do not understand why. Why is it simply impossible to read the CPU data from 0x1, 0x2, 0x3, 0x4 when I issue the address 0x1. This will not degrade performance or complicate the circuit.

Thank you in advance!

+3
performance cpu cpu-architecture computer-architecture
Oct 11 '10 at 2:21
source share
4 answers

This will not degrade performance or complicate the circuit.

These are false assumptions, which we accept as fact, which really eliminate further understanding.

Your comment on another question used a much more appropriate wording (" I donโ€™t think it will degrade" ...)

Do you think the memory architecture uses many memory chips in parallel to maximize throughput? And that a particular data item is in only one chip, you cannot just read which chip will be the most convenient, and expect it to have the data you need.

Right now, the processor and memory can be connected together, so bits 0-7 are connected only to chip 0, 8-15 to chip 1, 16-23 for chip 2, 24-31 for chip 3. And for all integers N, 4N memory cell is stored in chip 0, 4N + 1 in chip 1, etc. And this is the Nth byte in each of these chips.

Look at the memory addresses stored at each offset of each memory chip

 memory chip 0 1 2 3
 offset

     0 0 1 2 3
     1 4 5 6 7
     2 8 9 10 11
     N 4N 4N + 1 4N + 2 4N + 3



So, if you load from memory bytes 0-3, N = 0, each chip reports its internal byte 0, all bits end in the right places, and everything is fine.

Now, if you try to load a word starting at memory location 1, what will happen?

First we look at how this is done. The first bytes 1-3 of the memory, which are stored in memory chips 1-3 with an offset of 0, end with bits 8-31, because where these memory chips are connected, even if you asked them to be in bits 0-23. This does not really matter, because the processor can swizzle them internally using the same scheme as for the logical left shift. Then, in the next byte of transaction memory 4, which is stored in memory chip 0 with an offset of 1, it is read into bits 0-7 and swizzled into bits 24-31, where you want it to be.

Pay attention to something here. The requested word is divided into offsets, the first memory transaction is read from offset 0 from three chips, the second memory transaction is read from offset 1 of another chip. Here is where the problem is. You must tell the memory chips the offset so that they can send you the necessary data back, and the offset is ~ 40 bits, and the signals are VERY high speed. Right now there is only one set of bias signals that connects to all memory chips to make one transaction for uneven memory access, you need an independent bias (called the BTW address bus) that works with each memory chip. For a 64-bit processor, you must go from one address bus to eight, increasing by almost 300 contacts. In a world where processors use from 700 to 1300 pins, this can hardly be called a "slight increase in circuitry." Not to mention the huge increase in noise and crosstalk from many additional high-speed signals.

Well, this is not so bad, because the address bus can have at most two different offsets at the same time, and one is always plus one. So you can get away with one extra wire to each memory chip, saying that it is either (read the offset indicated on the address bus) or (read the offset), which are two states. But now every memory chip has an additional adder, which means that it must calculate the offset before actually accessing the memory, which slows down the maximum clock speed for the memory. This means that aligned access becomes slower if you want uneven access faster. Since 99.99% of access can be leveled, this is a net loss.

So why unaligned access is divided into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines, so the whole cache coherence logic must double to handle the communication between the processor cores.

+11
Oct 11 2018-10-10T00:
source share

In my opinion, this is a very simplified assumption. A circuit may include many levels of routing and caching optimization to enable the reading of certain bits of memory. In addition, memory subsystems, which can be created from components that have a difference order in performance and design complexity, are read to memory in order to read the way you think.

However, I add a caution that I am not a processor or memory developer, so I could talk about a piece.

0
Oct 11 2018-10-10T00:
source share

The answer to your question is in the question itself.

The CPU has an access granularity of 4 bytes. Thus, it can only break data in pieces of 4 bytes.

If you got access to address 0x0, the CPU will give you 4 bytes from 0x0 to 0x3.

When you issue an instruction to access data from address 0x1 , the processor accepts this as a request for 4 bytes of data starting from 0x1 (i.e. 0x1 to 0x4 ). This cannot be construed in any other way essentially due to the granularity of the CPU. Therefore, the CPU uploads data from 0x0 to 0x3 and 0x4 to 0x7 (ergo, 2 accesses), then puts data from 0x1 to 0x4 together as the final result.

0
Oct 11 '10 at 4:08
source share

Addressing 4 bytes with the first byte inconsistent on the left at 0x1, rather than 0x0, means that it does not start at the word boundary and flows to the next adjacent word. The first access captures 3 bytes to the word boundary (assuming a 32-bit word), and then the second access captures 0x4 bytes in the completion mode of a 4-byte 32-bit word addressing memory. Object code or assembler effectively performs second access and concatenation transparently for the programmer. It is best to keep word boundaries whenever possible, usually in units of 4 bytes.

0
Nov 29 '10 at 17:14
source share



All Articles