This will not degrade performance or complicate the circuit.
These are false assumptions, which we accept as fact, which really eliminate further understanding.
Your comment on another question used a much more appropriate wording (" I donโt think it will degrade" ...)
Do you think the memory architecture uses many memory chips in parallel to maximize throughput? And that a particular data item is in only one chip, you cannot just read which chip will be the most convenient, and expect it to have the data you need.
Right now, the processor and memory can be connected together, so bits 0-7 are connected only to chip 0, 8-15 to chip 1, 16-23 for chip 2, 24-31 for chip 3. And for all integers N, 4N memory cell is stored in chip 0, 4N + 1 in chip 1, etc. And this is the Nth byte in each of these chips.
Look at the memory addresses stored at each offset of each memory chip
memory chip 0 1 2 3
offset
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
N 4N 4N + 1 4N + 2 4N + 3
So, if you load from memory bytes 0-3, N = 0, each chip reports its internal byte 0, all bits end in the right places, and everything is fine.
Now, if you try to load a word starting at memory location 1, what will happen?
First we look at how this is done. The first bytes 1-3 of the memory, which are stored in memory chips 1-3 with an offset of 0, end with bits 8-31, because where these memory chips are connected, even if you asked them to be in bits 0-23. This does not really matter, because the processor can swizzle them internally using the same scheme as for the logical left shift. Then, in the next byte of transaction memory 4, which is stored in memory chip 0 with an offset of 1, it is read into bits 0-7 and swizzled into bits 24-31, where you want it to be.
Pay attention to something here. The requested word is divided into offsets, the first memory transaction is read from offset 0 from three chips, the second memory transaction is read from offset 1 of another chip. Here is where the problem is. You must tell the memory chips the offset so that they can send you the necessary data back, and the offset is ~ 40 bits, and the signals are VERY high speed. Right now there is only one set of bias signals that connects to all memory chips to make one transaction for uneven memory access, you need an independent bias (called the BTW address bus) that works with each memory chip. For a 64-bit processor, you must go from one address bus to eight, increasing by almost 300 contacts. In a world where processors use from 700 to 1300 pins, this can hardly be called a "slight increase in circuitry." Not to mention the huge increase in noise and crosstalk from many additional high-speed signals.
Well, this is not so bad, because the address bus can have at most two different offsets at the same time, and one is always plus one. So you can get away with one extra wire to each memory chip, saying that it is either (read the offset indicated on the address bus) or (read the offset), which are two states. But now every memory chip has an additional adder, which means that it must calculate the offset before actually accessing the memory, which slows down the maximum clock speed for the memory. This means that aligned access becomes slower if you want uneven access faster. Since 99.99% of access can be leveled, this is a net loss.
So why unaligned access is divided into two steps. Because the address bus is shared by all the bytes involved. And this is actually a simplification, because when you have different offsets, you also have different cache lines, so the whole cache coherence logic must double to handle the communication between the processor cores.