Replacing 2 registers in assembly language 8086 (16 bits)

Question

Replacing 2 registers in assembly language 8086 (16 bits)

Does anyone know how to change the values of two registers without using another variable, register, stack, or any other storage location? thanks!

As a replacement for AX, BX.

-1

assembly cpu-registers x86-16 16-bit

Clapa lucian Oct 20 '14 at 15:21

source share

2 answers

If you really need to change two registers, xchg ax, bx is the most efficient way for all modern processors in most cases . (You can create a case where several instructions with one-up can be more efficient due to some other external effect due to the surrounding code. Or for the 32-bit operand size, where the zero latency period mov made a 3-mov sequence with a temporary register is better on Intel processors).

For code size, xchg-with-ax takes only one byte. This is where the 0x90 NOP encoding comes from: it xchg ax,ax or xchg eax,eax in 32-bit mode. In xchg eax,eax 64-bit mode xchg eax,eax will truncate RAX to 32 bits, so 0x90 clearly a NOP instruction, not xchg . Exchange of any other pair of registers takes 2 bytes for the encoding xchg r, r/m . (+ REX prefix if required in 64-bit mode.)

On actual 8086, sample code was usually the performance bottleneck, so xchg is by far the best way, especially using single-byte xchg -with-AX short form.

For 32-bit / 64-bit 3 registers, mov instructions with temporary can benefit from the mov exception, where xchg cannot be on current Intel processors. xchg - these are 3 devices on Intel, all of which have a delay of 1 second and need an executive module, so one direction has a delay of 2 s, and the second - 1 s. See Why XCHG reg, register 3 micro-operation instructions on modern Intel architectures? to learn more about microarchitectural details about how current CPUs implement it.

In AMD Ryzen, xchg on 32/64-bit regs is 2 uops and is processed at the renaming stage, so this is like two mov commands that work in parallel. On earlier AMD processors, it is still a 2 uop instruction, but with a delay of 1 s in each case.

xor-swaps or add / sub swaps, or any other sequence of several instructions other than mov , xchg no sense compared to xchg for registers. All of them have a delay of 2 and 3 cycles and a larger code size. The only things worth considering are the mov instructions.

Or better, expand the loop or rearrange your code so that it does not need a swap, or you only need mov .

Note that memory xchg has an implied lock prefix. Don't use xchg with memory if performance doesn't matter at all, but code size does. (for example, in the bootloader). Or, if you need it to be an atomic and / or complete memory barrier, because it is both.

If you need to exchange the register with memory and not use the scratch register, xor-swap may indeed be the best option. Using temporary memory will require copying the memory value (for example, onto the stack using push [mem] ) or first spreading the register to the second zero memory cell before loading + saving the memory operand.)

The lowest latent path still remains with a zero register; often you can choose the one that is not on the critical path, or you need to restart it (it is not saved in the first place, because the value is already in memory or can be recounted from other registers with the ALU instruction).

 ; spill/reload another register push edx ; save/restore on the stack or anywhere else movzx edx, word [mem] ; or just mov dx, [mem] mov [mem], ax mov eax, edx pop edx ; or better, just clobber a scratch reg

Two other reasonable (but much worse) options for exchanging memory with a register: do not apply to other registers (except SP):

 ; using scratch space on the stack push [mem] ; [mem] can be any addressing mode, eg [bx] mov [mem], ax pop ax ; dep chain = load, store, reload.

or not touch anything else:

 ; using no extra space anywhere xor ax, [mem] xor [mem], ax ; read-modify-write has store-forwarding + ALU latency xor ax, [mem] ; dep chain = load+xor, (parallel load)+xor+store, reload+xor

using two xor memory destinations, and one memory source will be worse than bandwidth (more stores and a longer chain of dependencies).

The push / pop version only works for operand sizes that can be pressed / pulled out, but xor-swap works for any operand size. If you can use a temporary place on the stack, a save / restore version is probably preferable if you don't need a balance of code size and speed.

+1

Peter Cordes Oct 30 '17 at 18:19

source share

Zaz · Accepted Answer · 2014-10-20T15:42:12+0000

You can do this using some math operation. I can give you an idea. Hope this helps!

I followed this C code:

int i=10; j=20 i=i+j; j=ij; i=ij;

 mov ax,10 mov bx,20 add ax,bx //mov command to copy data from accumulator to ax, I forgot the statement, now ax=30 sub bx,ax //accumulator vil b 10 //mov command to copy data from accumulator to bx, I forgot the statement now sub ax,bx //accumulator vil b 20 //mov command to copy data from accumulator to ax, I forgot the statement now

Replacing 2 registers in assembly language 8086 (16 bits)

More articles: