Even if you can directly access hardware registers, the exchange code around the decision to use a register instead of memory is much slower.
To get performance, you need to design for performance up.
A few examples.
Prepare the x86 virtual machine by setting up all the traps to catch the code leaving the virtual memory space. Execute the code directly, do not emulate, do not enter into it and do not run it. When the code exits its memory space / i / o to talk to the device, etc., trap this and emulate this device or whatever it is so that it returns back to the program. If the code is connected to the processor, it will work very fast if the I / O speed is slow, but not as slow as emulating each command.
Static binary translation. Parse and translate the code before running, for example, the instruction 0x34,0x2E will turn ascii into a .c file:
al ^ = 0x2E; = 0; avg = 0; Sf = al
It is ideal to execute tons of dead code removal (if the following instruction changes the flags, and then does not change them here, etc.). And let the optimizer in the compiler do the rest. You can get a performance boost this way over the emulator, how good the performance depends on how well you can optimize the code. Being a new program, it runs on hardware, registers memory and everything, therefore the processor is attached to the code more slowly than the virtual machine, in some cases you do not need to deal with the processor that makes exceptions to capture memory / io, because you simulated memory access to the code, but it still has a cost, and still calls the simulated device, so there are no savings.
Dynamic translation, similar to sbt, but you do it at run time, I heard that it was done, for example, when simulating x86 code on some other processor they say dec alpha, the code slowly changes to its own alpha commands from x86 instructions, so next time it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code, the program runs faster.
Or maybe just redesign your emulator to be more efficient in terms of execution. Look at the emulated processors in MAME, for example, readability and maintainability of the code were sacrificed for performance. When it was written that it was important, today with multi-core gigahertz processors you do not need to work so much to simulate a 1.5ghz 6502 or 3ghz z80. Something simple, like looking at the next opcode in a table and deciding not to emulate some or all of the flag calculations for a command, can give you a significant boost.
On the bottom line, if you want to use hardware registers x86, Ax, BX, etc. to emulate registers AX, BX, etc. when starting the program, the only effective way to do this is to actually execute the instruction and not execute the trap, as in a single step of the debugger, but to execute long lines of instructions, not allowing them to leave the space of the virtual machine. There are different ways to do this, and the results will be different, and this does not mean that it will be faster than an effective emulator. This limits the coincidence of the processor with the program. Register emulation with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability, since you do not need to map the hardware to the running program.