Register Implementation in C Virtual Machine

I wrote a virtual machine in C as a hobby project. This virtual machine runs code very similar to Intel's x86 syntax assembly. The problem is that the registers used by this virtual machine are only registers by name. In my VM code, registers are used in the same way as x86 registers, but the machine stores them in system memory. There are no performance improvements when using registers over system memory in VM code. (I thought that only the terrain would slightly increase productivity, but in practice, nothing has changed.)

When interpreting a program, this virtual computer stores arguments for instructions as pointers. This allows the virtual instruction to take a memory address, a constant value, a virtual register, or just about anything as an argument.

Since the hardware registers have no addresses, I can’t figure out how to actually store my VM registers in the hardware registers. Using the register keyword in my type of virtual register does not work, because I have to get a pointer to the virtual register in order to use it as an argument. Is there a way to make these virtual registers more like their native copies?

It is possible that, if necessary, it is more convenient for me to delve into the assembly. I know that JIT compilation of this VM code may allow me to use hardware registers, but I would like to use them with my interpreted code as well.

+6
performance assembly memory vm-implementation
source share
6 answers
  • There is no indexing support in machine registers: you cannot access the register with the β€œindex” specified at runtime, no matter what it means, without generating code. Since you are probably deciphering the register index from your instructions, the only way is to make a huge switch (i.e. switch (opcode) { case ADD_R0_R1: r[0] += r[1]; break; ... } ). This is probably a bad idea, as it increases the interpreter's loop size too much, so it will interrupt command caching.

  • If we are talking about x86, the additional problem is that the number of general-purpose registers is rather small; some of them will be used for accounting (PC storage, saving the state of the VM stack, decoding instructions, etc.) - it is unlikely that you will have more than one free register for the virtual machine.

  • Even if register indexing support were available, it would hardly give you more performance. Typically, in interpreters, the biggest bottleneck is instruction decoding; x86 supports fast and compact memory addressing based on register values ​​(ie mov eax, dword ptr [ebx * 4 + ecx] ), so you would not win much. However, it is worth checking the generated assembly, i.e. Make sure that the register pool address is stored.

  • The best way to speed up translators is JITting; even a simple JIT (i.e., without smart register allocation), basically just emitting the same code that you would use with a loop of commands and a switch statement, except for decoding commands) can increase your productivity by 3 times or more (this actual results from a simple JITter on top of a Lua-like case-based VM). The translator is best stored as reference code (or for cold code to reduce the cost of JIT memory - the cost of generating JIT is not a problem for simple JITs).

+8
source share

Even if you can directly access hardware registers, the exchange code around the decision to use a register instead of memory is much slower.

To get performance, you need to design for performance up.

A few examples.

Prepare the x86 virtual machine by setting up all the traps to catch the code leaving the virtual memory space. Execute the code directly, do not emulate, do not enter into it and do not run it. When the code exits its memory space / i / o to talk to the device, etc., trap this and emulate this device or whatever it is so that it returns back to the program. If the code is connected to the processor, it will work very fast if the I / O speed is slow, but not as slow as emulating each command.

Static binary translation. Parse and translate the code before running, for example, the instruction 0x34,0x2E will turn ascii into a .c file:

al ^ = 0x2E; = 0; avg = 0; Sf = al

It is ideal to execute tons of dead code removal (if the following instruction changes the flags, and then does not change them here, etc.). And let the optimizer in the compiler do the rest. You can get a performance boost this way over the emulator, how good the performance depends on how well you can optimize the code. Being a new program, it runs on hardware, registers memory and everything, therefore the processor is attached to the code more slowly than the virtual machine, in some cases you do not need to deal with the processor that makes exceptions to capture memory / io, because you simulated memory access to the code, but it still has a cost, and still calls the simulated device, so there are no savings.

Dynamic translation, similar to sbt, but you do it at run time, I heard that it was done, for example, when simulating x86 code on some other processor they say dec alpha, the code slowly changes to its own alpha commands from x86 instructions, so next time it executes the alpha instruction directly instead of emulating the x86 instruction. Each time through the code, the program runs faster.

Or maybe just redesign your emulator to be more efficient in terms of execution. Look at the emulated processors in MAME, for example, readability and maintainability of the code were sacrificed for performance. When it was written that it was important, today with multi-core gigahertz processors you do not need to work so much to simulate a 1.5ghz 6502 or 3ghz z80. Something simple, like looking at the next opcode in a table and deciding not to emulate some or all of the flag calculations for a command, can give you a significant boost.

On the bottom line, if you want to use hardware registers x86, Ax, BX, etc. to emulate registers AX, BX, etc. when starting the program, the only effective way to do this is to actually execute the instruction and not execute the trap, as in a single step of the debugger, but to execute long lines of instructions, not allowing them to leave the space of the virtual machine. There are different ways to do this, and the results will be different, and this does not mean that it will be faster than an effective emulator. This limits the coincidence of the processor with the program. Register emulation with efficient code and a really good compiler (good optimizer) will give you reasonable performance and portability, since you do not need to map the hardware to the running program.

+3
source share

Convert your complex, registered code before execution (ahead of time). A simple solution would be the fourth, like a dual vm stack for execution, which offers the ability to cache a top stack element (TOS) in a register. If you prefer a case-based solution, choose the "opcode" format, which combines as many instructions as possible (rule of thumb, if you select a MISC style style, up to four instructions per byte can be added). Thus, access to the virtual register is locally allowed for references to physical registers for each static super-command (clang and gcc capable of performing such optimization). As a side effect, lowering the BTB erroneous prediction rate will result in significantly better performance regardless of the particular register allocations.

The best streaming methods for C-based interpreters are direct streaming (label extension) and replicated stream switching (ANSI compliance).

+1
source share

So you are writing an x86 interpreter, which should be between 1 and 3 degrees 10 slower than the actual hardware. In real hardware, saying that mov mem, foo will take much longer than mov reg, foo , while in your program mem[adr] = foo will take as long as myRegVars[regnum] = foo (modulo caching). So, do you expect the same speed difference?

If you want to simulate the speed difference between registers and memory, you will need to do something like what Cachegrind does. That is, keep the simulated clock, and when it makes a reference to the memory, it adds a large number to it.

0
source share

Your virtual machine seems too complex for efficient interpretation. The obvious optimization is to have a VM microcode, with instructions for loading / storing the registers, possibly even using the stack. You can upgrade your high-level virtual machine to a simpler option before running. Another useful optimization depends on the expansion of computed gcc labels; see Objective Caml VM Interpreter for an example of such a realistic implementation of VM.

0
source share

To answer the question you asked:

You can tell your C compiler to leave a bunch of registers for free for your use. Pointers to the first page of memory are usually not allowed; they are reserved for checking NULL pointers, so you can abuse the initial pointers to mark registers. This helps if you have several custom registers, so my example uses 64-bit mode to simulate 4 registers. It is possible that the additional overhead of the switch slows down execution, rather than speeding up its execution. Also see Other Answers for general tips.

 /* compile with gcc */ register long r0 asm("r12"); register long r1 asm("r13"); register long r2 asm("r14"); register long r3 asm("r15"); inline long get_argument(long* arg) { unsigned long val = (unsigned long)arg; switch(val) { /* leave 0 for NULL pointer */ case 1: return r0; case 2: return r1; case 3: return r2; case 4: return r3; default: return *arg; } } 
0
source share

All Articles