Search for the number of operands in opcode instructions

I plan to write my own little disassembler. I want to decode the operation codes that I get when reading an executable file. I see the following operation codes:

69 62 2f 6c 64 2d 6c 

which must match:

 imul $0x6c2d646c,0x2f(%edx),%esp 

The imul command can now have two or three operands. How do I figure this out from the opcodes I have?

It is based on the Intel i386 instruction set.

Thanks and respect,
Hrishikesh Murali

+4
source share
5 answers

The manuals describe how to distinguish between one, two, or three versions of operands.

IMUL instruction

F6 / F7: one operand; 0F AF: two operands; 6B / 69: three operands.

+2
source

Some tips, first get all the instructions you can get. for this x86 case, try using some old 8088/86 manuals, as well as more recent data from Intel, as well as many opcode tables on the network. different interpretations and documentation may first have subtle errors or differences in documentation, and secondly, some people may present information in a different and more understandable way.

Secondly, if this is your first disassembler, I recommend avoiding x86, it is very difficult. Since your question implies that variable-length instruction sets are complex to make a remotely remote disassembler, you need to follow the code in execution order, not memory order. Thus, your disassembler must use some kind of circuitry to not only decode and print instructions, but also decode jump instructions and put destination addresses as entry points into the instruction. for example ARM, is a fixed instruction length, you can write an ARM disassembler that starts at the beginning of a line and parses each word directly (assuming, of course, that this is not a mixture of hand and thumb code). thumb (not thumb2) can be parsed this way, since there is only one taste of the 32-bit command, the rest is 16 bits, and this one taste can be processed in a simple state machine, since these two 16-bit commands are displayed as pairs.

You won’t be able to parse everything (with a set of variable-length instructions) and due to the nuances of any manual coding or deliberate tactics, to prevent your code from being parsed in front, which executes the code in execution order, may have something that I will cause a collision, for example your instructions above. Let's say that one path will lead you to 0x69, which is the entry point to the instruction, and you determine from this, which is an instruction of 7 bytes, but they say that somewhere else there is a branch instruction whose purpose is calculated as 0x2f, which is the operation code for the team, although very clever programming might produce something like this, it is more likely that the disassembler was designed to disassemble the data. eg

 clear condition flag branch if condition flag clear data 

The disassembler does not know that the data is data, and without additional smarts, the disassembler does not understand that the conditional branch is actually an unconditional branch (there can be many instructions for different branch paths between the clear and branch condition if the condition is clear) so it takes a byte after of how a conditional branch is an instruction.

Finally, I applaud your efforts, I often preach to write simple disassemblers (those that assume that the code is very short, intentionally generated code), to study the set of instructions very well. If you do not put the disassembler in a situation where it should follow in order of execution, and instead it can go in memory order (basically do not insert data between instructions, put it at the end or somewhere else, leaving only lines of instructions to be taken apart). Understanding code decoding for a set of instructions can make you much better at programming for this platform for both low-level and high-level languages.

the short answer is intel used to publish and possibly still technical reference manuals for processors, I still have my 8088/86 manuals, hardware for electrical equipment and instruction set software and how it works. I have 486 and probably 386 people. The snapshot in Igor’s answer directly resembles Intel’s leadership. Since the instruction set has changed so much over time, at best, x86 is a complex animal. At the same time, if the processor itself can wade through these bytes and execute them, you can write a program that can do the same, but decode them. the difference is that you are most likely not going to create a simulator and any branches that are calculated by code and are not explicit in the code that you cannot see, and the destination for this branch may not appear in your list of bytes until you parse .

+2
source

Although the x86 instruction set is quite complex (it's CISC anyway), and I saw that many people here discourage your attempts to try to understand this, I will say the opposite: you can still understand it, and you can learn how it is so difficult and how Intel managed to extend it several times from 8086 to modern processors.

Instructions

x86s use variable length encoding, so they can consist of several bytes. Each byte must encode different things, and some of them are optional (it is encoded in the operation code, whether these optional fields are used or not).

For example, each opcode can be preceded by 0 to 4 prefix bytes, which are optional. Usually you do not need to worry about them. They are used to change the size of operands or evacuation codes to the "second floor" of the opcode table with advanced instructions of modern processors (MMX, SSE, etc.).

Then there is the actual operation code, which is usually one byte, but can contain up to three bytes for extended instructions. If you use only the basic set of commands, you also do not need to worry about them.

Further, there is the so-called ModR/M byte (sometimes also called mode-reg-reg/mem ), which encodes the addressing modes and types of operands. It is used only by operation codes that have such operands. It has three bit fields:

  • The first two bits (left, the most important) encode the addressing mode (4 possible bit combinations).
  • The next three bits encode the first register (8 possible bit combinations).
  • The last three bits can encode another register or extend the addressing mode, depending on what setting the first two bits are.

After the ModR/M byte, ModR/M may be another optional byte (depending on the addressing mode) called SIB ( S cale I ndex B ase). It is used for more exotic addressing modes to encode the scale factor (1x, 2x, 4x), base address / register, and index register used. It has a similar layout with the ModR/M byte, but the first two bits on the left (the most significant) are used to encode the scale, and the next three and last three bits encode index and base registers, as the name implies.

If any displacement is used, it occurs immediately afterwards. This can be 0, 1, 2, or 4 bytes depending on the addressing mode and execution mode (16 bit / 32-bit / 64-bit).

The latter is always direct data, if any. It can also be 0, 1, 2, or 4 bytes long.

So now that you know the general x86 instruction format, you just need to know what encodings are for all these bytes. And there are some patterns, contrary to common beliefs.

For example, all register encodings follow the neat ACDB scheme. That is, for 8-bit instructions, the lower two bits of the register code encode registers A, C, D, and B, respectively:

00 = A register (battery)
01 = C register (counter)
10 = D register (data)
11 = B case (basic)

I suspect that their 8-bit processors used only these four 8-bit registers encoded this way:

  second +---+---+ f | 0 | 1 | 00 = A i +---+---+---+ 01 = C r | 0 | A : C | 10 = D s +---+ - + - + 11 = B t | 1 | D : B | +---+---+---+ 

Then, on 16-bit processors, they doubled this register bank and added another register-encoded bit to select the bank, thus:

  second second 0 00 = AL +----+----+ +----+----+ 0 01 = CL f | 0 | 1 | f | 0 | 1 | 0 10 = DL i +---+----+----+ i +---+----+----+ 0 11 = BL r | 0 | AL : CL | r | 0 | AH : CH | s +---+ - -+ - -+ s +---+ - -+ - -+ 1 00 = AH t | 1 | DL : BL | t | 1 | DH : BH | 1 01 = CH +---+---+-----+ +---+----+----+ 1 10 = DH 0 = BANK L 1 = BANK H 1 11 = BH 

But now you can also use both halves of these registers together as full 16-bit registers. This is done with the last bit of the opcode (the least significant bit, the rightmost): if it is 0 , this is an 8-bit instruction. But if this bit is set (that is, the opcode is an odd number), this is a 16-bit instruction. In this mode, two bits encode one of the ACDB registers, as before. Samples remain the same. But now they encode full 16-bit registers. But when the third byte (the highest) is also set, they switch to a whole bank of registers called the index / pointer register, which are: SP (stack pointer), BP (base pointer), SI (source index), DI (destination / data index ) So, the addressing is as follows:

  second second 0 00 = AX +----+----+ +----+----+ 0 01 = CX f | 0 | 1 | f | 0 | 1 | 0 10 = DX i +---+----+----+ i +---+----+----+ 0 11 = BX r | 0 | AX : CX | r | 0 | SP : BP | s +---+ - -+ - -+ s +---+ - -+ - -+ 1 00 = SP t | 1 | DX : BX | t | 1 | SI : DI | 1 01 = BP +---+----+----+ +---+----+----+ 1 10 = SI 0 = BANK OF 1 = BANK OF 1 11 = DI GENERAL-PURPOSE POINTER/INDEX REGISTERS REGISTERS 

With the introduction of 32-bit CPUs, they doubled those banks again. But the pattern remains the same. Only now odd operation codes mean 32-bit registers and even operation codes, as before, 8-bit registers. I would call the odd opcodes "long" versions, because the 16/32-bit version is used depending on the processor and its current mode of operation. When it works in 16-bit mode, odd ("long") operation codes mean 16-bit registers, but when it works in 32-bit mode, odd ("long") operation codes mean 32-bit registers. It can be flipped over, the prefix of the entire instruction with the prefix 66 (redefining the size of the operand). Even operation codes (“short”) are always 8-bit. Thus, in a 32-bit processor, register codes:

 0 00 = EAX 1 00 = ESP 0 01 = ECX 1 01 = EBP 0 10 = EDX 1 10 = ESI 0 11 = EBX 1 11 = EDI 

As you can see, the ACDB pattern remains the same. Also, the pattern SP,BP,SI,SI remains the same. It just uses longer registers.

Operation codes also have some patterns. I already described one of them (even and odd = 8-bit "short" compared to 16/32-bit "long" things). Moreover, you can see on this map the opcode that I did once for quick reference and manual assembly / disassembly: enter image description here (This is not a complete table yet; some of the code codes are missing. Maybe I will update it someday.)

As you can see, arithmetic and logical instructions are mainly located in the upper half of the table, and their left and right halves correspond to a similar scheme. Instructions for moving data are in the bottom half. All branching commands (conditional branching) are on line 7* . There is also one full line B* reserved for the mov instruction, which is short for loading instantaneous values ​​(constants) into registers. These are all single-byte opcodes immediately followed by an immediate constant, because they encode the destination register in the opcode (they are selected by the column number in this table) in the three least significant bytes (the rightmost ones). They execute the same pattern for encoding registers. And the fourth bit is the short / long selection. You can see that your imul command is in the table, exactly at position 69 (huh ..; J).

For many instructions, the bit before the short / long bit must encode the order of the operands: which of the two registers encoded in the ModR/M byte is the source, and which is the same (this applies to instructions with two register operands).

As for the addressing mode field of the ModR/M byte, here's how to interpret it:

  • 11 is the simplest: it encodes register-to-register transfers. One register is encoded with the next three bits ( reg field), and the other register with three other bits ( R/M field) of this byte.
  • 01 means that after this byte there will be a one-byte offset.
  • 10 means the same, but the offset used is four bytes (on 32-bit CPUs).
  • 00 is the most difficult: it means indirect addressing or a simple offset, depending on the contents of the R/M field.

If a SIB byte is present, it is signaled by a bitmap of 100 in R/M bits. There is also code 101 for 32-bit mode only for movement, which does not use SIB bytes at all.

Here is a brief description of all these addressing modes:

 Mod R/M 11 rrr = register-register (one encoded in `R/M` bits, the other one in `reg` bits). 00 rrr = [ register ] (except SP and BP, which are encoded in `SIB` byte) 00 100 = SIB byte present 00 101 = 32-bit displacement only (no `SIB` byte required) 01 rrr = [ rrr + disp8 ] (8-bit displacement after the `ModR/M` byte) 01 100 = SIB + disp8 10 rrr = [ rrr + disp32 ] (except SP, which means that the `SIB` byte is used) 10 100 = SIB + disp32 

So, let's now decode your imul :

69 is the operation code. It encodes a version of imul that does not sign - it extends 8-bit operands. Version 6B decrypts them. (They differ in bit 1 in the opcode if someone asked.)

62 is the RegR/M byte. In binary format, it is 0110 0010 or 01 100 010 . The first two bytes ( Mod field) indicate the indirect addressing mode, and the offset will be 8-bit. The next three bits ( reg field) are 100 and encode the SP register (in this case, ESP , since we are in 32-bit mode) as the destination register. The last three bits represent the R/M field, and we have 010 there that encode register D (in this case EDX ) as another (source) register.

Now we expect an 8-bit offset. And here it is: 2f is the offset, positive (+47 in decimal form).

The last part consists of four bytes of the nearest constant required by the imul command. In your case, this is 6c 64 2d 6c , which in little-endian is $6c2d646c .

And what is the way that cookies collapse; -J

+2
source

This is not a machine code instruction (which will consist of an operation code and zero or more operands).

This is part of the text string, it translates as:

 $ echo -e "\x69\x62\x2f\x6c\x64\x2d\x6c" ib/ld-l 

which is obviously part of the line "/lib/ld-linux.so.2" .

+1
source

If you don’t feel that you are going through the operation / manual tables, it always helps to study on other projects, such as the open source disassembler, bea-engine , you may find that you no longer need to create your own, depending on what are you doing it for.

0
source

All Articles