How to write a disassembler?

I am interested in writing an x88 disassembler as an educational project.

The only real resource I found was Spiral Space, " How to Write a Disassembler ." Although this provides a good description of the high level of the various components of the disassembler, I am interested in more detailed resources. I also quickly looked at the source code for NASM , but it's a little hard to learn.

I understand that one of the main tasks of this project is a rather large set of x86 commands that I will have to deal with. I am also interested in the basic structure, basic disassembler references, etc.

Can someone point me to any detailed resources when writing an x86 disassembler?

+58
x86 disassembly
May 29 '09 at 3:50
source share
5 answers

See section 17.2 80386 Programmer's Reference . A disassembler is just an illustrious state machine . Disassembly steps:

  • Check if the current byte is a command prefix byte ( F3 , F2 or F0 ); if so, then you have the prefix REP / REPE / REPNE / LOCK . Go to the next byte.
  • Make sure the current byte is an address size byte ( 67 ). If so, decode the addresses in the rest of the instruction in 16-bit mode if it is currently in 32-bit mode, or decode addresses in 32-bit mode if it is currently in 16-bit mode
  • Ensure that the current byte is an operand size byte ( 66 ). If so, decode immediate operands in 16-bit mode, if currently in 32-bit mode, or decode immediate operands in 32-bit mode, if currently in 16-bit mode
  • Check if the current byte is a segment override byte ( 2E , 36 , 3E , 26 , 64 or 65 ). If so, use the appropriate segment register to decode the addresses instead of the default segment register.
  • The next byte is the opcode. If the operation code is 0F , then this is the extended operation code and reads the next byte in the form of the extended operation code.
  • Read and decode the Mod R / M byte, index index byte (SIB), offset (0, 1, 2 or 4 bytes) and / or immediate value (0, 1, 2 or 4 bytes), depending on the specific operation code. The sizes of these fields depend on the operation code, redefining the address size and redefining the size of the operand previously decoded.

The operation code informs you of the operation being performed. The opcode arguments can be decoded as Mod R / M, SIB, offset, and immediate values. There are many possibilities and many special cases due to the complex nature of x86. See the links above for a more detailed explanation.

+58
May 29 '09 at 4:57
source share

I would recommend checking out some open source disassemblers, preferably distorm and especially "disOps (Instructions Sets DataBase)" (ctrl + find this on the page).

The documentation itself is filled with luscious information about opcodes and instructions.

Quote from https://code.google.com/p/distorm/wiki/x86_x64_Machine_Code

80x86 Instruction:

The 80x86 instruction is divided by the number of elements:

  • Instruction prefixes affect how an instruction works.
  • Mandatory prefix used as an operation code byte for SSE instructions.
  • The byte order can be one or more bytes (up to 3 integer bytes).
  • The ModR / M modem is optional and can sometimes contain part of the operation code itself.
  • The SIB byte is optional and represents complex memory. forms.
  • An offset is optional, and this value is of different sizes of bytes (bytes, word, long) and is used as an offset.
  • Immediate is optional and is used as a general numeric value built from different sizes of bytes (byte, word, long).

The format is as follows:

 /-------------------------------------------------------------------------------------------------------------------------------------------\ |*Prefixes | *Mandatory Prefix | *REX Prefix | Opcode Bytes | *ModR/M | *SIB | *Displacement (1,2 or 4 bytes) | *Immediate (1,2 or 4 bytes) | \-------------------------------------------------------------------------------------------------------------------------------------------/ * means the element is optional. 

Data structures and decoding phases are described at https://code.google.com/p/distorm/wiki/diStorm_Internals

Quote:

Decoding phases

  • [Prefixes]
  • [Fetch Opcode]
  • [Filter opcode]
  • [Extract operands (s)]
  • [Text formatting]
  • [Hex Dump]
  • [Decoded instruction]

Each step is also explained.




Source links are kept for historical reasons:

http://code.google.com/p/distorm/wiki/x86_x64_Machine_Code and http://code.google.com/p/distorm/wiki/diStorm_Internals p>

+21
May 29 '09 at 4:41
source share

Start with some small program that was built and that gives you both the generated code and the instructions. Get a link to the command architecture and work through some generated code with a link to the architecture. You will find that the instructions have a very stereotyped inst op op structure with different numbers of operands. All you have to do is translate the hexadecimal or octal representation of the code according to the instructions; play a little, show it.

This process, automated, is the core of a disassembler. Ideally, you probably want to build an n array of instruction structures internally (or externally if the program is really large). Then you can translate this array into instructions in assembler format.

+6
May 29 '09 at 4:00
source share

Download requires a table of opcodes.

The fundamental structure of the search data is trie, however the table will be good enough if you care about speed.

To get the basic type of opcode, start by matching the table.

There are several ways to store registrar arguments; however, there are enough special cases that require most of them to be completed individually.

Since this is educational, take a look at ndisasm.

+4
May 29 '09 at 4:02 a.m.
source share

Checkout objdump sources is a great tool, it contains many opcode tables, and its sources can be a good base for creating your own disassembler.

+2
Aug 07 '11 at 23:27
source share



All Articles