X86 way of specifying instructions from data

Is there a more or less reliable way to determine if data in some place in memory is the beginning of a processor instruction or some other data?

For example, E8 3F BD 6A 00 may be a call ( E8 ) command with a relative offset of 0x6ABD3F , or there may be three bytes of data belonging to some other command followed by push 0 ().

I know the question sounds silly and there probably isn't an easy way, but maybe a set of instructions was designed with that in mind, and maybe some simple code analysis of +100 bytes around the location might give an answer that, most likely correct.

I want to know this because I am looking at the program code and replacing all the calls with some function with my replacement calls. It works so far, but it is not impossible that at some point, when I increase the number of functions that I replace, some data will look exactly like a function call to this exact address, and will be replaced, and this will call a break in surprise. I want to reduce the likelihood of this.

+4
source share
5 answers

If this is your code (or another that saves the link and debugging information), the best way is to scan the symbol / movement tables in the object file. Otherwise, there is no reliable way to determine if a certain byte is an increment or data.

Perhaps the most efficient way to retrieve data is recursive parsing. I. e. disassembling code from the enty point and from all found transition points. But this is not completely reliable, because it does not cross the jumping tables (you can try to use some heuristics for this, but this is also not entirely true).

A solution to your problem would be to replace the patch function itself: overwrite its beginning by going to your function.

+5
source

Unfortunately, there is no 100% reliable way to distinguish code from data. From the point of view of the CPU, a code is code only when some code of the jump code causes the processor to try to execute bytes as if they were code. You can try to analyze the control flow, starting from the entry point into the program and follow all possible paths of execution, but this can lead to failure if there are pointers.

On your specific issue: I understand that you want to replace an existing function with your own replacement. I suggest you fix the replaced function. Those. instead of looking for all the calls to foo() and replacing them with a call to bar() just replace the first bytes of foo() with the switch to bar() (a jmp , not a call : you don't want to communicate with the stack). This is less satisfactory due to the double jump, but it is reliable.

+2
source

It is impossible to distinguish data from a team in general, and this is due to von Neumann architecture . Parsing the code around is useful, and disassembly tools do this. ( This may be useful. If you cannot use IDA Pro / it's commercial /, use a different disassembly tool.)

+1
source

Regular code has very specific entropy, so it’s pretty easy to separate it from most data. However, this is a probabilistic approach, but a fairly large buffer of simple code can be recognized (especially the output of the compiler, when you can also recognize patterns, for example, the beginning of a function).

In addition, some opcodes are reserved for the future; others are only available from kernel mode. In this case, knowing them and knowing how to calculate the length of the instructions (you can try the routine written by Z0mbie for this), you can do it.

+1
source

Thomas offers the right idea. To implement it correctly, you need to parse the first few instructions (the part that you will overwrite with JMP ), and create a simple trampoline function that executes them, and then go to the rest of the original function.

There are libraries out there that do this for you. Known for Detours , but it has somewhat inconvenient licensing terms. A good implementation of the same idea with a more permissive license is Mhook .

0
source

All Articles