Functional imprint assembly level function

I would like to determine if two functions from two executable files were compiled from the same (C) source code and would like to do this even if they were compiled by different versions of the compiler or different compilation options. I am currently considering using some kind of fingerprint function at assembly level. A function fingerprint must have the following properties:

  • two functions compiled from the same source under different circumstances are likely to have the same fingerprint (or similar),
  • two functions compiled from different C sources may have different fingerprints,
  • (bonus), if the two functions of the source were similar, the fingerprints are also similar (for some reasonable definition of similar).

What I'm looking for right now is a set of properties of compiled functions that individually satisfy (1.) and taken together (2.).

Assumptions

Of course, this is not possible at all, but there may be something that will work in most cases. Here are some suggestions that could make this easier:

  • linux ELF binaries (no debugging information),
  • not getting confused in any way
  • compiled gcc,
  • on x86 linux (an approach that can be implemented on other architectures would be nice).

Ideas

Unfortunately, I have little experience with the assembly. Here are some ideas for the above properties:

  • types of instructions contained in a function (for example, floating point instructions, memory barriers)
  • memory access from a function (reads / writes from / to a heap?)?
  • library functions are called (their names must be available in ELF, also their order usually does not change)
  • control flow graph form (I think it will be highly dependent on the compiler)

Existing work

I managed to find only the tangent work:


Do you have any suggestions regarding function properties? Or another idea that also fulfills my goal? Or something similar has already been implemented, and I completely missed it?

+8
c assembly executable reverse-engineering fingerprint
source share
4 answers

FLIRT uses byte-level pattern matching, so it is interrupted with any changes to the encoding of instructions (for example, various register allocation / reordering commands).

For mapping graphs, see BinDiff. Although he closed the source, Halvar described some of the approaches of his blog . They even have open sources that they do to create fingerprints, in the form of a BinCrowd plugin .

+5
source share

In my opinion, the easiest way to do something like this is to decompose the assembly of functions back into some higher level form, where there are constructs (for example, for , while , function calls, etc.), and then the structure of these constructions corresponds more high level.

This will prevent reordering of commands, loop loops, unwinding loops, and any other optimizations that interact with the comparison, you can even (de) optimize these higher-level structures to the maximum at both ends to make sure they are at the same point, therefore, the comparison between the non-optimized debugging code and -O3 will not fail due to lack of time / absence of spills in the register, etc.

You can use something like boomerang as the basis for decompilation (except that you won't spit out C code).

0
source share

I suggest you approach this problem from the point of view of the language in which the code was written, and what restrictions the code places on compiler optimization.

I am not very familiar with the C standard, but C ++ has the concept of β€œobservable” behavior. The standard carefully defines this, and compilers get more freedom in optimization if the result gives the same observable behavior. My recommendation to determine if two functions are the same would be to try to determine what their observed behavior is (what I / O they perform and how they interact with other memory areas and in what order).

0
source share

If the set of problems can be reduced to a small set of known C or C++ source code functions compiled by n different compilers, each of which has m [n] different sets of compiler options, then it would be simple if the tedious solution would be to compile the code using a combination of each compiler and options and catalog received command bytes or, more efficiently, their hash signature in the database.

Many compiler options are potentially large, but in practice, engineers usually use a fairly standard and small set of parameters, usually minimally optimized for debugging and fully optimized for release. An examination of many project configurations may show that in any engineering culture there are only two or three associated with the prejudices or superstitions of how compilers work, whether accurate or not.

I suspect this approach is closest to what you really want: a way to investigate the allegedly misappropriated source code. All proposed methods for restoring the compiler parsing tree can bear fruit, but have great potential for missed symmetric solutions or ambiguous unsolvable cases.

0
source share

All Articles