This is a very interesting and deep question - although perhaps it is not very suitable for this site.
The question, if I understand this, is how it affects the choice of language for analysis when conducting static analysis in search of flaws; should the analyzer look at the IL or look at the source code? Note that I expanded this question from the original narrow focus on dividing defects by zero.
The answer, of course: it depends. Both methods are commonly used in the static analysis industry, and there are pros and cons to each. It depends on what kind of defects you are looking for, what methods you use to cut off false paths, suppress false positives and display defects and how you intend to detect detected defects to developers.
Bytecode analysis has some obvious advantages over source code. The main one: if you have a bytecode analyzer for Java bytecode, you can run Scala through it without even using the Scala analyzer. If you have an MSIL analyzer, you can run C # or VB or F # through it without writing analyzers for each language.
There are also benefits to code analysis at the bytecode level. Analysis of the control flow is very simple when you have bytecode, because you can very quickly organize pieces of bytecode into "base blocks"; the base block is a region of code where there is no instruction that enters its middle, and each normal exit from the block is at the bottom. (Exceptions can, of course, occur anywhere.) By parsing the bytecode into the base blocks, we can calculate the graph of the blocks that are connected to each other, and then summarize each block in terms of its action in local and global states. Bytecode is useful because it is an abstraction over code that shows at a lower level what is really happening.
This, of course, is its main drawback; the bytecode loses information about the intentions of the developer . Any defect check that requires information from the source code to detect a defect or prevent false positive will give poor results when run on bytecode. Consider, for example, program C:
#define DOBAR if(foo)bar(); ... if (blah) DOBAR else baz();
If this horrible code was downgraded to machine code or bytecode, then all we will see is a bunch of transition instructions, and we would not know that we should report a defect here, that else binds to if(foo) , and not if(blah) as the developer suggests.
The dangers of preprocessor C are well known. But there are also big difficulties that arise when analyzing complex downgraded code at the bytecode level. For example, consider something like C #:
async Task Foo(Something x) { if (x == null) return; await x.Bar(); await x.Blah(); }
Explicitly x cannot be dereferenced as null here. But C # is going to reduce this to completely insane code; part of this code will look something like this:
int state = 0; Action doit = () => { switch(state) { case 0: if (x == null) { state = -1; return; }; state = 1; goto case 1: case 1: Task bar = x.Bar(); state = 2; if (<bar is a completed task>) { goto case 2; } else { <assign doit as the completion of bar> return; } case 2:
And so on. (Except that it is much more complicated than that.) Then it will be omitted by an even more abstract bytecode; Imagine that you are trying to understand this code at the level of dropping switches to gotos and delegates descending into closures.
A static analyzer analyzing the equivalent bytecode will, within the limits of its rights, say: โsimply x can be zero, because we check it on one branch of the switch, this indicates that x should be checked for invalidity in other branches, and itโs not, so Iโll give a zero dereferencing defect on other branches. "
But that would be false positive. We know something that a static analyzer cannot, namely, that the zero state must be executed before every other state and that when resuming a coroutine, x will always be checked for null already. This is obvious from the source code, but it would be very difficult to break out of the bytecode.
What do you do if you want the benefits of bytecode analysis without flaws? There are many methods; for example, you could write your own intermediate language, which was higher than the bytecode - which has high-level constructs such as "exit" or "wait", or "for loop" - write an analyzer that analyzes this intermediate language, and then write compilers that compile each target language โ C #, Java, any โ into your intermediate language. This means that writing a lot of compilers, but only one analyzer, and possibly writing an analyzer, is the hard part.
It was a very short discussion, I know. It's a difficult question.
If you are interested in designing bytecode static analyzers, consider developing Infer, an open source static analyzer for Java and other languages โโthat turns Java bytecode into an even lower level of bytecode suitable for analyzing heap properties; first read the partitioning logic to display the heap properties. https://github.com/facebook/infer