How to parse LLVM IR line by line

I especially need to parse LLVM IR code line by line during the execution of my C ++ code, where I need to know which operation is performed on which operands in each line.

For example, if the IR code is:

%0 = load i32* %a, align 4 

I would like to know that the value from %a loaded to %0 during the execution of my C ++ code. I looked at using a simple C ++ text-based syntax program for this (Parse IR and searching for IR keywords), but would like to know if there are any existing libraries (possibly from LLVM itself) that will help me avoid this.

+5
source share
1 answer

Assumption

Theoretically, we could directly use LLVM::LLLexer to write our own parser for LLVM IR to parse line by line.

The following answer assumes that you are only interested in the operations inside each function of the LLVM IRL file, since the other information in the LLVM IR file does not contain anything about the operation. An operation can only be in a function. For other parts of IR, such as defining a structure, declaring a function, etc., they only have type information and do not contain anything about actions.

Implementation

Based on the above assumption, your question about parsing LLVM IR line by line for information about working in the IR file can be translated to parsing each operation in each function of the IR file LLVM.

LLVM has an existing implementation for directly analyzing LLVM IR file line by line to get information about actions directly, and since the sequence of functions of the IR file is what they are displayed in the LLVM IR file, the sequence of operations output from the next implementation is just a sequence of operations in given IR file LLVM.

Therefore, we could use the parseBitcodeFile interface provided by llvm. Such an interface first uses LLVM::LLLexer to split the LLVM IR file into tokens, then submits the token to Parser for analysis and, finally, generates information about the ErrorOr<llvm::Module *> module, the sequence of the list of functions in the module is the same as the sequence in the llvm ir file.

Then we could each LLVM::BasicBlock each LLVM::Function in LLVM::Module . And then repeat each LLVM::Instruction and get information about each LLVM::Value operand. Below is the implementation code.

 #include <iostream> #include <string> #include <llvm/Support/MemoryBuffer.h> #include <llvm/Support/ErrorOr.h> #include <llvm/IR/Module.h> #include <llvm/IR/LLVMContext.h> #include <llvm/Bitcode/ReaderWriter.h> #include <llvm/Support/raw_ostream.h> using namespace llvm; int main(int argc, char *argv[]) { if (argc != 2) { std::cerr << "Usage: " << argv[0] << "bitcode_filename" << std::endl; return 1; } StringRef filename = argv[1]; LLVMContext context; ErrorOr<std::unique_ptr<MemoryBuffer>> fileOrErr = MemoryBuffer::getFileOrSTDIN(filename); if (std::error_code ec = fileOrErr.getError()) { std::cerr << " Error opening input file: " + ec.message() << std::endl; return 2; } ErrorOr<llvm::Module *> moduleOrErr = parseBitcodeFile(fileOrErr.get()->getMemBufferRef(), context); if (std::error_code ec = fileOrErr.getError()) { std::cerr << "Error reading Moduule: " + ec.message() << std::endl; return 3; } Module *m = moduleOrErr.get(); std::cout << "Successfully read Module:" << std::endl; std::cout << " Name: " << m->getName().str() << std::endl; std::cout << " Target triple: " << m->getTargetTriple() << std::endl; for (auto iter1 = m->getFunctionList().begin(); iter1 != m->getFunctionList().end(); iter1++) { Function &f = *iter1; std::cout << " Function: " << f.getName().str() << std::endl; for (auto iter2 = f.getBasicBlockList().begin(); iter2 != f.getBasicBlockList().end(); iter2++) { BasicBlock &bb = *iter2; std::cout << " BasicBlock: " << bb.getName().str() << std::endl; for (auto iter3 = bb.begin(); iter3 != bb.end(); iter3++) { Instruction &inst = *iter3; std::cout << " Instruction " << &inst << " : " << inst.getOpcodeName(); unsigned int i = 0; unsigned int opnt_cnt = inst.getNumOperands(); for(; i < opnt_cnt; ++i) { Value *opnd = inst.getOperand(i); std::string o; // raw_string_ostream os(o); // opnd->print(os); //opnd->printAsOperand(os, true, m); if (opnd->hasName()) { o = opnd->getName(); std::cout << " " << o << "," ; } else { std::cout << " ptr" << opnd << ","; } } std:: cout << std::endl; } } } return 0; } 

To generate an executable file, use the following command:

 clang++ ReadBitCode.cpp -o reader `llvm-config --cxxflags --libs --ldflags --system-libs` 

Take the following c code as an example:

 struct a { int f_a; int f_b; char f_c:5; char f_d:4; }; int my_func( int arg1, struct a obj_a) { int x = arg1; return x+1 + obj_a.f_c; } int main() { int a = 11; int b = 22; int c = 33; int d = 44; struct a obj_a; obj_a.f_a = 1; obj_a.f_b = 2; obj_a.f_c = 3; obj_a.f_c = 4; if ( a > 10 ) { b = c; } else { b = my_func(d, obj_a); } return b; } 

After the following command we get some output:

 clang -emit-llvm -o foo.bc -c foo.c ./reader foo.bc 

The result should look something like this:

  Name: foo.bc Target triple: x86_64-unknown-linux-gnu Function: my_func BasicBlock: entry Instruction 0x18deb68 : alloca ptr0x18db940, Instruction 0x18debe8 : alloca ptr0x18db940, Instruction 0x18dec68 : alloca ptr0x18db940, Instruction 0x18dece8 : alloca ptr0x18db940, Instruction 0x18de968 : getelementptr coerce, ptr0x18de880, ptr0x18de880, Instruction 0x18de9f0 : store obj_a.coerce0, ptr0x18de968, Instruction 0x18df0a8 : getelementptr coerce, ptr0x18de880, ptr0x18db940, Instruction 0x18df130 : store obj_a.coerce1, ptr0x18df0a8, Instruction 0x18df1a8 : bitcast obj_a, Instruction 0x18df218 : bitcast coerce, Instruction 0x18df300 : call ptr0x18df1a8, ptr0x18df218, ptr0x18de8d0, ptr0x18de1a0, ptr0x18de1f0, llvm.memcpy.p0i8.p0i8.i64, Instruction 0x18df3a0 : store arg1, arg1.addr, Instruction 0x18df418 : load arg1.addr, Instruction 0x18df4a0 : store ptr0x18df418, x, Instruction 0x18df518 : load x, Instruction 0x18df5a0 : add ptr0x18df518, ptr0x18db940, Instruction 0x18df648 : getelementptr obj_a, ptr0x18de880, ptr0x18deab0, Instruction 0x18df6b8 : load f_c, Instruction 0x18df740 : shl bf.load, ptr0x18deb00, Instruction 0x18df7d0 : ashr bf.shl, ptr0x18deb00, Instruction 0x18df848 : sext bf.ashr, Instruction 0x18df8d0 : add add, conv, Instruction 0x18df948 : ret add1, Function: llvm.memcpy.p0i8.p0i8.i64 Function: main BasicBlock: entry Instruction 0x18e0078 : alloca ptr0x18db940, Instruction 0x18e00f8 : alloca ptr0x18db940, Instruction 0x18e0178 : alloca ptr0x18db940, Instruction 0x18e01f8 : alloca ptr0x18db940, Instruction 0x18e0278 : alloca ptr0x18db940, Instruction 0x18e02f8 : alloca ptr0x18db940, Instruction 0x18e0378 : alloca ptr0x18db940, Instruction 0x18e0410 : store ptr0x18de880, retval, Instruction 0x18e04a0 : store ptr0x18dfe30, a, Instruction 0x18e0530 : store ptr0x18dfe80, b, Instruction 0x18e05c0 : store ptr0x18dfed0, c, Instruction 0x18e0650 : store ptr0x18dff20, d, Instruction 0x18e06f8 : getelementptr obj_a, ptr0x18de880, ptr0x18de880, Instruction 0x18e0780 : store ptr0x18db940, f_a, Instruction 0x18e0828 : getelementptr obj_a, ptr0x18de880, ptr0x18db940, Instruction 0x18e08b0 : store ptr0x18deab0, f_b, Instruction 0x18e0958 : getelementptr obj_a, ptr0x18de880, ptr0x18deab0, Instruction 0x18e09c8 : load f_c, Instruction 0x18e0a50 : and bf.load, ptr0x18dff70, Instruction 0x18e0ae0 : or bf.clear, ptr0x18deb00, Instruction 0x18e0b70 : store bf.set, f_c, Instruction 0x18e0c18 : getelementptr obj_a, ptr0x18de880, ptr0x18deab0, Instruction 0x18e0c88 : load f_c1, Instruction 0x18e0d10 : and bf.load2, ptr0x18dff70, Instruction 0x18e0da0 : or bf.clear3, ptr0x18dffc0, Instruction 0x18ded80 : store bf.set4, f_c1, Instruction 0x18dedf8 : load a, Instruction 0x18dee80 : icmp ptr0x18dedf8, ptr0x18e0010, Instruction 0x18def28 : br cmp, if.else, if.then, BasicBlock: if.then Instruction 0x18def98 : load c, Instruction 0x18e1440 : store ptr0x18def98, b, Instruction 0x18df008 : br if.end, BasicBlock: if.else Instruction 0x18e14b8 : load d, Instruction 0x18e1528 : bitcast obj_a.coerce, Instruction 0x18e1598 : bitcast obj_a, Instruction 0x18e1680 : call ptr0x18e1528, ptr0x18e1598, ptr0x18de8d0, ptr0x18de880, ptr0x18de1f0, llvm.memcpy.p0i8.p0i8.i64, Instruction 0x18e1738 : getelementptr obj_a.coerce, ptr0x18de880, ptr0x18de880, Instruction 0x18e17a8 : load ptr0x18e1738, Instruction 0x18e1848 : getelementptr obj_a.coerce, ptr0x18de880, ptr0x18db940, Instruction 0x18e18b8 : load ptr0x18e1848, Instruction 0x18e1970 : call ptr0x18e14b8, ptr0x18e17a8, ptr0x18e18b8, my_func, Instruction 0x18e1a10 : store call, b, Instruction 0x18e1a88 : br if.end, BasicBlock: if.end Instruction 0x18e1af8 : load b, Instruction 0x18e1b68 : ret ptr0x18e1af8, 

Description

To get a better idea of ​​the above output, please note that.

LLVM uses the command address as an identifier for the return value

Internally, for each LLVM instruction, LLVM will directly use its instruction address to represent the return value. and when the return value is used for another command, it will directly use the address of this command.

For user-readable IR generated by clang , a return value, such as %0 , %add , %conv , is generated using the LLVM IR entry for readability.

LLVM Instruction Class Does Not Contain LLVM IR Line Number Information

LLVM IR has only line number information about the source code of C. This means that we could not get an idea of ​​the line number for each operation in the LLVM IR code.

Therefore, although we could analyze the operation line by line, we could not know in which line the operation is located.

Link

The above source code is borrowed from How to write a custom intermodular pass in LLVM? as well as modified for this question.

+6
source

All Articles