LLVM IR Tutorial: Phis, GEPs, and other things, oh my! - LLVM Developers Conference 2019
date
Jan 13, 2024
slug
llvm-ir-tutorial-llvm-dev-conf-2019
status
Published
tags
LLVM
summary
type
Post
Link to original tutorial slides: https://llvm.org/devmtg/2019-04/slides/Tutorial-Bridgers-LLVM_IR_tutorial.pdf
What is LLVM IR?
- A low-level programming language (RISC-like instruction set)
- while being able to represent high-level ideas (ideas in high-level language can be easily mapped to LLVM IR)
- Enable efficient code optimization.
IR & Compilation Process
- Source codes in different languages can be compiled into LLVM IR bit code form, which can be optimized by the optimizer of LLVM(e.g., opt) and then be linked.
Simplified IR Layout
Modules represent the top-level structure in an LLVM program. An LLVM module is effectively a translation unit or a collection of translation units merged together. (from LLVM documentation)
An Example: a basic main program
main.cpp
corresponding code in LLVM IR (omitting unnecessary parts)
Some observations:
- LLVM IR is a typed language (and no implicit conversion)
- Easily represent high-level ideas such as function.
%var is a virtual register in LLVM IR, which is used to represent a local variable. LLVM IR has infinite virtual registers. (%name or %number)
More Examples: recursive factorial
main.cpp
corresponding LLVM IR
Basic Blocks
The above example contains several basic blocks.
A basic block is a list of non-terminator instructions ending with a terminator instruction: - Branch: “br” - Return: “ret” - Switch: “switch” - Unreachable: “unreachable” - Exception handling instructions
Control Flow Graph
A Control Flow Graph captures the pred-succ relationship between basic blocks, the node set contains all basic blocks, andthe edge set contains all control flow transfer relations.
Every basic block has a label; if you do not explicitly write it, it will be implicitly added.
More Examples: iterative factorial
main.cpp
LLVM IR:
SSA: Single Static Assignment
SSA is a form of IR, with the following restrictions:
- Every variable is assigned exactly once.
- Every variable is defined before it is used.
In the above example, we want to update %temp and %i, but this violates the constraints of SSA, so what should we do?
Phi instruction comes to the rescue!
CFG when using Phis
Another Way to Cheat SSA: the Alloca instruction
The alloca instruction:
- Allocates memory on the stack frame of the executing function.
- Automatically released.
- Plays a big part in generating IR in SSA form.
CFG using alloca:
Global Variables
Allocas allocate memory for function scopes. Global variables fill that role for the module in a static way.
- They are always pointers, like the values returned by Allocas.
- GVs are always constant pointers(will not point to other locations)
LLVM’s Type System
types in LLVM’s type system(from langRef):
Aggregate Type: Array Type
The array type is defined by:
- A constant size. e.g.: @array = global [17 x ]
- An element type. e.g.: @array = global [17 x i8]
- (For GVs) An initializer: e.g.: @array = global [17 x i8] zeroinitializer
Accessing Array & Manipulating Pointers
We can access array elements using GEP instruction, which provides a way to:
- calculate array offset
- abstract details like sizeof type and padding inside structs
Manipulating Pointers using GEPs
here is an example:
if we only provide the 0-th dimension offset, GEP will return a pointer of the base type; if you provide an index in more dimensions, it will return element types.
GEP fundamentals
- Understand the first index:
- It does NOT change the resulting pointer type.
- It offsets by the base type. (may become a bit confusing when the base type is one of the aggregate types)
- Further indices:
- Offset inside aggregate types. (and vectors)
- Change the pointer type by removing one layer of “aggregation”
A common misusage of GEP
GEP offsets by base type if only the first index is provided, so in this case, it offsets by the size of the entire array.
Aggregate Type: the Struct Type
A struct type is defined by:
- A name
- Keyword “type”
- A list of types
- example: %MyStruct = type { i8, i32, [3 x i32] }
GEP with structs
Basically, it’s the same as with the array type. (offset by members), but should always be indexed by constants (so the compiler can figure out types of the elements in a static way)
For example, if the index is not constant, we cannot check whether it could be indexed at third-level. (this is based on the fact that the 2-nd element is an array)
GEP fundamentals - continue
- Struct indices must be constants.
- GEPs never load from memory!
Final Remarks
LLVM IR is constantly evolving.
Covered fundamental topics unlikely to change soon.
- There is a lot more to explore!
Next steps:
- Learn how to manipulate IR using the LLVM library
- Look at the OPT code!