LLVM IR Tutorial: Phis, GEPs, and other things, oh my!

date

Jan 13, 2024

slug

llvm-ir-tutorial-llvm-dev-conf-2019

status

Published

What is LLVM IR?

A low-level programming language (RISC-like instruction set)

while being able to represent high-level ideas (ideas in high-level language can be easily mapped to LLVM IR)

Enable efficient code optimization.

IR & Compilation Process

Source codes in different languages can be compiled into LLVM IR bit code form, which can be optimized by the optimizer of LLVM(e.g., opt) and then be linked.

Simplified IR Layout

Modules represent the top-level structure in an LLVM program. An LLVM module is effectively a translation unit or a collection of translation units merged together. (from LLVM documentation)

An Example: a basic main program

main.cpp

corresponding code in LLVM IR (omitting unnecessary parts)

Some observations:

LLVM IR is a typed language (and no implicit conversion)

Easily represent high-level ideas such as function.

%var is a virtual register in LLVM IR, which is used to represent a local variable. LLVM IR has infinite virtual registers. (%name or %number)

More Examples: recursive factorial

main.cpp

corresponding LLVM IR

Basic Blocks

The above example contains several basic blocks.

A basic block is a list of non-terminator instructions ending with a terminator instruction: - Branch: “br” - Return: “ret” - Switch: “switch” - Unreachable: “unreachable” - Exception handling instructions

Control Flow Graph

A Control Flow Graph captures the pred-succ relationship between basic blocks, the node set contains all basic blocks, andthe edge set contains all control flow transfer relations.

Every basic block has a label; if you do not explicitly write it, it will be implicitly added.

More Examples: iterative factorial

main.cpp

LLVM IR:

SSA: Single Static Assignment

SSA is a form of IR, with the following restrictions:

Every variable is assigned exactly once.

Every variable is defined before it is used.

In the above example, we want to update %temp and %i, but this violates the constraints of SSA, so what should we do?

Phi instruction comes to the rescue!

CFG when using Phis

Another Way to Cheat SSA: the Alloca instruction

The alloca instruction:

Allocates memory on the stack frame of the executing function.

Automatically released.

Plays a big part in generating IR in SSA form.

CFG using alloca:

Global Variables

Allocas allocate memory for function scopes. Global variables fill that role for the module in a static way.

They are always pointers, like the values returned by Allocas.

GVs are always constant pointers(will not point to other locations)

LLVM’s Type System

types in LLVM’s type system(from langRef):

Aggregate Type: Array Type

The array type is defined by:

A constant size. e.g.: @array = global [17 x ]

An element type. e.g.: @array = global [17 x i8]

(For GVs) An initializer: e.g.: @array = global [17 x i8] zeroinitializer

Accessing Array & Manipulating Pointers

We can access array elements using GEP instruction, which provides a way to:

calculate array offset

abstract details like sizeof type and padding inside structs

Manipulating Pointers using GEPs

here is an example:

if we only provide the 0-th dimension offset, GEP will return a pointer of the base type; if you provide an index in more dimensions, it will return element types.

GEP fundamentals

Understand the first index:

It does NOT change the resulting pointer type.

It offsets by the base type. (may become a bit confusing when the base type is one of the aggregate types)

Further indices:

Offset inside aggregate types. (and vectors)

Change the pointer type by removing one layer of “aggregation”

A common misusage of GEP

GEP offsets by base type if only the first index is provided, so in this case, it offsets by the size of the entire array.

Aggregate Type: the Struct Type

A struct type is defined by:

A name

Keyword “type”

A list of types

example: %MyStruct = type { i8, i32, [3 x i32] }

GEP with structs

Basically, it’s the same as with the array type. (offset by members), but should always be indexed by constants (so the compiler can figure out types of the elements in a static way)

For example, if the index is not constant, we cannot check whether it could be indexed at third-level. (this is based on the fact that the 2-nd element is an array)

GEP fundamentals - continue

Struct indices must be constants.

GEPs never load from memory!

Final Remarks

LLVM IR is constantly evolving.

Covered fundamental topics unlikely to change soon.

There is a lot more to explore!

Next steps:

Learn how to manipulate IR using the LLVM library

Look at the OPT code!

LLVM IR Tutorial: Phis, GEPs, and other things, oh my! - LLVM Developers Conference 2019

What is LLVM IR?

IR & Compilation Process

Simplified IR Layout

An Example: a basic main program

More Examples: recursive factorial

Basic Blocks

Control Flow Graph

More Examples: iterative factorial

SSA: Single Static Assignment

Phi instruction comes to the rescue!

Another Way to Cheat SSA: the Alloca instruction

Global Variables

LLVM’s Type System

Aggregate Type: Array Type

Accessing Array & Manipulating Pointers

Manipulating Pointers using GEPs

GEP fundamentals

A common misusage of GEP

Aggregate Type: the Struct Type

GEP with structs

GEP fundamentals - continue

Final Remarks