NIR: A new compiler IR for Mesa

Introduction

NIR (pronounced “ner”) is a new IR (internal representation) for the Mesa shader compiler that will sit between the old IR (GLSL IR) and back-end compilers. The primary purpose of NIR is to be more efficient for doing optimizations and generate better code for the back-ends. We have a lot of optimizations in implemented GLSL IR right now. However, they still generate fairly bad code primarily because its tree-based structure makes writing good optimizations difficult. For this reason, we have implemented a lot of optimizations in the i965 back-end compilers just to fix up the code we get from GLSL IR. The “proper fix” to this is to implement a better high-level IR; enter NIR.

Most of the initial work on NIR including setting up common data structures, helper methods, and a few basic passes was done over the summer by our intern, Connor Abbot. Connor did a fantastic job, but there is still a lot left to be done. I’ve spent the last two months trying to fill in the pieces that we need in order to get NIR off the ground. At this point, we’re at zero piglit regressions and the shader-db numbers aren’t terrible.

A few key points about NIR:

It is primarily an SSA-based IR.
It supports source/destination-modifiers and swizzles.
Standard GPU operations such as sin() and fmad() are first-class ALU operations, not intrinsics.
GLSL concepts like inputs, outputs, uniforms, etc. are built into the IR so we can do proper analysis on them.
Even though it’s SSA, it still has a concept of registers and write-masks in the core IR data structures. This means we can generate code that is much closer to what backends want.

Single Static Assignment (SSA) form

It’s worth spending a minute or two talking about SSA.

Every SSA value is assigned exactly once.
Undefined values have a fake assignment.
Different assignments from divergent control flow are resolved using special instructions called “phi nodes”.

Example 1

Non-SSA form:

int a;
if (foo) {
   a = 1;
} else {
   a = bar();
}
return a;

SSA form:

if (foo) {
   a_1 = 1;
} else {
   a_2 = bar();
}
a_3 = phi(a_1, a_2);
return a_3;

Example 2

Non-SSA form:

int a;
while (foo) {
   a = bar();
}
return a;

SSA form:

a_1 = ssa_undef;
while (foo) {
   a_2 = bar();
}
a_3 = phi(a_1, a_2);
return a_3;

Why SSA?

When writing optimizations you frequently want to ask the question, “At point P in the program, what value is stored in variable X?” This question is hard to answer because it involves asking a lot of subquestions:

When was X last assigned before P?
Is it possible that X was never assigned at all by the time we get to P?
Does the last assignment of X always happen or could it still have the old value?

We also want to ask the question “When is the result of instruction I used?” If the result of I is assigned to X, there are again, a lot of subquestions:

What instructions use X after I?
Is there another instruction that may assign X after I but before P?
Is there another instruction that always assigns X after I but before P?
Is X ever used before it is assigned to again?

SSA for makes all of these questions much simpler because each value is only assigned once and you can go directly from the assignment of a value to the instruction that assigned it.

Example 3

The variable a is always defined in the example below. Think about how you would figure that out with the program in the form given.

int a;
do {
   if (foo) {
      a = 5;
      break;
   }

   if (bar)
      continue;

   a = 6;
} while (!done);
return a;

In SSA form, this is trivial because we can trace the SSA definitions and see that we never hit an undef:

a_1 ssa_undef;
loop {
   if (foo) {
      a_2 = 5;
      break;
   }

   if (bar)
      continue; /* Goes to the if at the end */

   a_3 = 6;
   if (done)
      break;
}
a_4 = phi(a_2, a_3, a_1);
return a_3;

By putting a program into SSA form, we effectively move all of the control-flow analysis of variable definitions into the into-SSA pass and we can stop thinking about it in every optimization. We also have a pass to put it back into a regular non-SSA form before we hand it to the back-end compiler.

Basic Data Structures

All of the basic data structures in NIR are typedef’d C structs:

typedef struct {
   /* stuff */
} nir_foo;

Inheritance is done in the usual C way with structure inclusion. If a structure has subtypes, it has a type enum and each child type includes its parent type as a field. For each subtype, a casting function is created using the NIR_DEFINE_CAST macro for casting from parent to child.

typedef enum {
   nir_foo_type_bar,
   /* Other types */
} nir_foo_type;

typedef struct {
   nir_foo_type foo_type;
   /* stuff */
} nir_foo;

typedef struct {
   nir_foo foo;
   /* stuff */
} nir_foo_bar;

/* NIR_DEFINE_CAST(nir_foo_as_bar, nir_foo, nir_foo_bar, foo) */
static inline nir_foo_bar *
nir_foo_as_bar(nir_foo *parent)
{
   return exec_node_data(nir_foo_bar, parent, foo);
}

Note that the casting function does not first check if parent is, in fact, a nir_foo_bar before casting. This could be changed easily enough if we want “dynamic cast” behavior. Up to this point, it hasn’t been too much of a problem.

NIR datatypes and their children:

What follows is a (mostly complete) list of the core NIR data-structures with subtypes listed in sub-bullets.

nir_variable: Represents a GLSL variable. Mostly a copy-and-paste of ir_variable
nir_deref: Represents a dereference chain for a variable. The nir_deref objects form a singly linked list starting at the variable itself and working its way down the chain of structure and array dereferences.
- nir_deref_var: A variable dereference
- nir_deref_array: A dereference of an array. An array dereference can be one of: “direct”, “indirect”, or “wildcard”. A “wildcard” dereference refers to the every element of the array at the same time and is only allowed in nir_intrinsic_copy_var instructions.
- nir_deref_struct: A dereference of an element of a structure. The element to be dereferenced is denoted by its index in the structure.
nir_register: Non-SSA temporary storage. Registers can be written to multiple times and with write-masks.
nir_ssa_def: An SSA definition. This contains a pointer to the nir_instr that defines the value for easy crawling of expression trees.
nir_src: A source value for an instruction. Can be a register (with a potentially indirect offset) or an SSA definition.
- nir_alu_src: A source for an ALU instruction. ALU sources have source modifiers and swizzles in addition to the regular nir_src.
nir_dest: An instruction destination. Can be a register (with a potentially indirect offset) or an SSA definition. In the case of an SSA definition, the nir_ssa_def is actually embedded in the nir_dest.
- nir_alu_dest: The destination of an ALU instruction. ALU destinations can have the “saturate” destination modifier as well as a write-mask if the destination is a register.
nir_instr: An instruction
- nir_alu_instr: An ALU operation such as fmul, iadd, or fsin.
- nir_call_instr: A function call. Not currently used.
- nir_jump_instr: A jump instruction such as return, break, or continue.
- nir_tex_instr: A texturing operation. This structure contains all of the various bits of data required to figure out sampling, types (int or float), etc.
- nir_intrinsic_instr: An intrinsic. This is anything that isn’t really “special” but isn’t just an ALU operation. Intrinsics can’t, in general, be reordered or eliminated (there are flags to let you know when they can). Also, anything that has anything to do with a nir_variable must be an intrinsic.
- nir_load_const_instr: An instruction that just assigns some piece of constant data to its destination
- nir_ssa_undef_instr: An instruction for creating “fake” definitions of SSA values that aren’t actually defined in the code.
- nir_phi_instr: A phi node.
- nir_parallel_copy_instr: A parallel copy instruction. This crucial for going out of SSA form but you should never see one of these in the wild.
nir_cf_node: A node in the control flow graph. The CFG is explicitly maintained in NIR and each node has an exec_node embedded in it so that it can be placed in a list of control flow nodes.
- nir_block: A basic block. Contains a list of instructions none of which are allowed to be a jump instruction except possibly the last one.
- nir_if: An if statement. Each nir_if must be immediately preceded and succeded by a nir_block. (This ensures that there are no critical edges in the CFG.) Anir_ifcontains two lists of CF nodes: One for the then case and one for the else case. Thenir_ifalso contains anir_src` that is the condition for the if statement. Currently, there is no if instruction; it is built into the CFG.
- nir_loop: An endless loop. Each nir_loop must be immediately preceded and succeded by a nir_block. (This ensures that there are no critical edges in the CFG.) Anir_loop` contains a list of CF nodes that is repeated until you hit a break instruction.
- nir_function_impl: A function implementation. Not to be confused with a nir_function which may have multiple overloads not all of which are implemented. The reason for all of the indirection is to support shader subroutine and linking when the time comes. Fortunately, there are helpers in place that make it not too bad to work with. A nir_function_impl also contains function-local stuff such as local variables and registers and other metadata.
nir_shader: A shader. This may contain one or more functions as well as global registers and variables and other whole-shader type information. Right now, a nir_shader usually only contains one function called “main”, but that may change.

There are a few other data structures but they are mostly for handling function calls and linking which we don’t currently do in NIR.

Declaring NIR opcodes:

For both ALU operations and intrinsics, NIR uses a system of macros and C includes for declaring opcodes and the metadata that goes along with them. The nir_opcodes.h and nir_intrinsics.h header files each have a comment at the top of them describing a macro (not defined in the file) for declaring opcodes. Each opcode is then declared in the file using the given macro.

Then, in nir.h the opcode declaration macro is defined and nir_opcodes.h (or nir_intrinsics.h) is included to generate the enum of opcodes. The nir_opcodes.c and nir_intrinsics.c files contain a different definition of the macro and includes it again to declare the constant run-time metadata structures.

Basic flow of transformations

Before we go into the different transformations in detail, it is good to look at how the code gets from GLSL IR, through NIR, to the back-end code. This example will be taken from the i965 NIR back-end, but it should look fairly similar in other back-ends.

lower_output_reads(): In NIR, output variables are write-only. We could handle this specially, or there’s a pass that GLSL-to-TGSI also uses that lowers outputs to temporary variables with a final write at the end into the output variable. We could have our own pass, but it’s easier to just use the one that already exists.
glsl_to_nir(): Converts the GLSL IR code into NIR code. At this point, all of the information in the program is now in NIR data structures and the GLSL IR could, in theory, be thrown away. The code generated by the glsl_to_nir pass is technically in SSA form. However, it contains a lot of variable load/store intrinsics which we will eventually want to eliminate.
nir_lower_global_vars_to_local: For each global variable, it looks at all of the uses of that variable in every function implementation in the shader. If it is only ever used in one function implementation, it is lowered to a local variable.
nir_split_var_copies: Implements “copy splitting” which is similar to structure splitting only it works on copy operations rather than the datatypes themselves. The GLSL language allows you to copy one variable to another an entire structure (which may contain arrays or other structures) at a time. Normally, in a language such as C this would be handled by a “structure splitting” pass that breaks up the structures. Unfortunately for us, structures used in inputs or outputs can’t be split. Therefore, regardlesss of what we do, we have to be able to copy to/from structures.
Optimization loop: Performs a bunch of different optimizations in a loop. Each optimization pass returns a boolean value to indicate if it made any “progress”. The loop continues to repeat until it goes through a full pass of the loop without making any “progress”. I won’t go every optimization, but some of the key ones are as follows:
1. nir_lower_variables(): This pass (which probably needs a better name) lowers variable load/store intrinsics to SSA values whenever possible. In particular, it analizes whether or not a particular value is ever aliased by an indirect and, if not, lowers it to a SSA value inserting phi nodes where necessary.
2. nir_opt_algebraic(): This pass is capable of doing a variety of algebraic simplifications of expressions. One trivial example is a + 0 -> a. A more complicated example is (-|a| > 0) -> (a == 0) which shows up all the time in GLSL code that has been generated by a DirectX-to-OpenGL translation layer.
3. nir_opt_constant_folding(): Looks for ALU operations whose arguments are constants, computes the resulting constant value, and replaces the expression with a single load_const instruction. The constant folding pass is also capable of folding constants into variable dereference indirects. This way, after the constant folding pass, things that were once indirect variable dereferences are now direct and we may be able to lower them to SSA.
Lowering passes: These lower NIR concepts to things that are easier for the backend compiler to handle. The ones used by the i965 backend are:
1. nir_lower_locals_to_regs(): This pass lowers local variables to nir_register values. Again, this keeps the backend from having to worry about deref chains.
2. Dereference lowering. These passes all do basically the same thing: lower an instruction that uses a variable dereference chain to one that doesn’t. This way the backends don’t have to worry about crawling dereferences and can just work with indices into buffers and the occational indirect offset. These passes include:
  - nir_lower_io()
  - nir_lower_samplers()
  - nir_lower_system_values()
  - nir_lower_atomics()
3. nir_lower_to_source_mods(): Because it makes algebraic optimizations easier, we don’t actually emit source or destination modifiers in glsl_to_nir. Instead, we emit [fi]neg, [fi]abs, and fsat ALU instructions and work with those during optimization. After all the optimizations are done, we then lower these to source/destination modifiers.
4. nir_convert_from_ssa(): This pass takes the shader out of SSA form and into a more conventional form using nir_register. In general, going out of SSA form is very hard to do right. If other backends don’t want to be SSA, we’d rather have the out-of-SSA implementation in one place so they don’t get it wrong.
1. nir_lower_vec_to_movs(): Lowers nir vecN() operations to a series of moves with writemask.
nir_validate(): This very useful pass does nothing but check a pile of different invariants of the IR to ensure that nothing is broken. It should get run after every optimization pass when built in debug mode to ensure that we aren’t breaking any invariants as we go.
NIR to FS: Finally, we convert to the low-level FS IR that we use to then create programs to run on the i965 hardware.

In and out of SSA

One of the most important part of being able to use SSA form is going in and out of it efficiently. Going in and out of SSA naively isn’t hard but it generates terrible code. Going in and out competently is much harder but necessary if we want decent programs. For example, just going to and from SSA badly took one shader from 300 generated instructions to 800 with a register pressure of a couple hundred in the middle. Therefore, how we go to/from SSA is important

Going into SSA form

In NIR, there are two ways to go into SSA form. One is to first convert to using registers and call nir_to_ssa(). The other is what we do in i965 where we generate SSA+load/store and then call nir_lower_variables() I’ll walk through the nir_lower_variables() pass as it is the more complicated of the two, but they both follow the same algorithm for placing phi nodes etc.

Collect information about all of the variable dereference. This builds a data structure which I have called a “deref forest” which contains information about every possible variable dereference and where it is used. This is similar to the use/def information for scalar registers but is done on a per-deref basis so that you can diferentiate between different elements of a structure or different offsets in an array.
Determine what values can be lowered to SSA values. At the moment, this is a simle heuristic which only lowers a value if it can never be referenced indirectly. The deref forest makes it fairly easy to find these.
Place phi nodes at the iterated dominance frontier of the definitions. This follows the algorithm presented by Cytron et. al. in “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.” It places phi nodes at mimal locations where they are needed in order to resolve merge points in the CFG.
Variable renumbering. This is a pass that walk the list of instructions and replaces each store operation with an ssa definition and each load operation with aread from the closest SSA definition that reaches that instruction. Since we have already added Phi nodes, this can be done with a simple stack-based depth-first search algorithm.

In the nir_to_ssa() pass, we can insert phi nodes where all of the sources and destinations are the register we are trying to take into SSA form. Because we can’t use variables as sources or destinations of phi nodes, we fake it in the nir_lower_variables() pass with a hash table, but the principle is the same.

Going out of SSA form

Going out of SSA is even trickier than going into SSA. The pass we have right now would be more-or-less state-of-the-art in a scalar world. Unfortunately, vectors complicate things so we could probably be doing a lot better.

The pass we have now is based on “Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency” by Boissinot et. al. That is the only out-of-SSA paper I recommend you ever read as the others I’ve found are completely bonkers and rely on impossible data structures. The basic process is as follows:

Isolate phi nodes. This process inserts a bunch of “parallel copy” instructions at the beinning and ends of basic blocks that ensure SSA values used as sources and destinations of phi nodes are only ever used once. (They are only defined once since they are SSA values.) This prevents many of the technical difficulties of going out of SSA form
Aggressive register coalescing. This process starts by putting assigning all of the sources and destinations of a given phi node to the same register. It then tries to, one at a time, eliminate the parallel copies by assigning the source and destination of the copy to the same register. The entire time, it keeps track of the interferences and makes sure that nothing gets clobbered. This is done using a concept called a “dominance forest” which was first put forth by Budimlic et. al. in “Fast Copy Coalescing and Live-Range Identification”. Go read the papers for more info.
Assign registers. Now that we have figured out what registers to use for the phi nodes, we can assign registers to all of the other SSA values as well.
Resolve the parallel copies. A parallel copy operation takes a bunch of values and copies them to a bunch of other locations all at the same time. In this way, a single parallel copy can be used to shuffle arbitrarily many values around in an atomic fassion. Obviously, no one implements a parallel copy instruction in hardware, so this has to be lowered to a sequence of move operations.

Algebraic optimizations

For doing algebraic optimizations and lowering passes, we have an infrastructure that allows you to easily perform search/replace operations on expressions. This consists of two data-structures: nir_search_expression and nir_search_value and a function nir_replace_instr() which checks if the given instruction matches the search expression and, if it does, replaces it with a new instruction (or chain of instructions) according to the given replacement value. The framework automatically handles searching expression trees and dealing with swizzles for you so you don’t have to think about it.

Because creating these data structures is cumbersome, we have a bit if python/mako magic that auto-generates the structures and a function that walks the shader doing the search-and-replace as it goes. The function generated has a first-order switch statement so it only calls nir_replace_instr() if it knows that at least the instructions match. The search and replace expressions are written in a little language using python tuples. From nir_opt_algebraic.py:

# Convenience variables
a = 'a'
b = 'b'
c = 'c'
d = 'd'

# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value.  An expression is
# defined as a tuple of the form (<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value.  A value can be
# either a numeric constant or a string representing a variable name.  For
# constants, you have to be careful to make sure that it is the right type
# because python is unaware of the source and destination types of the
# opcodes.

optimizations = [
   (('fadd', a, 0.0), a),
   (('iadd', a, 0), a),
   (('fmul', a, 0.0), 0.0),
   (('imul', a, 0), 0),
   (('fmul', a, 1.0), a),
   (('imul', a, 1), a),
   (('ffma', 0.0, a, b), b),
   (('ffma', a, 0.0, b), b),
   (('ffma', a, b, 0.0), ('fmul', a, b)),
   (('flrp', a, b, 0.0), a),
   (('flrp', a, b, 1.0), b),
   (('flrp', a, a, b), a),
   (('flrp', 0.0, a, b), ('fmul', a, b)),
   (('fadd', ('fmul', a, b), c), ('ffma', a, b, c)),
   (('fge', ('fneg', ('fabs', a)), 0.0), ('feq', a, 0.0)),
   (('fmin', ('fmax', a, 1.0), 0.0), ('fsat', a)),
# This one may not be exact
   (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
]

As you can see, that makes adding new algebraic optimizations stupid easy. The framework can also be used for writing lowering passes to for things like lowering sat to min/max.

Metadata and analysis

Many of the optimization/lowering passes require different bits if metadata that are provided by different analysis passes. Right now, we don’t have much of this, but we do have some. As time goes on and we add things like value numbering, the amount of metadata we have will increase. In order to manage this, we have a simple metadata system consisting of an enum and two functions:

nir_metadata_require(): Declares that the given metadata (an OR of enum values) is required. The function automatically calls all of the required analysis passes for you and, upon its return, the requested metadata is available and current.
nir_metadata_preserve(): Called to declare what metadata (if any) was preserved by the given pass. If the pass didn’t touch anything, it doesn’t need to call this function. However, if it adds/removes instructions or modifies the CFG in any way, it needs to call nir_metadata_preserve(). The nir_metadata_preserve() function takes an OR of all of the bits of metadata that are preserved. That way as new metadata gets added, we don’t have to update every optimization pass to dirty it.

Unfortunately, we have no way to automatically dirty everything if you don’t call nir_metadata_preserve(). So shame on you if you forget it.

Faith Ekstrand

Navigation ↵