NIR: A new compiler IR for Mesa

Introduction

NIR (pronounced “ner”) is a new IR (internal representation) for the Mesa shader compiler that will sit between the old IR (GLSL IR) and back-end compilers. The primary purpose of NIR is to be more efficient for doing optimizations and generate better code for the back-ends. We have a lot of optimizations in implemented GLSL IR right now. However, they still generate fairly bad code primarily because its tree-based structure makes writing good optimizations difficult. For this reason, we have implemented a lot of optimizations in the i965 back-end compilers just to fix up the code we get from GLSL IR. The “proper fix” to this is to implement a better high-level IR; enter NIR.

Most of the initial work on NIR including setting up common data structures, helper methods, and a few basic passes was done over the summer by our intern, Connor Abbot. Connor did a fantastic job, but there is still a lot left to be done. I’ve spent the last two months trying to fill in the pieces that we need in order to get NIR off the ground. At this point, we’re at zero piglit regressions and the shader-db numbers aren’t terrible.

A few key points about NIR:

  1. It is primarily an SSA-based IR.
  2. It supports source/destination-modifiers and swizzles.
  3. Standard GPU operations such as sin() and fmad() are first-class ALU operations, not intrinsics.
  4. GLSL concepts like inputs, outputs, uniforms, etc. are built into the IR so we can do proper analysis on them.
  5. Even though it’s SSA, it still has a concept of registers and write-masks in the core IR data structures. This means we can generate code that is much closer to what backends want.

Single Static Assignment (SSA) form

It’s worth spending a minute or two talking about SSA.

  1. Every SSA value is assigned exactly once.
  2. Undefined values have a fake assignment.
  3. Different assignments from divergent control flow are resolved using special instructions called “phi nodes”.

Example 1

Non-SSA form:

int a;
if (foo) {
   a = 1;
} else {
   a = bar();
}
return a;

SSA form:

if (foo) {
   a_1 = 1;
} else {
   a_2 = bar();
}
a_3 = phi(a_1, a_2);
return a_3;

Example 2

Non-SSA form:

int a;
while (foo) {
   a = bar();
}
return a;

SSA form:

a_1 = ssa_undef;
while (foo) {
   a_2 = bar();
}
a_3 = phi(a_1, a_2);
return a_3;

Why SSA?

When writing optimizations you frequently want to ask the question, “At point P in the program, what value is stored in variable X?” This question is hard to answer because it involves asking a lot of subquestions:

  1. When was X last assigned before P?
  2. Is it possible that X was never assigned at all by the time we get to P?
  3. Does the last assignment of X always happen or could it still have the old value?

We also want to ask the question “When is the result of instruction I used?” If the result of I is assigned to X, there are again, a lot of subquestions:

  1. What instructions use X after I?
  2. Is there another instruction that may assign X after I but before P?
  3. Is there another instruction that always assigns X after I but before P?
  4. Is X ever used before it is assigned to again?

SSA for makes all of these questions much simpler because each value is only assigned once and you can go directly from the assignment of a value to the instruction that assigned it.

Example 3

The variable a is always defined in the example below. Think about how you would figure that out with the program in the form given.

int a;
do {
   if (foo) {
      a = 5;
      break;
   }

   if (bar)
      continue;

   a = 6;
} while (!done);
return a;

In SSA form, this is trivial because we can trace the SSA definitions and see that we never hit an undef:

a_1 ssa_undef;
loop {
   if (foo) {
      a_2 = 5;
      break;
   }

   if (bar)
      continue; /* Goes to the if at the end */

   a_3 = 6;
   if (done)
      break;
}
a_4 = phi(a_2, a_3, a_1);
return a_3;

By putting a program into SSA form, we effectively move all of the control-flow analysis of variable definitions into the into-SSA pass and we can stop thinking about it in every optimization. We also have a pass to put it back into a regular non-SSA form before we hand it to the back-end compiler.

Basic Data Structures

All of the basic data structures in NIR are typedef’d C structs:

typedef struct {
   /* stuff */
} nir_foo;

Inheritance is done in the usual C way with structure inclusion. If a structure has subtypes, it has a type enum and each child type includes its parent type as a field. For each subtype, a casting function is created using the NIR_DEFINE_CAST macro for casting from parent to child.

typedef enum {
   nir_foo_type_bar,
   /* Other types */
} nir_foo_type;

typedef struct {
   nir_foo_type foo_type;
   /* stuff */
} nir_foo;

typedef struct {
   nir_foo foo;
   /* stuff */
} nir_foo_bar;

/* NIR_DEFINE_CAST(nir_foo_as_bar, nir_foo, nir_foo_bar, foo) */
static inline nir_foo_bar *
nir_foo_as_bar(nir_foo *parent)
{
   return exec_node_data(nir_foo_bar, parent, foo);
}

Note that the casting function does not first check if parent is, in fact, a nir_foo_bar before casting. This could be changed easily enough if we want “dynamic cast” behavior. Up to this point, it hasn’t been too much of a problem.

NIR datatypes and their children:

What follows is a (mostly complete) list of the core NIR data-structures with subtypes listed in sub-bullets.

There are a few other data structures but they are mostly for handling function calls and linking which we don’t currently do in NIR.

Declaring NIR opcodes:

For both ALU operations and intrinsics, NIR uses a system of macros and C includes for declaring opcodes and the metadata that goes along with them. The nir_opcodes.h and nir_intrinsics.h header files each have a comment at the top of them describing a macro (not defined in the file) for declaring opcodes. Each opcode is then declared in the file using the given macro.

Then, in nir.h the opcode declaration macro is defined and nir_opcodes.h (or nir_intrinsics.h) is included to generate the enum of opcodes. The nir_opcodes.c and nir_intrinsics.c files contain a different definition of the macro and includes it again to declare the constant run-time metadata structures.

Basic flow of transformations

Before we go into the different transformations in detail, it is good to look at how the code gets from GLSL IR, through NIR, to the back-end code. This example will be taken from the i965 NIR back-end, but it should look fairly similar in other back-ends.

  1. lower_output_reads(): In NIR, output variables are write-only. We could handle this specially, or there’s a pass that GLSL-to-TGSI also uses that lowers outputs to temporary variables with a final write at the end into the output variable. We could have our own pass, but it’s easier to just use the one that already exists.

  2. glsl_to_nir(): Converts the GLSL IR code into NIR code. At this point, all of the information in the program is now in NIR data structures and the GLSL IR could, in theory, be thrown away. The code generated by the glsl_to_nir pass is technically in SSA form. However, it contains a lot of variable load/store intrinsics which we will eventually want to eliminate.

  3. nir_lower_global_vars_to_local: For each global variable, it looks at all of the uses of that variable in every function implementation in the shader. If it is only ever used in one function implementation, it is lowered to a local variable.

  4. nir_split_var_copies: Implements “copy splitting” which is similar to structure splitting only it works on copy operations rather than the datatypes themselves. The GLSL language allows you to copy one variable to another an entire structure (which may contain arrays or other structures) at a time. Normally, in a language such as C this would be handled by a “structure splitting” pass that breaks up the structures. Unfortunately for us, structures used in inputs or outputs can’t be split. Therefore, regardlesss of what we do, we have to be able to copy to/from structures.

  5. Optimization loop: Performs a bunch of different optimizations in a loop. Each optimization pass returns a boolean value to indicate if it made any “progress”. The loop continues to repeat until it goes through a full pass of the loop without making any “progress”. I won’t go every optimization, but some of the key ones are as follows:

    1. nir_lower_variables(): This pass (which probably needs a better name) lowers variable load/store intrinsics to SSA values whenever possible. In particular, it analizes whether or not a particular value is ever aliased by an indirect and, if not, lowers it to a SSA value inserting phi nodes where necessary.

    2. nir_opt_algebraic(): This pass is capable of doing a variety of algebraic simplifications of expressions. One trivial example is a + 0 -> a. A more complicated example is (-|a| > 0) -> (a == 0) which shows up all the time in GLSL code that has been generated by a DirectX-to-OpenGL translation layer.

    3. nir_opt_constant_folding(): Looks for ALU operations whose arguments are constants, computes the resulting constant value, and replaces the expression with a single load_const instruction. The constant folding pass is also capable of folding constants into variable dereference indirects. This way, after the constant folding pass, things that were once indirect variable dereferences are now direct and we may be able to lower them to SSA.

  6. Lowering passes: These lower NIR concepts to things that are easier for the backend compiler to handle. The ones used by the i965 backend are:

    1. nir_lower_locals_to_regs(): This pass lowers local variables to nir_register values. Again, this keeps the backend from having to worry about deref chains.

    2. Dereference lowering. These passes all do basically the same thing: lower an instruction that uses a variable dereference chain to one that doesn’t. This way the backends don’t have to worry about crawling dereferences and can just work with indices into buffers and the occational indirect offset. These passes include:

      • nir_lower_io()
      • nir_lower_samplers()
      • nir_lower_system_values()
      • nir_lower_atomics()
    3. nir_lower_to_source_mods(): Because it makes algebraic optimizations easier, we don’t actually emit source or destination modifiers in glsl_to_nir. Instead, we emit [fi]neg, [fi]abs, and fsat ALU instructions and work with those during optimization. After all the optimizations are done, we then lower these to source/destination modifiers.

    4. nir_convert_from_ssa(): This pass takes the shader out of SSA form and into a more conventional form using nir_register. In general, going out of SSA form is very hard to do right. If other backends don’t want to be SSA, we’d rather have the out-of-SSA implementation in one place so they don’t get it wrong.

    1. nir_lower_vec_to_movs(): Lowers nir vecN() operations to a series of moves with writemask.
  7. nir_validate(): This very useful pass does nothing but check a pile of different invariants of the IR to ensure that nothing is broken. It should get run after every optimization pass when built in debug mode to ensure that we aren’t breaking any invariants as we go.

  8. NIR to FS: Finally, we convert to the low-level FS IR that we use to then create programs to run on the i965 hardware.

In and out of SSA

One of the most important part of being able to use SSA form is going in and out of it efficiently. Going in and out of SSA naively isn’t hard but it generates terrible code. Going in and out competently is much harder but necessary if we want decent programs. For example, just going to and from SSA badly took one shader from 300 generated instructions to 800 with a register pressure of a couple hundred in the middle. Therefore, how we go to/from SSA is important

Going into SSA form

In NIR, there are two ways to go into SSA form. One is to first convert to using registers and call nir_to_ssa(). The other is what we do in i965 where we generate SSA+load/store and then call nir_lower_variables() I’ll walk through the nir_lower_variables() pass as it is the more complicated of the two, but they both follow the same algorithm for placing phi nodes etc.

  1. Collect information about all of the variable dereference. This builds a data structure which I have called a “deref forest” which contains information about every possible variable dereference and where it is used. This is similar to the use/def information for scalar registers but is done on a per-deref basis so that you can diferentiate between different elements of a structure or different offsets in an array.

  2. Determine what values can be lowered to SSA values. At the moment, this is a simle heuristic which only lowers a value if it can never be referenced indirectly. The deref forest makes it fairly easy to find these.

  3. Place phi nodes at the iterated dominance frontier of the definitions. This follows the algorithm presented by Cytron et. al. in “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.” It places phi nodes at mimal locations where they are needed in order to resolve merge points in the CFG.

  4. Variable renumbering. This is a pass that walk the list of instructions and replaces each store operation with an ssa definition and each load operation with aread from the closest SSA definition that reaches that instruction. Since we have already added Phi nodes, this can be done with a simple stack-based depth-first search algorithm.

In the nir_to_ssa() pass, we can insert phi nodes where all of the sources and destinations are the register we are trying to take into SSA form. Because we can’t use variables as sources or destinations of phi nodes, we fake it in the nir_lower_variables() pass with a hash table, but the principle is the same.

Going out of SSA form

Going out of SSA is even trickier than going into SSA. The pass we have right now would be more-or-less state-of-the-art in a scalar world. Unfortunately, vectors complicate things so we could probably be doing a lot better.

The pass we have now is based on “Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency” by Boissinot et. al. That is the only out-of-SSA paper I recommend you ever read as the others I’ve found are completely bonkers and rely on impossible data structures. The basic process is as follows:

  1. Isolate phi nodes. This process inserts a bunch of “parallel copy” instructions at the beinning and ends of basic blocks that ensure SSA values used as sources and destinations of phi nodes are only ever used once. (They are only defined once since they are SSA values.) This prevents many of the technical difficulties of going out of SSA form

  2. Aggressive register coalescing. This process starts by putting assigning all of the sources and destinations of a given phi node to the same register. It then tries to, one at a time, eliminate the parallel copies by assigning the source and destination of the copy to the same register. The entire time, it keeps track of the interferences and makes sure that nothing gets clobbered. This is done using a concept called a “dominance forest” which was first put forth by Budimlic et. al. in “Fast Copy Coalescing and Live-Range Identification”. Go read the papers for more info.

  3. Assign registers. Now that we have figured out what registers to use for the phi nodes, we can assign registers to all of the other SSA values as well.

  4. Resolve the parallel copies. A parallel copy operation takes a bunch of values and copies them to a bunch of other locations all at the same time. In this way, a single parallel copy can be used to shuffle arbitrarily many values around in an atomic fassion. Obviously, no one implements a parallel copy instruction in hardware, so this has to be lowered to a sequence of move operations.

Algebraic optimizations

For doing algebraic optimizations and lowering passes, we have an infrastructure that allows you to easily perform search/replace operations on expressions. This consists of two data-structures: nir_search_expression and nir_search_value and a function nir_replace_instr() which checks if the given instruction matches the search expression and, if it does, replaces it with a new instruction (or chain of instructions) according to the given replacement value. The framework automatically handles searching expression trees and dealing with swizzles for you so you don’t have to think about it.

Because creating these data structures is cumbersome, we have a bit if python/mako magic that auto-generates the structures and a function that walks the shader doing the search-and-replace as it goes. The function generated has a first-order switch statement so it only calls nir_replace_instr() if it knows that at least the instructions match. The search and replace expressions are written in a little language using python tuples. From nir_opt_algebraic.py:

# Convenience variables
a = 'a'
b = 'b'
c = 'c'
d = 'd'

# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value.  An expression is
# defined as a tuple of the form (<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value.  A value can be
# either a numeric constant or a string representing a variable name.  For
# constants, you have to be careful to make sure that it is the right type
# because python is unaware of the source and destination types of the
# opcodes.

optimizations = [
   (('fadd', a, 0.0), a),
   (('iadd', a, 0), a),
   (('fmul', a, 0.0), 0.0),
   (('imul', a, 0), 0),
   (('fmul', a, 1.0), a),
   (('imul', a, 1), a),
   (('ffma', 0.0, a, b), b),
   (('ffma', a, 0.0, b), b),
   (('ffma', a, b, 0.0), ('fmul', a, b)),
   (('flrp', a, b, 0.0), a),
   (('flrp', a, b, 1.0), b),
   (('flrp', a, a, b), a),
   (('flrp', 0.0, a, b), ('fmul', a, b)),
   (('fadd', ('fmul', a, b), c), ('ffma', a, b, c)),
   (('fge', ('fneg', ('fabs', a)), 0.0), ('feq', a, 0.0)),
   (('fmin', ('fmax', a, 1.0), 0.0), ('fsat', a)),
# This one may not be exact
   (('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
]

As you can see, that makes adding new algebraic optimizations stupid easy. The framework can also be used for writing lowering passes to for things like lowering sat to min/max.

Metadata and analysis

Many of the optimization/lowering passes require different bits if metadata that are provided by different analysis passes. Right now, we don’t have much of this, but we do have some. As time goes on and we add things like value numbering, the amount of metadata we have will increase. In order to manage this, we have a simple metadata system consisting of an enum and two functions:

Unfortunately, we have no way to automatically dirty everything if you don’t call nir_metadata_preserve(). So shame on you if you forget it.