NIR (pronounced “ner”) is a new IR (internal representation) for the Mesa shader compiler that will sit between the old IR (GLSL IR) and back-end compilers. The primary purpose of NIR is to be more efficient for doing optimizations and generate better code for the back-ends. We have a lot of optimizations in implemented GLSL IR right now. However, they still generate fairly bad code primarily because its tree-based structure makes writing good optimizations difficult. For this reason, we have implemented a lot of optimizations in the i965 back-end compilers just to fix up the code we get from GLSL IR. The “proper fix” to this is to implement a better high-level IR; enter NIR.
Most of the initial work on NIR including setting up common data structures, helper methods, and a few basic passes was done over the summer by our intern, Connor Abbot. Connor did a fantastic job, but there is still a lot left to be done. I’ve spent the last two months trying to fill in the pieces that we need in order to get NIR off the ground. At this point, we’re at zero piglit regressions and the shader-db numbers aren’t terrible.
A few key points about NIR:
It’s worth spending a minute or two talking about SSA.
Non-SSA form:
int a;
if (foo) {
a = 1;
} else {
a = bar();
}
return a;
SSA form:
if (foo) {
a_1 = 1;
} else {
a_2 = bar();
}
a_3 = phi(a_1, a_2);
return a_3;
Non-SSA form:
int a;
while (foo) {
a = bar();
}
return a;
SSA form:
a_1 = ssa_undef;
while (foo) {
a_2 = bar();
}
a_3 = phi(a_1, a_2);
return a_3;
When writing optimizations you frequently want to ask the question, “At point P in the program, what value is stored in variable X?” This question is hard to answer because it involves asking a lot of subquestions:
We also want to ask the question “When is the result of instruction I used?” If the result of I is assigned to X, there are again, a lot of subquestions:
SSA for makes all of these questions much simpler because each value is only assigned once and you can go directly from the assignment of a value to the instruction that assigned it.
The variable a
is always defined in the example below.
Think about how you would figure that out with the program in the form
given.
int a;
do {
if (foo) {
a = 5;
break;
}
if (bar)
continue;
a = 6;
} while (!done);
return a;
In SSA form, this is trivial because we can trace the SSA definitions and see that we never hit an undef:
a_1 ssa_undef;
loop {
if (foo) {
a_2 = 5;
break;
}
if (bar)
continue; /* Goes to the if at the end */
a_3 = 6;
if (done)
break;
}
a_4 = phi(a_2, a_3, a_1);
return a_3;
By putting a program into SSA form, we effectively move all of the control-flow analysis of variable definitions into the into-SSA pass and we can stop thinking about it in every optimization. We also have a pass to put it back into a regular non-SSA form before we hand it to the back-end compiler.
All of the basic data structures in NIR are typedef’d C structs:
typedef struct {
/* stuff */
} nir_foo;
Inheritance is done in the usual C way with structure inclusion. If a structure has subtypes, it has a type enum and each child type includes its parent type as a field. For each subtype, a casting function is created using the NIR_DEFINE_CAST macro for casting from parent to child.
typedef enum {
nir_foo_type_bar,
/* Other types */
} nir_foo_type;
typedef struct {
nir_foo_type foo_type;
/* stuff */
} nir_foo;
typedef struct {
nir_foo foo;
/* stuff */
} nir_foo_bar;
/* NIR_DEFINE_CAST(nir_foo_as_bar, nir_foo, nir_foo_bar, foo) */
static inline nir_foo_bar *
nir_foo_as_bar(nir_foo *parent)
{
return exec_node_data(nir_foo_bar, parent, foo);
}
Note that the casting function does not first check if
parent
is, in fact, a nir_foo_bar
before
casting. This could be changed easily enough if we want “dynamic cast”
behavior. Up to this point, it hasn’t been too much of a problem.
What follows is a (mostly complete) list of the core NIR data-structures with subtypes listed in sub-bullets.
nir_variable
: Represents a GLSL variable. Mostly a
copy-and-paste of ir_variable
nir_deref
: Represents a dereference chain for a
variable. The nir_deref
objects form a singly linked list
starting at the variable itself and working its way down the chain of
structure and array dereferences.
nir_deref_var
: A variable dereference
nir_deref_array
: A dereference of an array. An array
dereference can be one of: “direct”, “indirect”, or “wildcard”. A
“wildcard” dereference refers to the every element of the array at the
same time and is only allowed in nir_intrinsic_copy_var
instructions.
nir_deref_struct
: A dereference of an element of a
structure. The element to be dereferenced is denoted by its index in the
structure.
nir_register
: Non-SSA temporary storage. Registers
can be written to multiple times and with write-masks.
nir_ssa_def
: An SSA definition. This contains a
pointer to the nir_instr
that defines the value for easy
crawling of expression trees.
nir_src
: A source value for an instruction. Can be a
register (with a potentially indirect offset) or an SSA definition.
nir_alu_src
: A source for an ALU instruction. ALU
sources have source modifiers and swizzles in addition to the regular
nir_src
.nir_dest
: An instruction destination. Can be a
register (with a potentially indirect offset) or an SSA definition. In
the case of an SSA definition, the nir_ssa_def
is actually
embedded in the nir_dest
.
nir_alu_dest
: The destination of an ALU instruction.
ALU destinations can have the “saturate” destination modifier as well as
a write-mask if the destination is a register.nir_instr
: An instruction
nir_alu_instr
: An ALU operation such as
fmul
, iadd
, or fsin
.
nir_call_instr
: A function call. Not currently
used.
nir_jump_instr
: A jump instruction such as return,
break, or continue.
nir_tex_instr
: A texturing operation. This structure
contains all of the various bits of data required to figure out
sampling, types (int or float), etc.
nir_intrinsic_instr
: An intrinsic. This is anything
that isn’t really “special” but isn’t just an ALU operation. Intrinsics
can’t, in general, be reordered or eliminated (there are flags to let
you know when they can). Also, anything that has anything to do with a
nir_variable
must be an intrinsic.
nir_load_const_instr
: An instruction that just
assigns some piece of constant data to its destination
nir_ssa_undef_instr
: An instruction for creating
“fake” definitions of SSA values that aren’t actually defined in the
code.
nir_phi_instr
: A phi node.
nir_parallel_copy_instr
: A parallel copy
instruction. This crucial for going out of SSA form but you should never
see one of these in the wild.
nir_cf_node
: A node in the control flow graph. The
CFG is explicitly maintained in NIR and each node has an
exec_node
embedded in it so that it can be placed in a list
of control flow nodes.
nir_block
: A basic block. Contains a list of
instructions none of which are allowed to be a jump instruction except
possibly the last one.
nir_if
: An if statement. Each nir_if
must be immediately preceded and succeded by a
nir_block. (This ensures that there are no critical edges in the CFG.) A
nir_ifcontains two lists of CF nodes: One for the then case and one for the else case. The
nir_ifalso contains a
nir_src`
that is the condition for the if statement. Currently, there is no if
instruction; it is built into the CFG.
nir_loop
: An endless loop. Each
nir_loop
must be immediately preceded and succeded by a
nir_block. (This ensures that there are no critical edges in the CFG.) A
nir_loop`
contains a list of CF nodes that is repeated until you hit a break
instruction.
nir_function_impl
: A function implementation. Not to
be confused with a nir_function
which may have multiple
overloads not all of which are implemented. The reason for all of the
indirection is to support shader subroutine and linking when the time
comes. Fortunately, there are helpers in place that make it not too bad
to work with. A nir_function_impl
also contains
function-local stuff such as local variables and registers and other
metadata.
nir_shader
: A shader. This may contain one or more
functions as well as global registers and variables and other
whole-shader type information. Right now, a nir_shader
usually only contains one function called “main”, but that may
change.
There are a few other data structures but they are mostly for handling function calls and linking which we don’t currently do in NIR.
For both ALU operations and intrinsics, NIR uses a system of macros
and C includes for declaring opcodes and the metadata that goes along
with them. The nir_opcodes.h
and
nir_intrinsics.h
header files each have a comment at the
top of them describing a macro (not defined in the file) for declaring
opcodes. Each opcode is then declared in the file using the given
macro.
Then, in nir.h the opcode declaration macro is defined and
nir_opcodes.h
(or nir_intrinsics.h
) is
included to generate the enum of opcodes. The nir_opcodes.c
and nir_intrinsics.c
files contain a different definition
of the macro and includes it again to declare the constant run-time
metadata structures.
Before we go into the different transformations in detail, it is good to look at how the code gets from GLSL IR, through NIR, to the back-end code. This example will be taken from the i965 NIR back-end, but it should look fairly similar in other back-ends.
lower_output_reads()
: In NIR, output variables are
write-only. We could handle this specially, or there’s a pass that
GLSL-to-TGSI also uses that lowers outputs to temporary variables with a
final write at the end into the output variable. We could have our own
pass, but it’s easier to just use the one that already exists.
glsl_to_nir()
: Converts the GLSL IR code into NIR
code. At this point, all of the information in the program is now in NIR
data structures and the GLSL IR could, in theory, be thrown away. The
code generated by the glsl_to_nir
pass is technically in
SSA form. However, it contains a lot of variable load/store intrinsics
which we will eventually want to eliminate.
nir_lower_global_vars_to_local
: For each global
variable, it looks at all of the uses of that variable in every function
implementation in the shader. If it is only ever used in one function
implementation, it is lowered to a local variable.
nir_split_var_copies
: Implements “copy splitting”
which is similar to structure splitting only it works on copy operations
rather than the datatypes themselves. The GLSL language allows you to
copy one variable to another an entire structure (which may contain
arrays or other structures) at a time. Normally, in a language such as C
this would be handled by a “structure splitting” pass that breaks up the
structures. Unfortunately for us, structures used in inputs or outputs
can’t be split. Therefore, regardlesss of what we do, we have to be able
to copy to/from structures.
Optimization loop: Performs a bunch of different optimizations in a loop. Each optimization pass returns a boolean value to indicate if it made any “progress”. The loop continues to repeat until it goes through a full pass of the loop without making any “progress”. I won’t go every optimization, but some of the key ones are as follows:
nir_lower_variables()
: This pass (which probably
needs a better name) lowers variable load/store intrinsics to SSA values
whenever possible. In particular, it analizes whether or not a
particular value is ever aliased by an indirect and, if not, lowers it
to a SSA value inserting phi nodes where necessary.
nir_opt_algebraic()
: This pass is capable of doing a
variety of algebraic simplifications of expressions. One trivial example
is a + 0 -> a
. A more complicated example is
(-|a| > 0) -> (a == 0)
which shows up all the time in
GLSL code that has been generated by a DirectX-to-OpenGL translation
layer.
nir_opt_constant_folding()
: Looks for ALU operations
whose arguments are constants, computes the resulting constant value,
and replaces the expression with a single load_const instruction. The
constant folding pass is also capable of folding constants into variable
dereference indirects. This way, after the constant folding pass, things
that were once indirect variable dereferences are now direct and we may
be able to lower them to SSA.
Lowering passes: These lower NIR concepts to things that are easier for the backend compiler to handle. The ones used by the i965 backend are:
nir_lower_locals_to_regs()
: This pass lowers local
variables to nir_register
values. Again, this keeps the
backend from having to worry about deref chains.
Dereference lowering. These passes all do basically the same thing: lower an instruction that uses a variable dereference chain to one that doesn’t. This way the backends don’t have to worry about crawling dereferences and can just work with indices into buffers and the occational indirect offset. These passes include:
nir_lower_io()
nir_lower_samplers()
nir_lower_system_values()
nir_lower_atomics()
nir_lower_to_source_mods()
: Because it makes
algebraic optimizations easier, we don’t actually emit source or
destination modifiers in glsl_to_nir
. Instead, we emit
[fi]neg
, [fi]abs
, and fsat
ALU
instructions and work with those during optimization. After all the
optimizations are done, we then lower these to source/destination
modifiers.
nir_convert_from_ssa()
: This pass takes the shader
out of SSA form and into a more conventional form using
nir_register
. In general, going out of SSA form is very
hard to do right. If other backends don’t want to be SSA, we’d rather
have the out-of-SSA implementation in one place so they don’t get it
wrong.
nir_lower_vec_to_movs()
: Lowers nir vecN()
operations to a series of moves with writemask.nir_validate()
: This very useful pass does nothing
but check a pile of different invariants of the IR to ensure that
nothing is broken. It should get run after every optimization pass when
built in debug mode to ensure that we aren’t breaking any invariants as
we go.
NIR to FS: Finally, we convert to the low-level FS IR that we use to then create programs to run on the i965 hardware.
One of the most important part of being able to use SSA form is going in and out of it efficiently. Going in and out of SSA naively isn’t hard but it generates terrible code. Going in and out competently is much harder but necessary if we want decent programs. For example, just going to and from SSA badly took one shader from 300 generated instructions to 800 with a register pressure of a couple hundred in the middle. Therefore, how we go to/from SSA is important
In NIR, there are two ways to go into SSA form. One is to first
convert to using registers and call nir_to_ssa()
. The other
is what we do in i965 where we generate SSA+load/store and then call
nir_lower_variables()
I’ll walk through the
nir_lower_variables()
pass as it is the more complicated of
the two, but they both follow the same algorithm for placing phi nodes
etc.
Collect information about all of the variable dereference. This builds a data structure which I have called a “deref forest” which contains information about every possible variable dereference and where it is used. This is similar to the use/def information for scalar registers but is done on a per-deref basis so that you can diferentiate between different elements of a structure or different offsets in an array.
Determine what values can be lowered to SSA values. At the moment, this is a simle heuristic which only lowers a value if it can never be referenced indirectly. The deref forest makes it fairly easy to find these.
Place phi nodes at the iterated dominance frontier of the definitions. This follows the algorithm presented by Cytron et. al. in “Efficiently Computing Static Single Assignment Form and the Control Dependence Graph.” It places phi nodes at mimal locations where they are needed in order to resolve merge points in the CFG.
Variable renumbering. This is a pass that walk the list of instructions and replaces each store operation with an ssa definition and each load operation with aread from the closest SSA definition that reaches that instruction. Since we have already added Phi nodes, this can be done with a simple stack-based depth-first search algorithm.
In the nir_to_ssa()
pass, we can insert phi nodes where
all of the sources and destinations are the register we are trying to
take into SSA form. Because we can’t use variables as sources or
destinations of phi nodes, we fake it in the
nir_lower_variables()
pass with a hash table, but the
principle is the same.
Going out of SSA is even trickier than going into SSA. The pass we have right now would be more-or-less state-of-the-art in a scalar world. Unfortunately, vectors complicate things so we could probably be doing a lot better.
The pass we have now is based on “Revisiting Out-of-SSA Translation for Correctness, Code Quality, and Efficiency” by Boissinot et. al. That is the only out-of-SSA paper I recommend you ever read as the others I’ve found are completely bonkers and rely on impossible data structures. The basic process is as follows:
Isolate phi nodes. This process inserts a bunch of “parallel copy” instructions at the beinning and ends of basic blocks that ensure SSA values used as sources and destinations of phi nodes are only ever used once. (They are only defined once since they are SSA values.) This prevents many of the technical difficulties of going out of SSA form
Aggressive register coalescing. This process starts by putting assigning all of the sources and destinations of a given phi node to the same register. It then tries to, one at a time, eliminate the parallel copies by assigning the source and destination of the copy to the same register. The entire time, it keeps track of the interferences and makes sure that nothing gets clobbered. This is done using a concept called a “dominance forest” which was first put forth by Budimlic et. al. in “Fast Copy Coalescing and Live-Range Identification”. Go read the papers for more info.
Assign registers. Now that we have figured out what registers to use for the phi nodes, we can assign registers to all of the other SSA values as well.
Resolve the parallel copies. A parallel copy operation takes a bunch of values and copies them to a bunch of other locations all at the same time. In this way, a single parallel copy can be used to shuffle arbitrarily many values around in an atomic fassion. Obviously, no one implements a parallel copy instruction in hardware, so this has to be lowered to a sequence of move operations.
For doing algebraic optimizations and lowering passes, we have an
infrastructure that allows you to easily perform search/replace
operations on expressions. This consists of two data-structures:
nir_search_expression
and nir_search_value
and
a function nir_replace_instr()
which checks if the given
instruction matches the search expression and, if it does, replaces it
with a new instruction (or chain of instructions) according to the given
replacement value. The framework automatically handles searching
expression trees and dealing with swizzles for you so you don’t have to
think about it.
Because creating these data structures is cumbersome, we have a bit
if python/mako magic that auto-generates the structures and a function
that walks the shader doing the search-and-replace as it goes. The
function generated has a first-order switch statement so it only calls
nir_replace_instr()
if it knows that at least the
instructions match. The search and replace expressions are written in a
little language using python tuples. From
nir_opt_algebraic.py
:
# Convenience variables
a = 'a'
b = 'b'
c = 'c'
d = 'd'
# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value. An expression is
# defined as a tuple of the form (<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value. A value can be
# either a numeric constant or a string representing a variable name. For
# constants, you have to be careful to make sure that it is the right type
# because python is unaware of the source and destination types of the
# opcodes.
optimizations = [
(('fadd', a, 0.0), a),
(('iadd', a, 0), a),
(('fmul', a, 0.0), 0.0),
(('imul', a, 0), 0),
(('fmul', a, 1.0), a),
(('imul', a, 1), a),
(('ffma', 0.0, a, b), b),
(('ffma', a, 0.0, b), b),
(('ffma', a, b, 0.0), ('fmul', a, b)),
(('flrp', a, b, 0.0), a),
(('flrp', a, b, 1.0), b),
(('flrp', a, a, b), a),
(('flrp', 0.0, a, b), ('fmul', a, b)),
(('fadd', ('fmul', a, b), c), ('ffma', a, b, c)),
(('fge', ('fneg', ('fabs', a)), 0.0), ('feq', a, 0.0)),
(('fmin', ('fmax', a, 1.0), 0.0), ('fsat', a)),
# This one may not be exact
(('feq', ('fadd', a, b), 0.0), ('feq', a, ('fneg', b))),
]
As you can see, that makes adding new algebraic optimizations stupid easy. The framework can also be used for writing lowering passes to for things like lowering sat to min/max.
Many of the optimization/lowering passes require different bits if metadata that are provided by different analysis passes. Right now, we don’t have much of this, but we do have some. As time goes on and we add things like value numbering, the amount of metadata we have will increase. In order to manage this, we have a simple metadata system consisting of an enum and two functions:
nir_metadata_require()
: Declares that the given
metadata (an OR of enum values) is required. The function automatically
calls all of the required analysis passes for you and, upon its return,
the requested metadata is available and current.
nir_metadata_preserve()
: Called to declare what
metadata (if any) was preserved by the given pass. If the pass didn’t
touch anything, it doesn’t need to call this function. However, if it
adds/removes instructions or modifies the CFG in any way, it needs to
call nir_metadata_preserve()
. The
nir_metadata_preserve()
function takes an OR of all of the
bits of metadata that are preserved. That way as new metadata
gets added, we don’t have to update every optimization pass to dirty
it.
Unfortunately, we have no way to automatically dirty everything if
you don’t call nir_metadata_preserve()
. So shame on you if
you forget it.