In defense of NIR

NIR has been an integral part of the Mesa driver stack for about six or seven years now (depending on how you count) and a lot has changed since NIR first landed at the end of 2014 and I wrote my initial NIR notes. Also, for various reasons, I’ve had to give my NIR elevator pitch a few times lately. I think it’s time for a new post. This time on why, after working on this mess for seven years, I still think NIR was the right call.

A bit of history

Shortly after I joined the Mesa team at Intel in the summer of 2014, I was sitting in the cube area asking Ken questions, trying to figure out how Mesa was put together, and I asked, “Why don’t you use LLVM?” Suddenly, all eyes turned towards Ken and myself and I realized I’d poked a bear. Ken calmly explained a bunch of the packaging/shipping issues around having your compiler in a different project as well as issues radeonsi had run into with apps bundling their own LLVM that didn’t work. But for the more technical question of whether or not it was a good idea, his answer was something about trade-offs and how it’s really not clear if LLVM would really gain them much.

That same summer, Connor Abbott showed up as our intern and started developing NIR. By the end of the summer, he had a bunch of data structures a few mostly untested passes, and a validator. He also had most of a GLSL IR to NIR pass which mostly passed validation. Later that year, after Connor had gone off to school, I took over NIR, finished the Intel scalar back-end NIR consumer, fixed piles of bugs, and wrote out-of-SSA and a bunch of optimization passes to get it to the point where we could finally land it in the tree at the end of 2014. Initially, it was only a few Intel folks and Emma Anholt (Broadcom, at the time) who were all that interested in NIR. Today, it’s integral to the Mesa project and at the core of every driver that’s still seeing active development. Over the past seven years, we (the Mesa community) have poured thousands of man hours (probably millions of engineering dollars) into NIR and it’s gone from something only capable of handling fragment shaders to supporting full Vulkan 1.2 plus ray-tracing (task and mesh are coming) along with OpenCL 1.2 compute.

Was it worth it? That’s the multi-million dollar (literally) question. 2014 was a simpler time. Compute shaders were still newish and people didn’t use them for all that much more than they would have used a fancy fragment shader for a couple years earlier. More advanced features like Vulkan’s variable pointers weren’t even on the horizon. Had I known at the time how much work we’d have to put into NIR to keep up, I may have said, “Nah, this is too much effort; let’s just use LLVM.” If I had, I think it would have made the wrong call.

Distro and packaging issues

I’d like to get this one out of the way first because, while these issues are definitely real, it’s easily the least compelling reason to write a whole new piece of software. Having your compiler in a separate project and in LLVM in particular comes with an annoying set of problems.

First, there’s release cycles. Mesa releases on a rough 3-month cadence whereas LLVM releases on a 6-month cadence and there’s nothing syncing the two release cycles. This means that any new feature enabled in Mesa that require new LLVM compiler work can’t be enabled until they pick up a new LLVM. Not only does this make the question “what mesa version has X? unanswerable, it also means every one of these features needs conditional paths in the driver to be enabled or not depending on LLVM version. Also, because we can’t guarantee which LLVM version a distro will choose to pair with any give Mesa version, radeonsi (the only LLVM-based hardware driver in Mesa) has to support the latest two releases of LLVM as well as tip-of-tree at all times. While this has certainly gotten better in recent years, it used to be that LLVM would switch around C++ data structures on you requiring a bunch of wrapper classes in Mesa to deal with the mess. (They still reserve the right, it just happens less these days.)

Second is bug fixing. What do you do if there’s a compiler bug? You fix it in LLVM, of course, right? But what if the bug is in an old version of the AMD LLVM back-end and AMD’s LLVM people refuse to back-port the fix? You work around it in Mesa, of course! Yup, even though Mesa and LLVM are both open-source projects that theoretically have a stable bugfix release cycle, Mesa has to carry LLVM work-around patches because we can’t get the other team/project to back-port fixes. Things also get sticky whenever there’s a compiler bug which touches on the interface between the LLVM back-end compiler and the driver. How do you fix that in a backwards-compatible way? Sometimes, you don’t. Those interfaces can be absurdly subtle and complex and sometimes the bug is in the interface itself so you either have to fix it LLVM tip-of-tree and work around it in Mesa for older versions, or you have to break backwards compatibility somewhere and hope users pick up the LLVM bug-fix release.

Third is that some games actually link against LLVM and, historically, LLVM hasn’t done well with two different versions of it loaded at the same time. Some of this is LLVM and some of it is the way C++ shared library loading is handled on Linux. I won’t get into all the details but the point is that there have been some games in the past which simply can’t run on radeonsi because of LLVM library version conflicts. Some of this could probably be solved if Mesa were linked against LLVM statically but distros tend to be pretty sour on static linking unless you have a really good reason. A closed-source game pulling in their own LLVM isn’t generally considered to be a good reason.

And that, in the words of Forrest Gump, is all I have to say about that.

A compiler built for GPUs

One of the key differences between NIR and LLVM is that NIR is a GPU-focused compiler whereas LLVM is CPU-focused. Yes, AMD has an upstream LLVM back-end for their GPU hardware, Intel likes to brag about their out-of-tree LLVM back-end and many other vendors use it in their drivers as well even if their back-ends are closed-source and Internal. However, none of that actually means that LLVM understands GPUs or is any good at compiling for them. Most HW vendors have made that choice because they needed LLVM for OpenCL support and they wanted a unified compiler so they figured out how to make LLVM do graphics. It works but that doesn’t mean it works well.

To demonstrate this, let’s look at the following GLSL shader I stole from the texelFetch piglit test:

#version 120

#extension GL_EXT_gpu_shader4: require
#define ivec1 int
flat varying ivec4 tc;
uniform vec4 divisor;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);
    fragColor = color/divisor;
}

When compiled to NIR, this turns into

shader: MESA_SHADER_FRAGMENT
name: GLSL3
inputs: 1
outputs: 1
uniforms: 1
ubos: 1
shared: 0
decl_var uniform INTERP_MODE_NONE sampler2D tex (1, 0, 0)
decl_var ubo INTERP_MODE_NONE vec4[1] uniform_0 (0, 0, 0)
decl_function main (0 params)

impl main {
    block block_0:
    /* preds: */
    vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
    vec3 32 ssa_1 = intrinsic load_input (ssa_0) (0, 0, 34, 160) /* base=0 */ /* component=0 */ /* dest_type=int32 */ /* location=32 slots=1 */
    vec1 32 ssa_2 = deref_var &tex (uniform sampler2D)
    vec2 32 ssa_3 = vec2 ssa_1.x, ssa_1.y
    vec1 32 ssa_4 = mov ssa_1.z
    vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)
    vec4 32 ssa_6 = intrinsic load_ubo (ssa_0, ssa_0) (0, 1073741824, 0, 0, 16) /* access=0 */ /* align_mul=1073741824 */ /* align_offset=0 */ /* range_base=0 */ /* range=16 */
    vec1 32 ssa_7 = frcp ssa_6.x
    vec1 32 ssa_8 = frcp ssa_6.y
    vec1 32 ssa_9 = frcp ssa_6.z
    vec1 32 ssa_10 = frcp ssa_6.w
    vec1 32 ssa_11 = fmul ssa_5.x, ssa_7
    vec1 32 ssa_12 = fmul ssa_5.y, ssa_8
    vec1 32 ssa_13 = fmul ssa_5.z, ssa_9
    vec1 32 ssa_14 = fmul ssa_5.w, ssa_10
    vec4 32 ssa_15 = vec4 ssa_11, ssa_12, ssa_13, ssa_14
    intrinsic store_output (ssa_15, ssa_0) (0, 15, 0, 160, 132) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* src_type=float32 */ /* location=4 slots=1 */
    /* succs: block_1 */
    block block_1:
}

Then, the AMD driver turns it into the following LLVM IR:

; ModuleID = 'mesa-shader'
source_filename = "mesa-shader"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"
target triple = "amdgcn--"

define amdgpu_ps <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> @main(<4 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %0, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %1, float addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %2, <8 x i32> addrspace(6)* inreg noalias align 4 dereferenceable(18446744073709551615) %3, i32 inreg %4, i32 inreg %5, <2 x i32> %6, <2 x i32> %7, <2 x i32> %8, <3 x i32> %9, <2 x i32> %10, <2 x i32> %11, <2 x i32> %12, float %13, float %14, float %15, float %16, float %17, i32 %18, i32 %19, float %20, i32 %21) #0 {
main_body:
  %22 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 0, i32 0, i32 %5) #4
  %23 = bitcast float %22 to i32
  %24 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 1, i32 0, i32 %5) #4
  %25 = bitcast float %24 to i32
  %26 = call nsz arcp float @llvm.amdgcn.interp.mov(i32 2, i32 2, i32 0, i32 %5) #4
  %27 = bitcast float %26 to i32
  %28 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(6)* %3, i32 32, !amdgpu.uniform !0
  %29 = load <8 x i32>, <8 x i32> addrspace(6)* %28, align 4, !invariant.load !0
  %30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4
  %31 = ptrtoint float addrspace(6)* %2 to i32
  %32 = insertelement <4 x i32> <i32 poison, i32 0, i32 16, i32 163756>, i32 %31, i32 0
  %33 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 0, i32 0) #4
  %34 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 4, i32 0) #4
  %35 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 8, i32 0) #4
  %36 = call nsz arcp float @llvm.amdgcn.s.buffer.load.f32(<4 x i32> %32, i32 12, i32 0) #4
  %37 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %33) #4
  %38 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %34) #4
  %39 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %35) #4
  %40 = call nsz arcp float @llvm.amdgcn.rcp.f32(float %36) #4
  %41 = extractelement <4 x float> %30, i32 0
  %42 = fmul nsz arcp float %41, %37
  %43 = extractelement <4 x float> %30, i32 1
  %44 = fmul nsz arcp float %43, %38
  %45 = extractelement <4 x float> %30, i32 2
  %46 = fmul nsz arcp float %45, %39
  %47 = extractelement <4 x float> %30, i32 3
  %48 = fmul nsz arcp float %47, %40
  %49 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> undef, i32 %4, 4
  %50 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %49, float %42, 5
  %51 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %50, float %44, 6
  %52 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %51, float %46, 7
  %53 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %52, float %48, 8
  %54 = insertvalue <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %53, float %20, 19
  ret <{ i32, i32, i32, i32, i32, float, float, float, float, float, float, float, float, float, float, float, float, float, float, float }> %54
}

; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.interp.mov(i32 immarg, i32 immarg, i32 immarg, i32) #1

; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2

; Function Attrs: nounwind readnone willreturn
declare float @llvm.amdgcn.s.buffer.load.f32(<4 x i32>, i32, i32 immarg) #3

; Function Attrs: nounwind readnone speculatable willreturn
declare float @llvm.amdgcn.rcp.f32(float) #1

attributes #0 = { "InitialPSInputAddr"="0xb077" "denormal-fp-math"="ieee,ieee" "denormal-fp-math-f32"="preserve-sign,preserve-sign" "target-features"="+DumpCode" }
attributes #1 = { nounwind readnone speculatable willreturn }
attributes #2 = { nounwind readonly willreturn }
attributes #3 = { nounwind readnone willreturn }
attributes #4 = { nounwind readnone }

!0 = !{}

For those of you who can’t read NIR and/or LLVM or don’t want to sift through all that, let me reduce it down to the important lines:

GLSL:

vec4 color = texelFetch2D(tex, ivec2(tc), tc.w);

NIR:

vec4 32 ssa_5 = (float32)txf ssa_2 (texture_deref), ssa_2 (sampler_deref), ssa_3 (coord), ssa_4 (lod)

LLVM:

%30 = call nsz arcp <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 15, i32 %23, i32 %25, i32 %27, <8 x i32> %29, i32 0, i32 0) #4

; Function Attrs: nounwind readonly willreturn
declare <4 x float> @llvm.amdgcn.image.load.mip.2d.v4f32.i32(i32 immarg, i32, i32, i32, <8 x i32>, i32 immarg, i32 immarg) #2

attributes #2 = { nounwind readonly willreturn }
attributes #4 = { nounwind readnone }

In NIR, a texelFetch() shows up as a texture instruction. NIR has a special instruction type just for textures called nir_tex_instr to handle of the combinatorial explosion of possibilities when it comes to all the different ways you can access a texture. In this particular case, the texture opcode is nir_texop_txf for a texel fetch and it is passed a texture, a sampler, a coordinate and an LOD. Pretty standard stuff.

In AMD-flavored LLVM IR, this turns into a magic intrinsic funciton called llvm.amdgcn.image.load.mip.2d.v4f32.i32. A bunch of information about the operation such as the fact that it takes a mip parameter and returns a vec4 is encoded in the function name. The AMD back-end then knows how to turn this into the right sequence of hardware instructions to load from a texture.

There are a couple of important things to note here. First is the @llvm.amdgcn prefix on the function name. This is an entirely AMD-specific function. If I dumped out the LLVM from the Intel windows drivers for that same GLSL, it would use a different function name with a different encoding for the various bits of ancillary information such as the return type. Even though both drivers share LLVM, in theory, the way they encode graphics operations is entirely different. If you looked at NVIDIA, you would find a third encoding. There is no standardization.

Why is this important? Well, one of the most common arguments I hear from people for why we should all be using LLVM for graphics is because it allows for code sharing. Everyone can leverage all that great work that happens in upstream LLVM. Except it doesn’t. Not really. Sure, you can get LLVM’s algebraic optimizations and code motion etc. But you can’t share any of the optimizations that are really interesting for graphics because nothing graphics-related is common. Could it be standardized? Probably. But, in the state it’s in today, any claims that two graphics compilers are sharing significant optimizations because they’re both LLVM based is a half-truth at best. And it will never become standardized unless someone other than AMD decides to put their back-end into upstream LLVM and they decide to work together.

The second important bit about that LLVM function call is that LLVM has absolutely no idea what that function does. All it knows is that it’s been decorated nounwind, readonly, and willreturn. The readonly gives it a bit of information so it knows it can move the function call around a bit since it won’t write misc data. However, it can’t even eliminate redundant texture ops because, for all LLVM knows, a second call will return a different result. While LLVM has pretty good visibility into the basic math in the shader, when it comes to anything that touches image or buffer memory, it’s flying entirely blind. The Intel LLVM-based graphics compiler tries to improve this somewhat by using actual LLVM pointers for buffer memory so LLVM gets a bit more visibility but you still end up with a pile of out-of-thin-air pointers that all potentially alias each other so it’s pretty limited.

In contrast, NIR knows exactly what sort of thing nir_texop_txf is and what it does. It knows, for instance, that, even though it accesses external memory, the API guarantees that nothing shifts out from under you so it’s fine to eliminate redundant texture calls. For nir_texop_tex (texture() in GLSL), it knows that it takes implicit derivatives and so it can’t be moved into non-uniform control-flow. For things like SSBO and workgroup memory, we know what kind of memory they’re touching and can do alias analysis that’s actually aware of buffer bindings.

When people try to justify their use of LLVM to me, there are typically two major benefits they cite. The first is that LLVM lets them take advantage of all this academic compiler work. In the previous section, I explained why this is a weak argument at best. The second is that embracing LLVM for graphics lets them share code with their compute compiler. Does that mean that we’re against sharing code? Not at all! In fact, NIR lets us get far more code sharing than most companies do by using LLVM.

The difference is the axis for sharing. This is something I ran into trying to explain myself to people at Intel all the time. They’re usually only thinking about how to get the Intel OpenCL driver and the Intel D3D12 driver to share code. With NIR, we have compiler code shared effectively across 20 years of hardware from a 8 different vendors and at least 4 APIs. So while Intel’s Linux Vulkan and OpenCL drivers don’t share a single line of compiler code, it’s not like we went off and hand-coded a whole compiler stack just for Intel Linux Vulkan.

As an example of this, consider nir_lower_tex() a pass that lowers various different types of texture operations to other texture operations. It can, among other things:

Lower texture projectors away by doing the division in the shader,
Lower texelFetchOffset() to texelFetch(),
Lower rectangle textures by dividing the coordinate by the result of textureSize(),
Lower texture swizzles to swizzling in the shader,
Lower various forms of textureGrad*() to textureLod*() under various conditions,
Lower imageSize(i, lod) with an LOD to imageSize(i, 0) and some shader math,
And much more…

Exactly what lowering is needed is highly hardware dependent (except projectors; only old Qualcomm hardware has those) but most of them are needed by at least two different vendor’s hardware. While most of these are pretty simple, when you get into things like turning derivatives into LODs, the calculations get complex and we really don’t want everyone typing it themselves if we can avoid it.

And texture lowering is just one example. We’ve got dozens of passes for everything from lowering read-only images to textures for OpenCL to lowering built-in functions like frexp() to simpler math to flipping gl_FragCoord and gl_PointCoord when rendering upside down which as is required to implement OpenGL on Linux window-systems. All that code is in one central place where it’s usable by all the graphics drivers on Linux.

Tight driver integration

I mentioned earlier that having your compiler out-of-tree is painful from a packaging and release point-of-view. What I haven’t addressed yet is just how tight driver/compiler integration has to be. It depends a lot on the API and hardware, of course but the interface between compiler and driver is often very complex. We make it look very simple on the API side where you have descriptor sets (or bindings in GL) and then you access things from them in the shader. Simple, right? Hah!

In the Intel Linux Vulkan driver, we can access a UBO one of four ways depending on a complex heuristic:

We try to find up to 4 small ranges UBO commonly used constants and push those into the shader as push constants.
If we can’t push it all and it fits inside the hardware’s 240 entry binding table, we create a descriptor for it and put it in the binding table.
Depending on the hardware generation, UBOs successfully bound to descriptors might be accessed as SSBOs or we might access them through the texture unit.
If we ran our of entries in the binding table or if it’s in a ray-tracing stage (those don’t have binding tables), we fall back to doing bounds checking in the shader and access it using raw 64-bit GPU addresses.

And that’s just UBOs! SSBO binding has a similar level of complexity and also depends on the SSBO operations done in the shader. Textures have silent fall-back to bindless if we have too many, etc. In order to handle all this insanity, we have a compiler pass called anv_nir_apply_pipeline_layout() which lives in the driver. The interface between that pass and the rest of the driver is quite complex and can communicate information about exactly how things are actually laid out. We do have to serialize it to put it all in the pipeline cache so that limits the complexity some but we don’t have to worry about keeping the interface stable at all because it lives in the driver.

We also have passes for handling YCbCr format conversion, turning multiview into instanced rendering and constructing a gl_ViewID in the shader based on the view mask and the instance number, and a handful of other tasks. Each of these requires information from the VkPipelineCreateInfo and some of them result in magic push constants which the driver has to know need pushing.

Trying to do that with your compiler in another project would be insane. So how does AMD do it with their LLVM compiler? Good question! They either do it in NIR or as part of the NIR to LLVM conversion. By the time the shader gets to LLVM, most of the GL or Vulkanisms have been translated to simpler constructs, keeping the driver/LLVM interface manageable. It also helps that AMD’s hardware binding model is crazy simple and was basically designed for an API like Vulkan.

Structured control-flow

One of the riskier decisions we made when designing NIR was to make all control-flow inherently structured. Instead of branch and conditional branch instructions like LLVM or SPIR-V has, NIR has control-flow nodes in a tree structure. The root of the tree is always a nir_function_impl. In each function, is a list of control-flow nodes that may be nir_block, nir_if, or nir_loop. An if has a condition and then and else cases. A loop is a simple infinite loop and there are nir_jump_break and nir_jump_continue instructions which act exactly as their C counterparts.

At the time, this decision was made from pure pragmatism. We had structure coming out of GLSL and most of the back-ends expected structure. Why break everything? It did mean that, when we started writing control-flow manipulation passes, things were a lot harder. A dead control-flow pass in an unstructured IR is trivial:. Delete any conditional branches where the condition is false and replace it with an unconditional branch if the condition is true. Then delete any unreachable blocks and merge blocks as necessary. Done. In a structured IR, it’s a lot more fiddly. You have to manually collapse if ladders and deleting the unconditional break at the end of a loop is equivalent to loop unrolling. But we got over that hump, built tools to make it less painful, and have implemented most of the important control-flow optimizations at this point. In exchange, back-ends get structure which is something most GPUs want thanks to the SIMT model they use.

What we didn’t see coming when we made that decision (2014, remember?) was wave/subgroup ops. In the last several years, the SIMT nature of shader execution has slowly gone from an implementation detail to something that’s baked into all modern 3D and compute APIs and shader languages. With that shift has come the need to be consistent about re-convergence. If we say “texture() has to be in uniform control flow”, is the following shader ok?

void main
#version 120

varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    if (tc.x > 1.0)
        tc.x = 1.0;

    fragColor = texture(tex, tc);
}

Obviously, it should be. But what guarantees that you’re actually in uniform control-flow by the time you get to the texture() call? In an unstructured IR, once you diverge, it’s really hard to guarantee convergence. Of course, every GPU vendor with an LLVM-based compiler has algorithms for trying to maintain or re-create the structure but it’s always a bit fragile. Here’s an even more subtle example:

void main
#version 120

varying vec2 tc;
uniform sampler2D tex;
out vec4 fragColor;
void main()
{
    /* Block 0 */
    float x = tc.x;
    while (1) {
        /* Block 1 */
        if (x < 1.0) {
            /* Block 2 */
            tc.x = x;
            break;
        }

        /* Block 3 */
        x = x - 1.0;
    }

    /* Block 4 */
    fragColor = texture(tex, tc);
}

The same question of validity holds but there’s something even trickier in here. Can the compiler merge block 4 and block 2? If so, where should it put it? To a CPU-centric compiler like LLVM, it looks like it would be fine to merge the two and put it all in block 2. In fact, since texture ops are expensive and block 2 is deeper inside control-flow, it may think the resulting shader would be more efficient if it did. And it would be wrong on both counts.

First, the loop exit condition is non-uniform and, since texture() takes derivatives, it’s illegal to put it in non-uniform control-flow. (Yes, in this particular case, the result of those derivatives might be a bit wonky.) Second, due to the SIMT nature of execution, you really don’t want the texture op in the loop. In the worst case, a 32-wide execution will hit block 2 32 separate times whereas, if you guarantee re-convergence, it only hits block 4 once.

The fact that NIR’s control-flow is structured from start to finish has been a hidden blessing here. Once we get the structure figured out from SPIR-V decorations (which is annoyingly challenging at times), we never lose that structure and the re-convergence information it implies. NIR knows better than to move derivatives into non-uniform control-flow and its code-motion passes are tuned assuming a SIMT execution model. What has become a constant fight for people working with LLVM is a non-issue for us. The only thing that has been a challenge has been dealing with SPIR-V’s less than obvious structure rules and trying to make sure we properly structurize everything that’s legal. (It’s been getting better recently.)

Side-note: NIR does support OpenCL SPIR-V which is unstructured. To handle this, we have nir_jump_goto and nir_jump_goto_if instructions which are allowed only for a very brief period of time. After the initial SPIR-V to NIR conversion, we run a couple passes and then structurize. After that, it remains structured for the rest of the compile.

Algebraic optimizations

Every GPU compiler engineer has horror stories about something some app developer did in a shader. Sometimes it’s the fault of the developer and sometimes it’s just an artifact of whatever node-based visual shader building system the game engine presents to the artists and how it’s been abused. On Linux, however, it can get even more entertaining. Not only do we have those shaders that were written for DX9 and someone lost the code so they ran them through a DX9 to HLSL translator and then through FXC, but they then ported the app to OpenGL so it can run on Linux they did a DXBC to GLSL conversion with some horrid tool. The end result is x != 0 implemented with three levels of nested function calls, multiple splats out to a vec4 and a truly impressive pile of control-flow. I only wish I were joking….

To chew through this mess, we have nir_opt_algebraic(). We’ve implemented a little language for expressing these expression trees using python tuples and nir_opt_algebraic.py. To get a sense for what this looks like, let’s look at some excerpts from nir_opt_algebraic.py starting with the simple description at the top:

# Written in the form (<search>, <replace>) where <search> is an expression
# and <replace> is either an expression or a value.  An expression is
# defined as a tuple of the form ([~]<op>, <src0>, <src1>, <src2>, <src3>)
# where each source is either an expression or a value.  A value can be
# either a numeric constant or a string representing a variable name.
#
# <more details>

optimizations = [
   ...
   (('iadd', a, 0), a),

This rule is a good starting example because it’s so straightforward. It looks for an integer add operation of something with zero and gets rid of it. A slightly more complex example removes redundant fmax opcodes:

(('fmax', ('fmax', a, b), b), ('fmax', a, b)),

Since it’s written in python, we can also write little rule generators if the same thing applies to a bunch of opcodes or if you want to generalize across types:

# For any float comparison operation, "cmp", if you have "a == a && a cmp b"
# then the "a == a" is redundant because it's equivalent to "a is not NaN"
# and, if a is a NaN then the second comparison will fail anyway.
for op in ['flt', 'fge', 'feq']:
   optimizations += [
      (('iand', ('feq', a, a), (op, a, b)), ('!' + op, a, b)),
      (('iand', ('feq', a, a), (op, b, a)), ('!' + op, b, a)),
   ]

Because we’ve made adding new optimizations so incredibly easy, we have a lot of them. Not just the simple stuff I’ve highlighted above, either. We’ve got at least two cases where someone hand-rolled bitfieldReverse() and we match a giant pattern and turn it into a single HW instruction. (Some UE4 demo and Cyberpunk 2077, if you want to know who to blame. They hand-roll it differently, of course.) We also have patterns to chew through all the garbage from D3D9 to HLSL conversion where they emit piles of x ? 1.0 : 0.0 everywhere because D3D9 didn’t have real Boolean types. All told, as of the writing of this blog post, we have 1911 such search and replace patterns.

Not only have we made it easy to add new patterns but the nir_search framework has some pretty useful smarts in it. The expression I first showed matches a + 0 and replaces it with a but nir_search is smart enough to know that nir_op_iadd is commutative and so it also matches 0 + a without having to write two expressions. We also have syntax for detecting constants, handling different bit sizes, and applying arbitrary C predicates based on the SSA value. Since NIR is actually a vector IR (we support a lot of vec4-based hardware), nir_search also magically handles swizzles for you.

You might think 1911 patterns is a lot and it is. Doesn’t that take forever? Isn’t it O(NPS) where N is the number of instructions, P is the number of patterns and S is the average pattern size or something like that? Nope! A couple years ago, Connor Abbot converted it to using a finite state machine automata, built at driver compile time, to filter out impossible matches as we go. The result is that the whole pass effectively runs in linear time in the number of instructions.

NIR is a low(ish) level IR

This one continues to surprise me. When we set out to design NIR, the goal was something that was SSA and used flat lists of instructions (not expression trees). That was pretty much the extent of the design requirements. However, whenever you build an IR, you inevitably make a series of choices about what kinds of things you’re going to support natively and what things are going to require emulation or be a bit more painful.

One of the most fundamental choices we made in NIR was that SSA values would be typeless vectors. Each nir_ssa_def has a bit size and a number of vector components and that’s it. We don’t distinguish between integers and floats and we don’t support matrix or composite types. Not supporting matrix types was a bit controversial but it’s turned out fine. We also have to do a bit of juggling to support hardware that doesn’t have native integers because we have to lower integer operations to float and we’ve lost the type information. When working with shaders that come from D3D to OpenGL or Vulkan translators, the type information does more harm than good. I can’t count the number of shaders I’ve seen where they declare vec4 x1 through vec4 x80 at the top and then uintBitsToFloat() and floatBitsToUint() all over everywhere.

We also made adding new ALU ops and intrinsics really easy but also added a fairly powerful metadata system for both so the compiler can still reason about them. The lines we drew between ALU ops, intrinsics, texture instructions, and control-flow like break and continue were pretty arbitrary at the time if we’re honest. Texturing was going to be a lot of intrinsics so Connor added an instruction type. That was pretty much it.

The end result, however, has been an IR that’s incredibly versatile. It’s somehow both a high-level and low-level IR at the same time. When we do SPIR-V to NIR translation, we don’t have a separate IR for parsing SPIR-V. We have some data structures to deal with composite types and a handful of other stuff but when we parse SPIR-V opcodes, we go straight to NIR. We’ve got variables with fairly standard dereference chains (those do support composite types), bindings, all the crazy built-ins like frexp(), and a bunch of other language-level stuff. By the time the NIR shows up in your back-end, however, all that’s gone. Crazy built-in functions have been lowered. GL/Vulkan binding with derefs, descriptors, and locations has been turned into byte offsets and indices in a flat binding table. Some drivers have even attempted to emit hardware instructions directly from NIR. (It’s never quite worked but says a lot that they even tried.)

The Intel compiler back-end has probably shrunk by half in terms of optimization and lowering passes in the last seven years because we’re able to do so much in NIR. We’ve got code that lowers storage image access with unsupported formats to other image formats or even SSBO access, splitting of vector UBO/SSBO access that’s too wide for hardware, workarounds for imprecise trig ops, and a bunch of others. All of the interesting lowering is done in NIR. One reason for this is that Intel has two back-ends, one for scalar and one that’s vec4 and any lowering we can do in NIR is lowering that only happens once. But, also, it’s nice to be able to have the full power of NIR’s optimizer run on your lowered code.

As I said earlier, I find the versatility of NIR astounding. We never intended to write an IR that could get that close to hardware. We just wanted SSA for easier optimization writing. But the end result has been absolutely fantastic and has done a lot to accelerate driver development in Mesa.

Conclusion

If you’ve gotten this far, I both applaud and thank you! NIR has been a lot of fun to build and, as you can probably tell, I’m quite proud of it. It’s also been a huge investment involving thousands of man hours but I think it’s been well worth it. There’s a lot more work to do, of course. We still don’t have the ray-tracing situation where it needs to be and OpenCL-style compute needs some help to be really competent. But it’s come an incredibly long way in the last seven years and I’m incredibly proud of what we’ve built and forever thankful to the many many developers who have chipped in and fixed bugs and contributed optimization and lowering passes.

Hopefully, this post provides some additional background and explanation for the big question of why Mesa carries its own compiler stack. And maybe, just maybe, someone will get excited enough about it to play around with it and even contribute! One can hope, right?

Faith Ekstrand

Navigation ↵

In defense of NIR

A bit of history

Distro and packaging issues

A compiler built for GPUs

Tight driver integration

Structured control-flow

Algebraic optimizations

NIR is a low(ish) level IR

Conclusion

Faith Ekstrand

Navigation ↵

In defense of NIR

A bit of history

Distro and packaging issues

A compiler built for GPUs

Code sharing

Tight driver integration

Structured control-flow

Algebraic optimizations

NIR is a low(ish) level IR

Conclusion