Loop Unrolling: GPU vs CPU

CUDA is parallel computing platform from Nvidia that allows using GPUs for general purpose processing. CUDA C++ is a frontend to access this capability provided by Nvidia. It is a programming language that extends C++ to compile code that runs both on the CPU and the GPU.

For CUDA C++, the compilation pipeline follows two paths: one for host code, which runs on the CPU, and one for device code, which runs on the GPU. For this, I'll just focus on the process for device code. The code first has the C++ preprocessor run. Then, middle end optimizations are performed by nvvm, which spits out PTX assembly, a virtual instruction set for Nvidia GPUs. Then, ptxas runs on the PTX assembly to generate .cubin files, which contain GPU specific instructions. To read more about this, read Chapter 4 of the CUDA NVCC Handbook.

For the purpose of this blog post, we are only concerned with nvvm. nvvm is the middle end optimizer, which used NVVM IR. This is just a subset of LLVM IR, as all NVVM IR is valid LLVM IR. This portion of the compile runs standard LLVM optimization passes as well as others to generate more performant code. You can read more about the NVVM IR Spec if curious.

One of the passes here that is common is loop unrolling. Just briefly, loop unrolling is an optimization where the inner body of a loop is repeated with the induction variable not having to increment as much. Runtime loop unrolling is a subset where the upper bound is not known at compilation time. Here is an example below:

for(int i = 0; i < 100; ++i)
  sum += val[i];

After a 4x runtime loop unrolling:

int i = 0;
for(; i < 100; i += 4)
  sum += val[i];
  sum += val[i+1];
  sum += val[i+2];
  sum += val[i+3];
for(; i < 100; ++i)
  sum += val[i];

Loop Unrolling on a CPU

On a CPU, loop unrolling is profitable from the fact that we can reduce the amount of times we increment the induction variable i, which can yield mild speedups. This is an easy win that is profitable as long as we don't run out of registers to store temporaries while loop unrolling, or we don't blow up the size of the instructions to a point that we can have an instruction cache miss. This limits the upperbound that we can unroll by.

Unrolling also allows loads to be completed in parallel due to the nature of out of order processors. So above, the four loads can be pipelined, and even though the overall latency is still large if none are in the cache, the throughput increases by 4x if no load would put the other values in the cache.

This means there is quite a large performance gains to be had with loop unrolling on the CPU.

Loop Unrolling on a GPU

On a GPU, the story is similar. We still have the same small wins as above with not incrementing the induction variable, but the parallel nature of GPUs makes loop unrolling even more beneficial with loads.

How is this possible?

This comes in the difference between CPU and GPU architectures. Let's look at the above example we had where we had to load out of a val array. Let's assume that one memory access will not allow the subsequent memory access to be in cache(if they are sparse or random accesses). For the CPU, with a 4x unroll as shown above, we would get a roughly 4x throughput improvement.

However, for GPUs, latency is an even bigger problem as global memory has a much higher latency than memory accesses from a CPU. So the ability to reduce latency as shown in loop unrolling is actually profitable at even higher unroll counts than for CPUs because of the latency. By default, LLVM will only unroll up to 8x for runtime loops. However, I have seen 16x unrolls for GPUs by default. It isn't hard to imagine that because of how much worse latency is, 32x unrolls may also be profitable on GPUs in a way that CPUs would no longer benefit.

To elaborate on the above, GPU warps can execute multiple loads at once until it hits a use. The time spent waiting for the result of a load is known as a long scoreboard stall. Unrolling allows avoiding these long scoreboard stalls. However, even if you hit a scoreboard stall, the warp scheduler will switch to executing another warp, allowing the latency from the memory load to be hidden. This is another reason why larger unroll factors can be profitable.

Of course, there are drawbacks here. If you unroll too much, you can reduce performance by 50% or more due to the nature of GPU warp execution, in a way that doesn't come up as often on CPUs. Unrolling will necessarily increase register pressure, but because concurrent threads share registers, a small increase in register use that bumps up register pressure can decrease occupancy significantly, leading to significant performance degradation.

Overall, GPU runtime unrolling has higher runtime performance opportunities with similarly larger performance degradation risks, which one needs to carefully work around when working with GPU code.

Conclusion

Overall, loop unrolling is very profitable for both CPU and GPU code alike, but the unique design of GPU architectures make it an even more profitable optimization on GPUs with higher risks in case of unrolling too much.

Resources

CUDA NVCC Handbook

NVVM IR Spec

LLVM Runtime Unroll Upper Bound