Unique Brainfuck Optimizations

Brainfuck (BF) is an esoteric programming language created in 1993 by Swiss student Urban Müller, and is extremely simple with only eight instructions. The language essentially defines a turing machine, with the main and only abstraction being a tape and a pointer that can read and write to it. Here are the 8 instructions that exist in the language:

< - Shift tape pointer left
> - Shift tape pointer right
+ - Increment value at tape
- - Decrement value at tape
. - Print the value at the tape pointer as an ASCII character to standard out
, - Read an ASCII character from standard in and write its representation to the tape
[ - Jump If Zero, which jumps to the matching ] instruction when the tape reads 0
] - Jump Unless Zero, which jumps to the matching [ instruction when the tape does not read 0

The tape is as large as you can make it, is initialized to zero, and each cell can store values from 0-255. This was the spec I followed.

These instructions give us a turing complete language, with some relatively complex programs existing, such as a Sudoku solver.

But Why Use This Language?

This language allows getting to the most interesting parts of compiler development, namely that of optimizations, without needing to spend a large amount of time writing a tokenizer and a parser. Parsing is effectively trivial, and getting an interpreter working for this language only took me around two hours. This makes it very easy to spin up, and gives a very unique type of program to optimize.

Interesting Optimizations

There are several operations that are quite frequent, such as zeroing a cell: [-]. This set of three instructions won't exit until the current cell is zero, and this is something that comes up quite frequently. This is a trivial optimization, but this idea of simplfying loops give large performance gains here. For instance, what about this set of instructions:

[->+++++<]

This set of instructions will have the inner loop repeat as many times as the first tape cell is not 0, which is the just the value of the tape cell. So if we enter this loop with the current cell reading 5, the inner loop will execute 5 times, as we subtract one from the counter every iteration, then increment the next cell by 5. However, this can be simplified by turning this into a multiply add to an offset. This really is just:

tape[currPos+1] += tape[currPos] * 5

Optimizing out these simple loops gives speed ups of 2x compared to naive approaches, which makes them quite important here.

Real Example

Here's an example that I ran into when developing the loop simplifying pass on my AST:

[->+>++>>><+<<<<]

The offsets (other than the required 1 for the induction variable) are +1 at offset 1, +2 at offset 2, and+1 at offset 4. Using these offsets, I could remove this entire loop by replacing the additions with multiplications that store at offsets, and then zeroing the memory for the induction variable. For this case, that would be:

This set of instructions, when emitted as X86-64, yields this assembly:

movb	$1, %r10b
mulb	%r10b
addb	%al, 4(%rdi)
movb	(%rdi), %al
movb	$2, %r10b
mulb	%r10b
addb	%al, 2(%rdi)
movb	(%rdi), %al
movb	$1, %r10b
mulb	%r10b
addb	%al, 1(%rdi)
movb	$0, (%rdi)

This basically increments the offsets, then multiplies by the amount of times the induction variable will execute, then zeroes the induction variable. If the induction variable is incrementing rather than decrementing, then we also must xor it with -1 and add 1, but otherwise the logic is the same.

More we can do?

You can also perform a simple instcombine pass, which gave me a 3x speedup. This is pretty intuitive since a large sequence of +s can be reduced into just one.

So these are great, but these are the types of optimizations that LLVM could do as well. If we wrote a backend that emitted LLVM-IR, these local optimizations, as well as pretty much any peephole we could think of, would be easily emitted by LLVM. However, there are some optimizations that LLVM can't do natively.

Vectorized Scan Loops

A type of common procedure in BF programs is a memory scan. For instance:

[>]

This will constantly shift right through the tape until we find a cell with a value of 0. However, this is a whole class of moves, as you can scan by any amount:

[>>>>>>>>] (scans every 8 cells until it reads a 0)
[<<<<<] (scans every 5 cells to the left until it reads a 0)

In some mathematical programs, such as a Tower of Hanoi implementation or a Mandelbrot set, these are extremely common. Therefore, optimizations for can give great speedups.

The key idea here is that rather than checking every memory cell individually, we can use vector instructions to read a sequence of 32 cells at a time, and then use masking to extract the first zero cell that we care about.

For memory scans, I worked with AVX256 and supported positive and negative strides of 1, 2, and 4. You could support any arbitrary stride, however you would need to have up to 32 masks if the stride and number of cells you can handle with vector instructions (32 to AVX-256) are relatively prime.

The steps are as follows for positive strides:

Load 32 bytes from the tape starting at the pointer
Compare the 32 bytes byte-wise against a zero vector. For bytes that were 0, the resulting vector will have those bytes set to 0xFF
If the stride is 2 or 4, then also perform a byte-wise and against an appropriate mask - this zeroes everything that isn’t on a multiple we want
Use an instruction that creates a 32-bit mask from the high bits of the 32 bytes from the vector.
Count the trailing zeros in the new 32-bit mask - if there are no 1’s in the mask, then the instruction returns 32
Increment the memory pointer by the return of the instruction above.

For instance, the program [>>] yields this program:

label0:
    cmpb		$0, (%rdi)
    je		label1
    vpxor	%xmm0, %xmm0, %xmm0
    vpcmpeqb	(%rdi), %ymm0, %ymm0
    vpand	.STRIDE2MASK(%rip), %ymm0, %ymm0
    vpmovmskb	%ymm0, %r10
    tzcntl	%r10d, %r10d
    add		%r10, %rdi
label1:
    cmpb		$0, (%rdi)
    jne		label0

This is a unique optimization that emerges from the fact that we are operating on a tape cell, and the fact that it's very common makes it a good choice in a language such as BF.

Addendum: LLVM Performance

I also wrote an AST to LLVM backend to compare performance against my direct AST to X86-64 backend. Interestingly, my compiler was as fast or faster. Overall, performance was roughly similar, but for the Mandelbrot set, I was 3x faster than LLVM with -O3. There are likely several reasons for this, but I believe the main ones are:

I could perform vector scan optimizations
LLVM is not tuned for monolith functions

A BF program is inherently just a single function, and LLVM is not optimized for reasoning about a function that is tens of thousands of instructions long, as this is obviously not something that is very common in normal code! But this shows that if LLVM isn't tuned for something, you can do better than it even with a relatively naive compiler backend. LLVM can't solve every problem!

Conclusion

It's interesting working on more esoteric languages, as there are unique performance optimizations here that don't arise as commonly in more mainstream languages, but the skills of analyzing a language and code for missing optimizations is something that can easily be extended beyond this. Although, I do like working on BF programs just for how simple the language is!

Resources

My BF Compiler Implementation