CRuby Experiment: Branch Free SPECIAL_CONST_P
In Ruby, everything is an object, even Integers.
CRuby uses a tagged pointer scheme, a memory and performance optimization which represents some common objects within its pointer itself without a separate memory allocation. This works by repurposing the low bits, which are otherwise always 0 due to alignment.
For example, if an object is an Integer (below a certain size), its VALUE will have its least significant bit set to 1 and the remaining 63 bits represent. Internally this is called a “Fixnum” and before Ruby 2.4, where Fixnum and Bignum were unified into Integer, that class was visible to userland.
A similar technique is used to encode some Symbols, Floats, and the values of
Qtrue
(Ruby’s true
), Qfalse
(Ruby’s false
), Qnil
(Ruby’s nil
), and
Qundef (a value that’s not visible to Ruby users).
You can see the specifics in special_consts.h
This is great! It means we can do basic integer math without allocating an object and without needing to access memory elsewhere.
SPECIAL_CONST_P
The cost we pay for the benefits of these tagged non-heap values is that we
can’t assume that our VALUEs point to memory locations. Any time we want to
check the object’s flags, class, or GC attributes, we must first determine
whether or not the value is a “special constant”. The function to check this is
SPECIAL_CONST_P
.
As of Ruby 3.2 SPECIAL_CONST_P
is defined as (slightly modified for readability)
So objects with any of the 3 lowest bits set (Ruby’s internals call these
“immediate values”) are special constants and Qfalse
is a special
constant. Again Qfalse
is Ruby’s false, but it’s defined as 0 so it’s also
C’s false
.
This definition is new to Ruby 3.2, previously Qnil
was also a special case.
It was simplified in ruby/ruby#6759.
Which made me think that maybe it could be done using bitwide operations and
without branches.
Bitwise Tricks
There are two resources I really love for bitwise hacks: Sean Eron Anderson’s “Bit Twiddling Hacks” webpage and “Hacker’s Delight” by Henry S. Warren.
Hacker’s Delight is comprehensive. It starts by presenting basic techniques, identities and techniques which can be composed together. The author also made an easy to use superoptimizer, “Aha!” (A Hackers Assistant).
A superoptimizer is a program that exhaustively searches combinations of machine instructions to find the optimal sequence (by some criteria) which gives the desired output.
I used this fork of Aha! which targets ARM code. I gave it a version of the SPECIAL_CONST_P method, for reference, ran it, and it quickly output a solution:
$ ./aha 4
Searching for programs with 4 operations.
Found a 4-operation program:
neg r1,rx
and r2,r1,rx
sub r3,r2,#5
shr r4,r3,#31
Expr: (((-(x) & x) - 5) >>u 31)
It seems to work! But what’s going on?
(x - 5) >> 31
is doing a subtraction, and moving the sign bit from the most
significant position to the least significant. (Aha! runs the simulation with
32 bits). So that’s basically equivalent to (x < 5)
or (x <= 4)
.
“Hacker’s Delight” gives us a technique to create a word with 1’s in all the positions of trailing 0’s. The following are equivalent:
(x - 1) & ~x
(x & -x) - 1
~(x | -x)
So in the case of a tagged immediate, we will get a result of either 0 (for a fixnum, with the 1 bit set, an odd number), 1 (bit 2 is set, an even number), or 3 (bit 3 is set, a multiple of 4). When we get a 0 the value we get has all bits set, which is -1 when interpreted as a signed integer (the only possible negative value from this operation). Using this we can perform a signed integer comparison on the output and anything <= 3 is an special constant!
(x & -x) - 1 <= 3
That’s not far from our expression from the superoptimizer, and we can just add 1 to both sides of the inequality to clean it up further. This also removes the need for a signed integer!
(x & -x) <= 4
We can test this expression in compiler explorer and see that it essentially replicates the assembly from the superoptimizer as well.
On x86_64 gcc -O3:
SPECIAL_CONST_P(long):
mov rax, rdi
neg rax
and rax, rdi
cmp rax, 4
setle al
ret
On arm64 gcc -O3:
SPECIAL_CONST_P(long):
neg x1, x0
and x1, x1, x0
cmp x1, 4
cset w0, le
ret
The compiler knows how to make this branch-free using cmp/setle or cmp/cset, which performs the equivalent of our previous >> 31
hack.
I’ve tested this inside CRuby and it passes all tests. It works!
https://github.com/jhawthorn/ruby/commit/f4d8686c54dd5fb727723c8cef50a0f634abf9e0
Branch free code
Now we have a branch free version of SPECIAL_CONST_P
.
This in itself isn’t necessarily better, and we probably shouldn’t replace
SPECIAL_CONST_P
everywhere with this new version. Though it should be about
the same speed we don’t normally use SPECIAL_CONST_P
just to get a boolean 0
or 1 value, and compiler optimizations understand the old version better and can apply more optimizations when using it. Also
branches aren’t bad, as long as they’re predictable (what is and isn’t
“predictable” I find extremely difficult to have an intuition for as modern
branch predictors are surprisingly sophisticated. Agner Fog’s
page is a good place to read more).
Where I think there could be merit to this trick is composing it into a larger sequence of useful branch free code, especially in scenarios where the previous branches are unpredictable.
For example, a common operation in Ruby is to get the class of an object. Previously we needed branching to do this because the way we get a class on a heap object would crash if we tried it on a special constant.
Here’s the existing code
The disassembly shows that it’s full of branches.
test dil,0x7
jne 0x3f0b0 <rb_class_of+23>
mov rax,QWORD PTR [rip+0x4df422] # 0x51e4c8 <rb_cFalseClass>
test rdi,rdi
je 0x3f0ee <rb_class_of+85>
mov rax,QWORD PTR [rdi+0x8]
ret
cmp rdi,0x4
jne 0x3f0be <rb_class_of+37>
mov rax,QWORD PTR [rip+0x4df41b] # 0x51e4d8 <rb_cNilClass>
ret
cmp rdi,0x14
jne 0x3f0cc <rb_class_of+51>
mov rax,QWORD PTR [rip+0x4df405] # 0x51e4d0 <rb_cTrueClass>
ret
test dil,0x1
je 0x3f0da <rb_class_of+65>
mov rax,QWORD PTR [rip+0x4df337] # 0x51e410 <rb_cInteger>
ret
mov rax,QWORD PTR [rip+0x4df337] # 0x51e418 <rb_cFloat>
cmp dil,0xc
jne 0x3f0ee <rb_class_of+85>
mov rax,QWORD PTR [rip+0x4e008a] # 0x51f178 <rb_cSymbol>
ret
Instead what we can do is make a lookup table, mapping the last 3 bits (the tag bits) of a special constant value to its exact class.
Which looks like it has a branch but the compiler (at least clang, I found this unpredictable in GCC) emits this as a conditional mov.
mov rax,rdi
neg rax
and rax,rdi
mov ecx,edi
and ecx,0x1f
shl rcx,0x3
add rcx,QWORD PTR [rip+0x23dc0f] # 0x4b3d78
add rdi,0x8
cmp rax,0x5
cmovl rdi,rcx
mov rax,QWORD PTR [rdi]
ret
Shorter overall and no branches, but execution may take fewer instructions in the original version, depending on the path taken.
https://github.com/jhawthorn/ruby/commit/23155ce588dbb9eaaf3abb6e0391ea2f9f959167
Not too surprisingly, this version didn’t end up being an improvement when swapped out completely in CRuby. Once again I think the old version being friendlier to compiler optimizations, avoiding memory access, and being shorter in terms of execution wins out.
What use is this?
Right now nothing that I’ve seen 😅. It’s a nifty trick and I’m glad to have found it.
I think it’s still possible there’s a scenario where we have a very unpredictable branch where this would be more beneficial than the branching code. Perhaps something like GC marking, where we only care whether or not an object is on the heap.
It’s something I’ll keep in mind, but for now it’s just for fun.