I recently was using wrapping counters to implement a circular buffer. While doing this, I found a doc on circular buffers in the Linux kernel which stated:

Calculation of the occupancy or the remaining capacity of an arbitrarily sized circular buffer would normally be a slow operation, requiring the use of a modulus (divide) instruction. However, if the buffer is of a power-of-2 size, then a much quicker bitwise-AND instruction can be used instead.

Although that is talking about capacity, you can also use increment/bitwise-AND instead of increment/mod or increment/compare. That made me wonder how much faster bitwise-AND is than other methods I’ve used.

Method 1: Increment and Mod

Increment a counter x, using mod to set it to 0 when it reaches n.

x = (x + 1) % n;

Method 2: Increment, Compare, and Reset

Increment a counter x, resetting it to 0 when it reaches n.

x = x + 1;
if (x == n) {
    x = 0;
}

Method 3: Increment and bitwise-AND

Increment a counter x, mod n by using n-1 as a mask for the bitwise-AND. This only works if n is a power of 2.

x = (x + 1) & (n - 1);

Assembly Level Comparison

To compare these methods, I made a test. I built these tests using clang++ (Apple LLVM version 7.0.2) and g++ (5.3.0). Both compilers gave similar runtime results with optimization level set to -O3, even though the generated assembly was a bit different. For example, clang++ used partial loop unrolling while g++ did not and clang++ used the INC instruction while g++ used ADD + 1.

The Intel 64 assembly listing below show the increment/wrap loops generated by g++, which was a bit simpler than clang++ since it didn’t unroll loops.

Method 1: Increment and Mod

The C++ loop:

for(i = end; i != 0; --i) {
    j = (j+1) % divisor;
}

The generated assembly using ADD and IDIV instructions:

100000ac9:   xor    %eax,%eax
100000acb:   nopl   0x0(%rax,%rax,1)
100000ad0:   add    $0x1,%eax
100000ad3:   cltd
100000ad4:   idiv   %ebx
100000ad6:   mov    %edx,%r14d
100000ad9:   mov    %edx,%eax
100000adb:   sub    $0x1,%ecx
100000ade:   jne    100000ad0 <_main+0x50>

Method 2: Increment, Compare, and Reset

The C++ loop:

for(i = end; i != 0; --i) {
    if (++l == divisor) {
        l = 0;
    }
}

The generated assembly using LEA (load effective address) as a way to increment the counter and then CMP and CMOVE to implement the if branch without a jump:

100000ae5:   xor    %ebp,%ebp
100000ae7:   xor    %edx,%edx
100000ae9:   mov    %rax,%r12
100000aec:   mov    $0x3b9aca00,%eax
100000af1:   data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
100000b00:   lea    0x1(%rbp),%esi
100000b03:   cmp    %esi,%ebx
100000b05:   cmove  %edx,%esi
100000b08:   mov    %esi,%ebp
100000b0a:   sub    $0x1,%eax
100000b0d:   jne    100000b00 <_main+0x80>

Method 3: Increment and bitwise-AND

The C++ loop:

const int and_divisor = divisor-1;
for(i = end; i != 0; --i) {
    k = (k+1) & and_divisor;
}

The generated assembly using ADD and AND instructions:

100000b14:   xor    %ebx,%ebx
100000b16:   mov    $0x3b9aca00,%edx
100000b1b:   mov    %rax,%r13
100000b1e:   xchg   %ax,%ax
100000b20:   add    $0x1,%ebx
100000b23:   and    %r15d,%ebx
100000b26:   sub    $0x1,%edx
100000b29:   jne    100000b20 <_main+0xa0>

Runtime Comparison

Running the test on my machine produced:

./wrapping_counters_test 128
mod    :  7659829us
compare:  1476831us
and    :   728178us

So, Method 3, bitwise-AND is easily the fastest. Using it’s runtime as the baseline index, we can measure the relative performance of the other methods.

Method Relative Performance
Mod 10.5
Compare/Reset 2.0
Bitwise-AND 1

Summary

Don’t use mod to implement simple incrementing/wrapping counters. Use bitwise-AND, if you are wrapping around a power of 2, and use compare/reset otherwise.

References