I recently was using wrapping counters to implement a circular buffer. While doing this, I found a doc on circular buffers in the Linux kernel which stated:

Calculation of the occupancy or the remaining capacity of an arbitrarily sized circular buffer would normally be a slow operation, requiring the use of a modulus (divide) instruction. However, if the buffer is of a power-of-2 size, then a much quicker bitwise-AND instruction can be used instead.

Although that is talking about capacity, you can also use increment/bitwise-AND instead of increment/mod or increment/compare. That made me wonder how much faster bitwise-AND is than other methods I’ve used.

## Method 1: Increment and Mod

Increment a counter x, using mod to set it to 0 when it reaches n.

## Method 2: Increment, Compare, and Reset

Increment a counter x, resetting it to 0 when it reaches n.

## Method 3: Increment and bitwise-AND

Increment a counter x, mod n by using n-1 as a mask for the bitwise-AND. This only works if n is a power of 2.

# Assembly Level Comparison

To compare these methods, I made a test. I built these tests using clang++ (Apple LLVM version 7.0.2) and g++ (5.3.0). Both compilers gave similar runtime results with optimization level set to -O3, even though the generated assembly was a bit different. For example, clang++ used partial loop unrolling while g++ did not and clang++ used the INC instruction while g++ used ADD + 1.

The Intel 64 assembly listing below show the increment/wrap loops generated by g++, which was a bit simpler than clang++ since it didn’t unroll loops.

## Method 1: Increment and Mod

The C++ loop:

The generated assembly using ADD and IDIV instructions:

100000ac9:   xor    %eax,%eax
100000acb:   nopl   0x0(%rax,%rax,1)
100000ad0:   add    $0x1,%eax 100000ad3: cltd 100000ad4: idiv %ebx 100000ad6: mov %edx,%r14d 100000ad9: mov %edx,%eax 100000adb: sub$0x1,%ecx


## Method 2: Increment, Compare, and Reset

The C++ loop:

The generated assembly using LEA (load effective address) as a way to increment the counter and then CMP and CMOVE to implement the if branch without a jump:

100000ae5:   xor    %ebp,%ebp
100000ae7:   xor    %edx,%edx
100000ae9:   mov    %rax,%r12
100000aec:   mov    $0x3b9aca00,%eax 100000af1: data16 data16 data16 data16 data16 nopw %cs:0x0(%rax,%rax,1) 100000b00: lea 0x1(%rbp),%esi 100000b03: cmp %esi,%ebx 100000b05: cmove %edx,%esi 100000b08: mov %esi,%ebp 100000b0a: sub$0x1,%eax
100000b0d:   jne    100000b00 <_main+0x80>


## Method 3: Increment and bitwise-AND

The C++ loop:

The generated assembly using ADD and AND instructions:

100000b14:   xor    %ebx,%ebx
100000b16:   mov    $0x3b9aca00,%edx 100000b1b: mov %rax,%r13 100000b1e: xchg %ax,%ax 100000b20: add$0x1,%ebx
100000b23:   and    %r15d,%ebx
100000b26:   sub    \$0x1,%edx
100000b29:   jne    100000b20 <_main+0xa0>


# Runtime Comparison

Running the test on my machine produced:

./wrapping_counters_test 128
mod    :  7659829us
compare:  1476831us
and    :   728178us


So, Method 3, bitwise-AND is easily the fastest. Using it’s runtime as the baseline index, we can measure the relative performance of the other methods.

Method Relative Performance
Mod 10.5
Compare/Reset 2.0
Bitwise-AND 1

# Summary

Don’t use mod to implement simple incrementing/wrapping counters. Use bitwise-AND, if you are wrapping around a power of 2, and use compare/reset otherwise.