Ahh, that makes sense. Frontload the work. I was stuck thinking of a single divide rather than an array divide.

Thanks for the pop count link, that was good review. The bit manipulations has a rather interesting structure.

As for the algorithm complexity, if you were being pedantic, any algorithm is constant time when you fix the input size. As the pop count instruction works on a fixed number of bits (the register size), it is of course constant time. Though you may notice from the algorithm, the actual number of instructions depends on the number of bits involved in the counting. Looks to be logarithmic in the number of bits, which is pretty efficient. And of course, having a machine opcode to do it is always handy. Looks like it can be done quite efficiently at the hardware level.

I do rather enjoy the conversion

`(A % B)` =>

`(A & (B - 1))` when B is a power of 2. I use a slight variant of that for rounding up to the next multiple of a power of 2:

`roundUp(value, multiple) = (value + (multiple - 1)) & ~(multiple - 1)`

I tried your suggestion to pass a constexpr through a non-constexpr function. It seems the g++ optimizer is too good. I was not able to fool it with my simple test:

unsigned int addOne(unsigned int x) {

return x + 1;

}

unsigned int divideByTwo(unsigned int x) {

return x / addOne(1);

}

The result of

`g++ -c -S -masm=intel -O2 constExprArg.cpp` is:

_Z6addOnej:

.LFB0:

.cfi_startproc

lea eax, 1[rdi]

ret

.cfi_endproc

...

_Z11divideByTwoj:

.LFB1:

.cfi_startproc

mov eax, edi

shr eax ; Shift right (logical): Divide by 2

ret

.cfi_endproc