Re: [RFC][PATCH] Faster generic_fls

Willy Tarreau (willy@w.ods.org)
Wed, 30 Apr 2003 22:59:21 +0200


On Wed, Apr 30, 2003 at 09:15:33PM +0200, Daniel Phillips wrote:

> In the dawn of time, before God gave us Cache, my version would have been the
> fastest, because it executes the fewest instructions. In the misty future,
> as cache continues to scale and processors sprout more parallel execution
> units, it will be clearly better once again.

Daniel,

I must acknowledge that your simple code was not easy to beat ! You can try this
one on your PIII, I could only test it on an athlon mobile and a P4. With gcc 2.95.3,
it gives me a boost of about 25%, because it seems as gcc cannot optimize shifts
efficiently. On 3.2.3, however, it's between 0 and 5% depending on optimization/CPU.

But the code is a lot smaller (about 105 bytes total for the function). I could also
code a version with parallel execution of several branches, but it was slower,
because every branch not taken was computed for nothing, and the glue around it had
a cost. I tried to implement it with as many booleans as possible so that the compiler
can try to emit conditionnal instructions and make use of carries. I did operations
on an unsigned char because it has some advantages, such as 0x80 being negative ;-)

I also let the other tests in, so that you can try them if you want. Looking through
the generated code, it's clear that an assembler version would be faster. Not much,
but a bit.

Cheers,
Willy

static inline
unsigned new2_fls32(unsigned n32)
{
unsigned t = 0;
unsigned char n;

if (n32 >> 16) {
n32 >>= 16;
t = 16;
}

n = n32;
if (n32 >> 8) {
n = (unsigned char)(n32 >> 8);
t += 8;
}
#if 0
else
if (!n) return t;
// this one uses comparisons and jumps, but is short and fast
return (n < 0x10) ?
t + ((n < 0x04) ? 2 - (n < 0x02) : 4 - (n < 0x08)) :
t + ((n < 0x40) ? 6 - (n < 0x20) : 8 - (n < 0x80)) ;
#endif
#if 0
// this one compiles into several parallel branches, but is slower
if (n < 0x10) {
return t += (n < 0x04) ? n - ((n + 1) >> 2) : 4 - (n < 8);
} else {
return t += (n < 0x40) ? 6 - (n < 0x20) : 8 - (n < 0x80) ;
}
#endif
#if 0
// the same as above
return t += (n < 0x10) ? (n < 0x04) ? n - ((n + 1) >> 2) : 4 - (n < 8) :
(n < 0x40) ? 6 - (n < 0x20) : 8 - (n < 0x80) ;
#endif
#if 1
// this one uses comparisons and jumps, but is short and the fastest
return (n < 0x10) ?
(n < 0x04) ? n + t - ((n + 1) >> 2) : 4 + t - (n < 8) :
(n < 0x40) ? 6 + t - (n < 0x20) : 8 + t - (n < 0x80) ;
#endif
}

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/