Re: Common hash table implementation

Daniel Phillips (phillips@bonn-fries.net)
Sat, 21 Jul 2001 22:25:51 +0200


On Saturday 21 July 2001 02:24, Brian J. Watson wrote:
> Daniel Phillips wrote:
> > On Wednesday 18 July 2001 03:34, Larry McVoy wrote:
> > > We've got a fairly nice hash table interface in BitKeeper that
> > > we'd be happy to provide under the GPL. I've always thought it
> > > would be cool to have it in the kernel, we use it everywhere.
> > >
> > > http://bitmover.com:8888//home/bk/bugfixes/src/src/mdbm
>
> Thanks, Larry. Your hashing functions are much more sophisticated
> than the simple modulo operator I've been using for hashing by inode
> number.

Yes, I tested almost all of them to see how well they worked my
directory index application. There are really only two criterea:

1) How random is the hash
2) How efficient is it

My testing was hardly what you would call rigorous. Basically, what I
do is hash a lot of very unrandom strings and see how uniform the
resulting hash bucket distribution is. The *only* function from
Larry's set that did well on the randomness side is the linear
congruential hash - it did nearly as well as my dx_hack_hash.

Surprisingly, at least to me, the CRC32 turned in an extremely variable
performance. With a small number of buckets (say 100) it did ok, but
with a larger numbers it showed a very lumpy distribution. Yes, this
is way too imprecise a way of describing what happened and I should
take a closer look at it. I don't have the mathematical background to
be really sure about this, but I suspect CRC32 isn't optimized at all
for randomness - it's optimized for detecting bit errors and has good
properties with respect to neighbouring bits, properties that are no
use at all to a randomizing funciton. Anyway, I wasn't all that
unhappy to see CRC32 turn in a poor performance for two reasons: a) the
1K xor table would represent a 25% increase of the indexing code and b)
hashing through the table eats an extra 1K of precious L1 cache.

The linear congruential hash from Larry's set and my dx_hack_hash share
a common characteristic: they both munge each character against a
pseudorandom sequence. In Larry's hash it's a linear congruential
sequence, and in my case it's a feedback shift register. In addition,
I use a multiply to spread the effect of each character over a broader
range of bits.

Larry's hash doesn't do this and you can see right away that strings
that vary only in the last character aren't going to be distributed
very randomly. It might work a little better with the hashing step
spelled this way:

- ((h) = 0x63c63cd9*(h) + 0x9c39c33d + (c))
+ ((h) = 0x63c63cd9*(h + (c)) + 0x9c39c33d)

I haven't tried this, but I will.

There are people out there who know a lot more about analyzing hash
functions than I do, and I have their names somewhere in my mailbox.
I'll go look them up soon and submit for proper testing the whole batch
of functions that have been suggested to me over the last few months.
By the way, in case you haven't already deduced this, this stuff is
really time consuming.

[...]
> Richard Guenther sent the following link to his own common hashing
> code, which makes nice use of pseudo-templates:
>
> http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/~checkout~/glame/glame
>/src/include/hash.h?rev=1.5&content-type=text/plain
>
> A few things I would consider changing are:
>
> - ditching the pprev pointer

Yes, you want to use the generic list macros there.

> - encapsulating the next pointer inside a struct hash_head_##FOOBAR

I think the generic list macros give you that for free.

> - stripping out the hard-coded hashing function, and allowing the
> user to provide their own

Naturally. And trying to reduce the size of the macros. It's not that
easy to get stuff that has dozens of lines ending with "\" into the
kernel. You might have better luck just generalizing a few short sets
of common operations used in hashes, and showing examples of how you'd
use them to rewrite some of the existing hash code. Obviously, the
new, improved approach has to be no less efficient than the current way
of doing things.

> All the backslashes offend my aesthetic sensibility, but the
> preprocessor provides no alternative. ;)

It's hard to argue against using inlines there. It's true that there
are a lot of generalizations you just can't do with inlines, but so
what? What matters is how efficient the generated code is and to a
lesser extent, how readable the source is. You could make that source
quite a bit more readable with a few *small* macros and some inline
functions. Suggestion: express the bucket probe as an inline, compute
the hash outside and pass it in. Then you can wrap the whole thing up
in a really short macro.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/