Re: [rfc] Near-constant time directory index for Ext2

Kai Henningsen (kaih@khms.westfalen.de)
22 Feb 2001 20:38:00 +0200

phillips@innominate.de (Daniel Phillips) wrote on 20.02.01 in <01022020011905.18944@gimli>:

> But the current hash function is just a place holder, waiting for
> an better version based on some solid theory. I currently favor the
> idea of using crc32 as the default hash function, but I welcome
> suggestions.

I once liked those things, too - but I've learned better since.

Quoting _Handbook_of_Algorithms_and_Data_Structures_ (Gonnet/Baeza-Yates,

--- snip ---

3.3.1 Practical hashing functions

[...]

A universal class of hashing functions is a class with the property that
given any input, the average performance of all the functions is good.
[...] For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
a universal class of hash functions.
[...]
Keys which are strings or sequences of words (including those which are of
variable length) are best treated by considering them as a number base b.
Let the string s be composed of k characters s1s2...sk. Then

h(s) = ( sum(i=0..k-1) B^i*s(k-i) ) mod m

To obtain a more efficient version of this function we can compute

h(s) = ( ( sum(i=0..k-1) B^i*s(k-i) ) mod 2^w ) mod m

where w is the number of bits in a computer word, and the mod 2^w
operation is done by the hardware. For this function the value B = 131 is
recommended, as B^i has a maximum cycle mod 2^k for 8<=k<=64.

Hashing function for strings

int hashfunction(s)
char *s;

{ int i;
for(i=0; *s; s++) i = 131*i + *s;
return(i % m);
}

--- snip ---

I've actually used that function for a hash containing something like a
million phone numbers as keys, and there were *very* few collisions.
Similarly for another hash containgng megabytes of RFC 822 message-ids.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html