> Now that you provide source for r5 and dx_hack_hash, let me feed my
    > collections to them.
    > r5: catastrophic
    > dx_hack_hash: not bad, but the linear hash is better.
    I never expected dx_hack_hash to be particularly good at anything, but
    we might as well test the version without the mistake in it - I was
    previously using < 0 to test the sign bit - on an unsigned variable :-/
    unsigned dx_hack_hash (const char *name, int len)
    {
        u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
        while (len--)
        {
            u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
            if (hash & 0x80000000) hash -= 0x7fffffff;
            hash1 = hash0;
            hash0 = hash;
        }
        return hash0;
    }
    The correction gained me another 1% in the leaf block fullness measure. 
    I will try your hash with the htree index code tomorrow.
Basically I find the same results as before.
Actual names (N=557398)
	dx_hack+if			dx_hack-if			best
Size	min	max	av	stddev	min	max	av	stddev
2048	217	326	272.17	273.45	220	330	272.17	280.43	+
4096	97	191	136.08	138.35	100	182	136.08	138.29	-
8192	40	105	68.04	68.57	36	102	68.04	68.06	-
16384	14	59	34.02	34.36	14	59	34.02	34.08	-
32768	3	37	17.01	17.24	4	36	17.01	17.09	-
65536	0	24	8.51	8.55	0	23	8.51	8.51	-
131072	0	18	4.25	4.24	0	16	4.25	4.26	+
262144	0	13	2.13	2.12	0	11	2.13	2.13	-
Generated names
2048	195	347	272.17	509.38	191	346	272.17	636.11	+
4096	71	218	136.08	645.73	56	224	136.08	995.79	+
8192	23	125	68.04	202.16	23	135	68.04	288.99	+
16384	10	69	34.02	67.47	8	72	34.02	89.29	+
32768	1	42	17.01	26.32	1	43	17.01	31.39	+
65536	0	28	8.51	10.92	0	26	8.51	12.24	+
131072	0	17	4.25	4.93	0	18	4.25	5.28	+
262144	0	12	2.13	2.32	0	13	2.13	2.40	+
In other words, the "broken" version wins on actual names, the "correct" version
on generated names with number tail. The differences are small.
(And of course the broken version is faster.)
As a comparison:
Actual names (N=557398)
	linear hash (m=11)		linear hash (m=51)		best of 4
Size	min	max	av	stddev	min	max	av	stddev
2048	221	324	272.17	254.02	218	325	272.17	265.46	lin-11
4096	96	176	136.08	132.53	92	174	136.08	130.94	lin-51
8192	38	102	68.04	68.26	41	97	68.04	64.98	lin-51
16384	14	57	34.02	33.97	15	60	34.02	32.29	lin-51
32768	3	37	17.01	17.50	4	41	17.01	16.46	lin-51
65536	0	24	8.51	8.70	0	24	8.51	8.31	lin-51
131072	0	17	4.25	4.39	0	16	4.25	4.22	lin-51
262144	0	12	2.13	2.20	0	12	2.13	2.12	lin-51
Generated names
2048	262	283	272.17	12.25	244	298	272.17	136.72	lin-11
4096	128	146	136.08	9.39	119	151	136.08	39.73	lin-11
8192	61	76	68.04	4.41	58	79	68.04	12.47	lin-11
16384	26	41	34.02	3.61	25	44	34.02	6.32	lin-11
32768	10	23	17.01	4.36	10	25	17.01	4.04	lin-51
65536	2	14	8.51	2.79	3	16	8.51	4.54	lin-11
131072	0	8	4.25	1.85	0	10	4.25	1.88	lin-11
262144	0	5	2.13	1.10	0	6	2.13	1.22	lin-11
So both linear hash versions are far superior (if the criterion is uniform
distribution) in the case of generated names, and lin-51 also beats dx_hack
in the case of actual names. Of course it also wins in speed.
Andries
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/