Ext2 directory index: ALS paper and benchmarks

Daniel Phillips (phillips@bonn-fries.net)
Wed, 5 Dec 2001 22:26:17 +0100


For the filesystem mavens among us, here is a wealth of new information on my
htree directory index for Ext2, including the paper I presented last month at
ALS, and a series of benchmark results (therein lies a story, I'll get to
that[1]).

The paper is here, in browsable html and postscript:

http://people.nl.linux.org/~phillips/htree/paper/htree.html
http://people.nl.linux.org/~phillips/htree/htree.ps.gz
(A Directory Index for Ext2)

Now the benchmarks. The tests all consist of creating and deleting massive
numbers of empty files, with names lying in a given numerical range.
Sometimes the names are created sequentially, sometimes randomly. I've
included the C program I wrote to create files with 'random' in this post.

Why should it matter which order files are *created* in? Well, it shouldn't,
and it doesn't for the htree index, but read to the end of this post to find
out why it's interesting. On the other hand, the order in which files are
deleted matters a great deal, to some filesystems. Not htree-indexed Ext2
though, which I intentionally designed so that best and worst cases are
almost the same. (I'll provide additional benchmark results relating to
random deletion in a subsequent post.)

Here come the charts...

The first is basically the same as the one I presented for my first prototype
of the htree index, back in February of this year:

http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x10000.create.jpg
(Indexed ext2 compared to classic ext2 in the range 10,000 to 100,000 files)

The vertical axis is running time, the horizontal axis is number of files.

The only difference between this and February's chart is that this one runs
out to 100,000 files, which I was unable to do with the original prototype
because I'd implemented only a single tree level at the time - the index
became full at about 93,000 files. Oh, and the rendering quality is a lot
better this time, thankyou very much staroffice and sct (more on that later).

This chart shows that ext2 with the htree index creates large numbers of
files exponentially faster than classic ext2, the difference rapidly becoming
so large that the htree barcharts are barely visible.

The next chart shows the same thing with the classic ext2 results clipped to
the top of the chart so that we can see the linear performance of the htree
index:

http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x10000.create.clipped.jpg

File deletions show a similar result:

http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x10000.delete.jpg

Mind you, this is without Ted's recent hack to cache the Ext2 directory index
position. Unfortunately my patch steps on that, and update is needed. With
Ted's hack, ext2 file deletion performance should be much close to linear,
until we start getting into random order deletion. Hopefully I will find the
time to run another benchmark, so we can see what the position-caching hack
does for classic ext2. We are, after all, going to keep running into
unindexed directories for some time.

This next chart is perhaps the most interesting of the bunch, I think:

http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x100.create.jpg

This shows that the htree index is *always* as fast or faster than classic
ext2, even for small directories consisting of a few blocks. This is because
the htree patch actually branches to the old code for any directory
consisting of a single block, but as soon as we go to two blocks, the index
is faster.

This gives us the answer to a question that was raised some months ago:
obviously indexing a directory consisting of a single block will only slow
things down, but where is the cutoff? Several of us speculated about that (I
can't find the url at the moment) but ultimately the prediction was, that
htree would squeak ahead of the linear search, even at two blocks. This is
now confirmed.

With the small number of files, each test had to be run 10 times to get stable measurements.

Ok, now for the interesting part. With the htree patch, ext2 can handle
millions of files per directory; so can reiserfs. Which is faster? The
answer isn't completely straightforward:

http://people.nl.linux.org/~phillips/htree/indexed.vs.reiser.8x250000.create.jpg

Here, we're testing directories with up to two million files. Ext2+htree
wins for the first million files, then suffers some kind of bump that puts
reiserfs ahead for the next million. So for *really* huge directories,
advantage reiserfs, or so it seems.

Curiously, Reiserfs actually depends on the spelling of the filename for a
lot of its good performance. Creating files with names that don't follow a
lexigraphically contiguous sequence produces far different results:

http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.vs.reiser.10x10000.create.random.jpg

So it seems that for realistic cases, ext2+htree outperforms reiserfs quite
dramatically. (Are you reading, Hans? Fighting words... ;-)

Mind you, I did not try tweaking reiserfs at all, it does have a selection of
hash functions to choose from. It's likely that the default - R5 - is not
very well-suited at all to random names, i.e., names we'd expect to see in
the wild. For one thing, this shows the pitfalls of writing hash functions
that make assumptions about information present in the input string.

I'm sure we'll hear more on this subject.

Finally, it seems that ext2+htree does its work using quite a bit less cpu
than reiserfs or ext2 (logarithmic graph):

http://people.nl.linux.org/~phillips/htree/indexed.vs.reiser.10x100000.create.random.cpu.log.jpg

I will supply the full benchmark data in a subsequent post.

Here is the c program I used to create files with 'random' names:

/*
* randfiles.c
*
* Usage: randfiles <basename> <count> <echo?>
*
* For benchmarking create performance - create <count> files with
* numbered names, in random order. The <echo?> flag, if y, echoes
* filenames created. For example, 'randfiles foo 10 y' will create
* 10 empty files with names ranging from foo0 to foo9.
*
* copyleft: Daniel Phillips, Oct 6, 2001, phillips@nl.linux.org
*
*/

#include <stdlib.h>

#define swap(x, y) do { typeof(x) z = x; x = y; y = z; } while (0)

int main (int argc, char *argv[])
{
int n = (argc > 2)? strtol(argv[2], 0, 10): 0;
int i, size = 50, show = argc > 3 && !strncmp(argv[3], "y", 1);
char name[size];
int choose[n];
for (i = 0; i < n; i++) choose[i] = i;

for (i = n; i; i--)
{
int j = rand() % i;
swap(choose[i-1], choose[j]);
}

for (i = 0; i < n; i++)
{
snprintf(name, size, "%s%i", argv[1], choose[i]);
if (show) printf("create %s\n", name);
close(open(name, 0100));
}
return 0;
}

[1] I'm eternally indebted to Stephen Tweedie for coming to my rescue at the
last minute. Shortly before my ALS presentation I had exactly zero graphs
prepared, let alone slides. I'd arrived at Oakland with files full of
benchmark numbers, secure in the knowledge that I could massage them into
impressive graphics on a moment's notice. Boy was I ever wrong.

It seems it's still a 'early days' for kchart and kpresenter. To do the job,
I needed something something that actually works. Stephen Tweedie had just
the thing on his laptop: staroffice, and he knows how to use it. So we all
retired to a Chinese restaurant nearby and Stephen went to work - all I had
to do was offer advice on which tables to turn into charts. Most of the time
went by just getting the first chart, as Stephen had never actually tried
this before. Then finally, a few charts magically materialized and, even
better, began to metamorphose into a slide show. We all trooped over to the
conference halls, and continued to work on the slides during the talks preceding mine.

My talk went fine; the charts I've presented here are exactly the ones made
by Stephen that day. Thanks. :-)

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/