OK, I get it. Nice article - it would sure be nice to see this
incorporated Documentation/filesystems/ext2.txt. I just checked my copy
of Understanding the Linux Kernel and while existence of the compat
fields in the super block is noted, there is nothing at all said about
them.
So then, the obvious candidate would be:
#define EXT2_FEATURE_RO_COMPAT_DIR_INDEX 0x0004
which was formerly EXT2_FEATURE_RO_COMPAT_BTREE_DIR.
Other than declaring it, I gather I have to set this flag in the
superblock every time I set the EXT2_INDEX_FL in an inode. Is that it?
> > > For that matter, I'm not even sure if ext2 supports directory files
> > > larger than 2GB? Anyone? I'm not 100% sure, but it may be that e2fsck
> > > considers directories >2GB as invalid.
>
> Actually, as I think about this more, it is currently invalid for
> directories to become > 2GB. The i_size_high field in the inode is
> actually the i_dir_acl pointer, so this is only "available" for regular
> files and not directories. Given that the ext2 EA/ACL code does not use
> this field, we may allow this in the future, but for now we are limited
> to 2GB directories. Should be enough for now.
Yes, I will mask of 8 bits of the block index and this will be more than
enough to support the planned coalesce function. I can't think of any
other use for these bits than coalescing - can you?
As you point out, we can't have directories bigger than 2 (4?) GB at the
moment, and if we ever can then we could revisit the maximum size
question. I'm pretty sure that one day, not even very far in the
future, 16 GB will be too small a directory for some applications. On
the other hand, it's possible that none of these applications will be
using Ext2.
> > > Have you been testing this with different kinds of input for filenames,
> > > or only synthetic input data?
> >
> > Because I'm not relying on any particular properties of coherence of
> > names, i.e., my hash function is as random as I can make it, I think the
> > sequential names are a pretty good test, except that they are all of
> > similar length.
>
> You could try "bonnie++ -s 0 -n 1000 -d /mnt/index_filesystem"
>
> to create 1M (zero length) files in a single directory. The files have
> semi-random names and different lengths. Basically a fixed-length number
> with random length characters added to the end. Since bonnie++ is also
> a standard filesystem performance test, it will at least give some
> useful numbers to compare with.
OK, that sounds ideal.
> > > The fallback to a linear search is pretty much a requirement anyways,
> > > isn't it?
> >
> > Yes, in the sense that the new code also has to be able to handle
> > nonindexed directories, which it does now. As far as falling back to a
> > linear search after somebody has done something boneheaded - I think
> > it's better to fail and kprint a suggestion to run fsck, which can
> > easily fix the problem instead of allowing it to go unnoticed and
> > perhaps adversely affect system performance. The problem here is that
> > most users do not check their log files unless something doesn't work.
> > In this case, failing instead of pressing on constitutes helping, not
> > hurting.
>
> I would have to disagree. In the case of unsupported/corrupt/bad index
> data, there _is_ a meaningful fallback, which is linear directory search.
> Calling ext2_error() will mark the filesystem in error (for the next
> e2fsck to clean up), and the sysadmin has the option of mounting with
> "errors=remount-ro" or "errors=panic" if this is desirable. It should be
> up to the sysadmin to decide they want their box to crash or not, if there
> is a reasonable solution.
>
> We should definitely also clear the index flag on that directory, so we
> don't keep getting the same error. The rest of the ext2 code will deal
> with the case if the actual directory data is corrupt.
>
> Also, I'm sure that if the system has a large directory (thousands of
> files), they will notice that it has become very slow. If they don't
> notice (and e2fsck cleans it up after reboot), then that is just as well.
You win and I will put it on the to-do list. As you say, it's not a lot
of work, but of course that's not the point - the point is to be
consistent with current behaviour. Marking the filesystem in error
takes care of my objection, and the user gets to press on bravely in
this circumstance.
> > > What happens with an existing directory larger than a single block
> > > when you are mounting "-o index"? You can't index it after-the-fact,
> > > so you need to linear search.
> >
> > This works now.
>
> So then bailing out of index mode (on error) to go into linear search
> mode is as easy as clearing the directory index flag and reading the
> directory from the start.
Are you sure we want to clear the index flag? The user has probably
just booted the wrong kernel. And yes, we are talking about a
strategically placed goto here, after a little cleanup.
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/