On Thu, 06 Jul 2000, Steve Whitehouse wrote:
> can you explain "phase tree" and/or give a reference ?
I can describe it and give a reference. It's the algorithm used in my Tux2
filesystem that I've been working on since around Christmas, or longer than that
if you count 10 years of thinking about doing it :-)
Phase tree is my name for an algorithm similar to one that has been used in WAFL
and in a OS named Auragen that you can ask Victor Yodaiken about. I
developed the algorithm independently and it was only on reading a posting from
Victor on linux-kernel in from June, 1997, that I realized I wasn't the only
one to have thought of it.
My phase tree algorithm is different enough from the other two that I think it's
fair to call it a new algorithm, or at least a close cousin. I'm writing a
white paper on it for presentation at the ALS this fall. An abstract is
available now. I have a working prototype of Tux2 "with some issues" that I'm
now busily porting form 2.2.12 to 2.4.0.test.
I have attached Victor's original email, which makes very good reading. You can
get the Tux2 abstract by emailing me... (I'm very interested in finding out
exactly who is interested.)
Here is a brief description:
Tux2 is based on Ext2. It is not a journalling filesystem, but it does what a
journalling filesystem does: keep your files safe in the event of a processing
interruption. It does that for both data and metadata and, according to my
early benchmarks, should do it at about the same speed at which a JFS does
metadata-only. We shall see.
Tux2 uses my "phase tree" algorithm (so christened by Alan Cox - I called it
tree/phase but I like his name more). Phase tree imposes a partial ordering on
disk writes to ensure that a filesystem on disk is always updated atomically,
with a single write of the filesystem metaroot. To work properly, the entire
filesystem including all metadata, must be structured as a tree. Ext2 is not
structured as a tree, therefore, the major difference between Ext2 and Tux2 is
that all metadata has been rearranged into a tree.
Once you have the filesystem in the form of a tree you can make a copy of
the metaroot, then for all updates, apply a "branching" algorithm that works
from the updated block towards the metaroot doing a copy-on-write at each node
that needs updating. After some number of updates (the exact number is a
performance-tuning parameter) you store the new metaroot on disk, which gives
you an atomic update. So far this is similar to Auragen and WAFL.
Tux2's phase tree algorithm works almost entirely in cache and is intimately
coupled to the buffer cache system. A third metatree is added, to allow
filesystem updating to continue without pause while the second tree is commited
to disk, eventually replacing the first metatree using the abovementioned
atomic write.
In tree phase terminology, the three trees are called "phases". The three
phases are:
- recorded phase (the consistent filesystem image currently on disk)
- recording phase (diffs for a new consistent image currently being written)
- branching phase (the changing filesystem as applications see it)
Tux2 has its own update daemon that handles its "phase transitions". A phase
transition is the act of commiting a new metaroot wherein the second phase tree
becomes the first, the third becomes the second and a new metaroot is created.
(This is a function analgous to kflushd, though kflushd in its current form
can't possibly know what it would need to know to cause phase transistions at
appropriate times, and in any event, it has no way to initiate one.)
That's basically it. There are some other wrinkles in Tux2 that serve to
flatten the filesystem tree, reduce the number of block writes required and
keep cpu usage to a reasonable level.
As Alan mentioned, there are many interesting things you can do when you have a
filesystem's metadata in the form of a tree. Tux2 doesn't do most of those
things at this point, since its main purpose in life is to demonstrate the
efficacy of the phase tree algorithm and to allow me to do kernel development
without putting my precious files at risk every time I need to reset the system.
Please forgive me for my delay in responding - I've been busy settling into a
new job (which should be easily guessable from my email address).
> >
> > > You can do the same today on Linux with LVM snapshots. They are only
> > > useful for read-only consistency checking because they are read-only.
> > > (so in case of a problem you'll need to umount and rerun fsck on the
> > > normal device)
> >
> > Also there is ext2 based work going on using phase tree rather than journalling
> > which gives you similar journal properties, in future snapshots and also
> > very nice handling of multipath error recovery.
> >
> > Alan
-- Daniel--Boundary-=_nWlrBbmQBhCDarzOwKkYHIDdqSCD Content-Type: text/html; name="safe.commit.html" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="safe.commit.html"
PCEtLSByZWNlaXZlZD0iTW9uIEp1biAxNiAxMDo1NTowMiAxOTk3IEVTVCIgLS0+CjwhLS0gc2Vu dD0iTW9uLCAxNiBKdW4gMTk5NyAwOTozNjo0MiAtMDYwMCIgLS0+CjwhLS0gbmFtZT0iVmljdG9y IFlvZGFpa2VuIiAtLT4KPCEtLSBlbWFpbD0ieW9kYWlrZW5AY2hlbG0uY3Mubm10LmVkdSIgLS0+ CjwhLS0gc3ViamVjdD0iUmU6IGpvdXJuYWxpbmcgZmlsZXN5c3RlbSIgLS0+CjwhLS0gaWQ9IjE5 OTcwNjE2MTUzNi5KQUEyOTg1M0BjaGVsbS5jcy5ubXQuZWR1IiAtLT4KPCEtLSBpbnJlcGx5dG89 ImdlZXJ0ZW5AYmFydC5ubCIgLS0+Cjx0aXRsZT5MaW51eC1LZXJuZWwgQXJjaGl2ZTogUmU6IGpv dXJuYWxpbmcgZmlsZXN5c3RlbTwvdGl0bGU+CjxoMT5SZTogam91cm5hbGluZyBmaWxlc3lzdGVt PC9oMT4KVmljdG9yIFlvZGFpa2VuICg8aT55b2RhaWtlbkBjaGVsbS5jcy5ubXQuZWR1PC9pPik8 YnI+CjxpPk1vbiwgMTYgSnVuIDE5OTcgMDk6MzY6NDIgLTA2MDA8L2k+CjxwPgo8dWw+CjxsaT4g PGI+TWVzc2FnZXMgc29ydGVkIGJ5OjwvYj4gPGEgaHJlZj0iZGF0ZS5odG1sIzI5Ij5bIGRhdGUg XTwvYT48YSBocmVmPSJpbmRleC5odG1sIzI5Ij5bIHRocmVhZAogXTwvYT48YSBocmVmPSJzdWJq ZWN0Lmh0bWwjMjkiPlsgc3ViamVjdCBdPC9hPjxhIGhyZWY9ImF1dGhvci5odG1sIzI5Ij5bIGF1 dGhvciBdPC9hPgo8IS0tIG5leHQ9InN0YXJ0IiAtLT4KPGxpPiA8Yj5OZXh0IG1lc3NhZ2U6PC9i PiA8YSBocmVmPSIwMDMwLmh0bWwiPk9sYWYgVGl0ejogIlJlOiBrZXJuZWxkL211bHRpY2FzdCBi dWcgKHRpY2tsZWQgYnkgZ2F0CmVkKSI8L2E+CjxsaT4gPGI+UHJldmlvdXMgbWVzc2FnZTo8L2I+ IDxhIGhyZWY9IjAwMjguaHRtbCI+Um9naWVyIFdvbGZmOiAiVHJpdG9uSUkgSURFIGludGVyZmFj ZSBub3QgUENJIGNvbQpwbGlhbnQ/IjwvYT4KPCEtLSBuZXh0dGhyZWFkPSJzdGFydCIgLS0+Cjwh LS0gcmVwbHk9ImVuZCIgLS0+CjwvdWw+Cjxocj4KPCEtLSBib2R5PSJzdGFydCIgLS0+Ck9uIEp1 biAxNiwgIDQ6MDJhbSwgc3RlcGhlbiBmYXJyZWxsIHdyb3RlOjxicj4KIFN1YmplY3Q6IFJlOiBq b3VybmFsaW5nIGZpbGVzeXN0ZW08YnI+CjxpPiZndDtCdXQgeW91IGNhbm5vdCBndWFyYW50ZWUg dGhhdCB5b3UnbGwgbWFrZSBpdCB0aHJvdWdoIGNvbW1pdGluZyB0aGVzZTwvaT48YnI+CjxpPiZn dDtjaGFuZ2VzLCBzbyBJIGRvbid0IHVuZGVyc3RhbmQgd2hhdCB0aGlzIGdhaW5zIHlvdS4uLiAg YXQgYmVzdCBpdDwvaT48YnI+CjxpPiZndDtzb3VuZHMgbGlrZSB5b3UnZCBqdXN0IGJlIG1pbmlt aXppbmcgdGhlIHdpbmRvdyBpbiB3aGljaCBhIGNyYXNoIHdpbGw8L2k+PGJyPgo8aT4mZ3Q7aHVy dCB5b3UuPC9pPjxicj4KPHA+CkkgaGF2ZW4ndCB0aG91Z2h0IGFib3V0IHRoaXMgaXNzdWUgZm9y IGEgbG9uZyB0aW1lLCBhbmQgcGVyaGFwcyA8YnI+CkkndmUgYmVjb21lIGV2ZW4gbW9yZSBjb25m dXNlZCwgYnV0IEkgZG9uJ3QgdGhpbmsgeW91IGFyZSBjb3JyZWN0Ljxicj4KPHA+CllvdSBjYW4g aGF2ZSBhIHNhZmUgY29tbWl0LiBUaGUgYWxnb3JpdGhtLCBhcyBpcyBzdGFuZGFyZDxicj4KZm9y IERCIGxvZ2dpbmcsIGlzIHRvIGtlZXAgdHdvIGluc3RhbmNlcyBvZiB0aGUgRlMgLS0gb25lIHRo YXQ8YnI+CmlzIGEgcHJpbWFyeSwgY29uc2lzdGVudCwgdmVyc2lvbiwgYW5kIG9uZSB0aGF0IGlz IHRoZSBjdXJyZW50PGJyPgp2ZXJzaW9uLiBTdXBwb3NlIHdlIGhhdmUgdHdvIFN1cGVyQmxvY2tz LCBvbmUgcHJpbWFyeSwgb25lIGN1cnJlbnQ8YnI+CndpdGggZWFjaCBwb2ludGluZyB0byBpdHMg b3duIGZyZWUgbGlzdCBhbmQgIGlub2RlIHRhYmxlLjxicj4KQWxsIGRhdGEgaXMgd3JpdHRlbiB0 byBmcmVlIGJsb2NrcywgYWxsIGZyZWVkIGJsb2Nrczxicj4KYXJlIHBsYWNlZCBvbiBhICJ6b21i aWUgbGlzdCIgdW50aWwgY29tbWl0LiBUaGUgaW5vZGUgdGFibGUgaXM8YnI+CnVzZWQgdG8gYWxs b3cgbmV3IGlub2RlcyB0byBiZSBhbGxvY2F0ZWQgb24gY2hhbmdlcyAtLSBzbyB0aGF0PGJyPgpp bm9kZSBudW1iZXJzIGFyZSBub3QgYXNzb2NpYXRlZCB3aXRoIGZpeGVkIGRpc2sgbG9jYXRpb25z LiA8YnI+ClRoZSBjb21taXQgY2FuIGJlIGRvbmUgaW4gc2V2ZXJhbCB3YXlzLCBidXQgdG8gc3Rh cnQsIGFzc3VtZTxicj4KdGhhdCB3ZSBzaW1wbHkgZmx1c2ggYWxsIGRpcnR5IGJ1ZmZlcnMgYW5k IHRoZW4gIHdyaXRlIHRoZSA8YnI+CmN1cnJlbnQgU0IgYW5kIGN1cnJlbnQgaW5vZGUgdGFibGUs IHRoZW4gd2UgbWFyayB0aGUgY3VycmVudCA8YnI+ClNCIGFzIHRoZSBwcmltYXJ5IFNCIC0tIHRo aXMgd3JpdGUgaXMgYXRvbWljIGFuZCBmbGlwcyB1cyBmcm9tPGJyPgpvbmUgY29uc2lzdGVudCBG UyB0byBhbm90aGVyLiBPbiByZWNvdmVyeSBvciByZWJvb3QsIHRoZSA8YnI+CnByaW1hcnkgaXMg dXNlZCwgdGhlIHpvbWJpZSBsaXN0IGlzIGNvcGllZCB0byB0aGUgZnJlZSBsaXN0LCBhbmQ8YnI+ CnRoZSBwcmltYXJ5IGlzIHdyaXR0ZW4gb3ZlciBjdXJyZW50LiA8YnI+CjxwPgpUaGUgRlMgSSdt IHJlbWVtYmVyaW5nIHdhcyBkZXNpZ25lZCBmb3IgYSByZWxhdGl2ZWx5IHNpbXBsZSBzdHJhaWdo dGZvcndhcmQ8YnI+ClVuaXggRlMsIGJ1dCB0aGVyZSBhcmUgbnVtZXJvdXMgb2J2aW91cyBvcHRp bWl6YXRpb25zIC0tIGUuZy4gdG8gPGJyPgp1c2UgdGhlIHNjaGVtZSBwZXIgY2dyb3VwIG9yIHRv IHNldCBhc2lkZSBwYWlycyBvZiBjeWwtZ3JvdXBzLCBhZGRpbmc8YnI+CmEgM3JkIGxldmVsIGZv ciBpbiBtZW1vcnkgY3VycmVudCB3aGlsZSBjdXJyZW50IGlzIGJlaW5nIGZsdXNoZWQgLi4uPGJy PgpPbmUgdGhpbmcgdGhpcyBhbGdvcml0aG0gZG9lcyBub3QgZG8gaXMgd29yayB3ZWxsIHdoZW4g dGhlIGRpc2sgaXM8YnI+Cm5lYXJseSBmdWxsLCBidXQgZmF1bHQtdG9sZXJhbmNlIG11c3QgY29z dCBzb21ldGhpbmcsIGFuZCBkaXNrcyBhcmU8YnI+CmNoZWFwLiBUaGUgYWxnb3JpdGhtIGNhbiB3 b3JrIHdlbGwgd2l0aCBhIHNpbXBsZSBidWZmZXIgY2FjaGU8YnI+CmZsdXNoaW5nIGFsZ29yaXRo bSwgd2hpbGUgZnVsbCBqb3VybmFsaW5nIHNlZW1zIHRvIHJlcXVpcmUgdGhhdDxicj4KYnVmZmVy cyBnZXQgZmx1c2hlZCBpbiBhIHBhcnRpY3VsYXIgb3JkZXIuPGJyPgpJIHRoaW5rIHRoYXQgaXMg YSBjb21wZWxsaW5nIGFkdmFudGFnZSBiZWNhdXNlIHRoZSBpbnRlcmFjdGlvbiBvZjxicj4KVk0g cGFnaW5nIGFuZCBGUyBidWZmZXIgY2FjaGUgdXNlIGlzIHRvbyBjb21wbGljYXRlZCBhcyBpdCBp cy48YnI+CjxwPgpJJ20gYSBsaXR0bGUgc2tlcHRpY2FsIG9mIHRoZSBTcHJpdGUgTEZTL1plYnJh LyAgYW5kIHNpbWlsYXIgPGJyPgpqb3VybmFsaW5nIEZTIHNjaGVtZXMgZm9yIHNldmVyYWwgcmVh c29ucy48YnI+CjEuIFRoZSByZWNvdmVyeSB0aW1lIHNlZW1zIGxpa2UgaXQgY291bGQgYmUgcXVp dGUgbG9uZy48YnI+CjMuIFRoZSBwYXJhZGlnbSBvZiBkdW1waW5nIGV2ZXJ5dGhpbmcgaW4gb25l IG1hc3NpdmUgd3JpdGUgaXMgc3VzcGVjdDxicj4KICAgaW4gYSBIQSBlbnZpcm9ubWVudDogMTI4 TSBidWZmZXIgZmx1c2hlZCBhdCA2MG5zPGJyPgogICBwZXIgd29yZCBpcyBhYm91dCAyIHNlY29u ZHMuIFRoYXQncyBhIHZlcnkgbG9uZyB0aW1lLiA8YnI+CjxwPgpIZXJlJ3MgYSByZWZlcmVuY2Ug d2hpY2ggZGVzY3JpYmVzIHRoZSBGUyBicmllZmx5IGFuZCBhbHNvIHRhbGtzIGFib3V0PGJyPgp0 aGUgb3RoZXIgdmlydHVlcyBvZiB0aGUgQXVyb3MgT1Mgd2hpY2ggZGllZCBhbG9uZyB3aXRoIG15 IHN0b2NrPGJyPgpvcHRpb25zIHNvIG1hbnkgeWVhcnMgYWdvLjxicj4KPHA+CiAgIEBBcnRpY2xl e0JvcmdCbGF1R3JhZXRzY2hIZXJybWFubk9iZXJsZTg5LDxicj4KICAgICBrZXkgPSAgICAgICAg ICAiQm9yZyBldCBhbC4iLDxicj4KICAgICBhdXRob3IgPSAgICAgICAiQW5pdGEgQm9yZyBhbmQg V29sZmdhbmcgQmxhdSBhbmQgV29sZmdhbmcgR3JhZXRzY2ggYW5kPGJyPgogICAgICAgICAgICAg ICAgICAgIEZlcmRpbmFuZCBIZXJybWFubiBhbmQgV29sZmdhbmcgT2JlcmxlIiw8YnI+CiAgICAg dGl0bGUgPSAgICAgICAgIkZhdWx0IFRvbGVyYW5jZSBVbmRlciB7VU5JWH0iLDxicj4KICAgICBq b3VybmFsID0gICAgICAiQUNNIFRyYW5zYWN0aW9ucyBvbiBDb21wdXRlciBTeXN0ZW1zIiw8YnI+ CiAgICAgcGFnZXMgPSAgICAgICAgIjEtLTI0Iiw8YnI+CiAgICAgdm9sdW1lID0gICAgICAgIjci LDxicj4KICAgICBudW1iZXIgPSAgICAgICAiMSIsPGJyPgogICAgIG1vbnRoID0gICAgICAgIGZl Yiw8YnI+CiAgICAgeWVhciA9ICAgICAgICAgIjE5ODkiLDxicj4KPCEtLSBib2R5PSJlbmQiIC0t Pgo8aHI+CjxwPgo8dWw+CjwhLS0gbmV4dD0ic3RhcnQiIC0tPgo8bGk+IDxiPk5leHQgbWVzc2Fn ZTo8L2I+IDxhIGhyZWY9IjAwMzAuaHRtbCI+T2xhZiBUaXR6OiAiUmU6IGtlcm5lbGQvbXVsdGlj YXN0IGJ1ZyAodGlja2xlZCBieSBnYXRlZCkiPC9hPgo8bGk+IDxiPlByZXZpb3VzIG1lc3NhZ2U6 PC9iPiA8YSBocmVmPSIwMDI4Lmh0bWwiPlJvZ2llciBXb2xmZjogIlRyaXRvbklJIElERSBpbnRl cmZhY2Ugbm90IFBDSSBjb21wbGlhbnQ/IjwvYT4KPCEtLSBuZXh0dGhyZWFkPSJzdGFydCIgLS0+ CjwhLS0gcmVwbHk9ImVuZCIgLS0+CjwvdWw+Cgo=
--Boundary-=_nWlrBbmQBhCDarzOwKkYHIDdqSCD--
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/