Re: [lvm-devel] Re: 2x Oracle slowdown from 2.2.16 to 2.4.4

Andreas Dilger (adilger@turbolinux.com)
Fri, 13 Jul 2001 01:35:00 -0600 (MDT)


Andrea writes:
> With the current design of the pe_lock_req logic when you return from
> the ioctl(PE_LOCK) syscall, you never have the guarantee that all the
> in-flight writes are commited to disk, the
> fsync_dev(pe_lock_req.data.lv_dev) is just worthless, there's an huge
> race window between the fsync_dev and the pe_lock_req.lock = LOCK_PE
> where whatever I/O can be started without you fiding it later in the
> _pe_request list.

Yes there is a slight window there, but fsync_dev() serves to flush out the
majority of outstanding I/Os to disk (it waits for I/O completion). All
of these buffers should be on disk, right?

> Even despite of that window we don't even wait the
> requests running just after the lock test to complete, the only lock we
> have is in lvm_map, but we should really track which of those bh are
> been committed successfully to the platter before we can actually copy
> the pv under the lvm from userspace.

As soon as we set LOCK_PE, any new I/Os coming in on the LV device will
be put on the queue, so we don't need to worry about those. We have to
do something like sync_buffers(PV, 1) for the PV that is underneath the
PE being moved, to ensure any buffers that arrived between fsync_dev()
and LOCK_PE are flushed (they are the only buffers that can be in flight).
Is there another problem you are referring to?

AFAICS, there would only be a large window for missed buffers if you
were doing two PE moves at once, and had contention for _pe_lock,
otherwise fsync_dev to LOCK_PE is a very small window, I think.
However, I think we are also protected by the global LVM lock from
doing multiple PE moves at one time.

> If the logic would been sane, your patch would also been ok
> (besides the C breakage of the missing volatile but we abuse gcc this
> way in other parts of the kernel too after all).

Yes, I never thought about GCC optimizing away the two references to the
same var before and after making the check.

> I think the whole pv_move logic needs to be redesigned and rewritten, if
> you could rewrite it and send patches (possibly also against beta7 if
> a new lvm release is not scheduled shortly) that would be more than
> welcome!

Yes, well the correct solution is to do it all in a kernel thread, so
that you don't need to do kernel->user->kernel data copying. I already
discussed this with Joe Thornber (I think) and it was decided to be too
much for now (needs changes to user tools, IOP version, etc). Later.

Cheers, Andreas

-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/