Oh yes, there are lots (I'm simply shutting my eyes
to them till we've conquered at least a few :))
What I'm trying to do, is to look at them one at a time,
from the ones that have the greatest potential of benefits/
improvement based on impact and complexity.
Otherwise it'd be just too hard to get anywhere !
Actually that was one reason why I decided to post
this early. Want to catch the most important ones
and make sure we can deal with them at least.
Read is the easier case, which is why I started with
it. Will come back to write case sometime later after
I've played with a bit.
Even with read there is one more case besides your
list. The copy_to_user or fault_in_writable_pages can
block too ;) (that's what I'm running into) ..
However, the general idea is that any point of blocking
where it is possible for things to just work if we return
instead of waiting, and issue a retry that would
continue from the point where it left off, can be
handled within the basic framework. What we need to do
is to look at these on a case by case basis and see
if that is doable. My hunch is that for congestion and
throttling points it should be possible. And we already
have a lot of pipelining and restartability built
into the VFS now.
Mostly we just need to be able to make sure the
-EIOCBQUEUED returns can be propagated all the way up,
without breaking anything.
Its really the meta-data related waits that are a
black box for me, and wasn't planning on tackling yet ...
More so as I guess it could mean getting into very
filesystem specific territory so doing it consistently
may not be that easy.
> - write() -> down(&inode->i_sem)
> - read()/write() -> read indirect block -> wait_on_buffer()
> - read()/write() -> read bitmap block -> wait_on_buffer()
> - write() -> allocate block -> mark_buffer_dirty() ->
> balance_dirty_pages() -> writer throttling
> - write() -> allocate block -> journal_get_write_access()
> - read()/write() -> update_a/b/ctime() -> journal_get_write_access()
> - ditto for other journalling filesystems
> - read()/write() -> anything -> get_request_wait()
> (This one can be avoided by polling the backing_dev_info congestion
I was thinking of a get_request_async_wait() that unplugs
the queue, and returns -EIOCBQUEUED after queueing the
async waiter to just retry the operation.
Of course this would work if the caller is able to push
back -EIOCBQUEUED without breaking anything in a
> - read()/write() -> page allocator -> blk_congestion_wait()
> - write() -> balance_dirty_pages() -> writer throttling
> - probably others.
> Now, I assume that what you're looking for here is an 80% solution, but it
> seems that a lot more changes will be needed to get even that far.
If we don't bother about meta-data and indirect blocks
just yet, wouldn't the gains we get otherwise not be
worth it ?
> And given that a single kernel thread per spindle can easily keep that
> spindle saturated all the time, one does wonder "why try to do it this way at
Need some more explanation on how/where you really break up
a generic_file_aio_read operation (with the page_cache_read,
readpage, copy_to_user) on a per-spindle basis. Aren't we at
a higher level of abstraction compared to disk at this stage ?
I can visualize delegating all readpage calls to a worker
thread per-backing device or something like that (forgetting
LVM/RAID for a while), but what about the rest of the parts ?
Are you suggesting restructuring the generic_file_aio_read
code to separate out these stages, so it can identify where
it is and handoff itself accordingly to the right worker ?
-- Suparna Bhattacharya (email@example.com) Linux Technology Center IBM Software Labs, India
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to firstname.lastname@example.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/