Filesystem AIO read-write patches

Suparna Bhattacharya (suparna@in.ibm.com)
Thu, 24 Apr 2003 10:22:22 +0530


Here is a revised version of the filesystem AIO patches
for 2.5.68.

It is built on a variation of the simple retry based
scheme originally suggested by Ben LaHaise.

Why ?
------
Because 2.5 is still missing real support for regular
filesystem AIO (but for O_DIRECT).

ext2, jfs and nfs define the fops aio interfaces aio_read
and aio_write to default to generic_file_aio_read/write.
However these routines show fully synchronous behaviour
unless the file was opened with O_DIRECT. This means that
an io_submit could merrily block for regular aio read/write
operations, while an application thinks its doing async
i/o.

How ?
------
The approach we took was to identify and focus on
the most significant blocking points (seconded by
observations from initial experimentation and profiling
results), and convert them to retry exit points.

Retries start at a very high level, driven directly by
the aio infrastructure (In future if the in-kernel fs
apis change, then retries could be modified to happen
one level below, i.e. at the api level). They are kicked
off via async wait queue functions. In synchronous i/o
context the default wait queue entries are synchronous
hence don't cause an exit at a retry point.

One of the considerations was to try to take a careful
and less intrusive route with minimal changes to existing
synchronous i/o paths. The intent was to achieve a
reasonable level of asynchrony in a way that could then
be further optimized and tuned for workloads of relevance.

The Patches:
-----------
(which I'll be mailing out as responses to this note)
01aioretry.patch : Base aio retry infrastructure
02aiord.patch : Filesystem aio read
03aiowr.patch : Minimal filesystem aio write
(for all archs and all filesystems
using generic_file_aio_write)
04down_wq-86.patch : An asynchronous semaphore down
implementation (currently x86
only)
05aiowrdown_wq.patch : Uses async down for aio write
06bread_wq.patch : Async bread implementation
07ext2getblk_wq.patch : Async get block support for
the ext2 filesystem

Observations
--------------
As a quick check to find out if this really works, I could
observe a decent reduction in the time spent in io_submit
(especially for large reads) when the file is not already
cached (e.g. first time access). For the write case, I
found that I had to add the async get block support
to get a perceptable benefit. For the cached case, there
wasn't any observable difference, which is expected.
The patch didn't seem to be hurting synchronous read/
write performance for a simple test.

Another thing I tried out was to temporarily move the
retries into io_getevents rather than worker threads
just as a sanity check for any gross impact on cpu
utilization. That seemed OK too.

Of course thorough performance testing is needed and
would show up places where there is scope for
tuning, and how it affects overall system performance
numbers.

I have been playing with it for a while now, and so
far its been running OK for me.

I would welcome feedback, bug reports, test results etc.

Full diffstat:

aiordwr-rollup.patch:
......................
arch/i386/kernel/i386_ksyms.c | 2
arch/i386/kernel/semaphore.c | 30 ++-
drivers/block/ll_rw_blk.c | 21 +-
fs/aio.c | 371 +++++++++++++++++++++++++++++++++---------
fs/buffer.c | 54 +++++-
fs/ext2/inode.c | 44 +++-
include/asm-i386/semaphore.h | 27 ++-
include/linux/aio.h | 32 +++
include/linux/blkdev.h | 1
include/linux/buffer_head.h | 30 +++
include/linux/errno.h | 1
include/linux/init_task.h | 1
include/linux/pagemap.h | 19 ++
include/linux/sched.h | 2
include/linux/wait.h | 2
include/linux/writeback.h | 4
kernel/fork.c | 9 -
mm/filemap.c | 97 +++++++++-
mm/page-writeback.c | 17 +
19 files changed, 616 insertions(+), 148 deletions(-)

[The patches are also available for download on the
Linux Scalability Effort project site
(http://sourceforge.net/projects/lse)
Categorized under the "aio" release in IO Scalability
section
http://sourceforge.net/project/showfiles.php?group_id=8875]

A rollup version containing all the 7 patches
(aiordwr-rollup.patch) would be made available as well

Major additions/changes since previous versions posted:
------------------------------------------------------
- Introduced _wq versions of low level routines like
lock_page_wq, wait_on_page_bit_wq etc, which take the
wait_queue entry as a parameter (Thanks to Christoph
Hellwig for suggesting the new and much better
names :)).
- Reorganized code to avoid having to use the do_sync_op()
wrapper (because the forced emulation of the i/o wait
context seemed an overhead and not very elegant).
- (New)Implementation of asynchronous semaphore down
operation for x86 (down_wq).
- Have dropped the async block allocation portions from the
async ext2_get_block patch after a discussion with Stephen
Tweedie (the i/o patterns we anticipate are less likely
to extend file sizes)
- Fixes use_mm() to clear lazy tlb setting (I traced some
of the strange hangs I was seeing for large reads to this)
- Removed the aio_run_iocbs() acceleration from io_getevents,
now that the above problem is gone.

Todos/TBDs:
----------
- Support for down_wq on other archs or provide compatibility
definitions for archs where it is not implemented
(Need feedback on this)
- Should the cond_resched() calls in read/write be
converted to retry points (would need ctx specific worker
threads) ?
- Look at async get block implementations for other
filesystems (e.g. jfs) ?
- Optional: Check if it makes sense to use retry model for
o_direct (or change sync o_direct to wait for completion
of async o_direct) ?
- Upgrade to Ben's aio api changes (collapse of api parameters
into an rw_iocb) if and when it gets merged

A few comments on low level implementation details:
--------------------------------------------------
io_wait context
-------------------
The task->io_wait field reflects the wait context in which
a task is executing its i/o operations. For synchronous i/o
task->io_wait is NULL, and the wait context is local on
stack; for threads doing io submits or retries on behalf
of async i/o callers, tsk->io_wait is the wait queue
function entry to be notified on completion of a condition
required for i/o to progress.

Low level _wq routines take a wait queue parameter, so
they can be invoked in either async or sync mode, even
if running in async context (e.g servicing a page fault
during an async retry).

Routines which are expected to be async whenever they are
running in async context and sync when running in sync
context do not need to provide a wait queue parameter.

do_sync_op()
---------------
The do_sync_op() wrappers are not typically needed anymore
for sync versions of the operations; passing NULL to the
corresponding _wq functions suffices.

However, there may be weird cases where we may have several
levels of nesting like:
A()->B()->C()->D()->F()->iowait()
It may seem unnatural to pass a wait queue argument all
the way through, but if we need to force sync behaviour
in a certain case even if it is called under async context,
and have async behaviour in another, then we may need to
resort to using do_sync_op() (e.g if we had kept the
ext2 async block allocation modifications).

Regards
Suparna

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Labs, India

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/