Re: aio-core why not using SuS? [Re: [rfc] aio-core for 2.5.29 (Re: async-io API registration for 2.5.29)]

Suparna Bhattacharya (suparna@in.ibm.com)
Fri, 16 Aug 2002 16:53:06 +0530


On Fri, Aug 16, 2002 at 12:03:34PM +0200, Andrea Arcangeli wrote:
> On Fri, Aug 16, 2002 at 03:09:46PM +0530, Suparna Bhattacharya wrote:
> > Also, wasn't the fact that the API was designed to support both POSIX
> > and completion port style semantics, another reason for a different
> > (lightweight) in-kernel api? The c10k users of aio are likely to find
> > the latter model (i.e. completion ports) more efficient.
>
> if it's handy for you, can you post a link to the API defined by
> POSIX and completion ports so I can read them too and not only SuS?

Don't have anything handy atm that's any better than what you could
get through doing a google on "IO Completion ports". (See section at
the end of this note for some info)

Completion port apis aren't really part of any standard, but provided
by some operating systems (NT, AS/400), most of which use a similar
interface. I personally found it useful to refer to the DAFS
completion groups API (DAFS API Spec at www.dafscollaborative.org)
just to get an idea of something that takes into account these various
existing interfaces to arrive at an interface for async i/o
completion (even though this really is all direct user-space api
implementation for remote file data access and nothing to do with
in kernel i/o interfaces).

>
> btw, I don't see why there are so many API doing the same thing, I think
> for the goodness of linux it would be nice to standardize and recommend
> one of these user API so new software will use the API we recommend now,
> rather than choosing almost randomly every time. So the rest will be
> backwards compatibilty stuff for apps ported from other OS, and it will
> be worthwhile to have the kernel API to match what we recommend as user
> API.

Since you are analysing this stuff I wonder if you have by any
chance looked through the aio design notes I had posted a while back.
I did try to discuss the background in terms of completion apis used
elsewhere even though I didn't record the specific details of those
interfaces. Am appending that section of the doc below.

Regards
Suparna

-------------------------------------------

2.5 Completion/Readiness notification:

Comment: Readiness notification can be treated as a completion of an
asynchonous operation to await readiness.

POSIX aio provides for waiting for completion of a particular request, or
for an array of requests, either by means of polling, or asynchronously
through signals. On some operating systems, there is a notion
of an I/O Completion port (IOCP), which provides a flexible and scalable way
of grouping completion events. One can associate multiple file descriptors
with such a completion port, so that all completion events for requests on
those files are sent to the completion port. The application can thus issue
a wait on the completion port in order to get notified of any completion
event for that group. The level of concurrency can be increased simply by
increasing the number of threads waiting on the completion port. There are
also certain additional concurrency control features that can be associated
with IOCPs (as on NT), where the system decides how many threads to
wakeup when completion events occur, depending on the concurrency limits
set for the queue, and the actual number of runnable threads at that moment.
Keeping the number of runnable threads constant in this manner protects
against blocking due to page faults and other operations that cannot be
performed asynchronously.

On a similar note, the DAFS api spec incorportes completion groups for
handling async i/o completion, the design being motivated by VI completion
queues, NT IOCPs and the Solaris aiowait interfaces. Association of an
i/o with a completion group (NULL would imply the default completion queue)
happens at the time of i/o submission which lets the provider know where
to place the event when it completes, contrary to aio_suspend style interface
which specifies the grouping only when waiting on completion.

This implementation for Linux makes use a similar notion to provide
support for completion queues. There are api's to setup and destroy such
completion queues, specifying the maximum queue lengths that a queue is
configured for. Every asynchronous i/o request is associated with a completion
queue when it is submitted (like the DAFS interfaces), and an application
can issue a wait on a given queue to be notified of a completion event for
any request associated with that queue.

BSD kqueue (Jonathan Lemon) provides a very generic method for registering
for and handling notification of events or conditions based on the concept
of filters of different types. This covers a wide range of conditions
including file/socket readiness notification (as in poll), directory/file
(vnode) change notifications, process create/exit/stop notifications, signal
notification, timer notification and also aio completion notification
(via SIGEV_EVENT). The kqueue is equivalent to a completion queue, and
the interface allows one to both register for events and wait for (and
pick up) any events on the queue within the same call. It is rather flexible
in terms of providing for various kinds of event registration/notification
requirements, e.g oneshot or everytime, temporary disabling, clearing
state if transitions need to be notifiied, and it supports both edge and
level triggered types of filters.

2.5.1 Some Requirements which are addressed:

1. Efficient for large numbers of events and connections
- The interface to register events to wait for should be separate from
the interface used to actually poll/wait for the registered events to
complete (unlike traditional poll/select), so that registrations can
hold across multiple poll waits with minimum user-kernel transfers.
(It is better to handle this at interface definition level than
through some kind of an internal poll cache)

The i/o submission routine takes a completion queue as a parameter,
which associates/registers the events with a given completion group/queue.
The application can issue multiple waits on the completion queue using a
separate interface.

- Ability to reap many events together (unlike current sigtimedwait
and sigwaitinfo interfaces)

The interface used to wait for and retrieve events, can return an
array of completed events rather than just a single event.

- Scalable/tunable queue limits - at least have a limit per queue rather
than system wide limits

Queue limits can be specified when creating a completion group.
TBD: A control interface for changing queue parameters/limits (e.g
io_queue_grow) might be useful

- Room for more flexible/tunable wakeup semantics for better concurrency
control

Since the core event queue can be separated from the notification mechanism,
the design allows one to provide for alternative wakeup semantics
to optimize concurrency and reduce redundant or under-utilized context
switches. Implementing these might require some additional parameters or
interfaces to be defined. BTW, it is desirable to provide a unified interface
for notification and event retrieval to a caller, to avoid synchronization
complexities, even if the core policies are separable underneath in-kernel.

[See the discussion in Sec 2.6 on wakeup policies for a more
detailed discussion on this]

2. Enable flexible grouping of operations
- Flexible grouping at the time of i/o submission
(different operations on the same fd can belong to different groups,
operations on different fds can belong to the same group)

- Ability to wait for at least a specified number of operations from
a specified group to complete (at least N vs at least 1 helps with
batching on the way up, so that the application can perform its post
processing activities in a batch, without redundant context switches)

The DAFS api supports such a notion, both in its cg_batch_wait interface
which returns when either N events have completed, or with less than N
events in case of a timeout, and also in the form of a num_completions
hint at the time of i/o submission. The latter is a hint that gets sent
out to the server as a characteristic of the completion queue or session,
so the server can use this hint to batch its responses accordingly.
Knowing that the caller is interested only in batch completions helps
with appropriate optimizations.

Note: The Linux aio implementation today only supports "at least one"
and not "at least N" (e.g the aio_nwait interface on AIX).

The tradeoffs between responsiveness and fairness issues tend to
to get amplified when considering "at least N" type of semantics,
and this is one of the main concerns in supporting it.
[See discussion on wakeup policies later]

- Support dynamic additions to the group rather than a static or one time
list passed through a single call

Multiple i/o submissions can specify the same completion group, enabling
events to be added to the group.

[Question: Is the option of the completion group being different from the
submission batch/group (i.e. per iocb grouping field) useful to have ?
Like POSIX using sigevent as part of iocb]

3. Should also be able to wait for a specific operation to complete (without
being very inefficient about it)

One could either have low overhead group setup/teardown so such an operation
may be assigned a group of its own (costs can be amortized across multiple
such operations by reusing the same group if possible) or provide an
interface to wait for a specific operation to complete.

The latter would be more useful, though it requires a per-request wait queue
or something similar. The current implementation has a syscall interface
defined for this (io_wait), which hasn't been coded up as yet. The plan is
to use hashed wait queues to conserve on space.

There are also some semantics issues in terms of possibilities of another
waiter on the queue picking up the corresponding completion event for this
operation. To address this, the io_wait interface might be modified to
include an argument for the returned event.

BTW, there is an option of dealing with this using the group primitives
either in user space, or even in kernel by waiting in a loop for any event
in the group until the desired event occurs, but this could involve some
extra interim wakeups / context switches under the covers, and a user
level event distribution mechanism for the other events picked up in the
meantime.

4. Enable Flexible distribution of responsibility across multiple
threads/components

Different threads can handle submission for different operations,
and another pool of threads could wait on completion.
The degree of concurrency can be improved simply by increasing threads
in the pool that wait for and process completion of operations for
that group.

5. Support for Prioritized Event Delivery

This involves the basic infrastructure to be able to accord higher
priority to the delivery of certain completion events over others,
(e.g. depending on the request priority settings of the corresponding
request), i.e. if multiple completion events have arrived on the
queue, then the events for higher priorities should be picked up
first by the application.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/