Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Matthew D. Hall (mhall@free-market.net)
Thu, 31 Oct 2002 17:55:07 -0800


If I may respectfully weigh in...

If a new API and/or a significant change in semantics is to be applied
to the kernel for a unified event notification system, this is obviously
an issue for 2.7 or 2.9. As such, we have plenty of time to focus upon
simplicity and correctness rather than plain old inertia. We need to
bring a truly unified, and therefore new, event API to the kernel, and
it must be done right. kevent attempts to achieve this for FreeBSD, and
generally speaking, it succeeds. But linux can do much better.

The API should present the notion of a general edge-triggered event
(e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so
simply. Linus made some suggestions on lkml back in 2000
(http://marc.theaimsgroup.com/?l=linux-kernel&m=97236943118139&w=2) that
pretty much hit the nail on the head -- with some exceptions.

* Unless every conceivable event is to be represented as a file (rather
unintuitive IMHO), his proposed interface fails to accomodate non-I/O
events (e.g. timers, signals, directory updates, etc.). As much as I
appreciate the UNIX Way, making everything a file is a massive
oversimplification. We can often stretch the definition far enough to
make this work, but I'd be impressed to see how one intends to call,
e.g., a timer a type of file.

* There is a seemingly significant overhead in performing exactly one
callback per event. Doesn't this prevent any kind of event coalescence?
It seems like we could be doing an awful lot of cache thrashing, among
other things. In some cases, this might happen anyway, but why should
the interface enforce such behavior? In most other cases, it's
perfectly acceptable to inline an event handler (either via compile-time
inlining or literally). I do realize that Linus only suggested that the
C library do the mass callbacks, BTW, so strictly speaking, it's the
userland API that would "enforce such behavior." Nonetheless, this is
of concern.

Enough of what we shouldn't do. Here's what we should:

* The interface should allow the implementation to be highly extensible
without horrible code contortions within the kernel. It is important to
be able to add new types of events as they become necessary. The
interface should be general and simple enough to accomodate these
extensions. Linux (really, UNIX) has failed to exercise this foresight
in the past, and that's why we have the current mess of varied
polling/triggering methods.

* I might be getting a bit utopian here, but IMHO the kernel should
move toward a completely edge-triggered event system. The old
level-triggered interfaces should only wrap this paradigm.

* Might as well reiterate: simplicity. FreeBSD's kevent solves nearly
all of the traditional problems, but it is gross. And complicated.
There were clearly too many computer scientists and not enough
engineers on that team.

* Only one queue per process or kernel thread. Multiple queues per
flow of execution is just ugly and ultimately pointless. That is not to
say that you can't multithread, but each thread simply uses the same
queue. In cases when you want one thread to only wait on a certain
type(s) of event, you can do this as well; you just can't have one
thread juggling more than one queue. Since the event notification is
edge-triggered, I cannot see any reason for a significant performance
degradation resulting from this policy. I am not altogether convinced
that this must be a strict policy, however; the issue of different
userspace threads having different event queues inside the same task is
interesting.

* No re-arming events. They must be manually killed.

* I'm sure everyone would agree that passing an opaque "user context"
pointer is necessary with each event.

* Zero-copy event delivery (of course).

Some question marks:
- Should the kernel attempt to prune the queue of "cancelled" events
(hints later deemed irrelevant, untrue, or obsolete by newer events)?
- Is one-queue-per-task really the way to go? This stops many bad
practices but could prevent some decent ones (see user threads comment).

Matthew D. Hall

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/