If a new API and/or a significant change in semantics is to be applied 
to the kernel for a unified event notification system, this is obviously 
an issue for 2.7 or 2.9.  As such, we have plenty of time to focus upon 
simplicity and correctness rather than plain old inertia.  We need to 
bring a truly unified, and therefore new, event API to the kernel, and 
it must be done right.  kevent attempts to achieve this for FreeBSD, and 
generally speaking, it succeeds.  But linux can do much better.
The API should present the notion of a general edge-triggered event 
(e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so 
simply.  Linus made some suggestions on lkml back in 2000 
(http://marc.theaimsgroup.com/?l=linux-kernel&m=97236943118139&w=2) that 
pretty much hit the nail on the head -- with some exceptions.
*  Unless every conceivable event is to be represented as a file (rather 
unintuitive IMHO), his proposed interface fails to accomodate non-I/O 
events (e.g. timers,  signals, directory updates, etc.).  As much as I 
appreciate the UNIX Way, making everything a file is a massive 
oversimplification.  We can often stretch the definition far enough to 
make this work, but I'd be impressed to see how one intends to call, 
e.g., a timer a type of file.
*  There is a seemingly significant overhead in performing exactly one 
callback per event.  Doesn't this prevent any kind of event coalescence? 
 It seems like we could be doing an awful lot of cache thrashing, among 
other things.  In some cases, this might happen anyway, but why should 
the interface enforce such behavior?  In most other cases, it's 
perfectly acceptable to inline an event handler (either via compile-time 
inlining or literally).  I do realize that Linus only suggested that the 
C library do the mass callbacks, BTW, so strictly speaking, it's the 
userland API that would "enforce such behavior."  Nonetheless, this is 
of concern.
Enough of what we shouldn't do.  Here's what we should:
*  The interface should allow the implementation to be highly extensible 
without horrible code contortions within the kernel.  It is important to 
be able to add new types of events as they become necessary.  The 
interface should be general and simple enough to accomodate these 
extensions.  Linux (really, UNIX) has failed to exercise this foresight 
in the past, and that's why we have the current mess of varied 
polling/triggering methods.
*  I might be getting a bit utopian here, but IMHO the kernel should 
move toward a completely edge-triggered event system.  The old 
level-triggered interfaces should only wrap this paradigm.
*  Might as well reiterate: simplicity.  FreeBSD's kevent solves nearly 
all of the traditional problems, but it is gross.  And complicated. 
 There were clearly too many computer scientists and not enough 
engineers on that team.
*  Only one queue per process or kernel thread.  Multiple queues per 
flow of execution is just ugly and ultimately pointless.  That is not to 
say that you can't multithread, but each thread simply uses the same 
queue.  In cases when you want one thread to only wait on a certain 
type(s) of event, you can do this as well; you just can't have one 
thread juggling more than one queue.  Since the event notification is 
edge-triggered, I cannot see any reason for a significant performance 
degradation resulting from this policy.  I am not altogether convinced 
that this must be a strict policy, however; the issue of different 
userspace threads having different event queues inside the same task is 
interesting.
*  No re-arming events.  They must be manually killed.
*  I'm sure everyone would agree that passing an opaque "user context" 
pointer is necessary with each event.
*  Zero-copy event delivery (of course).
Some question marks:
-  Should the kernel attempt to prune the queue of "cancelled" events 
(hints later deemed irrelevant, untrue, or obsolete by newer events)?
-  Is one-queue-per-task really the way to go?  This stops many bad 
practices but could prevent some decent ones (see user threads comment).
Matthew D. Hall
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/