Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Jamie Lokier (lk@tantalophile.demon.co.uk)
Thu, 31 Oct 2002 00:52:59 +0000


Davide Libenzi wrote:
> The cost of the test will be basically the cost of a ->poll(), that is
> exactly the same cost of the very first read()/write() that you would do
> by following the current API rule.

No, the cost of ->poll() is somewhat less than read()/write(), because
the latter requires a system call and the former does not. System
calls are still nowhere near as cheap as function calls.

> > I also agree with criticisms that epoll should test and send an event
> > on registration, but only _if_ the test is cheap. Nothing to do with
> > correctness (I like the edge semantics as they are), but because
> > delivering one event is so infinitesimally low impact with epoll that
> > it's preferable to doing a single speculative read/write/whatever.
> >
> > Regarding the effectiveness of the optimisation, I'd guess that quite
> > a lot of incoming connections do not come with initial data in the
> > short scheduling time after a SYN (unless it's on a LAN). I don't
> > know this for sure though.
>
> Ok Jamie, try to explain me which kind of improvement this first drop will
> bring.

I have thought about an optimal server state machine. (I presume from
your carefully thought out implementation that you have too).

In a state machine, each fd has some user-space state. I've already
hinted at how this is used to prevent starvation/livelock on a busy
server, and make service fairer.

I would take that further and _defer_ the epoll_ctl() to register an
fd until the first time I have seen EAGAIN from that fd. This is
because in some cases, epoll_ctl() would not be needed at all - so we
can remove that overhead, and the system call overhead.

Now you would force me to call read() or write() after the
epoll_ctl(), even though I _know_ the result is always going to be
EAGAIN. You're forcing me to make an always redundant system call.
But I can't omit it, because that's a race condition.

So, I've thought about the _optimal_ state machine and it's clear that
epoll should test the condition on fd registration - for best
performance. (Nothing to do with scalability, just raw performance).

> And also, how such first drop would not bring a "confusion" for the
> user, letting him think that he can go sleeping event w/out having first
> received EAGAIN. Isn't it better to say "you wait for events after EAGAIN",
> instead of "you wait for events after EAGAIN but after accept/connect".

Be careful with your rules. epoll should work with blocking fds too,
if you understand the rules well enough, and fd registration doesn't
have to be done at the same time as accept/connect/pipe.

Your current rule in practice is:

an event is generated on every "would-block" -> "ready" transition.
after fd registration, you must treat the fd as "ready".

The proposed rule is this:

an event is generated on every "would-block" -> "ready" transition.
after fd registration, you may treat the fd as in any state you like.

The proposed rule is better because it permits better optimisations in
user space, as explained earlier. (If you _really_ want to avoid the
call to ->poll() when user space doesn't care, make that a flag
argument to epoll_ctl()).

enjoy :)
-- Jamie

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/