Re: [announce] [patch] limiting IRQ load, irq-rewrite-2.4.11-B5

jamal (hadi@cyberus.ca)
Tue, 2 Oct 2001 08:10:00 -0400 (EDT)

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Stelian Pop: "[PATCH 2.4.10-ac3] move DMI scan before other subsystem init (especially PnP)"
Previous message: Roland Kuhn: "Re: aic7xxx error"

On Tue, 2 Oct 2001, Benjamin LaHaise wrote:

> On Mon, Oct 01, 2001 at 09:54:49PM -0400, jamal wrote:
>
> > And how does /proc/irq/NR/max_rate solve this?
> > I have a feeling you are trying to say that varying /proc/irq/NR/max_rate
> > gives opportunity for user processes to execute;
> > note, although that is bad logic, you could also modify the high and low
> > watermarks for when we have congestion in the backlog queue
> > (This is already doable via /proc)
>
> The high and low watermarks are only sufficient if the task the machine is
> performing is limited to bh mode operations. What I mean is that user space
> can be starved by the cyclic nature of the network queues: they will
> eventually be emptied, at which time more interrupts will be permitted.
>

Which hardware flow control has been doing since 2.1 days;

> > It is unfair to add any latency to a device that didnt cause or
> > contributre to the havoc.
>
> I disagree. When a machine is overloaded, everything gets slower. But a
> side effect of delaying interrupts is that more work gets done for each
> irq handler that is run and efficiency goes up. The hard part is balancing
> the two in an attempt to achieve a steady rate of progress.
>

Let me see if i understand this:
-scheme 1: shut down only actions (within a device) that are contributing
to the overload and only they get affected because they are misbehaving;
when things get better(and we know when they get better), we turn it on
again
- scheme two: shut down the IRQ which might infact include other devices
for a jiffy or two (which doesnt mean the condition got better)

Are you saying that you disagree scheme 1 is better?

> > I think you missed my point. i am saying there is more than one source of
> > interupt for that same IRQ number that you are indiscrimately shutting
> > down in a network device.
>
> You're missing the effect that irq throttling has: it results in a system
> that is effectively running in "polled" mode. Information does get
> processed, and thruput remains high, it is just that some additional
> latency is found in operations. Which is acceptable by definition as
> the system is under extreme load.

sure. Just like the giant bottom half lock is acceptable when you can do
fine grained locking ;->
Dont preach polling to me; i am already a convert and you attended the
presentation i gave. Weve had patches for months which have been running
on live system. We were just waiting for 2.5 ...

>
> > So, assuming that tx complete interupts do actually shut you down
> > (although i doubt that very much given the classical Donald Becker tx
> > descriptor prunning) pick another interupt source; lets say MII link
> > status; why do you want to kill that when it is not causing any noise but
> > is a source of good asynchronous information (that could be used for
> > example in HA systems)?
>
> That information will eventually be picked up. I doubt the extra latency
> will be of significant note. If it is, you've got realtime concerns,
> which is not our goal to address at this time.
>

You are still missing the point (by humping on the literal meaning of the
example i provide), the point is: fine grained vs shutting down the whole
IRQ.

>
> > and what is this "known safe limit"? ;->
>
> It's system dependant. It's load dependant. For a short list of the number
> of factors that you have to include to compute this:
>
> - number of cycles userspace needs to run
> - number of cache misses that userspace is forced to
> incur due to irq handlers running
> - amount of time to dedicate to the irq handler
> - variance due to error path handling
> - increased system cpu usage due to higher memory load
> - front side bus speed of cpu
> - speed of cpu
> - length of cpu pipelines
> - time spent waiting on io cycles
> .....
>
> It is non-trivial to determine a limit. And trying to tune a system
> automatically is just as hard: which factor do you choose for the system
> to attempt to tune itself with? How does that choice affect users who
> want to tune for other loads? What if latency is more important than
> dropping data?
>
> There are a lot of choices as to how we handle these situations. They
> all involve tradeoffs of one kind or another. Personally, I have a
> preference towards irq rate limiting as I have measured the tradeoff
> between latency and thruput, and by putting that control in the hands of
> the admin, the choice that is best for the real load of the system is
> not made at compile time.
>
> If you look at what other operating systems do to schedule interrupts
> as tasks and then looks at the actual cost, is it really something we
> want to do? Linux has made a point of keeping things as simple as
> possible, and it has brought us great wins because we do not have the
> overhead that other, more complicated systems have chosen. It might
> be a loss in a specific case to rate limit interrupts, but if that is
> so, just change the rate. What can you say about the dynamic self
> tuning techniques that didn't take into account that particular type
> of load? Recompiling is not always an option.
>

I am not sure where you are getting the opinion that there is recompiling
involved or how what we have is complex (the patch is much smaller than
what Ingo posted);
And no, you dont need to maintain any state of all those things in your
list; in 2.4, which is a good start, you have the system load being probed
via a second order effect i.e the growth rate of the backlog queue is a
good indicator that the system is not pulling packets off fast enough.
This is a very good measure of _all_ those items on your list. I am not
saying it is God's answer, merely pointing that it is a good indicator
which doesnt need to maintain state of 1000 items or cause additional
computations on the datapath.
We get a fairly early warning that we are about to be overloaded. We can then
shut off the offending device's _receive_ interupt source when it doesnt
heed the congestion notification advice weve been giving it. It heeds the
advice by mitigating.
In the 2.5 patch (i should say it is a clean patch to 2.4 actually and is
backward compatible) we worked around the fact that the 2.4 solution
requires a specific NIC feature (mitigation) among a lot of other things.
Infact we have already proven that mitigation is only good when you have
one or two NICs on the system.

> > What we are providing is actually a scheme to exactly measure that "known
> > safe limit" you are refering to without depending on someone having to
> > tell you "here's a good number for that 8 way xeon"
> > If there is system capacity available why the fsck is it not being used?
>
> That's a choice for the admin to make. Sometimes having reserves that aren't
> used is a safety net that people are willing to pay for. ext2 has by
> default a reserve that isn't normally used. Do people complain? No. It
> buys several useful features (resistance against fragmentation, space for
> daemon temporary files on disk full, ...) that pay dividends of the cost.
>

I am not sure whether you are trolling or not. We are talking about a
system conserving a work principle and you compare it a reservation
system.

> Is irq throttling the be all and end all? No. Can other techniques work
> better? Yes. Always? No. And nothing prevents us from using this and
> other techniques together. Please don't dismiss it solely because you
> see cases that it doesn't handle.
>

I am not dismising the whole patch. I most definetly dismiss those
two ideas i pointed out.

cheers,
jamal

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Stelian Pop: "[PATCH 2.4.10-ac3] move DMI scan before other subsystem init (especially PnP)"
Previous message: Roland Kuhn: "Re: aic7xxx error"