Re: [Lse-tech] NUMA scheduling

Mike Kravetz (kravetz@us.ibm.com)
Tue, 26 Feb 2002 11:03:48 -0800

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Andrew Morton: "Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI"
Previous message: David Rees: "Re: IDE error on 2.4.17"

Erich,

I'm glad to see you are also exploring NUMA scheduling. The
more the merrier.

On Tue, Feb 26, 2002 at 11:33:14AM +0100, Erich Focht wrote:
>
> Well, maybe my description was a little bit misleading. My approach is not
> balancing much more aggressively, the difference is actually minimal,
> e.g. for 1ms ticks:
>
> Mike's approach:
> - idle CPU : load_balance() every 1ms (only within local node)
> balance_cpu_sets() every 2ms (balance across nodes)
> - busy CPU : load_balance() every 250ms
> balance_cpu_sets() every 500ms
> - schedule() : load_balance() if idle (only within local node)
>
> Erich's approach:
> - idle CPU : load_balance() every 1ms (first try balancing the local
> node, if already balanced (no CPU exceeds the
> current load by >25%) try to find a remote
> node with larger load than on the current one
> (>25% again)).
> - busy CPU : load_balance() every 250ms (same comment as above)
> - schedule() : load_balance() if idle (same comment as above).
>
> So the functional difference is not really that big here, I am also trying
> to balance locally first. If that fails (no imbalance), I try
> globally. The factor of 2 in the times is not so relevant, I think, and
> also I don't consider my approach significantly more aggressive.

My factor of 'two' is really a 'distance' factor. My thoughts were
along the lines that node rebalancing would occur at different rates
based on the distance between nodes. On the Sequent NUMA-Q machines
with really high remote memory latencies, you might want to be less
agressive than on machines with lower latencies. I was playing with
the idea that you would discover distances during topology discovery,
and the rate of rebalancing would somehow correspond to these distances.

> More significant is the difference in the data used for the balance
> decision:
>
> Mike: calculate load of a particular cpu set in the corresponding
> load_balance() call.
> Advantage: cheap (if spinlocks don't hurt there)
> Disadvantage: for busy CPUs it can be really old (250ms)
>
> Erich: calculate load when needed, at the load_balance() call, but not
> more than needed (normally only local node data, global data if needed,
> all lockless).
> Advantage: fresh, lockless
> Disadvantage: sometimes slower (when balancing across nodes)
>
> As Mike has mainly the cache affinity in mind, it doesn't really matter
> where a task is scheduled as long as it stays there long enough and the
> nodes are well balanced. A wrong scheduling decision (based on old
> data) will be fixed sooner or later (after x*250ms or so).

Agreed. I also played with the idea of keeping a load average over
time, which seems to be something we may want. However, I couldn't
think of an efficient way to accomplish this, and my first attempts
showed little promise. Perhaps, I will investigate this more.

-- 
Mike
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Next message: Andrew Morton: "Re: [CRASH] gdth / __block_prepare_write: zeroing uptodate buffer! / NMI"
Previous message: David Rees: "Re: IDE error on 2.4.17"