Re: [Lse-tech] NUMA scheduling

Davide Libenzi (davidel@xmailserver.org)
Mon, 25 Feb 2002 11:28:59 -0800 (PST)


On Mon, 25 Feb 2002, Larry McVoy wrote:

> On Mon, Feb 25, 2002 at 10:55:03AM -0800, Martin J. Bligh wrote:
> > > - The load_balancing() concept is different:
> > > - there are no special time intervals for balancing across pool
> > > boundaries, the need for this can occur very quickly and I
> > > have the feeling that 2*250ms is a long time for keeping the
> > > nodes unbalanced. This means: each time load_balance() is called
> > > it _can_ balance across pool boundaries (but doesn't have to).
> >
> > Imagine for a moment that there's a short spike in workload on one node.
> > By agressively balancing across nodes, won't you incur a high cost
> > in terms of migrating all the cache data to the remote node (destroying
> > the cache on both the remote and local node), when it would be cheaper
> > to wait for a few more ms, and run on the local node?
>
> Great question! The answer is that you are absolutely right. SGI tried
> a pile of things in this area, both on NUMA and on traditional SMPs (the
> NUMA stuff was more page migration and the SMP stuff was more process
> migration, but the problems are the same, you screw up the cache). They
> never got the page migration to give them better performance while I was
> there and I doubt they have today. And the process "migration" from CPU
> to CPU didn't work either, people tended to lock processes to processors
> for exactly the reason you alluded to.
>
> If you read the early hardware papers on SMP, they all claim "Symmetric
> Multi Processor", i.e., you can run any process on any CPU. Skip forward
> 3 years, now read the cache affinity papers from the same hardware people.
> You have to step back and squint but what you'll see is that these papers
> could be summarized on one sentence:
>
> "Oops, we lied, it's not really symmetric at all"
>
> You should treat each CPU as a mini system and think of a process reschedule
> someplace else as a checkpoint/restart and assume that is heavy weight. In
> fact, I'd love to see the scheduler code forcibly sleep the process for
> 500 milliseconds each time it lands on a different CPU. Tune the system
> to work well with that, then take out the sleep, and you'll have the right
> answer.

I made this test on 8 way NUMA machines ( thx to OSDLAB ). When a CPUs
went idle i let it sample the status/load of the system with 100HZ
frequency and i had a variable trigger time derivate that fired a task
steal if a certain load was observed on the same CPU for a time > K ms
I tested it with kernel builds and surprisingly enough having the idle to
observe a load for more than 60ms was a performance loss on these
machines. The moral of the story is:

"Cache trashing wheighs but wasted CPU time too ..."

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/