Re: [patch[ Simple Topology API

Linus Torvalds (torvalds@transmeta.com)
Sun, 14 Jul 2002 12:17:25 -0700 (PDT)


[ I've been off-line for a week, so I didn't follow all of the discussion,
but here goes anyway ]

On 13 Jul 2002, Andi Kleen wrote:
>
> Current x86-64 NUMA essentially has no 'nodes', just each CPU has
> local memory that is slightly faster than remote memory. This means
> the node number would be always identical to the CPU number. As long
> as the API provides it's ok for me. Just the node concept will not be
> very useful on that platform. memblk will also be identity mapped to
> node/cpu.

The whole "node" concept sounds broken. There is no such thing as a node,
since even within nodes latencies will easily differ for different CPU's
if you have local memories for CPU's within a node (which is clearly the
only sane thing to do).

If you want to model memory behaviour, you should have memory descriptors
(in linux parlance, "zone_t") have an array of latencies to each CPU. That
latency is _not_ a "is this memory local to this CPU" kind of number, that
simply doesn't make any sense. The fact is, what matters is the number of
hops. Maybe you want to allow one hop, but not five.

Then, make the memory binding interface a function of just what kind of
latency you allow from a set X of CPU's. Simple, straightforward, and it
has a direct meaning in real life, which makes it unabiguous.

So your "memory affinity" system call really needs just one number: the
acceptable latency. You may also want to have a CPU-set argument, although
I suspect that it's equally correct to just assume that the CPU-set is the
set of CPU's that the process can already run on.

After that, creating a new zone array is nothing more than:

- give each zone a "latency value", which is simply the minimum of all
the latencies for that zone from CPU's that are in the CPU set.

- sort the zone array, lowest latency first.

- the passed-in latency is the cut-off-point - clear the end of the
array (with the sanity check that you always accept one zone, even if
it happens to have a latency higher than the one passed in).

End result: you end up with a priority-sorted array of acceptable zones.
In other words, a zone list. Which iz _exactly_ what you want anyway
(that's what the current "zone_table" is.

And then you associate that zone-list with the process, and use that
zone-list for all process allocations.

Advantages:

- very direct mapping to what the hardware actually does

- no complex data structures for topology

- works for all topologies, the process doesn't even have to know, you
can trivially encode it all internally in the kernel by just having the
CPU latency map for each memory zone we know about.

Disadvantages:

- you cannot create "crazy" memory bindings. You can only say "I don't
want to allocate from slow memory". You _can_ do crazy things by
initially using a different CPU binding, then doing the memory
binding, and then re-doing the CPU binding. So if you _want_ bad memory
bindings you can create them, but you have to work at it.

- we have to use some standard latency measure, either purely time-based
(which changes from machine to machine), or based on some notion of
"relative to local memory".

My personal suggestion would be the "relative to local memory" thing, and
call that 10 units. So a cross-CPU (but same module) hop might imply a
latency of 15, which a memory access that goes over the backbone between
modules might be a 35. And one that takes two hops might be 55.

So then, for each CPU in a machine, you can _trivially_ create the mapping
from each memory zone to that CPU. And that's all you really care about.

No?

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/