Re: latest linus-2.5 BK broken

Ingo Molnar (mingo@elte.hu)
Wed, 19 Jun 2002 00:03:12 +0200 (CEST)


On Wed, 19 Jun 2002, Rusty Russell wrote:

> This is the NUMA "I want to be in this group" problem. If you're
> serious about this, you'll go for a sys_sched_groupaffinity call, or add
> an extra arg to sys_sched_setaffinity, or simply use the top 16 bits of
> the cpu arg.

the reason why i picked a linear cpu bitmask for the first patches to do
affinity syscalls (which ultimately found their way into 2.5) was very
simple: we do *NOT* want to deal with cache hierarchies in the kernel, at
this point.

enumerating CPUs and giving processes the ability to bind themselves to an
arbitrary set of CPUs is enough. *IF* user-space wants to do more then
they can get and use whatever NUMA information they want. There could even
be separate sets of syscalls perhaps to get the exact CPU cache hierarchy
of the system, although that would have to be done really well to be truly
generic and long-living.

so in this case the simplest approach that scales well to a reasonable
number of CPUs (thousands, at least) won.

> You will also add /proc/cpugroups or something to export this
> information to users so there's a point.

and this might not even be enough. Cache hierarchies can be pretty
non-trivial, and it's not necesserily a distinct group of CPUs, it could
be a hierarchy of multiple levels, or it could even be an assymetric
distribution of caches. In fact it might not be even expressable in
'group' categories - caches could be interconnected in a 2D or even 3D
topology. Or multiprocessing CPUs could have dynamic caches in the future
- 'cache on demand' allocated to a cache-happy CPU, while another CPU with
a smaller working set will use less cache space. [obviously the technology
is not available today.]

one thing i was *very* sure about, we frankly dont have the slightest clue
about how the really big systems will look like in 10 or 20 years. So
hardcoding anything like 'group affinity' or some of today's NUMA
hierarchies would be pretty shortsighted. I'm convinced that the 'opaque'
solution, the simple but generic setaffinity system call is the right
choice.

Ingo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/