Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

Erich Focht (efocht@ess.nec.de)
Tue, 24 Sep 2002 23:04:44 +0200


On Monday 23 September 2002 20:47, Martin J. Bligh wrote:
> > I have two problems with this approach:
> > 1: Freeing memory is quite expensive, as it currently involves finding
> > the maximum of the array node_mem[].
>
> Bleh ... why? This needs to be calculated much more lazily than this,
> or you're going to kick the hell out of any cache affinity. Can you
> recalc this in the rebalance code or something instead?

You're right, that would be too slow. I think of marking the tasks
needing recalculation and update their homenode when their runqueue
is scanned for a task to be stolen.

> > 2: I have no idea how tasks sharing the mm structure will behave. I'd
> > like them to run on different nodes (that's why node_mem is not in mm),
> > but they could (legally) free pages which they did not allocate and
> > have wrong values in node_mem[].
>
> Yes, that really ought to be per-process, not per task. Which means
> locking or atomics ... and overhead. Ick.

Hmm, I think it is sometimes ok to have it per task. For example OpenMP
parallel jobs working on huge arrays. The "first-touch" of these arrays
leads to pagefaults generated by the different tasks and thus different
node_mem[] arrays for each task. As long as they just allocate memory,
all is well. If they only release it at the end of the job, too. This
probably goes wrong if we have a long running task that spawns short
living clones. They inherit the node_mem from the parent but pages
added by them to the common mm are not reflected in the parent's node_mem
after their death.

> For the first cut of the NUMA sched, maybe you could just leave page
> allocation alone, and do that seperately? or is that what the second
> patch was meant to be?

The first patch needs a correction, add in load_balance()
if (!busiest) goto out;
after the call to find_busiest_queue. This works alone. On top of this
pooling NUMA scheduler we can put the node affinity approach that fits
best. With or without memory allocation. I'll update the patches and
their setup code (thanks for the comments!) and resend them.

Regards,
Erich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/