Re: Scalability problem (kmap_lock) with -aa kernels

Andrea Arcangeli (andrea@suse.de)
Wed, 20 Mar 2002 19:26:25 +0100


On Wed, Mar 20, 2002 at 02:41:14PM -0300, Rik van Riel wrote:
> Going for a smaller pool just doesn't make sense if you want
> the mappings to be cached, it could even result in more processes
> _sleeping_ on kmap entries to be freed.

256 simultaenous kmaps isn't that small, you need 256 task page faulting
simultaneously to get any of them sleeping, the bigger problem is the
frequency of the IPI and the reduced cache I think, not the
"_sleeping_". Infact it isn't much slower either. I had a feeling that
256 entreis was a kind of low-limit number to make persistent kmaps
still useful, even if I known it was a bit underpowered, but I needed to
check that it wasn't the O(N) pass hurting.

What I was worried about by seeing such long kmap_high time is that with
lots of concurrent tasks faulting, most of the entries were pinned most
of the time and so the pkmap pass had to be run much more frequently
and in a less useful manner, than on a system where the frequency of the
kmaps is high, but where the simultaneous kmaps aren't frequent.

Of course 2048 entries is better from a caching and tlb flushing
prospective (even if the O(N) pass takes more time), not incidentally I
was using 1024 instead of 128 :). (note that mainline only has 512 with
PAE, this is another reason for which I asked to try with 256)

> Please make a choice, do you want kmaps to be cached or would
> you be content to have just cpu-local or maybe process-local
> kmaps and get rid of the global kmap pool ?

One design idea I had to avoid the locking in the persistnt kmaps around
the copy-user, is to rewrite completly the persistnt code with a pool of
atomic kmaps per-CPU and to pin the task to the current CPU before doing
the copy-user, and then to count the number of reentrant persistent
kmaps happening in the current CPU and if it overflows we block waiting
somebody else to be wakenup and to kunmap. pros: it would avoid all the
locks (complete scalability, all per-cpu), it would avoid the IPI and
the global flush. cons: it will possibly waste some more address space
since not all NR_CPUS are going to be used in all machines, and it will
not be able to do any caching, so page->virtual can be dropped enterely
in x86 then, and it will reduce the ability of the scheduler to
reschedule in idle cpus during copy users around persistent kmaps. And
of course those kmaps should be dropped ASAP, they are meant only for
things like copy-users, or we risk to pin stuff in cpus, it should be
always the scheduler, not the kmap that chooses which cpu the task has
to run on after a wakeup.

And again, such reasoning and the above idea is very-high-end
scalability oriented, a 1G laptop would prefer the current persistent
kmaps, possibly also running invlpg during the O(N) pass instead of the
global tlb flush (we can't do that in SMP because it would take too long
to run such pass in all cpus, so we've to IPI a global flush instead).
And in turn I'm not going to make any change like this in 2.4, it simply
isn't important enough for the 90% of userbase, I will just limit to
avoid persistent kmaps in pte-highmem that is doable without any pain.

btw, this thread is been very useful to learn about some of the timings
and the problematics on the 16-ways NUMA-Q.

comments?

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/