Re: [patch 3/4] slab reclaim balancing

Andrew Morton (akpm@digeo.com)
Thu, 26 Sep 2002 12:49:14 -0700


Manfred Spraul wrote:
>
> Andrew Morton wrote:
> >
> > (What Ed said - we do hang onto one page. And I _have_ measured
> > cost in kmem_cache_shrink...)
> >
> I totally agree about kmem_cache_shrink - it's total abuse that
> fs/dcache.c calls it regularly. It was intended to be called before
> module unload, or during ifdown, etc.
> On NUMA, it's probably worse, because it does an IPI to all cpus.
> dcache.c should not call kmem_cache_shrink, and kmem_cache_reap should
> be improved.

Sorry, I meant kmem_cache_reap. It was the call into there from
vmscan.c which was the problem. On my 4-way (which is fairly
immune to lock contention for some reason), I was seeing *25%*
contention on the spinlock inside the cache_chain_sem semaphore.
With, of course, tons of context switching going on.

And it was giving a sort of turnstile effect, where each CPU needed
to pass through kmem_cache_reap before entering page reclaim proper.
Which maybe wasn't so bad in the 2.4 setup. But in 2.5 the VM is
basically per-zone, so different CPUs can run reclaim against different
zones with no contention of any form. This is especially important on
NUMA.

But the above could have been solved in different ways, and there are
still global effects with the requirement to prune slab. I expect that
the batching and perhaps a per-slab exclusion mechanism will fix up
any remaining problems.

> > Before:
> > Elapsed: 20.18s User: 192.914s System: 48.292s CPU: 1195.6%
> >
> > After:
> > Elapsed: 19.798s User: 191.61s System: 43.322s CPU: 1186.4%
> >
> > That's for a kernel compile.
> >
> UP or SMP?
> And was that the complete patch, or just the modification to slab.c?

That is the effect of per-cpu hotlist within the page allocator
on a 16- or 32-way NUMAQ.

It shows some of the benefit which we can get by carefully recycling
known-to-be-hot pages back into places where the page is known to
soon be touched by this CPU. Most of the gains were in do_anonymous_page,
copy_foo_user(), zap_pte_range(), etc. Where you'd expect.

I expect to end up with two forms of page allocation and freeing:
alloc_hot_page/alloc_cold_page and free_hot_page/free_cold_page

> I've made a microbenchmark of kmem_cache_alloc/free of 4 kb objects, on
> UP, AMD Duron:
> 1 object 4 objects
> cur 145 cycles 662 cycles
> patched 133 cycles 2733 cycles
>
> Summary:
> * for one object, the patch is a slight performance improvement. The
> reason is that the fallback from partial to free list in
> kmem_cache_alloc_one is avoided.
> * the overhead of kmem_cache_grow/shrink is around 500 cycles, nearly a
> slowdown of factor 4. The cache had no constructor/destructor.
> * everything cache hot state. [100 runs in a loop, loop overhead
> substracted. 98 or 99 runs completed in the given time, except for
> patched-4obj, where 24 runs completed in 2735 cycles, 72 in 2733 cycles]

Was the microbenchmark actually touching the memory which it was
allocating from slab? If so then yes, we'd expect to see cache
misses against those cold pages coming out of the buddy.

> For SMP and slabs that are per-cpu cached, the change could be right,
> because the arrays should absorb bursts. But I do not think that the
> change is the right approach for UP.

I'd suggest that we wait until we have slab freeing its pages into
the hotlists, and allocating from them. That should pull things back.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/