2.5.72-wli-1

William Lee Irwin III (wli@holomorphy.com)
Fri, 20 Jun 2003 11:39:39 -0700


Available from:
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/kernels/2.5.72/linux-2.5.72-wli-1.bz2

I've decided to outright slurp up various people's code that has uses
in various places for this release, as opposed to pounding out original
material. This went smoothly apart from a minor bug in one of them that
slipped through someone's audit.

This release should feature truly stupendous i386 PAE resource
scalability with respect to task counts. I did a bit of benchmarking
to find some things to do, and observed very little lowmem pressure
with elevated process counts while collecting profiles etc. The
benchmark loads tested were not feasible on mainline.

I'd be much obliged if someone either less terrified of lawyers or
with a benchmark that cares and whose results are easily publicable
could run this through the mill.

Quantitatively, stacks and pagetables on PAE eat a grand total of 4KB
of lowmem per process with all this applied. Not per thread. Per
process. (Unthreaded process, obviously tack threads onto a process and
it's 4KB/thread atop that, since I've not put stacks into highmem yet).
Resource scalability gains are also reaped from dmc+mbligh's objrmap.

Mainline eats 20KB per process and 8KB per additional thread worth of
lowmem for stacks and pagetables. So modulo mm_structs, vma's, filp's,
and other miscellania, this quintuples process capacity. Plus whatever
gains come from objrmap.

Changes since 2.5.71-bk2-wli-1:

+ pgd_ctor fix
Pointer arithmetic goes wrong unless page_address()'s result is
casted to pgd_t *. So cast it and fix AGP bad pmd bugs.

+ remap_page_range() vs. highpmd fix
fix a one-off in remap_page_range() pmd_unmap()'ing things

+ mremap() vs. highpmd fix
simplify logic and fix brokenness

+ inline vm_account()
Benchmarks said this was better to inline, despite looking large.

+ inline pte_chain_alloc()
This didn't require any substantial layering violation, and
benchmarks said this was good to inline.

+ partially re-inline i386 kmap*() functions
Between an incidental highpmd bit that called kmap_atomic() for
all PTE things and the lowmem_page_address() microoptimization,
it turned out to be better to inline the bits that check for
lowmem, falling back to highmem helpers as needed.

+ O(1) task_mem()
It was trivial to extend VM accounting to take care of all the
stats task_mem() wanted. Also rip out the ->mmap_sem
acquisitions, since they do no better wrt. producing reliable
statistics than without and measurable efficiency improvements
can be gained by sampling the statistics racily (this includes
pushing taking mm->mmap_sem into task_vsize(), if necessary).
We're just fishing integers out of the mm_struct here, and no
longer touching vma's.

+ NR_CPUS -adaptive mapping->page_lock
This wants to be a spinlock on smaller systems and an rwlock
on larger systems. #ifdef on NR_CPUS and wrap accesses to
make this adaptive for NR_CPUS.

+ RCU vfsmount
Originally by Maneesh Soni and Dipankar Sarma. Minor /proc/
bugfix brewed up simultaneously by everyone, including myself.
This is actually a series of 2 patches.

+ irqstacks, 4KB stacks, and mcount-based stack overflow checking
Originally by Ben LaHaise and Dave Hansen. Slightly debugged.
This is actually a series of 4 patches.

+ objrmap
Originally by Dave McCracken and Martin Bligh. Adapted to
highpmd by yours truly.

+ jack up batchcount
O(1) buffered_rmqueue() won't burn cpu doing larger batches.
So let it.

All 25 patches:

O(1) rmqueue_bulk()
Implement deferred coalescing with lists-of-lists -structured
order 0 deferred queues so buffered_rmqueue() is O(1) expected time.

lowmem_page_address() microoptimization
Use page_to_pfn() to inherit its arch-specific microoptimizations.

highpmd
Shove pmd's into highmem, by brute foce.

Trivial /proc/ BKL removals
Kill off some blatantly unnecessary BKL grabbing in /proc/

i386 pagetable cache
Resurrect the i386 pagetable cache, but safely this time.

pgd_ctor
Use slab ctors for i386 pgd's, and be safe with AGP and highpmd.

O(1) proc_pid_readdir()
Originally due to Manfred Spraul; figures out its position from
a small pid hashtable rearrangement.

O(1) proc_pid_statm()
Originally due to Ben LaHaise; keeps count of the various
proc_pid_statm() counters whenever twiddling ptes.

pgd_ctor fix
Pointer arithmetic goes wrong unless page_address()'s result is
casted to pgd_t *.

remap_page_range() vs. highpmd fix
make remap_page_range() pmd_unmap() the right thing

mremap() vs. highpmd fix
simplify logic and fix brokenness

inline vm_account()
This turned out to be better to inline, despite looking largeish.

inline pte_chain_alloc()
This didn't require any substantial layering violation, and sped
things up slightly.

partially re-inline i386 kmap*() functions
between an incidental highpmd bit that called kmap_atomic() for
all PTE things and the lowmem_page_address() microoptimization,
it turned out to be better to inline the bits that check for
lowmem

O(1) task_mem()
It was trivial to extend VM accounting to take care of all the
stats task_mem() wanted. Also rip out some of the ->mmap_sem
acquisitions, since they do no better wrt. producing reliable
statistics than without and measurable efficiency improvements
can be gained by sampling the statistics racily (this includes
pushing taking mm->mmap_sem into task_vsize(), if necessary).

NR_CPUS -adaptive mapping->page_lock
This wants to be a spinlock on smaller systems and an rwlock
on larger systems. #ifdef on NR_CPUS and wrap accesses to
make this adaptive for NR_CPUS.

RCU vfsmount
Originally by Maneesh Soni and Dipankar Sarma. Minor /proc/
bugfix brewed up simultaneously by everyone, including myself.
This is actually a series of 2.

irqstacks, 4KB stacks, and mcount-based stack overflow checking
Originally by Ben LaHaise with ongoing maintenance and
contributions by Dave Hansen. Slightly debugged.
This is actually a series of 4.

objrmap
Originally by Dave McCracken and Martin Bligh. Adapted to
highpmd by yours truly.

jack up batchcount
O(1) buffered_rmqueue() won't burn cpu doing larger batches.
So let it.

-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/