[PATCH] rmap15a incremental diff against 2.4.20-ac1

Javier Marcet (jmarcet@pobox.com)
Mon, 2 Dec 2002 04:24:48 +0100
Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Ed Tomlinson: "Re: [ALPHA RELEASE] module-init-tools 0.9-alpha"
Previous message: Jeremy Fitzhardinge: "Re: a bug in autofs"
--kORqDWCi7qDJ0mEj
Content-Type: multipart/mixed; boundary="PNTmBPCT7hxwcZjr"
Content-Disposition: inline
--PNTmBPCT7hxwcZjr
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
I had sent this patch a few hours ago but didn't see it on the list...
Anyway, there was a mistake in the diff I sent so there you go a new
version.
I've merged the incremental diffs of rmap (rmap14c-rmap15 and
rmap15-rmapa) with 2.4.20-ac1
There was no inconsistency but in three spots.
Namely:
...from the original rmap...
diff -Nru a/mm/shmem.c b/mm/shmem.c
--- a/mm/shmem.c Mon Nov 18 10:28:28 2002
+++ b/mm/shmem.c Mon Nov 18 10:28:28 2002
@@ -557,7 +557,7 @@
unsigned long flags;
=20
/* Look it up and read it in.. */
- page =3D find_get_page(&swapper_space, entry->val);
+ page =3D find_pagecache_page(&swapper_space, entry->val);
if (!page) {
swp_entry_t swap =3D *entry;
spin_unlock (&info->lock);
...to how I left it...
diff -purN linux-2.4.20-ac1/mm/shmem.c linux-2.4.20-ac1-rmap15a/mm/shmem.c
--- linux-2.4.20-ac1/mm/shmem.c 2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/shmem.c 2002-12-01 10:43:15.000000000 +0100
@@ -593,7 +593,7 @@ repeat:
unsigned long flags;
=20
/* Look it up and read it in.. */
- page =3D lookup_swap_cache(*entry);
+ page =3D find_pagecache_page(&swapper_space, entry->val);
if (!page) {
swp_entry_t swap =3D *entry;
spin_unlock (&info->lock);
I didn't know which version to leave, rmap's or original ac's
Then, this enum was different in -ac than stock, thus I didn't feel too
well when mixing it...
diff -Nru a/include/linux/brlock.h b/include/linux/brlock.h
--- a/include/linux/brlock.h Mon Nov 18 10:28:28 2002
+++ b/include/linux/brlock.h Mon Nov 18 10:28:28 2002
@@ -34,6 +34,7 @@
enum brlock_indices {
BR_GLOBALIRQ_LOCK,
BR_NETPROTO_LOCK,
+ BR_LRU_LOCK,
=20
__BR_END
};
diff -purN linux-2.4.20-ac1/include/linux/brlock.h linux-2.4.20-ac1-rmap15a=
/include/linux/brlock.h
--- linux-2.4.20-ac1/include/linux/brlock.h 2002-12-01 11:01:04.000000000 +=
0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/brlock.h 2002-12-01 10:43:15.000=
000000 +0100
@@ -37,6 +37,7 @@ enum brlock_indices {
BR_GLOBALIRQ_LOCK,
BR_NETPROTO_LOCK,
BR_LLC_LOCK,
+ BR_LRU_LOCK,
__BR_END
};
=20
And last, on this one, the wmb() call is not present on -ac. Everything
else was the same.
diff -Nru a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c Mon Nov 18 10:28:28 2002
+++ b/mm/vmscan.c Mon Nov 18 10:28:28 2002
@@ -846,7 +961,7 @@
/* OK, the VM is very loaded. Sleep instead of using all CPU. */
kswapd_overloaded =3D 1;
set_current_state(TASK_UNINTERRUPTIBLE);
- schedule_timeout(HZ / 4);
+ schedule_timeout(HZ / 40);
kswapd_overloaded =3D 0;
wmb();
return;
Feel free to try it. I'm running it right now and so far no problems.
The vm usage has definitely improved, but there are still slight stalls
when there's a high disk io. Say, in periods of ~2-3s the system stopped
responding for a few cents of a sec, as if it had tachycardia.
--
Javier Marcet <jmarcet@pobox.com>

--PNTmBPCT7hxwcZjr
Content-Type: text/plain; charset=latin-9
Content-Disposition: attachment; filename="001_2.4.20-ac1_ac1-rmap15a.diff"
Content-Transfer-Encoding: quoted-printable

diff -purN linux-2.4.20-ac1/Changelog.rmap linux-2.4.20-ac1-rmap15a/Changel=
og.rmap
--- linux-2.4.20-ac1/Changelog.rmap	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/Changelog.rmap	2002-12-01 10:59:22.000000000 +=
0100
@@ -0,0 +1,204 @@
+The first maintenance release of the 15th version of the reverse
+mapping based VM is now available.
+This is an attempt at making a more robust and flexible VM
+subsystem, while cleaning up a lot of code at the same time.
+The patch is available from:
+
+           http://surriel.com/patches/2.4/2.4.19-rmap15a
+and        http://linuxvm.bkbits.net/
+
+
+My big TODO items for a next release are:
+  - backport speedups from 2.5
+  - pte-highmem
+
+rmap 15a:
+  - more agressive freeing for higher order allocations   (me)
+  - export __find_pagecache_page, find_get_page define    (me, Cristoph, A=
rjan)
+  - make memory statistics SMP safe again                 (me)
+  - make page aging slow down again when needed           (Andrew Morton)
+  - first stab at fine-tuning arjan's O(1) VM             (me)
+  - split active list in cache / working set              (me)
+  - fix SMP locking in arjan's O(1) VM                    (me)
+rmap 15:
+  - small code cleanups and spelling fixes for O(1) VM    (me)
+  - O(1) page launder, O(1) page aging                    (Arjan van de Ve=
n)
+  - resync code with -ac (12 small patches)               (me)
+rmap 14c:
+  - fold page_over_rsslimit() into page_referenced()      (me)
+  - 2.5 backport: get pte_chains from the slab cache      (William Lee Irw=
in)
+  - remove dead code from page_launder_zone()             (me)
+  - make OOM detection a bit more agressive               (me)
+rmap 14b:
+  - don't unmap pages not in pagecache (ext3 & reiser)    (Andrew Morton, =
me)
+  - clean up mark_page_accessed a bit                     (me)
+  - Alpha NUMA fix for Ingo's per-cpu pages               (Fl=E1vio Leitne=
r, me)
+  - remove explicit low latency schedule zap_page_range   (Robert Love)
+  - fix OOM stuff for good, hopefully                     (me)
+rmap 14a:
+  - Ingo Molnar's per-cpu pages (SMP speedup)             (Christoph Hellw=
ig)
+  - fix SMP bug in page_launder_zone (rmap14 only)        (Arjan van de Ve=
n)=20
+  - semicolon day, fix typo in rmap.c w/ DEBUG_RMAP       (Craig Kulesa)
+  - remove unneeded pte_chain_unlock/lock pair vmscan.c   (Craig Kulesa)
+  - low latency zap_page_range also without preempt       (Arjan van de Ve=
n)
+  - do some throughput tuning for kswapd/page_launder     (me)
+  - don't allocate swap space for pages we're not writing (me)
+rmap 14:
+  - get rid of stalls during swapping, hopefully          (me)
+  - low latency zap_page_range                            (Robert Love)
+rmap 13c:
+  - add wmb() to wakeup_memwaiters                        (Arjan van de Ve=
n)
+  - remap_pmd_range now calls pte_alloc with full address (Paul Mackerras)
+  - #ifdef out pte_chain_lock/unlock on UP machines       (Andrew Morton)
+  - un-BUG() truncate_complete_page, the race is expected (Andrew Morton, =
me)
+  - remove NUMA changes from rmap13a                      (Christoph Hellw=
ig)
+rmap 13b:
+  - prevent PF_MEMALLOC recursion for higher order allocs (Arjan van de Ve=
n, me)
+  - fix small SMP race, PG_lru                            (Hugh Dickins)
+rmap 13a:
+  - NUMA changes for page_address                         (Samuel Ortiz)
+  - replace vm.freepages with simpler kswapd_minfree      (Christoph Hellw=
ig)
+rmap 13:
+  - rename touch_page to mark_page_accessed and uninline  (Christoph Hellw=
ig)
+  - NUMA bugfix for __alloc_pages                         (William Irwin)
+  - kill __find_page                                      (Christoph Hellw=
ig)
+  - make pte_chain_freelist per zone                      (William Irwin)
+  - protect pte_chains by per-page lock bit               (William Irwin)
+  - minor code cleanups                                   (me)
+rmap 12i:
+  - slab cleanup                                          (Christoph Hellw=
ig)
+  - remove references to compiler.h from mm/*             (me)
+  - move rmap to marcelo's bk tree                        (me)
+  - minor cleanups                                        (me)
+rmap 12h:
+  - hopefully fix OOM detection algorithm                 (me)
+  - drop pte quicklist in anticipation of pte-highmem     (me)
+  - replace andrea's highmem emulation by ingo's one      (me)
+  - improve rss limit checking                            (Nick Piggin)
+rmap 12g:
+  - port to armv architecture                             (David Woodhouse)
+  - NUMA fix to zone_table initialisation                 (Samuel Ortiz)
+  - remove init_page_count                                (David Miller)
+rmap 12f:
+  - for_each_pgdat macro                                  (William Lee Irw=
in)
+  - put back EXPORT(__find_get_page) for modular rd       (me)
+  - make bdflush and kswapd actually start queued disk IO (me)
+rmap 12e
+  - RSS limit fix, the limit can be 0 for some reason     (me)
+  - clean up for_each_zone define to not need pgdata_t    (William Lee Irw=
in)
+  - fix i810_dma bug introduced with page->wait removal   (William Lee Irw=
in)
+rmap 12d:
+  - fix compiler warning in rmap.c                        (Roger Larsson)
+  - read latency improvement   (read-latency2)            (Andrew Morton)
+rmap 12c:
+  - fix small balancing bug in page_launder_zone          (Nick Piggin)
+  - wakeup_kswapd / wakeup_memwaiters code fix            (Arjan van de Ve=
n)
+  - improve RSS limit enforcement                         (me)
+rmap 12b:
+  - highmem emulation (for debugging purposes)            (Andrea Arcangel=
i)
+  - ulimit RSS enforcement when memory gets tight         (me)
+  - sparc64 page->virtual quickfix                        (Greg Procunier)
+rmap 12a:
+  - fix the compile warning in buffer.c                   (me)
+  - fix divide-by-zero on highmem initialisation  DOH!    (me)
+  - remove the pgd quicklist (suspicious ...)             (DaveM, me)
+rmap 12:
+  - keep some extra free memory on large machines         (Arjan van de Ve=
n, me)
+  - higher-order allocation bugfix                        (Adrian Drzewiec=
ki)
+  - nr_free_buffer_pages() returns inactive + free mem    (me)
+  - pages from unused objects directly to inactive_clean  (me)
+  - use fast pte quicklists on non-pae machines           (Andrea Arcangel=
i)
+  - remove sleep_on from wakeup_kswapd                    (Arjan van de Ve=
n)
+  - page waitqueue cleanup                                (Christoph Hellw=
ig)
+rmap 11c:
+  - oom_kill race locking fix                             (Andres Salomon)
+  - elevator improvement                                  (Andrew Morton)
+  - dirty buffer writeout speedup (hopefully ;))          (me)
+  - small documentation updates                           (me)
+  - page_launder() never does synchronous IO, kswapd
+    and the processes calling it sleep on higher level    (me)
+  - deadlock fix in touch_page()                          (me)
+rmap 11b:
+  - added low latency reschedule points in vmscan.c       (me)
+  - make i810_dma.c include mm_inline.h too               (William Lee Irw=
in)
+  - wake up kswapd sleeper tasks on OOM kill so the
+    killed task can continue on its way out               (me)
+  - tune page allocation sleep point a little             (me)
+rmap 11a:
+  - don't let refill_inactive() progress count for OOM    (me)
+  - after an OOM kill, wait 5 seconds for the next kill   (me)
+  - agpgart_be fix for hashed waitqueues                  (William Lee Irw=
in)
+rmap 11:
+  - fix stupid logic inversion bug in wakeup_kswapd()     (Andrew Morton)
+  - fix it again in the morning                           (me)
+  - add #ifdef BROKEN_PPC_PTE_ALLOC_ONE to rmap.h, it
+    seems PPC calls pte_alloc() before mem_map[] init     (me)
+  - disable the debugging code in rmap.c ... the code
+    is working and people are running benchmarks          (me)
+  - let the slab cache shrink functions return a value
+    to help prevent early OOM killing                     (Ed Tomlinson)
+  - also, don't call the OOM code if we have enough
+    free pages                                            (me)
+  - move the call to lru_cache_del into __free_pages_ok   (Ben LaHaise)
+  - replace the per-page waitqueue with a hashed
+    waitqueue, reduces size of struct page from 64
+    bytes to 52 bytes (48 bytes on non-highmem machines)  (William Lee Irw=
in)
+rmap 10:
+  - fix the livelock for real (yeah right), turned out
+    to be a stupid bug in page_launder_zone()             (me)
+  - to make sure the VM subsystem doesn't monopolise
+    the CPU, let kswapd and some apps sleep a bit under
+    heavy stress situations                               (me)
+  - let __GFP_HIGH allocations dig a little bit deeper
+    into the free page pool, the SCSI layer seems fragile (me)
+rmap 9:
+  - improve comments all over the place                   (Michael Cohen)
+  - don't panic if page_remove_rmap() cannot find the
+    rmap in question, it's possible that the memory was
+    PG_reserved and belonging to a driver, but the driver
+    exited and cleared the PG_reserved bit                (me)
+  - fix the VM livelock by replacing > by >=3D in a few
+    critical places in the pageout code                   (me)
+  - treat the reclaiming of an inactive_clean page like
+    allocating a new page, calling try_to_free_pages()
+    and/or fixup_freespace() if required                  (me)
+  - when low on memory, don't make things worse by
+    doing swapin_readahead                                (me)
+rmap 8:
+  - add ANY_ZONE to the balancing functions to improve
+    kswapd's balancing a bit                              (me)
+  - regularize some of the maximum loop bounds in
+    vmscan.c for cosmetic purposes                        (William Lee Irw=
in)
+  - move page_address() to architecture-independent
+    code, now the removal of page->virtual is portable    (William Lee Irw=
in)
+  - speed up free_area_init_core() by doing a single
+    pass over the pages and not using atomic ops          (William Lee Irw=
in)
+  - documented the buddy allocator in page_alloc.c        (William Lee Irw=
in)
+rmap 7:
+  - clean up and document vmscan.c                        (me)
+  - reduce size of page struct, part one                  (William Lee Irw=
in)
+  - add rmap.h for other archs (untested, not for ARM)    (me)
+rmap 6:
+  - make the active and inactive_dirty list per zone,
+    this is finally possible because we can free pages
+    based on their physical address                       (William Lee Irw=
in)
+  - cleaned up William's code a bit                       (me)
+  - turn some defines into inlines and move those to
+    mm_inline.h (the includes are a mess ...)             (me)
+  - improve the VM balancing a bit                        (me)
+  - add back inactive_target to /proc/meminfo             (me)
+rmap 5:
+  - fixed recursive buglet, introduced by directly
+    editing the patch for making rmap 4 ;)))              (me)
+rmap 4:
+  - look at the referenced bits in page tables            (me)
+rmap 3:
+  - forgot one FASTCALL definition                        (me)
+rmap 2:
+  - teach try_to_unmap_one() about mremap()               (me)
+  - don't assign swap space to pages with buffers         (me)
+  - make the rmap.c functions FASTCALL / inline           (me)
+rmap 1:
+  - fix the swap leak in rmap 0                           (Dave McCracken)
+rmap 0:
+  - port of reverse mapping VM to 2.4.16                  (me)
diff -purN linux-2.4.20-ac1/fs/buffer.c linux-2.4.20-ac1-rmap15a/fs/buffer.c
--- linux-2.4.20-ac1/fs/buffer.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/fs/buffer.c	2002-12-01 10:43:15.000000000 +0100
@@ -2926,6 +2926,30 @@ int bdflush(void *startup)
 	}
 }
=20
+
+/*
+ * Do some IO post-processing here!!!
+ */
+void do_io_postprocessing(void)
+{
+	int i;
+	struct buffer_head *bh, *next;
+
+	spin_lock(&lru_list_lock);
+	bh =3D lru_list[BUF_LOCKED];
+	if (bh) {
+		for (i =3D nr_buffers_type[BUF_LOCKED]; i-- > 0; bh =3D next) {
+			next =3D bh->b_next_free;
+
+			if (!buffer_locked(bh))=20
+				__refile_buffer(bh);
+			else=20
+				break;
+		}
+	}
+	spin_unlock(&lru_list_lock);
+}
+
 /*
  * This is the kernel update daemon. It was used to live in userspace
  * but since it's need to run safely we want it unkillable by mistake.
@@ -2977,6 +3001,7 @@ int kupdate(void *startup)
 #ifdef DEBUG
 		printk(KERN_DEBUG "kupdate() activated...\n");
 #endif
+		do_io_postprocessing();
 		sync_old_buffers();
 		run_task_queue(&tq_disk);
 	}
diff -purN linux-2.4.20-ac1/fs/proc/proc_misc.c linux-2.4.20-ac1-rmap15a/fs=
/proc/proc_misc.c
--- linux-2.4.20-ac1/fs/proc/proc_misc.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/fs/proc/proc_misc.c	2002-12-01 10:59:16.000000=
000 +0100
@@ -191,7 +191,10 @@ static int meminfo_read_proc(char *page,
 		"Cached:       %8lu kB\n"
 		"SwapCached:   %8lu kB\n"
 		"Active:       %8u kB\n"
+		"ActiveAnon:   %8u kB\n"
+		"ActiveCache:  %8u kB\n"
 		"Inact_dirty:  %8u kB\n"
+		"Inact_laundry:%8u kB\n"
 		"Inact_clean:  %8u kB\n"
 		"Inact_target: %8u kB\n"
 		"HighTotal:    %8lu kB\n"
@@ -207,9 +210,12 @@ static int meminfo_read_proc(char *page,
 		K(i.bufferram),
 		K(pg_size - swapper_space.nrpages),
 		K(swapper_space.nrpages),
-		K(nr_active_pages),
-		K(nr_inactive_dirty_pages),
-		K(nr_inactive_clean_pages),
+		K(nr_active_anon_pages()) + K(nr_active_cache_pages()),
+		K(nr_active_anon_pages()),
+		K(nr_active_cache_pages()),
+		K(nr_inactive_dirty_pages()),
+		K(nr_inactive_laundry_pages()),
+		K(nr_inactive_clean_pages()),
 		K(inactive_target()),
 		K(i.totalhigh),
 		K(i.freehigh),
diff -purN linux-2.4.20-ac1/include/linux/brlock.h linux-2.4.20-ac1-rmap15a=
/include/linux/brlock.h
--- linux-2.4.20-ac1/include/linux/brlock.h	2002-12-01 11:01:04.000000000 +=
0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/brlock.h	2002-12-01 10:43:15.000=
000000 +0100
@@ -37,6 +37,7 @@ enum brlock_indices {
 	BR_GLOBALIRQ_LOCK,
 	BR_NETPROTO_LOCK,
 	BR_LLC_LOCK,
+	BR_LRU_LOCK,
 	__BR_END
 };
=20
diff -purN linux-2.4.20-ac1/include/linux/mm.h linux-2.4.20-ac1-rmap15a/inc=
lude/linux/mm.h
--- linux-2.4.20-ac1/include/linux/mm.h	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/mm.h	2002-12-01 10:59:16.0000000=
00 +0100
@@ -1,5 +1,23 @@
 #ifndef _LINUX_MM_H
 #define _LINUX_MM_H
+/*
+ * Copyright (c) 2002. All rights reserved.
+ *
+ * This software may be freely redistributed under the terms of the
+ * GNU General Public License.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Authors:=20
+ *	Linus Torvalds
+ *	Stephen Tweedie
+ *	Andrea Arcangeli
+ *	Rik van Riel
+ *	Arjan van de Ven
+ *	and others
+ */
=20
 #include <linux/sched.h>
 #include <linux/errno.h>
@@ -168,7 +186,7 @@ typedef struct page {
 	unsigned long flags;		/* atomic flags, some possibly
 					   updated asynchronously */
 	struct list_head lru;		/* Pageout list, eg. active_list;
-					   protected by pagemap_lru_lock !! */
+					   protected by the lru lock !! */
 	unsigned char age;		/* Page aging counter. */
 	struct pte_chain * pte_chain;	/* Reverse pte mapping pointer.
 					 * protected by PG_chainlock
@@ -279,7 +297,7 @@ typedef struct page {
  *
  * Note that the referenced bit, the page->lru list_head and the
  * active, inactive_dirty and inactive_clean lists are protected by
- * the pagemap_lru_lock, and *NOT* by the usual PG_locked bit!
+ * the lru lock, and *NOT* by the usual PG_locked bit!
  *
  * PG_skip is used on sparc/sparc64 architectures to "skip" certain
  * parts of the address space.
@@ -300,18 +318,20 @@ typedef struct page {
 #define PG_referenced		 2
 #define PG_uptodate		 3
 #define PG_dirty		 4
-#define PG_inactive_clean	 5
-#define PG_active		 6
+#define PG_active_anon		 5
 #define PG_inactive_dirty	 7
-#define PG_slab			 8
-#define PG_skip			10
-#define PG_highmem		11
-#define PG_checked		12	/* kill me in 2.5.<early>. */
-#define PG_arch_1		13
-#define PG_reserved		14
-#define PG_launder		15	/* written out by VM pressure.. */
-#define PG_chainlock		16	/* lock bit for ->pte_chain */
-#define PG_lru			17
+#define PG_inactive_laundry	 8
+#define PG_inactive_clean	 9
+#define PG_slab			10
+#define PG_skip			11
+#define PG_highmem		12
+#define PG_checked		13	/* kill me in 2.5.<early>. */
+#define PG_arch_1		14
+#define PG_reserved		15
+#define PG_launder		16	/* written out by VM pressure.. */
+#define PG_chainlock		17	/* lock bit for ->pte_chain */
+#define PG_lru			18
+#define PG_active_cache		19
 /* Don't you dare to use high bits, they seem to be used for something els=
e! */
=20
=20
@@ -429,11 +449,21 @@ extern void FASTCALL(set_page_dirty(stru
 #define PageClearSlab(page)	clear_bit(PG_slab, &(page)->flags)
 #define PageReserved(page)	test_bit(PG_reserved, &(page)->flags)
=20
-#define PageActive(page)	test_bit(PG_active, &(page)->flags)
-#define SetPageActive(page)	set_bit(PG_active, &(page)->flags)
-#define ClearPageActive(page)	clear_bit(PG_active, &(page)->flags)
-#define TestandSetPageActive(page)	test_and_set_bit(PG_active, &(page)->fl=
ags)
-#define TestandClearPageActive(page)	test_and_clear_bit(PG_active, &(page)=
->flags)
+#define PageActiveAnon(page)		test_bit(PG_active_anon, &(page)->flags)
+#define SetPageActiveAnon(page)	set_bit(PG_active_anon, &(page)->flags)
+#define ClearPageActiveAnon(page)	clear_bit(PG_active_anon, &(page)->flags)
+#define TestandSetPageActiveAnon(page)	test_and_set_bit(PG_active_anon, &(=
page)->flags)
+#define TestandClearPageActiveAnon(page)	test_and_clear_bit(PG_active_anon=
, &(page)->flags)
+
+#define PageActiveCache(page)		test_bit(PG_active_cache, &(page)->flags)
+#define SetPageActiveCache(page)	set_bit(PG_active_cache, &(page)->flags)
+#define ClearPageActiveCache(page)	clear_bit(PG_active_cache, &(page)->fla=
gs)
+#define TestandSetPageActiveCache(page)	test_and_set_bit(PG_active_cache, =
&(page)->flags)
+#define TestandClearPageActiveCache(page)	test_and_clear_bit(PG_active_cac=
he, &(page)->flags)
+
+#define PageInactiveLaundry(page)	test_bit(PG_inactive_laundry, &(page)->f=
lags)
+#define SetPageInactiveLaundry(page)	set_bit(PG_inactive_laundry, &(page)-=
>flags)
+#define ClearPageInactiveLaundry(page)	clear_bit(PG_inactive_laundry, &(pa=
ge)->flags)
=20
 #define PageInactiveDirty(page)	test_bit(PG_inactive_dirty, &(page)->flags)
 #define SetPageInactiveDirty(page)	set_bit(PG_inactive_dirty, &(page)->fla=
gs)
diff -purN linux-2.4.20-ac1/include/linux/mm_inline.h linux-2.4.20-ac1-rmap=
15a/include/linux/mm_inline.h
--- linux-2.4.20-ac1/include/linux/mm_inline.h	2002-12-01 11:01:04.00000000=
0 +0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/mm_inline.h	2002-12-01 10:59:16.=
000000000 +0100
@@ -2,23 +2,125 @@
 #define _LINUX_MM_INLINE_H
=20
 #include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/brlock.h>
+
+
+/*
+ * Copyright (c) 2002. All rights reserved.
+ *
+ * This software may be freely redistributed under the terms of the
+ * GNU General Public License.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+ *
+ * Authors:=20
+ *	Linus Torvalds
+ *	Stephen Tweedie
+ *	Andrea Arcangeli
+ *	Rik van Riel
+ *	Arjan van de Ven
+ *	and others
+ */
+
+GPL_HEADER()
+
+extern unsigned char active_age_bias;
=20
 /*
  * These inline functions tend to need bits and pieces of all the
  * other VM include files, meaning they cannot be defined inside
  * one of the other VM include files.
+ *=20
+ */
+=20
+/**
+ * page_dirty - do we need to write the data out to disk
+ * @page: page to test
+ *
+ * Returns true if the page contains data which needs to
+ * be written to disk.  Doesn't test the page tables (yet?).
+ */
+static inline int page_dirty(struct page *page)
+{
+	struct buffer_head *tmp, *bh;
+
+	if (PageDirty(page))
+		return 1;
+
+	if (page->mapping && !page->buffers)
+		return 0;
+
+	tmp =3D bh =3D page->buffers;
+
+	do {
+		if (tmp->b_state & ((1<<BH_Dirty) | (1<<BH_Lock)))
+			return 1;
+		tmp =3D tmp->b_this_page;
+	} while (tmp !=3D bh);
+
+	return 0;
+}
+
+/**
+ * page_anon - is this page ram/swap backed ?
+ * @page - page to test
  *
- * The include file mess really needs to be cleaned up...
+ * Returns 1 if the page is backed by ram/swap, 0 if the page is
+ * backed by a file in a filesystem on permanent storage.
  */
+static inline int page_anon(struct page * page)
+{
+	/* Pages of an mmap()d file won't trigger this unless they get
+	 * referenced on the inactive list and really are in the working
+	 * set of the process... */
+	if (page->pte_chain)
+		return 1;
+
+	if (!page->mapping && !page->buffers)
+		return 1;
+
+	if (PageSwapCache(page))
+		return 1;
+
+	if (!page->mapping->a_ops->writepage)
+		return 1;
=20
-static inline void add_page_to_active_list(struct page * page)
+	/* TODO: ramfs, tmpfs shm segments and ramdisk */
+
+	return 0;
+}
+
+
+
+static inline void add_page_to_active_anon_list(struct page * page, int ag=
e)
 {
 	struct zone_struct * zone =3D page_zone(page);
 	DEBUG_LRU_PAGE(page);
-	SetPageActive(page);
-	list_add(&page->lru, &zone->active_list);
-	zone->active_pages++;
-	nr_active_pages++;
+	SetPageActiveAnon(page);
+	list_add(&page->lru, &zone->active_anon_list[age]);
+	page->age =3D age + active_age_bias;
+	zone->active_anon_pages++;
+}
+
+static inline void add_page_to_active_cache_list(struct page * page, int a=
ge)
+{
+	struct zone_struct * zone =3D page_zone(page);
+	DEBUG_LRU_PAGE(page);
+	SetPageActiveCache(page);
+	list_add(&page->lru, &zone->active_cache_list[age]);
+	page->age =3D age + active_age_bias;
+	zone->active_cache_pages++;
+}
+
+static inline void add_page_to_active_list(struct page * page, int age)
+{
+	if (page_anon(page))
+		add_page_to_active_anon_list(page, age);
+	else
+		add_page_to_active_cache_list(page, age);
 }
=20
 static inline void add_page_to_inactive_dirty_list(struct page * page)
@@ -28,7 +130,15 @@ static inline void add_page_to_inactive_
 	SetPageInactiveDirty(page);
 	list_add(&page->lru, &zone->inactive_dirty_list);
 	zone->inactive_dirty_pages++;
-	nr_inactive_dirty_pages++;
+}
+
+static inline void add_page_to_inactive_laundry_list(struct page * page)
+{
+	struct zone_struct * zone =3D page_zone(page);
+	DEBUG_LRU_PAGE(page);
+	SetPageInactiveLaundry(page);
+	list_add(&page->lru, &zone->inactive_laundry_list);
+	zone->inactive_laundry_pages++;
 }
=20
 static inline void add_page_to_inactive_clean_list(struct page * page)
@@ -38,16 +148,31 @@ static inline void add_page_to_inactive_
 	SetPageInactiveClean(page);
 	list_add(&page->lru, &zone->inactive_clean_list);
 	zone->inactive_clean_pages++;
-	nr_inactive_clean_pages++;
 }
=20
-static inline void del_page_from_active_list(struct page * page)
+static inline void del_page_from_active_anon_list(struct page * page)
+{
+	struct zone_struct * zone =3D page_zone(page);
+	unsigned char age;
+	list_del(&page->lru);
+	ClearPageActiveAnon(page);
+	zone->active_anon_pages--;
+	age =3D page->age - active_age_bias;
+	if (age<=3DMAX_AGE)
+		zone->active_anon_count[age]--;
+	DEBUG_LRU_PAGE(page);
+}
+
+static inline void del_page_from_active_cache_list(struct page * page)
 {
 	struct zone_struct * zone =3D page_zone(page);
+	unsigned char age;
 	list_del(&page->lru);
-	ClearPageActive(page);
-	nr_active_pages--;
-	zone->active_pages--;
+	ClearPageActiveCache(page);
+	zone->active_cache_pages--;
+	age =3D page->age - active_age_bias;
+	if (age<=3DMAX_AGE)
+		zone->active_cache_count[age]--;
 	DEBUG_LRU_PAGE(page);
 }
=20
@@ -56,18 +181,25 @@ static inline void del_page_from_inactiv
 	struct zone_struct * zone =3D page_zone(page);
 	list_del(&page->lru);
 	ClearPageInactiveDirty(page);
-	nr_inactive_dirty_pages--;
 	zone->inactive_dirty_pages--;
 	DEBUG_LRU_PAGE(page);
 }
=20
+static inline void del_page_from_inactive_laundry_list(struct page * page)
+{
+	struct zone_struct * zone =3D page_zone(page);
+	list_del(&page->lru);
+	ClearPageInactiveLaundry(page);
+	zone->inactive_laundry_pages--;
+	DEBUG_LRU_PAGE(page);
+}
+
 static inline void del_page_from_inactive_clean_list(struct page * page)
 {
 	struct zone_struct * zone =3D page_zone(page);
 	list_del(&page->lru);
 	ClearPageInactiveClean(page);
 	zone->inactive_clean_pages--;
-	nr_inactive_clean_pages--;
 	DEBUG_LRU_PAGE(page);
 }
=20
@@ -184,7 +316,8 @@ static inline int zone_inactive_limit(st
 {
 	int inactive, target, inactive_base;
=20
-	inactive_base =3D zone->active_pages + zone->inactive_dirty_pages;
+	inactive_base =3D zone->active_anon_pages + zone->active_cache_pages
+			+  zone->inactive_dirty_pages;
 	inactive_base /=3D INACTIVE_FACTOR;
=20
 	/* GCC should optimise this away completely. */
@@ -253,7 +386,13 @@ static inline int inactive_low(struct zo
  */
 static inline int inactive_high(struct zone_struct * zone)
 {
-	return inactive_limit(zone, VM_HIGH);
+	unsigned long active, inactive;
+	active =3D zone->active_anon_pages + zone->active_cache_pages
+		+ zone->free_pages;
+	inactive =3D zone->inactive_dirty_pages + zone->inactive_clean_pages + zo=
ne->inactive_laundry_pages;
+	if (inactive * 5 >   (active+inactive))
+		return -1;
+	return 1;
 }
=20
 /*
@@ -263,12 +402,33 @@ static inline int inactive_target(void)
 {
 	int target;
=20
-	target =3D nr_active_pages + nr_inactive_dirty_pages
-			+ nr_inactive_clean_pages;
+	target =3D nr_active_anon_pages() + nr_active_cache_pages()
+			+ nr_inactive_dirty_pages() + nr_inactive_clean_pages()
+			+ nr_inactive_laundry_pages();
=20
 	target /=3D INACTIVE_FACTOR;
=20
 	return target;
 }
=20
+static inline void lru_lock(struct zone_struct *zone)
+{
+	if (zone) {
+		br_read_lock(BR_LRU_LOCK);
+		spin_lock(&zone->lru_lock);
+	} else {
+		br_write_lock(BR_LRU_LOCK);
+	}
+}
+
+static inline void lru_unlock(struct zone_struct *zone)
+{
+	if (zone) {
+		spin_unlock(&zone->lru_lock);
+		br_read_unlock(BR_LRU_LOCK);
+	} else {
+		br_write_unlock(BR_LRU_LOCK);
+	}
+}
+
 #endif /* _LINUX_MM_INLINE_H */
diff -purN linux-2.4.20-ac1/include/linux/mmzone.h linux-2.4.20-ac1-rmap15a=
/include/linux/mmzone.h
--- linux-2.4.20-ac1/include/linux/mmzone.h	2002-12-01 11:01:04.000000000 +=
0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/mmzone.h	2002-12-01 10:59:16.000=
000000 +0100
@@ -13,11 +13,7 @@
  * Free memory management - zoned buddy allocator.
  */
=20
-#ifndef CONFIG_FORCE_MAX_ZONEORDER
 #define MAX_ORDER 10
-#else
-#define MAX_ORDER CONFIG_FORCE_MAX_ZONEORDER
-#endif
=20
 typedef struct free_area_struct {
 	struct list_head	free_list;
@@ -29,6 +25,9 @@ struct pte_chain;
=20
 #define MAX_CHUNKS_PER_NODE 8
=20
+#define MAX_AGE 15
+#define INITIAL_AGE 3
+
 #define MAX_PER_CPU_PAGES 512
 typedef struct per_cpu_pages_s {
 	int			nr_pages, max_nr_pages;
@@ -50,19 +49,27 @@ typedef struct zone_struct {
 	per_cpu_t		cpu_pages[NR_CPUS];
 	spinlock_t		lock;
 	unsigned long		free_pages;
-	unsigned long		active_pages;
+	unsigned long		active_anon_pages;
+	unsigned long		active_cache_pages;
 	unsigned long		inactive_dirty_pages;
+	unsigned long		inactive_laundry_pages;
 	unsigned long		inactive_clean_pages;
 	unsigned long		pages_min, pages_low, pages_high, pages_plenty;
 	int			need_balance;
+	int			need_scan;
+	int			active_anon_count[MAX_AGE+1];
+	int			active_cache_count[MAX_AGE+1];
=20
 	/*
 	 * free areas of different sizes
 	 */
-	struct list_head	active_list;
+	struct list_head	active_anon_list[MAX_AGE+1];
+	struct list_head	active_cache_list[MAX_AGE+1];
 	struct list_head	inactive_dirty_list;
+	struct list_head	inactive_laundry_list;
 	struct list_head	inactive_clean_list;
 	free_area_t		free_area[MAX_ORDER];
+	spinlock_t		lru_lock;
=20
 	/*
 	 * wait_table           -- the array holding the hash table
diff -purN linux-2.4.20-ac1/include/linux/module.h linux-2.4.20-ac1-rmap15a=
/include/linux/module.h
--- linux-2.4.20-ac1/include/linux/module.h	2002-12-01 11:01:04.000000000 +=
0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/module.h	2002-12-01 10:43:15.000=
000000 +0100
@@ -287,6 +287,9 @@ static const struct gtype##_id * __modul
 static const char __module_license[] __attribute__((section(".modinfo"))) =
=3D   \
 "license=3D" license
=20
+#define GPL_HEADER() \
+static const char cpyright=3D"This software may be freely redistributed un=
der the terms of the GNU General Public License.";
+
 /* Define the module variable, and usage macros.  */
 extern struct module __this_module;
=20
@@ -302,7 +305,6 @@ static const char __module_kernel_versio
 static const char __module_using_checksums[] __attribute__((section(".modi=
nfo"))) =3D
 "using_checksums=3D1";
 #endif
-
 #else /* MODULE */
=20
 #define MODULE_AUTHOR(name)
@@ -311,6 +313,7 @@ static const char __module_using_checksu
 #define MODULE_SUPPORTED_DEVICE(name)
 #define MODULE_PARM(var,type)
 #define MODULE_PARM_DESC(var,desc)
+#define GPL_HEADER()
=20
 /* Create a dummy reference to the table to suppress gcc unused warnings. =
 Put
  * the reference in the .data.exit section which is discarded when code is=
 built
diff -purN linux-2.4.20-ac1/include/linux/pagemap.h linux-2.4.20-ac1-rmap15=
a/include/linux/pagemap.h
--- linux-2.4.20-ac1/include/linux/pagemap.h	2002-12-01 11:01:04.000000000 =
+0100
+++ linux-2.4.20-ac1-rmap15a/include/linux/pagemap.h	2002-12-01 10:59:16.00=
0000000 +0100
@@ -70,10 +70,6 @@ static inline unsigned long _page_hashfn
=20
 #define page_hash(mapping,index) (page_hash_table+_page_hashfn(mapping,ind=
ex))
=20
-extern struct page * __find_get_page(struct address_space *mapping,
-				unsigned long index, struct page **hash);
-#define find_get_page(mapping, index) \
-	__find_get_page(mapping, index, page_hash(mapping, index))
 extern struct page * __find_lock_page (struct address_space * mapping,
 				unsigned long index, struct page **hash);
 extern struct page * find_or_create_page(struct address_space *mapping,
@@ -90,6 +86,15 @@ extern void add_to_page_cache_locked(str
 extern int add_to_page_cache_unique(struct page * page, struct address_spa=
ce *mapping, unsigned long index, struct page **hash);
=20
 extern void ___wait_on_page(struct page *);
+extern int wait_on_page_timeout(struct page *page, int timeout);
+
+
+extern struct page * __find_pagecache_page(struct address_space *mapping,
+				unsigned long index, struct page **hash);
+#define find_pagecache_page(mapping, index) \
+	__find_pagecache_page(mapping, index, page_hash(mapping, index))
+#define find_get_page(mapping, index) \
+	__find_pagecache_page(mapping, index, page_hash(mapping, index))
=20
 static inline void wait_on_page(struct page * page)
 {
diff -purN linux-2.4.20-ac1/include/linux/swap.h linux-2.4.20-ac1-rmap15a/i=
nclude/linux/swap.h
--- linux-2.4.20-ac1/include/linux/swap.h	2002-12-01 11:01:04.000000000 +01=
00
+++ linux-2.4.20-ac1-rmap15a/include/linux/swap.h	2002-12-01 10:59:16.00000=
0000 +0100
@@ -85,9 +85,11 @@ extern int nr_swap_pages;
=20
 extern unsigned int nr_free_pages(void);
 extern unsigned int nr_free_buffer_pages(void);
-extern int nr_active_pages;
-extern int nr_inactive_dirty_pages;
-extern int nr_inactive_clean_pages;
+extern unsigned int nr_active_anon_pages(void);
+extern unsigned int nr_active_cache_pages(void);
+extern unsigned int nr_inactive_dirty_pages(void);
+extern unsigned int nr_inactive_laundry_pages(void);
+extern unsigned int nr_inactive_clean_pages(void);
 extern atomic_t page_cache_size;
 extern atomic_t buffermem_pages;
 extern spinlock_cacheline_t pagecache_lock_cacheline;
@@ -115,6 +117,7 @@ extern int FASTCALL(try_to_unmap(struct=20
=20
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
+extern void FASTCALL(lru_cache_add_dirty(struct page *));
 extern void FASTCALL(__lru_cache_del(struct page *));
 extern void FASTCALL(lru_cache_del(struct page *));
=20
@@ -130,6 +133,7 @@ extern void swap_setup(void);
 extern wait_queue_head_t kswapd_wait;
 extern struct page * FASTCALL(reclaim_page(zone_t *));
 extern int FASTCALL(try_to_free_pages(unsigned int gfp_mask));
+extern int rebalance_laundry_zone(struct zone_struct *, int, unsigned int);
 extern void wakeup_kswapd(unsigned int);
 extern void rss_free_pages(unsigned int);
=20
@@ -175,8 +179,6 @@ extern struct swap_list_t swap_list;
 asmlinkage long sys_swapoff(const char *);
 asmlinkage long sys_swapon(const char *, int);
=20
-extern spinlock_cacheline_t pagemap_lru_lock_cacheline;
-#define pagemap_lru_lock pagemap_lru_lock_cacheline.lock
=20
 extern void FASTCALL(mark_page_accessed(struct page *));
=20
@@ -191,14 +193,18 @@ extern void FASTCALL(mark_page_accessed(
=20
 /*
  * List add/del helper macros. These must be called
- * with the pagemap_lru_lock held!
+ * with the lru lock held!
  */
 #define DEBUG_LRU_PAGE(page)			\
 do {						\
-	if (PageActive(page))			\
+	if (PageActiveAnon(page))		\
+		BUG();				\
+	if (PageActiveCache(page))		\
 		BUG();				\
 	if (PageInactiveDirty(page))		\
 		BUG();				\
+	if (PageInactiveLaundry(page))		\
+		BUG();				\
 	if (PageInactiveClean(page))		\
 		BUG();				\
 } while (0)
diff -purN linux-2.4.20-ac1/kernel/ksyms.c linux-2.4.20-ac1-rmap15a/kernel/=
ksyms.c
--- linux-2.4.20-ac1/kernel/ksyms.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/kernel/ksyms.c	2002-12-01 10:43:15.000000000 +=
0100
@@ -262,7 +262,6 @@ EXPORT_SYMBOL(no_llseek);
 EXPORT_SYMBOL(__pollwait);
 EXPORT_SYMBOL(poll_freewait);
 EXPORT_SYMBOL(ROOT_DEV);
-EXPORT_SYMBOL(__find_get_page);
 EXPORT_SYMBOL(__find_lock_page);
 EXPORT_SYMBOL(find_or_create_page);
 EXPORT_SYMBOL(grab_cache_page_nowait);
diff -purN linux-2.4.20-ac1/Makefile linux-2.4.20-ac1-rmap15a/Makefile
--- linux-2.4.20-ac1/Makefile	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/Makefile	2002-12-01 10:59:16.000000000 +0100
@@ -1,7 +1,7 @@
 VERSION =3D 2
 PATCHLEVEL =3D 4
 SUBLEVEL =3D 20
-EXTRAVERSION =3D -ac1
+EXTRAVERSION =3D -ac1-rmap15a
=20
 KERNELRELEASE=3D$(VERSION).$(PATCHLEVEL).$(SUBLEVEL)$(EXTRAVERSION)
=20
diff -purN linux-2.4.20-ac1/mm/filemap.c linux-2.4.20-ac1-rmap15a/mm/filema=
p.c
--- linux-2.4.20-ac1/mm/filemap.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/filemap.c	2002-12-01 10:59:16.000000000 +01=
00
@@ -55,15 +55,14 @@ EXPORT_SYMBOL(vm_min_readahead);
=20
 spinlock_cacheline_t pagecache_lock_cacheline  =3D {SPIN_LOCK_UNLOCKED};
 /*
- * NOTE: to avoid deadlocking you must never acquire the pagemap_lru_lock=
=20
+ * NOTE: to avoid deadlocking you must never acquire the lru lock=20
  *	with the pagecache_lock held.
  *
  * Ordering:
  *	swap_lock ->
- *		pagemap_lru_lock ->
+ *		   lru lock ->
  *			pagecache_lock
  */
-spinlock_cacheline_t pagemap_lru_lock_cacheline =3D {SPIN_LOCK_UNLOCKED};
=20
 #define CLUSTER_PAGES		(1 << page_cluster)
 #define CLUSTER_OFFSET(x)	(((x) >> page_cluster) << page_cluster)
@@ -183,7 +182,7 @@ void invalidate_inode_pages(struct inode
=20
 	head =3D &inode->i_mapping->clean_pages;
=20
-	spin_lock(&pagemap_lru_lock);
+	lru_lock(ALL_ZONES);
 	spin_lock(&pagecache_lock);
 	curr =3D head->next;
=20
@@ -216,7 +215,7 @@ unlock:
 	}
=20
 	spin_unlock(&pagecache_lock);
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(ALL_ZONES);
 }
=20
 static int do_flushpage(struct page *page, unsigned long offset)
@@ -880,6 +879,32 @@ void unlock_page(struct page *page)
 		wake_up_all(waitqueue);
 }
=20
+
+/* like wait_on_page but with a timeout (in jiffies).
+ * returns 1 on timeout=20
+ */
+int wait_on_page_timeout(struct page *page, int timeout)
+{
+	wait_queue_head_t *waitqueue =3D page_waitqueue(page);
+	struct task_struct *tsk =3D current;
+	DECLARE_WAITQUEUE(wait, tsk);
+=09
+	if (!PageLocked(page))
+		return 0;
+
+	add_wait_queue(waitqueue, &wait);
+	do {
+		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		if (!PageLocked(page))
+			break;
+		sync_page(page);
+		timeout =3D schedule_timeout(timeout);
+	} while (PageLocked(page) && timeout);
+	__set_task_state(tsk, TASK_RUNNING);
+	remove_wait_queue(waitqueue, &wait);
+	return PageLocked(page);
+}
+
 /*
  * Get a lock on the page, assuming we need to sleep
  * to get it..
@@ -914,26 +939,6 @@ void lock_page(struct page *page)
 		__lock_page(page);
 }
=20
-/*
- * a rather lightweight function, finding and getting a reference to a
- * hashed page atomically.
- */
-struct page * __find_get_page(struct address_space *mapping,
-			      unsigned long offset, struct page **hash)
-{
-	struct page *page;
-
-	/*
-	 * We scan the hash list read-only. Addition to and removal from
-	 * the hash-list needs a held write-lock.
-	 */
-	spin_lock(&pagecache_lock);
-	page =3D __find_page_nolock(mapping, offset, *hash);
-	if (page)
-		page_cache_get(page);
-	spin_unlock(&pagecache_lock);
-	return page;
-}
=20
 /*
  * Same as above, but trylock it instead of incrementing the count.
@@ -1069,19 +1074,42 @@ static void drop_behind(struct file * fi
 	 * been increased since the last time we were called, we
 	 * stop when the page isn't there.
 	 */
-	spin_lock(&pagemap_lru_lock);
+	lru_lock(ALL_ZONES);
 	while (--index >=3D start) {
 		struct page **hash =3D page_hash(mapping, index);
 		spin_lock(&pagecache_lock);
 		page =3D __find_page_nolock(mapping, index, *hash);
 		spin_unlock(&pagecache_lock);
-		if (!page || !PageActive(page))
+		if (!page || !PageActiveCache(page))
 			break;
 		drop_page(page);
 	}
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(ALL_ZONES);
+}
+
+/*
+ * Look up a page in the pagecache and return that page with
+ * a reference helt
+ */
+struct page * __find_pagecache_page(struct address_space *mapping,
+			      unsigned long offset, struct page **hash)
+{
+	struct page *page;
+
+	/*
+	 * We scan the hash list read-only. Addition to and removal from
+	 * the hash-list needs a held write-lock.
+	 */
+	spin_lock(&pagecache_lock);
+	page =3D __find_page_nolock(mapping, offset, *hash);
+	if (page)
+		page_cache_get(page);
+	spin_unlock(&pagecache_lock);
+	return page;
 }
=20
+EXPORT_SYMBOL_GPL(__find_pagecache_page);
+
 /* Same as grab_cache_page, but do not wait if the page is unavailable.
  * This is intended for speculative data generators, where the data can
  * be regenerated if the page couldn't be grabbed.  This routine should
@@ -1092,7 +1120,7 @@ struct page *grab_cache_page_nowait(stru
 	struct page *page, **hash;
=20
 	hash =3D page_hash(mapping, index);
-	page =3D __find_get_page(mapping, index, hash);
+	page =3D __find_pagecache_page(mapping, index, hash);
=20
 	if ( page ) {
 		if ( !TryLockPage(page) ) {
@@ -1378,7 +1406,7 @@ void mark_page_accessed(struct page *pag
 	/* Mark the page referenced, AFTER checking for previous usage.. */
 	SetPageReferenced(page);
=20
-	if (unlikely(PageInactiveClean(page))) {
+	if (unlikely(PageInactiveClean(page) || PageInactiveLaundry(page))) {
 		struct zone_struct *zone =3D page_zone(page);
 		int free =3D zone->free_pages + zone->inactive_clean_pages;
=20
@@ -1899,7 +1927,7 @@ static ssize_t do_readahead(struct file=20
 		nr =3D max;
=20
 	/* And limit it to a sane percentage of the inactive list.. */
-	max =3D nr_inactive_clean_pages / 2;
+	max =3D (nr_inactive_clean_pages() + nr_inactive_laundry_pages()) / 2;
 	if (nr > max)
 		nr =3D max;
=20
@@ -2022,7 +2050,7 @@ retry_all:
 	 */
 	hash =3D page_hash(mapping, pgoff);
 retry_find:
-	page =3D __find_get_page(mapping, pgoff, hash);
+	page =3D __find_pagecache_page(mapping, pgoff, hash);
 	if (!page)
 		goto no_cached_page;
=20
@@ -2885,7 +2913,7 @@ struct page *__read_cache_page(struct ad
 	struct page *page, *cached_page =3D NULL;
 	int err;
 repeat:
-	page =3D __find_get_page(mapping, index, hash);
+	page =3D __find_pagecache_page(mapping, index, hash);
 	if (!page) {
 		if (!cached_page) {
 			cached_page =3D page_cache_alloc(mapping);
diff -purN linux-2.4.20-ac1/mm/page_alloc.c linux-2.4.20-ac1-rmap15a/mm/pag=
e_alloc.c
--- linux-2.4.20-ac1/mm/page_alloc.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/page_alloc.c	2002-12-01 10:59:16.000000000 =
+0100
@@ -26,9 +26,6 @@
 #include <linux/smp.h>
=20
 int nr_swap_pages;
-int nr_active_pages;
-int nr_inactive_dirty_pages;
-int nr_inactive_clean_pages;
 pg_data_t *pgdat_list;
=20
 /* Used to look up the address of the struct zone encoded in page->zone */
@@ -109,16 +106,19 @@
 		BUG();
 	if (PageLocked(page))
 		BUG();
-	if (PageActive(page))
+	if (PageActiveAnon(page))
+		BUG();
+	if (PageActiveCache(page))
 		BUG();
 	if (PageInactiveDirty(page))
 		BUG();
+	if (PageInactiveLaundry(page))
+		BUG();
 	if (PageInactiveClean(page))
 		BUG();
 	if (page->pte_chain)
 		BUG();
 	page->flags &=3D ~((1<<PG_referenced) | (1<<PG_dirty));
-	page->age =3D PAGE_AGE_START;
 =09
 	zone =3D page_zone(page);
=20
@@ -562,7 +562,9 @@
 	 */
 defragment:
 	{
-		int freed =3D 0;
+		int try_harder =3D 0;
+		unsigned int mask =3D 0;
+		int numpages;
 defragment_again:
 		zone =3D zonelist->zones;
 		for (;;) {
@@ -571,6 +573,22 @@
 				break;
 			if (!z->size)
 				continue;
+
+			/*
+			 * Try to free the zone's inactive laundry pages.
+			 * Nonblocking in the first pass; blocking in the
+			 * second pass, but never on very new IO.
+			 */
+			numpages =3D z->inactive_laundry_pages;
+			if (try_harder) {
+				numpages /=3D 2;
+				mask =3D gfp_mask;
+			}
+
+			current->flags |=3D PF_MEMALLOC;
+			rebalance_laundry_zone(z, numpages, mask);
+			current->flags &=3D ~PF_MEMALLOC;
+
 			while (z->inactive_clean_pages) {
 				struct page * page;
 				/* Move one page to the free list. */
@@ -585,12 +603,9 @@
 			}
 		}
=20
-		/* XXX: do real defragmentation instead of calling launder ? */
-		if (!freed & !(current->flags & PF_MEMALLOC)) {
-			freed =3D 1;
-			current->flags |=3D PF_MEMALLOC;
-			try_to_free_pages(gfp_mask);
-			current->flags &=3D ~PF_MEMALLOC;
+		/* If we can wait for IO to complete, we wait... */
+		if (!try_harder && (gfp_mask & __GFP_FS)) {
+			try_harder =3D 1;
 			goto defragment_again;
 		}
 	}
@@ -641,19 +656,29 @@
 }
=20
 /*
- * Total amount of free (allocatable) RAM:
+ * These statistics are held in per-zone counters, so we need to loop
+ * over each zone and read the statistics.  We use this silly macro
+ * so we don't need to duplicate the code for every statistic.
+ * If you have a better idea on how to implement this (cut'n'paste
+ * isn't considered better), please let me know - Rik
  */
-unsigned int nr_free_pages (void)
-{
-	unsigned int sum;
-	zone_t *zone;
+#define NR_FOO_PAGES(__function_name, __stat)	\
+	unsigned int __function_name (void)	\
+	{					\
+		unsigned int sum =3D 0;		\
+		zone_t *zone;			\
+						\
+		for_each_zone(zone)		\
+			sum +=3D zone->__stat;	\
+		return sum;			\
+	}
=20
-	sum =3D 0;
-	for_each_zone(zone)
-		sum +=3D zone->free_pages;
-=09
-	return sum;
-}
+NR_FOO_PAGES(nr_free_pages, free_pages)
+NR_FOO_PAGES(nr_active_anon_pages, active_anon_pages)
+NR_FOO_PAGES(nr_active_cache_pages, active_cache_pages)
+NR_FOO_PAGES(nr_inactive_dirty_pages, inactive_dirty_pages)
+NR_FOO_PAGES(nr_inactive_laundry_pages, inactive_laundry_pages)
+NR_FOO_PAGES(nr_inactive_clean_pages, inactive_clean_pages)
=20
 /*
  * Amount of free RAM allocatable as buffer memory:
@@ -671,6 +696,7 @@
 		for (zone =3D *zonep++; zone; zone =3D *zonep++) {
 			sum +=3D zone->free_pages;
 			sum +=3D zone->inactive_clean_pages;
+			sum +=3D zone->inactive_laundry_pages;
 			sum +=3D zone->inactive_dirty_pages;
 		}
=20
@@ -729,9 +755,10 @@
 		nr_free_highpages() << (PAGE_SHIFT-10));
=20
 	printk("( Active: %d, inactive_dirty: %d, inactive_clean: %d, free: %d )\=
n",
-		nr_active_pages,
-		nr_inactive_dirty_pages,
-		nr_inactive_clean_pages,
+		nr_active_anon_pages() + nr_active_cache_pages(),
+		nr_inactive_dirty_pages(),
+		nr_inactive_laundry_pages(),
+		nr_inactive_clean_pages(),
 		nr_free_pages());
=20
 	for (type =3D 0; type < MAX_NR_ZONES; type++) {
@@ -941,12 +968,25 @@
 		zone->lock =3D SPIN_LOCK_UNLOCKED;
 		zone->zone_pgdat =3D pgdat;
 		zone->free_pages =3D 0;
+		zone->active_anon_pages =3D 0;
+		zone->active_cache_pages =3D 0;
 		zone->inactive_clean_pages =3D 0;
+		zone->inactive_laundry_pages =3D 0;
 		zone->inactive_dirty_pages =3D 0;
 		zone->need_balance =3D 0;
-		INIT_LIST_HEAD(&zone->active_list);
+		zone->need_scan =3D 0;
+		for (k =3D 0; k <=3D MAX_AGE ; k++) {
+			INIT_LIST_HEAD(&zone->active_anon_list[k]);
+			zone->active_anon_count[k] =3D 0;
+		}
+		for (k =3D 0; k <=3D MAX_AGE ; k++) {
+			INIT_LIST_HEAD(&zone->active_cache_list[k]);
+			zone->active_cache_count[k] =3D 0;
+		}
 		INIT_LIST_HEAD(&zone->inactive_dirty_list);
+		INIT_LIST_HEAD(&zone->inactive_laundry_list);
 		INIT_LIST_HEAD(&zone->inactive_clean_list);
+		spin_lock_init(&zone->lru_lock);
=20
 		if (!size)
 			continue;
diff -purN linux-2.4.20-ac1/mm/rmap.c linux-2.4.20-ac1-rmap15a/mm/rmap.c
--- linux-2.4.20-ac1/mm/rmap.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/rmap.c	2002-12-01 10:43:15.000000000 +0100
@@ -14,7 +14,7 @@
 /*
  * Locking:
  * - the page->pte_chain is protected by the PG_chainlock bit,
- *   which nests within the pagemap_lru_lock, then the
+ *   which nests within the lru lock, then the
  *   mm->page_table_lock, and then the page lock.
  * - because swapout locking is opposite to the locking order
  *   in the page fault path, the swapout path uses trylocks
@@ -195,7 +195,7 @@ out:
  * table entry mapping a page. Because locking order here is opposite
  * to the locking order used by the page fault path, we use trylocks.
  * Locking:
- *	pagemap_lru_lock		page_launder()
+ *	   lru lock			page_launder()
  *	    page lock			page_launder(), trylock
  *		pte_chain_lock		page_launder()
  *		    mm->page_table_lock	try_to_unmap_one(), trylock
@@ -263,7 +263,7 @@ out_unlock:
  * @page: the page to get unmapped
  *
  * Tries to remove all the page table entries which are mapping this
- * page, used in the pageout path.  Caller must hold pagemap_lru_lock
+ * page, used in the pageout path.  Caller must hold lru lock
  * and the page lock.  Return values are:
  *
  * SWAP_SUCCESS	- we succeeded in removing all mappings
diff -purN linux-2.4.20-ac1/mm/shmem.c linux-2.4.20-ac1-rmap15a/mm/shmem.c
--- linux-2.4.20-ac1/mm/shmem.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/shmem.c	2002-12-01 10:43:15.000000000 +0100
@@ -581,7 +581,7 @@ repeat:
 	 * cache and swap cache.  We need to recheck the page cache
 	 * under the protection of the info->lock spinlock. */
=20
-	page =3D find_get_page(mapping, idx);
+	page =3D find_pagecache_page(mapping, idx);
 	if (page) {
 		if (TryLockPage(page))
 			goto wait_retry;
@@ -593,7 +593,7 @@ repeat:
 		unsigned long flags;
=20
 		/* Look it up and read it in.. */
-		page =3D lookup_swap_cache(*entry);
+		page =3D find_pagecache_page(&swapper_space, entry->val);
 		if (!page) {
 			swp_entry_t swap =3D *entry;
 			spin_unlock (&info->lock);
diff -purN linux-2.4.20-ac1/mm/swap.c linux-2.4.20-ac1-rmap15a/mm/swap.c
--- linux-2.4.20-ac1/mm/swap.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/swap.c	2002-12-01 10:59:16.000000000 +0100
@@ -36,7 +36,6 @@ pager_daemon_t pager_daemon =3D {
 /**
  * (de)activate_page - move pages from/to active and inactive lists
  * @page: the page we want to move
- * @nolock - are we already holding the pagemap_lru_lock?
  *
  * Deactivate_page will move an active page to the right
  * inactive list, while activate_page will move a page back
@@ -51,18 +50,20 @@ void deactivate_page_nolock(struct page=20
 	 * (some pages aren't on any list at all)
 	 */
 	ClearPageReferenced(page);
-	page->age =3D 0;
-	if (PageActive(page)) {
-		del_page_from_active_list(page);
+	if (PageActiveAnon(page)) {
+		del_page_from_active_anon_list(page);
+		add_page_to_inactive_dirty_list(page);
+	} else if (PageActiveCache(page)) {
+		del_page_from_active_cache_list(page);
 		add_page_to_inactive_dirty_list(page);
 	}
 }=09
=20
 void deactivate_page(struct page * page)
 {
-	spin_lock(&pagemap_lru_lock);
+	lru_lock(page_zone(page));
 	deactivate_page_nolock(page);
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(page_zone(page));
 }
=20
 /**
@@ -74,16 +75,54 @@ void deactivate_page(struct page * page)
  * on the inactive_clean list it is placed on the inactive_dirty list
  * instead.
  *
- * Note: this function gets called with the pagemap_lru_lock held.
+ * Note: this function gets called with the lru lock held.
  */
+void drop_page_zone(struct zone_struct *zone, struct page * page)
+{
+	if (!TryLockPage(page)) {
+		if (page->mapping && page->buffers) {
+			page_cache_get(page);
+			lru_unlock(zone);
+			try_to_release_page(page, GFP_NOIO);
+			lru_lock(zone);
+			page_cache_release(page);
+		}
+		UnlockPage(page);
+	}
+
+	/* Make sure the page really is reclaimable. */
+	pte_chain_lock(page);
+	if (!page->mapping || PageDirty(page) || page->pte_chain ||
+			page->buffers || page_count(page) > 1)
+		deactivate_page_nolock(page);
+
+	else if (page_count(page) =3D=3D 1) {
+		ClearPageReferenced(page);
+		if (PageActiveAnon(page)) {
+			del_page_from_active_anon_list(page);
+			add_page_to_inactive_clean_list(page);
+		} else if (PageActiveCache(page)) {
+			del_page_from_active_cache_list(page);
+			add_page_to_inactive_clean_list(page);
+		} else if (PageInactiveDirty(page)) {
+			del_page_from_inactive_dirty_list(page);
+			add_page_to_inactive_clean_list(page);
+		} else if (PageInactiveLaundry(page)) {
+			del_page_from_inactive_laundry_list(page);
+			add_page_to_inactive_clean_list(page);
+		}
+	}
+	pte_chain_unlock(page);
+}
+
 void drop_page(struct page * page)
 {
 	if (!TryLockPage(page)) {
 		if (page->mapping && page->buffers) {
 			page_cache_get(page);
-			spin_unlock(&pagemap_lru_lock);
+			lru_unlock(ALL_ZONES);
 			try_to_release_page(page, GFP_NOIO);
-			spin_lock(&pagemap_lru_lock);
+			lru_lock(ALL_ZONES);
 			page_cache_release(page);
 		}
 		UnlockPage(page);
@@ -97,13 +136,18 @@ void drop_page(struct page * page)
=20
 	else if (page_count(page) =3D=3D 1) {
 		ClearPageReferenced(page);
-		page->age =3D 0;
-		if (PageActive(page)) {
-			del_page_from_active_list(page);
+		if (PageActiveAnon(page)) {
+			del_page_from_active_anon_list(page);
+			add_page_to_inactive_clean_list(page);
+		} else if (PageActiveCache(page)) {
+			del_page_from_active_cache_list(page);
 			add_page_to_inactive_clean_list(page);
 		} else if (PageInactiveDirty(page)) {
 			del_page_from_inactive_dirty_list(page);
 			add_page_to_inactive_clean_list(page);
+		} else if (PageInactiveLaundry(page)) {
+			del_page_from_inactive_laundry_list(page);
+			add_page_to_inactive_clean_list(page);
 		}
 	}
 	pte_chain_unlock(page);
@@ -116,21 +160,21 @@ void activate_page_nolock(struct page *=20
 {
 	if (PageInactiveDirty(page)) {
 		del_page_from_inactive_dirty_list(page);
-		add_page_to_active_list(page);
+		add_page_to_active_list(page, INITIAL_AGE);
+	} else if (PageInactiveLaundry(page)) {
+		del_page_from_inactive_laundry_list(page);
+		add_page_to_active_list(page, INITIAL_AGE);
 	} else if (PageInactiveClean(page)) {
 		del_page_from_inactive_clean_list(page);
-		add_page_to_active_list(page);
+		add_page_to_active_list(page, INITIAL_AGE);
 	}
-
-	/* Make sure the page gets a fair chance at staying active. */
-	page->age =3D max((int)page->age, PAGE_AGE_START);
 }
=20
 void activate_page(struct page * page)
 {
-	spin_lock(&pagemap_lru_lock);
+	lru_lock(page_zone(page));
 	activate_page_nolock(page);
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(page_zone(page));
 }
=20
 /**
@@ -140,10 +184,10 @@ void activate_page(struct page * page)
 void lru_cache_add(struct page * page)
 {
 	if (!PageLRU(page)) {
-		spin_lock(&pagemap_lru_lock);
+		lru_lock(page_zone(page));
 		SetPageLRU(page);
-		add_page_to_active_list(page);
-		spin_unlock(&pagemap_lru_lock);
+		add_page_to_active_list(page, INITIAL_AGE);
+		lru_unlock(page_zone(page));
 	}
 }
=20
@@ -152,14 +196,18 @@ void lru_cache_add(struct page * page)
  * @page: the page to add
  *
  * This function is for when the caller already holds
- * the pagemap_lru_lock.
+ * the lru lock.
  */
 void __lru_cache_del(struct page * page)
 {
-	if (PageActive(page)) {
-		del_page_from_active_list(page);
+	if (PageActiveAnon(page)) {
+		del_page_from_active_anon_list(page);
+	} else if (PageActiveCache(page)) {
+		del_page_from_active_cache_list(page);
 	} else if (PageInactiveDirty(page)) {
 		del_page_from_inactive_dirty_list(page);
+	} else if (PageInactiveLaundry(page)) {
+		del_page_from_inactive_laundry_list(page);
 	} else if (PageInactiveClean(page)) {
 		del_page_from_inactive_clean_list(page);
 	}
@@ -172,9 +220,9 @@ void __lru_cache_del(struct page * page)
  */
 void lru_cache_del(struct page * page)
 {
-	spin_lock(&pagemap_lru_lock);
+	lru_lock(page_zone(page));
 	__lru_cache_del(page);
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(page_zone(page));
 }
=20
 /*
diff -purN linux-2.4.20-ac1/mm/swap_state.c linux-2.4.20-ac1-rmap15a/mm/swa=
p_state.c
--- linux-2.4.20-ac1/mm/swap_state.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/swap_state.c	2002-12-01 10:43:15.000000000 =
+0100
@@ -196,7 +196,7 @@ struct page * lookup_swap_cache(swp_entr
 {
 	struct page *found;
=20
-	found =3D find_get_page(&swapper_space, entry.val);
+	found =3D find_pagecache_page(&swapper_space, entry.val);
 	/*
 	 * Unsafe to assert PageSwapCache and mapping on page found:
 	 * if SMP nothing prevents swapoff from deleting this page from
@@ -224,10 +224,10 @@ struct page * read_swap_cache_async(swp_
 		/*
 		 * First check the swap cache.  Since this is normally
 		 * called after lookup_swap_cache() failed, re-calling
-		 * that would confuse statistics: use find_get_page()
+		 * that would confuse statistics: use find_pagecache_page()
 		 * directly.
 		 */
-		found_page =3D find_get_page(&swapper_space, entry.val);
+		found_page =3D find_pagecache_page(&swapper_space, entry.val);
 		if (found_page)
 			break;
=20
diff -purN linux-2.4.20-ac1/mm/vmscan.c linux-2.4.20-ac1-rmap15a/mm/vmscan.c
--- linux-2.4.20-ac1/mm/vmscan.c	2002-12-01 11:01:04.000000000 +0100
+++ linux-2.4.20-ac1-rmap15a/mm/vmscan.c	2002-12-01 10:59:16.000000000 +0100
@@ -12,6 +12,7 @@
  *  to bring the system back to freepages.high: 2.4.97, Rik van Riel.
  *  Zone aware kswapd started 02/00, Kanoj Sarcar (kanoj@sgi.com).
  *  Multiqueue VM started 5.8.00, Rik van Riel.
+ *  O(1) rmap vm, Arjan van de ven <arjanv@redhat.com>
  */
=20
 #include <linux/slab.h>
@@ -37,16 +38,36 @@ static void wakeup_memwaiters(void);
  */
 #define DEF_PRIORITY (6)
=20
-static inline void age_page_up(struct page *page)
+static inline void age_page_up_nolock(struct page *page, int old_age)
 {
-	page->age =3D min((int) (page->age + PAGE_AGE_ADV), PAGE_AGE_MAX);=20
-}
+	int new_age;
+=09
+	new_age =3D old_age+4;
+	if (new_age < 0)
+		new_age =3D 0;
+	if (new_age > MAX_AGE)
+		new_age =3D MAX_AGE;=09
+	=09
+	if (PageActiveAnon(page)) {
+		del_page_from_active_anon_list(page);
+		add_page_to_active_anon_list(page, new_age);=09
+	} else if (PageActiveCache(page)) {
+		del_page_from_active_cache_list(page);
+		add_page_to_active_cache_list(page, new_age);=09
+	} else if (PageInactiveDirty(page)) {
+		del_page_from_inactive_dirty_list(page);
+		add_page_to_active_list(page, new_age);=09
+	} else if (PageInactiveLaundry(page)) {
+		del_page_from_inactive_laundry_list(page);
+		add_page_to_active_list(page, new_age);=09
+	} else if (PageInactiveClean(page)) {
+		del_page_from_inactive_clean_list(page);
+		add_page_to_active_list(page, new_age);=09
+	} else return;
=20
-static inline void age_page_down(struct page *page)
-{
-	page->age -=3D min(PAGE_AGE_DECL, (int)page->age);
 }
=20
+
 /* Must be called with page's pte_chain_lock held. */
 static inline int page_mapping_inuse(struct page * page)
 {
@@ -84,9 +105,9 @@ struct page * reclaim_page(zone_t * zone
=20
 	/*
 	 * We need to hold the pagecache_lock around all tests to make sure
-	 * reclaim_page() cannot race with find_get_page() and friends.
+	 * reclaim_page() doesn't race with other pagecache users
 	 */
-	spin_lock(&pagemap_lru_lock);
+	lru_lock(zone);
 	spin_lock(&pagecache_lock);
 	maxscan =3D zone->inactive_clean_pages;
 	while (maxscan-- && !list_empty(&zone->inactive_clean_list)) {
@@ -94,12 +115,7 @@ struct page * reclaim_page(zone_t * zone
 		page =3D list_entry(page_lru, struct page, lru);
=20
 		/* Wrong page on list?! (list corruption, should not happen) */
-		if (unlikely(!PageInactiveClean(page))) {
-			printk("VM: reclaim_page, wrong page on list.\n");
-			list_del(page_lru);
-			page_zone(page)->inactive_clean_pages--;
-			continue;
-		}
+		BUG_ON(unlikely(!PageInactiveClean(page)));
=20
 		/* Page is being freed */
 		if (unlikely(page_count(page)) =3D=3D 0) {
@@ -144,7 +160,7 @@ struct page * reclaim_page(zone_t * zone
 		UnlockPage(page);
 	}
 	spin_unlock(&pagecache_lock);
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(zone);
 	return NULL;
=20
=20
@@ -152,11 +168,10 @@ found_page:
 	__lru_cache_del(page);
 	pte_chain_unlock(page);
 	spin_unlock(&pagecache_lock);
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(zone);
 	if (entry.val)
 		swap_free(entry);
 	UnlockPage(page);
-	page->age =3D PAGE_AGE_START;
 	if (page_count(page) !=3D 1)
 		printk("VM: reclaim_page, found page with count %d!\n",
 				page_count(page));
@@ -164,458 +179,626 @@ found_page:
 }
=20
 /**
- * page_dirty - do we need to write the data out to disk
- * @page: page to test
+ * need_rebalance_dirty - do we need to write inactive stuff to disk?
+ * @zone: the zone in question
  *
- * Returns true if the page contains data which needs to
- * be written to disk.  Doesn't test the page tables (yet?).
+ * Returns true if the zone in question has an inbalance between inactive
+ * dirty on one side and inactive laundry + inactive clean on the other
+ * Right now set the balance at 50%; may need tuning later on
  */
-static inline int page_dirty(struct page *page)
+static inline int need_rebalance_dirty(zone_t * zone)
 {
-	struct buffer_head *tmp, *bh;
-
-	if (PageDirty(page))
+	if (zone->inactive_dirty_pages > zone->inactive_laundry_pages + zone->ina=
ctive_clean_pages)
 		return 1;
=20
-	if (page->mapping && !page->buffers)
-		return 0;
-
-	tmp =3D bh =3D page->buffers;
-
-	do {
-		if (tmp->b_state & ((1<<BH_Dirty) | (1<<BH_Lock)))
-			return 1;
-		tmp =3D tmp->b_this_page;
-	} while (tmp !=3D bh);
+	return 0;
+}
=20
+/**
+ * need_rebalance_laundry - does the zone have too few inactive_clean page=
s?
+ * @zone: the zone in question
+ *
+ * Returns true if the zone in question has too few pages in inactive clean
+ * + free
+ */
+static inline int need_rebalance_laundry(zone_t * zone)
+{
+	if (free_low(zone) >=3D 0)
+		return 1;
 	return 0;
 }
=20
 /**
- * page_launder_zone - clean dirty inactive pages, move to inactive_clean =
list
+ * launder_page - clean dirty page, move to inactive_laundry list
  * @zone: zone to free pages in
  * @gfp_mask: what operations we are allowed to do
- * @full_flush: full-out page flushing, if we couldn't get enough clean pa=
ges
+ * @page: the page at hand, must be on the inactive dirty list
  *
- * This function is called when we are low on free / inactive_clean
- * pages, its purpose is to refill the free/clean list as efficiently
- * as possible.
- *
- * This means we do writes asynchronously as long as possible and will
- * only sleep on IO when we don't have another option. Since writeouts
- * cause disk seeks and make read IO slower, we skip writes alltogether
- * when the amount of dirty pages is small.
- *
- * This code is heavily inspired by the FreeBSD source code. Thanks
- * go out to Matthew Dillon.
- */
-int page_launder_zone(zone_t * zone, int gfp_mask, int full_flush)
-{
-	int maxscan, cleaned_pages, target, maxlaunder, iopages, over_rsslimit;
-	struct list_head * entry, * next;
-
-	target =3D max_t(int, free_plenty(zone), zone->pages_min);
-	cleaned_pages =3D iopages =3D 0;
-
-	/* If we can get away with it, only flush 2 MB worth of dirty pages */
-	if (full_flush)
-		maxlaunder =3D 1000000;
-	else {
-		maxlaunder =3D min_t(int, 512, zone->inactive_dirty_pages / 4);
-		maxlaunder =3D max(maxlaunder, free_plenty(zone) * 4);
-	}
-=09
-	/* The main launder loop. */
-	spin_lock(&pagemap_lru_lock);
-rescan:
-	maxscan =3D zone->inactive_dirty_pages;
-	entry =3D zone->inactive_dirty_list.prev;
-	next =3D entry->prev;
-	while (maxscan-- && !list_empty(&zone->inactive_dirty_list) &&
-			next !=3D &zone->inactive_dirty_list) {
-		struct page * page;
-	=09
-		/* Low latency reschedule point */
-		if (current->need_resched) {
-			spin_unlock(&pagemap_lru_lock);
-			schedule();
-			spin_lock(&pagemap_lru_lock);
-			continue;
-		}
-
-		entry =3D next;
-		next =3D entry->prev;
-		page =3D list_entry(entry, struct page, lru);
-
-		/* This page was removed while we looked the other way. */
-		if (!PageInactiveDirty(page))
-			goto rescan;
+ * per-zone lru lock is assumed to be held, but this function can drop
+ * it and sleep, so no other locks are allowed to be held.
+ *
+ * returns 0 for failure; 1 for success
+ */
+int launder_page(zone_t * zone, int gfp_mask, struct page *page)
+{
+	int over_rsslimit;
=20
-		if (cleaned_pages > target)
-			break;
+	/*
+	 * Page is being freed, don't worry about it, but report progress.
+	 */
+	if (unlikely(page_count(page)) =3D=3D 0)
+		return 1;
=20
-		/* Stop doing IO if we've laundered too many pages already. */
-		if (maxlaunder < 0)
-			gfp_mask &=3D ~(__GFP_IO|__GFP_FS);
+	BUG_ON(!PageInactiveDirty(page));
+	del_page_from_inactive_dirty_list(page);
+	add_page_to_inactive_laundry_list(page);
+	/* store the time we start IO */
+	page->age =3D (jiffies/HZ)&255;
+	/*
+	 * The page is locked. IO in progress?
+	 * If so, move to laundry and report progress
+	 * Acquire PG_locked early in order to safely
+	 * access page->mapping.
+	 */
+	if (unlikely(TryLockPage(page))) {
+		return 1;
+	}
=20
-		/*
-		 * Page is being freed, don't worry about it.
-		 */
-		if (unlikely(page_count(page)) =3D=3D 0)
-			continue;
+	/*
+	 * The page is in active use or really unfreeable. Move to
+	 * the active list and adjust the page age if needed.
+	 */
+	pte_chain_lock(page);
+	if (page_referenced(page, &over_rsslimit) && !over_rsslimit &&
+			page_mapping_inuse(page)) {
+		del_page_from_inactive_laundry_list(page);
+		add_page_to_active_list(page, INITIAL_AGE);
+		pte_chain_unlock(page);
+		UnlockPage(page);
+		return 1;
+	}
=20
-		/*
-		 * The page is locked. IO in progress?
-		 * Acquire PG_locked early in order to safely
-		 * access page->mapping.
-		 */
-		if (unlikely(TryLockPage(page))) {
-			iopages++;
-			continue;
+	/*
+	 * Anonymous process memory without backing store. Try to
+	 * allocate it some swap space here.
+	 *
+	 * XXX: implement swap clustering ?
+	 */
+	if (page->pte_chain && !page->mapping && !page->buffers) {
+		page_cache_get(page);
+		pte_chain_unlock(page);
+		lru_unlock(zone);
+		if (!add_to_swap(page)) {
+			activate_page(page);
+			lru_lock(zone);
+			UnlockPage(page);
+			page_cache_release(page);
+			return 0;
 		}
-
-		/*
-		 * The page is in active use or really unfreeable. Move to
-		 * the active list and adjust the page age if needed.
-		 */
-		pte_chain_lock(page);
-		if (page_referenced(page, &over_rsslimit) && !over_rsslimit &&
-				page_mapping_inuse(page)) {
-			del_page_from_inactive_dirty_list(page);
-			add_page_to_active_list(page);
-			page->age =3D max((int)page->age, PAGE_AGE_START);
-			pte_chain_unlock(page);
+		lru_lock(zone);
+		page_cache_release(page);
+		/* Note: may be on another list ! */
+		if (!PageInactiveLaundry(page)) {
 			UnlockPage(page);
-			continue;
+			return 1;
+		}
+		if (unlikely(page_count(page)) =3D=3D 0) {
+			UnlockPage(page);
+			return 1;
 		}
+		pte_chain_lock(page);
+	}
=20
-		/*
-		 * Anonymous process memory without backing store. Try to
-		 * allocate it some swap space here.
-		 *
-		 * XXX: implement swap clustering ?
-		 */
-		if (page->pte_chain && !page->mapping && !page->buffers) {
-			/* Don't bother if we can't swap it out now. */
-			if (maxlaunder < 0) {
+	/*
+	 * The page is mapped into the page tables of one or more
+	 * processes. Try to unmap it here.
+	 */
+	if (page->pte_chain && page->mapping) {
+		switch (try_to_unmap(page)) {
+			case SWAP_ERROR:
+			case SWAP_FAIL:
+				goto page_active;
+			case SWAP_AGAIN:
 				pte_chain_unlock(page);
 				UnlockPage(page);
-				list_del(entry);
-				list_add(entry, &zone->inactive_dirty_list);
-				continue;
-			}
-			page_cache_get(page);
-			pte_chain_unlock(page);
-			spin_unlock(&pagemap_lru_lock);
-			if (!add_to_swap(page)) {
-				activate_page(page);
-				UnlockPage(page);
-				page_cache_release(page);
-				spin_lock(&pagemap_lru_lock);
-				continue;
-			}
-			page_cache_release(page);
-			spin_lock(&pagemap_lru_lock);
-			pte_chain_lock(page);
+				return 0;
+			case SWAP_SUCCESS:
+				; /* fall through, try freeing the page below */
+			/* fixme: add a SWAP_MLOCK case */
 		}
+	}
+	pte_chain_unlock(page);
=20
+	if (PageDirty(page) && page->mapping) {
 		/*
-		 * The page is mapped into the page tables of one or more
-		 * processes. Try to unmap it here.
+		 * The page can be dirtied after we start writing, but
+		 * in that case the dirty bit will simply be set again
+		 * and we'll need to write it again.
 		 */
-		if (page->pte_chain && page->mapping) {
-			switch (try_to_unmap(page)) {
-				case SWAP_ERROR:
-				case SWAP_FAIL:
-					goto page_active;
-				case SWAP_AGAIN:
-					pte_chain_unlock(page);
-					UnlockPage(page);
-					continue;
-				case SWAP_SUCCESS:
-					; /* try to free the page below */
-			}
+		int (*writepage)(struct page *);
+
+		writepage =3D page->mapping->a_ops->writepage;
+		if ((gfp_mask & __GFP_FS) && writepage) {
+			ClearPageDirty(page);
+			SetPageLaunder(page);
+			page_cache_get(page);
+			lru_unlock(zone);
+
+			writepage(page);
+
+			page_cache_release(page);
+			lru_lock(zone);
+			return 1;
+		} else {
+			del_page_from_inactive_laundry_list(page);
+			add_page_to_inactive_dirty_list(page);
+			/* FIXME: this is wrong for !__GFP_FS !!! */
+			UnlockPage(page);
+			return 0;
 		}
-		pte_chain_unlock(page);
+	}
=20
-		if (PageDirty(page) && page->mapping) {
-			/*
-			 * It is not critical here to write it only if
-			 * the page is unmapped beause any direct writer
-			 * like O_DIRECT would set the PG_dirty bitflag
-			 * on the physical page after having successfully
-			 * pinned it and after the I/O to the page is finished,
-			 * so the direct writes to the page cannot get lost.
-			 */
-			int (*writepage)(struct page *);
+	/*
+	 * If the page has buffers, try to free the buffer mappings
+	 * associated with this page. If we succeed we try to free
+	 * the page as well.
+	 */
+	if (page->buffers) {
+		/* To avoid freeing our page before we're done. */
+		page_cache_get(page);
+		lru_unlock(zone);
=20
-			writepage =3D page->mapping->a_ops->writepage;
-			if ((gfp_mask & __GFP_FS) && writepage) {
-				ClearPageDirty(page);
-				SetPageLaunder(page);
-				page_cache_get(page);
-				spin_unlock(&pagemap_lru_lock);
+		try_to_release_page(page, gfp_mask);
+		UnlockPage(page);
=20
-				writepage(page);
-				maxlaunder--;
-				iopages++;
-				page_cache_release(page);
+		/*=20
+		 * If the buffers were the last user of the page we free
+		 * the page here. Because of that we shouldn't hold the
+		 * lru lock yet.
+		 */
+		page_cache_release(page);
=20
-				spin_lock(&pagemap_lru_lock);
-				continue;
-			} else {
-				UnlockPage(page);
-				list_del(entry);
-				list_add(entry, &zone->inactive_dirty_list);
-				continue;
-			}
-		}
+		lru_lock(zone);
+		return 1;
+	}
=20
+	/*
+	 * If the page is really freeable now, move it to the
+	 * inactive_laundry list to keep LRU order.
+	 *
+	 * We re-test everything since the page could have been
+	 * used by somebody else while we waited on IO above.
+	 * This test is not safe from races; only the one in
+	 * reclaim_page() needs to be.
+	 */
+	pte_chain_lock(page);
+	if (page->mapping && !PageDirty(page) && !page->pte_chain &&
+			page_count(page) =3D=3D 1) {
+		pte_chain_unlock(page);
+		UnlockPage(page);
+		return 1;
+	} else {
 		/*
-		 * If the page has buffers, try to free the buffer mappings
-		 * associated with this page. If we succeed we try to free
-		 * the page as well.
+		 * OK, we don't know what to do with the page.
+		 * It's no use keeping it here, so we move it
+		 * back to the active list.
 		 */
-		if (page->buffers) {
-			/* To avoid freeing our page before we're done. */
-			page_cache_get(page);
+ page_active:
+		activate_page_nolock(page);
+		pte_chain_unlock(page);
+		UnlockPage(page);
+	}
+	return 0;
+}
=20
-			spin_unlock(&pagemap_lru_lock);
=20
-			if (try_to_release_page(page, gfp_mask)) {
-				if (!page->mapping) {
-					/*
-					 * We must not allow an anon page
-					 * with no buffers to be visible on
-					 * the LRU, so we unlock the page after
-					 * taking the lru lock
-					 */
-					spin_lock(&pagemap_lru_lock);
-					UnlockPage(page);
-					__lru_cache_del(page);
+unsigned char active_age_bias =3D 0;
=20
-					/* effectively free the page here */
-					page_cache_release(page);
+/* Ages down all pages on the active list */
+/* assumes the lru lock held */
+static inline void kachunk_anon(struct zone_struct * zone)
+{
+	int k;
+	if (!list_empty(&zone->active_anon_list[0]))
+		return;
+	if (!zone->active_anon_pages)
+		return;
=20
-					cleaned_pages++;
-					continue;
-				} else {
-					/*
-					 * We freed the buffers but may have
-					 * slept; undo the stuff we did before
-					 * try_to_release_page and fall through
-					 * to the next step.
-					 * But only if the page is still on the inact. dirty=20
-					 * list.
-					 */
-
-					spin_lock(&pagemap_lru_lock);
-					/* Check if the page was removed from the list
-					 * while we looked the other way.=20
-					 */
-					if (!PageInactiveDirty(page)) {
-						page_cache_release(page);
-						continue;
-					}
-					page_cache_release(page);
-				}
-			} else {
-				/* failed to drop the buffers so stop here */
-				UnlockPage(page);
-				page_cache_release(page);
-				maxlaunder--;
-				iopages++;
+	for (k =3D 0; k < MAX_AGE; k++)  {
+		list_splice_init(&zone->active_anon_list[k+1], &zone->active_anon_list[k=
]);
+		zone->active_anon_count[k] =3D zone->active_anon_count[k+1];
+		zone->active_anon_count[k+1] =3D 0;
+	}
+
+	active_age_bias++;
+	/* flag this zone as having had activity -> rescan to age up is desired */
+	zone->need_scan++;
+}
+
+static inline void kachunk_cache(struct zone_struct * zone)
+{
+	int k;
+	if (!list_empty(&zone->active_cache_list[0]))
+		return;
+	if (!zone->active_cache_pages)
+		return;
+
+	for (k =3D 0; k < MAX_AGE; k++)  {
+		list_splice_init(&zone->active_cache_list[k+1], &zone->active_cache_list=
[k]);
+		zone->active_cache_count[k] =3D zone->active_cache_count[k+1];
+		zone->active_cache_count[k+1] =3D 0;
+	}
=20
-				spin_lock(&pagemap_lru_lock);
+	active_age_bias++;
+	/* flag this zone as having had activity -> rescan to age up is desired */
+	zone->need_scan++;
+}
+
+#define BATCH_WORK_AMOUNT	64
+
+/*
+ * returns the active cache ratio relative to the total active list
+ * times 10 (eg. 30% cache returns 3)
+ */
+static inline int cache_ratio(struct zone_struct * zone)
+{
+	if (!zone->size)
+		return 0;
+	return 10 * zone->active_cache_pages / (zone->active_cache_pages +
+			zone->active_anon_pages + 1);
+}
+
+/*
+ * If the active_cache list is more than 20% of all active pages,
+ * we do extra heavy reclaim from this list and less reclaiming of
+ * the active_anon pages.
+ * These arrays are indexed by cache_ratio(), ie 0%, 10%, 20% ... 100%
+ */
+static int  active_anon_work[11] =3D {32, 32, 12,  4,  2,  1,  1,  1,  1, =
 1,  1};
+static int active_cache_work[11] =3D {32, 32, 52, 60, 62, 63, 63, 63, 63, =
63, 63};
+
+/**
+ * refill_inactive_zone - scan the active list and find pages to deactivate
+ * @priority: how much are we allowed to scan
+ *
+ * This function will scan a portion of the active list of a zone to find
+ * unused pages, those pages will then be moved to the inactive list.
+ */
+int refill_inactive_zone(struct zone_struct * zone, int priority, int targ=
et)
+{
+	int maxscan =3D (zone->active_anon_pages + zone->active_cache_pages) >> p=
riority;
+	struct list_head * page_lru;
+	struct page * page;
+	int over_rsslimit;
+	int progress =3D 0;
+	int ratio;
+
+	/* Take the lock while messing with the list... */
+	lru_lock(zone);
+	if (target < BATCH_WORK_AMOUNT)
+		target =3D BATCH_WORK_AMOUNT;
+
+	ratio =3D cache_ratio(zone);
+
+	while (maxscan-- && zone->active_anon_pages + zone->active_cache_pages > =
0 && target > 0) {
+		int anon_work, cache_work;
+		anon_work =3D active_anon_work[ratio];
+		cache_work =3D active_cache_work[ratio];
+
+		while (anon_work-- >=3D 0 && zone->active_anon_pages) {
+			if (list_empty(&zone->active_anon_list[0])) {
+				kachunk_anon(zone);
 				continue;
 			}
-		}
=20
+			page_lru =3D zone->active_anon_list[0].prev;
+			page =3D list_entry(page_lru, struct page, lru);
=20
-		/*
-		 * If the page is really freeable now, move it to the
-		 * inactive_clean list.
-		 *
-		 * We re-test everything since the page could have been
-		 * used by somebody else while we waited on IO above.
-		 * This test is not safe from races, but only the one
-		 * in reclaim_page() needs to be.
-		 */
-		pte_chain_lock(page);
-		if (page->mapping && !PageDirty(page) && !page->pte_chain &&
-				page_count(page) =3D=3D 1) {
-			del_page_from_inactive_dirty_list(page);
-			add_page_to_inactive_clean_list(page);
+			/* Wrong page on list?! (list corruption, should not happen) */
+			BUG_ON(unlikely(!PageActiveAnon(page)));
+	=09
+			/* Needed to follow page->mapping */
+			if (TryLockPage(page)) {
+				/* The page is already locked. This for sure means
+				 * someone is doing stuff with it which makes it
+				 * active by definition ;)
+				 */
+				del_page_from_active_anon_list(page);
+				add_page_to_active_anon_list(page, INITIAL_AGE);
+				continue;
+			}
+
+			/*
+			 * Do aging on the pages.
+			 */
+			pte_chain_lock(page);
+			if (page_referenced(page, &over_rsslimit) && !over_rsslimit) {
+				pte_chain_unlock(page);
+				age_page_up_nolock(page, 0);
+				UnlockPage(page);
+				continue;
+			}
 			pte_chain_unlock(page);
+
+			deactivate_page_nolock(page);
+			target--;
+			progress++;
 			UnlockPage(page);
-			cleaned_pages++;
-		} else {
+		}
+
+		while (cache_work-- >=3D 0 && zone->active_cache_pages) {
+			if (list_empty(&zone->active_cache_list[0])) {
+				kachunk_cache(zone);
+				continue;
+			}
+
+			page_lru =3D zone->active_cache_list[0].prev;
+			page =3D list_entry(page_lru, struct page, lru);
+
+			/* Wrong page on list?! (list corruption, should not happen) */
+			BUG_ON(unlikely(!PageActiveCache(page)));
+	=09
+			/* Needed to follow page->mapping */
+			if (TryLockPage(page)) {
+				/* The page is already locked. This for sure means
+				 * someone is doing stuff with it which makes it
+				 * active by definition ;)
+				 */
+				del_page_from_active_cache_list(page);
+				add_page_to_active_cache_list(page, INITIAL_AGE);
+				continue;
+			}
+
 			/*
-			 * OK, we don't know what to do with the page.
-			 * It's no use keeping it here, so we move it to
-			 * the active list.
+			 * Do aging on the pages.
 			 */
-page_active:
-			del_page_from_inactive_dirty_list(page);
-			add_page_to_active_list(page);
+			pte_chain_lock(page);
+			if (page_referenced(page, &over_rsslimit) && !over_rsslimit) {
+				pte_chain_unlock(page);
+				age_page_up_nolock(page, 0);
+				UnlockPage(page);
+				continue;
+			}
 			pte_chain_unlock(page);
+
+			deactivate_page_nolock(page);
+			target--;
+			progress++;
 			UnlockPage(page);
 		}
 	}
-	spin_unlock(&pagemap_lru_lock);
+	lru_unlock(zone);
=20
-	/* Return the number of pages moved to the inactive_clean list. */
-	return cleaned_pages + iopages;
+	return progress;
 }
=20
-/**
- * page_launder - clean dirty inactive pages, move to inactive_clean list
- * @gfp_mask: what operations we are allowed to do
- *
- * This function iterates over all zones and calls page_launder_zone(),
- * balancing still needs to be added...
- */
-int page_launder(int gfp_mask)
+static int need_active_anon_scan(struct zone_struct * zone)
 {
-	struct zone_struct * zone;
-	int freed =3D 0;
+	int low =3D 0, high =3D 0;
+	int k;
+	for (k=3D0; k < MAX_AGE/2; k++)
+		low +=3D zone->active_anon_count[k];
=20
-	/* Global balancing while we have a global shortage. */
-	if (free_high(ALL_ZONES) >=3D 0)
-		for_each_zone(zone)
-			if (free_plenty(zone) >=3D 0)
-				freed +=3D page_launder_zone(zone, gfp_mask, 0);
-=09
-	/* Clean up the remaining zones with a serious shortage, if any. */
-	for_each_zone(zone)
-		if (free_low(zone) >=3D 0) {
-			int fullflush =3D free_min(zone) > 0;
-			freed +=3D page_launder_zone(zone, gfp_mask, fullflush);
-		}
+	for (k=3DMAX_AGE/2; k <=3D MAX_AGE; k++)
+		high +=3D zone->active_anon_count[k];
+
+	if (high<low)
+		return 1;
+	return 0;
+}
+
+static int need_active_cache_scan(struct zone_struct * zone)
+{
+	int low =3D 0, high =3D 0;
+	int k;
+	for (k=3D0; k < MAX_AGE/2; k++)
+		low +=3D zone->active_cache_count[k];
+
+	for (k=3DMAX_AGE/2; k <=3D MAX_AGE; k++)
+		high +=3D zone->active_cache_count[k];
=20
-	return freed;
+	if (high<low)
+		return 1;
+	return 0;
 }
=20
-/**
- * refill_inactive_zone - scan the active list and find pages to deactivate
- * @priority: how much are we allowed to scan
- *
- * This function will scan a portion of the active list of a zone to find
- * unused pages, those pages will then be moved to the inactive list.
+static int scan_active_list(struct zone_struct * zone, int age, int anon)
+{
+	struct list_head * list, *page_lru , *next;
+	struct page * page;
+	int over_rsslimit;
+
+	if (anon)
+		list =3D &zone->active_anon_list[age];
+	else
+		list =3D &zone->active_cache_list[age];
+
+	/* Take the lock while messing with the list... */
+	lru_lock(zone);
+	list_for_each_safe(page_lru, next, list) {
+		page =3D list_entry(page_lru, struct page, lru);
+		pte_chain_lock(page);
+		if (page_referenced(page, &over_rsslimit) && !over_rsslimit)
+			age_page_up_nolock(page, age);
+		pte_chain_unlock(page);
+	}
+	lru_unlock(zone);
+	return 0;
+}
+
+/*
+ * Move max_work pages to the inactive clean list as long as there is a ne=
ed
+ * for this. If gfp_mask allows it, sleep for IO to finish.
  */
-int refill_inactive_zone(struct zone_struct * zone, int priority)
+int rebalance_laundry_zone(struct zone_struct * zone, int max_work, unsign=
ed int gfp_mask)
 {
-	int maxscan =3D zone->active_pages >> priority;
-	int nr_deactivated =3D 0, over_rsslimit;
-	int target =3D inactive_high(zone);
 	struct list_head * page_lru;
+	int max_loop;
+	int work_done =3D 0;
 	struct page * page;
=20
+	max_loop =3D max_work;
+	if (max_loop < BATCH_WORK_AMOUNT)
+		max_loop =3D BATCH_WORK_AMOUNT;
 	/* Take the lock while messing with the list... */
-	spin_lock(&pagemap_lru_lock);
-	while (maxscan-- && !list_empty(&zone->active_list)) {
-		page_lru =3D zone->active_list.prev;
+	lru_lock(zone);
+	while (max_loop-- && !list_empty(&zone->inactive_laundry_list)) {
+		page_lru =3D zone->inactive_laundry_list.prev;
 		page =3D list_entry(page_lru, struct page, lru);
=20
 		/* Wrong page on list?! (list corruption, should not happen) */
-		if (unlikely(!PageActive(page))) {
-			printk("VM: refill_inactive, wrong page on list.\n");
-			list_del(page_lru);
-			nr_active_pages--;
-			continue;
-		}
-	=09
-		/* Needed to follow page->mapping */
+		BUG_ON(unlikely(!PageInactiveLaundry(page)));
+
+		/* TryLock to see if the page IO is done */
 		if (TryLockPage(page)) {
-			list_del(page_lru);
-			list_add(page_lru, &zone->active_list);
-			continue;
+			/*
+			 * Page is locked (IO in progress?). If we can sleep,
+			 * wait for it to finish, except when we've already
+			 * done enough work.
+			 */
+			if ((gfp_mask & __GFP_WAIT) && (work_done < max_work)) {
+				int timed_out;
+			=09
+				page_cache_get(page);
+				lru_unlock(zone);
+				run_task_queue(&tq_disk);
+				timed_out =3D wait_on_page_timeout(page, 5 * HZ);
+				lru_lock(zone);
+				page_cache_release(page);
+				/*
+				 * If we timed out and the page has been in
+				 * flight for over 30 seconds, this might not
+				 * be the best page to wait on; move it to
+				 * the head of the dirty list.
+				 */
+				if (timed_out & PageInactiveLaundry(page)) {
+					unsigned char now;
+					now =3D (jiffies/HZ)&255;
+					if (now - page->age > 30) {
+						del_page_from_inactive_laundry_list(page);
+						add_page_to_inactive_dirty_list(page);
+					}
+					continue;
+				}
+				/* We didn't make any progress for our caller,
+				 * but we are actively avoiding a livelock
+				 * so undo the decrement and wait on this page
+				 * some more, until IO finishes or we timeout.
+				 */
+				max_loop++;
+				continue;
+			} else
+				/* No dice, we can't wait for IO */
+				break;
 		}
+		UnlockPage(page);
=20
 		/*
-		 * If the object the page is in is not in use we don't
-		 * bother with page aging.  If the page is touched again
-		 * while on the inactive_clean list it'll be reactivated.
-		 * From here until the end of the current iteration
-		 * both PG_locked and the pte_chain_lock are held.
+		 * If we get here either the IO on the page is done or
+		 * IO never happened because it was clean. Either way
+		 * move it to the inactive clean list.
 		 */
-		pte_chain_lock(page);
-		if (!page_mapping_inuse(page)) {
-			pte_chain_unlock(page);
-			UnlockPage(page);
-			drop_page(page);
-			continue;
-		}
+
+		/* FIXME: check if the page is still clean or is accessed ? */
+
+		del_page_from_inactive_laundry_list(page);
+		add_page_to_inactive_clean_list(page);
+		work_done++;
=20
 		/*
-		 * Do aging on the pages.
+		 * If we've done the minimal batch of work and there's
+		 * no longer a need to rebalance, abort now.
 		 */
-		if (page_referenced(page, &over_rsslimit)) {
-			age_page_up(page);
-		} else {
-			age_page_down(page);
-		}
+		if ((work_done > BATCH_WORK_AMOUNT) && (!need_rebalance_laundry(zone)))
+			break;
+	}
=20
-		/*=20
-		 * If the page age is 'hot' and the process using the
-		 * page doesn't exceed its RSS limit we keep the page.
-		 * Otherwise we move it to the inactive_dirty list.
+	lru_unlock(zone);
+	return work_done;
+}
+
+/*
+ * Move max_work pages from the dirty list as long as there is a need.
+ * Start IO if the gfp_mask allows it.
+ */
+int rebalance_dirty_zone(struct zone_struct * zone, int max_work, unsigned=
 int gfp_mask)
+{
+	struct list_head * page_lru;
+	int max_loop;
+	int work_done =3D 0;
+	struct page * page;
+
+	max_loop =3D max_work;
+	if (max_loop < BATCH_WORK_AMOUNT)
+		max_loop =3D BATCH_WORK_AMOUNT;
+	/* Take the lock while messing with the list... */
+	lru_lock(zone);
+	while (max_loop-- && !list_empty(&zone->inactive_dirty_list)) {
+		page_lru =3D zone->inactive_dirty_list.prev;
+		page =3D list_entry(page_lru, struct page, lru);
+
+		/* Wrong page on list?! (list corruption, should not happen) */
+		BUG_ON(unlikely(!PageInactiveDirty(page)));
+
+		/*
+		 * Note: launder_page() sleeps so we can't safely look at
+		 * the page after this point!
+		 *
+		 * If we fail (only happens if we can't do IO) we just try
+		 * again on another page; launder_page makes sure we won't
+		 * see the same page over and over again.
 		 */
-		if (page->age && !over_rsslimit) {
-			list_del(page_lru);
-			list_add(page_lru, &zone->active_list);
-		} else {
-			deactivate_page_nolock(page);
-			if (++nr_deactivated > target) {
-				pte_chain_unlock(page);
-				UnlockPage(page);
-				goto done;
-			}
-		}
-		pte_chain_unlock(page);
-		UnlockPage(page);
+		if (!launder_page(zone, gfp_mask, page))
+			continue;
=20
-		/* Low latency reschedule point */
-		if (current->need_resched) {
-			spin_unlock(&pagemap_lru_lock);
-			schedule();
-			spin_lock(&pagemap_lru_lock);
-		}
+		work_done++;
+
+		/*
+		 * If we've done the minimal batch of work and there's
+		 * no longer any need to rebalance, abort now.
+		 */
+		if ((work_done > BATCH_WORK_AMOUNT) && (!need_rebalance_dirty(zone)))
+			break;
 	}
+	lru_unlock(zone);
+
+	return work_done;
+}
+
+/* goal percentage sets the goal of the laundry+clean+free of the total zo=
ne size */
+int rebalance_inactive_zone(struct zone_struct * zone, int max_work, unsig=
ned int gfp_mask, int goal_percentage)
+{
+	int ret =3D 0;
+	/* first deactivate memory */
+	if (((zone->inactive_laundry_pages + zone->inactive_clean_pages + zone->f=
ree_pages)*100 < zone->size * goal_percentage) &&
+			(inactive_high(zone) > 0))
+		refill_inactive_zone(zone, 0, max_work + BATCH_WORK_AMOUNT);
+
+	if (need_rebalance_dirty(zone))
+		ret +=3D rebalance_dirty_zone(zone, max_work, gfp_mask);
+	if (need_rebalance_laundry(zone))
+		ret +=3D rebalance_laundry_zone(zone, max_work, gfp_mask);
=20
-done:
-	spin_unlock(&pagemap_lru_lock);
+	/* These pages will become freeable, let the OOM detection know */
+	ret +=3D zone->inactive_laundry_pages;
=20
-	return nr_deactivated;
+	return ret;
 }
=20
-/**
- * refill_inactive - checks all zones and refills the inactive list as nee=
ded
- *
- * This function tries to balance page eviction from all zones by aging
- * the pages from each zone in the same ratio until the global inactive
- * shortage is resolved. After that it does one last "clean-up" scan to
- * fix up local inactive shortages.
- */
-int refill_inactive(void)
+int rebalance_inactive(unsigned int gfp_mask, int percentage)
 {
-	int maxtry =3D 1 << DEF_PRIORITY;
-	zone_t * zone;
+	struct zone_struct * zone;
+	int max_work;
 	int ret =3D 0;
=20
-	/* Global balancing while we have a global shortage. */
-	while (maxtry-- && inactive_low(ALL_ZONES) >=3D 0) {
-		for_each_zone(zone) {
-			if (inactive_high(zone) >=3D 0)
-				ret +=3D refill_inactive_zone(zone, DEF_PRIORITY);
-		}
-	}
+	max_work =3D 4 * BATCH_WORK_AMOUNT;
+	/* If we're in deeper trouble, do more work */
+	if (percentage >=3D 50)
+		max_work =3D 8 * BATCH_WORK_AMOUNT;
=20
-	/* Local balancing for zones which really need it. */
-	for_each_zone(zone) {
-		if (inactive_min(zone) >=3D 0)
-			ret +=3D refill_inactive_zone(zone, 0);
-	}
+	for_each_zone(zone)
+		ret +=3D rebalance_inactive_zone(zone, max_work, gfp_mask, percentage);
+		/* 4 * BATCH_WORK_AMOUNT needs tuning */
=20
 	return ret;
 }
@@ -636,7 +819,9 @@ static inline void background_aging(int=20
=20
 	for_each_zone(zone)
 		if (inactive_high(zone) > 0)
-			refill_inactive_zone(zone, priority);
+			refill_inactive_zone(zone, priority, BATCH_WORK_AMOUNT);
+	for_each_zone(zone)
+			rebalance_dirty_zone(zone, BATCH_WORK_AMOUNT, GFP_KSWAPD);
 }
=20
 /*
@@ -655,18 +840,13 @@ static int do_try_to_free_pages(unsigned
 	 * Eat memory from filesystem page cache, buffer cache,
 	 * dentry, inode and filesystem quota caches.
 	 */
-	ret +=3D page_launder(gfp_mask);
+	ret +=3D rebalance_inactive(gfp_mask, 100);
 	ret +=3D shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
 	ret +=3D shrink_icache_memory(1, gfp_mask);
 #ifdef CONFIG_QUOTA
 	ret +=3D shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
 #endif
=20
-	/*
-	 * Move pages from the active list to the inactive list.
-	 */
-	refill_inactive();
-
 	/* =09
 	 * Reclaim unused slab cache memory.
 	 */
@@ -682,12 +862,54 @@ static int do_try_to_free_pages(unsigned
 	 * Hmm.. Cache shrink failed - time to kill something?
 	 * Mhwahahhaha! This is the part I really like. Giggle.
 	 */
-	if (ret < free_low(ANY_ZONE))
+	if (ret < free_low(ANY_ZONE) && (gfp_mask&__GFP_WAIT))
 		out_of_memory();
=20
 	return ret;
 }
=20
+/*
+ * Worker function for kswapd and try_to_free_pages, we get
+ * called whenever there is a shortage of free/inactive_clean
+ * pages.
+ *
+ * This function will also move pages to the inactive list,
+ * if needed.
+ */
+static int do_try_to_free_pages_kswapd(unsigned int gfp_mask)
+{
+	int ret =3D 0;
+	struct zone_struct * zone;
+
+	ret +=3D shrink_dcache_memory(DEF_PRIORITY, gfp_mask);
+	ret +=3D shrink_icache_memory(DEF_PRIORITY, gfp_mask);
+#ifdef CONFIG_QUOTA
+	ret +=3D shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
+#endif
+
+	/*
+	 * Eat memory from filesystem page cache, buffer cache,
+	 * dentry, inode and filesystem quota caches.
+	 */
+	rebalance_inactive(gfp_mask, 5);
+
+	for_each_zone(zone)
+		while (need_rebalance_dirty(zone))
+			rebalance_dirty_zone(zone,  16 * BATCH_WORK_AMOUNT,  gfp_mask);
+
+	for_each_zone(zone)
+		if (free_high(zone)>0)
+			rebalance_laundry_zone(zone, BATCH_WORK_AMOUNT, 0);
+
+	refill_freelist();
+
+	/* Start IO when needed. */
+	if (free_plenty(ALL_ZONES) > 0 || free_low(ANY_ZONE) > 0)
+		run_task_queue(&tq_disk);
+
+	return ret;
+}
+
 /**
  * refill_freelist - move inactive_clean pages to free list if needed
  *
@@ -764,7 +986,7 @@ int kswapd(void *unused)
 		 * zone is very short on free pages.
 		 */
 		if (free_high(ALL_ZONES) >=3D 0 || free_low(ANY_ZONE) > 0)
-			do_try_to_free_pages(GFP_KSWAPD);
+			do_try_to_free_pages_kswapd(GFP_KSWAPD);
=20
 		refill_freelist();
=20
@@ -846,7 +1068,7 @@ static void wakeup_memwaiters(void)
 	/* OK, the VM is very loaded. Sleep instead of using all CPU. */
 	kswapd_overloaded =3D 1;
 	set_current_state(TASK_UNINTERRUPTIBLE);
-	schedule_timeout(HZ / 4);
+	schedule_timeout(HZ / 40);
 	kswapd_overloaded =3D 0;
 	return;
 }
@@ -888,6 +1110,7 @@ int try_to_free_pages(unsigned int gfp_m
 void rss_free_pages(unsigned int gfp_mask)
 {
 	long pause =3D 0;
+	struct zone_struct * zone;
=20
 	if (current->flags & PF_MEMALLOC)
 		return;
@@ -895,7 +1118,10 @@ void rss_free_pages(unsigned int gfp_mas
 	current->flags |=3D PF_MEMALLOC;
=20
 	do {
-		page_launder(gfp_mask);
+		rebalance_inactive(gfp_mask, 100);
+		for_each_zone(zone)
+			if (free_plenty(zone) >=3D 0)
+				rebalance_laundry_zone(zone, BATCH_WORK_AMOUNT, 0);
=20
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule_timeout(pause);
@@ -907,11 +1133,78 @@ void rss_free_pages(unsigned int gfp_mas
 	return;
 }
=20
+/*
+ * The background page scanning daemon, started as a kernel thread
+ * from the init process.=20
+ *
+ * This is the part that background scans the active list to find
+ * pages that are referenced and increases their age score.
+ * It is important that this scan rate is not proportional to vm pressure
+ * per se otherwise cpu usage becomes unbounded. On the other hand, if the=
re's
+ * no VM pressure at all it shouldn't age stuff either otherwise everything
+ * ends up at the maximum age.=20
+ */
+#define MAX_AGING_INTERVAL 5*HZ
+#define MIN_AGING_INTERVAL HZ/2
+int kscand(void *unused)
+{
+	struct task_struct *tsk =3D current;
+	struct zone_struct * zone;
+	unsigned long pause =3D MAX_AGING_INTERVAL;
+	int total_needscan =3D 0;
+	int age_faster =3D 0;
+	int num_zones =3D 0;
+	int age;
+
+	daemonize();
+	strcpy(tsk->comm, "kscand");
+	sigfillset(&tsk->blocked);
+=09
+	for (;;) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(pause);=09
+		for_each_zone(zone) {
+			if (need_active_anon_scan(zone)) {
+				for (age =3D 0; age < MAX_AGE; age++)  {
+					scan_active_list(zone, age, 1);
+					if (current->need_resched)
+						schedule();
+				}
+			}
+
+			if (need_active_cache_scan(zone)) {
+				for (age =3D 0; age < MAX_AGE; age++)  {
+					scan_active_list(zone, age, 0);
+					if (current->need_resched)
+						schedule();
+				}
+			}
+
+			/* Check if we've been aging quickly enough */
+			if (zone->need_scan >=3D 2)
+				age_faster++;
+			total_needscan +=3D zone->need_scan;
+			zone->need_scan =3D 0;
+			num_zones++;
+		}
+		if (age_faster)
+			pause =3D max(pause / 2, MIN_AGING_INTERVAL);
+		else if (total_needscan < num_zones)
+			pause =3D min(pause + pause / 2, MAX_AGING_INTERVAL);
+
+		total_needscan =3D 0;
+		age_faster =3D 0;
+		num_zones =3D 0;
+	}
+}
+
+
 static int __init kswapd_init(void)
 {
 	printk("Starting kswapd\n");
 	swap_setup();
 	kernel_thread(kswapd, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
+	kernel_thread(kscand, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
 	return 0;
 }
=20

--PNTmBPCT7hxwcZjr--

--kORqDWCi7qDJ0mEj
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iEYEARECAAYFAj3q0oAACgkQx/ptJkB7frzPHQCfRJEtNFQlKFo6eiTvupM3R4S0
nSMAoIW3E2TJZ/GRE55Fybuxi0WJV5Hd
=NzAU
-----END PGP SIGNATURE-----

--kORqDWCi7qDJ0mEj--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Next message: Ed Tomlinson: "Re: [ALPHA RELEASE] module-init-tools 0.9-alpha"
Previous message: Jeremy Fitzhardinge: "Re: a bug in autofs"