[rfc] "laptop mode"

Andrew Morton (akpm@zip.com.au)
Tue, 04 Jun 2002 15:54:50 -0700


This is a multi-part message in MIME format.
--------------0E196CBCA810BB4B5A6FAB4A
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Here's a patch which is designed to make the kernel play more nicely
with portable computers. I've been using it for a couple of days
and it seems to do the right thing. I'm wondering if anyone has
any comments/suggestions/etc.

To test this code you'll also need
http://www.zip.com.au/~akpm/linux/patches/2.4.20/pdflush-sysctl.patch
(hmm. Server seems to be dead. So the patches are here, as attachments)

Here's the algorithm, from the Documentation/filesystems/proc.txt
section describing /proc/sys/vm/:

laptop_mode
-----------

Setting this entry to '1' will put the kernel's dirty data writeout
algorithms into a mode which is better suited to laptop/notebook
computers. This mode is specifically designed to minimise the
frequency of disk spinups. Laptop mode works as follows:

- Dirty data remains in memory for longer periods of time (controlled
by laptop_writeback_centisecs).

- If there is pending dirty data and the disk is spun up for any
reason (even for a read) then all dirty data will be written back
shortly afterwards. ie: when the disk is spun up, make good use of
it.

- When the decision is made to write back some dirty data, the kernel
will write back all dirty data.

laptop_writeback_centisecs
--------------------------

This tunable determines the maximum age of dirty data when the machine
is operating in Laptop mode. The default value is 30000 - five
minutes. This means that if applications are generating a small amount
of write traffic, the disk will spin up once per five minutes.

If the disk is spun up for any other reason (such as for a read) then
all dirty data will be flushed anyway, and this timer is reset to zero.

laptop_writeback_centisecs has no effect when the machine is not
operating in Laptop mode.

This implementation doesn't try to be very smart - there's a direct
call out of do_ide_request() into the writeback code. This couldn't
be done from within ll_rw_blk.c because then a write to the ramdisk
would spin the disk up. Even as-is, a read from the IDE CDROM
drive will cause the IDE hard disk to spin up and flush data, so
probably that call in do_ide_request() should only be made if the
device is writable. Suggestions are sought, but let's try not to
get too fancy here...
--------------0E196CBCA810BB4B5A6FAB4A
Content-Type: text/x-diff; charset=us-ascii;
name="pdflush-sysctl.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="pdflush-sysctl.patch"

--- 2.5.19/include/linux/sysctl.h~pdflush-sysctl Sun Jun 2 00:46:24 2002
+++ 2.5.19-akpm/include/linux/sysctl.h Sun Jun 2 00:47:49 2002
@@ -130,16 +130,21 @@ enum
/* CTL_VM names: */
enum
{
- VM_SWAPCTL=1, /* struct: Set vm swapping control */
- VM_SWAPOUT=2, /* int: Linear or sqrt() swapout for hogs */
- VM_FREEPG=3, /* struct: Set free page thresholds */
+ VM_UNUSED1=1, /* was: struct: Set vm swapping control */
+ VM_UNUSED2=2, /* was; int: Linear or sqrt() swapout for hogs */
+ VM_UNUSED3=3, /* was: struct: Set free page thresholds */
VM_BDFLUSH_UNUSED=4, /* Spare */
VM_OVERCOMMIT_MEMORY=5, /* Turn off the virtual memory safety limit */
- VM_BUFFERMEM=6, /* struct: Set buffer memory thresholds */
- VM_PAGECACHE=7, /* struct: Set cache memory thresholds */
+ VM_UNUSED4=6, /* was: struct: Set buffer memory thresholds */
+ VM_UNUSED5=7, /* was: struct: Set cache memory thresholds */
VM_PAGERDAEMON=8, /* struct: Control kswapd behaviour */
- VM_PGT_CACHE=9, /* struct: Set page table cache parameters */
- VM_PAGE_CLUSTER=10 /* int: set number of pages to swap together */
+ VM_UNUSED6=9, /* was: struct: Set page table cache parameters */
+ VM_PAGE_CLUSTER=10, /* int: set number of pages to swap together */
+ VM_DIRTY_BACKGROUND=11, /* dirty_background_ratio */
+ VM_DIRTY_ASYNC=12, /* dirty_async_ratio */
+ VM_DIRTY_SYNC=13, /* dirty_sync_ratio */
+ VM_DIRTY_WB_CS=14, /* dirty_writeback_centisecs */
+ VM_DIRTY_EXPIRE_CS=15, /* dirty_expire_centisecs */
};


--- 2.5.19/kernel/sysctl.c~pdflush-sysctl Sun Jun 2 00:46:24 2002
+++ 2.5.19-akpm/kernel/sysctl.c Sun Jun 2 00:46:24 2002
@@ -31,6 +31,7 @@
#include <linux/init.h>
#include <linux/sysrq.h>
#include <linux/highuid.h>
+#include <linux/writeback.h>

#include <asm/uaccess.h>

@@ -264,6 +265,19 @@ static ctl_table vm_table[] = {
&pager_daemon, sizeof(pager_daemon_t), 0644, NULL, &proc_dointvec},
{VM_PAGE_CLUSTER, "page-cluster",
&page_cluster, sizeof(int), 0644, NULL, &proc_dointvec},
+ {VM_DIRTY_BACKGROUND, "dirty_background_ratio",
+ &dirty_background_ratio, sizeof(dirty_background_ratio),
+ 0644, NULL, &proc_dointvec},
+ {VM_DIRTY_ASYNC, "dirty_async_ratio", &dirty_async_ratio,
+ sizeof(dirty_async_ratio), 0644, NULL, &proc_dointvec},
+ {VM_DIRTY_SYNC, "dirty_sync_ratio", &dirty_sync_ratio,
+ sizeof(dirty_sync_ratio), 0644, NULL, &proc_dointvec},
+ {VM_DIRTY_WB_CS, "dirty_writeback_centisecs",
+ &dirty_writeback_centisecs, sizeof(dirty_writeback_centisecs), 0644,
+ NULL, &proc_dointvec},
+ {VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs",
+ &dirty_expire_centisecs, sizeof(dirty_expire_centisecs), 0644,
+ NULL, &proc_dointvec},
{0}
};

--- 2.5.19/mm/page-writeback.c~pdflush-sysctl Sun Jun 2 00:46:24 2002
+++ 2.5.19-akpm/mm/page-writeback.c Sun Jun 2 00:46:24 2002
@@ -26,29 +26,56 @@
* The maximum number of pages to writeout in a single bdflush/kupdate
* operation. We do this so we don't hold I_LOCK against an inode for
* enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.
+ * been forced to throttle against that inode. Also, the code reevaluates
+ * the dirty each time it has written this many pages.
*/
#define MAX_WRITEBACK_PAGES 1024

/*
- * Memory thresholds, in percentages
- * FIXME: expose these via /proc or whatever.
+ * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
+ * will look to see if it needs to force writeback or throttling. Probably
+ * should be scaled by memory size.
+ */
+#define RATELIMIT_PAGES 1000
+
+/*
+ * When balance_dirty_pages decides that the caller needs to perform some
+ * non-background writeback, this is how many pages it will attempt to write.
+ * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * large amounts of I/O are submitted.
+ */
+#define SYNC_WRITEBACK_PAGES 1500
+
+
+/*
+ * Dirty memory thresholds, in percentages
*/

/*
* Start background writeback (via pdflush) at this level
*/
-static int dirty_background_ratio = 40;
+int dirty_background_ratio = 40;

/*
* The generator of dirty data starts async writeback at this level
*/
-static int dirty_async_ratio = 50;
+int dirty_async_ratio = 50;

/*
* The generator of dirty data performs sync writeout at this level
*/
-static int dirty_sync_ratio = 60;
+int dirty_sync_ratio = 60;
+
+/*
+ * The interval between `kupdate'-style writebacks.
+ */
+int dirty_writeback_centisecs = 5 * 100;
+
+/*
+ * The largest amount of time for which data is allowed to remain dirty
+ */
+int dirty_expire_centisecs = 30 * 100;
+

static void background_writeout(unsigned long _min_pages);

@@ -84,12 +111,12 @@ void balance_dirty_pages(struct address_
sync_thresh = (dirty_sync_ratio * tot) / 100;

if (dirty_and_writeback > sync_thresh) {
- int nr_to_write = 1500;
+ int nr_to_write = SYNC_WRITEBACK_PAGES;

writeback_unlocked_inodes(&nr_to_write, WB_SYNC_LAST, NULL);
get_page_state(&ps);
} else if (dirty_and_writeback > async_thresh) {
- int nr_to_write = 1500;
+ int nr_to_write = SYNC_WRITEBACK_PAGES;

writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, NULL);
get_page_state(&ps);
@@ -118,7 +145,7 @@ void balance_dirty_pages_ratelimited(str
int cpu;

cpu = get_cpu();
- if (ratelimits[cpu].count++ >= 1000) {
+ if (ratelimits[cpu].count++ >= RATELIMIT_PAGES) {
ratelimits[cpu].count = 0;
put_cpu();
balance_dirty_pages(mapping);
@@ -162,17 +189,6 @@ void wakeup_bdflush(void)
pdflush_operation(background_writeout, ps.nr_dirty);
}

-/*
- * The interval between `kupdate'-style writebacks.
- *
- * Traditional kupdate writes back data which is 30-35 seconds old.
- * This one does that, but it also writes back just 1/6th of the dirty
- * data. This is to avoid great I/O storms.
- *
- * We chunk the writes up and yield, to permit any throttled page-allocators
- * to perform their I/O against a large file.
- */
-static int wb_writeback_jifs = 5 * HZ;
static struct timer_list wb_timer;

/*
@@ -183,9 +199,9 @@ static struct timer_list wb_timer;
* just walks the superblock inode list, writing back any inodes which are
* older than a specific point in time.
*
- * Try to run once per wb_writeback_jifs jiffies. But if a writeback event
- * takes longer than a wb_writeback_jifs interval, then leave a one-second
- * gap.
+ * Try to run once per dirty_writeback_centisecs. But if a writeback event
+ * takes longer than a dirty_writeback_centisecs interval, then leave a
+ * one-second gap.
*
* older_than_this takes precedence over nr_to_write. So we'll only write back
* all dirty pages if they are all attached to "old" mappings.
@@ -201,9 +217,9 @@ static void wb_kupdate(unsigned long arg
sync_supers();
get_page_state(&ps);

- oldest_jif = jiffies - 30*HZ;
+ oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100;
start_jif = jiffies;
- next_jif = start_jif + wb_writeback_jifs;
+ next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100;
nr_to_write = ps.nr_dirty;
writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, &oldest_jif);
blk_run_queues();
@@ -223,7 +239,7 @@ static void wb_timer_fn(unsigned long un
static int __init wb_timer_init(void)
{
init_timer(&wb_timer);
- wb_timer.expires = jiffies + wb_writeback_jifs;
+ wb_timer.expires = jiffies + (dirty_writeback_centisecs * HZ) / 100;
wb_timer.data = 0;
wb_timer.function = wb_timer_fn;
add_timer(&wb_timer);
--- 2.5.19/include/linux/writeback.h~pdflush-sysctl Sun Jun 2 00:46:24 2002
+++ 2.5.19-akpm/include/linux/writeback.h Sun Jun 2 00:46:24 2002
@@ -45,6 +45,12 @@ static inline void wait_on_inode(struct
/*
* mm/page-writeback.c
*/
+extern int dirty_background_ratio;
+extern int dirty_async_ratio;
+extern int dirty_sync_ratio;
+extern int dirty_writeback_centisecs;
+extern int dirty_expire_centisecs;
+
void balance_dirty_pages(struct address_space *mapping);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
--- 2.5.19/Documentation/filesystems/proc.txt~pdflush-sysctl Sun Jun 2 01:24:03 2002
+++ 2.5.19-akpm/Documentation/filesystems/proc.txt Sun Jun 2 01:30:44 2002
@@ -948,120 +948,43 @@ program to load modules on demand.
-----------------------------------------------

The files in this directory can be used to tune the operation of the virtual
-memory (VM) subsystem of the Linux kernel. In addition, one of the files
-(bdflush) has some influence on disk usage.
+memory (VM) subsystem of the Linux kernel.

-bdflush
--------
+dirty_background_ratio
+----------------------

-This file controls the operation of the bdflush kernel daemon. It currently
-contains nine integer values, six of which are actually used by the kernel.
-They are listed in table 2-2.
-
-
-Table 2-2: Parameters in /proc/sys/vm/bdflush
-..............................................................................
- Value Meaning
- nfract Percentage of buffer cache dirty to activate bdflush
- ndirty Maximum number of dirty blocks to write out per wake-cycle
- nrefill Number of clean buffers to try to obtain each time we call refill
- nref_dirt buffer threshold for activating bdflush when trying to refill
- buffers.
- dummy Unused
- age_buffer Time for normal buffer to age before we flush it
- age_super Time for superblock to age before we flush it
- dummy Unused
- dummy Unused
-..............................................................................
+Contains, as a percentage of total system memory, the number of pages at which
+the pdflush background writeback daemon will start writing out dirty data.

-nfract
-------
-
-This parameter governs the maximum number of dirty buffers in the buffer
-cache. Dirty means that the contents of the buffer still have to be written to
-disk (as opposed to a clean buffer, which can just be forgotten about).
-Setting this to a higher value means that Linux can delay disk writes for a
-long time, but it also means that it will have to do a lot of I/O at once when
-memory becomes short. A lower value will spread out disk I/O more evenly.
-
-ndirty
-------
-
-Ndirty gives the maximum number of dirty buffers that bdflush can write to the
-disk at one time. A high value will mean delayed, bursty I/O, while a small
-value can lead to memory shortage when bdflush isn't woken up often enough.
-
-nrefill
--------
-
-This is the number of buffers that bdflush will add to the list of free
-buffers when refill_freelist() is called. It is necessary to allocate free
-buffers beforehand, since the buffers are often different sizes than the
-memory pages and some bookkeeping needs to be done beforehand. The higher the
-number, the more memory will be wasted and the less often refill_freelist()
-will need to run.
-
-nref_dirt
----------
-
-When refill_freelist() comes across more than nref_dirt dirty buffers, it will
-wake up bdflush.
-
-age_buffer and age_super
-------------------------
-
-Finally, the age_buffer and age_super parameters govern the maximum time Linux
-waits before writing out a dirty buffer to disk. The value is expressed in
-jiffies (clockticks), the number of jiffies per second is 100. Age_buffer is
-the maximum age for data blocks, while age_super is for filesystems meta data.
-
-buffermem
----------
-
-The three values in this file control how much memory should be used for
-buffer memory. The percentage is calculated as a percentage of total system
-memory.
-
-The values are:
-
-min_percent
------------
-
-This is the minimum percentage of memory that should be spent on buffer
-memory.
-
-borrow_percent
---------------
-
-When Linux is short on memory, and the buffer cache uses more than it has been
-allotted, the memory management (MM) subsystem will prune the buffer cache
-more heavily than other memory to compensate.
-
-max_percent
------------
-
-This is the maximum amount of memory that can be used for buffer memory.
-
-freepages
----------
+dirty_async_ratio
+-----------------

-This file contains three values: min, low and high:
+Contains, as a percentage of total system memory, the number of pages at which
+a process which is generating disk writes will itself start writing out dirty
+data.
+
+dirty_sync_ratio
+----------------
+
+Contains, as a percentage of total system memory, the number of pages at which
+a process which is generating disk writes will itself start writing out dirty
+data and waiting upon completion of that writeout.
+
+dirty_writeback_centisecs
+-------------------------
+
+The pdflush writeback daemons will periodically wake up and write `old' data
+out to disk. This tunable expresses the interval between those wakeups, in
+100'ths of a second.
+
+dirty_expire_centisecs
+----------------------
+
+This tunable is used to define when dirty data is old enough to be eligible
+for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
+Data which has been dirty in-memory for longer than this interval will be
+written out next time a pdflush daemon wakes up.

-min
----
-When the number of free pages in the system reaches this number, only the
-kernel can allocate more memory.
-
-low
----
-If the number of free pages falls below this point, the kernel starts swapping
-aggressively.
-
-high
-----
-The kernel tries to keep up to this amount of memory free; if memory falls
-below this point, the kernel starts gently swapping in the hopes that it never
-has to do really aggressive swapping.

kswapd
------
@@ -1113,79 +1036,6 @@ On the other hand, enabling this feat
and thrash the system to death, so large and/or important servers will want to
set this value to 0.

-pagecache
----------
-
-This file does exactly the same job as buffermem, only this file controls the
-amount of memory allowed for memory mapping and generic caching of files.
-
-You don't want the minimum level to be too low, otherwise your system might
-thrash when memory is tight or fragmentation is high.
-
-pagetable_cache
----------------
-
-The kernel keeps a number of page tables in a per-processor cache (this helps
-a lot on SMP systems). The cache size for each processor will be between the
-low and the high value.
-
-On a low-memory, single CPU system, you can safely set these values to 0 so
-you don't waste memory. It is used on SMP systems so that the system can
-perform fast pagetable allocations without having to acquire the kernel memory
-lock.
-
-For large systems, the settings are probably fine. For normal systems they
-won't hurt a bit. For small systems ( less than 16MB ram) it might be
-advantageous to set both values to 0.
-
-swapctl
--------
-
-This file contains no less than 8 variables. All of these values are used by
-kswapd.
-
-The first four variables
-* sc_max_page_age,
-* sc_page_advance,
-* sc_page_decline and
-* sc_page_initial_age
-are used to keep track of Linux's page aging. Page aging is a bookkeeping
-method to track which pages of memory are often used, and which pages can be
-swapped out without consequences.
-
-When a page is swapped in, it starts at sc_page_initial_age (default 3) and
-when the page is scanned by kswapd, its age is adjusted according to the
-following scheme:
-
-* If the page was used since the last time we scanned, its age is increased
- by sc_page_advance (default 3). Where the maximum value is given by
- sc_max_page_age (default 20).
-* Otherwise (meaning it wasn't used) its age is decreased by sc_page_decline
- (default 1).
-
-When a page reaches age 0, it's ready to be swapped out.
-
-The variables sc_age_cluster_fract, sc_age_cluster_min, sc_pageout_weight and
-sc_bufferout_weight, can be used to control kswapd's aggressiveness in
-swapping out pages.
-
-Sc_age_cluster_fract is used to calculate how many pages from a process are to
-be scanned by kswapd. The formula used is
-
-(sc_age_cluster_fract divided by 1024) times resident set size
-
-So if you want kswapd to scan the whole process, sc_age_cluster_fract needs to
-have a value of 1024. The minimum number of pages kswapd will scan is
-represented by sc_age_cluster_min, which is done so that kswapd will also scan
-small processes.
-
-The values of sc_pageout_weight and sc_bufferout_weight are used to control
-how many tries kswapd will make in order to swap out one page/buffer. These
-values can be used to fine-tune the ratio between user pages and buffer/cache
-memory. When you find that your Linux system is swapping out too many process
-pages in order to satisfy buffer memory demands, you may want to either
-increase sc_bufferout_weight, or decrease the value of sc_pageout_weight.
-
2.5 /proc/sys/dev - Device specific parameters
----------------------------------------------

--- 2.5.19/Documentation/sysctl/vm.txt~pdflush-sysctl Sun Jun 2 01:31:17 2002
+++ 2.5.19-akpm/Documentation/sysctl/vm.txt Sun Jun 2 01:33:30 2002
@@ -9,116 +9,28 @@ This file contains the documentation for
/proc/sys/vm and is valid for Linux kernel version 2.2.

The files in this directory can be used to tune the operation
-of the virtual memory (VM) subsystem of the Linux kernel, and
-one of the files (bdflush) also has a little influence on disk
-usage.
+of the virtual memory (VM) subsystem of the Linux kernel and
+the writeout of dirty data to disk.

Default values and initialization routines for most of these
files can be found in mm/swap.c.

Currently, these files are in /proc/sys/vm:
-- bdflush
-- buffermem
-- freepages
- kswapd
- overcommit_memory
- page-cluster
-- pagecache
-- pagetable_cache
+- dirty_async_ratio
+- dirty_background_ratio
+- dirty_expire_centisecs
+- dirty_sync_ratio
+- dirty_writeback_centisecs

==============================================================

-bdflush:
+dirty_async_ratio, dirty_background_ratio, dirty_expire_centisecs,
+dirty_sync_ratio dirty_writeback_centisecs:

-This file controls the operation of the bdflush kernel
-daemon. The source code to this struct can be found in
-linux/fs/buffer.c. It currently contains 9 integer values,
-of which 4 are actually used by the kernel.
-
-From linux/fs/buffer.c:
---------------------------------------------------------------
-union bdflush_param {
- struct {
- int nfract; /* Percentage of buffer cache dirty to
- activate bdflush */
- int dummy1; /* old "ndirty" */
- int dummy2; /* old "nrefill" */
- int dummy3; /* unused */
- int interval; /* jiffies delay between kupdate flushes */
- int age_buffer; /* Time for normal buffer to age */
- int nfract_sync;/* Percentage of buffer cache dirty to
- activate bdflush synchronously */
- int dummy4; /* unused */
- int dummy5; /* unused */
- } b_un;
- unsigned int data[N_PARAM];
-} bdf_prm = {{30, 64, 64, 256, 5*HZ, 30*HZ, 60, 0, 0}};
---------------------------------------------------------------
-
-int nfract:
-The first parameter governs the maximum number of dirty
-buffers in the buffer cache. Dirty means that the contents
-of the buffer still have to be written to disk (as opposed
-to a clean buffer, which can just be forgotten about).
-Setting this to a high value means that Linux can delay disk
-writes for a long time, but it also means that it will have
-to do a lot of I/O at once when memory becomes short. A low
-value will spread out disk I/O more evenly, at the cost of
-more frequent I/O operations. The default value is 30%,
-the minimum is 0%, and the maximum is 100%.
-
-int interval:
-The fifth parameter, interval, is the minimum rate at
-which kupdate will wake and flush. The value is expressed in
-jiffies (clockticks), the number of jiffies per second is
-normally 100 (Alpha is 1024). Thus, x*HZ is x seconds. The
-default value is 5 seconds, the minimum is 0 seconds, and the
-maximum is 600 seconds.
-
-int age_buffer:
-The sixth parameter, age_buffer, governs the maximum time
-Linux waits before writing out a dirty buffer to disk. The
-value is in jiffies. The default value is 30 seconds,
-the minimum is 1 second, and the maximum 6,000 seconds.
-
-int nfract_sync:
-The seventh parameter, nfract_sync, governs the percentage
-of buffer cache that is dirty before bdflush activates
-synchronously. This can be viewed as the hard limit before
-bdflush forces buffers to disk. The default is 60%, the
-minimum is 0%, and the maximum is 100%.
-
-==============================================================
-buffermem:
-
-The three values in this file correspond to the values in
-the struct buffer_mem. It controls how much memory should
-be used for buffer memory. The percentage is calculated
-as a percentage of total system memory.
-
-The values are:
-min_percent -- this is the minimum percentage of memory
- that should be spent on buffer memory
-borrow_percent -- UNUSED
-max_percent -- UNUSED
-
-==============================================================
-freepages:
-
-This file contains the values in the struct freepages. That
-struct contains three members: min, low and high.
-
-The meaning of the numbers is:
-
-freepages.min When the number of free pages in the system
- reaches this number, only the kernel can
- allocate more memory.
-freepages.low If the number of free pages gets below this
- point, the kernel starts swapping aggressively.
-freepages.high The kernel tries to keep up to this amount of
- memory free; if memory comes below this point,
- the kernel gently starts swapping in the hopes
- that it never has to do real aggressive swapping.
+See Documentation/filesystems/proc.txt

==============================================================

@@ -180,38 +92,3 @@ The number of pages the kernel reads in
2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
for swap because we only cluster swap data in 32-page groups.

-==============================================================
-
-pagecache:
-
-This file does exactly the same as buffermem, only this
-file controls the struct page_cache, and thus controls
-the amount of memory used for the page cache.
-
-In 2.2, the page cache is used for 3 main purposes:
-- caching read() data from files
-- caching mmap()ed data and executable files
-- swap cache
-
-When your system is both deep in swap and high on cache,
-it probably means that a lot of the swapped data is being
-cached, making for more efficient swapping than possible
-with the 2.0 kernel.
-
-==============================================================
-
-pagetable_cache:
-
-The kernel keeps a number of page tables in a per-processor
-cache (this helps a lot on SMP systems). The cache size for
-each processor will be between the low and the high value.
-
-On a low-memory, single CPU system you can safely set these
-values to 0 so you don't waste the memory. On SMP systems it
-is used so that the system can do fast pagetable allocations
-without having to acquire the kernel memory lock.
-
-For large systems, the settings are probably OK. For normal
-systems they won't hurt a bit. For small systems (<16MB ram)
-it might be advantageous to set both values to 0.
-

--------------0E196CBCA810BB4B5A6FAB4A
Content-Type: text/x-diff; charset=us-ascii;
name="laptop-mode.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="laptop-mode.patch"

--- 2.5.20/mm/page-writeback.c~laptop-mode Tue Jun 4 15:27:54 2002
+++ 2.5.20-akpm/mm/page-writeback.c Tue Jun 4 15:27:54 2002
@@ -76,6 +76,21 @@ int dirty_writeback_centisecs = 5 * 100;
*/
int dirty_expire_centisecs = 30 * 100;

+/*
+ * A global sysctl-controlled flag which puts the machine into "laptop mode"
+ */
+int laptop_mode;
+
+/*
+ * When in laptop mode, this sysctl sets the interval between global flushes,
+ * in centiseconds.
+ */
+int laptop_writeback_centisecs = 5 * 60 * 100;
+
+/*
+ * A flag which is set when the disk is spun up.
+ */
+static int disk_activity_seen;

static void background_writeout(unsigned long _min_pages);

@@ -157,6 +172,8 @@ void balance_dirty_pages_ratelimited(str
/*
* writeback at least _min_pages, and keep writing until the amount of dirty
* memory is less than the background threshold, or until we're all clean.
+ *
+ * In laptop mode, just write all dirty data.
*/
static void background_writeout(unsigned long _min_pages)
{
@@ -169,7 +186,8 @@ static void background_writeout(unsigned
struct page_state ps;

get_page_state(&ps);
- if (ps.nr_dirty < background_thresh && min_pages <= 0)
+ if (!laptop_mode && ps.nr_dirty < background_thresh &&
+ min_pages <= 0)
break;
nr_to_write = MAX_WRITEBACK_PAGES;
writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, NULL);
@@ -205,8 +223,10 @@ static struct timer_list wb_timer;
*
* older_than_this takes precedence over nr_to_write. So we'll only write back
* all dirty pages if they are all attached to "old" mappings.
+ *
+ * When operating in laptop mode, writeback all dirty data.
*/
-static void wb_kupdate(unsigned long arg)
+static unsigned long wb_kupdate(void)
{
unsigned long oldest_jif;
unsigned long start_jif;
@@ -216,14 +236,66 @@ static void wb_kupdate(unsigned long arg

sync_supers();
get_page_state(&ps);
-
+ nr_to_write = ps.nr_dirty;
oldest_jif = jiffies - (dirty_expire_centisecs * HZ) / 100;
start_jif = jiffies;
next_jif = start_jif + (dirty_writeback_centisecs * HZ) / 100;
- nr_to_write = ps.nr_dirty;
- writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, &oldest_jif);
+
+ if (laptop_mode) {
+ nr_to_write *= 2;
+ writeback_unlocked_inodes(&nr_to_write, WB_SYNC_NONE, NULL);
+ } else {
+ writeback_unlocked_inodes(&nr_to_write,
+ WB_SYNC_NONE, &oldest_jif);
+ }
blk_run_queues();
yield();
+ return next_jif;
+}
+
+/*
+ * Insert comment here
+ */
+static unsigned long laptop_kupdate(void)
+{
+ static enum {
+ idle, /* Waiting for disk activity */
+ wait_for_inactivity, /* Waiting for I/O to stop */
+ } state = idle;
+ static unsigned long last_flush_jifs;
+ unsigned long interval = (laptop_writeback_centisecs * HZ) / 100;
+ unsigned long ret = jiffies + (dirty_writeback_centisecs * HZ) / 100;
+
+ if (time_after(last_flush_jifs, jiffies))
+ last_flush_jifs = jiffies; /* sanify the start-up state */
+
+ if (time_after(jiffies, last_flush_jifs + interval))
+ disk_activity_seen = 1; /* force writeback */
+
+ switch (state) {
+ case idle:
+ if (disk_activity_seen) {
+ ret = wb_kupdate();
+ last_flush_jifs = jiffies;
+ state = wait_for_inactivity;
+ }
+ break;
+ case wait_for_inactivity:
+ disk_activity_seen = 0;
+ state = idle;
+ }
+ return ret;
+}
+
+static void kupdate(unsigned long unused)
+{
+ unsigned long next_jif;
+ unsigned long (*fn)(void);
+
+ fn = wb_kupdate;
+ if (laptop_mode)
+ fn = laptop_kupdate;
+ next_jif = (*fn)();

if (time_before(next_jif, jiffies + HZ))
next_jif = jiffies + HZ;
@@ -232,7 +304,7 @@ static void wb_kupdate(unsigned long arg

static void wb_timer_fn(unsigned long unused)
{
- if (pdflush_operation(wb_kupdate, 0) < 0)
+ if (pdflush_operation(kupdate, 0) < 0)
mod_timer(&wb_timer, jiffies + HZ);
}

@@ -246,6 +318,23 @@ static int __init wb_timer_init(void)
return 0;
}
module_init(wb_timer_init);
+
+/*
+ * Device drivers call in here to indicate that the disk has spun up.
+ */
+void disk_spun_up(void)
+{
+ if (laptop_mode && !disk_activity_seen)
+ disk_activity_seen = 1;
+}
+EXPORT_SYMBOL(disk_spun_up);
+
+/*
+ * Journalling filesystems which perform their own writeback scheduling
+ * need these.
+ */
+EXPORT_SYMBOL(laptop_mode);
+EXPORT_SYMBOL(laptop_writeback_centisecs);

/*
* A library function, which implements the vm_writeback a_op. It's fairly
--- 2.5.20/include/linux/sysctl.h~laptop-mode Tue Jun 4 15:27:54 2002
+++ 2.5.20-akpm/include/linux/sysctl.h Tue Jun 4 15:27:54 2002
@@ -145,6 +145,8 @@ enum
VM_DIRTY_SYNC=13, /* dirty_sync_ratio */
VM_DIRTY_WB_CS=14, /* dirty_writeback_centisecs */
VM_DIRTY_EXPIRE_CS=15, /* dirty_expire_centisecs */
+ VM_LAPTOP_MODE=16, /* Enter "laptop" writeback mode */
+ VM_LAPTOP_WB_CS=17, /* Periodic flushtime when in laptop mode */
};


--- 2.5.20/kernel/sysctl.c~laptop-mode Tue Jun 4 15:27:54 2002
+++ 2.5.20-akpm/kernel/sysctl.c Tue Jun 4 15:27:54 2002
@@ -20,6 +20,7 @@

#include <linux/config.h>
#include <linux/mm.h>
+#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/sysctl.h>
#include <linux/swapctl.h>
@@ -278,6 +279,11 @@ static ctl_table vm_table[] = {
{VM_DIRTY_EXPIRE_CS, "dirty_expire_centisecs",
&dirty_expire_centisecs, sizeof(dirty_expire_centisecs), 0644,
NULL, &proc_dointvec},
+ {VM_LAPTOP_MODE, "laptop_mode", &laptop_mode, sizeof(laptop_mode),
+ 0644, NULL, &proc_dointvec},
+ {VM_LAPTOP_WB_CS, "laptop_writeback_centisecs",
+ &laptop_writeback_centisecs, sizeof(laptop_writeback_centisecs),
+ 0644, NULL, &proc_dointvec},
{0}
};

--- 2.5.20/drivers/ide/ide.c~laptop-mode Tue Jun 4 15:27:54 2002
+++ 2.5.20-akpm/drivers/ide/ide.c Tue Jun 4 15:27:54 2002
@@ -1043,6 +1043,7 @@ static void do_request(struct ata_channe

void do_ide_request(request_queue_t *q)
{
+ disk_spun_up();
do_request(q->queuedata);
}

--- 2.5.20/fs/jbd/transaction.c~laptop-mode Tue Jun 4 15:27:54 2002
+++ 2.5.20-akpm/fs/jbd/transaction.c Tue Jun 4 15:27:54 2002
@@ -45,7 +45,8 @@ extern spinlock_t journal_datalist_lock;

static transaction_t * get_transaction (journal_t * journal, int is_try)
{
- transaction_t * transaction;
+ transaction_t *transaction;
+ unsigned long expires;

transaction = jbd_kmalloc (sizeof (transaction_t), GFP_NOFS);
if (!transaction)
@@ -56,14 +57,25 @@ static transaction_t * get_transaction (
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
transaction->t_tid = journal->j_transaction_sequence++;
- transaction->t_expires = jiffies + journal->j_commit_interval;

- /* Set up the commit timer for the new transaction. */
- J_ASSERT (!journal->j_commit_timer_active);
+ /*
+ * Set up the commit timer for the new transaction. In laptop mode
+ * we expect commits to be forced by core kernel kupdate activity, so
+ * just set the transaction to expire five seconds after that in case
+ * something changes or goes wrong there.
+ */
+ expires = jiffies;
+ if (laptop_mode)
+ expires += 5 * HZ + (laptop_writeback_centisecs * HZ) / 100;
+ else
+ expires += journal->j_commit_interval;
+
+ transaction->t_expires = expires;
+ J_ASSERT(!journal->j_commit_timer_active);
journal->j_commit_timer_active = 1;
journal->j_commit_timer->expires = transaction->t_expires;
add_timer(journal->j_commit_timer);
-
+
J_ASSERT (journal->j_running_transaction == NULL);
journal->j_running_transaction = transaction;

--- 2.5.20/include/linux/fs.h~laptop-mode Tue Jun 4 15:27:54 2002
+++ 2.5.20-akpm/include/linux/fs.h Tue Jun 4 15:27:54 2002
@@ -1291,5 +1291,9 @@ static inline ino_t parent_ino(struct de
return res;
}

+extern void disk_spun_up(void);
+extern int laptop_mode;
+extern int laptop_writeback_centisecs;
+
#endif /* __KERNEL__ */
#endif /* _LINUX_FS_H */
--- 2.5.20/Documentation/filesystems/proc.txt~laptop-mode Tue Jun 4 15:28:32 2002
+++ 2.5.20-akpm/Documentation/filesystems/proc.txt Tue Jun 4 15:35:04 2002
@@ -985,6 +985,38 @@ for writeout by the pdflush daemons. It
Data which has been dirty in-memory for longer than this interval will be
written out next time a pdflush daemon wakes up.

+laptop_mode
+-----------
+
+Setting this entry to '1' will put the kernel's dirty data writeout
+algorithms into a mode which is better suited to laptop/notebook
+computers. This mode is specifically designed to minimise the
+frequency of disk spinups. Laptop mode works as follows:
+
+- Dirty data remains in memory for longer periods of time (controlled
+ by laptop_writeback_centisecs).
+
+- If there is pending dirty data and the disk is spun up for any
+ reason (even for a read) then all dirty data will be written back
+ shortly afterwards. ie: when the disk is spun up, make good use of
+ it.
+
+- When the decision is made to write back some dirty data, the kernel
+ will write back all dirty data.
+
+laptop_writeback_centisecs
+--------------------------
+
+This tunable determines the maximum age of dirty data when the machine
+is operating in Laptop mode. The default value is 30000 - five
+minutes. This means that if applications are generating a small amount
+of write traffic, the disk will spin up once per five minutes.
+
+If the disk is spun up for any other reason (such as for a read) then
+all dirty data will be flushed anyway, and this timer is reset to zero.
+
+laptop_writeback_centisecs has no effect when the machine is not
+operating in Laptop mode.

kswapd
------

--------------0E196CBCA810BB4B5A6FAB4A--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/