Re: [VM] 2.4.14/15-pre4 too "swap-happy"?

Ken Brownfield (brownfld@irridia.com)
Mon, 19 Nov 2001 21:09:41 -0600

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: John Ellson: "Re: [PATCH] "make modules_install" breaks with new /bin/cp"
Previous message: Eric W. Biederman: "Re: Swap"

Well, I think you'll be pleased to hear that your untested patch
compiled, booted, _and_ fixed the problem. :)

The minimum free RAM was about 9.8-11MB (matching your guestimate) and
kswapd seemed to behave the same as the watermark patch. The results of
top were basically the same, so I'm omitting it.

However, I do have some profiling numbers, thanks to Marcelo. Attached
are numbers from "readprofile | sort -nr +2 | head -20". I think the
pre4 numbers point to shrink_cache, prune_icache, and statm_pgd_range.
The other two might have significance for wizards, but statistically
don't stand out to me, except maybe statm_pgd_range.

I reset the counters just before starting Oracle and the stress test. I
think a -pre7 with a blessed patch would be good, since my testing was
very narrow.

I'll test new kernels as I hear new info.

Thanks much!

-- 
Ken.
brownfld@irridia.com


2.4.15-pre4 with your original patch:
(shorter time period since the machine went to hell fast)
(matches vanilla behaviour)

164536 default_idle                             3164.1538
101562 shrink_cache                             113.8587
  3683 prune_icache                              13.5404
  3034 file_read_actor                           12.2339
   914 DAC960_BA_InterruptHandler                 5.5732
  1128 statm_pgd_range                            2.9072
    40 page_cache_release                         0.8333
    31 add_page_to_hash_queue                     0.5167
    89 page_cache_read                            0.4363
    25 remove_inode_page                          0.4167
    26 unlock_page                                0.3095
   509 __make_request                             0.3008
    66 smp_call_function                          0.2946
    21 set_bh_page                                0.2917
     9 __brelse                                   0.2812
    90 try_to_free_buffers                        0.2778
    13 mark_page_accessed                         0.2708
     8 __free_pages                               0.2500
    43 get_hash_table                             0.2443
    42 activate_page                              0.2234

2.4.15-pre6 with watermark patch:

1617446 default_idle                             31104.7308
 27599 DAC960_BA_InterruptHandler               168.2866
 38918 file_read_actor                          156.9274
   528 page_cache_release                        11.0000
   554 add_page_to_hash_queue                     9.2333
 15487 __make_request                             9.1531
  3453 statm_pgd_range                            8.8995
   514 remove_inode_page                          8.5667
  1453 blk_init_free_list                         7.2650
   377 set_bh_page                                5.2361
   898 page_cache_read                            4.4020
   590 add_to_page_cache_unique                   4.3382
   136 __brelse                                   4.2500
  1120 kmem_cache_alloc                           3.8356
   628 kunmap_high                                3.7381
  1189 try_to_free_buffers                        3.6698
   625 get_hash_table                             3.5511
   439 lru_cache_add                              3.4297
  1715 rmqueue                                    3.0194
   105 remove_wait_queue                          2.9167

2.4.15-pre6 with Linus patch:

1249875 default_idle                             24036.0577
 65324 file_read_actor                          263.4032
 36979 DAC960_BA_InterruptHandler               225.4817
  9809 statm_pgd_range                           25.2809
  1039 page_cache_release                        21.6458
   994 add_page_to_hash_queue                    16.5667
   922 remove_inode_page                         15.3667
  2409 blk_init_free_list                        12.0450
 20159 __make_request                            11.9143
  1198 lru_cache_add                              9.3594
  1628 page_cache_read                            7.9804
   987 add_to_page_cache_unique                   7.2574
  2202 try_to_free_buffers                        6.7963
  1038 get_unused_buffer_head                     6.6538
   484 unlock_page                                5.7619
  3182 rmqueue                                    5.6021
   874 kunmap_high                                5.2024
   164 __brelse                                   5.1250
   900 get_hash_table                             5.1136
   357 set_bh_page                                4.9583


On Mon, Nov 19, 2001 at 03:52:44PM -0800, Linus Torvalds wrote:
| 
| On Mon, 19 Nov 2001, Ken Brownfield wrote:
| >
| > I went straight to the aa patch, and it looks like it either fixes the
| > problem or (because of the side-effects Linus mentioned) otherwise
| > prevents the issue:
| 
| So is this pre6aa1, or pre6 + just the watermark patch?
| 
| > The machine went into swap immediately when the page cache stopped
| > growing and hovered at 100-400MB.  Also, in my experience the page cache
| > will grow until there's only 5ishMB of free RAM, but with the aa patch
| > it looks like it stops at 320MB or maybe 10% of RAM.  Was that the aa
| > patch, or part of -pre6?
| 
| That was the watermarking. The way Andrea did it, the page cache will
| basically refuse to touch as much of the "normal" page zone, because it
| would prefer to allocate more from highmem..
| 
| I think it's excessive to have 320MB free memory, though, that's just
| an insane waste. I suspect that the real number should be somewhere
| between the old behaviour and the new one. You can tweak the behaviour of
| andrea's kernel by changing the "reserved" page numbers, but I'd like to
| hear whether my simpler approach works too..
| 
| > The Oracle SGA is set to ~522MB, with nothing else running except a
| > couple of sshds, getty, etc.  Now that I'm looking, 2.8GB page cache
| > plus 328MB free adds up to about 3.1GB of RAM -- where does the 512MB
| > shared memory segment fit?  Is it being swapped out in deference to page
| > cache?
| 
| Shared memory actually uses the page cache too, so it will be accounted
| for in the 2.8GB number.
| 
| Anyway, can you try plain vanilla pre6, with the appended patch? This is
| my suggested simplified version of what Andrea tried to do, and it should
| try to keep only a few extra megs of memory free in the low memory
| regions, not 300+ MB.
| 
| (and the profiling would be interesting regardless, but I think Andrea did
| find the real problem, his fix just seems a bit of an overkill ;)
| 
| 		Linus

| diff -u --recursive --new-file pre6/linux/mm/page_alloc.c linux/mm/page_alloc.c
| --- pre6/linux/mm/page_alloc.c	Sat Nov 17 19:07:43 2001
| +++ linux/mm/page_alloc.c	Mon Nov 19 15:13:36 2001
| @@ -299,29 +299,26 @@
|  	return page;
|  }
|  
| -static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
| -{
| -	long free = zone->free_pages - (1UL << order);
| -	return free >= 0 ? free : 0;
| -}
| -
|  /*
|   * This is the 'heart' of the zoned buddy allocator:
|   */
|  struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
|  {
| +	unsigned long min;
|  	zone_t **zone, * classzone;
|  	struct page * page;
|  	int freed;
|  
|  	zone = zonelist->zones;
|  	classzone = *zone;
| +	min = 1UL << order;
|  	for (;;) {
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		if (zone_free_pages(z, order) > z->pages_low) {
| +		min += z->pages_low;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;
| @@ -334,16 +331,18 @@
|  		wake_up_interruptible(&kswapd_wait);
|  
|  	zone = zonelist->zones;
| +	min = 1UL << order;
|  	for (;;) {
| -		unsigned long min;
| +		unsigned long local_min;
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		min = z->pages_min;
| +		local_min = z->pages_min;
|  		if (!(gfp_mask & __GFP_WAIT))
| -			min >>= 2;
| -		if (zone_free_pages(z, order) > min) {
| +			local_min >>= 2;
| +		min += local_min;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;
| @@ -376,12 +375,14 @@
|  		return page;
|  
|  	zone = zonelist->zones;
| +	min = 1UL << order;
|  	for (;;) {
|  		zone_t *z = *(zone++);
|  		if (!z)
|  			break;
|  
| -		if (zone_free_pages(z, order) > z->pages_min) {
| +		min += z->pages_min;
| +		if (z->free_pages > min) {
|  			page = rmqueue(z, order);
|  			if (page)
|  				return page;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Next message: John Ellson: "Re: [PATCH] "make modules_install" breaks with new /bin/cp"
Previous message: Eric W. Biederman: "Re: Swap"