Re: Break 2.4 VM in five easy steps

Linus Torvalds (torvalds@transmeta.com)
6 Jun 2001 13:47:35 -0700


In article <3B1D5ADE.7FA50CD0@illusionary.com>,
Derek Glidden <dglidden@illusionary.com> wrote:
>
>After reading the messages to this list for the last couple of weeks and
>playing around on my machine, I'm convinced that the VM system in 2.4 is
>still severely broken.

Now, this may well be true, but what you actually demonstrated is that
"swapoff()" is extremely (and I mean _EXTREMELY_) inefficient, to the
point that it can certainly be called broken.

It got worse in 2.4.x not so much due to any generic VM worseness, as
due to the fact that the much more persistent swap cache behaviour in
2.4.x just exposes the fundamental inefficiencies of "swapoff()" more
clearly. I don't think the swapoff() algorithm itself has changed, it's
just that the algorithm was always exponential, I think (and because of
the persistent swap cache, the "n" in the algorithm became much bigger).

So this is really a separate problem from the general VM balancing
issues. Go and look at the "try_to_unuse()" logic, and wince.

I'd love to have somebody look a bit more at swap-off. It may well be,
for example, that swap-off does not correctly notice dead swap-pages at
all - somebody should verify that it doesn't try to read in and
"try_to_unuse()" dead swap entries. That would make the inefficiency
show up even more clearly.

(Quick look gives the following: right now try_to_unuse() in
mm/swapfile.c does something like

lock_page(page);
if (PageSwapCache(page))
delete_from_swap_cache_nolock(page);
UnlockPage(page);
read_lock(&tasklist_lock);
for_each_task(p)
unuse_process(p->mm, entry, page);
read_unlock(&tasklist_lock);
shmem_unuse(entry, page);
/* Now get rid of the extra reference to the temporary
page we've been using. */
page_cache_release(page);

and we should trivially notice that if the page count is 1, it cannot be
mapped in any process, so we should maybe add something like

lock_page(page);
if (PageSwapCache(page))
delete_from_swap_cache_nolock(page);
UnlockPage(page);
+ if (page_count(page) == 1)
+ goto nothing_to_do;
read_lock(&tasklist_lock);
for_each_task(p)
unuse_process(p->mm, entry, page);
read_unlock(&tasklist_lock);
shmem_unuse(entry, page);
+
+ nothing_to_do:
+
/* Now get rid of the extra reference to the temporary
page we've been using. */
page_cache_release(page);

which should (assuming I got the page count thing right - I'v eobviously
not tested the above change) make sure that we don't spend tons of time
on dead swap pages.

Somebody interested in trying the above add? And looking for other more
obvious bandaid fixes. It won't "fix" swapoff per se, but it might make
it bearable and bring it to the 2.2.x levels.

The _real_ fix is to really make "swapoff()" work the other way around -
go through each process and look for swap entries in the page tables
_first_, and bring all entries for that device in sanely, and after
everything is brought in just drop all the swap cache pages for that
device.

The current swapoff() thing is really a quick hack that has lived on
since early 1992 with quick hacks to make it work with the big VM
changes that have happened since.

That would make swapoff be O(n) in VM size (and you can easily do some
further micro-optimizations at that time by avoiding shared mappings
with backing store and other things that cannot have swap info involved)

Is anybody interested in making "swapoff()" better? Please speak up..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/