<!-- received="Wed Apr 26 16:26:01 2000 EET DST" -->
<!-- sent="Wed, 26 Apr 2000 06:11:15 -0700" -->
<!-- name="David S. Miller" -->
<!-- email="davem@redhat.com" -->
<!-- subject="Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing" -->
<!-- id="200004261311.GAA13838@pizda.ninka.net" -->
<!-- inreplyto="[PATCH] 2.3.99-pre6-3+  VM rebalancing" -->
<title>Linux-kernel mailing list archive 2000-17,: Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing</title>
<body bgcolor="#FFFFFF"><font face="Arial,Helvetica">
<h1>Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing</h1>
<b>David S. Miller</b> (<a href="mailto:davem@redhat.com"><i>davem@redhat.com</i></a>)<br>
<i>Wed, 26 Apr 2000 06:11:15 -0700</i>
<p>
<ul>
<li> <b>Messages sorted by:</b> <a href="date.html#486">[ date ]</a><a href="index.html#486">[ thread ]</a><a href="subject.html#486">[ subject ]</a><a href="author.html#486">[ author ]</a>
<!-- next="start" -->
<li> <b>Next message:</b> <a href="0487.html">ralf willenbacher: "System hangs when reading or writing many or large files"</a>
<li> <b>Previous message:</b> <a href="0485.html">George Anzinger: "Re: kernel debugger"</a>
<li> <b>Maybe in reply to:</b> <a href="0022.html">Rik van Riel: "[PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<!-- nextthread="start" -->
<li> <b>Next in thread:</b> <a href="0503.html">Stephen C. Tweedie: "Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<li> <b>Next in thread:</b> <a href="0501.html">David S. Miller: "Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<li> <b>Reply:</b> <a href="0503.html">Stephen C. Tweedie: "Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<!-- reply="end" -->
</ul>
<hr>
<!-- body="start" -->
   Date: Wed, 26 Apr 2000 14:00:31 +0100<br>
   From: "Stephen C. Tweedie" &lt;<a href="mailto:sct@redhat.com">sct@redhat.com</a>&gt;<br>
<p>
   Doing it isn't the problem.  Doing it efficiently is, if you have <br>
   fork() and mremap() in the picture.  With mremap(), you cannot assume<br>
   that the virtual address of an anonymous page is the same in all<br>
   processes which have the page mapped.<br>
<p>
Who makes that assumption?  The virtual address of a physical page<br>
is:<br>
<p>
	(page-&gt;index - vma-&gt;vm_pgoff) &lt;&lt; PAGE_SHIFT<br>
<p>
Add that to vma-&gt;vm_start and if the resulting value is not<br>
<i>&gt;= vma-&gt;vm_end, then you have the proper virtual address, always.</i><br>
<p>
   So, basically, to find all the ptes for a given page, you have to<br>
   walk every single vma in every single mm which is a fork()ed <br>
   ancestor or descendent of the mm whose address_space you indexed<br>
   the page against.<br>
<p>
If you implement things correctly, this is not true at all.<br>
<p>
   Detecting the right vma isn't hard, because the vma's vm_pgoff is<br>
   preserved over mremap().  It's the linear scan that is the danger.<br>
<p>
In my implementation there is no linear scan, only VMA's which<br>
can actually contain the anonymous page in question are scanned.<br>
<p>
It's called an anonymous layer, and it provides pseudo backing objects<br>
for VMA's which have at least one privatized anonymous page.  Each<br>
such object is no more than a reference count, and an address_space<br>
struct.  The anonymous pages are queued into the address_space page<br>
list, and have their page-&gt;index fields set appropriately.<br>
<p>
When VMA's move around, get duplicated in fork'd processes, etc.<br>
the anon layer gets called and adjusts things appropriately.<br>
<p>
Instead of talk, I'll show some code :-)  The following is the<br>
anon layer I implemented for 2.3.x in my hacks.<br>
<p>
--- ./mm/anon.c.~1~	Tue Apr 25 00:39:55 2000<br>
+++ ./mm/anon.c	Tue Apr 25 07:08:28 2000<br>
@@ -0,0 +1,370 @@<br>
+/*<br>
+ *	linux/mm/anon.c<br>
+ *<br>
+ * Written by DaveM.<br>
+ */<br>
+<br>
+#include &lt;linux/kernel.h&gt;<br>
+#include &lt;linux/sched.h&gt;<br>
+#include &lt;linux/mm.h&gt;<br>
+#include &lt;linux/slab.h&gt;<br>
+#include &lt;linux/fs.h&gt;<br>
+#include &lt;linux/swap.h&gt;<br>
+#include &lt;linux/pagemap.h&gt;<br>
+#include &lt;linux/spinlock.h&gt;<br>
+#include &lt;linux/highmem.h&gt;<br>
+<br>
+/* The anon layer provides a virtual backing object for anonymous<br>
+ * private pages.  The anon objects hang off of vmas and are created<br>
+ * at the first cow fault into a private mapping.<br>
+ *<br>
+ * The anon address space is just like the page cache, it holds a<br>
+ * reference to each of the pages attached to it.<br>
+ */<br>
+<br>
+/* The layout of this structure is completely private to the<br>
+ * anon layer.  There is no reason to export it so we don't.<br>
+ */<br>
+struct anon_area {<br>
+	atomic_t		count;<br>
+	struct address_space	mapping;<br>
+};<br>
+<br>
+extern spinlock_t pagecache_lock;<br>
+static kmem_cache_t *anon_cachep = NULL;<br>
+<br>
+static __inline__ void anon_insert_vma(struct vm_area_struct *vma,<br>
+				       struct anon_area *anon)<br>
+{<br>
+	struct address_space *mapping = &amp;anon-&gt;mapping;<br>
+	struct vm_area_struct *next;<br>
+<br>
+	spin_lock(&amp;mapping-&gt;i_shared_lock);<br>
+	next = mapping-&gt;i_mmap;<br>
+	if ((vma-&gt;vm_anon_next_share = next) != NULL)<br>
+		next-&gt;vm_anon_pprev_share = &amp;vma-&gt;vm_anon_next_share;<br>
+	mapping-&gt;i_mmap = vma;<br>
+	vma-&gt;vm_anon_pprev_share = &amp;mapping-&gt;i_mmap;<br>
+	spin_unlock(&amp;mapping-&gt;i_shared_lock);<br>
+}<br>
+<br>
+static __inline__ void anon_remove_vma(struct vm_area_struct *vma,<br>
+				       struct anon_area *anon)<br>
+{<br>
+	struct address_space *mapping = &amp;anon-&gt;mapping;<br>
+	struct vm_area_struct *next;<br>
+<br>
+	spin_lock(&amp;mapping-&gt;i_shared_lock);<br>
+	next = vma-&gt;vm_anon_next_share;<br>
+	if (next)<br>
+		next-&gt;vm_anon_pprev_share = vma-&gt;vm_anon_pprev_share;<br>
+	*(vma-&gt;vm_anon_pprev_share) = next;<br>
+	spin_unlock(&amp;mapping-&gt;i_shared_lock);<br>
+}<br>
+<br>
+/* Attach VMA's anon_area to NEW_VMA */<br>
+void anon_dup(struct vm_area_struct *vma, struct vm_area_struct *new_vma)<br>
+{<br>
+	struct anon_area *anon = vma-&gt;vm_anon;<br>
+<br>
+	if (anon == NULL)<br>
+		BUG();<br>
+<br>
+	atomic_inc(&amp;anon-&gt;count);<br>
+	anon_insert_vma(new_vma, anon);<br>
+	new_vma-&gt;vm_anon = anon;<br>
+}<br>
+<br>
+/* Free up all the pages assosciated with ANON. */<br>
+static void invalidate_anon_pages(struct anon_area *anon)<br>
+{<br>
+	spin_lock(&amp;pagecache_lock);<br>
+<br>
+	for (;;) {<br>
+		struct list_head *entry = anon-&gt;mapping.pages.next;<br>
+		struct page *page;<br>
+<br>
+		if (entry == &amp;anon-&gt;mapping.pages)<br>
+			break;<br>
+<br>
+		page = list_entry(entry, struct page, list);<br>
+<br>
+		get_page(page);<br>
+		if (TryLockPage(page)) {<br>
+			spin_unlock(&amp;pagecache_lock);<br>
+			lock_page(page);<br>
+			spin_lock(&amp;pagecache_lock);<br>
+		}<br>
+<br>
+		if (PageSwapCache(page)) {<br>
+			spin_unlock(&amp;pagecache_lock);<br>
+			__delete_from_swap_cache(page);<br>
+			spin_lock(&amp;pagecache_lock);<br>
+		}<br>
+<br>
+		put_page(page);<br>
+<br>
+		lru_cache_del(page);<br>
+<br>
+		list_del(&amp;page-&gt;list);<br>
+		anon-&gt;mapping.nrpages--;<br>
+		ClearPageAnon(page);<br>
+		page-&gt;mapping = NULL;<br>
+		UnlockPage(page);<br>
+<br>
+		__free_page(page);<br>
+	}<br>
+<br>
+	spin_unlock(&amp;pagecache_lock);<br>
+<br>
+	if (anon-&gt;mapping.nrpages != 0)<br>
+		BUG();<br>
+}<br>
+<br>
+/* VMA has been resized in some way, or one of the anon_area owners<br>
+ * has gone away.  Trim the anonymous pages from the anon_area which<br>
+ * have a reference count of one.  These pages are no longer<br>
+ * referenced validly by any VMA and thus can be safely disposed.<br>
+ *<br>
+ * This is actually an optimization of sorts, we could just<br>
+ * ignore this situation and let the eventual final anon_put<br>
+ * get rid of the pages.<br>
+ *<br>
+ * It is the callers responsibility to unmap and free the<br>
+ * pages from the address space of the process before invoking<br>
+ * this.  It cannot work otherwise.<br>
+ */<br>
+void anon_trim(struct vm_area_struct *vma)<br>
+{<br>
+	struct anon_area *anon = vma-&gt;vm_anon;<br>
+	struct list_head *entry;<br>
+<br>
+	spin_lock(&amp;pagecache_lock);<br>
+<br>
+	entry = anon-&gt;mapping.pages.next;<br>
+	while (entry != &amp;anon-&gt;mapping.pages) {<br>
+		struct page *page = list_entry(entry, struct page, list);<br>
+		struct list_head *next = entry-&gt;next;<br>
+<br>
+		entry = next;<br>
+<br>
+		if (page_count(page) != 1)<br>
+			continue;<br>
+<br>
+		if (TryLockPage(page))<br>
+			continue;<br>
+<br>
+		lru_cache_del(page);<br>
+<br>
+		list_del(&amp;page-&gt;list);<br>
+		anon-&gt;mapping.nrpages--;<br>
+		ClearPageAnon(page);<br>
+		page-&gt;mapping = NULL;<br>
+		UnlockPage(page);<br>
+<br>
+		__free_page(page);<br>
+	}<br>
+<br>
+	spin_unlock(&amp;pagecache_lock);<br>
+}<br>
+<br>
+/* Disassosciate VMA with the vm_anon attached to it. */<br>
+void anon_put(struct vm_area_struct *vma)<br>
+{<br>
+	struct anon_area *anon = vma-&gt;vm_anon;<br>
+<br>
+	if (anon == NULL)<br>
+		BUG();<br>
+	if (atomic_read(&amp;anon-&gt;count) &lt; 1)<br>
+		BUG();<br>
+<br>
+	anon_remove_vma(vma, anon);<br>
+<br>
+	if (atomic_dec_and_test(&amp;anon-&gt;count)) {<br>
+		if (anon-&gt;mapping.i_mmap != NULL)<br>
+			BUG();<br>
+		invalidate_anon_pages(anon);<br>
+		kmem_cache_free(anon_cachep, anon);<br>
+	} else<br>
+		anon_trim(vma);<br>
+<br>
+	vma-&gt;vm_anon = NULL;<br>
+}<br>
+<br>
+<br>
+/* Forcibly delete an anon_area page.  This also kills the<br>
+ * original reference made by anon_cow.<br>
+ */<br>
+void anon_page_kill(struct page *page)<br>
+{<br>
+	spin_lock(&amp;pagecache_lock);<br>
+<br>
+	if (TryLockPage(page)) {<br>
+		spin_unlock(&amp;pagecache_lock);<br>
+<br>
+		lock_page(page);<br>
+<br>
+		spin_lock(&amp;pagecache_lock);<br>
+	}<br>
+<br>
+	lru_cache_del(page);<br>
+<br>
+	page-&gt;mapping-&gt;nrpages--;<br>
+	list_del(&amp;page-&gt;list);<br>
+	ClearPageAnon(page);<br>
+	page-&gt;mapping = NULL;<br>
+	UnlockPage(page);<br>
+<br>
+	put_page(page);<br>
+	__free_page(page);<br>
+<br>
+	spin_unlock(&amp;pagecache_lock);<br>
+}<br>
+<br>
+static int anon_try_to_free_page(struct page *page)<br>
+{<br>
+	int ret = 0;<br>
+<br>
+	if (page_count(page) &lt;= 1)<br>
+		BUG();<br>
+	if (!PageLocked(page))<br>
+		BUG();<br>
+<br>
+	spin_lock(&amp;pagecache_lock);<br>
+	if (PageSwapCache(page)) {<br>
+		spin_unlock(&amp;pagecache_lock);<br>
+		__delete_from_swap_cache(page);<br>
+		spin_lock(&amp;pagecache_lock);<br>
+	}<br>
+	if (page_count(page) == 2) {<br>
+		struct address_space *mapping = page-&gt;mapping;<br>
+<br>
+		mapping-&gt;nrpages--;<br>
+		list_del(&amp;page-&gt;list);<br>
+<br>
+		ClearPageAnon(page);<br>
+		page-&gt;mapping = NULL;<br>
+		ret = 1;<br>
+	}<br>
+	spin_unlock(&amp;pagecache_lock);<br>
+<br>
+	if (ret == 1)<br>
+		__free_page(page);<br>
+<br>
+	return ret;<br>
+}<br>
+<br>
+struct address_space_operations anon_address_space_operations = {<br>
+	try_to_free_page:	anon_try_to_free_page<br>
+};<br>
+<br>
+/* SLAB constructor for anon_area structs. */<br>
+static void anon_ctor(void *__p, kmem_cache_t *cache, unsigned long flags)<br>
+{<br>
+	struct anon_area *anon = __p;<br>
+	struct address_space *mapping = &amp;anon-&gt;mapping;<br>
+<br>
+	INIT_LIST_HEAD(&amp;mapping-&gt;pages);<br>
+	mapping-&gt;nrpages = 0;<br>
+	mapping-&gt;a_ops = &amp;anon_address_space_operations;<br>
+	mapping-&gt;host = anon;<br>
+	spin_lock_init(&amp;mapping-&gt;i_shared_lock);<br>
+}<br>
+<br>
+/* Create a new anon_area, and attach it to VMA. */<br>
+static struct anon_area *anon_alloc(struct vm_area_struct *vma)<br>
+{<br>
+	struct anon_area *anon = kmem_cache_alloc(anon_cachep, GFP_KERNEL);<br>
+<br>
+	if (anon) {<br>
+		struct address_space *mapping = &amp;anon-&gt;mapping;<br>
+<br>
+		atomic_set(&amp;anon-&gt;count, 1);<br>
+		mapping-&gt;i_mmap = vma;<br>
+		vma-&gt;vm_anon = anon;<br>
+		vma-&gt;vm_anon_next_share = NULL;<br>
+		vma-&gt;vm_anon_pprev_share = &amp;mapping-&gt;i_mmap;<br>
+	}<br>
+<br>
+	return anon;<br>
+}<br>
+<br>
+static void anon_page_insert(struct vm_area_struct *vma, unsigned long address, struct address_space *mapping, struct page *page)<br>
+{<br>
+	page-&gt;index = ((address - vma-&gt;vm_start) &gt;&gt; PAGE_SHIFT) + vma-&gt;vm_pgoff;<br>
+<br>
+	get_page(page);<br>
+<br>
+	spin_lock(&amp;pagecache_lock);<br>
+	SetPageAnon(page);<br>
+	mapping-&gt;nrpages++;<br>
+	list_add(&amp;page-&gt;list, &amp;mapping-&gt;pages);<br>
+	page-&gt;mapping = mapping;<br>
+	spin_unlock(&amp;pagecache_lock);<br>
+<br>
+	lru_cache_add(page);<br>
+}<br>
+<br>
+static __inline__ struct anon_area *get_anon(struct vm_area_struct *vma)<br>
+{<br>
+	struct anon_area *anon = vma-&gt;vm_anon;<br>
+<br>
+	if (anon == NULL)<br>
+		anon = anon_alloc(vma);<br>
+<br>
+	return anon;<br>
+}<br>
+<br>
+int anon_page_add(struct vm_area_struct *vma, unsigned long address, struct page *page)<br>
+{<br>
+	struct anon_area *anon = get_anon(vma);<br>
+<br>
+	if (anon) {<br>
+		anon_page_insert(vma, address, &amp;anon-&gt;mapping, page);<br>
+		return 0;<br>
+	}<br>
+<br>
+	return -1;<br>
+}<br>
+<br>
+/*<br>
+ * We special-case the C-O-W ZERO_PAGE, because it's such<br>
+ * a common occurrence (no need to read the page to know<br>
+ * that it's zero - better for the cache and memory subsystem).<br>
+ */<br>
+static inline void copy_cow_page(struct page * from, struct page * to, unsigned long address)<br>
+{<br>
+	if (from == ZERO_PAGE(address)) {<br>
+		clear_user_highpage(to, address);<br>
+		return;<br>
+	}<br>
+	copy_user_highpage(to, from, address);<br>
+}<br>
+<br>
+struct page *anon_cow(struct vm_area_struct *vma, unsigned long address, struct page *orig_page)<br>
+{<br>
+	struct anon_area *anon = get_anon(vma);<br>
+<br>
+	if (anon) {<br>
+		struct page *new_page = alloc_page(GFP_HIGHUSER);<br>
+<br>
+		if (new_page) {<br>
+			copy_cow_page(orig_page, new_page, address);<br>
+			anon_page_insert(vma, address, &amp;anon-&gt;mapping, new_page);<br>
+		}<br>
+<br>
+		return new_page;<br>
+	}<br>
+<br>
+	return NULL;<br>
+}<br>
+<br>
+void anon_init(void)<br>
+{<br>
+	anon_cachep = kmem_cache_create("anon_area",<br>
+					sizeof(struct anon_area),<br>
+					0, SLAB_HWCACHE_ALIGN,<br>
+					anon_ctor, NULL);<br>
+	if (!anon_cachep)<br>
+		panic("anon_init: Cannot alloc anon_area cache.");<br>
+}<br>
<p>
-<br>
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in<br>
the body of a message to majordomo@vger.rutgers.edu<br>
Please read the FAQ at <a href="http://www.tux.org/lkml/">http://www.tux.org/lkml/</a><br>
<!-- body="end" -->
<hr>
<p>
<ul>
<!-- next="start" -->
<li> <b>Next message:</b> <a href="0487.html">ralf willenbacher: "System hangs when reading or writing many or large files"</a>
<li> <b>Previous message:</b> <a href="0485.html">George Anzinger: "Re: kernel debugger"</a>
<li> <b>Maybe in reply to:</b> <a href="0022.html">Rik van Riel: "[PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<!-- nextthread="start" -->
<li> <b>Next in thread:</b> <a href="0503.html">Stephen C. Tweedie: "Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<li> <b>Next in thread:</b> <a href="0501.html">David S. Miller: "Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<li> <b>Reply:</b> <a href="0503.html">Stephen C. Tweedie: "Re: [PATCH] 2.3.99-pre6-3+  VM rebalancing"</a>
<!-- reply="end" -->
</ul>
</font></body>
