Strangely enough not. You just have to try and stay out of the kernel as
much as possible ;-)
Of course some idiot sold the total-cost-of-ownership thingy of linux to
the customers. What they really need is a OS/360... 
  
> > 2. We're making a MPI library, and as such we don't have any control
> > with the application. 
> 
> I can't remember that the MPI spec tells anything about intercepting
> syscalls..
> 
It's says quite a bit about what memory can be used for comm buffers. 
> > 3b. the performance loss from copying from a receive area to the
> > userspace buffer is unacceptable. 
> > 3c. It's therefore necessary for HW to access user pages. 
> > 4. In order to to 3, the user pages must be pinned down. 
> > 5. the way MPI is written, it's not using a special malloc() to allocate
> > the send receive buffers. It can't since it would break language binding
> > to fortran. Thus ANY writeable user page may be used. 
> 
> so use get_user_pages.
Let me clearify: pinning pages are not, repeat not a problem. 
The problem occur when you 
1. pinn a buffer  
2. sbrk(-n) or munmap() (usually thru free()) the area the buffer  
3. a new malloc() resulting in a sbrk(+n) or mmap()
4. then my new buffer has the exactly same virtual address as the prev.
(belive it or not this happens, and relatively frequently). 
 
> 
> > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > buffers every time they're used. 
> 
> Umm, pinning memory all the time means you get a bunch of nice DoS
> attachs due to the huge amount of memory.
> 
This is HPC clusters. DoS is a non issue. This is not the normal multi
user systems. In fact you run one active process per CPU.  
> > 7. The only way to cache buffers (to see if they're used before and
> > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > to a device file is prohibitive expensive under point 1.  
> 
> That's a horribly b0rked approach..
> 
It's *FAST*.
> Again, where's your driver source so we can help you to find a better
> approach out of that mess?
>  
The trace module we made to trace munmap() and sbrk() could be opened,
but you'll be disappointed since all the pinning ( get_user_pages() and
friends), send() recv() etc are in the drivers for the various hardware,
most of which are not our property.  
The module works as follows. It catches sbrk(-arg) and munmap() and lays
out the trace info in a memory area mmap()'able thru the device file.  
So when processes need the trace info they have the info in memory to
avoid doing a ioctl().
Thats all we need to know if a given virtual address needs to be
(re)pinned. 
Lets deal, I'll GPL the trace module if you get me a 
EXPORT_SYMBOL_GPL(sys_call_table);
TJ
-- _________________________________________________________________________Terje Eggestad mailto:terje.eggestad@scali.no Scali Scalable Linux Systems http://www.scali.com
Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE) P.O.Box 150, Oppsal +47 975 31 574 (MOBILE) N-0619 Oslo fax: +47 22 62 89 51 NORWAY _________________________________________________________________________
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/