Re: [RFC][PATCH] linux-2.5.34_vsyscall_A0

Andrea Arcangeli (andrea@suse.de)
Fri, 18 Oct 2002 19:01:16 +0200


On Fri, Oct 18, 2002 at 09:13:39AM -0700, Stephen Hemminger wrote:
> One reason gettimeofday ends up being important is that several
> databases call it a lot. They use it to build up a transaction id. Under
> big transaction loads, even the fast linux syscall path ends up being a
> bottleneck. Also, on NUMA machines the data used for time of day (xtime)
> ends up being a significant portion of the cache traffic.

Yep. However the main bottleneck is to go inside/outside the kernel, the
xtime is one l1 cacheline readonly that can be trivially shared under
high load. I would be surprised if that was the bottleneck, today you
should see an huge bottleneck in the xtime_lock before you can remotely
see a bottleneck in xtime data itself. (I'm speaking HZ=100 at least,
HZ=1000 would hurt more here)

> It would be great to rework the whole TSC time of day stuff to work with
> per cpu data and allow unsychronized TSC's like NUMA. The problem is
> that for fast user level access, there would need to be some way to find
> out the current CPU and avoid preemption/migration for a short period.
> It seems like the LDT stuff for per-thread data could provide the
> current cpu (and maybe current pid) somehow. And it would be possible
> to avoid preemption while in a vsyscall text page, some other Unix
> variants do this to implement portions of the thread library in kernel
> provided user text pages.

actually my idea on 64bit was to use the high 8 bit of each 64bit word to
give you the cpuid, to get out the coherent data, including the sequence
number that are read and written inversely with mb() like now (the
sequence number as well will become per-cpu), so it is definitely doable
without any single problem and in a very performant way, just not as
easy as without the per-cpu info. Even if segmentation per-cpu tricks
would be possible or available (remeber long mode is pure paging, no
segmentation) it would be not worthwhile IMHO, the cpuid encoded
atomically in each 64bit data provided by the vsyscall seems a much
simpler and possibly more performant solution. You set a different
per-cpu data-mapping with different pte settings in each cpu. The
vsyscall bytecode remains the same, aware about this cpuid encoded in
each 64bit word. Doing it in 32bit is ugly (or at least much slower)
since most data is natively at least 32bit, it would need some slow
demultiplexing.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/