Re: Why Plan 9 C compilers don't have asm("")

Linus Torvalds (torvalds@transmeta.com)
Wed, 4 Jul 2001 17:22:44 +0000 (UTC)


In article <20010704002436.C1294@ftsoj.fsmlabs.com>,
Cort Dougan <cort@fsmlabs.com> wrote:
>
>There isn't such a crippling difference between straight-line and code with
>unconditional branches in it with modern processors. In fact, there's very
>little measurable difference.

Oh, the small details get to you eventually.

And it's not just the "call" and "ret" instructions. They _do_ hurt,
even on modern CPU's, btw. They tend to break up the prefetching, and
often mean that you cannot do as good of a instruction mix.

But there's an even more serious problem: a function call in C is a
VERY heavy operation as far as the compiler is concerned. It's a major
sequence point, and the compiler doesn't know what memory locations are
potentially dead etc.

Which means that the compiler has to save everything that might be
relevant to memory, and depending on the calling convention has to
assume that registers are trashed. And when you come back, you have to
re-load everything again, on the assumption that the function might have
changed state.

You also often have issues like reloading the gp pointer on many 64-bit
architectures, where functions can be in different "domains", and
returning from an unknown function means that you have to do other nasty
setup in order to get at your global data.

And trust me, it's noticeable. On alpha, a fast function call _should_
be a a simple two-cycle thing - branch and return. But because of
practical linker issues, what the compiler ends up having to generate
for calls to targets that it doesn't know where they are is

- load a 64-bit address off the GP area that the linker will have fixed
up.
- do an indirect branch to that address
- the callee re-loads the GP with _its_ copy of the GP if it needs any
global data or needs to call anybody else.
- we return to the caller
- the caller reloads its GP.

Your theoretical two cycles that the CPU could follow in the front end
and speculate around turns into multiple loads, a indirect branch and
about 10 instructions. And that's without any of the other effects even
being taken into account. No matter _how_ good the CPU is, that's going
to be slower than not doing it.

[ And yes, I know there are optimizing linkers for the alpha around that
improve this and notice when they don't need to change GP and can do a
straight branch etc. I don't think GNU ld _still_ does that, but who
knows. Even the "good" Digital compilers tended to nop out unnecessary
instructions rather than remove them, causing more icache pressure on
a CPU that was already famous for needing tons of icache ]

Now, you could get around a bit of this by allowing for special calling
conventions. Gcc actually has this for some details - namely the
"register arguments" part, which actually makes for much more readable
code (that's my main personal use for it - never mind the fact that it
is probably faster _too_).

But gcc doesn't have a good "you can re-order this call wrt other stuff"
setup, and gcc lacks the ability to change the calling convention
on-the-fly ("this function will not clobber any registers").

Try it and see. There are good reasons for "inline asm", not the least
of which is that it often makes the produced code much more readable.

And if you never look at the produced assembler code, then you'll never
have a fast system. Really. Compilers can do only so much. People who
understand what the end result is makes for a difference.

Now, you could probably argue that instead of inline asms we should have
more flexibility in doing a per-callee calling convention. That would be
good too, no question about it.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/