Looks like a cacheline alignment issue to me.
This loop of yours occupy x cachelines on your cpu,
moving it in memory by adding the printf
might cause it to ocupy x+1 cachelines.
That might be noticeable if x is a really small number,
such as 1.
> gcc is 3.2.1 (same happens for 2.95..)
> Note this is with -O3. If I don't specify -O then
> leaving the printf in speeds things up by about 15%
Sure - going from -O3 to -O changes code generation so
your loop code hits the cachelines differently.
In this case the printf moved the loop into
My advice is to put your test loop in a function of its own,
and do the printing in the function that calls it.
functions are always aligned the same (good) way so
that calling them will be fast.
You can tune the speed of your inner loop by experimenting
with the insertion of one or more NOP asms in front
of the loop. Just be aware that all such tuning is wasted once
you change anything at all in that function - you'll have to
re-do the tuning each time.
The compiler should ideally align the loops for maximum performance.
That can be hard though, considering all the different processors
that might run your program. And aligning everything optimally
could waste a _lot_ of code space - so do this only for
small loops with lots of iterations.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to firstname.lastname@example.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/