With gcc 3.x i get
495MB/s  with -O3 -march=athlon-tbird -mcpu=athlon-tbird -falign-loops=4 
-falign-functions=4
488MB/s with -O3 -march=athlon-tbird -mcpu=athlon-tbird -falign-loops=4
467MB/s with -O0 -march=i686 -mcpu=i686
which is almost a 30MB/s difference or 6% simply from compiler options 
of the same compiler.  It may not mean much in 1 second. But few things 
where we care about performance are only run for one second.
I'd expect something below 3% and realistically closer to 1%. Any ideas 
as to why it is making a difference?  Does the execution path to the 
function in C really take up performance to drop 30MB/s of memory 
bandwidth because from the looks of it this program is very small and 
things should be really quick to the asm functions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/