Re: Crusoe's performance on linux?

dean gaudet (dean-list-linux-kernel@arctic.org)
Mon, 23 Jun 2003 18:43:28 -0700 (PDT)


On Mon, 23 Jun 2003, Samphan Raruenrom wrote:

> Desktop - Pentium III 1 G Hz 754 MB -> 10.x min.
> Tablet PC - Crusoe TM5800 1 GHz 731 MB -> 17.x min.

how much real memory are in these boxes? the above don't look like any
real memory sizes i'm aware of (even if i try to guess if your M=10^6
or M=2^20).

> From freshdiagnos benchmack, the TPC has about 2x faster RAM.

you're mistaken if you believe marketing doodoo which says that DDR is
"twice as fast" as SDR -- only data transfer is clocked at twice the
bus speed. there's command bus costs which are identical between SDR
and DDR -- and its these costs which dominate the bus in non-sequential
benchmarks such as compilation (as opposed to excessively sequential
benchmarks which are used for marketing purposes).

expansion memory for the tablet PC is PC133 -- you can verify this
by reading the part numbers off the SODIMM and doing a lookup on the
manufacturer's website...

if you don't have any expansion memory in your tablet PC then you've
got only 256MB of RAM and i really don't think you should be using tmpfs.

> Shouldn't TM5800 with 4-wide VLIW engine and 64 registers,
> working on a single task, run as fast as a Pentium III?

nope.

you're assuming that the VLIW has 4 completely orthogonal processing
units -- it doesn't. try measuring code sequences such as:

add %eax,%ebx
add %eax,%ecx
add %eax,%edx
add %eax,%edi
add %eax,%ebx
add %eax,%ecx
add %eax,%edx
add %eax,%edi
...

you'll quickly figure out how many ALUs the machine has, plus you can
figure out their latencies and throughputs (by increasing or decreasing
the number of independent additions). if you do this you'll find
that p3 and tm5800 are essentially just as "wide" as each other for
ALU operations: a pair of 32-bit ALUs with single-cycle latency and
single-cycle throughput.

similar microarchitctural analysis of other aspects of the cores can
help explain the differences you see.

traditionally VLIW have been used for DSP and other such related tasks
in which there are a large number of orthogonal units. in tm5800 you
can think of the VLIW as a set of up to 4 micro-ops which feed the front
of up to 4 pipelines in each cycle -- much like the decoded micro-ops
in the p3 feed one of several pipelines in each cycle.

but i'm really guessing you're causing excessive disk i/o by having a
small memory system use a huge tmpfs... get rid of the tmpfs and
see what happens. and also consider doing i/o benchmarks while
running something which soaks up idle cycles (i.e. a tight loop
incrementing a counter) to see how the two architectures differ.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/