Re: 2.5.20 RAID5 compile error

Martin Dalecki (dalecki@evision-ventures.com)
Tue, 04 Jun 2002 16:24:17 +0200


Jens Axboe wrote:
> On Tue, Jun 04 2002, Martin Dalecki wrote:
>
>>Jens Axboe wrote:
>>
>>>Neil,
>>>
>>>I tried converting umem to see how it fit together, this is what I came
>>>up with. This does use a queue per umem unit, but I think that's the
>>>right split anyways. Maybe at some point we can use the per-major
>>>statically allocated queues...
>>
>>>/*
>>>+ * remove the queue from the plugged list, if present. called with
>>>+ * queue lock held and interrupts disabled.
>>>+ */
>>>+inline int blk_remove_plug(request_queue_t *q)
>>
>>
>>Jens - I have noticed some unlikely() tag "optimizations" in
>>tcq code too.
>>Please tell my, why do you attribute this exported function as inline?
>>I "hardly doubt" that it will ever show up on any profile.
>>Contrary to popular school generally it only pays out to unroll vector code
>>on modern CPUs not decision code like the above.
>
>
> I doubt it matters much in this case. But it certainly isn't called
> often enough to justify the inline, I'll uninline later.
>
> WRT the unlikely(), if you have the hints available, why not pass them
> on?

Well it's kind like the answer to the question: why don't do it all in hand
optimized assembler? Or in other words - let's give the GCC guys good
reasons for more hard work. But more seriously:

Things like unlikely() tricks and other friends seldomly really
pay off if applied randomly. But they can:

1. Have quite contrary effects to what one would expect due to
the fact that one is targetting a single instruction set but in
reality multiple very different CPU archs or even multiple archs.

2. Changes/improvements to the compiler.

My personal rule of thumb is - don't do something like the
above unless you did:

1. Some real global profiling.
2. Some real CPU cycle counting on the micro level.
3. You really have too. (This should be number 1!)

Unless one of the above cases applies there is only one rule
for true code optimization, which appears to be generally
valid: Write tight code. It will help small and old
systems which usually have quite narrow memmory constrains
and on bigger systems it will help keeping the intruction
caches and internal instruction decode more happy.

Think for example about hints for:

1. AMD's and P4's internal instruction predecode/CISC-RISC translation.
Small functions will give immediately the information - "hey buddy
you saw this code already once...". But inlined stuff will still
trash the predecode engines along the single execution.
2. Think on the fly instruction set translation (Transmeta).
Its rather pretty obvious that a small function is acutally more likely
to be more palatable to software like that...
3. Even the Cyrix486 contained already very efficient mechanism to make
call instructions nearly zero cost in terms of instruction
cycles. (Think return address stack and hints for branch prediction.)

Of corse the above is all valid for decission code, which is the case here,
but not necessarily for tight loops of vectorized operations...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/