Re: Linux/Pro -- clusters

Doug Ledford (dledford@redhat.com)
Thu, 06 Dec 2001 13:38:44 -0500

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Rik van Riel: "Re: Kernel 2.4.16 & Heavy I/O"
Previous message: Martin J. Bligh: "Re: SMP/cc Cluster description"
Maybe in reply to: Donald Becker: "Linux/Pro -- clusters"
Next in thread: Martin Dalecki: "Re: Linux/Pro -- clusters"

Linus Torvalds wrote:

>In article <E16BlnL-00080m-00@the-village.bc.nu>,
>Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
>>You still need the scsi code. There are a whole sequence of common, quite
>>complex and generic functions that the scsi layer handles (in paticular
>>error handling).
>>
>
>Well, the preliminary patches already handle _some_ common things, like
>building the proper command request for reads and writes etc, and that
>will probably continue. We'll probably have to have all the old helpers
>for things like "this target only wants to be probed on lun 0" etc
>
Personally, I think it would be a mistake to have any of the low level
driver know a thing about the ll_rw_blk.c request structs or bio
structures or similar stuff. They simply don't *need* to know those
things. Low level drivers need a command that can go to the drive and
they need a sg array. Smart hosts, aka raid controllers and the like,
could theoretically use the higher level structs for reasonable things,
but I won't make any assumptions about which ones could or couldn't use
that information because I haven't looked into it. So, making the scsi
mid layer a helper library may be OK, but for the largest number of
drivers, it really means that the call chain would likely look something
like this:

blk_dev->request_fn()
scsi_dispatch_request() // map from the major/minor to driver
driver_queue_routine(bio *)
Scsi_Cmnd = scsi_make_cmnd(bio *);
send_command(cmnd); //put it in the hardware

driver_interrupt_routine()
if (cmnd = get_errored_command()) {
// Pick out the commands that had a transfer related error
// and rework or error them out here
if(retryable_error(cmnd))
requeue_command(cmnd);
else {
cmnd->bio->completion_handler(cmnd->bio);
scsi_free_command(cmnd);
}
}
if (cmnd = get_completed_command()) {
retryable = scsi_check_sense(cmnd);
switch(retryable) {
case TRANSIENT_ERROR:
requeue_command(cmnd);
break;
case FATAL_ERROR:
case NO_ERROR:
cmnd->bio->completion_handler(cmnd->bio);
scsi_free_command(cmnd);
break;
case BUSY:
requeue_command_with_delay(cmnd);
break;
case ...
}
}

Personally, I don't like it. The queueing stuff is somewhat OK (but
requires other things to be changed to accomodate and I don't think
those things should change), but I don't want to have to deal with
result parsing in every driver. Proper result parsing is huge and easy
to get wrong. The current code gets it wrong. There are a few ways in
which it could be fixed to do the right thing without disrupting the
world. Duplicating that in all the low level drivers will be a
nightmare though. I also don't like one aspect of putting the queueing
totally in the low level drivers. That way, all of the low level
drivers are going to have to maintain A) delayable queues with variable
delay times and undelay on completion semantics (my driver already has
this, but it's been a source of pain to maintain, it would be nice if it
didn't need it) and B) ordering requirements when multiple commands for
a device operating in untagged mode are in the queue (this one isn't so
hard, but getting it wrong means things like tape drives will store
garbage when you least expect it).

Anyway, I suspect the end result would be the same either way: drivers
would end up having control over their own flow of commands. The only
difference would be how that control would be achieved. A) by changing
the mid layer to accept more driver specific parameters (aka, even
though MegaRAID may take up to 60 seconds to complete a normal command,
the aic7xxx should never take more than 10 seconds, so the default
command timeout length would be driver specific) and to honor the low
level driver's idea of what to do on a timeout (let the low level driver
tell the mid layer what action to take, and that should be the limit of
the mid layers error handling "brains", performing those actions). Or
B) removing the mid layer from the picture all together (as much as
possible anyway, it (or a similar variant) will still likely be needed
on queueing just to map from major/minor to driver unless we change the
device allocation scheme or something) and then having the low level
driver call into helper routines.

>
>I disagree about the error handling, though.
>
>Traditionally, the timeouts and the reset handling was handled in the
>SCSI mid-layer, and it was a complete and utter disaster. Different
>hosts simply wanted so different behaviour that it's not even funny.
>
That's not really accurate. It was too opaque, sure. But for what it
was it did a decent job. The mid layer driver was trying to make
generic timeout decisions without the benefit of the low level driver's
knowledge of the current bus state. For example, 20 commands may time
out at once, but in reality, only one of those commands is probably
holding up the SCSI bus. The low level drivers can (usually) look at
their card to see which command is *really* the hold up. Then, they
could have all the other commands simply put back to sleep without doing
anything and take appropriate action against the holdup command. The
current mid layer (at least in the old error handling) couldn't do that.
The new_eh code attempts to allow drivers to tell them these things by
use of the strategy function. In reality, that's all you need. That's
the driver's ability to tell the mid layer *how* to proceed on any given
command.

>
>Timeouts for different commands were so different that people ended up
>making most timeouts so long that they no longer made sense for other
>commands etc.
>
Not accurate here either. For most commands across all controllers,
timeouts are pretty uniform (the timeout to read a CD-ROM for instance
is pretty constant). The timeouts *only* started to vary when you ran
across smart RAID controllers that were doing too much work rebuilding
things and didn't respond to your requests in a timely fashion. And
that only applys to their logical drives. It would be pretty easy to
just allow disk timeouts to be adjusted on a disk by disk basis to a
reasonable default for that controller and solve this problem.

>
>Other device drivers have been able to handle timeouts and errors on
>their own before, and have _not_ had the kinds of horrendous problems
>that the SCSI layer has had.
>
If you use the new eh strategy function (or so I understand it, if I'm
wrong here then I'll take the time to *make* myself right by changing
the code), then you are essentially allowing your driver to take control
of *what* happens, and the mid layer timeout code becomes nothing more
than a glorified timer creation, monitoring, and teardown framework that
happens to be plugged into the queueing and completion framework so it
can notice new commands and when old commands are done. Nothing more.
And that's all a SCSI driver that wants to do its own thing really
needs. Done properly, this could be a "good" thing. Done poorly,
everyone will hate it. But, the ability is there to make your driver do
what you want with the strategy call in.

>
>We'll see what the details will end up being, but I personally think
>that it is a major mistake to try to have generic error handling. The
>only true generic thing is "this request finished successfully / with an
>error", and _no_ high-level retries etc. It's up to the driver to decide
>if retries make sense.
>
Well, I disagree on this. I think asking all those drivers to deal with
sense data and keep up with changing standards and new sense returns on
new devices, etc. is just asking for a lot of out of date drivers that
do the wrong thing. Sense parsing and decision making is *device*
specific, not controller specific, and I don't think it has any place
being in the controller's responsibility list. You might as well start
asking eth drivers to not only check checksums on packets but also
protocol flags before passing the packet up to the IP or TCP layers and
to perform all the TCP retrans operations in themselves instead of in
the TCP layer.

>
>(Often retrying _doesn't_ make sense, because the firmware on the
>high-end card or disk itself may already have done retries on its own,
>and high-level error handling is nothing but a waste of time and causes
>the error notification to be even more delayed).
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rik van Riel: "Re: Kernel 2.4.16 & Heavy I/O"
Previous message: Martin J. Bligh: "Re: SMP/cc Cluster description"
Maybe in reply to: Donald Becker: "Linux/Pro -- clusters"
Next in thread: Martin Dalecki: "Re: Linux/Pro -- clusters"