Re: Undo aic7xxx changes (now rc7+aic20030603)

Stephan von Krawczynski (skraw@ithnet.com)
Sat, 21 Jun 2003 01:48:28 +0200


On Sat, 21 Jun 2003 00:03:31 +0200
Willy Tarreau <willy@w.ods.org> wrote:

> Hi !
>
> On Fri, Jun 20, 2003 at 06:13:53PM -0300, Marcelo Tosatti wrote:
> > > Actually, without another copy of the data on a different system to
> > > verify it with, you can't know that for sure. It could easily be getting
> > > to the tape (the actual media) just fine, but then get corrupted during
> > > the verify readback.
> >
> > Right. Stephan, if you could use a bit of your time to isolate the problem
> > I would be VERY grateful.
>
> I remember Stephan once said that he used tar to verify the tape, and that
> for one backup, he did several tests showing corruption on different files.
> Altough that doesn't mean that the tape is written totally correctly, it at
> proves that there's at least a read corruption.

Hello Willy, hello Marcelo,

in fact I noticed that doing multiple verify cycles the so-called corruption
happens rarely (read _very_ rarely) on the same files. So it is indeed very
likely that the read case is a problem.
Another thing to note is that I did not manage to produce a failed verify on a
dataset tar'ed to the 3ware raid and not to tape. I did not test that very
intensively, but from the tests I did I would have expected a corruption to
happen based on the cycles I did on tape.

> I think that comparing multiple reads to find a pattern in corruption offsets
> (if any) is the only thing he could do (not speaking about mixing read/writes
> with good/bad kernels). Of course, storing several times 70GB on disk is not
> easy, but at least a 16 bits checksum for each 1kB block would result on
> about 140 MB files, which will be "easier" to compare. It could be enough to
> check for empty blocks, duplicated blocks or totally random ones.
>
> Stephan, if you're willing to do the test but don't have such a tool, I may
> write a quick dirty one tomorrow if that helps.
>
> BTW, it could be interesting to note the read buffer's hardware address for
> each test, in case it matters.

Well, in fact I am a bit lost in the case, because of the shere data volume, I
have space for several sets on disk, but it takes a damn long time to produce
one cycle write/verify. Anyway I will do if that helps. The big problem with
tar is that I have (to my knowledge) no chance to let it somewhere save the
verify-failing data parts. I guess this could help a lot, because we could then
see what the corruption looks like, how long (in bytes) it is and so on.
If anybody has an idea how to achieve this goal let me know.

I am not 100% confident that the tests would look the same if I simply read the
whole tape onto the disks again and then verify via file compare, but anyway I
should try this too several times to complete the picture.

Ok, weekend is here, I see what can be done.

Regards,
Stephan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/