Re: 2.2.18 data corruption issues

Denis Vlasenko (vda@port.imtp.ilyichevsk.odessa.ua)
Tue, 9 Apr 2002 09:08:06 -0200


On 9 April 2002 00:52, Russell Miller wrote:
> I'm not subscribed to the list, so please CC me on any responses.
>
> I am running the stock 2.4.18 kernel, downloaded a few days ago from the
> kernel mailing list. The kernel was custom-built to my specifications,
> using the default RH7.2 gcc (config available upon request). The machine
> is a dual pentium-III 1000 MHz, one scsi drive (sym53cxxx criver) and two
> ide drives. All filesystems are ext3 journaling.

What is your GCC version?

> We copied several very large partitions from one machine to another in an
> attempt to put a new machine in service. Just for kicks, we attempted to
> verify the copy. It turns out that a small amount of files, about 60 to
> 100 on a 17 gig partition, are corrupted. Mod times are exactly the same,
> owners, even file size. It turns out that pretty consistently four null
> characters (and occasionally other characters and a different number) are
> appended to the beginning of the file, and the last four characters are
> rolled off the end. We ran the copy (and rsync and stuff) multiple times.
> Each time different files were modified, in a seemingly random fashion, but
> with a fairly consistent pattern of corruption.
>
> I have turned off DMA on the disk drives to no effect. I have replaced the
> ide cables with higher quality cables. The problem seems to be occuring on
> both the scsi and ide drives, which to me eliminates the ide or scsi
> controllers, drivers, or anything on the back end of them as the source of
> the problem.This same machine was in service previously, minus one disk
> drive, and this problem never manifested itself, leading me to believe it
> is either something to do with the ext3 jfs, or with the 2.4.18 kernel.

It was Linux? What kernel version? Did you try copying with that kernel?

> Does anyone have any tips on how to debug this? I have administrative
> access to the machine, and although it is running production, I am very
> keen on getting this resolved and will provide any information you need.
> If this is a kernel or ext3 problem as I suspect I imagine you want to get
> this resolved as much as I do.

You may try to repeat your test with:
* newer / older kernel (maybe this is a kernel bug?)
* newer GCC (miscompiled kernel?)
* different fs (ext3 bug?)
* different hardware (last resort to rule out hw problems)

--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/