Re: cerberus errors on 2.4.19 (ide dma related)

Ed Sweetman (safemode@speakeasy.net)
18 Aug 2002 03:26:42 -0400


Ok, i reran the test with a little process of elimination.
The problem occurs only when dma is enabled on the promise controller's harddrive.

The cerberus test ran for 15 minutes when dma was disabled on the promise controller but both disabled and enabled on the via controller's card without any errors. When dma was enabled on the promise controller's, both with dma enabled on via and disabled cerberus reported MEMORY errors within 30 seconds both times.

It appears then that there are some DMA issues with the promise controller i have with the driver. My swap used to be on the drive on the promise controller before which would explain fs corruption on both drives (swap cached and such).

If whoever develops this driver wants some more bug testing or specific information I can give it. I'd like to help get the problem solved.

On Sun, 2002-08-18 at 02:44, Ed Sweetman wrote:
> (overview written in hindsight of writing email)
> I ran all these tests on ide/host2/bus0/target0/lun0/part1
> when dma was enabled or disabled is was done to both drives at the same
> time.
> I do not know if cerberus cares where it is run or not to do it's tests,
> but the program was on the drive it was tested on when run and
> throughout the email i assume it only runs it's drive tests primarily on
> the partition you've run it on. I see now that this is probably wrong
> and instead of changing where i run the test i should alternate which
> drive gets dma enabled and disabled and process of elimination will show
> just the kind of dma bug i'm seeing
> (/overview)
>
>
> I've been trying to track down why i seem to get disk corruption on my
> harddrives after some good amount of usage all the time. It's been
> happening for a long time across a number of different kernel versions.
> I believe this is because i stick to the same board manufacturer, Abit
> and use via chipsets.
>
> I ran cerberus with dma enabled at UDMA4 and UDMA2, at udma4 cerberus
> reports MEMORY errors and BBidehost2bus0target0lun0discN1 errors, but
> mostly MEMORY errors before the kernel panics after a minute or two. At
> udma2 the cerberus reports no errors but panics after a minute or two.
> I ran cerberus a couple times on each, with UDMA4 it began to error
> about 30 seconds into the test with MEMORY errors.
>
> I thought, well this could be ram errors, so i ran memtest for a couple
> hours. Nothing reported as being bad. I then thought, my hardware
> could be the problem, so I ran e2fsck -c on the partition I was running
> cerberus on with dma disabled via hdparm -d0 and it completed with no
> errors found. I then rebooted, enabled udma2 and the kernel panic'd
> with the same test after a few minutes.
>
> The rest of this email is just information regarding the setup
>
>
> First off the way my fs's are setup are as follows:
>
> swap + files are now all on my primary master ide drive on the
> motherboard ide controller. Swap on my primary master promise controller
> seemed too problematic because of corruption, but i'm not sure if the
> corruption i've seen is related only to the promise controller or if
> it's not controller specific. I'll have to run the test without swap on
> the promise drive and then run the test on my primary motherboard hdd
> and again without swap.
>
> cerberus version : 1.3.0pre4
> dmesg info : http://signal-lost.homeip.net/lkml/dmesg
> hdparm info : http://signal-lost.homeip.net/lkml/hdparm
> pci info : http://signal-lost.homeip.net/lkml/lspci
>
> tests completed before escaping in pio mode:
> http://signal-lost.homeip.net/lkml/tests_passed
>
> Errors during last test that caused kernel panic (udma2)
> http://signal-lost.homeip.net/lkml/memory
>
> Errors during test of udma4 (first test)
> http://signal-lost.homeip.net/lkml/memory2
> http://signal-lost.homeip.net/lkml/dmesg2
> various segfaults of badblocks of BBidehost tests.
>
>
> I ran memtest for an extensive amount of time after the first test
> reported memory errors and go absolutely no errors (wasn't using dma
> mode at the time either). And since these errors aren't produced when
> not using DMA on my drives I find it very unlikely that it's "System
> Ram" as the cause of them. I'm going to rerun the test on my
> motherboard primary drive after posting this in case something happens
> and i hose everything.

> Forgot to add my kernel config.
> http://signal-lost.homeip.net/lkml/config

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/