Is there something strange going on with the ext2 Filesystem?

John L. Males (jlmales@yahoo.com)
Thu, 6 Jun 2002 04:17:07 -0400


--=.3J1RuRy_I,JN?C
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

Hi,

***** Please BCC me in on any reply, not CC me. Two reasons, I am not
on the LKML, and second I am suffering BIG time with SPAM from posting
to the mailing list. Thanks in advance. *****

Ok, I have been having some odd, to very serious problems with the
ext2 file system. The problems seemed to have started when I started
using the 2.4.x kernels. Possible suspect are in the 2.4.15-pre5. I
had used 2.4.9-ac10, 2.4.9-ac18, 2.4.10-ac12, 2.4.13-ac5 in the past
and a SuSE 2.4.9 varient that was simply had some odd problems, not
file system related, that I used for a very short one session, and
short boot.

There are three basic issues at hand:

1) e2fsck 1.19 would hang with a 2.4.x kernel. Kernel occurred on
was 2.4.15-pre5, then tried 2.4.13-ac5, and 2.4.9-ac18 and these two
also had e2fsch hang. No changes made, just booted to 2.2.19 (current
at time) on same system as already had 2.4.19 with the OpenWall patch
and e2fsck run to completion. I do not recall if there were errors
detected. I may have notes on this, as in the comsole messages, but I
have to do a bit of digging.

2) BIG TIME cross node problems, BIG BIG time bit map problems that
is either realted to the special test programs I have developed for
testing the kernel VM subsystem, or maybe unrelated. Bottom line I
had real serious problems after this e2fsck wherein many files were
lost or say the "ps" name entry no longer pointed to the "ps" program,
but say the Free Pascal compiler. There were a number of these messes
and so I have since moved back to the 2.2.x kernel series. There had
been "smaller scale" problems like these, but not with cross
linked/duplicae nodes or the like with other 2.4.x kernels before
2.4.15-pre5. I therefore feel there is a problem sitting about that
may not have been addressed yet. I have no notes on this as these
problems were with the root file system and I had no ability to hold
the messages and try to copy the, at times, several hundred, numbers,
messages. For my other mounts I can as I have XFree up at that time
and can cut and past the console messages into an editor and save the
messages and all.

3) I am starting to see a parallel even not that I have been using
the 2.2.19/2.2.20 kernel for a number of months now. The parallel is
when the file system gets full, I will skip how this happens but it
does frequently in what things I do, either the next boot of the
system or the next scheduled e2fsck reults in various bitmap, wrong
group, directory counts, etc. The number of times this has happened
is more than suggestive of a problem. I currently do not have the
time or a "current" mainstream release of Linux to try testing my
theory on this without risking my one and only Linux day to day use
system. I am trying to find time and current distribution I can
download over 56K modem line to do such a test. I aslo need to see if
my VM tests are a factor, as in when the kernel is stressed in the VM
context, it has secondary negative effects. Not to mention seeing
what progress the VM subsystem has made. My last check with
2.4.15-pre5 was ok, but still many problems.

4) Now in past few days I had a case where the system was shut down
normally, Then started again next day for a few hours of very light
activity and shut down again normally, Then the next startup there
are group/directory count problems, and bitmap problems. All after
two clean shutdowns and no mount/device full conditions during any of
these prior sessions.

In conclusion, I think there needs to be some more formalized and
specific QA/Testing testing of file systems. The conditions to be
tested needs to bacome a formal and routine part Linux kernel testing.
I realize it can be time consuming, but there can be "creative" ways
and techniques applied to ease the effort and automate the process to
a great extent I am sure. From my ongoing experiences the ext2 file
system with regard to device/file system full conditions, using
different block sizes, inode density configurations, and kernel stress
conditions are key starting elements in such a ongoing sanity check of
the kernel. I would expect all the other file systems to be put
throught the generic file system tests as well as in combination with
their own specific file system configurations/options. I happened not
to take the defaults and therefore these may not have been tested at
all or not very well tested for any file system, not just the ext2
file system.

I would like to trust Linux with its file system management, as I can
tell you first hand The Other OS has big time problems, on a routine
basis with thw two primary file systems it uses. I know from first
hand experience, BIT TIME. I can tell you I am not your average user,
and I know there are many very large installations that have used
Linux under what are believed stressful and demanding uses over
several hours. All I can tell you is for some reason I seem to break
the general 80/20 rule more often and my uses of systems tend to be
more a 70/30 or 60/40 split.

For that reason I seem to encounter and find by "accident" more
problems than others in my day to day use of systems. I will not tell
you what happens when I actually try to test and break software to
validate its design/stability/duty cycles.

I am taking the time now to articulate my experiences with my "modest"
ext2 use to give a heads up before a possibility more serious problem
is experienced. Again I am aware there are clearly more complicated
stressed systems that my "workstation" useage. PErhaps my
"experience" is an opportunity to detect something before it becomes
much worse. I do read the Kernel patch/release updates on regular
basis, and so far I cannot say I have seen anything that suggests the
problem I am experiencing is somehow directly or indirectly addressed.
I do recall the 2.4.x kernel had some superblock related activity. I
have no idea if what I have experienced is somehow related or not.
All I can say is a very remote "maybe" as I am not a KErnel
developer/hacker, and I have no concept of the details or impact of
such work. I know there has been SCSI subsustem work done as well,
but again I have no real knowledge if, maybe, not a chance impact
these may be having on what I have experienced. All I know is without
being able to duplicate the problem, it will be difficult to
invesigate the problem and fid out what issues there may be.

Regards,

John L. Males
Willowdale, Ontario
Canada
06 June 2002 04:17

==================================================================
Please BCC me by replacing yahoo.com after the "@" as follows"
TLD = The last three letters of the word internet
Domain name = The first four letters of the word software,
followed by the last four letters of the word
homeless.
My appologies in advance for the jumbled eMail address
and request to BCC me, but SPAM has become a very serious
problem. The eMail address in my header information is
not a valid eMail address for me. I needed to use a valid
domain due to ISP SMTP screen rules.
--=.3J1RuRy_I,JN?C
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)

iEYEARECAAYFAjz/Go4ACgkQsrsjS27q9xZ7+ACbBvrup9ESDfBrAaLNEyKuSHgY
tIAAn1skxuSl/PjwaQCRUPbn//XiOT1c
=foVV
-----END PGP SIGNATURE-----

--=.3J1RuRy_I,JN?C--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/