Ferquently system lockups under load

Daniel Khan (dk@webcluster.at)
Thu, 23 Jan 2003 14:30:27 +0100


Hello List,

this posting is the last try to get a solution for a problem I have since
summer last year.

Last summer we bought 2 servers for a cluster and run RedHat 7.2 on it.
We had frequently lockups (I will describe later) and decided to exchange
the hole
hardware. On the 2 new servers we installed RedHat 7.3 and the same problems
occured.
We always had the latest RedHat SMP kernels _now_ we are on
2.4.18-19.7.xsmp.

The lockup:
Mostly under higher load (cronjobs) the problem happens
If the problem occurs the remote shell is dropped. Also the local shell
doesn't accept any
commands.
All ports remain open.
All interfaces remain up.
A telnet to ports like pop3 result in a "connected" message but nothing
more.
Only courier IMAP gives a "Hello" message but this is all.
So there are no services reachable anymore.
The only solution is a hard reset.
There are no logfile entries that indicate any problem.
Seems as if syslog also crashes before.

The system:
glibc-2.2.5-42
heartbeat-0.4.9f-1
RedHat 7.3 2.4.18-19.7.xsmp
pam_ldap
nss_ldap-198-3
openldap-2.0.25-1

all 7 partitions ext3 with default params in an IDE hardware raid 5:
/dev/hda2 26G 5.4G 19G 22% /
/dev/hda1 45M 42M 2.1M 96% /boot
/dev/hda3 65G 12G 50G 19% /home
none 756M 0 755M 0% /dev/shm
/dev/hda9 243M 4.8M 225M 3% /tmp
/dev/hda5 24G 159M 22G 1% /var/lib/mysql
/dev/hda6 20G 3.1G 16G 17% /var/log/httpd
/dev/hda7 13G 33M 12G 1% /var/spool/mail

The hardware:
IDE Raid controller: ACCUSYS ACS7610 B0W121 (on the first servers it was a
3ware)
2x 1.2 GHZ CPU's
1Gb Ram

What I found out:
As long as I don't do LDAP lookups in postfix the system seems to be stable

Other things:
I don't use nscd.
I have updated all packages (At least RedHat network tells me that
everything's fine.)

Why I think it's a kernel issue:
Well it may also be a problem in nss_ldap but I think that everything goes
back to the kernel
if the system get's locked down totally.
I am now thinking that it may be a problem in ext3 in conjunction with
nss_ldap.
But - as you allready found out I'm no kernel hacker at all. I am simply
trying to debug
and rule out since month's and now I'm really stuck.

I would be glad if somebody could help me on this problem or point me into
the right direction
cause the only thing I can search for is "RedHat crash","RedHat
freeze","RedHat ext3 crash"
as I simply don't know why and how this happens...

Would you please CC the message to me as I am not subscribed yet and please
don't shoot at me...

Thanks in advance - _AND_ thanks for your work on the linux kernel.

--
Daniel Khan

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/