Ukko cluster overloaded
29.11.2011 - 12:00 - 18:00
Kuvaus:
Ukko cluster was (and partly still is) badly overloaded because of a user process gone crazy. Some or all nodes probably need a reboot to fix this.
Update 14:40
Now the nodes which are still responding to ssh request are probably actually usable. Rest of the nodes need probably be rebooted. The hpc-report page might now actually show correct information.
Update 16:15
Most nodes are still down. Killing the crazy process was not enough, since it kept restarting itself through ssh. Nodes are still being restarted.
Update 17:45
The cluster is now back and running. I will now schedule a reboot next week for those nodes which survived this incident.