Ukko cluster overloaded

29.11.2011 - 12:00 - 18:00
Kuvaus: 

 Ukko cluster was (and partly still is) badly overloaded because of a user process gone crazy. Some or all nodes probably need a reboot to fix this.

Update 14:40

Now the nodes which are still responding to ssh request are probably actually usable. Rest of the nodes need probably be rebooted. The hpc-report page might now actually show correct information. 

Update 16:15

Most nodes are still down. Killing the crazy process was not enough, since it kept restarting itself through ssh. Nodes are still being restarted.

Update 17:45

The cluster is now back and running. I will now schedule a reboot next week for those nodes which survived this incident.

29.11.2011 - 18:48 Jani Jaakkola
29.11.2011 - 14:34 Jani Jaakkola