Bringing the Ukko cluster down
On Friday, March 4th we began a heavy-duty computational experiment in order to figure out the Ukko computing cluster's maximum power draw. Since nobody had done a "stress test" on the cluster before, we didn't really know what would happen if the cluster was fully (over)-loaded for a longer period. What's more, it was also be interesting to find the relationship between the job pattern on Ukko and its power consumption.
According to our original idea, we planned to use a large-scale BitTorrent experiment to do the stress test. However, since BitTorrent involved also complicated networking operations, it was somewhat unrepresentative from a pure stress test perspective. We changed our plan a little bit and scripted a 50% (16GB) memory consumption on each node while performing some nasty calculations byte by byte, e.g., (e**sin(e**y)) % 2**63.
Our stress test focused on intensive CPU calculation and memory operations. On each node, we started 16 processes which matches the number of cores. Each process performed exactly the same job, as we wanted each node to remain just fully-loaded.
During our computation we measured a steady 68 kW power draw from Ukko, and an aggregate power draw of 115 kW of the entire data center. On Saturday, March 5th, after the experiment had been running for about 8,5 hours, Ukko's power connection failed. Initially, we assumed that the fuses had blown. In reality, Ukko's power supply line had completely melted.
Measured power draw in kW. During computation, Ukko's power draw surges from ca. 40 kW to almost 70 kW.
On Saturday morning, the on-duty repairman reported that Ukko's fuse box was "somehow wrong", and no further repair could be initiated. Two days later, when our electricians came back to work, we discovered that the three 100 A fuses had not only blown, they had taken the fuse box with them.
It took four days of downtime to locate and install a new fuse box. Thus, we could repeat the experiment and check whether the problem had been eliminated. It had not; this time, it took only five hours for the fuses to blow.
A much more thorough examination started. By measuring each of the three electrical phases separately, our electricians and Pekka Niklander discovered that Ukko's three-phase power draw was gravely unbalanced. One of the phases drew 83 A, while the two others 101 A and 105 A. The supply fuses were rated for only 100 A.
As the load was just a little bit over the rating, it took a few hours for a fuse to overload and blow. When the fuse finally blew, Ukko's power supply units (PSU) took care of the rest. The redundant PSU:s change their load over to the remaining connectors when a connector disappears. In our case, a failing phase ensured that the remaining phases were certainly overloaded as well.
The fuse box was another matter. It turned out to be a case of human error during installation. As the fuse cabinet was a bit too narrow for the box, the eager electrician installing the fuses had bent the box slightly to install the fuses "correctly". This, in turn, caused the fuse connectors to mismatch, and ultimately melted the whole box.
Fuse box with blown fuses. Note the metal connectors bent bythe surge of electricity.
A number of improvements are underway, but the lesson to learn is this:
don't try this at home, kids!
Text and images: Mikko Pervilä and Liang Wang
The relentless IT teams of the Department and HIIT noticed on Saturday morning that something was wrong with the Ukko cluster. This is a selection from the teams' IRC channel log (published with their permission).