Mysterious Reboots

April 24, 2014
by admin

Some weird reboot/reinstalls occurred, that may or may not make any sense.

1d 36m ago, compute-1-18 rebooted. It’s plugged into one of the white UPSes. No other nodes plugged into this UPS went down at that time.

Seventeen hours ago, the following nodes went down: 1-12, 1-14, and 2-22 through 2-25. 1-12 and 1-14 are plugged into a power strip (along with many other nodes that did not go down) that is plugged into the 3000VA UPS that is second from the bottom in the stack. It has bad batteries. We stole from the bottom UPS, which is off, and made a new battery pack. 2-22 through 2-26 are plugged into the top half of the 208V PDU. All of these nodes were plugged into locations that had plenty of other nodes also plugged in that didn’t go down.

2 hours ago, the following nodes went down: 1-5, 1-10, 2-1 through 2-4. These nodes are all plugged into the top 3000VA UPS in the stack. However, compute-1-11 and 1-9 are also plugged into that UPS, and they did not go down at that time. They’re also fully loaded and have been the whole time.

What gives? At this point it is entirely unclear to me.

  1.   admin Says:

    compute-2-22 through 2-25 died because the power supply in that rack died. I ordered two new ones.

  2.   admin Says:

    compute-1-18 rebooted itself at approximately 3:15 this morning for no apparent reason. It’s back, but I wonder if it has a failing power supply. It didn’t run out of memory, as far as I can tell. It could also be overheating, I suppose.
    Compute-1-6 kernel panicked. Again. I think it’s overheating.

  3.   admin Says:

    I took the door off the front of rack 1 to see if it helped with 1-18 and 1-6 overheating (if that’s even what’s happening with them). Restarted crashed renders due to those machines crashing. Breakfast time.

  4.   admin Says:

    So, it looks like the mysterious reboots are mostly due to the 3000VA UPSes misbehaving/having bad batteries/providing slightly unstable voltage. We could either buy some 0U rack PDUs with L5-30P supply plugs, or get new batteries for the UPSes. We ran a big long cord back to P11, the last available circuit in the room, and things are currently stable with the cluster running full-bore all weekend, but the situation is less than ideal. Two of the UPSes are out of service because of bad batteries, which means we’ve got 60A of power in the wall we can’t use.

  5.   admin Says:

    …except that 2-5 through 2-8 rebooted and reinstalled yesterday. They’re on the 208v PDU. So… unstable power, or slightly bad power supply, or the chassis overheated, or some other reason?
    A lot of those C6100’s do have blinking orange lights a lot of the time – not sure what that means. I should look it up, as it could very well be related.

