Mysterious Reboots
April 24, 2014by admin
Some weird reboot/reinstalls occurred, that may or may not make any sense.
1d 36m ago, compute-1-18 rebooted. It’s plugged into one of the white UPSes. No other nodes plugged into this UPS went down at that time.
Seventeen hours ago, the following nodes went down: 1-12, 1-14, and 2-22 through 2-25. 1-12 and 1-14 are plugged into a power strip (along with many other nodes that did not go down) that is plugged into the 3000VA UPS that is second from the bottom in the stack. It has bad batteries. We stole from the bottom UPS, which is off, and made a new battery pack. 2-22 through 2-26 are plugged into the top half of the 208V PDU. All of these nodes were plugged into locations that had plenty of other nodes also plugged in that didn’t go down.
2 hours ago, the following nodes went down: 1-5, 1-10, 2-1 through 2-4. These nodes are all plugged into the top 3000VA UPS in the stack. However, compute-1-11 and 1-9 are also plugged into that UPS, and they did not go down at that time. They’re also fully loaded and have been the whole time.
What gives? At this point it is entirely unclear to me.
April 24th, 2014 at 2:59 pm
compute-2-22 through 2-25 died because the power supply in that rack died. I ordered two new ones.
April 25th, 2014 at 6:02 am
compute-1-18 rebooted itself at approximately 3:15 this morning for no apparent reason. It’s back, but I wonder if it has a failing power supply. It didn’t run out of memory, as far as I can tell. It could also be overheating, I suppose.
Compute-1-6 kernel panicked. Again. I think it’s overheating.
April 25th, 2014 at 6:20 am
I took the door off the front of rack 1 to see if it helped with 1-18 and 1-6 overheating (if that’s even what’s happening with them). Restarted crashed renders due to those machines crashing. Breakfast time.
May 5th, 2014 at 8:03 am
So, it looks like the mysterious reboots are mostly due to the 3000VA UPSes misbehaving/having bad batteries/providing slightly unstable voltage. We could either buy some 0U rack PDUs with L5-30P supply plugs, or get new batteries for the UPSes. We ran a big long cord back to P11, the last available circuit in the room, and things are currently stable with the cluster running full-bore all weekend, but the situation is less than ideal. Two of the UPSes are out of service because of bad batteries, which means we’ve got 60A of power in the wall we can’t use.
May 5th, 2014 at 8:09 am
…except that 2-5 through 2-8 rebooted and reinstalled yesterday. They’re on the 208v PDU. So… unstable power, or slightly bad power supply, or the chassis overheated, or some other reason?
A lot of those C6100’s do have blinking orange lights a lot of the time – not sure what that means. I should look it up, as it could very well be related.