Archive for April, 2014

Mysterious Reboots

Thursday, April 24th, 2014

Some weird reboot/reinstalls occurred, that may or may not make any sense.

1d 36m ago, compute-1-18 rebooted. It’s plugged into one of the white UPSes. No other nodes plugged into this UPS went down at that time.

Seventeen hours ago, the following nodes went down: 1-12, 1-14, and 2-22 through 2-25. 1-12 and 1-14 are plugged into a power strip (along with many other nodes that did not go down) that is plugged into the 3000VA UPS that is second from the bottom in the stack. It has bad batteries. We stole from the bottom UPS, which is off, and made a new battery pack. 2-22 through 2-26 are plugged into the top half of the 208V PDU. All of these nodes were plugged into locations that had plenty of other nodes also plugged in that didn’t go down.

2 hours ago, the following nodes went down: 1-5, 1-10, 2-1 through 2-4. These nodes are all plugged into the top 3000VA UPS in the stack. However, compute-1-11 and 1-9 are also plugged into that UPS, and they did not go down at that time. They’re also fully loaded and have been the whole time.

What gives? At this point it is entirely unclear to me.

Smaller than 80GB hard drives no longer work

Monday, April 21st, 2014

Due to some reconfig of the partition schema and some adding to the distro, it appears that anything smaller than 80GB no longer works in a cluster node. I also had this weird problem in compute-1-8, 1-16, and 1-15 where they were hung at “Verifying DMI Pool Data….” forever, but when Dan and I swapped out the hard drives for 80GB SATA drives, this problem disappeared, as did the more obvious problem that the ROCKS installer exited on saying “You need 857MB more hard drive space”.
All nodes should be back in service shortly, probably by the time you read this. Thanks to Rae-Ann for donating all the old 80GB hard drives that User Support was throwing away 🙂