Archive for April, 2009

Wednesday, April 29th, 2009

compute-0-4 is back in service, it was indeed the hard drive. I pulled one from one of the old nodes – a Barracuda, because they seem to last the longest… haven’t had one die yet (knock on wood). For some reason, I had to remove the node from the cluster and then reinsert it with insert-ethers to get it to PXE install correctly. Ah well.

Tuesday, April 28th, 2009

compute-0-4 is being funky… bad hard drive, I suspect. I’ll replace it later this week, don’t have time now…

compute-0-5 back in service

Monday, April 27th, 2009

Tore apart the power supply, replaced the fan, replaced the inside rear exhaust fan as well. Noticed when putting back in the rack that in every case that a node needed a fan replaced, it was the inside one. This must mean that the inside fan is subjected to significantly more heat than the outside one, which would make sense, since most of the heat comes off the processors, which are in the middle of the machine. Interesting. Maybe I should preempt further failure and replace the remaining two inside fans that have not yet been replaced.

Thursday, April 23rd, 2009

Got a moment when the cluster wasn’t busy and put the RAM back in those nodes. Jean ordered some fans for me.

Power outage this morning, took down the blades. Since they boot up into reinstall by default, and the reinstall doesn’t work (you have to use PXE for the 64-bit nodes), and they take FOREVER to initialize on reboot, getting them back up and running was actually really annoying. How much is a 240V UPS that can handle a load of 4000W? Probably a lot. I should check into it though. Probably the blades don’t actually pull anywhere near that much power under normal circumstances, so maybe that’s overkill.

More nodes dying

Monday, April 20th, 2009

Hm. Compute-3-0 died. I won’t bother with it, most likely. It was free and I don’t have any extras. Compute-0-5 just died as well. I think it’s a bad power supply. That I’ll fix, as I have a pile of old nodes with the same power supply, possibly even a brand new one. I used up all the extra fans I bought right away. I should have bought more. One stick each in compute-1-1 and compute-1-2 died. Weird. I RMA’d them as pairs because that’s how they were bought, but left the machines up with 2GB of RAM each, and I have them back now, but the cluster is too busy for me to take down a couple of nodes to put half the RAM back in. Maybe I’ll pull compute-0-5 out of the rack and check it out today, if I can just figure out where I left my nice screwdriver….

More coffee first 🙂