Archive for March, 2012

New parts in

Friday, March 30th, 2012

Got some new parts in – new power supply for compute-1-4 and new motherboard and RAM for compute-1-7. If it works out, I might update more of rack 1 to this configuration: new 8-core Phenom II’s with 16GB of RAM. I’ll have to check the old motherboards; there’s even a possiblity I won’t have to get new ones.

Some dead nodes

Monday, March 26th, 2012

Looks like a fair portion of the cluster has been down for a couple of days. Checking into it reveals the following problems:
1. compute-1-4 won’t power on. Probably a power supply.
2. Much of racks 1 and 2 had just hardlocked, and mostly all at the same time, so it must have been a job that got submitted at that time.
3. Several nodes had an error message on the console from tractor about “Load averages are unobtainable”. Some of these nodes were crashed, some were not (compute-1-6, for instance, was not)
4. Compute-3-3 seems to have finally died the same death as the other Opteron nodes