Some dead nodes
March 26, 2012by Wm. Josiah Erikson (wjens)
Looks like a fair portion of the cluster has been down for a couple of days. Checking into it reveals the following problems:
1. compute-1-4 won’t power on. Probably a power supply.
2. Much of racks 1 and 2 had just hardlocked, and mostly all at the same time, so it must have been a job that got submitted at that time.
3. Several nodes had an error message on the console from tractor about “Load averages are unobtainable”. Some of these nodes were crashed, some were not (compute-1-6, for instance, was not)
4. Compute-3-3 seems to have finally died the same death as the other Opteron nodes
March 26th, 2012 at 1:54 pm
Just spoke with Bassam, this was Blender crashing nodes by running them out of memory. Submitting jobs with a minimum of 8GB of RAM seems to solve the problem.