Some dead nodes

March 26, 2012
by Wm. Josiah Erikson (wjens)

Looks like a fair portion of the cluster has been down for a couple of days. Checking into it reveals the following problems:
1. compute-1-4 won’t power on. Probably a power supply.
2. Much of racks 1 and 2 had just hardlocked, and mostly all at the same time, so it must have been a job that got submitted at that time.
3. Several nodes had an error message on the console from tractor about “Load averages are unobtainable”. Some of these nodes were crashed, some were not (compute-1-6, for instance, was not)
4. Compute-3-3 seems to have finally died the same death as the other Opteron nodes



One Response to “Some dead nodes”

  1.   wjens Says:

    Just spoke with Bassam, this was Blender crashing nodes by running them out of memory. Submitting jobs with a minimum of 8GB of RAM seems to solve the problem.

Leave a Reply

You must be logged in to post a comment.