Problem list while upgrading cluster

November 13, 2012
by Wm. Josiah Erikson (wjens)

1. compute-1-2 has a bad power supply (that’s why it was down!
2. compute-1-5 has a bad exhaust fan
3. compute-4-5 (new monster node) will not post
4. compute-1-6 will not post

I guess I’ll try to figure out 4-5 first, since it’s most important…



12 Responses to “Problem list while upgrading cluster”

  1.   wjens Says:

    compute-4-5’s problem was bad RAM. fixed – had extra from Jeremy and Chris…

  2.   wjens Says:

    compute-1-5’s fan started when I hit it with a screwdriver. Ordering new one, putting it back into production for now. compute-1-6 just needed a RAM reseat and blowing out the channels… definitely seemed like the slow closest to the edge of the board was the culprit (not the RAM stick, the slot itself), but then I blew it out again and put the RAM back and it was fine. Pulling apart compute-1-2 now…

  3.   wjens Says:

    Replaced compute-1-2’s power supply with one Bruner was giving away. Had to adapt power cable with butt connectors and electrical tape. Cluster is fully operational again.

  4.   wjens Says:

    …except that 1-6 never came up… investigating….

  5.   wjens Says:

    Hm. Might be a software problem with the Radeon r600 driver… kernel fails to load it. The installer seems to work fine, reinstalling to see what happens this time. Might need some investigation or a post to the rocks-discuss list…

  6.   wjens Says:

    I disabled the internal graphics and it boots fine… just can’t see what it’s doing. Not ideal, but it works! Heh.

  7.   wjens Says:

    compute-1-6 is misbehaving again, as reported by Tom, probably bad RAM. Will investigate…

  8.   wjens Says:

    Running a long memory test on compute-1-6… for some reason, when I reset the CMOS, it ran the ROCKS installer in text mode, which might eliminate the need to turn off the onboard video card, which would be nice… not sure what setting changed that…

  9.   wjens Says:

    Huh. It passed 8 consecutive passes of the full memtest86+ memory suite. I usually consider 1 good enough. Putting back into production… also modifying extend-compute.xml and rebuilding the distro first to see if I can get it to come back up with tractor-blade running…. I’ve been having an issue where once the blades come up, you have to either reboot them or just run /etc/init.d/tractor-blade start … not sure why yet, as I have chkconfig –add tractor-blade, touch /var/log/tractor-blade, and chown pixar /var/log/tractor-blade and then /etc/init.d/tractor-blade start as the last things in extend-compute…. who knows…. I’ll figure it out.

  10.   wjens Says:

    Huh. Set the graphics card to use 32MB and unganged mode and now it seems to work even in graphics mode with VGA. OK….

  11.   wjens Says:

    …But now it’s not pinging….

  12.   wjens Says:

    Hooked it up with DVI, we’ll see…. running the graphical installer this time though…. huh.

Leave a Reply

You must be logged in to post a comment.