Problem list while upgrading cluster
November 13, 2012by Wm. Josiah Erikson (wjens)
1. compute-1-2 has a bad power supply (that’s why it was down!
2. compute-1-5 has a bad exhaust fan
3. compute-4-5 (new monster node) will not post
4. compute-1-6 will not post
I guess I’ll try to figure out 4-5 first, since it’s most important…
November 13th, 2012 at 1:36 pm
compute-4-5’s problem was bad RAM. fixed – had extra from Jeremy and Chris…
November 14th, 2012 at 3:41 pm
compute-1-5’s fan started when I hit it with a screwdriver. Ordering new one, putting it back into production for now. compute-1-6 just needed a RAM reseat and blowing out the channels… definitely seemed like the slow closest to the edge of the board was the culprit (not the RAM stick, the slot itself), but then I blew it out again and put the RAM back and it was fine. Pulling apart compute-1-2 now…
November 15th, 2012 at 7:17 am
Replaced compute-1-2’s power supply with one Bruner was giving away. Had to adapt power cable with butt connectors and electrical tape. Cluster is fully operational again.
November 15th, 2012 at 7:19 am
…except that 1-6 never came up… investigating….
November 15th, 2012 at 7:25 am
Hm. Might be a software problem with the Radeon r600 driver… kernel fails to load it. The installer seems to work fine, reinstalling to see what happens this time. Might need some investigation or a post to the rocks-discuss list…
November 15th, 2012 at 8:02 am
I disabled the internal graphics and it boots fine… just can’t see what it’s doing. Not ideal, but it works! Heh.
January 7th, 2013 at 9:41 am
compute-1-6 is misbehaving again, as reported by Tom, probably bad RAM. Will investigate…
January 9th, 2013 at 1:38 pm
Running a long memory test on compute-1-6… for some reason, when I reset the CMOS, it ran the ROCKS installer in text mode, which might eliminate the need to turn off the onboard video card, which would be nice… not sure what setting changed that…
January 10th, 2013 at 8:35 am
Huh. It passed 8 consecutive passes of the full memtest86+ memory suite. I usually consider 1 good enough. Putting back into production… also modifying extend-compute.xml and rebuilding the distro first to see if I can get it to come back up with tractor-blade running…. I’ve been having an issue where once the blades come up, you have to either reboot them or just run /etc/init.d/tractor-blade start … not sure why yet, as I have chkconfig –add tractor-blade, touch /var/log/tractor-blade, and chown pixar /var/log/tractor-blade and then /etc/init.d/tractor-blade start as the last things in extend-compute…. who knows…. I’ll figure it out.
January 10th, 2013 at 10:15 am
Huh. Set the graphics card to use 32MB and unganged mode and now it seems to work even in graphics mode with VGA. OK….
January 10th, 2013 at 10:42 am
…But now it’s not pinging….
January 10th, 2013 at 10:47 am
Hooked it up with DVI, we’ll see…. running the graphical installer this time though…. huh.