compute-1-1’s power supply died

August 6, 2013
by Wm. Josiah Erikson (wjens)

I’ll replace it at some point….. but probably not before I go on vacation


UPS Overwhelmed

May 10, 2013
by Wm. Josiah Erikson (wjens)

I guess the new nodes draw a lot more power when going full-tilt. The latest job that Tom submitted took out one of the UPSes, and therefore a bunch of nodes. I’ve moved some of them over to wall power, and they’re reinstalling now…. should be back up soon, but you’ll have to restart some of your jobs (or your whole run, depending on how you feel about the randomness of that event), Tom. Sorry! I guess we need another couple of UPSes if we want to cover the whole cluster… I have some, but they need batteries – around $500 would put them both back in business….


Troubleshooting emergent

May 3, 2013
by Wm. Josiah Erikson (wjens)

There’s a problem with emergent and breve – they both segfault regularly, probably because I built them wrong, and Jaime has had to write a little script to detect this and kill/restart them as appropriate (I don’t understand the details of what he’s done, but suffice it to say that I know it’s broken). So I’m working on recompiling it more correctly, on a node where there aren’t stray random libraries in /usr/local, etc

Here are my notes from Trello, where I’m tracking the project. I’m currently recompiling Qt:

emergent won’t compile without GL support in Qt… installing GL libraries on compute-2-1, where I’m compiling… lots of dependencies, total PITA

58 minutes ago
Wm. Josiah Erikson

Qt-everywhere had been built with OpenGL support, since that’s installed on the head node, but it seemed to be broken anyway, we don’t need it, and it’s not installed on the compute nodes, so recompiling without it…

yesterday at 10:53 am
Wm. Josiah Erikson

Compiling on compute-2-1. Also forced 64-bit compile….

yesterday at 10:34 am
Wm. Josiah Erikson

Gonna see if I can compile it on a node.

yesterday at 10:01 am
Wm. Josiah Erikson

Didn’t help. Still crashes in X, too. There are a few more libraries I can install on the nodes to see if it makes any difference, but they’re installed on the head node and it crashes in X there – though we haven’t tried running what he runs through tractor on the head node, which also segfaults. Something else is up…. can’t figure out what. breve segfaults too… what?

May 1 at 8:50 am
Wm. Josiah Erikson

Recompiling with correct lib locations


Just bought 12 new 8-core 24GB nodes… for $2700

April 19, 2013
by Wm. Josiah Erikson (wjens)

I just got three of these.
They’re a bit old, but benchmarks put that processor at about half as fast as an FX-8350… and there are two of them per node, and 24GB of RAM each, and they’re Dell servers, so they won’t break all the time, and the whole thing will only take up 6U, and then I can ditch all of Rack2 to make space for them… win win win and better price/performance than gutting and rebuilding the Rack 1 nodes all the time. Cool! I hope this works out for everyone. I think it will.


compute-1-18 randomly rebooting

April 12, 2013
by Wm. Josiah Erikson (wjens)

perhaps power supply – can’t remember if that’s the one I replaced the power supply in, or the one that still has the crappy Athena Power one that came with the case. Probably won’t check into it today – will wait until Monday, most likely.


Fly crashed again

April 12, 2013
by Wm. Josiah Erikson (wjens)

Identical symptoms. Again, hostname was fly.hampshire.edu on restart, again it wouldn’t mount /helga. Had to manually mount /helga, set hostname to fly.local, tentakel restart rpcidmapd (and also on the head node), then start tractor-engine (which wouldn’t start because /helga wasn’t mounted). But the real question is: why is it crashing? Do we need new hardware? I should run some tests…. I’m afraid it will go down again this weekend. However, I have to run to manager training….


Upgraded compute-1-2

April 11, 2013
by Wm. Josiah Erikson (wjens)

…with compute-1-12’s old internals, since I replaced them, thinking it was mobo/cpu/ram when really it was the hard drive. Heh.

On second thought, apparently one of the sticks of RAM has gone bad in the meantime, so 1-2 will have 6GB of RAM until the RMA comes back….


compute-1-14 needs new hard drive

April 9, 2013
by Wm. Josiah Erikson (wjens)

Guess that old 20GB unit from Josh Crawford finally bit the dust, along with a few others. They all seem to be dying around now. Good life – 12 years – they were manufactured in 2001. I’d say they don’t owe us anything, especially since they were free to begin with 🙂


Fly crashed

April 9, 2013
by Wm. Josiah Erikson (wjens)

Flashing caps lock and Scroll lock, no signal to VGA… nothing weird in the nagios graphs to indicate why


Random reboot on compute-1-7, crash on 1-6

March 29, 2013
by Wm. Josiah Erikson (wjens)

Kernel panic on 1-6. Rebooted, seems OK so far, but I should watch it – compute-1-6 has been trouble for awhile. Dunno why 1-7 rebooted, but I’d guess it’s another bad power supply. Also worth watching.