compute-1-12 died

March 11, 2013
by Wm. Josiah Erikson (wjens)

Probably bad power supply. Will check it out this afternoon after lunch.


Built two new cluster nodes from scratch

December 10, 2012
by Wm. Josiah Erikson (wjens)

Test-case nodes from newegg, using to benchmark and figure out what nodes will be cheapest in both the short and long term, price/performance-wise. We’ve run into a bit of a snafu with pmap giving slowdowns compared to map on AMD hardware, and disappointing gains on Intel hardware. Anyway, the nodes themselves came out well, except that one of the cases came with a bad power supply (Oh well, don’t feel like RMA’ing it, since I have some extras around. I also ordered a couple of spare power supplies, since they seem to keep dying. I also threw away all of my bad ones.) compute-1-17 is an Intel Core i7 3770K with 16GB of RAM. Total node cost, including case: $607.86
compute-1-18 is an AMD FX-8350 with 16GB of RAM. Total node cost, including case: $489.96. I used hard drives I had around, so I should probably add $80 or so to those costs for a hard drive. The Core i7 node happens to be WAY faster with clojure and pmap, due to some clojure or Java oddness that we have yet to understand. However, the FX-8350 is actually faster than the i7 on single-threaded clojure maps. Odder and odder, since all the available benchmarks out there say the opposite of that in both cases. Now to benchmark rendering!


compute-1-1’s power supply died…

November 16, 2012
by Wm. Josiah Erikson (wjens)

They’re dying like flies… hahaha. Yeah. Ahem. I guess these ancient (2001) p/s’s can’t really be expected to last this long or handle what I’m throwing at them. They’ve done a bang-up job really, and hardly owe us anything…


Upgrade notes

November 13, 2012
by Wm. Josiah Erikson (wjens)

Setting /etc/hostname to fly.local to fix the rpc.idmapd domain problem where users homedirs all mapped to being owned by nobody may not have been the right fix. Remember to set up backups again once things are up and running.


Problem list while upgrading cluster

November 13, 2012
by Wm. Josiah Erikson (wjens)

1. compute-1-2 has a bad power supply (that’s why it was down!
2. compute-1-5 has a bad exhaust fan
3. compute-4-5 (new monster node) will not post
4. compute-1-6 will not post

I guess I’ll try to figure out 4-5 first, since it’s most important…


Upgrading to ROCKS 6.0

November 12, 2012
by Wm. Josiah Erikson (wjens)

I will be rebuilding the cluster today, tomorrow, and Wednesday. I expect to take it down shortly and I don’t expect it to be fully functional again for a couple of days.


compute-1-1 is back

September 19, 2012
by Wm. Josiah Erikson (wjens)

Replaced power supply with one salvaged from the Library Basement. Good to go.


Power outage

September 19, 2012
by Wm. Josiah Erikson (wjens)

We had a power outage last night. All the nodes came back up fine (most of them didn’t go down at all, thanks to the UPSes, actually, but all of Rack 2 went down, of course) except for compute-4-4, which came back up with a hard power-off and power-on. Not sure why compute-4-4 didn’t make it through the power outage when the others did… maybe I need to swap around the dual power supplies a bit…


Current problems with the cluster

September 19, 2012
by Wm. Josiah Erikson (wjens)

1. compute-1-6 has some bad RAM (memtest86+ says so). Fix: RMA the RAM Timeframe: Very soon Difficulty: PITA, but whatever
2. compute-1-1 probably has a bad power supply. Fix: Replace power supply. Timeframe: very soon Difficulty: easy
3. compute-4-5 (the monsterest node) isn’t compatible with the version of ROCKS install. Fix: rebuild the cluster with the latest version of ROCKS Timeframe: Have to coordinate with GP and Animation folks Difficulty: Well, every time we rebuild the cluster there are about a million things to figure out again. Hard, I guess, but quite doable.
4. Tractor is out of date and NIMBY still doesn’t work Fix: Update tractor and figure out the new permissions system Timeframe: Coordinate with GP and Animation folks Difficulty: Probably not that hard, but I’d rather do it when rebuilding the cluster if possible.


New: Biggest node yet!

April 26, 2012
by admin

I bought us a SuperMicro 2042G-TRF and installed four 16-core processors and 16 x 4GB quad-ranked RAM from IT (they couldn’t use it due to some motherboard limitation we don’t have), so we now have a 64-core, 64GB node! It’s currently called compute-4-6 for some silly reason (fly picked up two MACs from the node, and the second appears to be the primary interface. Oh well. Maybe I’ll fix it at some point). I haven’t installed it in the rack yet, but I’m psyched to try it out! Oh and also, this whole thing only cost us $3600, so I had some money left over to get a couple clones of compute-1-7!