Overheating problem with new node

April 6, 2012
by Wm. Josiah Erikson (wjens)

It looks like Kwaku accidentally discovered a problem with the new node’s config. It crashed in the middle of him running something on it, and I found it totally unresponsive due to (probably) overheating. I think that 2U just isn’t enough room for this heatsink and fan combo to work with, and also I think that the placement of the memory prevents airflow through the heatsink fins. Clearly the fins should be facing the other direction. I think it will be clear from this pic. Unless I can figure out some way around this, that nixes using these. Perhaps a different heatsink, or maybe even just getting two 8GB memory modules and putting them in the slots further away from the processor?


compute-1-4 back up; compute-1-7 went from worst to best in rack 1

April 5, 2012
by Wm. Josiah Erikson (wjens)

I replaced compute-1-4’s power supply. I also replaced compute-1-7’s innards. It was a 2.16Ghz Core 2 Duo with 4GB of RAM. It is now a 3.1Ghz octa-core Phenom with 16GB of RAM. Cost for this upgrade: ~$400. I’ll be “benchmarking” it shortly (with dnetc, not a real benchmark, but might have something to do with something).


New parts in

March 30, 2012
by Wm. Josiah Erikson (wjens)

Got some new parts in – new power supply for compute-1-4 and new motherboard and RAM for compute-1-7. If it works out, I might update more of rack 1 to this configuration: new 8-core Phenom II’s with 16GB of RAM. I’ll have to check the old motherboards; there’s even a possiblity I won’t have to get new ones.


Some dead nodes

March 26, 2012
by Wm. Josiah Erikson (wjens)

Looks like a fair portion of the cluster has been down for a couple of days. Checking into it reveals the following problems:
1. compute-1-4 won’t power on. Probably a power supply.
2. Much of racks 1 and 2 had just hardlocked, and mostly all at the same time, so it must have been a job that got submitted at that time.
3. Several nodes had an error message on the console from tractor about “Load averages are unobtainable”. Some of these nodes were crashed, some were not (compute-1-6, for instance, was not)
4. Compute-3-3 seems to have finally died the same death as the other Opteron nodes


Small upgrades

November 7, 2011
by Wm. Josiah Erikson (wjens)

At some point this summer, I upgraded compute-1-9 to a Core 2 Quad and 8GB of RAM, and upgraded compute-1-6 to a hexacore Athlon and 16GB of RAM, both for piddly amounts of money, bringing us up to Total CPUs: 258
Total Memory: 303.9 GB.

I’ll probably upgrade compute-1-7 to match compute-1-6 soon, as it’s our last 4GB, dual-core node in Rack 1. Or maybe I’ll see if I can do something crazy with a quad-processor motherboard or something. We’ll see how bored I get 🙂


Total CPUs: 250 Total Memory: 288.5 GB

June 23, 2011
by Wm. Josiah Erikson (wjens)

Yeah. That’s up 96 CPU’s and 64GB of RAM from before lunch. Not bad.

I love those nodes. They are just so enormously badass. They’re like a cluster in a box. So in case it wasn’t obvious, we now have compute-4-3 and compute-4-4. They’re also just so easy to install. I took the rails out of the box, clipped them into the rack, took the nodes out of the boxes, dropped them into the rails, slid them into the rack. Plugged in two power cords each to two different UPSes each, plugged in and routed the network cables, ziptied up the power cords to make them look pretty, fired up insert-ethers –rack 4 –rank 3, turned on node 3, plugged in a keyboard and mouse, hit F12, waited for it to come up in insert ethers and start the install, repeat for the other one. I think it took me less than half an hour from start to finish.


Lefthand rack replaced with square hole rack, finally

June 23, 2011
by Wm. Josiah Erikson (wjens)

In anticipation of two new 48-core nodes (!!), Kyle Harrington and I (thanks Kyle!) replaced the old threaded-hole rack that compute-2-x, compute-3-3, compute-4-x, the head node, and the UPSes for the cluster sat in. I have not yet put compute-3-3 back up. The new Dells are coming with rails that only work in round or square hole racks, so in order to rack these and any other modern machines, this move was necessary. In the process, we had to shut down the entire cluster, and compute-1-10 did not come back up. Looks like a power supply. Compute-1-6 is down because I have replaced its internals with a hexa-core Phenom and 16GB of RAM, but the onboard ethernet is not supported by ROCKS. I’m going to get a motherboard of which this is not true, but we are out of money, so it has to wait until July 1…

I just got an email that the two new 48-core nodes are in, so perhaps this afternoon, the capacity of the cluster will nearly double…


UPS Batteries and compute-3-1

January 31, 2011
by Wm. Josiah Erikson (wjens)

Compute-3-1 is dead. Oh well. Another free machine from Dartmouth lived out its useful life. On a much better note, Jean ordered us more new UPS batteries and now all four Pulsar Evolution 3000’s have brand-new batteries. Yay! No more beeping in the server room! We like that. Also, less importantly, jobs will not die because of a brief power outage.


More dead hard drives

January 3, 2011
by Wm. Josiah Erikson (wjens)

compute-1-8 and compute-1-13 appear to have dead hard drives. Time to tear them apart and see what’s up. I’m about out of spare hard drives, too – I’ll have to order some.


Cleaning

December 17, 2010
by Wm. Josiah Erikson (wjens)

Finally got around to unracking one of the old dead Opterons from Dartmouth. I’m hoping that I’ll get a new lefthand rack with actual square holes from Chris and Jeremy shortly, so in preparation for that, it’s time to get rid of the cruft.

I also really hope to get rid of the blades sometime soon and replace them with something far better, preferably something more like Rack 1, which has been great. Basically, I want those cases exactly… they’re my favorite rackmount cases ever.