fly rebuilt

May 16, 2014
by admin

I have rebuilt fly with a machine donated from Mt. Holyoke (thanks Jeremy and Ron!). It’s a Dell 2950 – dual quad-core 2.0Ghz Xeon with 32GB of RAM and a couple of 1TB hard drives we had lying around in the library server room. I upgraded to ROCKS 6.1.1 while I was at it, which isn’t all that much different from ROCKS 6.0, it turns out, but the ganglia interface is a little nicer, and it comes with lots of security updates, etc. I deleted some unused accounts, installed newer versions of some software (including tractor), and now I’m trying to compile emergent again.

That was MUCH easier than I thought it was going to be. It only took me a few hours – I thought it would take a week.

One lesson learned: Make sureΒ /etc/sysconfig/network has fly.local set as the hostname so that idmapd properly maps things to the .local domain. If you don’t, everything gets mapped to being owned by nobody when you log into the nodes. The first obvious symptom of this is that you get asked for a password when logging into a node, because SSH says that ownership doesn’t match on .ssh/authorized_keys, which it in fact doesn’t, because you are not nobody πŸ™‚


Power rebalance and upgrade project

May 5, 2014
by admin

I have ordered new UPS batteries for all the 3000VA UPSes – they’re three years and four months old, according to this blog – I blogged the last time I replaced them. That makes me feel old. Also I have ordered a nice 0U metered PDU that should be able to power most of Rack 1, and the other 0U PDU on the other side can do the rest. I hope that between these two things, there will be no more power strips hanging out behind the racks, and I should be able to properly balance and use all the power available.


Mysterious Reboots

April 24, 2014
by admin

Some weird reboot/reinstalls occurred, that may or may not make any sense.

1d 36m ago, compute-1-18 rebooted. It’s plugged into one of the white UPSes. No other nodes plugged into this UPS went down at that time.

Seventeen hours ago, the following nodes went down: 1-12, 1-14, and 2-22 through 2-25. 1-12 and 1-14 are plugged into a power strip (along with many other nodes that did not go down) that is plugged into the 3000VA UPS that is second from the bottom in the stack. It has bad batteries. We stole from the bottom UPS, which is off, and made a new battery pack. 2-22 through 2-26 are plugged into the top half of the 208V PDU. All of these nodes were plugged into locations that had plenty of other nodes also plugged in that didn’t go down.

2 hours ago, the following nodes went down: 1-5, 1-10, 2-1 through 2-4. These nodes are all plugged into the top 3000VA UPS in the stack. However, compute-1-11 and 1-9 are also plugged into that UPS, and they did not go down at that time. They’re also fully loaded and have been the whole time.

What gives? At this point it is entirely unclear to me.


Smaller than 80GB hard drives no longer work

April 21, 2014
by Wm. Josiah Erikson (wjens)

Due to some reconfig of the partition schema and some adding to the distro, it appears that anything smaller than 80GB no longer works in a cluster node. I also had this weird problem in compute-1-8, 1-16, and 1-15 where they were hung at “Verifying DMI Pool Data….” forever, but when Dan and I swapped out the hard drives for 80GB SATA drives, this problem disappeared, as did the more obvious problem that the ROCKS installer exited on saying “You need 857MB more hard drive space”.
All nodes should be back in service shortly, probably by the time you read this. Thanks to Rae-Ann for donating all the old 80GB hard drives that User Support was throwing away πŸ™‚


New nodes

January 24, 2014
by admin

I bought some new C6100’s – X5560’s with 24GB of RAM this time. Somebody’s discovered the secret of the C6100’s, as they’ve gone UP in price. They’re still a steal though – $1500 per four-node 2U rackmount unit. I bought two. I have installed one so far and am waiting for compute-1-17 to be done with what it’s doing before I move it over to rack 1 where it belongs and replace it with the other C6100. Everything’s all out of order now – I’ll have to reorder things when I next reinstall the cluster. I labeled compute-2-x so I wouldn’t be confused. Also: we’re almost over 1TB of RAM, and will be once I set up the second C6100. An important milestone! πŸ™‚


Circuit blew again

December 4, 2013
by Wm. Josiah Erikson (wjens)

We blew circuit #10 this morning, which is a 120V 20A circuit with one UPS and a couple of other things plugged into it. I took compute-2-17 through 21 off that UPS and plugged them into the non-UPS 208V circuit. The biggest problem with blowing this circuit was that one of the infrastructure switches was plugged into that UPS, so it looked like a lot more nodes were down than really were. We shouldn’t blow that circuit again, but now we have four nodes that aren’t on UPS. We should probably get a 208V UPS.


compute-2-1 can’t talk to its hard drive

October 7, 2013
by Wm. Josiah Erikson (wjens)

Gonna trying booting it with a USB hard drive I had sitting around. Janky, but it might work πŸ™‚


Constant power outages, UPSes

October 7, 2013
by Wm. Josiah Erikson (wjens)

I’ve moved some nodes over to the 3000VA UPSes, and brought in a 1500VA UPS from the old Five College rack and replaced the batteries, plus a 2200VA UPS I found in G-8. Probably needs new batteries, but might be enough to pull through a few blips. Ordered new batteries for the other one plus the one in B-18…. leaving a few nodes un-tractored until tomorrow morning so I can put what I can on UPS without interrupting Tom’s runs. Started up compute-4-1 anyway, since it has dual power supplies and I can move it without disturbing things. Not sure how many compute-2-x nodes I can really put on these UPSes – those things draw a lot of power. Maybe a couple enclosures at least. We shall see. Jeff Neumann says they’re working on it, but I think more blips will happen before they figure it out.


30A circuit blew

September 10, 2013
by Wm. Josiah Erikson (wjens)

Wacky. The power strip didn’t pop – the actual 30A 208V circuit breaker in the closet tripped… that shouldn’t happen. I don’t know how that’s possible without tripping the power strip unless, perhaps, there was a voltage dip or something. I guess that’s what we get for not having a UPS… but then again, UPSes often cause their own problems! Nodes reinstalling now….


Eight new cluster nodes

August 6, 2013
by Wm. Josiah Erikson (wjens)

Two more of those C6100’s with L5520’s and 24GB of RAM per node. Racked up and set to go. They were $700 each plus $200 for a lot of 10 250GB hard drives, as they were driveless.