Archive for the 'Uncategorized' Category

compute-1-6 and 1-7 are dead, long live compute-0-1 through compute-0-4 (and 0-5 through 0-8 soon!)

Friday, July 14th, 2017

I have removed compute-1-6 and 1-7, the oldest single-CPU nodes in
the cluster, which were taking up 2U each, to make room for 8 shiny
“new” Dell 6220 nodes, each of which have dual octa-core Xeon E5-2650’s
and 64GB of RAM. This makes them by far the most powerful nodes in the
cluster, as they should be. You’ll notice they present to the OS as if
they had 32 CPU’s each – this is hyperthreading in action, though there
is an actual execution unit attached to each of those virtual cores, so
we should probably leave it on. However, if we’re not seeing significant
performance speedups from them, we could try turning it off.

Right now I have four nodes up and running. I’m going to add them to
tractor with some sensible service tags… I’ll assume “tom” should be
one of them, Tom?

Four more are in the mail – the originals got damaged in shipping
and they’re sending me some replacement parts.

bad enclosure?

Thursday, April 6th, 2017

Compute-2-26 through 2-29 are down. 29 has been down for quit awhile… when trying to replace it, for some reason all the rest of the nodes in that enclosure rebooted. I’m becoming suspicious of the enclosure itself and am going to replace it, as I have spares.

compute-1-7

Monday, February 27th, 2017

Died. Will not power on again. Probably bad power supply. Will check.

compute-1-18 and compute-1-3

Tuesday, January 31st, 2017

These two nodes weren’t running tractor – they had rebooted themselves and reinstalled. Compute-1-18 5 days, 11:39 uptime and compute-1-3 1 day, 5:18. Not sure why – neither has anything in the logs nor shows anything particularly suspicious in ganglia.

compute-2-25

Tuesday, January 31st, 2017

Appears to have rebooted and reinstalled itself about 20 hours ago. Not
sure why – nothing in the logs. I’m beginning to suspect that chassis
that has 22-25 in it may have a backplane problem – I can replace it if
so, as I have two spares.

compute-2-23

Monday, January 30th, 2017

Died. Like won’t turn on. Luckily I had just bought some new compute nodes, and I replaced it with a shiny new 6-core node with 48GB of RAM (combined the RAM from the old and the new – there were spare slots). Upgrades!

2-5 through 8

Monday, January 23rd, 2017

One of the C6100’s, containing compute-2-5 through 2-8, rebooted 18 hours ago. I’m not entirely sure why yet, but if you saw runs mysteriously die, that’s why. It could be a power supply issue or it could be that side of the PDU got overloaded. I’ve moved it to the other side where there is 1A smaller power draw. If it does it again, I’ll replace the power supply.

compute-1-10 bad RAM

Friday, January 20th, 2017

One of the four sticks was bad. Am RMA’ing it. Will have 24GB of RAM until it comes back.

Three node notes

Friday, January 20th, 2017

1. Compute-1-7 had crashed with a “soft lockup” kernel bug. I rebooted.
Will keep watching it. Could indicate a hardware problem, could be random.

2. Compute-1-17’s hard drive died. I replaced it.

3. Compute-2-24 died entirely – would not power on. As it was a node in
one of the C6100’s, and I had a spare, I replaced it. Now it has faster
processors and twice as much RAM. Note to self: I’m out of spare C6100
nodes.

compute-1-3 is dead, long live compute-1-3!

Thursday, July 23rd, 2015

Motherboard this time. Video went all wonky. Tried replacing both RAM and CPU and didn’t help. Had new AM3+ motherboard on hand, so got 16GB of RAM and an FX-8350 for under $300 and threw ’em in there.