Archive for January, 2017

compute-1-18 and compute-1-3

Tuesday, January 31st, 2017

These two nodes weren’t running tractor – they had rebooted themselves and reinstalled. Compute-1-18 5 days, 11:39 uptime and compute-1-3 1 day, 5:18. Not sure why – neither has anything in the logs nor shows anything particularly suspicious in ganglia.

compute-2-25

Tuesday, January 31st, 2017

Appears to have rebooted and reinstalled itself about 20 hours ago. Not
sure why – nothing in the logs. I’m beginning to suspect that chassis
that has 22-25 in it may have a backplane problem – I can replace it if
so, as I have two spares.

compute-2-23

Monday, January 30th, 2017

Died. Like won’t turn on. Luckily I had just bought some new compute nodes, and I replaced it with a shiny new 6-core node with 48GB of RAM (combined the RAM from the old and the new – there were spare slots). Upgrades!

2-5 through 8

Monday, January 23rd, 2017

One of the C6100’s, containing compute-2-5 through 2-8, rebooted 18 hours ago. I’m not entirely sure why yet, but if you saw runs mysteriously die, that’s why. It could be a power supply issue or it could be that side of the PDU got overloaded. I’ve moved it to the other side where there is 1A smaller power draw. If it does it again, I’ll replace the power supply.

compute-1-10 bad RAM

Friday, January 20th, 2017

One of the four sticks was bad. Am RMA’ing it. Will have 24GB of RAM until it comes back.

Three node notes

Friday, January 20th, 2017

1. Compute-1-7 had crashed with a “soft lockup” kernel bug. I rebooted.
Will keep watching it. Could indicate a hardware problem, could be random.

2. Compute-1-17’s hard drive died. I replaced it.

3. Compute-2-24 died entirely – would not power on. As it was a node in
one of the C6100’s, and I had a spare, I replaced it. Now it has faster
processors and twice as much RAM. Note to self: I’m out of spare C6100
nodes.