compute-1-10 is back

January 5, 2015
by Wm. Josiah Erikson (wjens)

Power supply was bad. Good thing I just bought all those Antec Green Power ones.


compute-1-18 rebooting randomly

September 15, 2014
by admin

Not sure why. It doesn’t do it with dnetc, just Tom/render jobs (though it’s in the “notom” Profile, maybe because of this). I ran it for 16 days at full load with dnetc, and it’s stable as a rock. Just started tractor-blade again, and we’ll see if it reboots itself again…

it’s an FX-8350. Maybe I bought it one of those crappy MSI motherboards.

These AMD machines seem to be pretty flaky… but it might actually just be the MSI motherboards. Not sure I’ve had any trouble with a machine with the Gigabyte motherboard.


compute-1-2 upgraded to FX-8320, 16GB

August 25, 2014
by admin

Got the RMA’ed motherboards back (MSI 760GM-E51(FX)’s), and the cpu’s from AMD. One of the motherboards was DOA. Go MSI. However, I upgraded compute-1-2 with the other one. I now have a spare FX-8120 and 8GB of RAM… I guess I’ll buy a Gigabyte motherboard and another 8GB of RAM and upgrade compute-1-3. I’m tired of those terrible MSI motherboards… I don’t know if I even WANT them to send me a new one. Heh.


2-15 and 2-27 having hard drive issues

June 30, 2014
by admin

I may just have a lot of bad hard drives, or they may have backplane issues… I saw this with compute-2-1 at some point, and I don’t remember the exact resolution… I’ll go read that post and see if I blogged about it. At this point, after replacing the hard drives multiple times and seeing different kinds of errors (can’t find the hard drive at all, then loses communication in the middle of an install and resets multiple times), I’ve got a couple of hard drives in there that are working, but I still think it might be a physical backplane connectivity issue…


compute-4-3 reset

June 30, 2014
by admin

BIOS says: Warning: A fatal error has caused system reset! Continue? I said yes. If it happens again, I’ll worry about it.


compute-1-12 has hardware errors

June 26, 2014
by admin

I’m getting kernel: [Hardware Error]: Combined Unit Error: VB Data/ECC error
Could be CPU, memory, or motherboard. This post has almost the exact same hardware we’ve got: http://ubuntuforums.org/showthread.php?t=2010489 and he didn’t have bad memory… those motherboards have been really lousy. We’ve got two coming back from RMA soon, and I’ve got a couple of spare CPUs sitting around that came back from RMA… guess I could swap one or the other of those out.


weird eth2 issue

May 21, 2014
by admin

eth2 on fly (fly1) went down, with the following error message in dmesg:

————[ cut here ]————
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted)
Hardware name: PowerEdge 2950
NETDEV WATCHDOG: eth2 (bnx2): transmit queue 0 timed out
Modules linked in: nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt 8021q garp stp llc p4_clockmod freq_table speedstep_lib ipt_REJECT xt_state iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ses enclosure bnx2 sg serio_raw lpc_ich mfd_core i5000_edac edac_core i5k_amb shpchp ext4 jbd2 mbcache sd_mod crc_t10dif sr_mod cdrom megaraid_sas pata_acpi ata_generic ata_piix usb_storage radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mperf]
Pid: 0, comm: swapper Not tainted 2.6.32-431.11.2.el6.x86_64 #1
Call Trace:
[] ? warn_slowpath_common+0x87/0xc0
[] ? warn_slowpath_fmt+0x46/0x50
[] ? dev_watchdog+0x26b/0x280
[] ? scheduler_tick+0x11e/0x260
[] ? dev_watchdog+0x0/0x280
[] ? run_timer_softirq+0x197/0x340
[] ? tick_dev_program_event+0x65/0xc0
[] ? __do_softirq+0xc1/0x1e0
[] ? tick_program_event+0x2a/0x30
[] ? call_softirq+0x1c/0x30
[] ? do_softirq+0x65/0xa0
[] ? irq_exit+0x85/0x90
[] ? smp_apic_timer_interrupt+0x4a/0x60
[] ? apic_timer_interrupt+0x13/0x20
[] ? mwait_idle+0x77/0xd0
[] ? atomic_notifier_call_chain+0x1a/0x20
[] ? cpu_idle+0xb6/0x110
[] ? start_secondary+0x2ac/0x2ef
—[ end trace 0131d3805b9feaaf ]—
bnx2 0000:05:00.0: eth2:
bnx2 0000:05:00.0: eth2: RV2P_PFTQ_CTL 00010000
bnx2 0000:05:00.0: eth2: RV2P_TFTQ_CTL 00020000
bnx2 0000:05:00.0: eth2: RV2P_MFTQ_CTL 00020000
bnx2 0000:05:00.0: eth2: TBDR_FTQ_CTL 00004002
bnx2 0000:05:00.0: eth2: TDMA_FTQ_CTL 00010002
bnx2 0000:05:00.0: eth2: TXP_FTQ_CTL 00010002
bnx2 0000:05:00.0: eth2: TXP_FTQ_CTL 00010002
bnx2 0000:05:00.0: eth2: TPAT_FTQ_CTL 00010000
bnx2 0000:05:00.0: eth2: RXP_CFTQ_CTL 00008000
bnx2 0000:05:00.0: eth2: RXP_FTQ_CTL 00100000
bnx2 0000:05:00.0: eth2: COM_COMXQ_FTQ_CTL 00010000
bnx2 0000:05:00.0: eth2: COM_COMTQ_FTQ_CTL 00020000
bnx2 0000:05:00.0: eth2: COM_COMQ_FTQ_CTL 00010000
bnx2 0000:05:00.0: eth2: CP_CPQ_FTQ_CTL 00008000
bnx2 0000:05:00.0: eth2: CPU states:
bnx2 0000:05:00.0: eth2: 045000 mode b84c state 80001000 evt_mask 500 pc 8000bf8 pc 8000bf0 instr 1f82821
bnx2 0000:05:00.0: eth2: 085000 mode b84c state 80001000 evt_mask 500 pc 800068c pc 8000694 instr 3c180800
bnx2 0000:05:00.0: eth2: 0c5000 mode b84c state 80001000 evt_mask 500 pc 80044c4 pc 80044c8 instr 32a20003
bnx2 0000:05:00.0: eth2: 105000 mode b84c state 80001000 evt_mask 500 pc 8000774 pc 800074c instr af8a0014
bnx2 0000:05:00.0: eth2: 145000 mode b880 state 80000000 evt_mask 500 pc 8004e10 pc 8000f58 instr 3e00008
bnx2 0000:05:00.0: eth2: 185000 mode b84c state 80008000 evt_mask 500 pc 80006f8 pc 800042c instr 3c0c0800
bnx2 0000:05:00.0: eth2:
bnx2 0000:05:00.0: eth2:
bnx2 0000:05:00.0: eth2: TBDC free cnt: 32
bnx2 0000:05:00.0: eth2: LINE CID BIDX CMD VALIDS
bnx2 0000:05:00.0: eth2: 00 000800 c488 00 [0]
bnx2 0000:05:00.0: eth2: 01 000800 c488 00 [0]
bnx2 0000:05:00.0: eth2: 02 000800 3060 00 [0]
bnx2 0000:05:00.0: eth2: 03 000800 9aa8 00 [0]
bnx2 0000:05:00.0: eth2: 04 000800 9ab0 00 [0]
bnx2 0000:05:00.0: eth2: 05 000800 7528 00 [0]
bnx2 0000:05:00.0: eth2: 06 000800 7538 00 [0]
bnx2 0000:05:00.0: eth2: 07 000800 7500 00 [0]
bnx2 0000:05:00.0: eth2: 08 000800 73c8 00 [0]
bnx2 0000:05:00.0: eth2: 09 000800 73e8 00 [0]
bnx2 0000:05:00.0: eth2: 0a 000800 c408 00 [0]
bnx2 0000:05:00.0: eth2: 0b 000800 c410 00 [0]
bnx2 0000:05:00.0: eth2: 0c 000800 c390 00 [0]
bnx2 0000:05:00.0: eth2: 0d 000800 c3a8 00 [0]
bnx2 0000:05:00.0: eth2: 0e 000800 c3b0 00 [0]
bnx2 0000:05:00.0: eth2: 0f 000800 d238 00 [0]
bnx2 0000:05:00.0: eth2: 10 000800 d268 00 [0]
bnx2 0000:05:00.0: eth2: 11 000800 d2c8 00 [0]
bnx2 0000:05:00.0: eth2: 12 000800 d2b8 00 [0]
bnx2 0000:05:00.0: eth2: 13 000800 d2c0 00 [0]
bnx2 0000:05:00.0: eth2: 14 000800 2ff8 00 [0]
bnx2 0000:05:00.0: eth2: 15 000800 3048 00 [0]
bnx2 0000:05:00.0: eth2: 16 000800 3098 00 [0]
bnx2 0000:05:00.0: eth2: 17 000800 3d40 00 [0]
bnx2 0000:05:00.0: eth2: 18 15e600 e640 cb [0]
bnx2 0000:05:00.0: eth2: 19 0d4200 db90 91 [0]
bnx2 0000:05:00.0: eth2: 1a 110e00 1a70 18 [0]
bnx2 0000:05:00.0: eth2: 1b 0e6f00 cc08 42 [0]
bnx2 0000:05:00.0: eth2: 1c 08b800 0ae0 0f [0]
bnx2 0000:05:00.0: eth2: 1d 1c1680 8208 18 [0]
bnx2 0000:05:00.0: eth2: 1e 0a8800 82c0 17 [0]
bnx2 0000:05:00.0: eth2: 1f 109180 fdd8 81 [0]
bnx2 0000:05:00.0: eth2:
bnx2 0000:05:00.0: eth2: DEBUG: intr_sem[0] PCI_CMD[02b8055e]
bnx2 0000:05:00.0: eth2: DEBUG: PCI_PM[1d002000] PCI_MISC_CFG[81020088]
bnx2 0000:05:00.0: eth2: DEBUG: EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
bnx2 0000:05:00.0: eth2: DEBUG: RPM_MGMT_PKT_CTRL[00000000]
bnx2 0000:05:00.0: eth2: DEBUG: HC_STATS_INTERRUPT_STATUS[00000000]
bnx2 0000:05:00.0: eth2:
bnx2 0000:05:00.0: eth2: DEBUG: MCP_STATE_P0[00000106] MCP_STATE_P1[ffffffff]
bnx2 0000:05:00.0: eth2: DEBUG: MCP mode[0000b880] state[80000000] evt_mask[00000500]
bnx2 0000:05:00.0: eth2: DEBUG: pc[08007298] pc[08004cb4] instr[afbf0014]
bnx2 0000:05:00.0: eth2: DEBUG: shmem states:
bnx2 0000:05:00.0: eth2: DEBUG: drv_mb[01030003] fw_mb[00000003] link_status[0000006f] drv_pulse_mb[00005e5b]
bnx2 0000:05:00.0: eth2: DEBUG: dev_info_signature[44564905] reset_type[01005254] condition[00000106]
bnx2 0000:05:00.0: eth2: DEBUG: 000001c0: 01005254 42530000 00000106 fbffffff
bnx2 0000:05:00.0: eth2: DEBUG: 000003cc: 44444444 44444444 44444444 00000a28
bnx2 0000:05:00.0: eth2: DEBUG: 000003dc: 0004ffff 00000000 00000000 00000000
bnx2 0000:05:00.0: eth2: DEBUG: 000003ec: 00000000 00000000 00000000 00a22fa0
bnx2 0000:05:00.0: eth2: DEBUG: 0x3fc[0000ffff]
bnx2 0000:05:00.0: eth2:
bnx2 0000:05:00.0: eth2: NIC Copper Link is Down

ifdown eth2 and then ifup eth2 appears to have fixed the issue… for now. I’m going to post to the ROCKS list to see if there’s anything else I should do about it.


compute-1-7 has new heatsink, motherboard

May 16, 2014
by admin

compute-1-7 has been on the table with the top off for quite a while, because it crashes when loaded heavily in the rack, probably because the stock heatsink with the top-mounted fan doesn’t have enough clearance between the top of the case and the fan to move the amount of air required to properly cool the CPU. So we bought it a Dynatron A27G, which is specifically designed for socket AM2 2U applications. Unfortunately, it has sideways, not front-to-back airflow, but it might be OK, we’ll find out….

Unfortunately, while attempting to attach said heatsink without using the bottom plate that came with it (I tried to use the stock bottom plate – bad idea), I slipped with the screwgun and destroyed a trace on the 760GM-E51 motherboard. Fortunately, I had two spares, so all is well.

It reinstalled properly, and I’ll put it back in the rack now and see how it fares.


fly rebuilt

May 16, 2014
by admin

I have rebuilt fly with a machine donated from Mt. Holyoke (thanks Jeremy and Ron!). It’s a Dell 2950 – dual quad-core 2.0Ghz Xeon with 32GB of RAM and a couple of 1TB hard drives we had lying around in the library server room. I upgraded to ROCKS 6.1.1 while I was at it, which isn’t all that much different from ROCKS 6.0, it turns out, but the ganglia interface is a little nicer, and it comes with lots of security updates, etc. I deleted some unused accounts, installed newer versions of some software (including tractor), and now I’m trying to compile emergent again.

That was MUCH easier than I thought it was going to be. It only took me a few hours – I thought it would take a week.

One lesson learned: Make sure /etc/sysconfig/network has fly.local set as the hostname so that idmapd properly maps things to the .local domain. If you don’t, everything gets mapped to being owned by nobody when you log into the nodes. The first obvious symptom of this is that you get asked for a password when logging into a node, because SSH says that ownership doesn’t match on .ssh/authorized_keys, which it in fact doesn’t, because you are not nobody 🙂


Power rebalance and upgrade project

May 5, 2014
by admin

I have ordered new UPS batteries for all the 3000VA UPSes – they’re three years and four months old, according to this blog – I blogged the last time I replaced them. That makes me feel old. Also I have ordered a nice 0U metered PDU that should be able to power most of Rack 1, and the other 0U PDU on the other side can do the rest. I hope that between these two things, there will be no more power strips hanging out behind the racks, and I should be able to properly balance and use all the power available.