Random reboot on compute-1-7, crash on 1-6

March 29, 2013
by Wm. Josiah Erikson (wjens)

Kernel panic on 1-6. Rebooted, seems OK so far, but I should watch it – compute-1-6 has been trouble for awhile. Dunno why 1-7 rebooted, but I’d guess it’s another bad power supply. Also worth watching.



21 Responses to “Random reboot on compute-1-7, crash on 1-6”

  1.   wjens Says:

    Hm. So blank screen on compute-1-6, machine powered on, nothing on the console. Rebooted and all appears well. I would normally run memtest, but I already did that. If this happens again, perhaps replace the processor? Compute-1-7, on the other hand, was off. Just powered off. Removed from the rack, took the top off, applied power, and it powered on again and appears to be working fine. Again, unclear what is going on! The power supply in 1-7 is a new one – “Green Power” EA-380D. Could be bad, but I doubt it.

  2.   wjens Says:

    compute-1-7 reinstalled, then died with a blank console. We’ll run memtest86+ and see what we come up with, since it’s still running and therefore might not be a power supply issue. It could also be a bad CPU or motherboard.

  3.   wjens Says:

    memtest86+ passed 10 complete passes with no errors, so it’s not the memory… trying to reinstall again, watching it more closely this time.

  4.   wjens Says:

    compute-1-7 appears to have the r600 plague as well… but it’s unclear to me why it only started now. It’s trying to load a firmware file (R600_rlc.bin) that isn’t there and was never provided in CentOS 6.2. Ah, I see – the problem only happens if it’s plugged into a monitor while booting, and it could possibly be that it only happens if it’s plugged into THIS monitor (1280×1024 LCD) while booting. I did try it both with DVI and VGA…. and it appears to immediately kernel panic if you plug it in after it boots. Everything makes so much sense now! This is extremely unfortunate. Perhaps I will be buying another 78LMT-USB3, which doesn’t exhibit this problem.

  5.   wjens Says:

    compute-1-6 crashed again. This time there’s something on the console: Kernel panic – not syncing: Watchdog detected hard LOCKUP on cpu 3. Bad CPU? I did just get two RMA’ed MSI motherboard back, too, so I could try that…

  6.   wjens Says:

    Now when compute-1-6 reinstalls, you can’t log in – it says your password is wrong (it shouldn’t even be asking). Reinstalling compute-1-2 to see if it has the same problem… did something happen to the install image? Possibly…. maybe I need to remake it…. fly has crashed a lot recently.

  7.   wjens Says:

    Wacky. So compute-1-2 reinstalled fine and came up normally. I’ll try reinstalling compute-1-6 again and if it still fails, I’ll rebuild it with a new motherboard.

  8.   wjens Says:

    1-6 and 1-7 crashed again. New motherboards, coming up!

  9.   wjens Says:

    One more chance for compute-1-7… its rear fan was out and it’s a fanless power supply. Replaced – maybe that will solve the problem.

  10.   wjens Says:

    …or not. Crashed in the middle of installing.

  11.   wjens Says:

    Replaced motherboard in compute-1-6 with one of the RMA’ed 760GM-E51’s. Installing now. We’ll see what happens. If this isn’t it, I guess it’s the CPU.

  12.   wjens Says:

    compute-1-6 already crashed on install. Guess it wasn’t the motherboard. We’ll try CPU next.

  13.   wjens Says:

    Ordered a new FX-8320 for it – they were on special sale for $158.

  14.   wjens Says:

    On a whim, I put the 78LMT-USB3 back in it and tried again. I got the weird problem where it wouldn’t let me log in and didn’t ping. I replaced the RAM with some older, smaller RAM, and it booted, installed, and works fine. For now. Definitely don’t trust it.

  15.   wjens Says:

    Trying replacing compute-1-7’s motherboard. It’s the single remaining non-RMA’ed 760GM-E51. Suspect! 🙂

  16.   wjens Says:

    compute-1-7 crashed AGAIN. Is it RAM? It definitely works better than it did before… as in, it boots and installs 🙂 Why do I do this when I could just buy some C6100’s?

  17.   wjens Says:

    I came into the server room and compute-1-7 was off. Power supply? It’s newer, but it’s a “Green Power” one that maybe doesn’t have enough juice.

  18.   wjens Says:

    Compute-1-6 has been fine for awhile now. Weird. Still don’t trust it. As far as compute-1-7 goes, I tried replacing the power supply, no go, tried replacing the motherboard again, tried replacing the RAM, tried replacing the CPU with the new FX-8320, still no go. Then I unplugged the 80GB Seagate Serial ATA hard drive. Boom! Fine. Replaced with old 80GB PATA drive that I brought in from home. Re-racked. Now I have an extra FX-8120, RAM, and motherboard to upgrade another node with. I still don’t trust compute-1-6 though…

  19.   wjens Says:

    Compute-1-6 crashed again. I will replace the CPU.

  20.   wjens Says:

    My spare FX-8120 is toast. Right. I’ll have to RMA it. Running it with the top off to see if it’s a heat issue. It does still boot fine.

  21.   wjens Says:

    Hm. Well it has been running, with the top off, often partially loaded, and fully loaded since Monday morning (more than 48 hours ago) for 11 days with no problems. I think that’s clear. It’s a heat issue. Huh. OK. I’ll have to get clever about moving more air through that case.

Leave a Reply

You must be logged in to post a comment.