New: Biggest node yet!

April 26, 2012
by admin

I bought us a SuperMicro 2042G-TRF and installed four 16-core processors and 16 x 4GB quad-ranked RAM from IT (they couldn’t use it due to some motherboard limitation we don’t have), so we now have a 64-core, 64GB node! It’s currently called compute-4-6 for some silly reason (fly picked up two MACs from the node, and the second appears to be the primary interface. Oh well. Maybe I’ll fix it at some point). I haven’t installed it in the rack yet, but I’m psyched to try it out! Oh and also, this whole thing only cost us $3600, so I had some money left over to get a couple clones of compute-1-7!



8 Responses to “New: Biggest node yet!”

  1.   admin Says:

    I installed it in the rack this morning – Supermicro now has nice click-in rails that are almost as nice as Dell’s ReadyRails, but not quite.

  2.   admin Says:

    …and I think I’ll leave it as compute-4-6, as Paula says we can get another R815 (this one a 64-core!) from Dell, which will become compute-4-5. Then I’ll use next year’s money to upgrade rack 1, and possibly retire rack 2, or just leave it for students to experiment on… or just use them as storage nodes for pvfs2 exclusively.

  3.   wjens Says:

    …or not. I’m reinstalling it as compute-4-5. We may not get said node.

  4.   wjens Says:

    Found weird problem with the IPMI where it had stolen the IP of the primary interface… thought that fixing that would fix the problem with ganglia where it disappears randomly, but no, it did not. It did, however, fix the problem where it would disconnect me every 30 seconds. Also, dnetc is MISERABLY slow on it, like 1/10 the speed of compute-4-4. Hum. Clearly this experiment cannot yet be called a success.

  5.   wjens Says:

    The dnetc problem looks like it’s probably just that dnetc doesn’t recognize the processor tag and loads the wrong core – it’s loading Intel-optimized instructions.

  6.   wjens Says:

    Hm. Actually it’s not even able to use all of the processor, and even when I manually select the right core, it’s still insanely slow, and the clock is running really fast… apparently CentOS 5 doesn’t fully support the Interlagos processors in a way that I don’t fully understand (lacking the FMA4 instruction sets, apparently, but I’m not sure why that would make the clock run wrong!) Somebody on the rocks-discuss list told me that I should stop hald and that would fix the clock problem, but it didn’t. I think I shall try something like this:

    https://wiki.rocksclusters.org/wiki/index.php/Install_RHEL_6_kernel_for_AMD_Interlagos_support_on_Rocks_5.4.3

    I have currently installed ubuntu 12.0.4 on compute-4-5 and I’m hoping to verify that it works properly with an up-to-date kernel/OS. So far so good. The clock is running correctly, dnetc uses all 64 cores at 99% each, but I don’t have a benchmark yet… not sure how meaningful the benchmark will be, but hopefully it will be a lot better than the last one!!

  7.   wjens Says:

    So yes, it looks like this worked. dnetc benchmarks at almost exactly the same on the new monster node as on the old monster nodes. This isn’t entirely surprising, since apparently there’s some weirdness where actually a 16-core processor only has 8 FPU’s but they’re each 256 bits wide, etc.. pretty sure that OGR is mostly FPU… perhaps we’ll see better performance on integer stuff, but I think mostly we’ll use FPU, so maybe we should have stuck with the older Magny-Cours stuff. Or just gone Intel…. the cheap 8-core AMD Phenom’s seem to be performing very well though, I just have to replace that power supply… I should go do that.

  8.   wjens Says:

    Aha! They just released ROCKS 6.0. That should solve this problem when I rebuild the cluster next. I’ll put it on the shelf until then.

Leave a Reply

You must be logged in to post a comment.