Slashdot Mirror


ARM Chips Designed For 480-Core Servers

angry tapir writes "Calxeda revealed initial details about its first ARM-based server chip, designed to let companies build low-power servers with up to 480 cores. The Calxeda chip is built on a quad-core ARM processor, and low-power servers could have 120 ARM processing nodes in a 2U box. The chips will be based on ARM's Cortex-A9 processor architecture."

20 of 132 comments (clear)

  1. Going to be expensive! by ikarys · · Score: 5, Funny

    It'll likely cost an ARM and a leg.

    1. Re:Going to be expensive! by fuzzyfuzzyfungus · · Score: 2

      I suspect that cost will largely boil down to the "fabric", type unspecified, and whatever the "because we can" premium for this device happens to be.

      Since the A9s are in mass production, and have some vendor competition, they should be reasonably cheap, and of basically knowable price; but, depending on what sort of interconnect this thing has, you could end up paying handsomely for that. "Basically ethernet; but cut down to handle short signal paths over known PCBs" shouldn't be too bad; but if it is some sort of custom NUMA unified memory thing, bend over and open your checkbook...

  2. is it worth it? by metalmaster · · Score: 2

    When you start piling all you can onto a chip the power consumption is going to naturally creep up. Once you reach a certain threshold of x chips you lose on the benefit of ARM being "low-power." Am i wrong?

    1. Re:is it worth it? by swalve · · Score: 3, Insightful

      Its low power in that the cores (I assume) can be shut down that aren't being used. Like a switchmode power supply versus a linear one. So you are always using the least amount of power possible.

    2. Re:is it worth it? by L4t3r4lu5 · · Score: 5, Interesting

      Cortex A9 is 250mW per core at 1GHz

      You're looking at, for a 240 core 2U node, 60W for CPUs. Pretty impressive.

      --
      Finally had enough. Come see us over at https://soylentnews.org/
    3. Re:is it worth it? by fuzzyfuzzyfungus · · Score: 4, Interesting

      It really depends on how much(and what kind of) support hardware ends up being involved in having lots and lots of them together in some useful way. That and what inefficiencies, if any, are present because your workload was really expecting a smaller number of higher-performance cores.

      The power/performance of the core itself remains the same whether you have 1 or 1 million. The power demands of the memory may or may not change: phones and the like usually use a fairly small amount of low-power RAM in a package-on-package stack with the CPU. For server applications, something that takes DIMMS or SODIMMs might be more attractive, because PoP usually limits you in terms of quantity.

      The big server-specific questions are going to be the nature of the "fabric" across which 120 nodes in a 2U are communicating. Because 120 ports worth of 10/100 or GigE would occupy 3Us and nonzero power themselves, I'm assuming that this fabric is either not ethernet at all, or some sort of cut-down "we don't need to care about the standards because the signal only has to travel 6 inches over boards we designed, with our hardware at both ends" pseudo-ethernet that looks like an ethernet connection for compatibility purposes; but is electrically more frugal. Whatever that costs, in terms of energy, will have to be added on to the effective energy cost of the CPUs themselves.

      Then you get perhaps the most annoying variable: Many tasks are(either fundamentally, or because nobody bothered to program them to support it) basically dependent on access to a single very fast core, or to a modest number of cores with very fast access to one another's memory. For such applications, the performance of 400+ slow cores is going to be way worse than a naive addition of their individual powers would suggest. Sharing time on a fast core is both fundamentally easier, and enjoys a much longer history of development, than does dividing a task among small ones. With some workloads, that will make this box nearly useless(especially if the interconnect is slow and/or doesn't do memory access). For others, performance might be nearly as good as a naive prediction would suggest.

    4. Re:is it worth it? by somersault · · Score: 3, Interesting

      Not really, the server could stay powered up the whole time (unless you really get 0% usage at non-peak times, and those times are predictable, in which case it makes sense to just power down completely at those times). By scaling up I mean enabling more cores, thus improving the processing capacity of the server. Then you'd get the best of both worlds, with the server being fine for anything from small to massive workloads, while still using less power than the equivalent x86 setup. Like modern engines which can enable or disable cylinders at will to conserve fuel when not much power is needed.

      --
      which is totally what she said
    5. Re:is it worth it? by wvmarle · · Score: 2

      Most servers do not do heavy computing work: they serve up (dynamic) web pages, handle SQL queries, process e-mail, serve files. That sounds to me like lots and lots of threads that each have relatively little work to do.

      For example /.: the serving of a single page to a single visitor will take a few dozen SQL queries and the running of a Perl script to stitch it all together. This takes, say, 0.001 seconds of time of an x86 core - a wild guess, may be an order of magnitude off, good enough for the sake of the argument. An ARM core is maybe a tenth of that speed, so that single page would need 0.01 seconds of processing power to build up. And that is assuming the processor is the bottleneck. Likely the network to access the SQL servers is the bottleneck, which may end up the same overall time to build up that web page.

      But now there are thousands upon thousands of visitors - all requesting pages. As this all goes parallel, it would simply require ten ARM cores to replace one x86 core and retain the same overall output.

      Indeed when you're doing heavy scientific calculations - then ARM definitely won't stand a chance. But web pages won't even need you to do any floating point arithmetic. The same for handling an e-mail queue. It's I/O that's important, the capacity of moving the correct bits from A to B. And from what I've learned about these processors I don't think ARM is doing that so much worse than x86. So depending on the server load, there may really be something to it. Especially as those ten ARM cores use just a fraction of the power of a single x68 core.

    6. Re:is it worth it? by wagnerrp · · Score: 2

      The comment wasn't intended to be derogatory against the ARM. The ARM was just designed from the ground up with low power consumption in mind, not performance. The Cortex A9 has an 8-stage pipeline, 2.5 instructions per clock, around 13M transistors per core, runs at 800MHz to 1.5GHz, and has up to 512KB of L2 cache. The Pentium 3 has a 10-stage pipeline, 2.5 instructions per clock, around 10M transistors, runs at 500MHz-1.4GHz, and has up to 512KB of L2 cache. They're fairly comparable processors, with the ARM probably having a better instruction dispatcher and branch predictor, and the P3 having better floating point performance.

      While it doesn't have a lot of power comparable to modern x86 chips, it absolutely blows them away in performance per watt. It's a much better prospect for low power systems than the Atom, where Intel effectively tossed out 15 years of microprocessor design ripping out parts to cut power consumption.

    7. Re:is it worth it? by TheRaven64 · · Score: 2

      They're fairly comparable processors, with the ARM probably having a better instruction dispatcher and branch predictor, and the P3 having better floating point performance.

      The ARM chip probably doesn't have a better branch predictor. The Pentium 4 had a very good one, which was back-ported to the Pentium-M. The Pentium 3 one was pretty good. ARM chips didn't have one at all until very recently, because branch prediction is much less important with the ARM ISA.

      A lot of ARM instructions are predicated, meaning that they are evaluated, but their results are only retired if a specific condition register is set. Branch prediction on x86 is very important, because short if sequences cause a pipeline stall if they are not correctly predicted. For example, consider this made up example:

      if (x % 2)
      {
      x++;
      }

      With an x86 chip, this will be a conditional branch to skip over the increment. The Pentium 3 branch predictor should get this right most of the time, but if it gets it wrong then you have to flush all of the instructions that were put into the pipeline after the branch instruction (which can be quite a lot, but is probably around 10 in a typical case). In contrast, the ARM version will just use the predicated version of the increment instruction, so the worst that happens is that you lose one cycle.

      For longer branches, the cost of a pipeline stall is less important relative to the overall cost of execution, but it's still quite important. Older ARM chips had very short pipelines, so it wasn't really worth bothering wasting power on a branch predictor. Newer ones do branch prediction, but you can turn off the predictor to save power.

      Comparing ARM instructions per clock and x86 instructions per clock is pretty hard. x86 has some trivial instructions and some incredibly complex ones. ARM instruction density is often very good - it's about the only ISA that regularly beats x86. For example, ARM load instructions get a free barrel shift, which makes array indexing very fast - often a single instruction for accessing an array element.

      --
      I am TheRaven on Soylent News
  3. Re:And it's useless. No 64-bit support. by GeLeTo · · Score: 2

    ARM's Large Physical Address Extensions (LPAE) allows access to up to 1TB of memory. While I doubt applications will use this, it will allow each virtualized host on the server to use 4GB of memory.

  4. Re:WANTED: 1U low-power rack server by TheRaven64 · · Score: 2

    Take a look at the PandaBoard, if you want a low-power, dual-core ARM server, although you'd have to use CF + USB for storage, not SATA. Note, however, that VirtualBox is x86-only. If you want virtualisation, you're currently pretty limited on ARM. There is a Xen port, but it's not really packaged for end users yet.

    --
    I am TheRaven on Soylent News
  5. Re:WANTED: 1U low-power rack server by espiesp · · Score: 2

    While not in 1U format or a lot of off the shelf NAS boxes use ARM. My LG N2R1 NAS has a 800MHz Marvell 88F6192 and runs Lenny. I won't be surprised to see some NanoITX boards out running similar hardware. Plus, I've been very impressed with how many Debian packages are available for ARMEL. While not perfect, it's the most useful Linux server I've ever had.

  6. Re:And it's useless. No 64-bit support. by TheRaven64 · · Score: 4, Informative

    How about a link to this rant, if you want us to read it? And, if you've got a problem with PAE-like extensions, then I presume you're aware that both Intel's and AMD's virtualisation extensions use PAE-like addressing?

    All that PAE and LPAE do is decouple the size of the physical and virtual address spaces. This is a fairly trivial extension to existing virtual memory schemes. On any modern system, there is some mechanism for mapping from virtual to physical pages, so each application sees a 4GB private address space (on a 32-bit system) and the pages that it uses are mapped to some from physical memory. With PAE / LPAE, the only difference is that this mapping now lets you map to a larger physical address space - for example, 32-bit virtual to 36-bit physical. You see exactly the opposite of this on almost all 64-bit platforms, where you have a 64-bit virtual address space but only a 40- or 48-bit physical address space.

    The big problem with PAE was that most machines that supported it came with 32-bit peripherals and no IOMMU. This meant that the peripherals could do DMA transfers to and from the low 4GB, but not anywhere else in memory. This dramatically complicated the work that the kernel had to do, because it needed to either remap memory pages from the low 4GB and copy their contents or use bounce buffers, neither of which was good for performance (which, generally, is something that people who need more than 4GB of RAM care about).

    The advantage is that you can add more physical memory without changing the ABI. Pointers remain 32 bits, and applications are each limited to 4GB of virtual address space, but you can have multiple applications all using 4GB without needing to swap. Oh, and you also get better cache usage than with a pure 64-bit ABI, because you're not using 8 bytes to store a pointer into an address space that's much smaller than 4GB.

    By the way, I just did a quick check on a few 64-bit machines that I have accounts on. Out of about 700 processes running on these systems (one laptop, two servers, one compute node), none were using more than 4GB of virtual address space.

    --
    I am TheRaven on Soylent News
  7. Re:And it's useless. No 64-bit support. by pmontra · · Score: 2

    How about a link to this rant

    http://blog.linuxolution.org/archives/117

  8. Re:And it's useless. No 64-bit support. by Bengie · · Score: 2

    64bit memory range? Each node is going to have it's own memory slot(s). 120 cores, 4 cores per node = 30 nodes. If you plan to have less than 4GB of memory in this system, how small does each stick have to be when you plug 30 in? ~128mb. Good Luck finding a bunch of DDR2/3 128MB sticks to plug into your 4GB 120 core web server. Anyway, each node needs its own local copy of the data it needs to serve up. If you web page needs ~256MB, each node is going to need the same 256MB of data duplicated, plus any extra overhead. You can't expect all 30 nodes to access the same 2-3 memory slots; that would scale like crap. This is one of the issues you get when scaling via cores. Interconnection bandwidth/latency becomes an issue and you need to use local storage to allow fully independent processing. Once you start getting up into these ranges, you're better off thinking of each node as its own computer with a fairly high speed network.

  9. Re:WANTED: 1U low-power rack server by Nursie · · Score: 2

    You need to watch out with them also though. The WD Sharespace I have uses a 500MHz chip which is totally inadequate for decent throughput between the 4-disk array and the GigE interface.

    And I had to write my own device support into the kernel to get it running a modern OS! It came with 2.6.12!

  10. Re:Cheaper way by jDeepbeep · · Score: 4, Funny

    Nah, too RISCy

    --
    Reply to That ||
  11. Re:And it's useless. No 64-bit support. by TheRaven64 · · Score: 2

    His complaint basically boils down to the fact that the kernel needs to be able to map all of physical memory, and have some address space left over for memory-mapped I/O. This is a valid complaint for a kernel developer (although Linus' 'everyone who disagrees with me is an idiot' style is quite irritating), but it largely irrelevant to the issue at hand. There is nothing stopping a kernel on ARM with LPAE from using 64-bit pointers internally. You still need to translate userspace pointers, but you need to do that anyway on most architectures (on x86, context switches are insanely expensive, so typically you use a segment for the kernel and run system call handlers without changing the page tables, just making the kernel segment visible by switching to ring 0), so that code already exists in all of the relevant places in the kernel.

    --
    I am TheRaven on Soylent News
  12. Re:And it's useless. No 64-bit support. by the+linux+geek · · Score: 2

    This kind of arrangement gets brought up over and over - one of the more recent examples is SiCortex, and it sucked. Having a Single System Image is always preferable to a "cluster in a box."