Slashdot Mirror


ARM Unveils One-chip SMP Multiprocessor Core

An anonymous reader writes "ARM Ltd. will unveil a unique multi-processor core technology, capable of up to 4-way cache coherent symmetric multi-processing (SMP) running Linux, this week at the Embedded Processor Forum in San Jose, Calif.. The "synthesizable multiprocessor" core -- a first for ARM -- is the result of a partnership with NEC Electronics announced last October, and is based on ARM's ARMv6 architecture. ARM says its new "MPCore" multiprocessor core can be configured to contain between one and four processors delivering up to 2600 Dhrystone MIPS of aggregate performance, based on clock rates between 335 and 550 MHz."

20 of 145 comments (clear)

  1. Re:Interesting by Tune · · Score: 4, Informative

    It appears to be similar to other dual core technologies except developers need to worry less about threads accessing the same data. This is accomplished by cache snooping, which is a dated, but very fast way to avoid (L0) cache inconsistencies. That should take care of a major hurdle wrt. keeping SMP threads busy, especially if the clock speeds are relatively low.

    Notice that SMP has been a dream to the ARM team from its early Acorn/Archimedes days on. It seems they finally got it working...

  2. I've been running SMP desktops for years... by pointbeing · · Score: 5, Informative
    The _ONLY_ reason to do this is as a last resort when you can no longer clock your existing core any higher.

    Incorrect.

    As the subject line says, I've been running SMP desktop PCs for years. My current home PC is a dual 1GHz P-III, my wife's is a dual 850 and my Linux web/file/mail/whatever server is a dual 700 with a 12% overclock.

    You can only figure on about a 40% performance increase with a dual processor desktop PC, but being able to play Quake and burn a DVD at the same time has it's advantages ;-)

    As others have mentioned, multitasking is greatly enhanced - and two midrange processors are generally cheaper than one high-end processor.

    Also, even though some applications aren't multithreaded, all modern desktop OS are - so you get a performance boost even running single-task applications. If you're into running Windows, Internet Explorer is multithreaded, as are all Microsoft Office applications. There's a real-world productivity boost using SMP machines.

    --
    we see things not as as they are, but as we are.
    -- anais nin
  3. Re:Hype by TonyJohn · · Score: 5, Informative

    As Intel is now discovering (and promoting) it has long been known that clock frequency is not a sufficient measure of performance. It matters how much processing you can do in each clock tick as well as how often your clock ticks. Naturally, the faster the clock ticks, the less processing you can do per clock tick.

    1/2 GHz quoted for this core may not sound a lot, but there are some good reasons for it:

    - ARM cores use a shorter pipeline than Intel cores (in general). This requires less logic to get a good throughput of operations. Less logic means less area (less cost) and less power consumption. These are important in embedded applications (you don't want your phone to be putting out 50W and costing $200).

    - These cores are synthesisable. This means that ARM will deliver a "model" of the device, and customers can translate this to a silicon layout on their own process, and they can integrate peripherals, memory etc. on the same silicon. Getting a higher clock speed requires custom logic which is hard to translate between processes. Essentially the processor has sold separately as a piece of silicon, and this means a slow off-chip interface to the rest of the system.

    For a multi-threaded or multi-process application such as this core is targetted, using MP cores makes more sense than using a single high-speed core and switching between processes all the time. For one thing you save all the context switching overhead.

    --
    Owl tried to think of something wise to say, but couldn't.
  4. Re:MMP ARM server by mbge7psh · · Score: 4, Informative
    Your dreams are answered - it does have floating point.

    It also features configurable level 1 caches, 64-bit AMBA AXI interfaces, vector floating-point coprocessors and programmable interrupt distribution.
  5. ARM6 *NOT* a server chip by Tune · · Score: 2, Informative


    If I recall correctly, chips prior to ARM6 had register 15 (ARM's PC) designed with the upper six bits reserved for status. Having a program address space of only 2^26 = 64 MB was a major obstacle, even for (successors of) Acorn's RiscPC, a desktop model. With that resolved in the ARM6 series, it is still unable to look beyond the 4GB boundary. In the 4 way SMP servermarket this is likely to become a major pain.

    So either they found a nice way to add yet more MIPS per megaherz (or per watt) to serve a higher end embedded systems or they're targetting (very) low end servers.

  6. Re:ARM servers by Christopher+Thomas · · Score: 4, Informative

    For decades now, memory frequency scaling has lagged that of the microprocessor. Although there has been some great strides recently, latency is still rearing its ugly head. External DRAM is too electrically distant to remain at the heart of any high-performance system.

    Once we get processor and memory combined, we'll see performance increasing by several orders of magnatude.


    This idea has been around for what is almost certainly longer than either of us have been alive. It turns out that there are problems.

    The main problem is that no matter how much memory a system has, we find ways to use it. In the time I've been using computers, memory size has gone up four orders of _magnitude_, and I'm sure the greybeards listening will top that. The processor sitting in your machine right now has more on-die memory (the cache) than, say, an early XT had, but the tasks you're running have a memory footprint too large to fit. This is the price for being able to _do_ more than you could do on that old XT.

    Another problem is with the structure of memory itself. You've heard of "fast, cheap, good - pick two"? Memory is "large, fast, densely-packed - pick _one_". The reason why integrated logic/DRAM processes tend to do one or the other badly is that DRAM and logic have to optimize transistor characteristics for exactly opposite things (high "on" current for logic, low leakage current for DRAM). Among other things, this means that DRAM is either slow or very power-hungry. SRAM is bulky no matter what you do - it's the cost of playing, when you have six transistors instead of one. Any kind of large RAM array is slow no matter what you do - you have to propagate signals across a huge structure instead of a smaller one.

    The solution to date has been a hierarchical cache system, where small, fast, on-die memory is accessed whenever possible, and when that overflows, larger, moderately fast, on-die memory, and when that fails, DRAM. This works amazingly well, giving you almost all of the benefits of fully on-die memory for problems that fit in cache. Problems that don't fit in cache won't fit in on-die memory, so going with an on-die implementation doesn't help for them.

    Progress in improving memory response times is made in two ways. The first is to use a better cache indexing algorithm that is less suceptible to pathalogical situations. In the simpler indexing schemes, you can end up with situations where a short repeating access pattern can hammer on the same small set of cache blocks, causing cache misses even when there's plenty of space elsewhere. Higher associativity and tricks like victim caches reduce this problem. Techniques like a "preferred" block in a set reduce the time penalty for high associativity, and techniques like content-addressable memory reduce the power penalty. This is still a field of active research - build a better cache, and you get closer to a system that _acts_ as if it has all memory on-die.

    The second way of improving memory subsystem performance is to use memory speculation. This involves either figuring out (or even guessing) what memory locations are going to be needed and preemptively fetching their contents, or taking a guess at the value that will be returned by a memory fetch before the real result comes in. In both cases, you're masking most of the latency of the memory access, while paying a price for failed speculations (either in higher memory _bandwidth_ required, or in power for speculated threads that have to be squashed). Build a better address and data speculation engine, and you'll again approach performance of an impossible all-on-die-and-fast system.

    In summary, it turns out that putting all of the memory of a general-purpose system isn't practical now and won't be as long as requirements for memory keep increasing. However, caches already give you performance approaching this for problems tha are small enough to _fit_ in on-die memory, and cache technology is constantly being improved. This is where effort should be (and is) going.

  7. ARM6 != ARMv6 by hattig · · Score: 3, Informative

    One is a ~1990 era version of the ARMv3 architecture (IIRC).
    The other is ARM's latest version of the ARM architecture.

    26-bit addressing limitations were removed ~14 years ago. I don't even think any of the more recent versions of the ARM architecture support it.

  8. Re:Synthesizable = can put it in an FPGA by NoMercy · · Score: 3, Informative

    I'm not sure how to tell you this, but youre virtually totally wrong one very point.

    Synthisiable to Silicon, for ASIC's mostly though people like Philips turn them into micro-controllers and Intel make a few Micro-processors, the idea mostly is you can put a LCD controller, SIM Card reader, DSP, etc all on one lump of silicon with an ARM processor and put it in your mobile phone.

    And you don't licence a PowerPC core to put in a FPGA, you get a PowerPC chip actually inside the FPGA (Vertex2 Pro), any IP-Cores you see in the core-gen are simply the hooks into these devices that are already there, similar to the GCM's.

    And the big plus of this... well I don't really know but depending on how much number crunching it can do, and how much heat it generates when it does it, it could see all manner of applications.

  9. Re:Synthesizable = can put it in an FPGA by eclectro · · Score: 4, Informative

    Too bad its not open source, as there are other wicked fast processor cores available. For example Xilinx can license you to put a PowerPC in its FPGA cores.

    There is this.

    You can find the code easily. There are a couple of other clones, but I have not heard much about them. Another one is BlackARM developed in Sweden a couple of years ago.

    I think these projects would be ok as long as they are instruction compatible, but not an internal clone. In which case ARM would pull out their lawyer dogs.

    But there are a couple of other open source cores available, which IMHO would be smarter to use because you could do more with them without the fear of legal reprisal from ARM.

    If you are designing an embedded system, you might could get by using such a core. The thing ARM has going for it is that commercial support and toolkits are available, which can be handy if you have a complex application that needs a lot of debugging. And there is a lot of third party support that you are not going to find with your homegrown core.

    That being said, you could save a fair amount of money using an open core. But if you need to get something important out the door quickly (like a toy for christmas) you go with the commercial solution. Unless you have the necessary in-house resources to troubleshoot problems.

    Just my .02

    --
    Take the cheese to sickbay, the doctor should see it as soon as possible - B'Elanna Torres, "Learning Curve"
  10. Why? by Anonymous Coward · · Score: 4, Informative
    Low power. Die size. Cost.

    You don't use an opteron in the same situation as an arc core. Its a synthesisable mini processor used for controlling real time systems. It can be embedded in chips with custom VLSI logic to provide a platform for an operating system. Its not meant for competing with Opterons or any of the other such stupid ideas.


    Why 4 cores?


    Not all customers need 4 cores, some only need 1 (washing machines) or maybe 2. The system is therefore scalable to die size/power/cost requirements. Note its configurable, it does not have to have 4 cores. If I were a customer of arc I could chose how much die space to devote to the core and how much power I really needed.

    4 cores, instead of one bigger more complex one is easier to engineer and get right. Look at modern graphics architectures, its the same principle (though one can argue about cache coherency).

    Multiple cores would make dynamic power management much easier to handle I imagine. An entire core could shut down when its process(es) are not busy. A properly designed embedded system could benefit enourmously from this power saving and the hardware design is made relatively easy rather than trying to cut voltage for on one large core.

    Embedded systems using arc cores often need to meet real time needs. One advantage of a multicore system would be to place a critical software component on a single core and, with correct use of memory, guarantee a fixed throughput rate of data. Of course I can use thread priorities but this makes things harder IMO. Maybe thats what they refer to by easier programming.


    To me, this looks like a clean idea, which although not revolutionary in terms of an idea, does provide significant advantages for embedded device designers by being synthesisable.


    Wroceng
    (no association with ARM at all but I forgot my password temporarily)

  11. Re:Synthesizable = can put it in an FPGA by NoMercy · · Score: 2, Informative

    Doh, that's what you get for modifying posts to much, You can put it on a FPGA, but you wouln't want to outside of development, if you look at the picture on the article, there's 2500 dolars worth of FPGA there, and the whole unit, probably looking at 10,000, and it's a tad big, put it on the intended final target, a silicon chip and youve got something which will fit in the tiny space behind the battery in your mobile phone.

  12. Re:ARM servers by pedantic+bore · · Score: 2, Informative

    Cobalt servers were based on MIPS, and then migrated to AMD-K6 processors.

    Not that they wouldn't have worked just fine with ARM, but as far as I can tell the idea never even came up.

    --
    Am I part of the core demographic for Swedish Fish?
  13. Re:ARM servers by drinkypoo · · Score: 3, Informative
    First try a google for Cobalt server ARM and then try another one for Cobalt server MIPS and see how you do. Cobalt Qube and Raq up to 2 were MIPS architecture machines, not ARM.

    ARM has been used in many PDAs as you say, and in Acorn/Archimedes computers. It's also in the Game Boy Advance (ARM7 I believe) and will likely be the foundation of the Dual Screen (ARM9 and ARM7 both will be in the box, if leaked specs can be believed.) Arm also begat StrongARM, and intel purchased (some level of) rights to the StrongARM II architecture, which they call XScale.

    --
    "You're right," Fisheye says. "I should have set it on 'whip' or 'chop.'"
  14. Just a small detail by JamesP · · Score: 0, Informative

    ARM SUCKS!

    I already did some stuff using ARM7TDMI and I can say that it SUCKS BIG TIME.

    Why? NO INTEGER DIVISION. You have a blazing fast code 90% of the time and the other 10% it's crunching the single division in your program

    --
    how long until /. fixes commenting on Chrome?
  15. Nice! by Archibald+Buttle · · Score: 2, Informative

    I've been an ARM fan for many, many years, so it's great to see this development. I've always thought this kind of thing should happen with ARM chips, and that the ARM should be well suited for this kind of application.

    ARM cores have a great advantage of having an incredibly low transistor count. As a result the simpler ARM chips tend to have incredibly good production yields. I don't know if that's true for the more complex ARM variants like XScale. This multi-core processor should also be an order of magnitude less complicated than a Pentium, so it too should get good yields and thus for volume production be very cheap.

    However it's also always struck me that the low transistor count of ARM chips could be of use in very high performance computing applications. It is difficult to build high transistor count chips in exotic materials, but an ARM-based chip needn't have those problems. This is of course why most chips are still made on silicon.

    Also the low transistor count means that even in high speed situations you shouldn't have the clock-skew problems that plague larger processors. (Clock-skew is the problem whereby it takes longer than a single clock tick for a signal to reach from one side of the processor to the other.) A good proportion of the transistors in Pentium IVs and PowerPC G5s are there to deal with that very issue.

  16. Re:ARM servers by pantherace · · Score: 2, Informative
    Actually StrongARM owes nothing to ARM (the company), as it was made by DEC when they realized that lower power could be possible by turning down voltage etc on alphas, and instead of either creating a new instruction set, or using the alpha's instruction set (the first pure 64-bit arch, which was needed in servers, but not really in ultra-low power stuff at the time.) they decided to use ARM.

    In a court case between DEC & Intel which was settled, DEC sold it's fabs (I think they had one or two left) & StrongARM to Intel, with Intel to produce the next generation Alphas, and the court also barred them from buying the Alpha tech*. There is little evidence that Intel tried to fab the Alphas, before saying they couldn't. When what was left of DEC after Compaq bought them by the time of the HP merger, Compaq sold the Alpha tech (non-exclusive licence apparently to get by the court decision) to Intel.

    Xscale is Intel doing what intel does best: ramping up clock speeds, and having core errors in the Processor (on PXA250 (number from memory: double check) Xscales, they ran a risk of corrupting the cache, which could only be worked around by disabling the cache, making them really slow, as a equivilent clock speed a StrongARM (even as old as those in the Newton) is faster: a tribute to the DEC engineers who knew what they were doing, but StrongARM is only ARM in name & instruction set (armv4l as I recall for StrongARM, while Xscale is armv5 as I recall)

  17. Re:ARM's 1st synthesizable proc? by Wesley+Felter · · Score: 2, Informative

    ARM has been making synthesizable cores for years. The article is just confused.

  18. Re:Arm != Intel ? by TonyJohn · · Score: 3, Informative

    Eeek. No.

    Intel bought part of DEC (Digital), which had, in its product portfolio, the StrongARM processor. StrongARM is a DEC implementation of the ARM Instruction Set Architecture (version 4 if you care).

    ARM is still an separate, publically listed company. XScale is an Intel implementation of the ARM ISA (version 5TE I think). Intel pays ARM to use their architecture.

    ARM also designs implementations of the ARM ISA and licences these designs to chip designers to include in System-on-Chip designs.

    --
    Owl tried to think of something wise to say, but couldn't.
  19. Re:ARM's 1st synthesizable proc? by TonyJohn · · Score: 2, Informative

    The article is correct but misleading. This is ARM's first multiprocessor core and therefore also its first "synthesisable multiprocessor core".

    --
    Owl tried to think of something wise to say, but couldn't.
  20. Re:ARM servers by default+luser · · Score: 2, Informative

    Non-volatile ram is a different concept, you'll probably want to steer clear for the purposes of this discussion.

    You're probably thinking of SRAM, in which a single bit cell requires 6 transistors. The advantages of SRAM:

    - Data remains resident as long as the cell remains powered.

    - With the exception of leakage, the only power required is for switching, making SRAM good for low-power applications.

    That said, a single DRAM bit is about as simple as you can get. It consists of a single transistor and a capacitor to hold the data. The disadvantages of DRAM:

    - Data degrades over time, requiring periodic refresh.

    - The data contained by the capacitor is also destroyed on read, requiring it to be re-written.

    - Due to their design, DRAM cells have inherently slower performance (although there are tricks to improve this).

    This issues make your memory interface more complex and power-hungry, but the space savings is often worthwhile to go with embedded DRAM over SRAM.

    --

    Man is the animal that laughs.
    And occasionally whores for Karma.