ARM In Supercomputers — 'Get Ready For the Change'
An anonymous reader writes "Commodity ARM CPUs are poised to to replace x86 CPUs in modern supercomputers just as commodity x86 CPUs replaced vector CPUs in early supercomputers. An analysis by the EU Mountblanc Project (PDF) (using Nvidia Tegra 2/3, Samsung Exynos 5 & Intel Core i7 CPUs) highlights the suitability and energy efficiency of ARM-based solutions. They finish off by saying, 'Current limitations [are] due to target market condition — not real technological challenges. ... A whole set of ARM server chips is coming — solving most of the limitations identified.'"
PC user, hardcore gamer and programmer here; for me, energy efficiency is a lesser priority than speed in a CPU. Make an ARM CPU compete with an Intel Core i7 2600K, and show me it's overclockable with few issues, and you got my attention.
Really, Soulskill?
Like the CDC6600?
Power/performance ratios are with x86.
Most of the actual processing power in current supercomputers comes from GPUs, not CPUs. There are exceptions (that all-SPARC Japanese one, or a few Cell-based ones), but they're just that, exceptions.
So sure, replace the Xeons and Opterons with Cortex-A15s. Doesn't really change much.
What might be interesting is a GPU-heavy SoC - some light CPU cores on the die of a supercomputer-class GPU. I have heard Nvidia is working on such (using Tegra CPUs and Tesla GPUs), and I would not be surprised if AMD is as well, although they'd be using one of their x86 cores for it (probably Bulldozer - damn thing was practically built for heavily-virtualized servers, not much different from supercomputers).
google 'slow clap copmilation youtube'
Shows how dominant Intel have become that they were actually able to keep competing RISC processors out of many supercomputers for so long.
This isn't to say that ARM *can't* be there, but thus far all of the implementations have focused around 'good enough' performance within a tightly constrained power envelope. Intel's designs have traditionally been highly inefficient in that power band, but at peak conditions, it is still compelling.
I recall one 'study' which claimed to demonstrate ARM as inarguably better. It got way more attention than they should have. The reason being is that they measured the performance on the ARM test, but just *assumed* TDP would be the accurate number for x86. There are very few workloads that would cause a processor to *average* TDP over the course of a benchmark.
The thing that really *is* stealing x86 thunder is the GPU world. Intel's Phi strives to answer it, but thus far falls short in performance. There continue to be areas where GPU architecture is an ill fit, and ultimately I think Phi may end up being a pretty good solution.
XML is like violence. If it doesn't solve the problem, use more.
As I understand it, Intel still has the advantage in the performance per watt category for general processing and GPUs have better performance per watt IF you can optimize for that specific environment--both things which have been commented to death endlessly by people far more knowledgeable than I.
However, to me there are at least 3 questions unanswered:
1. ASICs (and possibly FPGAs): Bitcoin miners and DES breakers are the best known examples. Where is the dividing line between where your operations are specific enough to emply an ASIC vs not specific enough and needing a GPU (or even CPU)? Could further optimization move this line more toward the ASIC?
2. Huge dies: This has been talked about before, but it seems that, for applications that are embarrassingly parallel, this is clearly where the next revolution will be, with hundreds of cores (at least, and of whatever kind of "core" you want). So when will this stop being vaporware?
3. But what do we do about all the NON-parallel jobs? If you can't apply an ASIC and you can't break it down, you're still stuck at the basic wall we've been at for around a decade now: where's Moore's (performance) law here? It would seem the only hope is new algorithms: TRUE computer science!
Hopefully this means we should start seeing ARM-using motherboards in an ATX form-factor. The Pi and Beaglebone are nice, but I want something that's eassentially just like a commodity x86 motherboard except it uses ARM.
Current ARM processors may indeed have a role to play in supercomputing, but the advantages this article implies don't exist.
Go look at performance figures for the Cortex-A15. It's *much* faster than the Cortex-A9. It also draws far more power. There's a reason why ARM's own product literature identifies the Cortex-A15 as a smartphone chip at the high end, but suggests strategies like big.LITTLE for lowering total power consumption. Next year, ARM's Cortex-A57 will start to appear. That'll be a 64-bit chip, it'll be faster than the Cortex-A15, it'll incorporate some further power efficiency improvements, and it'll use more power at peak load.
That doesn't mean ARM chips are bad -- it means that when it comes to semiconductors and the laws of physics, there are no magic bullets and no such thing as a free lunch.
http://www.extremetech.com/computing/155941-supercomputing-director-bets-2000-that-we-wont-have-exascale-computing-by-2020
I'm the author of that story, but I'm discussing a presentation given by one of the US's top supercomputing people. Pay particular attention to this graph:
http://www.extremetech.com/wp-content/uploads/2013/05/CostPerFlop.png
What it shows is the cost, in energy, of moving data. Keeping data local is essential to keeping power consumption down in a supercomputing environment. That means that smaller, less-efficient cores are a bad fit for environments in which data has to be synchronized across tens of thousands of cores and hundreds of nodes. Now, can you build ARM cores that have higher single-threaded efficiency? Absolutely, yes. But they use more power.
ARM is going to go into datacenters and supercomputers, but it has no magic powers that guarantee it better outcomes.
I have long pined for a server with maybe 10 4 core ARM CPUS. Basically my server spends its time serving up web stuff from memory. Each web request needs to do a bit of thinking and then fire the data out the port. Disk IO is not an issue nor is server bandwidth. Quite simply I don't need much CPU but I need many CPUs. A big powerful intel is of less interest.
Also by breaking up the system into physically separate CPUs I suspect that an interesting memory accessing architecture could be conjured up preventing another potential choke point.
Has anybody else seen/considered the Xilinx Zync? It's a mix of ARM kernels and FPGA, which could be interesting in supercomputing solutions.
For anyone willing to tweak around with it there are development boards around like the ZedBoard that is priced at US$395. Not the cheapest device around, but for anyone willing to learn more about this interesting chip it is at least not an impossible sum. Xilinx also have the Zynq®-7000 AP SoC ZC702 Evaluation Kit which is priced at US$895, which is quite a bit more expensive and not as interesting for hobbyists.
Done right you may be able to do a lot of interesting stuff with a FPGA a lot faster than an ordinary processor can and then let the processor take care of stuff where performance isn't a critical part.
Those chips are right now starting to find their way into vehicle ECUs, but it's still in an early phase so there aren't many mass produced cars yet with it.
As I see it - supercomputers will have to look at every avenue to get maximum performance for the lowest possible power consumption - and avoid solutions with high power consumption in standby situations.
If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
"keeping the things fed with memory can be slower than doing it on a CPU in the first place" is the line you've missed and is why GPUs don't solve every highly parallel problem at the moment. They can do reverse time migration, but can't currently do time migration, depth migration, tomography etc etc. The penalties of swapping so much memory in and out are far too costly, to the point of orders of magnitude of performance or complete showstoppers where you just can't get enough in for it to work at all.
There is already one line of supercomputers built from embedded hardware: the IBM Blue Gene. Their CPUs are embedded PowerPC cores. That's the reason why those systems typically have an order of magnitude more cores than their x86-based competition.
Now, the problem with BG is, that not all codes scale well with the number of cores. Especially when you're doing strong scaling (i.e. you fix the problem size, but throw more and more cores on the problem), then the law of Amdahl tells you that it's beneficial to have fewer/faster cores.
Finally I consider the study to be fundamentally flawed as it compares the OEM prices of consumer-grade embedded chips with retail prices of high-end server chips. This is wrong for so many reasons... you might then throw in the 947 GFLOPS, $500 AMD Radeon 7970, which beats even the ARM SoCs by a margin of 2x (ARM: ~1 GFLOPS/$, AMD Radeon: ~2 GFLOPS/$).
Computer simulation made easy -- LibGeoDecomp
I may be wrong here, but I get the impression that the MIPS architecture is much more power efficient than that of the ARM architecture
If they are going to talk about building up a big iron using CPUs which are of high power efficiency, I reckon the MIPS cpu might be more suitable for this task than one from the ARM camp
Muchas Gracias, Señor Edward Snowden !
Slashdot seems to have lots of ARM fanboys that look at ARM's low power processors and assume that ARM could make processors on par with Intel chips but much more efficient. They seem to think Intel does things poorly, as though they don't spend billions on R&D.
Of course that would beg the question as to why ARM doesn't and the answer is they can't. The more features you blot on to a chip, the higher the clock speed, and so on, the more power it needs. So you want 64-bit? More power. Bigger memory controller? More power. Heavy hitting vector unit? More power. And so on.
There's no magic ju ju in ARM designs. They are low power designs, in both sense of the word. Now that's wonderful, we need that for cellphones. You can't be slogging around with a 100 watt chip in a phone or the like. However don't mistake that for meaning that they can keep that low consumption and offer performance equal to the 100 watt chip.
Frankly I agree with you. I'm thinking the average /. reader will find your post incoherent though.
Help stamp out iliturcy.
I think my initial general comment about memory is properly aimed at a high school level readership Mr "sixth level cache" :)
...but also reliability (because supercomputers are really large and one failed node will generally crash the whole job, thereby wasting gazillions of core hours; that's one reason why SC centers buy expensive Nvidia Tesla hardware instead of the cheaper GeForce series) and IO and memory bandwidth and finally integration density. That one Intel chip can be more tightly integrated as it won't generate as much excess heat per GFLOPS (according to TFA...).
Computer simulation made easy -- LibGeoDecomp
Well you dont expect them to just go and outperform the others, they are obviously gonna take time optimizing, but what interests me more is more competition and something new to look forward to.
Cheers,
I am a fan boy for the small ARM boards... I have built an MPI cluster out of Raspberry-Pi boards and it is not even close except as a teaching exercise where it excels.
However many site services can be dedicated to these little boards where corp IT seems to dedicate virtual machines.
Department Web Servers... with mostly static content... via NFS or a revision control system like hg.
Department and internal caching name servers... NTP servers and managed central storage for each building or closet.
The impact of the little ARM boards has kicked Intel in their lethargy-loaded-behind. Their next generation sub 25 Watt systems will take names and kick but as long as IT does not overload them with WindowZ.
IT departments will find that the management advantage of chromebox devices connected to quality screens compelling.
Users will find that flipping open the company ChromeOS laptop will put them on the same page as the big screen in the office...
It is true that this is not 100% ready for prime time for all of us but the handwriting is on the wall.
Truth is stranger than fiction, but it is because Fiction is obliged to stick to possibilities; Truth isn't. Mark Twain.
It's Mont-Blanc, not Mountblanc.
Not something you care about with a mobile phone, but with a HPC system you really DO care about every watt dissipated.
A gimp version of windows is not going to get the job done.
On the other hand, a Windows version of GIMP does get a lot of jobs done that don't quite need Adobe Photoshop.
But seriously, the reason Windows RT is "gimped" is because Microsoft has refused to endorse recompiling desktop applications. That's not a failing of ARM, as ARM ran RISC OS on Acorn computers, as much as a power grab by Microsoft.
Some of the Samsung Slate tablets however come with an x86...and are actually fully functional! Can you point to an ARM tablet that can do everything it can?
Some ARM tablets run Ubuntu. Other Android tablets run Debian in a chroot, with video out through an X11 server app for Android. These can't run Windows applications in Wine the way x86 applications do, but they work for any GNU/Linux application that has been recompiled for ARM.
Scientific computing has ALWAYS been bespoke for the big iron. Because each scientific model is unique to the problem domain and the ideas of how it is going to be solved.
The compiler is relied upon to produce the most optimal code resulting from the (usually FORTRAN) source and the computer libraries called are optimised to the machine that they run on.
I work for the UK Met Office and I've not heard any different from any other big computing resources that do any different.
An ISA is only as good as its most efficient implementation.
I think ARM can win this, because it has a superior, more streamlined ISA. x86 is a relic, a dinosaur, and it's all over the place. Just like x86 can do low-power designs, ARM can also reach the same performance, but also with a smaller and more power efficient implementation thanks to its refined ISA.
Signature intentionally left blank.
That advantage goes away if your core is superscalar -- you still have issues with branching and not keeping the queue full. Some versions of x86 superscalar can execute both sides of branches, then discard the results of the branch not taken. There is no reason that an architecture with an ARM instruction set could not do this; but then some of the power-per-watt benefits would be leveled out.
Typical supercomputing tasks usually scale quite well. (Otherwise there's little point of running it on a supercomputer in the first place). Which is of course why GPUs are so interesting.
why then, are both of the new video game consoles moving to x86-based architectures?
True, but with limits. There is a reason why LRZ bought SuperMUC without GPUs: a) fewer, faster cores, b) users didn't have to change their codes. Now, machines like BG/Q scale extremely well, despite having such a high core count. But they have the interconnect built right into the chip architecture. We don't have anything comparable on current ARM designs, but hey, the future is gonna be interesting.
Computer simulation made easy -- LibGeoDecomp
I keep hearing this kind of thing from ARM fans. Ok, show it to me. You can't, because it doesn't exist, nor anything even close to it. What that means is you are just hoping this is the case, making things up, not that it is actually the case.
is the lack of an IOMMU by default on all ARMs.
And that is, ARM has many makers/sellers , but intel..... is just one intel..... a single source for all your $$$$. More than one ARM source is better for competition.
Liberty freedom are no1, not dicks in suits.
How much thorium is needed to power a cpu for 5 years.
Liberty freedom are no1, not dicks in suits.