big.LITTLE: ARM's Strategy For Efficient Computing

← Back to Stories (view on slashdot.org)

big.LITTLE: ARM's Strategy For Efficient Computing

Posted by Soulskill on Tuesday July 9, 2013 @07:35PM from the i-see-what-you-did-there dept.

MojoKid writes "big.LITTLE is ARM's solution to a particularly nasty problem: smaller and smaller process nodes no longer deliver the kind of overall power consumption improvements they did years ago. Before 90nm technology, semiconductor firms could count on new chips being smaller, faster, and drawing less power at a given frequency. Eventually, that stopped being true. Tighter process geometries still pack more transistors per square millimeter, but the improvements to power consumption and maximum frequency have been falling with each smaller node. Rising defect densities have created a situation where — for the first time ever — 20nm wafers won't be cheaper than the 28nm processors they're supposed to replace. This is a critical problem for the mobile market, where low power consumption is absolutely vital. big.LITTLE is ARM's answer to this problem. The strategy requires manufacturers to implement two sets of cores — the Cortex-A7 and Cortex-A15 are the current match-up. The idea is for the little cores to handle the bulk of the device's work, with the big cores used for occasional heavy lifting. ARM's argument is that this approach is superior to dynamic voltage and frequency scaling (DVFS) because it's impossible for a single CPU architecture to retain a linear performance/power curve across its entire frequency range. This is the same argument Nvidia made when it built the Companion Core in Tegra 3."

13 of 73 comments (clear)

Min score:

Reason:

Sort:

old news by Anonymous Coward · 2013-07-09 19:52 · Score: 3, Informative

Advertising much?
Re:big.LITTLE is a joke by Anonymous Coward · 2013-07-09 20:08 · Score: 3, Insightful

Powered-down circuits have no leakage. Also a "little" implementation has vastly lower leakage than a bigger core.
Re:big.LITTLE by Thanshin · 2013-07-09 20:11 · Score: 4, Funny

Are you proposing the name IBig.ULittle?
Re:This Looks Familiar... by Rockoon · 2013-07-09 20:16 · Score: 2

GPU caches are designed to maximize bandwidth.
CPU caches are designed to minimize latency.

These two goals are at odds with each other.

It is no surprise that there is a market for GPU's. I think the surprise was that 3dfx could offer much-better-than-using-the-cpu performance so cheaply.

--
"His name was James Damore."
It's not necessarily ARM's solution by m6ack · 2013-07-09 20:27 · Score: 5, Insightful

Big/little is a lazy way out of the power problem... Because instead of investing in design and development and in fine grained power control in your processor, you make the design decision of, "Heck with this -- silicon is cheap!" and throw away a good chunk of silicon when the processor goes into a different power mode... You have no graceful scaling -- just a brute force throttle and a clunky interface for the Kernel.
So, not all ARM licensees have been convinced or seen the need to go to a big/little architecture because big/little has that big disadvantages of added complexity and wasted realestate (and cost) on the die. Unlike nVidea (Tegra) and Samsung (Exynos), Qualcomm has been able to thus far keep power under control in their Snapdragon designs without having to resort to a big/little and has thus been able to excel on the phone. So far, the Qualcomm strategy seems to be a winning one for phones in terms of both overall power savings and performance per miliwatt -- where on phones every extra hour of battery life is a cherished commodity. Such may not be true for tablets that can stand to have larger batteries and where performance at "some reasonable expectation" of battery life may be the more important.
1. Re:It's not necessarily ARM's solution by TheRaven64 · 2013-07-09 21:19 · Score: 5, Insightful
  
  The power difference between an A7 and an A15 is huge. There's really nothing that you could do to something like an A15 to get it close to the power consumption of the A7 without killing performance. They're entirely different pipeline structures (in-order, dual-issue-if-you're-luck vs out-of-order multi-issue). The first generation from Samsung had some bugs in cache coherency that made them painful for the OS, but the newer ones are much better: they allow you to have any combination of A7s and A15s powered at the same time, so if you have a single CPU-bound task you can give it an A15 to run on and put everything else on one or more A7s (depending on how many other processes you've got, running multiple A7s at a lower clock speed may be more efficient than running one at full speed). The OS is in a far better place to make these decisions than the CPU, because it can learn a lot about the prior behaviour of a process and about other processes spawned from the same program binary.
  
  --
  I am TheRaven on Soylent News
2. Re:It's not necessarily ARM's solution by Anonymous Coward · 2013-07-09 21:20 · Score: 2, Interesting
  
  where on phones every extra hour of battery life is a cherished commodity. Such may not be true for tablets that can stand to have larger batteries and where performance at "some reasonable expectation" of battery life may be the more important.
  This isn't directly for phones and tablets and it isn't "a lazy way out of the power problem".
  We are not talking about a gradual increase in efficiency here, this is to solve the standby energy requirements for permanently powered consumer devices like TV-sets. (See the One Watt Initiative)
  The first generation of devices that solved the problem had dual power supplies. One that was optimized for high efficiency for a low load. This was used to power a microcontroller that dealt with the remote control and started the primary power supply and the rest of the electronics.
  Later there where pretty large improvements in switched power supplies that made it possible to go back to just having a single transformer.
  The problem is that there aren't really any devices in the 32bit-range that can get down below the 1mA-range without being completely shut down. (This isn't just ARM, it's also true for PIC32, ColdFire, AVR32 and other competing controllers, and no, Atom is not even trying to get down in this range.)
  Because of this the common solution is to have a small 8bit/16bit controller to handle the standby mode and possibly some of the low latency tasks that the larger controllers have problems with.
  The big.LITTLE solution to this problem isn't new, controllers with asymmetric cores have been available for a while. The benefit of it isn't a power saving, it is a space saving. This will allow developers to remove the external controllers.
  It also doesn't add complexity compared to a system with multiple controllers that are completely separate form each other.
3. Re:It's not necessarily ARM's solution by tlhIngan · 2013-07-10 03:54 · Score: 2
  
  it's still lazy and ends up with juggling between the two cores, but that's not arms problem so they went with it.
  but this is prett much the 4+1 core solution from nvidia anyways. it would be far better if they could just shut down parts of the one core to stop leakage. article blurb is just stupid though, it implies this would be a way towards cheaper, while it obviously isn't since it uses more die space.
  Except it's easier in software to use big.LITTLE. If you wanted to switch from A7 to A15 and back, as long as the cores lined up and you had a cache coherent bus (a requirement anyways), all you'd really do was activate the new cores, force a context save on all 4 cores, then restore the context on the new cores. Shut down the old cores, and let the caches in the new cores warm up through cache coherency, then issue one final flush on the old cores and you're done.
  As far as the OS is concerned, nothing changed. The scheduler doesn't even need to be super-aware of the change (it could, which makes life easier, but if it isn't, you can use an upper level library to do it underneath the OS).
  NVidia's solution requires scheduler awareness, and runs the risk that the +1 is not sufficient to handle existing tasks thus keeping the 4 cores unnecessarily up.
  The big problem though isn't power draw - the A15 is a huge power hog, true, but the problem with that is thermal issues.
  Yes, thermal. You run 4 A15 cores full tilt and the chip heats up significantly - easily reaching max junction temperature in a few minutes. And there's very little in the way of cooling - you can't have huge heat planes on the PCB because you're talking high-density BGA parts, and the top of the chip is covered by a PoP memory chip.
  I've seen thermal analysis done - the best that could be done would be 2 A15s running full tilt, with the other two software-modulated to run under 50% load after a few minutes, which would keep the chip at max temperature. And the system would try to be progressive - as the chip heated up, the cores would increasingly require being idle to maintain thermal limits.
  Of course, a few minutes is all you need when you're e-peening GeekBench and other related benchmarks.
Re:Power not die area efficient. by TheRaven64 · 2013-07-09 21:00 · Score: 4, Insightful

This solution _might_ be more power efficient. But it can not be more die and space efficient
Two words: Dark Silicon. As process technologies have improved, the amount of the chip that you can have powered at any given time has decreased. This is why we've seen a recent rise in instruction set extensions that improve the performance of a relatively small set of algorithms. If you add something that needs to be powered all of the time, all you do is push closer to the thermal limit where you need to reduce clock speed. If you add something that is only powered infrequently, then you can get a big performance win when it is used but pay a price when it isn't.
TL;DR version: transistors are cheap. Powered transistors are expensive.

--
I am TheRaven on Soylent News
Re:and... by TheRaven64 · 2013-07-09 21:31 · Score: 4, Insightful

Nothing, except that Intel's most power efficient chips are in the same ballpark as the A15 (the power-hungry, fast 'big' chip) and they currently have nothing comparable to the A7 (the power-efficient, slow 'LITTLE' chip). And in the power envelope of the A7, an x86 decoder is a significant fraction of your total power consumption.
One of the reasons why RISC had an advantage over CISC in the '80s was the large amount of die area (10-20% of the total die size) that the CISC chips had to use to deal with the extra complexity of decoding a complex non-orthogonal variable-length instruction set. This started to be eroded in the '90s for two reasons. The first was that chips got bigger, whereas decoders stayed the same size and so were proportionally smaller. The second was that CISC encodings were often denser, and so used less instruction cache, than RISC.
Intel doesn't have either of these advantages at the low-power end. The decoder is still a significant fraction of a low-power chip and, worse, it is a part that has to be powered all of the time. They also don't win on instruction density, because both x86 and Thumb-2 are approximately as dense.
MIPS might be able to do something similar. They've been somewhat unfocussed in the processor design area for the past decade, but this has meant that a lot of their licensees have produced chips with very different characteristics, so they may be able to license two of these and implement something similar quite easily. Their main problem is that no one cares about MIPS.

--
I am TheRaven on Soylent News
Re:Power not die area efficient. by RoboJ1M · 2013-07-09 22:22 · Score: 3, Informative

Found it:
http://semiaccurate.com/2013/05/01/sonics-licenses-fabric-tech-to-arm/
"Sonics and ARM just made an agreement to use Sonics interconnects patents and some power management tech in ARM products."
"If Sonics is to be taken at face value on their functionality, then you can slap just about any IP block you have on an ARM core now with a fair bit of ease."
This is kind of relevant too, the internet will eat all our electricities:
http://www.theregister.co.uk/2012/11/26/interview_rod_tucker/
"and if we don’t do anything, it could become ten percent between 2020 and 2025"
Although if you read it, the lion shares of internet electric usage is actually those amp happy DSL connections we have.
Re:big.LITTLE is a joke by AlecC · 2013-07-09 22:28 · Score: 3, Informative

Royalties in many licenses allow an unlimited number of CPUs on the same chip. You pay the royalty per design per chip.

--
Consciousness is an illusion caused by an excess of self consciousness.
Re:and... by TheRaven64 · 2013-07-09 22:38 · Score: 4, Informative

Citation needed. Anandtech benchmarked Clovertrail against Tegra-3, the least power efficient ARM core currently on the market. The Tegra-3 has a very power-hungry GPU (which is nice if you've got the batteries for it...) and a fairly standard Cortex A9 core, which has lower performance-per-Watt than either the A7 or A15 and lower performance in absolute terms than the A15. Their latest Atom SoCs are in the same ballpark as the A15 in both power consumption and performance, but they're nowhere near the A7 in terms of power consumption, which uses less power under load than Clovertrail uses idle.

--
I am TheRaven on Soylent News