ARM's New Processors Are Designed To Power the Machine-Learning Machines (theverge.com)

← Back to Stories (view on slashdot.org)

ARM's New Processors Are Designed To Power the Machine-Learning Machines (theverge.com)

Posted by msmash on Monday May 29, 2017 @08:00AM from the going-forward dept.

An anonymous reader shares an article: Official today, the ARM Cortex-A75 is the new flagship-tier mobile processor design, with a claimed 22 percent improvement in performance over the incumbent A73. It's joined by the new Cortex-A55, which has the highest power efficiency of any mid-range CPU ARM's ever designed, and the Mali-G72 graphics processor, which also comes with a 25 percent improvement in efficiency relative to its predecessor G71. The efficiency improvements are evolutionary and predictable, but the revolutionary aspects of this new lineup relate to artificial intelligence: this is the first set of processing components designed specifically to tackle the challenges of onboard AI and machine learning. Plus, last year's updates to improve performance in the power-hugry tasks of augmented and virtual reality are being extended and elaborated. [...] ARM won't just be powering machine learning with its new chips, it'll benefit from ML too. The new designs benefit from an improved branch predictor that uses neural network algorithms to improve data prefetching and overall performance.

27 comments

Min score:

Reason:

Sort:

Need Memory Improvements Too by lobiusmoop · 2017-05-29 08:31 · Score: 4, Interesting

IMHO what is mostly needed is faster memory. Modern ML often involves working with multi-Gigabyte domain models, stored in DRAM, where the access latency hasn't changed particularly in the last 10 years.

--
"I bless every day that I continue to live, for every day is pure profit."
1. Re:Need Memory Improvements Too by Anonymous Coward · 2017-05-29 09:40 · Score: 0
  
  In the past, there was smart memory which moved arithmetic operations onto video memory chips. That was supposed to accelerate basic window systems. But then everything moved onto GPU's.
2. Re:Need Memory Improvements Too by Anonymous Coward · 2017-05-29 09:55 · Score: 0
  
  It also depends on both the ML model, data, and memory model. For example, if the problem can be segmented such that a processor needs only data that can be stored locally to that processor, not needing hops to more distant memory on the same board, or to other machines, then you can keep memory access latencies low, which is beneficial. In this case relatively low core counts and relatively low memory per board may be sufficient, which is often the case with ARM based systems.
  If the problem relies on streaming of data then bandwidth may be more important than latency, and potentially network connectivity may be more limiting than memory performance. Again, easily achievable with ARM if you invest in the interconnect.
  In both those use cases you can potentially scale up the performance of the solution by increasing component count, and so energy efficiency then becomes important.
  Not all problems fit such cases, and where relatively non local access with low latency and high bandwidth is required then things get much more difficult.
  A
3. Re:Need Memory Improvements Too by Anonymous Coward · 2017-05-29 10:08 · Score: 0
  
  Or, since the features from input datasets are generally either discrete values from one of a few classes or floating points with experimental measurement errors larger than 0.1%, people could stop trying to use double precision floats for everything and switch to smaller integers and half precision floats in all but the very few cases where more precision is needed.
4. Re:Need Memory Improvements Too by Anonymous Coward · 2017-05-29 11:00 · Score: 0
  
  This can also make better use of cache, provided you a stepping sensibly through cache memory with operations, as well as better use of DRAM.
5. Re:Need Memory Improvements Too by Anonymous Coward · 2017-05-29 21:32 · Score: 0
  
  "floating points with experimental measurement errors larger than 0.1%"
  This is misleading.
  While the original data may be represented by single precision, depending on the nature of the calculation it can easily cause problems due to limited precision.
  Life is too short to be chasing problems caused by lack of precision, so I don't blame most coders for using "double" for everything.
  Of course I would be the first to change my mind if the task required loading a few GB of floating point numbers to hold in RAM as a data source...
ARM Sucks by Anonymous Coward · 2017-05-29 08:58 · Score: 0

Purchase Cyrix processors only.
1. Re:ARM Sucks by __aaclcg7560 · 2017-05-29 09:57 · Score: 1
  
  Purchase Cyrix processors only.
  I loved my Cyrix 6x86 CPU back in the late 1990's. It ran Linux flawlessly for my file server.
  https://en.wikipedia.org/wiki/Cyrix_6x86
2. Re:ARM Sucks by jwhyche · 2017-05-29 10:06 · Score: 2
  
  Those where great processors for the money that you paid for them. I believe it used Pentium pro instructions instead of the pure Pentium. Ran Linux perfectly too. I was using in my main workstation.
  
  --
  I read at +2. If your post doesn't reach that level I will not see or respond to it.
3. Re: ARM Sucks by cyber-vandal · 2017-05-29 11:41 · Score: 2
  
  "Were" you thick git
4. Re: ARM Sucks by jwhyche · 2017-05-29 11:58 · Score: 1
  
  Why don't you kiss my ass.
  
  --
  I read at +2. If your post doesn't reach that level I will not see or respond to it.
5. Re: ARM Sucks by Anonymous Coward · 2017-05-29 13:20 · Score: 0
  
  If it's as big as creimer's, I'm gonna need bigger lips.
6. Re: ARM Sucks by Anonymous Coward · 2017-05-29 13:59 · Score: 0
  
  Unfortunately, many of the motherboards people would buy with those CPUs were absolute garbage. They would over clock the PCI and AGP busses because they all used multipliers of the same clock source instead of having a separate clock for the CPU. This results in many lockups and exceptions caused by PCI devices and video cards being clocked way faster than designed, and they wouldn't run properly unless you underclocked the CPU that was designed to have a faster base clocking.
blah blah revolutionary blah blah by Anonymous Coward · 2017-05-29 09:18 · Score: 0

Hey we get you like to push slashvertisements but you gotta give us some substance too, and hold the MBAese, thanks.
What, exactly, is it that makes this chip dance, hm? Give us some code examples already. ASSEMBLY code, yes, thanks.
No they aren't by locater16 · 2017-05-29 10:37 · Score: 4, Informative

No, ARM's new processors are not "designed" to power AI. They added an INT8 instruction, something useful for AI deploying neural nets (not training them though). And that's it. Otherwise it's a standard evolution of both designs taped out on "10nm" rather 14nm. They just know AI is HOT HOT HOT and so hope to grab some of that PR magic quick.

For those really interested Anandtech has the actual computer engineering of the whole thing: http://www.anandtech.com/show/...
1. Re:No they aren't by Anonymous Coward · 2017-05-29 11:38 · Score: 0
  
  Oh come on, anandtech is fake news!
2. Re:No they aren't by Anonymous Coward · 2017-05-29 23:18 · Score: 0
  
  Actually there are also double-speed 16-bit floating-point instructions which should be useful for training.
  Apart from that, I agree with you that the AI marketing blurb is annoying, especially since all news sources just repeat it.
Instruction sets by DrYak · 2017-05-29 10:56 · Score: 1

Those where great processors for the money that you paid for them. I believe it used Pentium pro instructions instead of the pure Pentium.
According to wikipedia :
- the Cyrix MII - was Pentium Pro / Pentium II compatible, as you mention.
- before that : Cyrix 6x86 MX - was Pentium MMX compatible.
- even before : the previous Cyrix 6x86 & 8x86L - were more or less, but not entirely, Pentium compatible. (They officially identified themselves as "486") (I remember that bit)
Also:
- their FPU was less optimized, because most of the typical software workload was integer back then. (Also rings a bell)
For a linux server they would very likely have been quite descent, because :
- FPU is indeed irrelevant.
- GCC and Linux kernel *do handle* 6x86 (it's not considered as a pure 486, they can use the pentium-compatible parts).

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
GDDR by DrYak · 2017-05-29 11:20 · Score: 1

But then everything moved onto GPU's.
Modern GDDR retains the capability to clear buffers by itself.
But indeed, the bitmasking capability of older WRAM and SGRAM have been made redundant by the much more general-purpose capabilites offered by the GPU coupled with the much more complex modern interface. (i.e.: It's opengl running your Linux Compiz / Apple Quartz and whatever was the windows equivalent).

--
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
latency thermal wall by epine · 2017-05-29 11:56 · Score: 4, Informative

IMHO what is mostly needed is faster memory. Modern ML often involves working with multi-Gigabyte domain models, stored in DRAM, where the access latency hasn't changed particularly in the last 10 years.
You should write advertising copy.

What is needed is faster relief. We've improved the package perforation. Now rips open 2x faster!
Faster has many dimensions, yet you fixate on just one. It turns out, however, that slapping you down was a royal PITA: all of the vendors involved in HBM{1,2,3} pony up sweet-shit-all concerning latency (wanted: an edible, colour-coded haymark).
Finally I found this comment by one Tuna-Fish from 2010:

Memory latency of many devices using GDDR5 (like GPUs) is a lot higher than on the typical device that uses DDR3, but this has nothing to do with the RAM, and everything to do with the controller.
Basically, GPUs can expect to see a lot of accesses to addresses reasonably close to each other (like reading color values out of a texture) in a relatively short time, and the devices are typically good at finding other work to do while waiting on memory accesses. Because of this, and the fact that larger transfers are more efficient, GPUs tend to delay initiating transfers a bit to wait for opportunities to combine them.
It's entirely possible to have a memory controller that does this to GPU-like transfers and doesn't do it to CPU-like transfers.
I'm not the only frustrated person.
* AMD's upcoming Fiji GPU will feature new memory interface — Joel Hruska, 30 April 2015

Bandwidth, however, is just one characteristic of memory performance. Latency is equally important, but data on HBM latency compared with GDDR5 is much harder to come by. The implication, if I've read the various slide decks and data sheets correctly, is that HBM latency should be modestly better than GDDR5's — but possibly not by much. Certainly it won't improve by anything like the bandwidth jumps we're going to see.
The gist of the fragments I managed to find is that HBM latency is roughly on par with the concurrent GDR generation, and this is—in most controllers—actually worse than the concurrent DDR generation, hence the industry-wide light-lip syndrome.
Only that's not the whole story. Because HBM has more channels than GDR and allows more pages to be open concurrently. For a sufficiently parallel workload, HBM latency as a function of bandwidth can be excellent compared to the alternatives.
And certainly the thermal density is yards superior. Which is itself interesting, because you hardly ever see plots pitting latency against J/bit-ns. Awesome! A brand shiny new thermal wall. Physical distance, aka latency, actually functions as an implicit thermal spreader, and this goes away when the engineers get too pie-eyed over rail-gun-drone–accelerated rolling drive-thru nirvana (recommended: a Kevlar fish net on a titanium pole, and a Quick eye).
A Study of Application Performance with Non-Volatile Main Memory — Yiying Zhang (2015)
The fastest of the prospective non-volatile technologies (which are thermally desirable due to lack of refresh) is NRAM.
Fast NRAM to be released 2019-epsilon by Nantero/Fujitsu — August 2016
It actually has the endurance to be used as an on-chip SRAM replacement with eDRAM access times, but I don't know whether joint fabrication with CMOS is viable (in particular, at the high end). Note that ultimate durability is as yet unknown, because their 10^14-cycle test bench is taking a while to return 0/1.
[*] I wou
1. Re:latency thermal wall by lobiusmoop · 2017-05-30 05:21 · Score: 1
  
  Um... yea.. Thanks for that.
  Are you high?
  
  --
  "I bless every day that I continue to live, for every day is pure profit."
Machine-Learning Machines? by Gravis+Zero · 2017-05-29 12:12 · Score: 0

If the Machines are Learning Machines, who is Learning the Machine-Learning Machine Machines? ;)

--
Anons need not reply. Questions end with a question mark.
Incorrect summary by Anonymous Coward · 2017-05-29 12:28 · Score: 0

"Interestingly, ARM won’t just be powering machine learning with its new chips, it’ll benefit from ML too. The new designs benefit from an improved branch predictor that uses neural network algorithms to improve data prefetching and overall performance."
Fucking marketing. This is factually incorrect. They are really drinking the koolaid these days labeling everything as AI or ML.
Appropriate reference by Anonymous Coward · 2017-05-29 12:47 · Score: 0

Isaac Asimov's Nine Tomorrow's: The Feeling of Power.

Nine times seven, thought Shuman with deep satisfaction, is sixty-three, and I don't need a computer to tell me so. The computer is in my own head. And it was amazing the feeling of power that gave him.
Bunch of marketing cock by tietokone-olmi · 2017-05-29 13:24 · Score: 2

Nothing in there points in any way to machine learning. There's just a fancy branch predictor in there, the design of which may have been informed by something related to neural networks, but that's true of all CPUs of the current generation. (kind of like integrated memory controllers were like 10 years ago.)
But that's just as well, given how AI is a marketing scam anyway.
no by nettechindia0 · 2017-05-29 21:07 · Score: 1

They added an INT8 instruction, something useful for AI deploying neural nets (not training them though. There's just a fancy branch predictor in there, the design of which may have been informed by something related to neural networks, but that's true of all CPUs of the current generation. (kind of like integrated memory controllers were like 10 years ago.)