MIT Develops Algorithm To Accelerate Neural Networks By 200x (extremetech.com)

← Back to Stories (view on slashdot.org)

MIT Develops Algorithm To Accelerate Neural Networks By 200x (extremetech.com)

Posted by BeauHD on Friday March 22, 2019 @08:50AM from the efficient-methods dept.

An anonymous reader quotes a report from ExtremeTech: MIT researchers have reportedly developed an algorithm that can accelerate [neural networks] by up to 200x. The NAS (Neural Architecture Search, in this context) algorithm they developed "can directly learn specialized convolutional neural networks (CNNs) for target hardware platforms -- when run on a massive image dataset -- in only 200 GPU hours," MIT News reports. This is a massive improvement over the 48,000 hours Google reported taking to develop a state-of-the-art NAS algorithm for image classification. The goal of the researchers is to democratize AI by allowing researchers to experiment with various aspects of CNN design without needing enormous GPU arrays to do the front-end work. If finding state of the art approaches requires 48,000 GPU arrays, precious few people, even at large institutions, will ever have the opportunity to try.

Algorithms produced by the new NAS were, on average, 1.8x faster than the CNNs tested on a mobile device with similar accuracy. The new algorithm leveraged techniques like path level binarization, which stores just one path at a time to reduce memory consumption by an order of magnitude. MIT doesn't actually link out to specific research reports, but from a bit of Google sleuthing, the referenced articles appear to be here and here -- two different research reports from an overlapping group of researchers. The teams focused on pruning entire potential paths for CNNs to use, evaluating each in turn. Lower probability paths are successively pruned away, leaving the final, best-case path. The new model incorporated other improvements as well. Architectures were checked against hardware platforms for latency when evaluated. In some cases, their model predicted superior performance for platforms that had been dismissed as inefficient. For example, 7x7 filters for image classification are typically not used, because they're quite computationally expensive -- but the research team found that these actually worked well for GPUs.

43 comments

Min score:

Reason:

Sort:

First! by madsh · 2019-03-22 08:53 · Score: 0

Because someone is trying to read the article
1. Re:First! by ceoyoyo · 2019-03-22 08:58 · Score: 1
  
  Trying to read the summary was bad enough.
No neural networks in Federal Prison by Anonymous Coward · 2019-03-22 08:57 · Score: 0

Sorry Trump traitors!
1. Re:No neural networks in Federal Prison by Anonymous Coward · 2019-03-22 09:09 · Score: 0
  
  Well that was pretty low effort. At least try to tie the story in with your trump derangement in some meaningful way. Occasionally these are amusing, but this one is pathetic.
  What's wrong? You're enthusiasm waning? Starting to accept the situation?
2. Re: No neural networks in Federal Prison by Anonymous Coward · 2019-03-22 09:46 · Score: 0
  
  Does it say that? Will you commit suicide if it turns out to not say that?
3. Re:No neural networks in Federal Prison by Anonymous Coward · 2019-03-22 09:49 · Score: 0
  
  You're getting repetitive ivanbot
Accelerate [development of] neural networks by SlaveToTheGrind · 2019-03-22 08:59 · Score: 2

Even the summary says that the 200x improvement is the learning cycle. The actual execution speed is less than 2x faster.
1. Re: Accelerate [development of] neural networks by Anonymous Coward · 2019-03-22 09:02 · Score: 1
  
  The learning cycle is the important part...
2. Re: Accelerate [development of] neural networks by Anonymous Coward · 2019-03-22 09:23 · Score: 0
  
  Suck my dick, faggot
3. Re:Accelerate [development of] neural networks by Anonymous Coward · 2019-03-22 10:22 · Score: 0
  
  You don't seem to understand how ML works.
  Learn(), the hard part that takes hours.
  Compute() the easy part that takes microseconds.
  200x speed up, 2x speed up -- both _really_ fucking good.
4. Re:Accelerate [development of] neural networks by phantomfive · 2019-03-22 18:09 · Score: 1
  
  The execution speed of neural networks is not very CPU intensive, it can happen almost immediately. The training is the part that takes a long time.
  
  --
  "First they came for the slanderers and i said nothing."
5. Re: Accelerate [development of] neural networks by Anonymous Coward · 2019-03-23 00:55 · Score: 0
  
  An. Not the.
6. Re:Accelerate [development of] neural networks by Anonymous Coward · 2019-03-23 00:59 · Score: 0
  
  Whether the execution is CPU-intensive or not depends on how the neural network is constructed. See SpiNNaker for something that takes up a lot of CPU resource. Generally, for most production NNs, training takes up more resource on the devices or devices on which training occurs than on a single device doing classification, but overall, over all devices, it's a lot of resource, and so optimising the classification task is important in some senses. Optimising it so the response is within an acceptable time for a given quality is certainly important.
7. Re:Accelerate [development of] neural networks by hcs_$reboot · 2019-03-23 11:34 · Score: 1
  
  It seems the 'x' was a '%', actually.
  
  --
  Slashdot, fix the reply notifications... You won't get away with it...
Utter bullshit by goombah99 · 2019-03-22 09:07 · Score: 3, Insightful

Apparently no one has heard of Wolpert's No-Free-Lunch-Theorem for search. It says then when averaged over all use cases no search algorithm out performs another (provided resources are not an issue). So one can have more resource efficient searches and one can have search algorithms that do better on some problems than others. It's great when you find a class of problems your search method is optimal for. But in general, no. Can't be done.
TO get a 200x speed up on the test set they must have a 200x slow down on average elsewhere.
That said this could be really useful for a large class of practical problems. So it's the hyperbole that is the bullshit not the research.

--
Some drink at the fountain of knowledge. Others just gargle.
1. Re:Utter bullshit by Anonymous Coward · 2019-03-22 09:47 · Score: 0
  
  Apparently no one has heard of Wolpert's No-Free-Lunch-Theorem for search.
  
  But, this simply isn't a search.
2. Re:Utter bullshit by Anonymous Coward · 2019-03-22 10:02 · Score: 0
  
  It's searching for a set of parameters that achieves the right output response for given inputs. It is a search.
  However, the no free lunch theorem isn't that important in general, and an extreme interpretation is false. The obvious counterexample is two search functions that give identical results (and have no race conditions), but one calls Sleep(1000) in a loop that takes 10 years. That's less efficient than the one without the sleep. So clearly you can achieve algorithmic optimizations if there's simple inefficiencies, the theorem doesn't apply to that. It's more like the best-possible implementation of one search algorithm isn't going to be better than another search algorithm. It's entirely possible there are huge inefficiencies.
  The other aspect is you are averaging together useful and non-useful search algorithms. Which was alluded to but I think the point isn't stressed enough. We're average performance across cases that include really stupid inputs.
3. Re:Utter bullshit by Anonymous Coward · 2019-03-22 10:58 · Score: 0
  
  Apparently you have not heard that stating rules of thumb as hard facts is moronic...
4. Re:Utter bullshit by Anonymous Coward · 2019-03-22 14:00 · Score: 0
  
  TO get a 200x speed up on the test set they must have a 200x slow down on average elsewhere.
  Speeding up something on a computer by 200x happens all the time. Generally I break things up something like this...
  1. First, get a better algorithm (ass in this article). Yes some algorithms are much faster than others. A hash plus collision table (i.e. dictionary) is way faster than say a linear search to find a string in a really big list. A plane old boring array is faster still, when that fits the problem. You can often go from a hash plus collision approach to an array, if your source key(s) are well ordered such that with a simply translation you can get a unique number. (i.e. if you are using 4 numbers ranging 0..15 as keys to data, well that is easy to make a single number out of.)
  2. Now optimize your code. Reduce redundant copies. Reduce re computation of intermediate values when it helps. Look for ways the algorithm can be tailored to the actual hardware.
  3. Sometimes optimizing can mean breaking a problem up into parts and then optimizing each one. If your sub problems mostly can execute in CPU cache and you can spin a few thousand iterations on one before switching to the next, that can be significantly faster.
  4. Don't forget CPUs have a crap load of features. Are you using all the ones that might be useful? rdtsc() is awesome for precise timing if you want to do a bit of work, and basically free to boot.
  5. Look for areas where locks and similar are perhaps chewing up too much time. Can you architect around them?
  6. Look at your thread usage. Are you using too few or too many? Both can impact performance.
  7. Optimize your data structures. You can usually count on a compiler to put things on x8 boundaries and such, but you are the one that actually knows how often some bit is used. Packing structures can reduce memory usage, thus improving performance indirectly.
  8. Even things like the mod operator have cost, and sometimes there is a trivial workaround, such as simply waiting till a count equals a total then setting the count back to zero.
  9. Re-examine your problem. Can you simplify anything? Perhaps a model that is 90% as good is good enough? For something like a pseudo random number generator, examine the actual purpose of the random numbers. If they are just for test displays, then filling a circular buffer is likely fine. If you need something a bit better, but not cryptographically significant, well there are lots of sources of entropy.
  10. Are any libraries your using contributing to the problem? Can you do anything about it?
  11. Don't forget specialized hardware. Sometimes using a video card or a FPGA may be worth it for some of the algorithm. This one is last for a reason though, so don't go here lightly. I count assembly language here as well. Generally assembly is for something you can't easily do in C. Its hard to beat a compiler these days.
5. Re:Utter bullshit by goombah99 · 2019-03-22 16:23 · Score: 1
  
  While your points are well intentioned, you haven't read the No-free-lunch-theorem. It addresses all your points and it's still true.
  
  --
  Some drink at the fountain of knowledge. Others just gargle.
6. Re:Utter bullshit by religionofpeas · 2019-03-22 17:51 · Score: 1
  
  TO get a 200x speed up on the test set they must have a 200x slow down on average elsewhere.
  Possibly, yes. But typical images only form an absolutely tiny subset of all possible images. If we can speed up recognition of real-life images, at the cost of losing speed in fields of noisy random pixels, it's still a useful improvement.
7. Re:Utter bullshit by Anonymous Coward · 2019-03-22 19:32 · Score: 1
  
  >TO get a 200x speed up on the test set they must have a 200x slow down on average elsewhere.
  You're making the unfounded assumption that both optimization algorithms are performing optimally.
  I can trivially disprove you by taking an existing algorithm A, and making a deliberately shittier version (A') that performs worse in all aspects.
8. Re:Utter bullshit by MobyDisk · 2019-03-23 01:58 · Score: 1
  
  No, it is you who misunderstand the theorem. It most certainly does not say that "no search algorithm out performs another." And it is mathematically false that "when averaged over all use cases no search algorithm out performs another." That is easily mathematically proven to be false. For any search algorithm, I can create an algorithm that is 200x slower and provides no benefit. Using a real-world example, selection sort's best case is n^2 but merge sort's worst-case is n*log(n). Therefore "when averaged over all use cases" merge-sort will outperform selection sort.
9. Re:Utter bullshit by goombah99 · 2019-03-23 02:56 · Score: 1
  
  Good god man. you are completely wrong. The NFL excludes deliberate stupidity and inefficiency of course. But in the extreme, a random guessing or uphill climbing finds the minimum just as fast as steepest descent or genetic optimization. yes that is proven. The catch is that for problem on which steepest descent works well then it out performs the crazy idea of finding a minimum by going up hill. And that's likely most problems you will encounter in real life. You might even wonder how going up hill could find the minimum faster than going downhill. But in many problems you will blunder into the minimum only after going over a ridge. When you sum this over all problems-- which is the clever part of the proof-- it turns out that all non-revisiting algorithms (i.e. ones that dont get stuck) will find the global minimum in the same average time. It's rather shocking. It thus says, the ONLY thing you can ever say about a search algorithm is it's better for a certain class of problems, then accurately describe the circle of efficiency defining that class. One can of course be stupid and make a revisiting algorithm but that's not germane. The other escape clause of the NFL grimness is that if you are not trying to find the minumum but rather something-close to the minimum the it may be possible to get there faster under certain limitations but quantifying this has not been achieved yet.
  
  --
  Some drink at the fountain of knowledge. Others just gargle.
10. Re:Utter bullshit by MobyDisk · 2019-03-25 04:51 · Score: 1
  
  Okay, between your reply and Wikipedia I am swayed. That's pretty amazing. Still, be careful generalizing this to mean "No improvements can ever be made to AI learning algorithms."
Loss in quality? by Anonymous Coward · 2019-03-22 09:21 · Score: 1

It's not hard to accelerate a process, if you are allowed to leave away details without caring for their importance. The hard part is to not lose quality in the process.
TFA doesn't seem to say anything about that.
1. Re:Loss in quality? by ceoyoyo · 2019-03-22 11:38 · Score: 1
  
  Yes... they tested with one task on a single dataset. For a technique that (as far as I can gather from the terrible article) evaluates and discards options, that has a high risk of working well only in particular circumstances.
2. Re:Loss in quality? by phantomfive · 2019-03-22 18:14 · Score: 1
  
  Worth mentioning that often the difficulty with neural networks isn't the training time, it's collecting the dataset in the first place.
  
  --
  "First they came for the slanderers and i said nothing."
3. Re:Loss in quality? by ceoyoyo · 2019-03-23 03:05 · Score: 1
  
  Very true. I'm not sure where the tens of thousands of GPU hours figure comes from either. Google may have trained for that long because they've got the hardware. You can train a decent ImageNet clone on a commodity GPU on your own computer in a day or so.
took out cookies & hostage user consent delays by Anonymous Coward · 2019-03-22 09:23 · Score: 0

that would do it?
So it can spot the gorillas much faster by Anonymous Coward · 2019-03-22 09:50 · Score: 0

And sheep.
CNNs by Anonymous Coward · 2019-03-22 09:54 · Score: 0

CNN is fake news
Confusing title by Chris+Mattern · 2019-03-22 12:26 · Score: 1

I though it was saying the algorithm would work before 2010.
1. Re:Confusing title by Misagon · 2019-03-22 14:04 · Score: 1
  
  Besides, the multiplication symbol is not the letter 'x', but Ã--
  ... oh wait, this is Slashdot. We're still stuck in ASCII.
  
  --
  "We mustn't be caught by surprise by our own advancing technology" -- Aldous Huxley
We're all gonna die by nospam007 · 2019-03-22 12:33 · Score: 1

That's the gist of it.
Possible recipe for disaster by xonen · 2019-03-22 14:29 · Score: 1

The teams focused on pruning entire potential paths for CNNs to use, evaluating each in turn. Lower probability paths are successively pruned away, leaving the final, best-case path.

Now, i don't know about AI that much, so in this scenario it may be totally different.
With chess engines, early pruning is a recipe for disaster. You might fool beginner players, you won't fool GM's. Pruning in general is already questionable, as you are deleting possibilities based on assumptions that are in turn based on a limited subset of your data. Early pruning leads to big performance gains but may also easily overlook possibilities because certain paths are assumed wrong. Just because sometimes something that appears to be wrong is actually good.
As said, it may or may not apply to AI, but it looks it does as they use similar words to describe a similar process. If it sounds too good to be true. Performance gains by a factor of 200 smells like red flags.

--
A glitch a day keeps the bugs away.
1. Re:Possible recipe for disaster by religionofpeas · 2019-03-22 18:01 · Score: 1
  
  You might fool beginner players, you won't fool GM's. Pruning in general is already questionable, as you are deleting possibilities based on assumptions that are in turn based on a limited subset of your data. Early pruning leads to big performance gains but may also easily overlook possibilities because certain paths are assumed wrong
  If you get big performance gains that means it plays better overall, so it will be more likely to fool GMs as well.
  Keep in mind that every human player also uses early pruning. A GM looking at board typically prunes 25 of the 30 possible moves right away.
2. Re: Possible recipe for disaster by Anonymous Coward · 2019-03-23 03:20 · Score: 0
  
  This new algorithm really needs to be tried out on Leela Zero with decent hardware, to see if the learning process is 200 times faster. I don't have decent hardware, so I urge the author of Leela Zero to try it out.
Incorrect PDF links. by Anonymous Coward · 2019-03-23 02:59 · Score: 0

The two links are to the summary page and downloading the same PDF, it doesn't actually link to the 2nd pdf or review page :(
Dumbass by Anonymous Coward · 2019-03-23 09:51 · Score: 0

Sorting is not search. Search is minimization of an objective function. Take a CS course? If not don't pretend you understand something you don't know.
1. Re:Dumbass by MobyDisk · 2019-03-25 04:49 · Score: 1
  
  Suppose the search algorithm is "sort then do a binary search." Goombah99's arguments are more worth replying to than.
Huh by Anonymous Coward · 2019-03-24 04:37 · Score: 0

When I was a senior in high school in Texas, students started asking each other where they were going to go to college. There was a fellow named Allen in my electronics class that I asked where he was going. He said MIT. I asked him what and where it was and he was surprised to see that I had never heard of it. I had an even greater surprise on my face when he told me that it was in Boston. No respectable Texan would ever consider going to school anywhere in Yankeedom, as far as I was concerned. Of course, having heard about MIT for many years since, I now consider it to possibly be the ultimate of all the world's universities.