"The term "deep" comes from the idea that the algorithm is trying to learn something deeper than previous algorithms. In fact, the usual set of machine learning algorithms are termed shallow learning now. The difference is that deep learning tries to model P(X) whereas shallow learning (SVM, NN, naive Bayes, etc..) try to learn P(X|Y) where X is your input space and Y is the label space. "
Well, more correctly, the deep learning tries to model P(X) as P(X | H) for some set of "hidden" or latent features H which in some ways, is far simpler than the raw data and space of X, and then learns P(Y | H, X) after doing some training for P(X|H).
"So, you admit 'deep' is a marketing buzzword...thank you. It's *obviously* not a technical term."
Technical people in the field, when they hear "deep belief networks" have an excellent idea about the class of computational methods in that class. So yes, "deep" is part of a reasonably precise technical phrase, and the word "deep" in itself does have a connotation in the technical literature: certainly more than one hidden layer, and probably more than two hidden layers.
People in the field also know that typical earlier training approaches relying only on supervised learning did not show any consistent advantage to going beyond two hidden layers, whereas the newer class of methods do show some clear advantages in some problems.
If you want to attack something, it is the "belief" part of the phrase, and not "deep", as this is less clearly defined.
Summary: it's not vapid marketing-speak just invented, any more than "nuclear magnetic resonance".
"I'm sure all of that helped, but the key ingredient is training mechanisms. Traditionally networks with multiple layers did not train very well, because the standard training mechanism "backpropagates" an error estimate, and it gets very diffuse as at goes backwards. So most of the training happened in the last layer or two."
This problem can be easily remedied by scaling up the gradient terms for earlier and earlier layers. Doing so doesn't solve the deep network problem. As others have mentioned the breakthroughs were combining highly parallelizable unsupervised representation methods with traditional supervised learning.
It's a naive question. People have been combining evolutionary methods for architecture selection with more traditional gradient/function value optimization since, well at least the late 1970's. It will still evolve only in the space that the human set it up to evolve.
The breakthroughs did not occur because of an automated computer. They occurred through an large-scale evolutionary algorithm known as Smart-Fraction-Of-Human-Civilization-Thinking-And-Working-and-Writing-For-Decades. We needed new ideas and persistence to test them thoroughly.
What made it possible in our society is long-term government funding of research.
There is a new thing. It has long been known that "deep networks" could theoretically represent more sophisticated features and concepts, and there were obvious biological examples of this working successfully.
The artificial neural network methods of 1990, as you say hill-climbing on a multi-dimensional landscape, turned out not to work particularly successfully on deep networks, or more correctly, provide little additional benefit vs shallow networks. After this time, resarch in statistical learning moved from just these parametric models to more clearly statistical methods with some clever tricks, e.g. support vector machines and boosted ensembles of simple learners. These appeared to have some advantages in training over traditional neural networks, because they could be transformed into more deterministic optimization sub problems, compared to the neural networks. SVM's in particular could be transformed into quadratic optimization which had deterministic solutions (i.e. convex optimization instead of the very rough and fractal error surface of MLPs/networks).
However, it turned out that some of these methods did not scale well to really large problem sizes, e.g. training and scoring SVM's on millions to billions of data points instead of the 1000-50000 of typical academic "benchmark" datasets doesn't work well in convex optimization. The time necessary to train in the convenient dual space which has this property can be quadratic or worse in the number of points. So what is state of the art in large scale SVM training? Uh, stochastic gradient descent just like those yucky neural networks.
Now, back to the new generation of neural networks. The typical trick now for the newer generation of neural networks (and yes Geoff Hinton and his lab is the leader in this revival) is that most of the training does NOT use the supervised methods (matching to the target). Much of the initial phase of trainings involve unsupervised methods which are statistical methods which attempt to find "interesting" structure in the input data--for some arguable form of "interesting"---thereby doing "dimensionality reduction".
Then at the end, there is traditional supervised optimization to 'clean things up'. Of course now the trick is matching the right biases in the unsupervised layers for the task at hand, and that is likely still trial and error. The papers show the successes. But the point is that their successes are so spectacular occasionally that the general approach appears to be pretty valuable.
It's quite possible this is exactly what evolution has done---evolve different unsupervised priors/algorithms by evolving wetware---which happen to turn out to be useful for the types of statistical patterns occurring in the various forms of sensory inputs.
As far as I can tell, the original connectionist manifesto is still correct. This is the only plausible approach I see towards artificial intelligence, as opposed to being just machine learning (which is in some ways a superset, but also a subset in ambition).
"Since we are all pretty well aware that we are between ice ages it doesn't say much at all and it gives absolutely no indication if the current warming trend is usual or not."
It is guaranteed that the atmosphere is definitely unusual because we have dug up and combusted carbon which was sequestered geologically since long before many many interglacial/ice age cycles. When that carbon was being laid down (massive plant growth), bacteria and fungus had not yet evolved the ability to break down lignin so the wood piled up and up and up and up and turned into coal.
Today that isn't the case, so it's quite possible that humans action today have significantly changed the properties of the atmosphere for the remainder of the Earth's lifetime.
Except of course that there is no local signal from intermittent volcanism, and that this issue has been examined by scientists decades ago and is confirmed by many other measuring stations. And that the extra CO2 from fossils can be distinguished by a slightly different isotopic ratio.
There is a persistent behavior in climate "skeptics" who think they are clever. They take 15 seconds and imagine one simple consideration in response to a popularized sound bite and assume that somehow they gotcha'ed thousands of people who spend their lifetimes working on the problem.
Moral: Never hire an asshat without a working reality distortion field and a profound ability to choose and motivate very skilled but diverse people into working extremely hard to make the right thing at the right time.
Jobs' hat and ass is a bug not a feature, but he did have features.
Come on, they could have announced publicly, "We have detected accounting problems and we are lowering our offer to 0.5*X." The target will howl, but then somebody will be forced to actually audit them.
"Speed is not actually all that relevant, it's just a number to the computer. It reduces reaction time, but since a computer would fire the missile anyways, that isn't much of an issue."
Yes, of course speed is the the issue and it is not just a number to a computer. An ICBM warhead comes in at extraordinary speeds. (~8-10 kilometers per second). And it is very very small compared to space. It has nearly all the energy that a large rocket provided during its boost phase. Your own interceptors have physical limitations on range as well. You need to be able to detect and distinguish which is real, which is very hard when the enemy's warheads are tiny things far away and they have launched many very cheap physical and electronic countermeasures like radar reflectors which in space have exactly the same ballistic trajectories as the warheads. You can only tell the difference when they hit the atmosphere going down (warheads are heavy), and by then it's too late. 2-3 seconds from stratosphere to boom.
"You mean that SDI might work after all? That will get us out of the nuclear age. A stop rate of 90% eliminates a first strike advantage."
No. There is an enormous difference between the rockets of Hamas and an ICBM. An ICBM is fast. Really fast.
I don't think people fully understand the difficulty of stopping an ICBM; perhaps this fact will help visualize it. An ICBM warhead, upon re-entry, descends from the stratosphere (say U-2 altitude, 70,000 feet) to ground level in about 2 to 3 seconds.
Sri Lanka dealt with the Tamil Tigers with overwhelming military force, annexation, and only modest attention to civilian casualties. There is no "peace process", no Tamilstan, no "two-state solution", and nobody advocating boycotts of Sri Lanka dipped in fairly transparent anti-Sinhalese bigotry. And everybody else seems perfectly happy about it. (I am too, as the LTTE turned into vicious terrorist scum who deserved what they got.)
"It keeps ahead of the curve, which is the counter-intuitive bit that so many physical scientists self-curbstomp over, not understanding the economics has been shown to work over and over again."
Uh, no. Why don't we have cheap $20 oil? Why isn't East Texas squirting easy-to-find oil out of the ground once again?
Let's actually have resource prices going down in this area before you start crowing about the triumph of ignoring physical science.
The point is that we have to do ugly and expensive and difficult things to get something which used to be easy. Why? What's different about this than manufacturing engines?
"This means now the US can set about selling off our natural resources to the highest bidder like every other Third World shithole."
Other than WW2, when has the US's natural resource extractors ever NOT sold their product to the highest bidder?
Typically Third World shitholes "sell" of the resources to the brother in law of the deciderer-for-life who hen actually sells it to the highest bidder.
"Finally, flush down the memory hole that the same end game would have been achieved had Motors Liquidation Co., at the time still named General Motors, never received its much vaunted and extolled "bail outs""
No it would not have. You need debtor-in-possession financing, and a bunch of it for bankruptcy, and no private banks were remotely interested in lending for such an enterprise in 2009. The bankruptcy would have been legally extremely long and arduous, like Lehman Bros which is still ongoing. The suppliers were themselves weeks from bankruptcy and that would have resulted in Ford collapsing as well. Ford's CEO explicitly advocated that the government engage in this plan even though it meant saving a fierce competitor.
"The term "deep" comes from the idea that the algorithm is trying to learn something deeper than previous algorithms. In fact, the usual set of machine learning algorithms are termed shallow learning now. The difference is that deep learning tries to model P(X) whereas shallow learning (SVM, NN, naive Bayes, etc..) try to learn P(X|Y) where X is your input space and Y is the label space. "
Well, more correctly, the deep learning tries to model P(X) as P(X | H) for some set of "hidden" or latent features H which in some ways, is far simpler than the raw data and space of X, and then learns P(Y | H, X) after doing some training for P(X|H).
"So, you admit 'deep' is a marketing buzzword...thank you. It's *obviously* not a technical term."
Technical people in the field, when they hear "deep belief networks" have an excellent idea about the class of computational methods in that class. So yes, "deep" is part of a reasonably precise technical phrase, and the word "deep" in itself does have a connotation in the technical literature: certainly more than one hidden layer, and probably more than two hidden layers.
People in the field also know that typical earlier training approaches relying only on supervised learning did not show any consistent advantage to going beyond two hidden layers, whereas the newer class of methods do show some clear advantages in some problems.
If you want to attack something, it is the "belief" part of the phrase, and not "deep", as this is less clearly defined.
Summary: it's not vapid marketing-speak just invented, any more than "nuclear magnetic resonance".
"I'm sure all of that helped, but the key ingredient is training mechanisms. Traditionally networks with multiple layers did not train very well, because the standard training mechanism "backpropagates" an error estimate, and it gets very diffuse as at goes backwards. So most of the training happened in the last layer or two."
This problem can be easily remedied by scaling up the gradient terms for earlier and earlier layers. Doing so doesn't solve the deep network problem.
As others have mentioned the breakthroughs were combining highly parallelizable unsupervised representation methods with traditional supervised learning.
It's a naive question. People have been combining evolutionary methods for architecture selection with more traditional gradient/function value optimization since, well at least the late 1970's. It will still evolve only in the space that the human set it up to evolve.
The breakthroughs did not occur because of an automated computer. They occurred through an large-scale evolutionary algorithm known as Smart-Fraction-Of-Human-Civilization-Thinking-And-Working-and-Writing-For-Decades. We needed new ideas and persistence to test them thoroughly.
What made it possible in our society is long-term government funding of research.
There is a new thing. It has long been known that "deep networks" could theoretically represent more sophisticated features and concepts, and there were obvious biological examples of this working successfully.
The artificial neural network methods of 1990, as you say hill-climbing on a multi-dimensional landscape, turned out not to work particularly successfully on deep networks, or more correctly, provide little additional benefit vs shallow networks. After this time, resarch in statistical learning moved from just these parametric models to more clearly statistical methods with some clever tricks, e.g. support vector machines and boosted ensembles of simple learners. These appeared to have some advantages in training over traditional neural networks, because they could be transformed into more deterministic optimization sub problems, compared to the neural networks. SVM's in particular could be transformed into quadratic optimization which had deterministic solutions (i.e. convex optimization instead of the very rough and fractal error surface of MLPs/networks).
However, it turned out that some of these methods did not scale well to really large problem sizes, e.g. training and scoring SVM's on millions to billions of data points instead of the 1000-50000 of typical academic "benchmark" datasets doesn't work well in convex optimization. The time necessary to train in the convenient dual space which has this property can be quadratic or worse in the number of points. So what is state of the art in large scale SVM training? Uh, stochastic gradient descent just like those yucky neural networks.
Now, back to the new generation of neural networks. The typical trick now for the newer generation of neural networks (and yes Geoff Hinton and his lab is the leader in this revival) is that most of the training does NOT use the supervised methods (matching to the target). Much of the initial phase of trainings involve unsupervised methods which are statistical methods which attempt to find "interesting" structure in the input data--for some arguable form of "interesting"---thereby doing "dimensionality reduction".
Then at the end, there is traditional supervised optimization to 'clean things up'. Of course now the trick is matching the right biases in the unsupervised layers for the task at hand, and that is likely still trial and error. The papers show the successes. But the point is that their successes are so spectacular occasionally that the general approach appears to be pretty valuable.
It's quite possible this is exactly what evolution has done---evolve different unsupervised priors/algorithms by evolving wetware---which happen to turn out to be useful for the types of statistical patterns occurring in the various forms of sensory inputs.
As far as I can tell, the original connectionist manifesto is still correct. This is the only plausible approach I see towards artificial intelligence, as opposed to being just machine learning (which is in some ways a superset, but also a subset in ambition).
"Since we are all pretty well aware that we are between ice ages it doesn't say much at all and it gives absolutely no indication if the current warming trend is usual or not."
It is guaranteed that the atmosphere is definitely unusual because we have dug up and combusted carbon which was sequestered geologically since long before many many interglacial/ice age cycles. When that carbon was being laid down (massive plant growth), bacteria and fungus had not yet evolved the ability to break down lignin so the wood piled up and up and up and up and turned into coal.
Today that isn't the case, so it's quite possible that humans action today have significantly changed the properties of the atmosphere for the remainder of the Earth's lifetime.
Except of course that there is no local signal from intermittent volcanism, and that this issue has been examined by scientists decades ago and is confirmed by many other measuring stations. And that the extra CO2 from fossils can be distinguished by a slightly different isotopic ratio.
There is a persistent behavior in climate "skeptics" who think they are clever. They take 15 seconds and imagine one simple consideration in response to a popularized sound bite and assume that somehow they gotcha'ed thousands of people who spend their lifetimes working on the problem.
Moral: Never hire an asshat without a working reality distortion field and a profound ability to choose and motivate very skilled but diverse people into working extremely hard to make the right thing at the right time.
Jobs' hat and ass is a bug not a feature, but he did have features.
Come on, they could have announced publicly, "We have detected accounting problems and we are lowering our offer to 0.5*X." The target will howl, but then somebody will be forced to actually audit them.
"I've always thought Oracle would be a miserable place to work, but now I see some people are definitely having fun, just not programmers."
I've always thought Oracle would be a miserable place to work, but now I see Larry Ellison definitely having fun, just not people-who-are-not-Larry.
FTFY
Are you sure it's not vice versa?
Hamas knows with 100% confidence that after they launch rockets they will face strong retaliation. They launch rockets.
Russia's existing systems are already much more capable than the European interceptors.
"Speed is not actually all that relevant, it's just a number to the computer. It reduces reaction time, but since a computer would fire the missile anyways, that isn't much of an issue."
Yes, of course speed is the the issue and it is not just a number to a computer. An ICBM warhead comes in at extraordinary speeds. (~8-10 kilometers per second). And it is very very small compared to space. It has nearly all the energy that a large rocket provided during its boost phase. Your own interceptors have physical limitations on range as well. You need to be able to detect and distinguish which is real, which is very hard when the enemy's warheads are tiny things far away and they have launched many very cheap physical and electronic countermeasures like radar reflectors which in space have exactly the same ballistic trajectories as the warheads. You can only tell the difference when they hit the atmosphere going down (warheads are heavy), and by then it's too late. 2-3 seconds from stratosphere to boom.
"You mean that SDI might work after all? That will get us out of the nuclear age. A stop rate of 90% eliminates a first strike advantage."
No. There is an enormous difference between the rockets of Hamas and an ICBM. An ICBM is fast. Really fast.
I don't think people fully understand the difficulty of stopping an ICBM; perhaps this fact will help visualize it. An ICBM warhead, upon re-entry, descends from the stratosphere (say U-2 altitude, 70,000 feet) to ground level in about 2 to 3 seconds.
http://en.wikipedia.org/wiki/File:Peacekeeper-missile-testing.jpg
Indeed. It's an instructive example.
Sri Lanka dealt with the Tamil Tigers with overwhelming military force, annexation, and only modest attention to civilian casualties. There is no "peace process", no Tamilstan, no "two-state solution", and nobody advocating boycotts of Sri Lanka dipped in fairly transparent anti-Sinhalese bigotry. And everybody else seems perfectly happy about it. (I am too, as the LTTE turned into vicious terrorist scum who deserved what they got.)
What's up with that?
http://atheism.about.com/library/FAQs/religion/blgrk_aphrodite02.htm
And this woman would be in the lowest 20-25% of BMI in the USA today, even of her age group.
Did you ever ask a liberal why they don't like oil?
"It keeps ahead of the curve, which is the counter-intuitive bit that so many physical scientists self-curbstomp over, not understanding the economics has been shown to work over and over again."
Uh, no. Why don't we have cheap $20 oil? Why isn't East Texas squirting easy-to-find oil out of the ground once again?
Let's actually have resource prices going down in this area before you start crowing about the triumph of ignoring physical science.
The point is that we have to do ugly and expensive and difficult things to get something which used to be easy. Why? What's different about this than manufacturing engines?
"Why not put those resources into carbon-neutral energy generation?"
Because the people with the money don't think the other choices make enough money. And they're right (about not making enough money).
"This means now the US can set about selling off our natural resources to the highest bidder like every other Third World shithole."
Other than WW2, when has the US's natural resource extractors ever NOT sold their product to the highest bidder?
Typically Third World shitholes "sell" of the resources to the brother in law of the deciderer-for-life who hen actually sells it to the highest bidder.
"Wait, what? There is an idea that natural gas will curb CO2 emissions?"
For a given energy output, burning CH4 emits less carbon than burning long chain hydrocarbons (petroleum) or solid blocks of carbon (coal).
Actually, taxing exports requires a Constitutional amendment.
Pissing people off is not necessarily the worst thing, not getting stuff done right is the worst thing.
The ousting that really matters isn't happening.
"Finally, flush down the memory hole that the same end game would have been achieved had Motors Liquidation Co., at the time still named General Motors, never received its much vaunted and extolled "bail outs""
No it would not have. You need debtor-in-possession financing, and a bunch of it for bankruptcy, and no private banks were remotely interested in lending for such an enterprise in 2009. The bankruptcy would have been legally extremely long and arduous, like Lehman Bros which is still ongoing. The suppliers were themselves weeks from bankruptcy and that would have resulted in Ford collapsing as well. Ford's CEO explicitly advocated that the government engage in this plan even though it meant saving a fierce competitor.
It would have been, in practice, a catastrophe.