Google's Latest Machine Vision Breakthrough
mikejuk writes "Google Research recently released details of a Machine Vision technique which might bring high power visual recognition to simple desktops and even mobile computers. It claims to be able to recognize 100,000 different types of object within a photo in a few minutes — and there isn't a deep neural network mentioned. It is another example of the direct 'engineering' approach to implementing AI catching up with the biologically inspired techniques. This particular advance is based on converting the usual mask-based filters to a simpler ordinal computation and using hashing to avoid having to do the computation most of the time. The result of the change to the basic algorithm is a speed-up of around 20,000 times, which is astounding. The method was tested on 100,000 object detectors using over a million filters on multiple resolution scalings of the target image, which were all computed in less than 20 seconds using nothing but a single, multi-core machine with 20GB of RAM."
Can it sort and identify duplicates automagically in my porn collection?
When Fascism comes to America, it will call itself Anti-Fascism, and tell you to give up your guns.
-"... using nothing but a single, multi-core machine with 20GB of RAM" Phew.. here i was thinking it'd need some unrealisticalll high specs from my PC!!
"...might bring high power visual recognition to simple desktops and even mobile computers... computed in less than 20 seconds using nothing but a single, multi-core machine with 20GB of RAM."
Right... and by mobile computers you mean computers that I can lug from one desk to another.
my cat can spot a Dentabite bag from across the room in 20 milliseconds, does that mean my cat has 20TB of RAM?
... or Sarah Connor for that matter?
So Captcha's will become even easier to crack? Great, the sooner we can get rid of them, the better. As it is they are getting impossible to read by humans, thanks to idiots who don't know how to design them.
Yes, it's a breakthrough. It won the best paper award at this year's Conference on Computer Vision and Pattern Recognition, a tier 1 computer vision conference.
Hashing invarient properties in images isn't new, but,
banded winner-take-all hashing of histograms-of-oriented-gradient part filters and then using matches across those bands to identify a test feature's nearest neighbors, while simultaneously computing an upper bound or exact dot products of those test features with their nearest learned features, for up to 100,000 objects with small amounts of memory, is new.
When you say small amounts, you mean 32Gb.
That's a small amount.
Seriously, what's with all the 1980s throwbacks on here today...
20GB per 100000 objects is 209kB per object. Don't know what resolution each image was, but I think 200kB is quite small.
BMW has a forward facing camera under the rear view mirror that scans for highway signs for posted speed limit and no-passing signs and displays them on the dash. I am not it is basic car or you have to buy some advanced tech package for it.
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
It would be nice if it could identify bird species (or other animals) preferably up to specific individual animals, like they do it with whales and penguins already. .45, 0.23 miles in that direction, so there is still hope.
I'd gladly pay money for such a program instead of getting only a free version, where I can check if aunt Mary with a drink in hand is in any photo in my collection.
We have already been waiting for years to get a program that can identify bird songs after shazaa came out, no luck yet, but hey, after all many towns have already a program that tells them: Somebody shot somebody with a
You have been tagged at the ATM
You have been tagged at the laundromat
You have been tagged at the Quickie Mart
You have been tagged at work
You have been tagged at the gym
You have been tagged.
Whats with people acting as though everyone and their grandma should do it in their personal computer for some new technology to be a phenomenal breakthrough? Does your grandma compute page rank of billions of pages of the net in her home computer?
sed -e 's/Chuck Norris/Rajnikant/g' joke > fact
Its fast, but training set is random garbage from YT thumbnails and they have NO PROCESS to assess accuracy. All they can do is measure precision and its ~16% on average. What this means is their algorithm could very well just say FACE every single time and by shear coincidence every sixth image in dataset contains some face - tada, you just reached 16% precision.
Who logs in to gdm? Not I, said the duck.
Being a software engineer myself I understand the sense of excitement accomplishment after completing internal testing. But as with many projects, as soon as this leaves the controlled "lab testing" environment it's a whole different ball game. Until then it's still a white paper product and I'd suggest remaining cautiously optimistic...
When you say small amounts, you mean 32Gb.
Which according to newegg can be had for $229, sure it's not pocket change but if you're thinking of say a computer vision program for a car... that's a tree, that's a house, that's a dog, there's a child running around. I would imagine it's a lot easier to collect sensor data than to make sense of it in real time, if you can rapidly identify points of interest like facial recognition in photo cameras on steroids you can put processing power - and potentially directional sensors - to good use. For example, you very closely look for clues whether a pedestrian walking down the street past a crossing is going to make use of it, so should an AI driver. Blunt use of raw processing power will only take you so far.
Live today, because you never know what tomorrow brings
Bear in mind, this particular method is just a way to quickly do a large number of convolutions and get statistically fairly accurate results for the most activated convolution kernels.
This isn't incompatible with deep neural network models. This method can be combined with them and provide the same speedup there.
But is it a breakthrough compared with the normal way to speed up a convolution, that is to compute it in Fourier space using a Fast Fourier Transform (FFT) or variants (DFT, DCT), etc.? I don't know the answer to this and would like a comparison as it is very relevant to my research work...
I'm sure it would take me more than a few minutes to identify that many objects.
However, how fast can it find Waldo?
It probably has been well tested "in the real world" - check out Google Goggles sometime (which is available for Android and iOS).
In fact, this probably came out of the stuff that Goggles does - where you snap a photo and Goggles figures out what's in it. If you snap a QR code, it'll decode it, a barcode, it'll pop up a Google search for that product. Other items it'll attempt to either OCR it or perform object recognition. Basically it gives a list of things (snap a sign and it'll probably try to OCR it, offer you a translation, tell you what kinds of cars it sees, etc).
- First they ignore you, then they laugh at you, then ???, then profit.
I built out an $8k desktop for home earlier this year. I can assure you, the 64GB of RAM that went in there was probably one of the cheapest components in it. You can do 64GB for as little as $300. A single high-end graphics card is going to run you around $1000.
To make the Kinect work (version 1.0) Microsoft gathered thousands upon thousands, possibly millions of data points, processed the images, checked the results etc. and after zillions of computations ended with digested data and some algorithms that use it, giving an accurate result in real time.
From reading the abstract I'm under the impression that Google basically did the same thing ; it's trading computation for memory use. The "hashes" of what the camera see match somehow with the digested data they amassed and thus the object gets classified. They do mention the training data.
represents a speed-up of approximately 20,000 times - four orders of magnitude - when compared with performing the convolutions explicitly on the same hardware. While mean average precision over the full set of 100,000 object classes is around 0.16 due in large part to the challenges in gathering training data and collecting ground truth for so many classes, we achieve a mAP of at least 0.20 on a third of the classes and 0.30 or better on about 20% of the classes.
I can't comment further on this, dunno if that new Google thing is basically/fundementally the same concept used in the Kinect or if there are relevant differences, other than scale.