Breakthrough In Face Recognition Software
An anonymous reader writes: Face recognition software underwent a revolution in 2001 with the creation of the Viola-Jones algorithm. Now, the field looks set to dramatically improve once again: computer scientists from Stanford and Yahoo Labs have published a new, simple approach that can find faces turned at an angle and those that are partially blocked by something else. The researchers "capitalize on the advances made in recent years on a type of machine learning known as a deep convolutional neural network. The idea is to train a many-layered neural network using a vast database of annotated examples, in this case pictures of faces from many angles. To that end, Farfade and co created a database of 200,000 images that included faces at various angles and orientations and a further 20 million images without faces. They then trained their neural net in batches of 128 images over 50,000 iterations. ... What's more, their algorithm is significantly better at spotting faces when upside down, something other approaches haven't perfected."
Very much anecdotal, but here goes anyway - a little while back, I found a recipe for cow tongue that seemed intriguing. If I had eaten it before I couldn't recall, at least I hadn't prepared it myself. So off to the butcher's I was, as this is not found in every shop. The tongues they had on display there seemed very tiny (in retrospect, they must have been veal tongues), so I said "give me the largest tongue you have". As the saying goes, "you should be careful what you wish for" - what I ended up with was a monster, something like over 1.3kg (nearly three pounds). I really didn't need that much, but all I could do was to say thanks and go home with my prey.
As I laid it on my cutting board, pretty much filling it entirely, it looked at the same time so awesome and gruesome that I had to take a photo of it (not a food blogger, or a blogger of any kind, I just had to document it). And to share the experience, I sent it to a friend via Hangouts. Now, as she uses Hangouts from the GMail web interface, the images are not visible inline but are Google+ links. So she clicks the link.
...and G+ helpfully asks her "Is this xxxxx?" (xxxxx == her name) While people are, rightfully, concerned whether companies such as Google know too much about their lives, at least when it comes to Google and facial recognition, they have a long way to go.
They're using a standard technique. Convolutional networks started to become big with LeCun's 1998 paper on learning to recognize hand-written digits http://yann.lecun.com/exdb/pub... . His lenet-5 network could identify the digit accurately 99% of the time.
Convolutional networks are starting to become used to play Go, eg 'Move evaluation in Go using Deep Convolutional Neural Networks', by Maddison Huang, Sutskever and Silver, http://arxiv.org/pdf/1412.6564... Maddison et al used a 12-layer convolutional network to predict where an expect would move next with 50% accuracy :-)
Progress on convolutional networks moves forward all the time, in an incremental way. If we had one article per day about one increment it would quickly lose mass appeal though :-) The article is about one increment along the way, but does symbolize the massive progress that is being made.
Convolutional networks work well partly because they can take advantage of the massive computional capacity made available in GPU hardware.
As someone who literally works on face detection/tracking software on low power ARMv7/8 CPUs, I can safely say you are dead wrong.
Assuming width==height (not likely given any current video formats or cameras), and assuming width%8 == 0 - it's a simple transposition of the rows and/or columns to do +/- 90/180 degrees, yes - and assuming you can fit your ENTIRE image in L1 cache you're going to incur minimal stalls (especially with an SoC that has a decent prefetch engine).
In reality:
* width != height
* width is however typically divisible by 8 so you can do pure NEON (not hybrid NEON + ALU/VFP) transpositions
* an 8bit grayscale VGA (640x480) image doesn't even fit in L1 cache, let alone a 720/1080p format (though most CV applications scale things down significantly, you tend to work at 320x180 - but that still doesn't fit in most L1 caches, although it does fit in 'some')
* L2 cache hits are dozens of cycles, L2 cache misses are HUNDREDS of cycles
* A real world case of rotating a 320x180 image takes ~2ms on a 700Mhz Cortex A9, that is not 'practically zero', that's 12% of your processing time at 60Hz - 36% of your processing time if you're going to rotate 3 times.
(Note: using 700Mhz Cortex A9 as an example as that's typical in automotive hardware systems we typically deal with, although the last 2 years has brought ~1-1.5Ghz A15's into the mix - though most of those cars aren't even on the market yet)
I didn't read the article, of course, but the summary sounds like they're doing face *detection* not recognition.
Detection: find which portions of an image are faces.
Recognition: compare to a database of faces and find out whose face it is.
First is way easier than the other.
I apologize for the lack of a signature.