Researchers Teach Computers To Perceive 3D from 2D

← Back to Stories (view on slashdot.org)

Researchers Teach Computers To Perceive 3D from 2D

Posted by ryuzaki0 on Wednesday June 14, 2006 @07:16AM from the your-battlebot-wants-an-upgrade dept.

hamilton76 writes to tell us that researchers at Carnegie Mellon have found a way to allow computers to extrapolate 3 dimensional models from 2 dimensional pictures. From the article: "Using machine learning techniques, Robotics Institute researchers Alexei Efros and Martial Hebert, along with graduate student Derek Hoiem, have taught computers how to spot the visual cues that differentiate between vertical surfaces and horizontal surfaces in photographs of outdoor scenes. They've even developed a program that allows the computer to automatically generate 3-D reconstructions of scenes based on a single image. [...] Identifying vertical and horizontal surfaces and the orientation of those surfaces provides much of the information necessary for understanding the geometric context of an entire scene. Only about three percent of surfaces in a typical photo are at an angle, they have found."

5 of 145 comments (clear)

Min score:

Reason:

Sort:

Errr... by Ayanami+Rei · 2006-06-14 07:26 · Score: 5, Informative

you've always been able to do that.
Cities aren't the kind of thing this is target for.
You can get building plans and architectural drawings and everything from the city for free. There are algorithms that can easily map pictures to objects if you know ahead of time the shape of the things that "should" be there.

This stuff is for deciding the shape of unknown things, and more importantly, to gain new heuristics for image searches.

With this technology, you could ask for "things that are round, and have a box".

More importantly, you could show the computer one picture of something, and have it attempt to find more pictures of it (from different angles, with different colors, etc.). Like you show it a Volvo C90, and it shows you any and all pictures of Volvo C90s by the shape.

--
THIS THING CAN TURN ON A DIME, MACROSSZERO STYLE ALSO FUCK BETA, ~NYORON
It is a fairly simple process by IndustrialComplex · 2006-06-14 07:33 · Score: 2, Informative

I remember doing something similar to this while an undergrad at Penn State. It was just an undergraduate computer vision course, but one of our exercises involved identifying common reference points from one or more images of the same object. These points can then be used to make an estimation of parallax between the images. It is really fun to play with since you can use a few still images to create the illusion that a camera is panning around the object. Of course, that example is quite simple. It is very easy for the points to give false positives, and the processing time of our unoptomized algorithms nearly made it unusable. But it did at least give a proof of concept. However, taking this and expanding it to create 3d models, if they can do so reliably, is quite amazing.

--
Out of modpoints but really liked a post? 1BDkF6TtmmeZ3yqXbz9yhdYVqRYnwFoXDj
I worked with them briefly by moultano · 2006-06-14 07:42 · Score: 3, Informative

The complexity of the models that the program is able to extract is similar to what you would see in a game like doom. All "floors" are perfectly horizontal, all "walls" are perfectly vertical, and most objects (people, trees, cars) become small vertical walls. This doesn't attempt to capture surface geometry at all; it approximates things with large planes. What they are saying is that most things you see in pictures are very well approximated by these simple primitives, such that when they create a scene using them it provides convincing parallax as you move around it. It's a really neat effect.
Re:"Enemy of the State" by Jerf · 2006-06-14 07:50 · Score: 4, Informative

It's worth pointing out that a lot of that stuff isn't, strictly speaking, impossible.

What's impossible is to take a single photo out of the stream and "enhance" it to the n-th degree without using the rest of the video.

And no matter how good your technique, you can't generate information, so there will be some limit to your zooming in.

But the idea that if you consider the entire video stream, you can extract a lot more information is not impossible at all, and you'd probably be surprised by both what is in there and what isn't. Seeing "through" something probabilistically is possible if the object being "seen" was in video at some point. On the other hand, "zooming" in to something on the counter that has been there for the entire duration of the video and has never moved is impossible, because while you may have 15,000 pictures of the object, they're all the same pictures.

Normally I don't bring this up when we're having one of our usual bitch-fests about CSI here on Slashdot because by and large the standard bitching is still correct. But as AI advances, some of the stuff that seems impossible now will become very possible.

One early example I remember seeing is the demonstration of a system that could identify a person with about 15x15 pixel, high-temporal-resolution monochrome video of them walking, by comparing walking patterns. This was a while ago, and it's worth pointing out your brain can do a pretty decent job of the same task when shown the same video. I mention this because any given frame of the video is basically a random assortment of gray blobs, but in motion, not only is it "a person" but it's a specific person; making it a video adds a lot of information.
Re:Using multiple camera angles... by Anonymous Coward · 2006-06-14 08:34 · Score: 1, Informative

actually, there is a technique called Scale Invariant Feature Transform (SIFT) that can do the same thing. I'm doind an undergraduate research project on it right now. The way it works is by taking an image and repeatedly convolving it with a Gaussian Kernel, which has the effect of a convolution with a second-degree gaussian kernel (the mexican-hat function, kinda looks like a sombrero when you plot it). You do this throughout your "Octave" (however many it is, I usually use n = 6), getting n+2 images, the last of which has the effective resolution of half the original resolution of the initial image. You then decrease the resolution of the image (easily done by averaging groups of 4 pixels) and repeat. In each octave, you then take your convolved image and find local minima and maxima in that image, the image immediately prior (one convolution before) and the image immediately after (one convolusion later). These are then considered to be features, and the octave in which they were found indicates their relative size. These features are then categorized through a few ways. I use rotation by convolving another kernel over just the area with the feature to find the gradiants in the X and Y direction, which allows me to then calculate the gradiant magnitude of each pixel in the feature. I then use a weighted average (more weight as the pixel is closer to the center of the feature) to determine the feature's rotation (Similar things could be used to try to determine skew or transform, but those are not as useful). I then finally create a histogram that categorizes each feature in a manner that is searchable (this is difficult, I'm working on it now). The hope is that if I preform the same SIFT algorithm on another image and find its features, I can match the features in an effort to identify them in other images. If I find a potential feature match, I know what relative scale the feature because I know the octave that I found it on in the original image is and I can attempt to find other featuresthat might be present at that octave and then attempt to match those. If I find many matches in close proximity, then I have likely identified an object.

This sounds complicated, but it actually runs quite quickly because the repeated gaussian convolution is not a particularly difficult problem (it's O(NxM) where N and M are the length and width of the image, and with a small kernel, that's not very many operations). There are some ways to speed it up, however. One trick is to note that the convolution operation is a simple multiplication in the frequency domain, so if you use a Fast Fourier Transform (FFT) on the image to find its frequency content, you could then apply the convolution as a multiplication, but I haven't actually tried this because it is NOT a trivial programming task.