Google Using ReCAPTCHA To Decode Street Addresses

← Back to Stories (view on slashdot.org)

Google Using ReCAPTCHA To Decode Street Addresses

Posted by timothy on Thursday March 29, 2012 @09:10AM from the you-are-the-crowd-being-sourced dept.

smolloy writes "Apparently some users of reCAPTCHA have recently begun seeing photographs appear in their CAPTCHA puzzles — photos that look very much like zoomed in house numbers taken from Google Streetview. It appears that Google has decided to put the reCAPTCHA system to help clean up Google streetview images, and 'according to a Google spokesperson, the system isn't limited to street addresses, but also involves street names and even traffic signs.' A large collection of these has appeared on the Blackhatworld website."

23 of 104 comments (clear)

Min score:

Reason:

Sort:

Take off your tin foil hat by Anonymous Coward · 2012-03-29 09:13 · Score: 4, Insightful

This is an incredibly fascinating and great use of the technology.
1. Re:Take off your tin foil hat by Desler · 2012-03-29 09:34 · Score: 2
  
  I'm guessing you've never done a copy-and-paste on, say, Google Books because the OCRed text quite frequently contains typos, random inserted spaces and completely wrong words. And since reCapatcha is used to supplement the OCR on Google Books, it would appear they aren't as smart as you would like them to seem.
2. Re:Take off your tin foil hat by Anonymous Coward · 2012-03-29 10:02 · Score: 2, Insightful
  
  It's mostly the fault of 4chan.
  Ever since Re-Captcha was implmented there, most of the RC results are
  '(Checkword) Nigger'
Re:I'm a Microsoft whore by nedlohs · 2012-03-29 09:15 · Score: 4, Insightful

Yeah because those street number designed to tell everyone passing by what number the house is on the street are meant to be private.
If I just type out the necessary word... by mykos · 2012-03-29 09:18 · Score: 2

What happens to the other part? Does google keep recycling it until it has multiples of the same answer? Can we all agree on a word for the addresses just to have some fun with google?
1. Re:If I just type out the necessary word... by X0563511 · 2012-03-29 09:34 · Score: 4, Insightful
  
  Great. You know what they were previously? OCR for things like libraries.
  I think your own answer to them describes what you are.
  
  --
  For large sets, this will be our guide even unto death, for the LORD will work for each type of data it is applied to...
Re:Eyebleed site by bertoelcon · 2012-03-29 09:20 · Score: 3, Informative

Wow that site is so terrible looking that it makes Geocities and myspace look decent. The only thing it's missing is cosmic cursors.
Yeah, Techcrunch is really ugly isn't it.

--
Anything can be found funny, from a certain point of view.
Be a Roman harlot instead! by AliasMarlowe · 2012-03-29 09:21 · Score: 5, Funny

And put your house number in Roman Numerals. Nothing like living in number CLXXIV to screw up the recaptcha. Anyone answering with 174 is likely counted as wrong...

--
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
1. Re:Be a Roman harlot instead! by gnick · 2012-03-29 09:51 · Score: 3, Funny
  
  And put your house number in Roman Numerals. Nothing like living in number CLXXIV to screw up the recaptcha.
  Not to mention the postal service! Damn snooty mailmen with their eagle-logo cars and fancy uniforms... Now I know how to get back at them.
  
  --
  He's getting rather old, but he's a good mouse.
Re:This is actually kind of frightening... by nine-times · 2012-03-29 09:21 · Score: 4, Insightful

I don't find it worrying. The existence of a street address is properly public knowledge. It's not an invasion of privacy until they link the address with who lives there.
Re:This is actually kind of frightening... by medlefsen · 2012-03-29 09:22 · Score: 4, Funny

Oh shit
http://www.whitepages.com/
I seem to have missed something... by Gen-GNU · 2012-03-29 09:28 · Score: 4, Informative

I have read the quote from Google about what they are doing several times, and I don't see what everyone else sees. It appears to me that they are using the already known street names and numbers as possible ReCAPTCHA images. What they are NOT doing is using the results given by people to define what the image says. The point of the experiment is to determine whether these images are sufficient to separate people from web-bots. I imagine that they will look at the number of 'wrong' answers from both sides of the test, and see if bots are able to parse the street view images significantly more often than the standard test images.
So... can anyone point to something in the Google quote to show me where I went wrong? From TFA, here is the quote:
We’re currently running an experiment in which characters from Street View images are appearing in CAPTCHAs. We often extract data such as street names and traffic signs from Street View imagery to improve Google Maps with useful information like business addresses and locations. Based on the data and results of these reCaptcha tests, we’ll determine if using imagery might also be an effective way to further refine our tools for fighting machine and bot-related abuse online.
1. Re:I seem to have missed something... by icebike · 2012-03-29 10:01 · Score: 2
  
  Getting around reCAPTCHA logins is usually easy. Just correctly type the easy to read word, and an approximation of the number of characters in the hard to read one. You don't even have to be close.
  Google could have a few thousand house numbers they already know (their own recognition system is probably capable of this), and they can swap these in as well as a hard to read scanned word from a book, and you could never be sure which one was the reCAPTCHA and which was the CAPTCHA.
  
  --
  Sig Battery depleted. Reverting to safe mode.
Re:I'm a Microsoft whore by b4dc0d3r · 2012-03-29 09:33 · Score: 2

Different angles make it hard to be sure you have the number right. If you look at a street photo like a book you're going to OCR, you have first the layout detection, then identify the image part and the text part. Solving this problem would be similar to identifying where the page number is, to be eliminated from the text.
Taking a laser measurement, un-warping the photo, and then doing traditional OCR would be awesome, if they had the forethought to include the laser part in their vast collection, but they didn't. Then you have the multiple "type faces" available.
Anyway, lots of places don't have a specific street address. Type something in and you get a blobby sort of approximation. Or data from Open Street Map - my home address is just a dot in the middle of the street. With street view they could get it more precise.
I would guarantee this is all shots from places like mine, where they may or may not have street names, and definitely don't have address ranges for the blocks. Connect the street name to the GPS tag in the photo, apply that to the orientation of the vehicle, and add the street numbers - accurate mapping better than any in-dash system has today.
Re:I'm a Microsoft whore by Baloroth · 2012-03-29 09:54 · Score: 5, Informative

Yet Google would have to know what the address numbers really was in order to validate the reCAPTCHA, so that can hardly be why they are doing it. They don't need to crowd source an answer that they already know.
No they don't. They also add an altered text image alongside the picture (which presumably they generated), and can use that to validate the CAPTCHA. The street number can be validated by numerical probability (if 70% of them say it is "257", and the numbers "2,5,7" appear frequently in the rest, it is probably "257") even if they don't already know what it is.

--
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
Re:I'm a Microsoft whore by cforciea · 2012-03-29 09:54 · Score: 5, Informative

I don't think you know how reCAPTCHA works. You are always presented with two different items to decode. One of them is always a known answer, and the other they are less sure about, but become more sure after they show it to enough people and get a crowd sourced answer. They don't give you two prompts just to be double sure you are human.
Re:I'm a Microsoft whore by eldorel · 2012-03-29 09:57 · Score: 3, Informative

Recaptcha works by using a known value with an unknown, it's why you have to type 2 words.

One of the two words is considered solved, and is the actual captcha, the second word is using you as an ocr.

After enough people provide the same solution for the second word, it goes into the solved category and is used for validation.

They don't have to pay people to validate the addresses, we're doing it for free.
Re:Eyebleed site by wmbetts · 2012-03-29 10:01 · Score: 2

Baziiiinga!
Fixed it for you.

--
"Ubuntu" -- an African word, meaning "Slackware is too hard for me". - stolen from Dan C alt.os.linux.slackware
Re:Are people actually annoyed at this? by icebike · 2012-03-29 10:04 · Score: 4, Insightful

Oh, climb down off that ledge before you get hurt.
reCAPTCHA is for what ever you want to use it for, Its simply a technique for crowdsourcing guesses.
In my estimation, Google maps and street view is one of the great accomplishments of our time, easily worth every penny Google monetizes out of it.

--
Sig Battery depleted. Reverting to safe mode.
Re:How does ReCAPTCHA "solve" new images? by Anonymous Coward · 2012-03-29 10:16 · Score: 2, Interesting

They give you two words to solve. One is an old, known word and the other is a new, unknown word. You have no way to tell which is which. To pass the CAPTCHA, you need to answer both and get the known one correct. Eventually entries can go from unknown to known when enough people provide the same answer.
Re:How does ReCAPTCHA "solve" new images? by eldorel · 2012-03-29 10:38 · Score: 3, Informative

Not exactly, but pretty close.

They give you 2 words, one is an already solved known value, and the other is an unknown word.
if you get the first word correct, they take the value from your second word and add it to the "possible solutions" list.

After 2000 or so people have solved the word, they examine the results for a statistically unique answer. If there is not outlier, (say 65% have the same answer) it goes back into the unknown pile.

Once they find a statistically significant answer, it's considered "solved" and is used as one of the initial validation words.

Rinse, repeat.
Would make for some interesting kids' books by mykos · 2012-03-29 11:15 · Score: 2

I do not like green eggs and FUCK
I do not like FUCK Sam I am
Re:Um, what? That's exactly what they're doing. by martin-boundary · 2012-03-29 11:45 · Score: 4, Interesting

Yeah, the problem with that is that it can't work when most of the humans are robots. The robots will make guesses using standard algorithms, and their guesses will be pretty consistent with the other robots' guesses (which are quite probably the same robot in another instance). Then Google thinks the robot guess is correct, because it's overwhelmingly the most consistent answer. And humans who give the correct answer get marked wrong, because they're a minority.
It's quite noticeable if you use a site which relies heavily on recaptchas. For example, when you get a word which has old english S which looks like a modern small case F, you're much better off claiming it's an F instead of giving the correct answer.