Google Using ReCAPTCHA To Decode Street Addresses
smolloy writes "Apparently some users of reCAPTCHA have recently begun seeing photographs appear in their CAPTCHA puzzles — photos that look very much like zoomed in house numbers taken from Google Streetview. It appears that Google has decided to put the reCAPTCHA system to help clean up Google streetview images, and 'according to a Google spokesperson, the system isn't limited to street addresses, but also involves street names and even traffic signs.' A large collection of these has appeared on the Blackhatworld website."
This is an incredibly fascinating and great use of the technology.
Yeah because those street number designed to tell everyone passing by what number the house is on the street are meant to be private.
Wow that site is so terrible looking that it makes Geocities and myspace look decent. The only thing it's missing is cosmic cursors.
What happens to the other part? Does google keep recycling it until it has multiples of the same answer? Can we all agree on a word for the addresses just to have some fun with google?
They're using us to identify our own home and business addresses, does anyone else feel a little violated by this?
Could just be me being paranoid, but this sounds like something out of a science fiction book. Whoever had the idea to do this, I have to admit, was really using their head though.
If carrots got you drunk, rabbits would be fucked up. - Comedian Mitch Hedberg R.I.P. 03/30/68-2/24/05
Don't feed the trolls... (but you're right, though)
And put your house number in Roman Numerals. Nothing like living in number CLXXIV to screw up the recaptcha. Anyone answering with 174 is likely counted as wrong...
Those who can make you believe absurdities can make you commit atrocities. - Voltaire
Awesome. More invasion of privacy. Fuck Google.
bonch
What makes this more of an invasion of privacy than whatever they used to do to find house numbers? I assume they used some combination of databases, OCR, and paying someone to do it.
I'm surprised that this is a big help to them - if they can identify that something on a house is the house number (as opposed to a shadow or some home design pattern), it's surprising that they can't identify the number itself. It seems like there's going to be relatively few instances where something is identifiable as a house number, but the number itself is not OCRable -- especially when they already have a hint from the neighboring house numbers. Though I guess when you're dealing with identifying millions of structures, even "relatively few" is a lot.
Do they allow the early set of people to type whatever they want, then pass the next set of people on if they type something that matches, let's say, the top five words typed by the first set of people, then assume after 1000 people or whatever that the most frequently entered word properly represents what's in the image?
I have read the quote from Google about what they are doing several times, and I don't see what everyone else sees. It appears to me that they are using the already known street names and numbers as possible ReCAPTCHA images. What they are NOT doing is using the results given by people to define what the image says. The point of the experiment is to determine whether these images are sufficient to separate people from web-bots. I imagine that they will look at the number of 'wrong' answers from both sides of the test, and see if bots are able to parse the street view images significantly more often than the standard test images.
So... can anyone point to something in the Google quote to show me where I went wrong? From TFA, here is the quote:
We’re currently running an experiment in which characters from Street View images are appearing in CAPTCHAs. We often extract data such as street names and traffic signs from Street View imagery to improve Google Maps with useful information like business addresses and locations. Based on the data and results of these reCaptcha tests, we’ll determine if using imagery might also be an effective way to further refine our tools for fighting machine and bot-related abuse online.
Different angles make it hard to be sure you have the number right. If you look at a street photo like a book you're going to OCR, you have first the layout detection, then identify the image part and the text part. Solving this problem would be similar to identifying where the page number is, to be eliminated from the text.
Taking a laser measurement, un-warping the photo, and then doing traditional OCR would be awesome, if they had the forethought to include the laser part in their vast collection, but they didn't. Then you have the multiple "type faces" available.
Anyway, lots of places don't have a specific street address. Type something in and you get a blobby sort of approximation. Or data from Open Street Map - my home address is just a dot in the middle of the street. With street view they could get it more precise.
I would guarantee this is all shots from places like mine, where they may or may not have street names, and definitely don't have address ranges for the blocks. Connect the street name to the GPS tag in the photo, apply that to the orientation of the vehicle, and add the street numbers - accurate mapping better than any in-dash system has today.
I don't see how anyone can be pissy they're doing this.
They already list the number of the house on maps.
My internetting is no good.
Yet Google would have to know what the address numbers really was in order to validate the reCAPTCHA, so that can hardly be why they are doing it. They don't need to crowd source an answer that they already know.
No they don't. They also add an altered text image alongside the picture (which presumably they generated), and can use that to validate the CAPTCHA. The street number can be validated by numerical probability (if 70% of them say it is "257", and the numbers "2,5,7" appear frequently in the rest, it is probably "257") even if they don't already know what it is.
"None can love freedom heartily, but good men; the rest love not freedom, but license." --John Milton
I don't think you know how reCAPTCHA works. You are always presented with two different items to decode. One of them is always a known answer, and the other they are less sure about, but become more sure after they show it to enough people and get a crowd sourced answer. They don't give you two prompts just to be double sure you are human.
Yet Google would have to know what the address numbers really was in order to validate the reCAPTCHA, so that can hardly be why they are doing it. They don't need to crowd source an answer that they already know.
Doubtful. They post two images. One they know and one they don't. They use the data for the one they don't, combine it with data from 1000s of other people who have also solved that captcha to get an accurate picture of what that particular number is. They use the one they know to validate the recaptcha data and verify you're human...
Recaptcha works by using a known value with an unknown, it's why you have to type 2 words.
One of the two words is considered solved, and is the actual captcha, the second word is using you as an ocr.
After enough people provide the same solution for the second word, it goes into the solved category and is used for validation.
They don't have to pay people to validate the addresses, we're doing it for free.
But unless Google is paying a zillion people to validate these images visually,
That is EXACTLY what Google is doing. And the payment is access to the site the reCAPTCHA is protecting.
What they are NOT doing is using the results given by people to define what the image says.
Um, no, that's exactly what ReCaptcha is for! The standard ReCaptcha images are all from old books that were scanned in (and presumably had trouble being OCRed with high confidence), and Google used ReCaptcha to "read" the words.
For heaven's sake, ReCaptcha's MOTTO is: "reCAPTCHA: Stop Spam, Read Books"
I read how it works. Multiple users are shown the same image, and once a few people have identified a given image as the same word, it's treated as the "correct" answer, and then later users have to match that answer to get past the ReCaptcha. This is why they show you more than one word....one word has a "known" answer, the other word is one they're still trying to figure out the "right" answer to.
With the first link, the chain is forged.
I'm glad something is being done I can't recall how many times I've looked up a street address to find Google maps reporting it as being 4 or 5 blocks away (on average) from where it actually is.
Thank you for the information, I've often wondered about them.
I only have about a 60% success rate on those swirly semi-inverted ones. My wife's friend's decaptcha software does a much better job than I do with its 79% success rate. I had wondered that as they get harder to read that the day was almost here when only machines would have the ability to decode captchas and prove that they were human.
I said - don't look Ethel!..., but it was too late..., she'd already looked.
I do not like green eggs and FUCK
I do not like FUCK Sam I am
ReCaptcha will accept any sequence of symbols for the unknown word. The most telling sign that a word is unknown is that, out of the two, it is the one that is ACTUALLY A WORD. Other signs are non-standard fonts, scanning distortions, non-Latin symbols, and punctuation marks.
Furthermore, there is a 1-chacter fault tolerance for the sequence of letters used as the part of the ReCaptcha to actually check if you pass or fail or not.
It's quite noticeable if you use a site which relies heavily on recaptchas. For example, when you get a word which has old english S which looks like a modern small case F, you're much better off claiming it's an F instead of giving the correct answer.
Plus, street numbers in the US typically go odd/even on either side of the street, so they can extrapolate most of the time.
your thin skin doesn't make me a troll
/b/ has standardized on "nigger" for anything unreadable.
ReCaptcha will accept any sequence of symbols for the unknown word.
Wait - you can type something other than "nigger" for the unknown word?
One of these days I'm going to do that when someone's looking over my shoulder and get a serious WTF from them.
Back when reCaptcha showed two words that you could find in the dictionary, black on white I had no problem with it, it seemed like a good idea and you might be contributing to digitizing a book or something.
But now you just get randomly generated characters with a zigzag going through the middle and blobs that invert it and it's hard to tell if this one letter is an 'i' or an 'r' or a 't'.
So I don't even bother looking at the real word and just solve the generated one.
ReCaptcha will accept any sequence of symbols for the unknown word.
Wait - you can type something other than "nigger" for the unknown word?
One of these days I'm going to do that when someone's looking over my shoulder and get a serious WTF from them.
To whoever modded the parent at -1, pay attention.
Out of the two images you are presented, one is known, the other is unknown. When a large enough number of people have entered the same answer for the unknown image, it gets moved to the 'known' list with that particular answer.
So on some places like 4chan, there has been a large effort to get as many people as possible to answer the unknown image with the word 'nigger'. If enough people do it on a single unknown image, it will get added to the pool with the "correct" answer set to the word 'nigger'... thus polluting the reCaptcha system. As the percentage of polluted entries in the "known" image list grows, so does the chance that the answer to any reCaptcha is 'nigger'.
And what would that achieve exactly?
Fun.
I didn't know about the nigger thing, but I've always submitted nonsense for the book one.
That is used to digitalize books
http://www.google.com/recaptcha
So many addresses has been fuzzy that I could that could only be a strange design choice.
If Google really cared they would fix Android Chrome to reflow text, instead of discriminating
I thought text in Streetview was blurred out by design in the same way that faces were-- automatically and for security reasons (read: so Google doesn't get sued by crazy OMG I'M ON TEH INTERNET people).
I'd actually prefer if they un-blurred all street numbers and signs. It's fine to rely on Map's street number location when you're in a huge city, and the difference between 123 fake street and 125 fake street is ten feet or so. But last time I planned a road trip, the difference between 123 Country Side Road and 200 Country Side Road could be dozens of kilometers or more. Often I'll get a recommendation to visit Out Of The Way Restaurant that has the red sign, just keep an eye out for it. I'll go into Street View, "drive" along my intended route looking for that sign-- and pass by dozens of little buildings with red signs that read "{&o /// &&6$#q blurrrrrrrrrrrrrrrrry".
UTF-8: There and Back Again
It very rapidly caused Google to abort all plans to "accept" unknown words when a consensus appeared to form.
You may note that reCAPTCHA now always has a very obvious computer-generated word as the "known" word.
I don't know whether or not they manually accept words for translation after a while, but if they did, they would need at least an automated system to filter out "common" words (i.e. if the same word is typed for many different scanned words, then it is probably not the real word, and it's either people being lazy or they're actively trying to pollute the database).