The North American model sadly doesn't come with the hand-crank. It's not clear if those will be available for purchase later on, or if I can use (or mod) my cell-phone hand-crank to work with the XO laptop. Excited to try out the XO though, and I'm very happy to support this project.
This is quite possibly an Internet urban legend. It certainly sounds plausible, but I've never seen a report of such an attack "in the wild". In addition, doing this attack with reCAPTCHA would require a high level of sophistication, as we have security features in place specifically to detect this man-in-the-middle attack.
We have noticed one such "humans filling out CAPTCHAs for spammers" attack on reCAPTCHA, but in this case it was offshore workers being paid to solve CAPTCHAs. We shut them out of the system promptly. (But even if we hadn't, it's still a win over using nothing, because at least the spammers are incurring a non-trivial economic cost for every CAPTCHA solved.)
PWNtcha does not defeat reCAPTCHA, nor are we aware of any existing OCR or CAPTCHA-breaking algorithms that do. We are working with research groups at a couple universities who are trying to break our CAPTCHA (and if they can, we'll obviously fix it). In case we do notice a break, it's trivial for us to switch to a completely different kind of CAPTCHA (using different distortions). Because our system is a web service, if there is a security breach, we can fix it for all sites at once by simply changing the distortions on our challenge images. This is a big security benefit compared to other CAPTCHA systems that are difficult (at best) to patch and update.
As you point out, if we did get broken on a wide scale, it would be possible to seed bad data into the system. However, it's easy enough for us to simply distrust all responses that happened during the vulnerable period.
Since this is a university project, we do actually care quite a bit about transcribing books:) In fact, that's the aspect of the system that I'm primarily responsible for. However, there is a lot of really interesting data along the lines of what you're suggesting, and I'm sure some of that data will eventually make it into papers.
"I wonder, afte this is running for a while, most of the unknown words will be nonsense"
It's already been running for a few months, and we're getting millions of solutions a day, and there's still a pretty good mix of words in the system:) Most words in the source documents aren't nonsense.
You said "people" putting in wrong words (ala the suggestion someone said below about "everyone fill in CowboyNeal!"), which is quite different from automated attacks. For that, we have numerous scripts that notice various forms of anomalous behavior from any given IP. We manually review these to make sure the answers are reasonable. We are also working with CERT, who have a large database of botnetted machines, to detect attacks. I'm not going to give complete details of everything we check, but rest assured that we are very active in preventing attacks -- our goal is to be the best CAPTCHA in the world, and we take security threats very seriously.
In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
"And that's not even counting malice where people deliberately put wrong words in."
We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
1) We've done some studies at CMU that shows that recognizing and typing 2 real English words is much easier and faster than typing 6 or 7 random letters and numbers. Would you rather type "private much" (which is what just showed up for reCAPTCHA) or "KXd2cM" (which is what showed up for Yahoo's CAPTCHA)?
2) Any given CAPTCHA is only shown to a couple of users. We're getting millions of legitimate solutions a day, so even a relatively sophisticated bot would have little chance of seeing the same image twice.
The lack of 3-legged things is an "unintended" side effect of the fact that bilateral symmetry is practically ubiquitous in animals (excepting sponges and cnidarians), and any animal with bilateral symmetry is going to end up with an even number of legs. Bilateral symmetry has many evolutionary "uses" aside from locomotion, so it's fair to presume that an odd number of legs *could* still result in efficient locomotion.
Things with wheels aren't popular in nature either, but that doesn't mean wheels aren't effective for locomotion.
"When I take action, I'm not going to fire a $2 million missile at a $10 empty tent and hit a camel in the butt. It's going to be decisive." --George W. Bush
Bush said that, but didn't live up to it. This is main reason the US is losing in Iraq: we spend billions of dollars to fight against people with cheap weapons and little training. Local militias can be a very "cheap and effective defense" against an occupying country, even if their technology is decades out of date.
"Having 2,000+ emails sitting in your inbox is plain nuts."
Lock me up, then:)
---Mutt: =ok [Msgs:15816 Flag:16]---
Almost 16K non-spam messages in my mailbox and it doesn't drive me crazy. Of these, 16 are flagged as important, which basically means "I should look at these every day as a reminder for stuff I need to do." And a lot of it is stuff like "Barbecue Friday, 7pm". I rewrite the subjects of important mail so that it's clear at a glance why I chose to flag the message. With a couple hand-crafted keybindings and some knowledge of regexes, sorting through my mail is a breeze.
I think it's faster for me to process and type two English words than to type one word that's just random letters. My brain is trained to read and type English; it's not trained to type nonsense.
CAPTCHAs simply tell whether there's a person sitting at the other end of the machine. No CAPTCHA can tell you whether that person is a malicious user or not. With this approach, at least the spammers are helping to digitize books.
"So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time."
Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.
The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.
Gmail has "solved" the *easy* half of the spam problem. Indeed, I never get spam in my Gmail mailbox. On the other hand, about 1-3 non-spam messages get marked as Spam every week. I'd rather the filter erred in the opposite direction -- I'd much rather see 10 spams a day but be ensured that there are no false positives.
"How is a mail delivery protocol supposed to be able to distinguish between these situations?"
I've thought of this before and I think that part of the problem is that mail delivery is not a true end-to-end protocol. As far as SMTP is concerned, mail delivery "succeeds" when my mail server accepts an email for me. In reality, I have my own set of spam filters that run after the mail has been accepted. If one of these spam filters rejects a mail, ideally the sender would be informed -- that way a legitimate sender can try harder to "defeat" my filter, or simply contact me out-of-band (e.g. call my cell phone). If I don't have to worry about ever rejecting legitimate email, I can configure some very aggressive spam filters.
I'm not 100% sure how this would be implemented. I envision something closer to IM -- you run a "client" on your machine which is responsible for accepting & rejecting requests. If your client isn't "on" when someone tries to send you a message, their client queues the message, and tries sending it as soon as it finds that you're available. I'm very open to suggestions/criticisms.
In Linux, digikam has the capability to do this -- there's an "Sync Gallery" plugin that claims to keep your library in sync with a Gallery install, presumably including the tags and other metadata. However, I've not tried it personally (though I have used digikam extensively to do other things).
cdzrherntphpqjwetxeqjggpvkq (verb) To decrypt an HD-DVD. Example usage: "Bob cdzrherntphpqjwetxeqjggpvkqed The Matrix so that he could watch it in Linux."
Hint: s = "CDZRHERNTPHPQJWETXEQJGGPVKQ" n = 0 for c in s:
n *= 26
n += ord(c) - 65 print n
Obviously I can't moderate because I've already commented, but here's an honorary +1, Informative. :) Thanks for the info!
Yeah, it's hidden in their wiki:
http://wiki.laptop.org/go/Give_One_Get_One
Will the North American Laptops include any human-power system?
no.
The North American model sadly doesn't come with the hand-crank. It's not clear if those will be available for purchase later on, or if I can use (or mod) my cell-phone hand-crank to work with the XO laptop. Excited to try out the XO though, and I'm very happy to support this project.
This is quite possibly an Internet urban legend. It certainly sounds plausible, but I've never seen a report of such an attack "in the wild". In addition, doing this attack with reCAPTCHA would require a high level of sophistication, as we have security features in place specifically to detect this man-in-the-middle attack.
We have noticed one such "humans filling out CAPTCHAs for spammers" attack on reCAPTCHA, but in this case it was offshore workers being paid to solve CAPTCHAs. We shut them out of the system promptly. (But even if we hadn't, it's still a win over using nothing, because at least the spammers are incurring a non-trivial economic cost for every CAPTCHA solved.)
PWNtcha does not defeat reCAPTCHA, nor are we aware of any existing OCR or CAPTCHA-breaking algorithms that do. We are working with research groups at a couple universities who are trying to break our CAPTCHA (and if they can, we'll obviously fix it). In case we do notice a break, it's trivial for us to switch to a completely different kind of CAPTCHA (using different distortions). Because our system is a web service, if there is a security breach, we can fix it for all sites at once by simply changing the distortions on our challenge images. This is a big security benefit compared to other CAPTCHA systems that are difficult (at best) to patch and update.
As you point out, if we did get broken on a wide scale, it would be possible to seed bad data into the system. However, it's easy enough for us to simply distrust all responses that happened during the vulnerable period.
Since this is a university project, we do actually care quite a bit about transcribing books :) In fact, that's the aspect of the system that I'm primarily responsible for. However, there is a lot of really interesting data along the lines of what you're suggesting, and I'm sure some of that data will eventually make it into papers.
:) Most words in the source documents aren't nonsense.
"I wonder, afte this is running for a while, most of the unknown words will be nonsense"
It's already been running for a few months, and we're getting millions of solutions a day, and there's still a pretty good mix of words in the system
You said "people" putting in wrong words (ala the suggestion someone said below about "everyone fill in CowboyNeal!"), which is quite different from automated attacks. For that, we have numerous scripts that notice various forms of anomalous behavior from any given IP. We manually review these to make sure the answers are reasonable. We are also working with CERT, who have a large database of botnetted machines, to detect attacks. I'm not going to give complete details of everything we check, but rest assured that we are very active in preventing attacks -- our goal is to be the best CAPTCHA in the world, and we take security threats very seriously.
In terms of the digital output, we spot-check some of the transcribed pages every day. These spot-checks will also turn up any anomalous solutions, with high probability.
Sorry, but we've already thought of this attack :)
We can compute the daily frequency of each human-provided solution and automatically flag anything that suddenly jumps in popularity. It's especially suspicious if these answers always disagree with the OCR's guess (often the OCR happens to be right, but just doesn't have high confidence).
"And that's not even counting malice where people deliberately put wrong words in."
We're already getting several million legitimate solutions a day. The chance that a few malicious people would happen to get the same CAPTCHA is relatively small. Also, for many of our words, the OCR's answer happens to be correct -- it just doesn't have high confidence in the word. If a single person agrees with the OCR in this case, we can mark the word as "read" with no further human confirmation. For this reason, many of the words will only ever be shown to a single human.
Our demo at http://recaptcha.net/fastcgi/demo/recaptcha keeps track of the number of words you've digitized. :)
A couple things:
1) We've done some studies at CMU that shows that recognizing and typing 2 real English words is much easier and faster than typing 6 or 7 random letters and numbers. Would you rather type "private much" (which is what just showed up for reCAPTCHA) or "KXd2cM" (which is what showed up for Yahoo's CAPTCHA)?
2) Any given CAPTCHA is only shown to a couple of users. We're getting millions of legitimate solutions a day, so even a relatively sophisticated bot would have little chance of seeing the same image twice.
The lack of 3-legged things is an "unintended" side effect of the fact that bilateral symmetry is practically ubiquitous in animals (excepting sponges and cnidarians), and any animal with bilateral symmetry is going to end up with an even number of legs. Bilateral symmetry has many evolutionary "uses" aside from locomotion, so it's fair to presume that an odd number of legs *could* still result in efficient locomotion.
Things with wheels aren't popular in nature either, but that doesn't mean wheels aren't effective for locomotion.
"When I take action, I'm not going to fire a $2 million missile at a $10 empty tent and hit a camel in the butt. It's going to be decisive." --George W. Bush
Bush said that, but didn't live up to it. This is main reason the US is losing in Iraq: we spend billions of dollars to fight against people with cheap weapons and little training. Local militias can be a very "cheap and effective defense" against an occupying country, even if their technology is decades out of date.
"Having 2,000+ emails sitting in your inbox is plain nuts."
:)
Lock me up, then
---Mutt: =ok [Msgs:15816 Flag:16]---
Almost 16K non-spam messages in my mailbox and it doesn't drive me crazy. Of these, 16 are flagged as important, which basically means "I should look at these every day as a reminder for stuff I need to do." And a lot of it is stuff like "Barbecue Friday, 7pm". I rewrite the subjects of important mail so that it's clear at a glance why I chose to flag the message. With a couple hand-crafted keybindings and some knowledge of regexes, sorting through my mail is a breeze.
I think it's faster for me to process and type two English words than to type one word that's just random letters. My brain is trained to read and type English; it's not trained to type nonsense.
CAPTCHAs simply tell whether there's a person sitting at the other end of the machine. No CAPTCHA can tell you whether that person is a malicious user or not. With this approach, at least the spammers are helping to digitize books.
"So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time."
Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.
The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.
The article is lacking some information. Here are some better links:
Official reCAPTCHA site
Hide your email address with reCAPTCHA (super easy!)
A more detailed blog post about how the system works
Disclaimer: I work with Luis von Ahn, who's the professor running the reCAPTCHA project.
"Actually you'll find that "core" Republican voters REALLY REALLY hate trial attorneys."
Really?
Hint: #2 on the list of Bush donors is trial lawyers.
(Of course, the lawyers have their fingers in every pot: they're #1 on Kerry's list and #2 on Nader's list.)
Gmail has "solved" the *easy* half of the spam problem. Indeed, I never get spam in my Gmail mailbox. On the other hand, about 1-3 non-spam messages get marked as Spam every week. I'd rather the filter erred in the opposite direction -- I'd much rather see 10 spams a day but be ensured that there are no false positives.
I don't know that Slash is really to blame here. The initial article had the same text verabtim, also without any superscripts. See the bottom of this page: http://www.extremetech.com/article2/0,1697,2131553 ,00.asp
"How is a mail delivery protocol supposed to be able to distinguish between these situations?"
I've thought of this before and I think that part of the problem is that mail delivery is not a true end-to-end protocol. As far as SMTP is concerned, mail delivery "succeeds" when my mail server accepts an email for me. In reality, I have my own set of spam filters that run after the mail has been accepted. If one of these spam filters rejects a mail, ideally the sender would be informed -- that way a legitimate sender can try harder to "defeat" my filter, or simply contact me out-of-band (e.g. call my cell phone). If I don't have to worry about ever rejecting legitimate email, I can configure some very aggressive spam filters.
I'm not 100% sure how this would be implemented. I envision something closer to IM -- you run a "client" on your machine which is responsible for accepting & rejecting requests. If your client isn't "on" when someone tries to send you a message, their client queues the message, and tries sending it as soon as it finds that you're available. I'm very open to suggestions/criticisms.
In Linux, digikam has the capability to do this -- there's an "Sync Gallery" plugin that claims to keep your library in sync with a Gallery install, presumably including the tags and other metadata. However, I've not tried it personally (though I have used digikam extensively to do other things).
Hint:
s = "CDZRHERNTPHPQJWETXEQJGGPVKQ"
n = 0
for c in s:
n *= 26
n += ord(c) - 65
print n