Randomizing Survey Answers For Accuracy
Saint Aardvark writes: "The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results. Strangely enough, they can get accurate information out of the aggregate of enough answers -- but it's completely anonymized. Since conservative estimates say nearly half of all survey answers are bogus, there's an interest in persuading people to be more truthful. As ever, you can use the Random NY Times Registration Generator to falsify your registration details and read the article..."
Ok, fine. They've managed to come up with a model that doesn't actually collect any data. And how will this help people to enter REAL data? People don't give data because they don't trust the company. If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?
Javascript + Nintendo DSi = DSiCade
Did you lie when answering this question?
O Yes
O No
O Cowboy Neal told me the answer
Do they expect that people will enter real data on the mere promise that it will be stored in some randomized, aggregate, or other form that does not invade their privacy? If the coroporation could not be trusted in the first place, no statement they make will make them trustworthy.
Sounds all fine and dandy for science, but people are usually honest with a professional researcher who is going to gaurantee your ananymity, and moreover the research data is going to be used for something tangible rather than selling something right back to you.
Market researchers want information on YOU. They want generic info on your demographics, but this information has been available from other venues for a long time. When spy ware and other information gathering techniques are employed against someone they are being used to collect data to target marketing at that person specifically. Literally employed against that person.
As such, I'll still say that I'm female, in my 50's, from Yemen and making less than $12,000 a year. Randomize away.
It doesn't take the irony nazi to point out the sweet, sweet irony in using the random NYT account generator to read the story.
Bringing irony to the Slash-masses
I think there is something to be said about companies that ask for information as an option versus companies that ask for information as a requirement.
For example, company XYZ has released a program called Widget. In order to download Widget, users are asked to fill out a survey so that XYZ may guage the demographics of their target audience.
Some sites will allow you to bypass this step and proceed to download the software. Other sites require this information before revealing the download link. I think that the psychological difference between "required" and "optional" would heavily influence the honesty of the answers.
I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company. The difference? I'm not being forced to give anything up in advance.
Is this true in general? I don't know. But it makes sense to me.
I have an idea for something to replace the survey forms - an AI program to carry out a conversation with the user. Ah ha! We just have to watch out for users that say to the AI - "I am lying" - and hope the AI doesn't need therapy.
Price, Quality, Time. Pick none. What, you thought you had a choice?
Personally, if I don't trust them enough to tell them how much I make, I'm not going to trust them to randomize my results. I don't see how this will increase accuracy -- especially if I keep telling everyone I'm a 108 year old female in Uganda making $100,000+ per year year who works in the sales department of an Educational field and plans to make purchases of an suv, a house, a console gaming system, a optical mouse in the next six months and rates thier internet experience as very low. My e-mail address is sjobs@mac.com and I would like to apply for your quarterly, monthly, weekly, daily, and hourly newsletters and I do give permission to pass this information to your affiliates.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
Let me summarize:
1) People lie on surveys, most likely because they don't trust the taker - but probably also just because they like putting in other answers (yeah, I'm a millionaire, woohoo!, etc). This only addresses the trust issue, ignoring other ptential sources of lying.
2) In order to work around the trust issue, they've developed a method of injecting random noise into the original answers as they are recorded and then extracting useful data in the end.
Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.
Without bothering with all this mumbo-jumbo, I can build a trustworthy system. I simply record survey statistics, and I promise not to use the individuals' personal data invidually.
They can either trust me that I'm telling the truth about this, or they can lie. In the IBM researchers' scenario, the users are again asked to trust that the backend system doesn't compromise them, and again they can choose to trust it or choose to lie.
Given the above, why on earth would you bother with this research and uneccesary complexity. It's not going to make any difference over just promising your users that you don't invade their privacy. You could replace their research results with a banner on top of the survey that says "After you sumbit your data to us, we use Magical HibiJibi technology to prevent ourselves from invading your privacy, so please trust us and answer truthfully"
What a waste of research.
11*43+456^2
No, what they are saying is that their random number generator is very random. If you have a perfectly randomly generated bell curve set of numbers, it makes it easier to reconstruct the original data. Think of an audio signal for example. If you have a sine wave (your data) mixed with white noise (perfectly random), you can quite easily pick out the sine wave. Its the one frequency that is louder than all the rest. However, if instead of white noise you have noise that is not perfectly random, you will not see a clear sine wave, but several different frequencies.
----
All of whose base are belong to the what-now?
Interesting approach, but useless unless people actually understand and trust the system. For this to happen will probably require widespread adoption, an easy to understand explanation of the process, and assurances that answers really are randomized. These requirements obviously force a bit of a chicken and the egg scenario.
Explaining the whole randomization process (how it protects privacy, how it provides useful info) will be a little much for most people I think, but a good user interface might alleviate this, perhaps with a 'randomize' button that is used before hitting the 'submit' button. This would take the user input and change it right in front of their eyes. Of course many would be rightfully concerned that the randomize button is just for show (or simply encodes but doesn't anonymize), but I think that enough people might buy into the false sense of security that demonstrated 'randomization' provides to at least partly improve the % of bonafide results. Also, the system could be set up so users who don't mind submitting traceable information could be encouraged ("extra 10% off") to submit without randomization, with a simple flag sorting data into randomized/anonymous and non-randomized/non-anonymous data).
This approach would be even better if the randomization approach becomes a ubiquitous standard backed by a consistent and legally accountable and well-known entity/brand (IBM for instance). I'm not sure how well an open solution would work unless there was a central group assuming responsibility and accountability for the system, enforcing trademarks, and suing spoofers. Also, people feel safer when they feel there's someone to blame for any abuse/mistakes (hence, giving their credit card freely to a waiter but not to a website).
My next sig will be ready soon, but friends can beat the rush!
As another poster observes, if you don't trust them with the data, why trust them to randomize it?
My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:
With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.
If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.
The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.
"Judge, I did not know she was 14 years old. I'm pleading innocent by reason of randomized, aggregate data!"
This is not a dream, not a dream...we are transmitting from the year 1-9-9-9.
Hey, it's me. The guy who put together and hosts the New York Times random login generator. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.
I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)
As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?
A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!
Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.
"How to Do Nothing," kids activities, back in print!
I don't really understand how SSL works, but I trust my browser (a bit) and when I see https in the URL then I'm comfortable with that. Not because I fully understand SSL, but because I listen to the opinions of people who do.
So if it became accepted practice that pressing the Randomize button on your browser (why not build it into the browser) made your response anonymous then nobody needs to understand it any more than they do SSL.
Actually, why not have a new http method: POST-RANDOM instead of POST so the server knows that the data has been randomized.
80N
Timeo idiotikOS et dona ferentes