Randomizing Survey Answers For Accuracy
Saint Aardvark writes: "The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results. Strangely enough, they can get accurate information out of the aggregate of enough answers -- but it's completely anonymized. Since conservative estimates say nearly half of all survey answers are bogus, there's an interest in persuading people to be more truthful. As ever, you can use the Random NY Times Registration Generator to falsify your registration details and read the article..."
No, what they are saying is that their random number generator is very random. If you have a perfectly randomly generated bell curve set of numbers, it makes it easier to reconstruct the original data. Think of an audio signal for example. If you have a sine wave (your data) mixed with white noise (perfectly random), you can quite easily pick out the sine wave. Its the one frequency that is louder than all the rest. However, if instead of white noise you have noise that is not perfectly random, you will not see a clear sine wave, but several different frequencies.
----
All of whose base are belong to the what-now?
Did you read the article, this doesn't randomize the poll at all. It won't do anything unless people awnser truthfully. It's for protecting privacy by randomizing the raw data they recieve.
Oops....you'll know what I'm talkin about in a bit.
As another poster observes, if you don't trust them with the data, why trust them to randomize it?
My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:
With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.
If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.
The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.
Hey, it's me. The guy who put together and hosts the New York Times random login generator. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.
I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)
The idea of using randomness to get better survey results is not a new one. In his 1990 book "Innumeracy", John Allen Paulos posits a system for asking a potentially embarrasing yes or no question whereby the examiner asks the subject to flip a fair coin before responding. If the subject gets heads he should give the embarrasing answer, tails he should tell the truth. The idea is that the subject is then spared the trauma of giving the embarassing answer since the examiner is not told the result of the coin flip and it is possible the subject just flipped heads. Knowing the "probability distribution" of a fair coin it can then be assumed that half the respondants gave the embarrasing answer as a result of their coin flip. These can then be removed from the data leaving a staticically accurate result.
It seems that what the IBM folks are doing is a staightforward extension of this idea to a larger response domain (numerical ages as opposed to boolean questions) and to a more automated system in which the website flips the coin for the subject and amends his answer accordingly.
As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?
A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!
Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.
"How to Do Nothing," kids activities, back in print!
My partner and my company does this for large corporations (a great deal in the automotive sector) and here's what we've found.
Frequently, the people that give input simply misread questions... for example 'How many males over the age of 18 in your household INCLUDING YOU' as opposed to 'NOT COUNTING YOURSELF'. Or they make typos. Error checking can fix that frequently. Saying that just because they mis-keyed their zip, the whole dataset is incorrect is not correct.
We've found that the most positive way to get good data is to get people that WANT to tell you their opinions to take the survey. Forcing someone to take the survey for free stuff or to take part in something just doesn't work. Giving them the free stuff then saying "Hey, would you like to give us your opinion" on the other hand, does. The only drawback is that you would assume you're tainting the respondent's opinion. Given the amount of research we've put in, we've actually found the opposite... people say "hey, I've already got my free shit, now I'll tell em how I REALLY feel". I don't see much of a purpose in what IBM has come up with.