Randomizing Survey Answers For Accuracy
Saint Aardvark writes: "The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results. Strangely enough, they can get accurate information out of the aggregate of enough answers -- but it's completely anonymized. Since conservative estimates say nearly half of all survey answers are bogus, there's an interest in persuading people to be more truthful. As ever, you can use the Random NY Times Registration Generator to falsify your registration details and read the article..."
In the past, I'd give false answers. Now I'll need to randomize my true/false answers to throw their randomness off.
Ok, fine. They've managed to come up with a model that doesn't actually collect any data. And how will this help people to enter REAL data? People don't give data because they don't trust the company. If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?
Javascript + Nintendo DSi = DSiCade
Did you lie when answering this question?
O Yes
O No
O Cowboy Neal told me the answer
Do they expect that people will enter real data on the mere promise that it will be stored in some randomized, aggregate, or other form that does not invade their privacy? If the coroporation could not be trusted in the first place, no statement they make will make them trustworthy.
Sounds all fine and dandy for science, but people are usually honest with a professional researcher who is going to gaurantee your ananymity, and moreover the research data is going to be used for something tangible rather than selling something right back to you.
Market researchers want information on YOU. They want generic info on your demographics, but this information has been available from other venues for a long time. When spy ware and other information gathering techniques are employed against someone they are being used to collect data to target marketing at that person specifically. Literally employed against that person.
As such, I'll still say that I'm female, in my 50's, from Yemen and making less than $12,000 a year. Randomize away.
It doesn't take the irony nazi to point out the sweet, sweet irony in using the random NYT account generator to read the story.
Bringing irony to the Slash-masses
I don't see how people would trust this any more than entering it normally.
Typical session:
What is your age? (Results will be randomized)
23
OK, we're putting down you are 28 based on a random number we picked. Aren't we good to protect your privay?
(Then behind the scenes the database gets the real age put into it, how will the user ever know?)
Even if the user can view their profile later on, the database can just store their real age + the so-called random modifier, and the user will be none the wiser.
What a pointless "technology".
I've had enough abrasive sigs. Kittens are cute and fuzzy.
So what they're saying is that they've proven that their random number generator isn't really all that random? :p
What's the point of living if you can't screw with market research?
:)
Dr. Ann Cavoukian sounds like she can help you with that too. Maybe that was her plan in the first place.
I've had enough abrasive sigs. Kittens are cute and fuzzy.
I think there is something to be said about companies that ask for information as an option versus companies that ask for information as a requirement.
For example, company XYZ has released a program called Widget. In order to download Widget, users are asked to fill out a survey so that XYZ may guage the demographics of their target audience.
Some sites will allow you to bypass this step and proceed to download the software. Other sites require this information before revealing the download link. I think that the psychological difference between "required" and "optional" would heavily influence the honesty of the answers.
I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company. The difference? I'm not being forced to give anything up in advance.
Is this true in general? I don't know. But it makes sense to me.
I have an idea for something to replace the survey forms - an AI program to carry out a conversation with the user. Ah ha! We just have to watch out for users that say to the AI - "I am lying" - and hope the AI doesn't need therapy.
Price, Quality, Time. Pick none. What, you thought you had a choice?
Personally, if I don't trust them enough to tell them how much I make, I'm not going to trust them to randomize my results. I don't see how this will increase accuracy -- especially if I keep telling everyone I'm a 108 year old female in Uganda making $100,000+ per year year who works in the sales department of an Educational field and plans to make purchases of an suv, a house, a console gaming system, a optical mouse in the next six months and rates thier internet experience as very low. My e-mail address is sjobs@mac.com and I would like to apply for your quarterly, monthly, weekly, daily, and hourly newsletters and I do give permission to pass this information to your affiliates.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
heh, with a name like that, he must not get many new clients...
"hello mr. smith, you have a 1pm appointment with dr. 'ka-voa kee-an' today."
"no thanks, I feel better."
"nonsense- the doctor just got the death machine warmed up!"
(and yes, I know it's not the same guy)
Looking for Book Reviews? Check out Literary Escapism.
Let me summarize:
1) People lie on surveys, most likely because they don't trust the taker - but probably also just because they like putting in other answers (yeah, I'm a millionaire, woohoo!, etc). This only addresses the trust issue, ignoring other ptential sources of lying.
2) In order to work around the trust issue, they've developed a method of injecting random noise into the original answers as they are recorded and then extracting useful data in the end.
Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.
Without bothering with all this mumbo-jumbo, I can build a trustworthy system. I simply record survey statistics, and I promise not to use the individuals' personal data invidually.
They can either trust me that I'm telling the truth about this, or they can lie. In the IBM researchers' scenario, the users are again asked to trust that the backend system doesn't compromise them, and again they can choose to trust it or choose to lie.
Given the above, why on earth would you bother with this research and uneccesary complexity. It's not going to make any difference over just promising your users that you don't invade their privacy. You could replace their research results with a banner on top of the survey that says "After you sumbit your data to us, we use Magical HibiJibi technology to prevent ourselves from invading your privacy, so please trust us and answer truthfully"
What a waste of research.
11*43+456^2
I'm always a bit skeptical when I'm told I'm about to be surveyed anonymously, and I can't think of a way that this can be implemented (or at least is likely to be implemented) that would reassure me. The non-skeptics are filling in their information already. Perhaps businesses could pick one in five to survey and offer the people who don't want to take it the ability to just skip it; I'll bet a good amount of crap in the databases is coming from people who have to fill in eighty mandatory fields for free e-mail or music or whatever.
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
I agree with lots of folks here that this system works only if you don't have to trust the remote site to apply the obfuscating transformation. Here's a suggestion to make things somewhat more transparent.
Create a form with attached Javascript. You enter the real data and hit the "obfuscate" button. The script then locally adds noise to your answers. At this point, the "obfuscate" button turns into "submit", allowing you to send the visibly obfuscated responses to the remote site.
Of course, you'll probably want to read the source to make sure the real answers are not sent along with the obfuscated ones. Still, this scheme would go a ways toward creating the perception of honesty.
Interesting approach, but useless unless people actually understand and trust the system. For this to happen will probably require widespread adoption, an easy to understand explanation of the process, and assurances that answers really are randomized. These requirements obviously force a bit of a chicken and the egg scenario.
Explaining the whole randomization process (how it protects privacy, how it provides useful info) will be a little much for most people I think, but a good user interface might alleviate this, perhaps with a 'randomize' button that is used before hitting the 'submit' button. This would take the user input and change it right in front of their eyes. Of course many would be rightfully concerned that the randomize button is just for show (or simply encodes but doesn't anonymize), but I think that enough people might buy into the false sense of security that demonstrated 'randomization' provides to at least partly improve the % of bonafide results. Also, the system could be set up so users who don't mind submitting traceable information could be encouraged ("extra 10% off") to submit without randomization, with a simple flag sorting data into randomized/anonymous and non-randomized/non-anonymous data).
This approach would be even better if the randomization approach becomes a ubiquitous standard backed by a consistent and legally accountable and well-known entity/brand (IBM for instance). I'm not sure how well an open solution would work unless there was a central group assuming responsibility and accountability for the system, enforcing trademarks, and suing spoofers. Also, people feel safer when they feel there's someone to blame for any abuse/mistakes (hence, giving their credit card freely to a waiter but not to a website).
My next sig will be ready soon, but friends can beat the rush!
Basically people providing false answers will ver often pick a false answer based on location. Many will pick the middle option, with others being picked less, though the first and last may have a oddly large amount of people chosing them compared to the second and second-to-last options.
/. poll that illustrates the phenomenon. For an excercise, imagine what the results would look like if the offerings were randomly ordered for each person the poll was shown too. My bet is each would be near 12.5%.
When the answers are given in random order, each cycles into the different spots. The liars are actually cancelling out other liars who used the form before them. The differences in the answers are mostly based on how the truth tellers answered (I say mostly because some liars may have a different way of selecting a lie, such as the longest string offered in the answers), and so you can derive more meaningful statistics from them.
See this
-no broken link
Did you read the article, this doesn't randomize the poll at all. It won't do anything unless people awnser truthfully. It's for protecting privacy by randomizing the raw data they recieve.
Oops....you'll know what I'm talkin about in a bit.
As another poster observes, if you don't trust them with the data, why trust them to randomize it?
My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:
With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.
If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.
The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.
this song by Three Dead Trolls in a Baggie
"Listen: We are here on Earth to fart around. Don't let anybody tell you any different!" - Kurt Vonnegut
"Judge, I did not know she was 14 years old. I'm pleading innocent by reason of randomized, aggregate data!"
This is not a dream, not a dream...we are transmitting from the year 1-9-9-9.
Hey, it's me. The guy who put together and hosts the New York Times random login generator. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.
I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)
The kind of questions that most of these sites ask include stuff that is impolite for friends to ask each other sometimes, never mind some random business. If they want accurate results, they should include the option for people to answer with a "MYOB" option. People are rather unlikely to keep tossing in crap data when they have the "MYOB" option, at least not in the 40% range. There is no way in hell that anyone making 100k+/year would actually admit it and give a business their real e-mail address. They would be begging for a flood of advertisements.
Why is it that online business feel they have the right to try and force so much personal information out of us? In brick 'n mortar stores, the worst info anyone asks me for is my zip code (or age to purchase alcohol). They can get my name if I use my credit card, but I can easily pay cash to avoid that.
It's very ironic that NYTimes would run this story.... Why do they expect me to tell them where I live, work, and what I make, just to read their articles? The paper version is nowhere near this invasive.
The idea of using randomness to get better survey results is not a new one. In his 1990 book "Innumeracy", John Allen Paulos posits a system for asking a potentially embarrasing yes or no question whereby the examiner asks the subject to flip a fair coin before responding. If the subject gets heads he should give the embarrasing answer, tails he should tell the truth. The idea is that the subject is then spared the trauma of giving the embarassing answer since the examiner is not told the result of the coin flip and it is possible the subject just flipped heads. Knowing the "probability distribution" of a fair coin it can then be assumed that half the respondants gave the embarrasing answer as a result of their coin flip. These can then be removed from the data leaving a staticically accurate result.
It seems that what the IBM folks are doing is a staightforward extension of this idea to a larger response domain (numerical ages as opposed to boolean questions) and to a more automated system in which the website flips the coin for the subject and amends his answer accordingly.
If the respondents are already randomizing the data, the statistical analysis should be able to produce the same result.
Or hadn't they thought of that?
A couple of thoughts.
First, I found this funny:
Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.
Yes, they *need* to make even more money off your data.
Second, anyone find it interetsing that they assume a distribution and then work towards it:
"When people lie randomly -- and that is what they do now when they answer questions -- we get very poor results," he said. But by "adding random values to true values," he said, "we can reconstruct a distribution that is very close to the actual one."
Using this information, Dr. Srikant said, the researchers make a first guess at what the true distribution should be. Then the program crunches through the analysis and produces a slightly better guess. This guess is crunched again, and the process is repeated over and over again, getting closer and closer to the actual distribution.
My guess i sthat they hope people don't truely lie randomly, and then yuse their random additions or subtractions to bring people closer to the actual distribution - i.e. I may say I make $0 or %$50,000 (or what ever the low/high end is, but not pick one one away from my real income.) They are hoping that people, as a group, behave predictably even when any one individual doesn't. Which, if my org behavuior prof is to be beleived, is generally the case and the way people can shape other's responses and behaviors.
Interestingly enough, randomization is a useful tool in surveys. If you area sking about very private infromation that people may lie about if they fear the answer will be leaked, you can tell them to flip a coin - heads ask them to answer truthfully, tails put down no (for a yes no survey). With a large enough sample, you can back out the real results based on the 50/50 results of the coin toss, without knowing how anyone actually answered.
Of course, companies should probably ask themselves how many Josef Stalins live in Moscow Idaho and were born on Oct 24, 1917?
I'm a consultant - I convert gibberish into cash-flow.
When a company tells me they aren't going to use the information I give them for anything but demographics research, then asks me for my phone number and address and makes both fields required, I consider it safe to assume that company is lying, and don't feel think it's at all naughty to fib.
On the other hand, if the company really only requires me to answer questions of demographic importance, such as what country and state/province I am from and my age, I am likely to respond truthfully.
1 Age [28] *Will be randomized*
2 Age [56 (Randomized)] *28*
The value 56 gets submitted to the server, not the value 28 - which is my real age ;).
This is auditable because I can inspect the source code which is part of the web-page, and I can even monitor the network packets if I'm really paranoid.
Now I could still lie, or mess with the algorithms in the Javascript, but what would be the point?
80N
It was an interesting article, and I can see how this technique will work when the surveyors have the goodwill of the respondents, so that any respondent's primary concern is only that of keeping his individual privacy.
But is privacy the core issue in market research, or is it simply a label of convenience that a lot people use for something else that we don't have easy words for? I will lie on many surveys even when I am fully confident of my personal anonymity-- though I prefer to avoid those surveys entirely when I can. OTOH, when a survey is done by a group that I have aligned myself with, I might well enthusiastically bare my soul without any regard to the privacy issue. And I know that I am not at all uncommon in these respects.
I suspect that my reactions stem from the same source as nationalism, patriotism, ethnic pride, and that whole mess of things where I'm not behaving as an individual protecting my privacy, but as a member of a group who feels called upon to defend my group.
Mostly I see marketing as an attempt by outsiders to mess with my group, to get us to buy stuff through conning us rather than letting us apply our own standards of value to the goods offered. I think I lie on surveys to protect my group from these subtle attacks; to misdirect and confound my group's enemy.
So I really don't think privacy has much to do with it. I think all this lying is a natural group reaction to consumerism, and its belief that it is perfectly okay to sell product by conning your customers into thinking that what you are pushing today is something they want.
Not in my group, buster. We don't need no steenkeeng pushers in our neighborhood.
People won't trust sites to actually randomize the data. Actually, people probably won't notice that the site is promising to, or take this as a reason to give good results. What they should do is set up a system where the randomization is done by the browser (which people trust), in accordance with a distribution specified by the site and provided to the user.
That way, the browser tells you that your entry will be randomized to tell the site your age +-30 years, or give your actual gender 20% more frequently. Based on the numbers the site is using, you can decide whether to answer accurately, knowing just how hard it would be to track you based on this information. The web site would then be able to remove the noise from the aggregate data, and have a confidence based on the distribution they ask for (aside from people who think the margin is too small and lie).
As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?
A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!
Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.
"How to Do Nothing," kids activities, back in print!
Asking people right out "Hey, did you have unprotected anal sex on your casual encounter?" was found to be not a particularly good way to elicit truthful answers. So what you do is give people a fair coin (or the equivalent) and have them flip the coin for each question. If the coin lands heads, they answer "yes". If the coin lands tails, they answer *truthfully*. Looking over an answer sheet, you have no idea which "yes" answers are real and which are not, and subject did feel like nobody really could "get" any personal information off their answer sheet. In the statistical aggregate, however, you could get perfectly useful average rates for a given population. (Basically, you just adjust for the "yes answer background".)
A great idea, but its use in a wide-range study of this type was axed, I believe, when the study itself was blasted by certain members of congress...but that's another story.
Babar
They've missed the point about why their forms are full of bullshit.
The forms are giant time-wasters.
If the folks giving these surveys would stick to EXACTLY what they NEED to know, we wouldn't balk at filling them out properly- especially since personal data is one thing they generally do NOT need to know for marketing!
Forget the name, address, interests (the BIGGEST time waster of all.) Generally, the most important information that you can get from site visitors is:
1) Zip code. This tells you the geographic area that your visitors are coming from. Useful for location-relevant information, but completely impersonal.
2) Age range. This is really the prime info that marketers want, as so much of their "science" is based on generational observation. Again, totally impersonal.
3) How you heard about the site. This is the most important thing you can learn from your visitors, as it gives you some information on which advertisements are performing!
If every site I signed up to asked me these three questions and these three questions ONLY, I'd answer them all truthfully. As it stands, I have to dig through a mountain of shit, and these days I generally just throw the shovel at the pile and move on.
What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey
...which is not unusual on Slashdot - I do it all the time as well.
The idea of randomising answers it not new. It has been used in 'socially sensitive' surveys for years, if not decades.
Simple explanation:
Have a survey of 10 questions people don't like to reveal the truth of, ech with a yes/no answer.
For each question, either
a) reply truthfully
b) flip a coin and record whatever the coin gives.
If challenged about your answer, you can always say that's the answer the coin required.
Analyse the results for a large population of completed survays. Any significant deviation from 50% yes and 50% no answers tells you which way the population answered, without revealing who actually holds those views.
All you need is a coin to randomise your answers. This is independent of any web form, doctored answer sheet etc etc - so particular answers cannot be pinned on you.
It's fun administering the same survey to people with and without the randomisation - you get to see what people in general lie about!
Hope this gives a usefule summary of the method.
Regards,
pgrb
This line intentionally left..uh..blank?
ISTR it's in Tanner's book on Gibbs sampling, as a method used to extract accurate population estimates about embarassing, personal or even incriminating subjects, such as past exposure to STDs, sexual orientation, or the use of particular controlled drugs.
Of course, your survey has to be big enough so that the expected number of true positives (N.p) stands out above the expected uncertainty in the number of false positives, approx sqrt(N.p'.(1-p'). If p is small, N may have to be really quite big.
I remember reading about something similar to this in a psychologly class in 1988 or so. The idea was for people doing a door-to-door survey asking things like sexual behavior. There's important public health reasons to have the data, but also strong reluctance to give honest answers.
What they did was give the person being polled a spinner, like from a board game. (Remember those, oh you young /.ers? Maybe not...) It was divided in two parts, 2/3 would say "yes" and 1/3 would be "no". The questioner would ask if the person's answer to some yes/no question matched that shown on the spinner (which the questioner couldn't see). You couldn't know what any single person's answer was, but you could do the math and get how many had done what.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
I used to work for a company whose customers had to provide accurate information in order to sign up -- the service wouldn't work with false info -- but the problem was getting people to sign up.
One of the main selling points was that customer data was completely secure: no one will ever be able to read your data, only an aggregate report of all our users. The company went to a lot of trouble to make this point convincing, going so far as to suggest that users had legal protections against abuse. There were people in the building who spent all their time trying to think of ways to convince more people to drop their defenses so we could exploit their information -- cold, calculating, 24-7, like WOPR spends all its time playing World War Three.
I believed their claims until the day I saw a user's sensitive data on an engineer's screen. And then that engineer showed me another user's data, and another. "We've always had the ability to do this," he said, "for, ahem, quality control purposes."
If a company tells you it isn't collecting the valuable data you provide, you need to assume it is lying (unless you can personally verify the claim or you are positive that the law protects you against abuse).
People are "lying" because corporations lie, as a matter of policy. This will never change because lies are more profitable than truth. Only corporations don't call their behavior "lying," they call it "marketing." So when I fill out an intrusive form with false information, I don't consider it lying either. I call it "standing up for my right to privacy." This system of "marketing" versus "standing up for my rights" is well-balanced, but this new masking technology is simply a marketing attempt to tip the scales in the corporations' favor by tricking consumers into volunteering information on false assumptions.The whole idea of the thing is that they'll have pretty much the same data at the end, at least an agregate form. What they won't have is the exact data on any individual person. If their randomizer adds a value form -15 to 15 to the age, and my result comes up as 37, then i could actually be anywhere from 22 to 53 years old.
This Space Intentionally Left Blank
Timeo idiotikOS et dona ferentes
Why do companies take these polls to begin with? To make money. Either there is money to be made in interpreting the results or even in providing the results in the first place (see election exit polls). If the pollsters are looking to make a profit off of the information, why not share that profit with the people that gave you the information to begin with?
Going back to the example of the exit poll, if all you're going to do is try to make money by predicting who will win an election, its much more satisfying for the voter to lie and watch them squirm when they get it wrong. Why should we tell the truth?
(from mathematic perspective)but isn't random function reversible? Though the chance of reconstruct the scrumble is slim but we shouldn't rule out the risk. Why don't they use some irreversible functions like MD5?
The company is providing you with a product, often for free, and they request that for you to use their product you give them a little personal information. It is their product, so they get to make the rules. Your choices are to give what they want and take what you want, or you could just live without it. I don't understand the position of taking what you want and not leaving what they want.
Or you consider this tiny piece of personal information part of the price. Instead of giving them $5, you give them your age, salary, and email address. You don't try to trick the grocery store clerk when you think the bill was too high for what you bought, do you? Why would this be any different? If you don't like the invasion of privacy, then the cost is too high for you and you don't take the product.
I can see where people may say that capital is a required part of making the product and personal information isn't. Since they don't need my email address then I should feel free to not give it to them. However, this personal information often does translate into capital for them (the goal of business is to make money most of the time). Besides, that isn't your decision to make. The company wants your privates and they are giving you the product, so their desire carries more weight. If you were not receiving something back in return, then their desires would not override yours.
It is only a "little" lie doesn't change the fundamental aspect that lying is a priori wrong.
...that we use a Random NY Times Registration Generator to falsify [our] registration details to access an article about ways to persuade people to give correct answers to survey questions?
:)
Helluva page btw, majcher. Thanks
May we live long and die out
Beef jerky?
Light cup, beer drink, thin so chain, neck turtle fat, man I won't say it again
You'd need it to be a browser feature, rather than a feature implemented by the browser while running javascript in the page. Obviously, this involves a lot of changes, but this is theoretical research anyway.