Randomizing Survey Answers For Accuracy
Saint Aardvark writes: "The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results. Strangely enough, they can get accurate information out of the aggregate of enough answers -- but it's completely anonymized. Since conservative estimates say nearly half of all survey answers are bogus, there's an interest in persuading people to be more truthful. As ever, you can use the Random NY Times Registration Generator to falsify your registration details and read the article..."
If I want to intentionally put a bogus answer into a poll, such randomization doesn't affect it at all. Quite often, I will answer a poll not in the way that I actually feel, but in the way that interests me the most at that particular moment. This randomization doesn't affect that.
In the past, I'd give false answers. Now I'll need to randomize my true/false answers to throw their randomness off.
fp
It hurts and stuff -- plus this story is a dupe.
Ok, fine. They've managed to come up with a model that doesn't actually collect any data. And how will this help people to enter REAL data? People don't give data because they don't trust the company. If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?
Javascript + Nintendo DSi = DSiCade
Did you lie when answering this question?
O Yes
O No
O Cowboy Neal told me the answer
There is a great amount of irony in the fact that we're all reading an article about obtaining accurate information by clicking on a link that will generate false information.
That's just way too wonderful to put into mere words.
My
Limekiller
Do they expect that people will enter real data on the mere promise that it will be stored in some randomized, aggregate, or other form that does not invade their privacy? If the coroporation could not be trusted in the first place, no statement they make will make them trustworthy.
Sounds all fine and dandy for science, but people are usually honest with a professional researcher who is going to gaurantee your ananymity, and moreover the research data is going to be used for something tangible rather than selling something right back to you.
Market researchers want information on YOU. They want generic info on your demographics, but this information has been available from other venues for a long time. When spy ware and other information gathering techniques are employed against someone they are being used to collect data to target marketing at that person specifically. Literally employed against that person.
As such, I'll still say that I'm female, in my 50's, from Yemen and making less than $12,000 a year. Randomize away.
Does he help people kill themselves or do they kill themselves over all these e-commerce surveys?f
While a lot of people are concerned about their privacy, somehow, I don't think that the fact that they won't be able to tie the answers to you will lead to any more truthful answering.
What's the point of living if you can't screw with market research? It's just fun, and my little way of getting some revenge for the countless webpages they cover with annoying advertisements, or the time they steal in between my TV programs.
I don't see how people would trust this any more than entering it normally.
Typical session:
What is your age? (Results will be randomized)
23
OK, we're putting down you are 28 based on a random number we picked. Aren't we good to protect your privay?
(Then behind the scenes the database gets the real age put into it, how will the user ever know?)
Even if the user can view their profile later on, the database can just store their real age + the so-called random modifier, and the user will be none the wiser.
What a pointless "technology".
I've had enough abrasive sigs. Kittens are cute and fuzzy.
-CySurflex
Either I trust "them" to be completely honest and fair, then I can give them my information. If I don't trust them to use my information properly then why should I trust them that they actually use this IBM-program? Either they find some way to get information out of my false answers or make me trust them.
So what they're saying is that they've proven that their random number generator isn't really all that random? :p
a few people have posted that you still have to trust the company you're sending the info to to randomize the data for you. It doesn't have to work like this. You could have the program work so that the info is randomized at your end, maybe by having the browser make a call to a "registration" program. It could be open source so that we could be sure of what it's doing, and the company then can't get your real info. (without hacking your box)
"Save me jebus!" - Homer Simpson (btw, I'm probably talkin out of me arse)
You left out the big red graphic 'W' from the beginning of the article.
I think there is something to be said about companies that ask for information as an option versus companies that ask for information as a requirement.
For example, company XYZ has released a program called Widget. In order to download Widget, users are asked to fill out a survey so that XYZ may guage the demographics of their target audience.
Some sites will allow you to bypass this step and proceed to download the software. Other sites require this information before revealing the download link. I think that the psychological difference between "required" and "optional" would heavily influence the honesty of the answers.
I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company. The difference? I'm not being forced to give anything up in advance.
Is this true in general? I don't know. But it makes sense to me.
I have an idea for something to replace the survey forms - an AI program to carry out a conversation with the user. Ah ha! We just have to watch out for users that say to the AI - "I am lying" - and hope the AI doesn't need therapy.
Price, Quality, Time. Pick none. What, you thought you had a choice?
They're blurry so they look nicer! :)
and unless the IBM software runs on my computer and is fully OSS, I wont trust it more than Company X's Privacy Policy to begin with...
That's actually an old statistical trick. Adding an homogeneous noise to any statistical data doesn't actually involve any noise in the final data accuracy. With a little button in java which randomize the data you've entered in the form( thus before sending the data to the firm ), it protects your privacy while still giving useful data to the firm. They got a nice idea, but sure it won't stop some people to fake answers "for fun". I do that sometimes :-)
Personally, if I don't trust them enough to tell them how much I make, I'm not going to trust them to randomize my results. I don't see how this will increase accuracy -- especially if I keep telling everyone I'm a 108 year old female in Uganda making $100,000+ per year year who works in the sales department of an Educational field and plans to make purchases of an suv, a house, a console gaming system, a optical mouse in the next six months and rates thier internet experience as very low. My e-mail address is sjobs@mac.com and I would like to apply for your quarterly, monthly, weekly, daily, and hourly newsletters and I do give permission to pass this information to your affiliates.
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
Not only does this not make any sense, the article was really poorly written. There is no way this system will be any more truthful than the lame one we use now. What this does prove it the morons ay college actually believe the crap they write...what a sad concept. It is like actually believing the commercials we see on TV...
errr....umm...*whooosh* *whoosh* Is this thing on ?
Let me summarize:
1) People lie on surveys, most likely because they don't trust the taker - but probably also just because they like putting in other answers (yeah, I'm a millionaire, woohoo!, etc). This only addresses the trust issue, ignoring other ptential sources of lying.
2) In order to work around the trust issue, they've developed a method of injecting random noise into the original answers as they are recorded and then extracting useful data in the end.
Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.
Without bothering with all this mumbo-jumbo, I can build a trustworthy system. I simply record survey statistics, and I promise not to use the individuals' personal data invidually.
They can either trust me that I'm telling the truth about this, or they can lie. In the IBM researchers' scenario, the users are again asked to trust that the backend system doesn't compromise them, and again they can choose to trust it or choose to lie.
Given the above, why on earth would you bother with this research and uneccesary complexity. It's not going to make any difference over just promising your users that you don't invade their privacy. You could replace their research results with a banner on top of the survey that says "After you sumbit your data to us, we use Magical HibiJibi technology to prevent ourselves from invading your privacy, so please trust us and answer truthfully"
What a waste of research.
11*43+456^2
Yet I don't quite understand it , but I like it cause it somehow arises my interest in data-mining.
KOS-MOS
There are three kinds of lies: Lies, Damn Lies, and Statistics.
This is probably the most sensible way of doing this, using, for example, Java or even ActiveX.
However, it still doesn't fix the problem that people lie. Even if they know that their privacy is guaranteed, they'll still lie, simply because it's fun--after all, rules are made to be broken.
I'm always a bit skeptical when I'm told I'm about to be surveyed anonymously, and I can't think of a way that this can be implemented (or at least is likely to be implemented) that would reassure me. The non-skeptics are filling in their information already. Perhaps businesses could pick one in five to survey and offer the people who don't want to take it the ability to just skip it; I'll bet a good amount of crap in the databases is coming from people who have to fill in eighty mandatory fields for free e-mail or music or whatever.
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
I heard something similar to this a while back where they were surveying college students asking personal questions like "have you had sex?" and to make students more inclined to answer truthfully the students would go into a room by themselves with a survey and a coin and for every question they we asked to flip a coin. If the coin came up heads they would answer truthfully and if it came up tails they would flip the coin again and heads would mean true and tails would mean false.
I can't remember where I read this. If someone has a like could you please post it?
I agree with lots of folks here that this system works only if you don't have to trust the remote site to apply the obfuscating transformation. Here's a suggestion to make things somewhat more transparent.
Create a form with attached Javascript. You enter the real data and hit the "obfuscate" button. The script then locally adds noise to your answers. At this point, the "obfuscate" button turns into "submit", allowing you to send the visibly obfuscated responses to the remote site.
Of course, you'll probably want to read the source to make sure the real answers are not sent along with the obfuscated ones. Still, this scheme would go a ways toward creating the perception of honesty.
Flip a coin twice. If the result is two tails, answer "yes" to this question. If the result is two heads, answer "no". If the result is one head and one tail, answer truthfully:
Are you a homosexual?
Nothing known about that one person, but integrate the results over a large enough sample... The catch: the person taking the survey must trust the random number generator, so low-tech things (like coin tosses) would work best.
Interesting approach, but useless unless people actually understand and trust the system. For this to happen will probably require widespread adoption, an easy to understand explanation of the process, and assurances that answers really are randomized. These requirements obviously force a bit of a chicken and the egg scenario.
Explaining the whole randomization process (how it protects privacy, how it provides useful info) will be a little much for most people I think, but a good user interface might alleviate this, perhaps with a 'randomize' button that is used before hitting the 'submit' button. This would take the user input and change it right in front of their eyes. Of course many would be rightfully concerned that the randomize button is just for show (or simply encodes but doesn't anonymize), but I think that enough people might buy into the false sense of security that demonstrated 'randomization' provides to at least partly improve the % of bonafide results. Also, the system could be set up so users who don't mind submitting traceable information could be encouraged ("extra 10% off") to submit without randomization, with a simple flag sorting data into randomized/anonymous and non-randomized/non-anonymous data).
This approach would be even better if the randomization approach becomes a ubiquitous standard backed by a consistent and legally accountable and well-known entity/brand (IBM for instance). I'm not sure how well an open solution would work unless there was a central group assuming responsibility and accountability for the system, enforcing trademarks, and suing spoofers. Also, people feel safer when they feel there's someone to blame for any abuse/mistakes (hence, giving their credit card freely to a waiter but not to a website).
My next sig will be ready soon, but friends can beat the rush!
"The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results."
Maybe I'm crazy, but isn't this exactly the same as the random response stuff that I (and presumably everyone else) learned in high school finite? The way it worked in finite is you ask someone 2 questions, like "have you ever killed?" and "have you ever been on a plane?", then you tell them to answer one of those based on some random event, like "answer the first question only if its Monday, Tuesday or Wednesday, otherwise the other question", then you calculate the probability that they answered each question using finite. This protects the privacy since there is no way of knowing whether they said "true" to killing of being on a plane, and people know this so they will be more likly to be honest. I don't see how this "new" system that IBM has created is much different than a slightly modified version of what I have just described.
Why do I sometimes put bogus information in surveys? Because the survey stands between me and something I'm after, and I don't want to waste any more time than necessary, getting whatever it is. For example, reading an online article, downloading eval software, entering an etailer's contest, getting on a geek mailing list, etc. are all things for which I've been asked to fill out surveys.
Privacy concerns don't necessarily enter into the equation in those situations. I may or may not want to take the time/brain power to even attempt to answer accurately.
I hadn't heard about this before. Randomized Response
Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.
There is a fairly well-known survey technique called randomized response that implements just this. The way to ensure trust is simply to let the user generate the random noise, for example by flipping a coin.
For example, to answer a sensitive yes/no question, you could be instructed to flip a coin. If you get heads, you answer the question truthfully. If you get tails, you flip the coin again, and answer "yes" for heads / "no" for tails. Thus, there is a 50% probability that any given answer is completely random, but the noise can easily be removed from the aggregate statistics.
Mind you, I don't think that's what the NYT article is describing. The text is too vague to to be sure, but it does seem as if they're describing a server-side randomization system. If so, I wouldn't trust that either.
I dont think trust has anything to do with it. I already filled in forms on the internet truthfully years ago when i was young and stupid, lots of sites have my details - they're gonna get them one way or another - maybe another company will sell them my information illigally, or maybe i'll fill in a form, it makes no difference.
:). Then there are other sites that i just dont feel deserve my details. Im fed up of junk mail so my email address usually ends up being webmaster@[the-web-site].com etc.. (give them a taste of their own medicene). I think there are very few people who actually go around filling in false data just to sabotage the statistics. All people want is a _By-Pass_ button that will just get them in - it really is for the web-sites' own good if they want to maintain a reasonably accurate database, because otherwise they will just get my useless random fist-hitting-keyboard data.
The reason i dont fill in forms truthfully is because it takes too long. I cant be bothered to read questions, i see a drop-down box of countries i quickly click afganistan, then i click on the post/zip code box and hit my keyboard to fill it with crap, moving on, click, tick, select whatever gets me through as quick as possable. I _really_ have no time to keep filling in these forms i just want to get on the site so i can post a question, read an article or whatever (btw the random ny times generater was excellent
This comment does not represent the views or opinions of the user.
Dear God, I learned about this technique in an undergrad intro to stats stats course back in 1985. What's the NY Times going to discover next, "Harvard Researchers Prove Accuracy in Counting to 10 Improved by Averaging Finger Counts"?
The technique described by the researchers is designed to reduce false answers users give to protect their privacy. Users don't trust the pollsters with their private info, so they lie to protect it. The way I read the article, and I did read the article, they are suggesting the answers get randomized before they get to the pollster. Like this:
"Sire take this clipboard and go to that table way over there where we can't watch you too closely. For each question you wish to answers, spin the Wheel-Of-Fortune-type wheel. Add the value of the wheel to the value of your answer and mark the sum down on the clipboard. In this way we cannot discern your actual information."
If privacy concerns are the sole reason the user would normally lie to a pollster, this technique may convince the user that their actual data cannot be linked back to them, thus preserving their privacy.
On the other hand, if the user just wants to screw with the pollster, as many slashdotters have confessed, this technique is of no help.
What a pointless "technology".
"Pointless" would be to argue with you about the meaning of higher education.
Let's think what the article actually says: IBM has employed a technique which lets them estimate the original distribution of data by adding a certain amount of random data with know distribution. That surely should be useful in other areas as well?!
A Google seachr on Random Perturbation gives quite a long list with applications in wheather simulation, computer graphics, chaotic dynamical systems, etc.
Still pointless? What about a search in the then NEC Research Index? Wowwww... Pointless, eh...?
Excellence: Moderate (mostly affected by comments on your karma)
As another poster observes, if you don't trust them with the data, why trust them to randomize it?
My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:
With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.
If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.
The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.
If the users always pick the first choice, regardless of the 'true distribution', you're not going to get any information.
this song by Three Dead Trolls in a Baggie
"Listen: We are here on Earth to fart around. Don't let anybody tell you any different!" - Kurt Vonnegut
"Judge, I did not know she was 14 years old. I'm pleading innocent by reason of randomized, aggregate data!"
This is not a dream, not a dream...we are transmitting from the year 1-9-9-9.
Hey, it's me. The guy who put together and hosts the New York Times random login generator. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.
I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)
The kind of questions that most of these sites ask include stuff that is impolite for friends to ask each other sometimes, never mind some random business. If they want accurate results, they should include the option for people to answer with a "MYOB" option. People are rather unlikely to keep tossing in crap data when they have the "MYOB" option, at least not in the 40% range. There is no way in hell that anyone making 100k+/year would actually admit it and give a business their real e-mail address. They would be begging for a flood of advertisements.
Why is it that online business feel they have the right to try and force so much personal information out of us? In brick 'n mortar stores, the worst info anyone asks me for is my zip code (or age to purchase alcohol). They can get my name if I use my credit card, but I can easily pay cash to avoid that.
It's very ironic that NYTimes would run this story.... Why do they expect me to tell them where I live, work, and what I make, just to read their articles? The paper version is nowhere near this invasive.
The idea of using randomness to get better survey results is not a new one. In his 1990 book "Innumeracy", John Allen Paulos posits a system for asking a potentially embarrasing yes or no question whereby the examiner asks the subject to flip a fair coin before responding. If the subject gets heads he should give the embarrasing answer, tails he should tell the truth. The idea is that the subject is then spared the trauma of giving the embarassing answer since the examiner is not told the result of the coin flip and it is possible the subject just flipped heads. Knowing the "probability distribution" of a fair coin it can then be assumed that half the respondants gave the embarrasing answer as a result of their coin flip. These can then be removed from the data leaving a staticically accurate result.
It seems that what the IBM folks are doing is a staightforward extension of this idea to a larger response domain (numerical ages as opposed to boolean questions) and to a more automated system in which the website flips the coin for the subject and amends his answer accordingly.
If the respondents are already randomizing the data, the statistical analysis should be able to produce the same result.
Or hadn't they thought of that?
A couple of thoughts.
First, I found this funny:
Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.
Yes, they *need* to make even more money off your data.
Second, anyone find it interetsing that they assume a distribution and then work towards it:
"When people lie randomly -- and that is what they do now when they answer questions -- we get very poor results," he said. But by "adding random values to true values," he said, "we can reconstruct a distribution that is very close to the actual one."
Using this information, Dr. Srikant said, the researchers make a first guess at what the true distribution should be. Then the program crunches through the analysis and produces a slightly better guess. This guess is crunched again, and the process is repeated over and over again, getting closer and closer to the actual distribution.
My guess i sthat they hope people don't truely lie randomly, and then yuse their random additions or subtractions to bring people closer to the actual distribution - i.e. I may say I make $0 or %$50,000 (or what ever the low/high end is, but not pick one one away from my real income.) They are hoping that people, as a group, behave predictably even when any one individual doesn't. Which, if my org behavuior prof is to be beleived, is generally the case and the way people can shape other's responses and behaviors.
Interestingly enough, randomization is a useful tool in surveys. If you area sking about very private infromation that people may lie about if they fear the answer will be leaked, you can tell them to flip a coin - heads ask them to answer truthfully, tails put down no (for a yes no survey). With a large enough sample, you can back out the real results based on the 50/50 results of the coin toss, without knowing how anyone actually answered.
Of course, companies should probably ask themselves how many Josef Stalins live in Moscow Idaho and were born on Oct 24, 1917?
I'm a consultant - I convert gibberish into cash-flow.
I am truely amazed that the NYT would report this as news since they always have one or more front page stories which are the results of a poll.
What's so amazing is that the polls they report as news are usually constructed as to come up with results that the NYT editorial board wants to have.
How about being respectable journalists by:
1. reporting facts to not try to influence public opinion
2. not reporting manufactured news stories such as poll results
3. putting editorials on the editorial page and not the front page since editorials are designed to change public opinion
4. reporting both good and bad news about a president/congressman whether or not it helps/hurts the political outlook of the NYT editorial board
5. reporting news stories which don't center around NYC, DC, and LA.
doesnt anyone else see how this such an incredibly stupid waste of time and CPU cycles?... paraphrased: "yea... we could add or subract a random quality from the results to either conform to a normal distribution or just linearly.." is this the whole point? chances are theyll have pretty much the same data at the end... or is it some psychological device?
When a company tells me they aren't going to use the information I give them for anything but demographics research, then asks me for my phone number and address and makes both fields required, I consider it safe to assume that company is lying, and don't feel think it's at all naughty to fib.
On the other hand, if the company really only requires me to answer questions of demographic importance, such as what country and state/province I am from and my age, I am likely to respond truthfully.
...that their surveys are bogus, that we see them assume that the crappy answers they get are the fault of the respondents.
Back when McCain's campaign was maligning Bush in the 2000 primaries, I got a call from a pollster, obvious from the McCain camp, who offered the most leading questions and answer choices I've ever seen in such a poll. Of course, when I said that I thought McCain was a commie, and that I didn't agree with even a tenth of his allegedly 'conservative' positions, the guy started shuffling to end that interview. I am certain my survey responses got flushed.
In space, no one can hear you moo.
I submitted this a while ago before it appeared here. Same for other stories.
2002-07-19 22:51:55 Web form information, True or False? (articles,news) (rejected)
I wont submit another story again as it is pointless.
----- Whats wrong with this picture? http://www.revoh.org:1234/whatswrong
I am more prone to lie about facts like birth date, income, education status.
In the end it all comes down to the questions: If they are along the lines of my interests, my age range then I feel more comfortable saying I am in my late 20's and how I like to travel. Than give them my birth date and which hotel chains I stay at.
...was when I could actually check the button marked:
Household income? >$100,000
and then, only once. Of course, the rest of the block were filled in with random junk.
If they take actual inputs, and randomize the results, why bother with the expensive surveyy taking in the first place? Just generate a set of random responses.
"Well, we had 1,500 responses, and 23% were female, potential snow tire buyers, age 18-25" (of course, we did not actually take a survey, and just had the computer over in the corner generate a bunch of random responses).
1 Age [28] *Will be randomized*
2 Age [56 (Randomized)] *28*
The value 56 gets submitted to the server, not the value 28 - which is my real age ;).
This is auditable because I can inspect the source code which is part of the web-page, and I can even monitor the network packets if I'm really paranoid.
Now I could still lie, or mess with the algorithms in the Javascript, but what would be the point?
80N
It was an interesting article, and I can see how this technique will work when the surveyors have the goodwill of the respondents, so that any respondent's primary concern is only that of keeping his individual privacy.
But is privacy the core issue in market research, or is it simply a label of convenience that a lot people use for something else that we don't have easy words for? I will lie on many surveys even when I am fully confident of my personal anonymity-- though I prefer to avoid those surveys entirely when I can. OTOH, when a survey is done by a group that I have aligned myself with, I might well enthusiastically bare my soul without any regard to the privacy issue. And I know that I am not at all uncommon in these respects.
I suspect that my reactions stem from the same source as nationalism, patriotism, ethnic pride, and that whole mess of things where I'm not behaving as an individual protecting my privacy, but as a member of a group who feels called upon to defend my group.
Mostly I see marketing as an attempt by outsiders to mess with my group, to get us to buy stuff through conning us rather than letting us apply our own standards of value to the goods offered. I think I lie on surveys to protect my group from these subtle attacks; to misdirect and confound my group's enemy.
So I really don't think privacy has much to do with it. I think all this lying is a natural group reaction to consumerism, and its belief that it is perfectly okay to sell product by conning your customers into thinking that what you are pushing today is something they want.
Not in my group, buster. We don't need no steenkeeng pushers in our neighborhood.
There's a better way. Run the demo-collecting software on the client. The user enters their info, the client randomizes it and sends it on.
Similarly for customized ads. Your client (open source of course) knows your demographics. But it also has 5 other (fake) profiles. It sends them all to the server, the server sends back 5 customized ads, one for each profile. The client picks the right one and shows you.
Everyone wins!
Ciao for now,
Simon Woodside
http://www.simonwoodside.com/
PS. Please, check out http://www.semacode.org/ and give me some feedback !
home page
You vote for candidate B, this is randomized to be candidate E. The voting machine has a record that shows that you voted for E. This can be inspected by you to determine that your (ramdomized) vote was not tampered with.
The outcome of the election is determined simply by removing the randomizing bias...
80N
People won't trust sites to actually randomize the data. Actually, people probably won't notice that the site is promising to, or take this as a reason to give good results. What they should do is set up a system where the randomization is done by the browser (which people trust), in accordance with a distribution specified by the site and provided to the user.
That way, the browser tells you that your entry will be randomized to tell the site your age +-30 years, or give your actual gender 20% more frequently. Based on the numbers the site is using, you can decide whether to answer accurately, knowing just how hard it would be to track you based on this information. The web site would then be able to remove the noise from the aggregate data, and have a confidence based on the distribution they ask for (aside from people who think the margin is too small and lie).
As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?
A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!
Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.
"How to Do Nothing," kids activities, back in print!
So perhaps the real objective of this "research" is actually to persuade the guys who pay for the surveys that IBM consulting has better ways of doing them - trust us, we
- know
, and that's an extra 20% on the bill for all that research we did into how to do market research.Panurge has posted for the last time. Thanks for the positive moderations.
Asking people right out "Hey, did you have unprotected anal sex on your casual encounter?" was found to be not a particularly good way to elicit truthful answers. So what you do is give people a fair coin (or the equivalent) and have them flip the coin for each question. If the coin lands heads, they answer "yes". If the coin lands tails, they answer *truthfully*. Looking over an answer sheet, you have no idea which "yes" answers are real and which are not, and subject did feel like nobody really could "get" any personal information off their answer sheet. In the statistical aggregate, however, you could get perfectly useful average rates for a given population. (Basically, you just adjust for the "yes answer background".)
A great idea, but its use in a wide-range study of this type was axed, I believe, when the study itself was blasted by certain members of congress...but that's another story.
Babar
I fail to see how this is a real technology. They have untrustworthy results, so they want to get results that make sense, so they mess with them. So they already know what they want to get, and by implementing this system, they manipulate the "data" so that they do. What's the worth in asking actual people, anyway? If you assume that they lie and then change their data entirely to fit into your bell, why don't you just make up the data entirely? I bet you could make it fit much nicer on the bell if you did that.
They've missed the point about why their forms are full of bullshit.
The forms are giant time-wasters.
If the folks giving these surveys would stick to EXACTLY what they NEED to know, we wouldn't balk at filling them out properly- especially since personal data is one thing they generally do NOT need to know for marketing!
Forget the name, address, interests (the BIGGEST time waster of all.) Generally, the most important information that you can get from site visitors is:
1) Zip code. This tells you the geographic area that your visitors are coming from. Useful for location-relevant information, but completely impersonal.
2) Age range. This is really the prime info that marketers want, as so much of their "science" is based on generational observation. Again, totally impersonal.
3) How you heard about the site. This is the most important thing you can learn from your visitors, as it gives you some information on which advertisements are performing!
If every site I signed up to asked me these three questions and these three questions ONLY, I'd answer them all truthfully. As it stands, I have to dig through a mountain of shit, and these days I generally just throw the shovel at the pile and move on.
What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey
Client-side obfuscation is a great idea, but trusting some website's Javascript isn't good enough. Not everyone knows how to inspect source code for hidden fields. This would be best implemented as browser feature or plug-in, so you use the same code to obfuscate forms every time, and it's made by people independant of the surveying website.
isn't the fact that people are providing false data enough randomization? Why doesn't IBM just process the data that is already provided, rather than trying to collect it "their" way.
All they need is an idea of what the distribution of the random numbers generated by the average person.
...which is not unusual on Slashdot - I do it all the time as well.
The idea of randomising answers it not new. It has been used in 'socially sensitive' surveys for years, if not decades.
Simple explanation:
Have a survey of 10 questions people don't like to reveal the truth of, ech with a yes/no answer.
For each question, either
a) reply truthfully
b) flip a coin and record whatever the coin gives.
If challenged about your answer, you can always say that's the answer the coin required.
Analyse the results for a large population of completed survays. Any significant deviation from 50% yes and 50% no answers tells you which way the population answered, without revealing who actually holds those views.
All you need is a coin to randomise your answers. This is independent of any web form, doctored answer sheet etc etc - so particular answers cannot be pinned on you.
It's fun administering the same survey to people with and without the randomisation - you get to see what people in general lie about!
Hope this gives a usefule summary of the method.
Regards,
pgrb
This line intentionally left..uh..blank?
I enter a new sub everytime I read NYTimes...
This time my email was theevildoers@whitehouse.gov, I was 60 yrs old, lived in Afghanistan, was CEO in the Energy business (that sounds about right for the email addy), and subscribed to the Times.
If you can't get me to input truthful info in the first place, which I won't, no matter how "anonymously" the data will be kept, how is randomization going to help?
Isn't it ironic that we are using the random NY Times registration generator to read an article about random registration data? Sort of proves the point, doesn't it?
Most people will not understand, what the browser does. ( probably with javascript ) Even those who could, would not bother reading the page source to make sure, the data isn't transmittet in clear.
It will not make any difference whether the data is encrypted on the client or the server.
ISTR it's in Tanner's book on Gibbs sampling, as a method used to extract accurate population estimates about embarassing, personal or even incriminating subjects, such as past exposure to STDs, sexual orientation, or the use of particular controlled drugs.
Of course, your survey has to be big enough so that the expected number of true positives (N.p) stands out above the expected uncertainty in the number of false positives, approx sqrt(N.p'.(1-p'). If p is small, N may have to be really quite big.
I remember reading about something similar to this in a psychologly class in 1988 or so. The idea was for people doing a door-to-door survey asking things like sexual behavior. There's important public health reasons to have the data, but also strong reluctance to give honest answers.
What they did was give the person being polled a spinner, like from a board game. (Remember those, oh you young /.ers? Maybe not...) It was divided in two parts, 2/3 would say "yes" and 1/3 would be "no". The questioner would ask if the person's answer to some yes/no question matched that shown on the spinner (which the questioner couldn't see). You couldn't know what any single person's answer was, but you could do the math and get how many had done what.
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood
I used to work for a company whose customers had to provide accurate information in order to sign up -- the service wouldn't work with false info -- but the problem was getting people to sign up.
One of the main selling points was that customer data was completely secure: no one will ever be able to read your data, only an aggregate report of all our users. The company went to a lot of trouble to make this point convincing, going so far as to suggest that users had legal protections against abuse. There were people in the building who spent all their time trying to think of ways to convince more people to drop their defenses so we could exploit their information -- cold, calculating, 24-7, like WOPR spends all its time playing World War Three.
I believed their claims until the day I saw a user's sensitive data on an engineer's screen. And then that engineer showed me another user's data, and another. "We've always had the ability to do this," he said, "for, ahem, quality control purposes."
If a company tells you it isn't collecting the valuable data you provide, you need to assume it is lying (unless you can personally verify the claim or you are positive that the law protects you against abuse).
People are "lying" because corporations lie, as a matter of policy. This will never change because lies are more profitable than truth. Only corporations don't call their behavior "lying," they call it "marketing." So when I fill out an intrusive form with false information, I don't consider it lying either. I call it "standing up for my right to privacy." This system of "marketing" versus "standing up for my rights" is well-balanced, but this new masking technology is simply a marketing attempt to tip the scales in the corporations' favor by tricking consumers into volunteering information on false assumptions.studies have shown that over 67% of statistics are made up on the spot :)
Jeremy
CmdrTaco, root@cmdrtaco.net
Timeo idiotikOS et dona ferentes
Why do companies take these polls to begin with? To make money. Either there is money to be made in interpreting the results or even in providing the results in the first place (see election exit polls). If the pollsters are looking to make a profit off of the information, why not share that profit with the people that gave you the information to begin with?
Going back to the example of the exit poll, if all you're going to do is try to make money by predicting who will win an election, its much more satisfying for the voter to lie and watch them squirm when they get it wrong. Why should we tell the truth?
ROFLMAO
Or the computer could pick random numbers out of a hat, with the chances of picking any one number the same as for any other.
Now, all other problems with the article aside, I think this little sentence about generating random numbers is the most problematic. As most people are aware, generating random numbers on a computer is no trivial thing. If they have since grown the ability to draw numbers out of a hat...I'd really like to see it.
i always thought that the way to get accurate data is to make fields *non required* so people don't have to game the form if they don't want to enter the data. this seems pretty obvious to me, but i guess marketing people aren't known for having a firm grasp of the obvious.
In his book The Shockwave Rider, Brunner posits an online service called "Oracle" which operates by aggregating responses from a large enough cross section of the general population to gain answers to question. Brunner suggested that even though the members of the cross-section didn't actually know the answers, in the aggregate the oracle was more often right than wrong.
Looks like the NYTimes reads old science fiction novels.
All surveys have a confidence interval, and 1000 flips of a coin has a very high confidence of being very close to 50-50 heads/tails.
There are of course more sophisticated versions of this with dice or a more complex random number generator, but with the interviewer not knowing if a person is telling the truth, it usually makes people much more confident in being honest as nothing can ever be proven about what they say.
This is an attempt to find a technical solution to a problem of human motivation.
Nobody gives a crap about providing truthful survey answers to marketing weasles. Obviously, technology can never provide a solution for this problem.
This is exactly parallel to the music industry's problem. Nobody gives a crap about their copyrights, and so their attempt to find a technical solution has been a failure.
To fully understand technology, you need to understand what it CAN'T do.
Adding noise does an OK job of protecting an individual response, but after years of submitting survey responses to many web surveys, there'd be plenty of data to make excellent estimates of your personal attributes.
How many users are aware that many of the sites they visit pool data?
There's an enormous body of research on how to hide individual records in databases, ranging from adding noise to preventing queries that access fewer than a set number of records. In the end, none of the methods work well - all have simple or clever workarounds. Even individual records aren't very well protected by adding noise if the record size is large enough and fields are dependent.
If randomizing simply introduces an element of inaccuracy, why not just have the options include a range?
i.e Age?: (0-10), (10-15), (15-25) etc
To me, this is all randomizing seems to achieve in regards to privacy, and overall accuracy of the results. At least then you don't have to rely on the host collecting the results randomizing them, because you would already know the answers are fairly 'vague'.
I.O.U One Sig.
WHAT'S your age? Your salary?
Online merchants who ask nosy questions like that on surveys at their Web sites have learned what usually honest visitors will do.
Fib, most likely.
People give false answers to protect their privacy. Then, because the data is so unreliable, companies can't use it to help them run their businesses.
Two I.B.M. researchers have devised software that seeks to get around this information age impasse. Rakesh Agrawal and Ramakrishnan Srikant, computer scientists at the I.B.M. Almaden Research Center in San Jose, Calif., have devised a data-mining program that would cloak individual truthful answers that people might enter once their trust was won but still recover important characteristics of the overall group.
For instance, instead of recording the answer "41" to a nosy question like "How old are you?" the software automatically adds a random number of years within a specified range, say minus 30 to plus 30, to the answer. No record of initial answers is kept.
Then, using a series of mathematical guesses based partly on how the initial data was randomized, the program gradually reconstructs a realistic distribution of the age groups that responded -- how many people were 20 to 25, say, or 40 to 45. Demographic information like this might be of great interest to a company in quest of 25-year-olds to buy its sports cars or computer games.
Some inaccuracy results when the I.B.M. program approximates the actual distribution of age, salary or other characteristics in such large data sets, said Ann Cavoukian, the commissioner of information and privacy in Ontario. "But in return for about 5 percent inaccuracy, you have a privacy model in which individual answers are not used," she said.
Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.
"Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said. Conservative estimates are 42 percent, but anecdotally the rates are far higher, she added. "People are lying," she said, "and vendors don't know what is false and accurate, so the information is useless."
Dr. Agrawal said that his way of reconstructing data was based on hiding the true numbers, although not through the sort of lying practiced by ordinary people confronting a questionnaire.
"When people lie randomly -- and that is what they do now when they answer questions -- we get very poor results," he said. But by "adding random values to true values," he said, "we can reconstruct a distribution that is very close to the actual one."
Dr. Srikant said, "We know a lot about the distribution of these random values."
The random numbers generated by the computer could be distributed in a bell curve, for instance, with most values clustered near zero and fewer at either end. Or the computer could pick random numbers out of a hat, with the chances of picking any one number the same as for any other.
Using this information, Dr. Srikant said, the researchers make a first guess at what the true distribution should be. Then the program crunches through the analysis and produces a slightly better guess. This guess is crunched again, and the process is repeated over and over again, getting closer and closer to the actual distribution.
"When you do this for 10,000 answers, the overall distribution is likely to be accurate," Dr. Srikant said.
Johannes Gehrke, an assistant professor of computer science at Cornell University who specializes in data mining, said the program was the first effort to address in depth the challenge of reconstructing a distribution of large data sets in the context of data mining.
"You know the record after randomization and you also know how you randomized the record," he said. Those two pieces of information, along with a standard statistical theorem called Bayes' rule, allow the program to estimate the prior distribution.
Random perturbation, the formal name of the technique used by the I.B.M. researchers to mask the original answers, satisfies the demand for privacy to a greater degree than many other procedures available to organizations, said David F. Andrews, who recently retired as a professor in the department of statistics at the University of Toronto.
"The idea that you can take data from a population, add random noise to it and then recover important characteristics from this perturbed data has a long history," he said.
Techniques that reconstruct distributions without revealing individual information may be welcome not only to people filling out forms but also to companies that ask touchy questions. "If companies have data and it escapes, they could be liable for data breaches of security," Dr. Andrews said. "This way, you can't be sued."
The program and related ones by other researchers may help companies explore raw data presently closed to them, said Christopher W. Clifton, an associate professor of computer science at Purdue University and author of a chapter on security and privacy in the forthcoming LEA Handbook of Data Mining (Lawrence Erlbaum Associates). "These programs ensure that the original data values can't be reconstructed, but are still close enough to the real results to be meaningful."
The I.B.M. program has been tested in the lab and a prototype is available. Dr. Cavoukian said she hoped that businesses would soon come forward to do beta tests of the software.
"Usually technology is used to invade privacy," she said. "I like this program because here we are using technology to protect privacy."
One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.
<sarcasm>
That would certainly be pyschologically assuring. It would show users that the data they entered could not be used to target them specifically but could still provide useful group demographics to the company. As a web retailer, I can't wait for this to become widespread. I'll put up a registration form on my web site with a "randomize" button next to the submit button. When a user clicks on the randomize button, the javascript on the page will back up the real information that the customer entered into another set of variables, then randomize what the display boxes show to make it look like the data has been randomized. When that user then clicks the submit button, which belongs to the form containing all the backed up true values in "hidden" type inputs, the customer's data will be sent through a POST so the user can't see that his or her data has not been protected.
It's brilliant. It gets me the data I want without regard for the user's privacy and it solves the problem of the user not trusting my web-based business.
<sarcasm> <!-- does sarcasm stack? -->
But what if some random user clicks "view source" and finds out what I'm doing? Well, of course they'll report it to the eff which will muster it's army of thousands of high paid lawyers, public relations masters, and black belt ninjas to sue me out of business, run television ad campaigns to keep the public informed, and quickly and (get this) anonymously take out all my top execs, just like they've always done when combatting such problems as Passport(R)'s invasions of privacy and the DMCA.
</sarcasm>
</sarcasm>
There's no incentive to answer correctly. Good old-fashioned generosity and truthfulness are more than cancelled out by spam.
I don't lie to deceive. I just answer as quickly as possible. I don't care if people know who I am. It's just easier to enter a@a.com and pick the first choice in each multiple choice group.
... of all survey answers are bogus
And how exactly do they measure that? Are you lying:
Yes
No
Random
Obviously you then discount everyone who said no, because only the ones that answered yes can be certain to have told the truth, but then they... oh dear
And in related news, 74% of all statistics are made up.
Glenn
The Smrt way to trade CFDs on the ASX
So it seems to me that IBM has just reinvented the wheel. What would have made it more interesting is if they had some way to randomize the answers automatically without having to trust the company since without this the whole process is useless and asking visitors to randomize the answers themselves has two problems:
How many people would actually do it properly rather than just get through the stupid forms as fast as possible? If you completely lie then there is no way to extract the truth (unless you can show that everyone's lies form some statistical pattern)
Is human randomness truly random? If not (and it's almost certainly not!) then a lot of work will be needed to extract the correct distributions, especially since there may be correlations. For example when asked to randomize their age are women more likely to subtract 10 years than add 10 years? Are teenagers more likely to add 10 years than subtract it etc.
What might solve the second problem is a browser plugin that would display forms and add randomization to them when submitted. However I don't see any way to solve the first problem so I think this is really a none starter.
What I hate is the e-mail checkers. To get past them, I have to risk using someone else real e-mail address.
Back to the topic, even if they promised to fake 100% of the answers, I still wouldn't fill in the truth. I've been spammed by almost every company I've given my e-mail to, even when they promised not to.
'SBEMAIL!' is better than a goat!!
(from mathematic perspective)but isn't random function reversible? Though the chance of reconstruct the scrumble is slim but we shouldn't rule out the risk. Why don't they use some irreversible functions like MD5?
The company is providing you with a product, often for free, and they request that for you to use their product you give them a little personal information. It is their product, so they get to make the rules. Your choices are to give what they want and take what you want, or you could just live without it. I don't understand the position of taking what you want and not leaving what they want.
Or you consider this tiny piece of personal information part of the price. Instead of giving them $5, you give them your age, salary, and email address. You don't try to trick the grocery store clerk when you think the bill was too high for what you bought, do you? Why would this be any different? If you don't like the invasion of privacy, then the cost is too high for you and you don't take the product.
I can see where people may say that capital is a required part of making the product and personal information isn't. Since they don't need my email address then I should feel free to not give it to them. However, this personal information often does translate into capital for them (the goal of business is to make money most of the time). Besides, that isn't your decision to make. The company wants your privates and they are giving you the product, so their desire carries more weight. If you were not receiving something back in return, then their desires would not override yours.
It is only a "little" lie doesn't change the fundamental aspect that lying is a priori wrong.
...that we use a Random NY Times Registration Generator to falsify [our] registration details to access an article about ways to persuade people to give correct answers to survey questions?
:)
Helluva page btw, majcher. Thanks
May we live long and die out
Since conservative estimates say nearly half of all survey answers are bogus...
How did they come up with that estimate? A survey?
davejenkins.com |
Beef jerky?
Light cup, beer drink, thin so chain, neck turtle fat, man I won't say it again
That latter objection is the one that has botched any number of theoretically sound online voting systems.
Voting systems were designed to randomize some votes to protect anonymity? Very interesting. Even if people trust it, though, I wonder about the constitutionality and politics of it. Could you imagine Florida 2000 with some votes intentionally randomized?
My objection to online voting was not so much anonymity as security. It would be worth huge amounts of money to some people (from foreign gov'ts, to politicians, to large companies) to rig an election. How do you secure every variation of computer that exists in America against that kind of attacker; think of the NSA (i.e. their foreign equivalents). But that's OT.
This confuses them, and my responses certainly will be dropped as "statisical outliers". But it sure is fun to get these tele-surveyors working outside of the comfort zone!
Now, if the politicians who commission these "unbiased" surveys hadn't excluded themselves from the state "no-call" law, all of this wouldn't be necessary. Until they do that, I'll just enjoy myself at their expense.
You'd need it to be a browser feature, rather than a feature implemented by the browser while running javascript in the page. Obviously, this involves a lot of changes, but this is theoretical research anyway.