Slashdot Mirror


Randomizing Survey Answers For Accuracy

Saint Aardvark writes: "The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results. Strangely enough, they can get accurate information out of the aggregate of enough answers -- but it's completely anonymized. Since conservative estimates say nearly half of all survey answers are bogus, there's an interest in persuading people to be more truthful. As ever, you can use the Random NY Times Registration Generator to falsify your registration details and read the article..."

26 of 224 comments (clear)

  1. I don't get it. by AKAImBatman · · Score: 4, Insightful

    Ok, fine. They've managed to come up with a model that doesn't actually collect any data. And how will this help people to enter REAL data? People don't give data because they don't trust the company. If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?

    1. Re:I don't get it. by plugger · · Score: 3, Insightful

      If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?

      Fair point. One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.

      Then again, if all you are interested in is aggregate data, just don't ask for any personally identifying information.

    2. Re:I don't get it. by dboyles · · Score: 3, Insightful

      ...what really pisses me off is places that correllate the state you select with the zip code. Places like that seem to be deliberately positioning themselves AGAINST me, so I intentionally fill it with erroneous data because they have become my adversary in the case of this page.

      You seem to have some sort of problem with this, as if they are somehow tricking you. No, it's just a validity check in an attempt to ensure accurate data. What I find interesting is that they would give you an error and ask you to fill in the form again.

      Let me explain: let's say you've filled out a 10-question form asking for name, email, age, location, and a few "consumer behavior" questions. If you've done all this accurately, it files your data and lets your proceed. But if you've done it inaccurately (in this case, filled out a ZIP/state that don't match), it kicks you back and makes you correct it. So this time you put in a valid ZIP/state. You submit it, and it files your data away and lets you proceed.

      The problem is that your data still isn't accurate, and therefore should be thrown out. Maybe your ZIP/state is correct now, but maybe you just put 90210/CA. A much better solution from a data integrity standpoint is to allow that user to enter junk data, but to not factor in that bad data when drawing conclusions.

      I think there needs to be much more research in this area if anybody expects to get good data out of the internet. IBM's studies seem to be a step in the right direction. Not only do they want to improve data integrity for the company, they're also factoring in another important issue: privacy.

      --
      -- "Complacency is a far more dangerous attitude than outrage." -Naomi Littlebear
    3. Re:I don't get it. by pheonix · · Score: 3, Informative

      My partner and my company does this for large corporations (a great deal in the automotive sector) and here's what we've found.

      Frequently, the people that give input simply misread questions... for example 'How many males over the age of 18 in your household INCLUDING YOU' as opposed to 'NOT COUNTING YOURSELF'. Or they make typos. Error checking can fix that frequently. Saying that just because they mis-keyed their zip, the whole dataset is incorrect is not correct.

      We've found that the most positive way to get good data is to get people that WANT to tell you their opinions to take the survey. Forcing someone to take the survey for free stuff or to take part in something just doesn't work. Giving them the free stuff then saying "Hey, would you like to give us your opinion" on the other hand, does. The only drawback is that you would assume you're tainting the respondent's opinion. Given the amount of research we've put in, we've actually found the opposite... people say "hey, I've already got my free shit, now I'll tell em how I REALLY feel". I don't see much of a purpose in what IBM has come up with.

    4. Re:I don't get it. by DennyK · · Score: 3, Interesting

      Heck, usually it's LESS work to lie. Much easier to select the first or last option in a list than to hunt for the one that applies to you, or say you live in "dkjhgkjhdgs dshkjgdsh, AL" than to actually type your real address. And if they insist on cross-checking your ZIP and state, then what else is there except CA and 90210? ;) (Guess crappy TV shows can have their uses after all... ;) ) I'd love to see a study done about what % of visitors put CA/90210 for a state/ZIP in those places that do the cross-checking. That would give you a damn good idea about how many people lie like hell on those surveys... ;)

      DennyK

  2. Slashdot Poll? by jedwards · · Score: 5, Funny

    Did you lie when answering this question?

    O Yes
    O No
    O Cowboy Neal told me the answer

    1. Re:Slashdot Poll? by Subcarrier · · Score: 5, Funny

      Did you lie when answering this question? Yes

      Truth is often the most devious of lies.

      --
      "I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush
  3. This will not affect user behavior by treat · · Score: 5, Insightful

    Do they expect that people will enter real data on the mere promise that it will be stored in some randomized, aggregate, or other form that does not invade their privacy? If the coroporation could not be trusted in the first place, no statement they make will make them trustworthy.

    1. Re:This will not affect user behavior by Blue+Stone · · Score: 4, Insightful


      All they have to do is stop asking for my name and e-mail address, and I could be truthful about pretty much anything else they'd care to ask.

      --
      Corporation, n. An ingenious device for obtaining individual profit without individual responsibility. - Ambrose Bierce
  4. Missing the Point by Inexile2002 · · Score: 3, Insightful

    Sounds all fine and dandy for science, but people are usually honest with a professional researcher who is going to gaurantee your ananymity, and moreover the research data is going to be used for something tangible rather than selling something right back to you.

    Market researchers want information on YOU. They want generic info on your demographics, but this information has been available from other venues for a long time. When spy ware and other information gathering techniques are employed against someone they are being used to collect data to target marketing at that person specifically. Literally employed against that person.

    As such, I'll still say that I'm female, in my 50's, from Yemen and making less than $12,000 a year. Randomize away.

  5. Re:How does this stop people from being false? by irony+nazi · · Score: 3, Funny

    It doesn't take the irony nazi to point out the sweet, sweet irony in using the random NYT account generator to read the story.

    --

    Bringing irony to the Slash-masses
  6. optional vs. required by verbatim · · Score: 5, Insightful

    I think there is something to be said about companies that ask for information as an option versus companies that ask for information as a requirement.

    For example, company XYZ has released a program called Widget. In order to download Widget, users are asked to fill out a survey so that XYZ may guage the demographics of their target audience.

    Some sites will allow you to bypass this step and proceed to download the software. Other sites require this information before revealing the download link. I think that the psychological difference between "required" and "optional" would heavily influence the honesty of the answers.

    I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company. The difference? I'm not being forced to give anything up in advance.

    Is this true in general? I don't know. But it makes sense to me.

    I have an idea for something to replace the survey forms - an AI program to carry out a conversation with the user. Ah ha! We just have to watch out for users that say to the AI - "I am lying" - and hope the AI doesn't need therapy.

    --
    Price, Quality, Time. Pick none. What, you thought you had a choice?
    1. Re:optional vs. required by Anne_Nonymous · · Score: 3, Funny

      >> information as an option versus...information as a requiremen

      The New York Times thinks I'm a 146 year-old lady who makes less than $10,000 a year, has 3 children in high-school, and enjoys golf and motorsports in her spare time.

  7. Does this increase trust? by SeanTobin · · Score: 5, Funny
    I hope these companies aren't asking users to 'trust' them with thier personal information based on the fact that we are supposed to trust them to randomize it.

    Personally, if I don't trust them enough to tell them how much I make, I'm not going to trust them to randomize my results. I don't see how this will increase accuracy -- especially if I keep telling everyone I'm a 108 year old female in Uganda making $100,000+ per year year who works in the sales department of an Educational field and plans to make purchases of an suv, a house, a console gaming system, a optical mouse in the next six months and rates thier internet experience as very low. My e-mail address is sjobs@mac.com and I would like to apply for your quarterly, monthly, weekly, daily, and hourly newsletters and I do give permission to pass this information to your affiliates.

    --
    Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;
  8. That's just stupid by photon317 · · Score: 4, Insightful


    Let me summarize:

    1) People lie on surveys, most likely because they don't trust the taker - but probably also just because they like putting in other answers (yeah, I'm a millionaire, woohoo!, etc). This only addresses the trust issue, ignoring other ptential sources of lying.

    2) In order to work around the trust issue, they've developed a method of injecting random noise into the original answers as they are recorded and then extracting useful data in the end.

    Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.

    Without bothering with all this mumbo-jumbo, I can build a trustworthy system. I simply record survey statistics, and I promise not to use the individuals' personal data invidually.

    They can either trust me that I'm telling the truth about this, or they can lie. In the IBM researchers' scenario, the users are again asked to trust that the backend system doesn't compromise them, and again they can choose to trust it or choose to lie.

    Given the above, why on earth would you bother with this research and uneccesary complexity. It's not going to make any difference over just promising your users that you don't invade their privacy. You could replace their research results with a banner on top of the survey that says "After you sumbit your data to us, we use Magical HibiJibi technology to prevent ourselves from invading your privacy, so please trust us and answer truthfully"

    What a waste of research.

    --
    11*43+456^2
  9. Re:5% Error on "reconstructed data" by DaCool42 · · Score: 3, Informative

    No, what they are saying is that their random number generator is very random. If you have a perfectly randomly generated bell curve set of numbers, it makes it easier to reconstruct the original data. Think of an audio signal for example. If you have a sine wave (your data) mixed with white noise (perfectly random), you can quite easily pick out the sine wave. Its the one frequency that is louder than all the rest. However, if instead of white noise you have noise that is not perfectly random, you will not see a clear sine wave, but several different frequencies.

    --

    ----
    All of whose base are belong to the what-now?
  10. User Interface and Implementation by WEFUNK · · Score: 4, Insightful

    Interesting approach, but useless unless people actually understand and trust the system. For this to happen will probably require widespread adoption, an easy to understand explanation of the process, and assurances that answers really are randomized. These requirements obviously force a bit of a chicken and the egg scenario.

    Explaining the whole randomization process (how it protects privacy, how it provides useful info) will be a little much for most people I think, but a good user interface might alleviate this, perhaps with a 'randomize' button that is used before hitting the 'submit' button. This would take the user input and change it right in front of their eyes. Of course many would be rightfully concerned that the randomize button is just for show (or simply encodes but doesn't anonymize), but I think that enough people might buy into the false sense of security that demonstrated 'randomization' provides to at least partly improve the % of bonafide results. Also, the system could be set up so users who don't mind submitting traceable information could be encouraged ("extra 10% off") to submit without randomization, with a simple flag sorting data into randomized/anonymous and non-randomized/non-anonymous data).

    This approach would be even better if the randomization approach becomes a ubiquitous standard backed by a consistent and legally accountable and well-known entity/brand (IBM for instance). I'm not sure how well an open solution would work unless there was a central group assuming responsibility and accountability for the system, enforcing trademarks, and suing spoofers. Also, people feel safer when they feel there's someone to blame for any abuse/mistakes (hence, giving their credit card freely to a waiter but not to a website).

    --
    My next sig will be ready soon, but friends can beat the rush!
  11. Old trick by guanxi · · Score: 4, Informative

    As another poster observes, if you don't trust them with the data, why trust them to randomize it?

    My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:

    With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.

    If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.

    The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.

    1. Re:Old trick by cduffy · · Score: 3, Insightful

      The problem with these techniques is that you can't force the user to do it manually (as they won't), and the user can't trust their own computer (running someone else's software) to do it for themselves. That latter objection is the one that has botched any number of theoretically sound online voting systems.

      Useful in theory? Very. Useful in practice? Not so much.

    2. Re:Old trick by AJWM · · Score: 3, Interesting

      Indeed, very old trick. (For my sins, in my earlier days I used to help PhD psych students run statistical analyses on their survey data.)

      A variation on this is to give the respondant a die (ie, half a pair of dice), tell them to pick a number between one and six, and every time they roll that number, intentionally give a false answer on the survey. Thus, looking at any individual survey response, you don't know whether it's true or false, but you can factor in the 16.7% false responses into the statistical analysis.

      Sure, that can be computerized, but as someone above pointed out, how does the respondant know he can trust it? The above old technique is entirely under the respondant's control.

      --
      -- Alastair
  12. I'm innocent! by HD+Webdev · · Score: 3, Funny

    "Judge, I did not know she was 14 years old. I'm pleading innocent by reason of randomized, aggregate data!"

    --
    This is not a dream, not a dream...we are transmitting from the year 1-9-9-9.
  13. NYT Random Login Generator by majcher · · Score: 5, Informative

    Hey, it's me. The guy who put together and hosts the New York Times random login generator. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.

    I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)

    1. Re:NYT Random Login Generator by shepd · · Score: 4, Insightful

      >If you're one of those paranoid psychos, then don't give them your life story.

      Too bad there's no "Skip this crap" option in their registration screen, huh?

      So, the only way to not give them your life story is to lie. I know! Let's make it easy and create a random login generator so I don't have to type more random crap on every computer I use!

      And, BTW, if you think I'm paranoid, I'll let you know that I was able to make any changes I wanted [but only did what I asked, of course] to my grandmother's phone line by simply asking her age and full name -- ALL of which are sent to NYT on that page. They only asked to hear a lady's voice, which my mother happily provided. Armed with just a birthdate and name I can make all sorts of changes to your services -- anonymously.

      Knowing that, do you want to give me your name and address? If you don't, you should know there's no reason why I'm not working at the NYT right now... I will tell you that were I do work I have access to many, many, many records including Full Names and Birthdates. Feeling uneasy yet? Well, if you trust me, I've never abused those privleges.

      >When they change their registration process and perhaps charging for their online content, don't start bitching.

      My only bitching will be the fact their site goes offline for everyone. You can't compete in a (literally) Free market by charging infinitely more than your competitors. With the amount of newspapers online right now, and the amount of good content that doesn't come from the NYT, I think they'll end up another salon.

      --
      If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
  14. Why I don't trust survey takers by dpbsmith · · Score: 3, Informative

    As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?

    A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!

    Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.

  15. Re:This is how it would work: by 80N · · Score: 3, Interesting
    95% of web users don't understand a lot of things, but if someone they trust tells them its OK then they will be happy.

    I don't really understand how SSL works, but I trust my browser (a bit) and when I see https in the URL then I'm comfortable with that. Not because I fully understand SSL, but because I listen to the opinions of people who do.

    So if it became accepted practice that pressing the Randomize button on your browser (why not build it into the browser) made your response anonymous then nobody needs to understand it any more than they do SSL.

    Actually, why not have a new http method: POST-RANDOM instead of POST so the server knows that the data has been randomized.

    80N

  16. Randomizing for Accuracy by hysterion · · Score: 4, Funny
    Rakesh Agrawal and Ramakrishnan Srikant have devised a data-mining program that would cloak individual truthful answers
    Don't trust these guys. They are (obviously) piping their names through some obfuscation algorithm.