Randomizing Survey Answers For Accuracy

How does this stop people from being false? by OneStepFromElysium · 2002-07-21 05:08 · Score: 1

If I want to intentionally put a bogus answer into a poll, such randomization doesn't affect it at all. Quite often, I will answer a poll not in the way that I actually feel, but in the way that interests me the most at that particular moment. This randomization doesn't affect that.

Re:How does this stop people from being false? by Ignavus+Anonymous · 2002-07-21 05:15 · Score: 1

I will answer a poll not in the way that I actually feel, but in the way that interests me the most at that particular moment

Even worse, it's a lot of fun just to screw the poll and prove statistics wrong :D

--

--
Re:How does this stop people from being false? by irony+nazi · 2002-07-21 05:19 · Score: 3, Funny

It doesn't take the irony nazi to point out the sweet, sweet irony in using the random NYT account generator to read the story.

--

Bringing irony to the Slash-masses
Re:How does this stop people from being false? by Fjord · 2002-07-21 05:49 · Score: 2, Interesting

Basically people providing false answers will ver often pick a false answer based on location. Many will pick the middle option, with others being picked less, though the first and last may have a oddly large amount of people chosing them compared to the second and second-to-last options.

When the answers are given in random order, each cycles into the different spots. The liars are actually cancelling out other liars who used the form before them. The differences in the answers are mostly based on how the truth tellers answered (I say mostly because some liars may have a different way of selecting a lie, such as the longest string offered in the answers), and so you can derive more meaningful statistics from them.

See this /. poll that illustrates the phenomenon. For an excercise, imagine what the results would look like if the offerings were randomly ordered for each person the poll was shown too. My bet is each would be near 12.5%.

--
-no broken link
Re:How does this stop people from being false? by SuperLiquidSex · 2002-07-21 06:09 · Score: 2, Informative

Did you read the article, this doesn't randomize the poll at all. It won't do anything unless people awnser truthfully. It's for protecting privacy by randomizing the raw data they recieve.

--
Oops....you'll know what I'm talkin about in a bit.
Re:How does this stop people from being false? by donpardo · 2002-07-21 07:18 · Score: 1

Basically people providing false answers will ver often pick a false answer based on location.

And this has what to do with the article?

--
Nothing to see here. Move along.
Re:How does this stop people from being false? by Anonymous Coward · 2002-07-21 08:00 · Score: 1, Insightful

Gallup has been randomizing the order of poll answers for many, many years now.

The next major step Gallup made was to randomly give slightly differing forms of a question to estimate the systematic bias due to the phrasing of a question. In my opinion, that is the real key to error estimation in polls. Since you have to phrase a question one way or another, you can never really remove this kind of systematic error, but at least you can estimate how large the error it is.
Re:How does this stop people from being false? by Yosemite_Mark · 2002-07-22 11:27 · Score: 1

I thing the reasoning is that by letting people know that their input will be randomized, they'll be more likely to answer truthfully, as they don't have to worry that their personal information will be stored forever.
Not sure if this would work, though, as you'll have to trust that they're really randomizing, and the reason many people lie in the first place is because they DON'T trust the web site

Hrm by Anonymous Coward · 2002-07-21 05:08 · Score: 2, Funny

In the past, I'd give false answers. Now I'll need to randomize my true/false answers to throw their randomness off.

fp by fozzy(pro) · 2002-07-21 05:08 · Score: 0, Troll

fp

frost pist by Anonymous Coward · 2002-07-21 05:09 · Score: 0

It hurts and stuff -- plus this story is a dupe.

I don't get it. by AKAImBatman · 2002-07-21 05:12 · Score: 4, Insightful

Ok, fine. They've managed to come up with a model that doesn't actually collect any data. And how will this help people to enter REAL data? People don't give data because they don't trust the company. If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?

--
Javascript + Nintendo DSi = DSiCade

Re:I don't get it. by plugger · 2002-07-21 05:20 · Score: 3, Insightful

If they don't trust the company, do you really think they'll believe some mumbo-jumbo about "randomizing"?

Fair point. One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.

Then again, if all you are interested in is aggregate data, just don't ask for any personally identifying information.
Re:I don't get it. by Otter · 2002-07-21 05:37 · Score: 2, Insightful

Fair point. One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.
But, again, why would a user bother? People resent being pestered for information. It's minimally more work to lie than to provide accurate information and much more satisfying.

--
What I'm listening to now on Pandora...
Re:I don't get it. by vipw · 2002-07-21 07:25 · Score: 2, Interesting

I don't give out real data because I don't feel a need to at all.
I find it takes a lot less time to fill in crap data than real data, what really pisses me off is places that correllate the state you select with the zip code. Places like that seem to be deliberately positioning themselves AGAINST me, so I intentionally fill it with erroneous data because they have become my adversary in the case of this page.

Filling in webforms doesn't become an issue of trust until I actually need them to have these data; in which case I try to be careful with who I give my credit card number, but don't care all that much about the rest.

I think the only reason people give out real data when presented with pointless web forms (ala NYT) is that they are unsure if it will operate properly if they enter the wrong information. I assume a goodly percentage of truthful answers come from a demographic that never intentionally fills erroneous answers into web forms; people who aren't very interested in where limitations exist in these computers that they just happen to use.
Re:I don't get it. by Anonymous Coward · 2002-07-21 07:28 · Score: 0

Lieing is less work once you get good at it. Just checking the first entry in every list is simple, street addresses become one word, date of birth becomes 11/11/11, etc. Quick and easy to key in.

Unless they're stupid fuckers who check to make sure your zipcode is in your state, but I think that is a capital crime anyway ;)
Re:I don't get it. by dboyles · 2002-07-21 08:52 · Score: 3, Insightful

...what really pisses me off is places that correllate the state you select with the zip code. Places like that seem to be deliberately positioning themselves AGAINST me, so I intentionally fill it with erroneous data because they have become my adversary in the case of this page.

You seem to have some sort of problem with this, as if they are somehow tricking you. No, it's just a validity check in an attempt to ensure accurate data. What I find interesting is that they would give you an error and ask you to fill in the form again.

Let me explain: let's say you've filled out a 10-question form asking for name, email, age, location, and a few "consumer behavior" questions. If you've done all this accurately, it files your data and lets your proceed. But if you've done it inaccurately (in this case, filled out a ZIP/state that don't match), it kicks you back and makes you correct it. So this time you put in a valid ZIP/state. You submit it, and it files your data away and lets you proceed.

The problem is that your data still isn't accurate, and therefore should be thrown out. Maybe your ZIP/state is correct now, but maybe you just put 90210/CA. A much better solution from a data integrity standpoint is to allow that user to enter junk data, but to not factor in that bad data when drawing conclusions.

I think there needs to be much more research in this area if anybody expects to get good data out of the internet. IBM's studies seem to be a step in the right direction. Not only do they want to improve data integrity for the company, they're also factoring in another important issue: privacy.

--
-- "Complacency is a far more dangerous attitude than outrage." -Naomi Littlebear
Re:I don't get it. by Yosi · 2002-07-21 09:19 · Score: 2, Funny

I remember once I was watching an eleven year old kid fill out a form for something completely truthfully. When he hit submit, it took him back to the form, complaining that the age he gave was too young (for them to be collecting information on him), and suggesting that he fix it. So of course, he did.

huh?
Re:I don't get it. by Bert690 · 2002-07-21 09:33 · Score: 1

You (and MANY others in this discussion) are completely missing the point.
The scheme allows randomization to be performed on the CLIENT side of the wire (that is, your own machine) so that the company/companies NEVER ever sees the real data.
Yes, deploying this properly requires some client-side software to achieve this processing. But given this software, all you have to trust is that the client side implementation is properly implementing the scheme. You enter your real data into the client-side software once, and then forget about it. Server-side software interacts with this client-side software to obtain ONLY the appropriately randomized data. Your privacy is preserved, but companies you interact with can perform meaningful aggregate computations in order to better optimize their services.
Re:I don't get it. by pheonix · 2002-07-21 09:55 · Score: 3, Informative

My partner and my company does this for large corporations (a great deal in the automotive sector) and here's what we've found.

Frequently, the people that give input simply misread questions... for example 'How many males over the age of 18 in your household INCLUDING YOU' as opposed to 'NOT COUNTING YOURSELF'. Or they make typos. Error checking can fix that frequently. Saying that just because they mis-keyed their zip, the whole dataset is incorrect is not correct.

We've found that the most positive way to get good data is to get people that WANT to tell you their opinions to take the survey. Forcing someone to take the survey for free stuff or to take part in something just doesn't work. Giving them the free stuff then saying "Hey, would you like to give us your opinion" on the other hand, does. The only drawback is that you would assume you're tainting the respondent's opinion. Given the amount of research we've put in, we've actually found the opposite... people say "hey, I've already got my free shit, now I'll tell em how I REALLY feel". I don't see much of a purpose in what IBM has come up with.
Re:I don't get it. by AvitarX · 2002-07-21 09:55 · Score: 1

What I was actually thinking when reading through this is way randomize at all?

If only the aggregate is of interest, why not do research on how people falsify and as you said trashing the obvious lies. These people lieing after all are just noise (but probably not 100% random).

I have not seen a single post of someone here saying that this will help them tell the truth, so it would seam that applying the same theroies to already collected data but taking into account that the noise is not pure would be a much more effecient way to do it.

Of course that would not sound new, and would therefore not be marketable.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:I don't get it. by DennyK · 2002-07-21 10:08 · Score: 3, Interesting

Heck, usually it's LESS work to lie. Much easier to select the first or last option in a list than to hunt for the one that applies to you, or say you live in "dkjhgkjhdgs dshkjgdsh, AL" than to actually type your real address. And if they insist on cross-checking your ZIP and state, then what else is there except CA and 90210? ;) (Guess crappy TV shows can have their uses after all... ;) ) I'd love to see a study done about what % of visitors put CA/90210 for a state/ZIP in those places that do the cross-checking. That would give you a damn good idea about how many people lie like hell on those surveys... ;)

DennyK
Re:I don't get it. by mr3038 · 2002-07-21 12:05 · Score: 1

...but maybe you just put 90210/CA
Nah, I prefer 1537 Paper St., Wilmington, DE 19808.

--
_________________________
Spelling and grammar mistakes left as an exercise for the reader.
Re:I don't get it. by shepd · 2002-07-21 12:35 · Score: 1

>Unless they're stupid fuckers who check to make sure your zipcode is in your state

That's the only reason why Beverly Hills, 90210 didn't suck. I suppose you could forget what state that's in...

--
If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
Re:I don't get it. by mcjulio · 2002-07-21 21:18 · Score: 1

I always use 90210/CA and aaa@aaa.com. One only hopes that there is no Arthur Alex Anderson who works for the American Automobile Association.
Re:I don't get it. by RhetoricalQuestion · 2002-07-22 03:16 · Score: 2

I work in Marketing and part of my job is occaisionally analyzing the results of our eval surveys. I think far more people provide false answers out of laziness than out of deliberate lying.
I think this because we use this data (in part) to analyze the effectiveness of our magazine ads in order to budget accordingly. For a long time, one of the magazines, beginning with the letter A had been getting really good results from the eval survey. So we put more money into ads and articles for that magazine, but didn't notice any increase in evaluation downloads.
Then we changed the order of the magazine dropdown, and suddenly, no one was picking that magazine. (Now we regularly rotate the list.)
Yes, there is quite a bit of false data -- lots of people who work at Foo or Test -- but I think it's mostly people trying to get to the survey quickly and not people trying to protect their privacy.

--
I can spell. I just can't type.
Re:I don't get it. by Anonymous Coward · 2002-07-22 06:24 · Score: 0

I usually use suck@my.dick or fuck@your.mother. If that fails due to TLD verification or reverse DNS lookups, I'll use you@suck.com. Sure it's childish, but I still hope that at least occasionally I manage to offend someone at the nosy company as deeply as they offend me with their data collection practices.

Slashdot Poll? by jedwards · 2002-07-21 05:13 · Score: 5, Funny

Did you lie when answering this question?

O Yes
O No
O Cowboy Neal told me the answer

Re:Slashdot Poll? by Lord_Slepnir · 2002-07-21 05:26 · Score: 2

If logic holds true, then there will be 100% No answers (minus the people who take the Cowboy Neal option). If you're a lier, then you can't click yes because then you wouldn't be lying, so you have to choose no (because you're a lyer). If you aren't lying, then you have to choose no.
Re:Slashdot Poll? by Anonymous Coward · 2002-07-21 05:30 · Score: 0

Make up your mind, jesus. First you use "lier", then you use "lyer".

You're wrong both times (damn American, lol! A pathetic nation of educational failures); the proper spelling is "liar".
Re:Slashdot Poll? by blixel · 2002-07-21 06:11 · Score: 1

If logic holds true, then there will be 100% No answers (minus the people who take the Cowboy Neal option). If you're a lier, then you can't click yes because then you wouldn't be lying, so you have to choose no (because you're a lyer). If you aren't lying, then you have to choose no.

Wasn't that from a Yu-Gi-Oh Episode?

This is from Episode 17 - Double Trouble Duel (Part 1)

Yugi: "I hate to disappoint you Joey but I don't think we've solved this
riddle just yet. In fact, I think we're going down the wrong road
ourselves - so to speak."

Joey: "But I told you Yug' I heard this one before."

Yugi: "Your riddle has some things in common with our present predicament
but there are some key differences. And trust me Joey those key differences
change the entire nature of the problem."

Joey: "Huh?"

Yugi: "In your riddle there was only one person to question at the crossroads.
But in our situation we have two, Para and Dox. Now both Para and Dox
have told us the exact same thing. One of them will speak nothing but
the truth, and the other will speak nothing but lies. But there's a problem
with that already. Because if they were, as they said, one truth teller and
one liar, the liar could never admit to it. That would be telling the truth.
The only way they could both make that statement is if they were both lying.
And that means we can't trust either Para, Dox, or their riddle."

Joey: "Wow. My brain hurts."
Re:Slashdot Poll? by Anonymous Coward · 2002-07-21 06:11 · Score: 0

(damn American, lol!
Don't they teach you how to use a comma in fagland?
Re:Slashdot Poll? by Subcarrier · 2002-07-21 06:22 · Score: 5, Funny

Did you lie when answering this question? Yes

Truth is often the most devious of lies.

--
"I have opinions of my own, strong opinions, but I don't always agree with them." -- George H. W. Bush
Re:Slashdot Poll? by rworne · 2002-07-21 06:39 · Score: 1

This was also in a Dr. Who episode where he was presented with two robots, one that always lied and the other that always told the truth.
Methinks that this is a clever, yet popular riddle for geek shows.

--
I tried every decent and legal way I could think of to resolve the issue w/the business before I rented the chicken suit
Re:Slashdot Poll? by Ho-Lee-Cow! · 2002-07-21 07:18 · Score: 1

Always go for Cowboy Neal!

--
In space, no one can hear you moo.
Re:Slashdot Poll? by isorox · 2002-07-21 07:32 · Score: 2

Truth is often the most devious of lies.

Truth is just an excuse for a lack of imagination
Re:Slashdot Poll? by Tackhead · 2002-07-21 07:56 · Score: 2

> > Did you lie when answering this question? Yes
>
>Truth is often the most devious of lies.
"If I were to ask you what your answer to the question 'will you lie when answering this question' would be, how would you answer?" ;-)
Re:Slashdot Poll? by G-funk · 2002-07-21 11:31 · Score: 2

You're going about the problem the wrong way. Don't think about what you "can" and "can't" click. Because we both know this crowd would all click "yes" anyway.

The fact is, those who are the "liars" must tick yes anyway, because it's the statement that cannot be true, hence it's a lie.

Then again, anybody who is a liar can also tick NO, since he's lying.

--
Send lawyers, guns, and money!
Re:Slashdot Poll? by falzer · 2002-07-21 18:10 · Score: 1

Pepsi?
Re:Slashdot Poll? by Helmholtz+Coil · 2002-07-22 02:41 · Score: 1

And geek games, too. IIRC, it was also in Ultima VI, with a two-headed horse, right before you got the plans to the balloon.

Circumvention by limekiller4 · 2002-07-21 05:14 · Score: 1

There is a great amount of irony in the fact that we're all reading an article about obtaining accurate information by clicking on a link that will generate false information.

That's just way too wonderful to put into mere words.

--
My .02,
Limekiller

Re:Circumvention by Anonymous Coward · 2002-07-21 15:40 · Score: 0

The great amount of irony about the above comment is that it was actually put into words???

Granted that my comment should bring about a string on comments on the irony of the above comment about the above comment on the above comment.

This will not affect user behavior by treat · 2002-07-21 05:14 · Score: 5, Insightful

Do they expect that people will enter real data on the mere promise that it will be stored in some randomized, aggregate, or other form that does not invade their privacy? If the coroporation could not be trusted in the first place, no statement they make will make them trustworthy.

Re:This will not affect user behavior by Blue+Stone · 2002-07-21 07:20 · Score: 4, Insightful

All they have to do is stop asking for my name and e-mail address, and I could be truthful about pretty much anything else they'd care to ask.

--
Corporation, n. An ingenious device for obtaining individual profit without individual responsibility. - Ambrose Bierce
Re:This will not affect user behavior by serutan · 2002-07-21 07:35 · Score: 2

Very True. If you repeatedly poison the well, eventually even the stupidest people will stop drinking out of it. Our lean-and-mean, smart-sized, just-in-time business community doesn't seem to understand this behavior any better than our politicians. Oops, same thing.

Missing the Point by Inexile2002 · 2002-07-21 05:14 · Score: 3, Insightful

Sounds all fine and dandy for science, but people are usually honest with a professional researcher who is going to gaurantee your ananymity, and moreover the research data is going to be used for something tangible rather than selling something right back to you.

Market researchers want information on YOU. They want generic info on your demographics, but this information has been available from other venues for a long time. When spy ware and other information gathering techniques are employed against someone they are being used to collect data to target marketing at that person specifically. Literally employed against that person.

As such, I'll still say that I'm female, in my 50's, from Yemen and making less than $12,000 a year. Randomize away.

Re:Missing the Point by sailesh · 2002-07-21 20:24 · Score: 1

Well actually you are missing the point :-)
The NYT has done a disservice by presenting this as a tool for increasing trust in the web. It's actually a tool for preserving privacy of data.
A better example would be the analysis of medical records. Now, the IS people at HMOs like Kaiser Permanente have one overriding fear - that they will get up one morning and find that the NYT/WSJ reporting that medical records data has been compromised. As a result, they lock down this data and impose extreme security restrictions in their access. This denies the opportunity to perform data mining analysis, say by an outside contractor.
If you now pose the question as follows: how can an outside agency perform some data mining analysis (for instance infection rates of two diseases in particular areas) on this data without compromising individual security ? One nice answer to this is this technique from the IBM researchers.
Re:Missing the Point by mcjulio · 2002-07-21 21:21 · Score: 1

You sound hot. Can I get your phone number?

Dr.... by CySurflex · 2002-07-21 05:18 · Score: 0, Flamebait

"Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said.

Does he help people kill themselves or do they kill themselves over all these e-commerce surveys?f

Re:Dr.... by morgajel · 2002-07-21 05:35 · Score: 2

heh, with a name like that, he must not get many new clients...

"hello mr. smith, you have a 1pm appointment with dr. 'ka-voa kee-an' today."

"no thanks, I feel better."

"nonsense- the doctor just got the death machine warmed up!"

(and yes, I know it's not the same guy)

--
Looking for Book Reviews? Check out Literary Escapism.

Privacy? by martyn+s · 2002-07-21 05:20 · Score: 1

While a lot of people are concerned about their privacy, somehow, I don't think that the fact that they won't be able to tie the answers to you will lead to any more truthful answering.

My Advice by iONiUM · 2002-07-21 05:20 · Score: 1

What's the point of living if you can't screw with market research? It's just fun, and my little way of getting some revenge for the countless webpages they cover with annoying advertisements, or the time they steal in between my TV programs.

Re:My Advice by GigsVT · 2002-07-21 05:25 · Score: 2

What's the point of living if you can't screw with market research?

Dr. Ann Cavoukian sounds like she can help you with that too. Maybe that was her plan in the first place. :)

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Re:My Advice by khorosho · 2002-07-21 10:48 · Score: 1

yet another slashdotter being incredibly stupid and not saying anything worth reading.

or the time they steal in between my TV programs

they're not stealing time from your programs asshole, theyre the ones paying for it. But yet again, you think that software, music, tv, simply all things that you enjoy in life should be free. Good...when are you going to start giving me candy bars and big steaks...because I enjoy those in life..and they deserve to be free. Sometimes people just need to understand that these supposed rights that we have out there are only so because we have grown up in a world which says that those are natural and right. And what we think is a privelige today we think is a right tomorrow...and sooner or later we'll have people suing because they are not afforded they're constitutional right to access the internet in the area they live(i'm sure they'll justify it somehow). Figure out that you want to the world to be a certain way because of your beliefs, and not because of some overarching good/right. Oh, and while youre at it, I think you should eat your own poop and punch yourself in the face. Later...

Of course by GigsVT · 2002-07-21 05:21 · Score: 2

I don't see how people would trust this any more than entering it normally.

Typical session:
What is your age? (Results will be randomized)
23

OK, we're putting down you are 28 based on a random number we picked. Aren't we good to protect your privay?

(Then behind the scenes the database gets the real age put into it, how will the user ever know?)

Even if the user can view their profile later on, the database can just store their real age + the so-called random modifier, and the user will be none the wiser.

What a pointless "technology".

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.

Re:Of course by Anonymous Coward · 2002-07-21 05:26 · Score: 1, Insightful

What a pointless "technology".

Not at all, not at all. Like 80% of the stuff these days, it exists merely to get some nice paperwork for the students, after that it will be forgotten. Once they have their Masters/Doctorate in an incredibly narrow field, gotten themselves into debt, given money to textbook makers and given jobs to profs, they will have their paper that will get them a nice nice job, all the while perpetuating the myth of higher education and raising the bar for everyone else.

Hardly pointless, is it? I mean, it's the only way for a modern society to still use capitalism.
Re:Of course by verbatim · 2002-07-21 05:32 · Score: 1

*clap* *clap* *clap*

You sir, are my hero. :)

--
Price, Quality, Time. Pick none. What, you thought you had a choice?
Re:Of course by Anonymous Coward · 2002-07-21 06:19 · Score: 0

And just how many certifications do put on your resume?

Trust is the issue... by CySurflex · 2002-07-21 05:21 · Score: 1

This is a very novell scientific idea with very little useful application, IMO. What percentage of the public is going to actually believe that their individual answers are not going to be stored? Or an even better question - what percentage of companies claiming to use this technique will not actually store the data entered by their users?

-CySurflex

Do you trust them now? by ypoint · 2002-07-21 05:22 · Score: 0

Either I trust "them" to be completely honest and fair, then I can give them my information. If I don't trust them to use my information properly then why should I trust them that they actually use this IBM-program? Either they find some way to get information out of my false answers or make me trust them.

5% Error on "reconstructed data" by Keeper · 2002-07-21 05:24 · Score: 2

So what they're saying is that they've proven that their random number generator isn't really all that random? :p

Re:5% Error on "reconstructed data" by DaCool42 · 2002-07-21 05:37 · Score: 3, Informative

No, what they are saying is that their random number generator is very random. If you have a perfectly randomly generated bell curve set of numbers, it makes it easier to reconstruct the original data. Think of an audio signal for example. If you have a sine wave (your data) mixed with white noise (perfectly random), you can quite easily pick out the sine wave. Its the one frequency that is louder than all the rest. However, if instead of white noise you have noise that is not perfectly random, you will not see a clear sine wave, but several different frequencies.

--

----
All of whose base are belong to the what-now?
Re:5% Error on "reconstructed data" by Moofie · 2002-07-21 05:57 · Score: 1

But, doesn't that assume that your data should in fact look like a bell curve? I mean, if you know that ahead of time, why bother taking the data? Isn't the point of gathering the information figuring out how your audience does NOT look like a Gaussian distribution?

I mean, the way the article read to me, the surveyors figured that they couldn't get accurate data, so they just had a computer make some up. I can not understand how this is useful.

(Now, I admit that I am rather aggressively ignorant about statistics...I think the entire field of study is complete bullshit.)

--
Why yes, I AM a rocket scientist!
Re:5% Error on "reconstructed data" by imkonen · 2002-07-21 06:17 · Score: 2, Insightful

Or, what they are saying is that they used (or assumed) a realistically finite number of data points to try to reconstruct the original distribution. The random noise they add may well be perfectly characterized, and the random number generator perfectly random*, but if they are estimating 1000 randomized responses, or 10000, there is also a predictable, non-zero uncertainty in the result when they try to extract the original distribution.
However, since the reconstruction error would depend on the number of respondants, which will vary dramatically from site to site, I might also guess the 5% number was rectally extracted, and only used to make a point for the article that it will still be better than the error due to respondants lying, despite not being perfect.
All of this, or course, under the dubious assumption that people will stop lying just because random numbers have been added to their information, as numerous other posts here have discussed...
*...yea..yea..I know, there's no such thing as perfect random number generator, but those tests you hear about mathemiticians running on RNG algorithms are for the truly anal-retentive who are worried about patterns showing up after the 2^64th repetion or whatever. I doubt that even a relatively low-tech randum number algorithm would be taxed by this technique.
Re:5% Error on "reconstructed data" by Knacklappen · 2002-07-21 06:27 · Score: 1

*...yea..yea..I know, there's no such thing as perfect random number generator, but those tests you hear about mathemiticians running on RNG algorithms are for the truly anal-retentive who are worried about patterns showing up after the 2^64th repetion or whatever. I doubt that even a relatively low-tech randum number algorithm would be taxed by this technique.
Or the absence of patterns, false ones. A truely *random* generator should be able to reproduce any series (like 1-2-3-4-5-6-7-8-9) for a while, but of course with a very high impropability.

--

Excellence: Moderate (mostly affected by comments on your karma)
Re:5% Error on "reconstructed data" by Joheines · 2002-07-21 07:09 · Score: 1

The data does not have to look like a standard normal distribution, the randomizing just has to be done in a way following a certain known distribution (for example, the standard normal).

In reply to "you still have to trust the company" by qubit64 · 2002-07-21 05:25 · Score: 1

a few people have posted that you still have to trust the company you're sending the info to to randomize the data for you. It doesn't have to work like this. You could have the program work so that the info is randomized at your end, maybe by having the browser make a call to a "registration" program. It could be open source so that we could be sure of what it's doing, and the company then can't get your real info. (without hacking your box)

--
"Save me jebus!" - Homer Simpson (btw, I'm probably talkin out of me arse)

Re:Mirror by martyn+s · 2002-07-21 05:26 · Score: 1

You left out the big red graphic 'W' from the beginning of the article.

optional vs. required by verbatim · 2002-07-21 05:30 · Score: 5, Insightful

I think there is something to be said about companies that ask for information as an option versus companies that ask for information as a requirement.

For example, company XYZ has released a program called Widget. In order to download Widget, users are asked to fill out a survey so that XYZ may guage the demographics of their target audience.

Some sites will allow you to bypass this step and proceed to download the software. Other sites require this information before revealing the download link. I think that the psychological difference between "required" and "optional" would heavily influence the honesty of the answers.

I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company. The difference? I'm not being forced to give anything up in advance.

Is this true in general? I don't know. But it makes sense to me.

I have an idea for something to replace the survey forms - an AI program to carry out a conversation with the user. Ah ha! We just have to watch out for users that say to the AI - "I am lying" - and hope the AI doesn't need therapy.

--
Price, Quality, Time. Pick none. What, you thought you had a choice?

Re:optional vs. required by Anonymous Coward · 2002-07-21 05:53 · Score: 0

It's a matter of trust : this company trust me to the point that they provide me the software without requiring anything. In return, I choose to trust it and give it my personnal information : this company made the first step.
OTOH, that company clearly shows that it doesn't trust me, because it is requiring me to fill out stuff before rewarding me with the data. Therefore, why should I trust it, and give it my real infos ?

It's almost the same problem with Anonymous Cowards like me : I show, by posting AC that I don't trust the medium or the readers of my comments. Then, how can readers take seriously what I say ?
Re:optional vs. required by danpbrowning · 2002-07-21 06:37 · Score: 2

I'm the same way. I've setup about a dozen Juno accounts for friends and family, and every time I "fill out" the form for them by randomly clicking one thing from each selection. If it wasn't required, I wouldn't pollute their statistics, but oh well.

--
Daniel
Re:optional vs. required by dboyles · 2002-07-21 08:39 · Score: 2

I know that I never honestly fill out required forms. I'll fill in a bunch of bogus details, get the link, and be on my way. However, if the form is optional, I may download first and, if I like the program, provide some details to the company.

I agree with your theory, but I want to expound on it a little bit.

I don't think many people will be inclined to actually return to the site and voluntarily provide information. However, think about the people who would fill out optional forms in the first place. The demographic probably fits that of the casual internet user. That user is much more likely to provide accurate information - but just as importantly, they're unlikely to provide inaccurate information. So by making a form optional, you've seriously improved the integrity of your data.

Then, a company can look at that (supposedly very good) data and make assumptions about the users. However, they must be careful to not assume that the data is a full picture just because it is not innaccurate (I'm purposely not using the word "accurate"). In other words, if 40% of the respondants indicate that they like Murder She Wrote, you can't assume that that extrapolates to 40% of your user base. Instead, the company must associate that data only with the respondants. But since they have very accurate information about their respondants, they can assume that their conclusions are equally accurate.

So the question arises, "What about the non-respondants?" That's true, the company doesn't have accurate information about them. But what's better, good information about a small group, or bad information about a large one?

--
-- "Complacency is a far more dangerous attitude than outrage." -Naomi Littlebear
Re:optional vs. required by pheonix · 2002-07-21 10:13 · Score: 2

So the question arises, "What about the non-respondants?" That's true, the company doesn't have accurate information about them. But what's better, good information about a small group, or bad information about a large one?

That's so backwards though. There is a difference between a Survey and a Census. Asking every single person that comes to your site what they think is a Census. Yes, that's obviously the best way, but not the most cost effective.

A Survey is talking to a percentage of your user base, and extrapolating the data. If done properly, you can interview a random group of around 15% of your user base and be statistically 95% accurate. Thus far, it's the best we have, if you discount cheating and poor data collection practices.

My point is, if you do your survey correctly, if 40% of your respondents indicate that they like Murder She Wrote, it's safe to say that 40% of your user base also does, plus or minus a small percentage. That's the whole point of statistics.
Re:optional vs. required by dboyles · 2002-07-21 10:54 · Score: 2

My point is, if you do your survey correctly, if 40% of your respondents indicate that they like Murder She Wrote, it's safe to say that 40% of your user base also does, plus or minus a small percentage. That's the whole point of statistics.

That's assuming that the respondants represent an accurate model of your population. My argument is that that's not the case in optional, online polling. Maybe 40% of the respondants like Murder She Wrote, but maybe 70% of respondants were between the ages of 50 and 70.

--
-- "Complacency is a far more dangerous attitude than outrage." -Naomi Littlebear
Re:optional vs. required by pheonix · 2002-07-21 13:22 · Score: 2

That would call the quality of the surveying process into question, but not the survey itself. A few writeups we've been looking at indicate that, typically, there is no direct corrolation between age or sex and likelihood of accurate survey completion. There is a vague corrolation between tech-knowledge and accurate survey completion, but it's not particularly strong, and well within statistical bounds.

Basically, if they don't get a representative sample within reason, it's due to a poorly administered survey, not the seemingly arbitrary nature of their polling.
Re:optional vs. required by Anne_Nonymous · 2002-07-21 15:55 · Score: 3, Funny

>> information as an option versus...information as a requiremen

The New York Times thinks I'm a 146 year-old lady who makes less than $10,000 a year, has 3 children in high-school, and enjoys golf and motorsports in her spare time.

Anti-aliased statistics! by Anonymous Coward · 2002-07-21 05:31 · Score: 0

They're blurry so they look nicer! :)

Flawed method - multiple answers still reveal data by Anonymous Coward · 2002-07-21 05:32 · Score: 0

The method is flawed because multiple answers given to multiple companies (which companies are often cooperating) will give a good estimation of the original data. So this method only works if you register only once, *ever* - which is unrelistic.

and unless the IBM software runs on my computer and is fully OSS, I wont trust it more than Company X's Privacy Policy to begin with...

old but useful by optikron · 2002-07-21 05:33 · Score: 1

That's actually an old statistical trick. Adding an homogeneous noise to any statistical data doesn't actually involve any noise in the final data accuracy. With a little button in java which randomize the data you've entered in the form( thus before sending the data to the firm ), it protects your privacy while still giving useful data to the firm. They got a nice idea, but sure it won't stop some people to fake answers "for fun". I do that sometimes :-)

Re:old but useful by jhoger · 2002-07-21 07:00 · Score: 1

Yeah, it seems odd that this was presented as a new idea considering they taught it to us in introductory statistics.

They use it on surveys when they ask embarrassing questions, something like you flip a coin and if it comes up one way they can lie and if it comes up the other they have to tell the truth. But the surveyer doesn't really know what they are going to answer so they needn't be embarrassed or worry about the cops coming for them.

-- John.

Does this increase trust? by SeanTobin · 2002-07-21 05:35 · Score: 5, Funny

I hope these companies aren't asking users to 'trust' them with thier personal information based on the fact that we are supposed to trust them to randomize it.

Personally, if I don't trust them enough to tell them how much I make, I'm not going to trust them to randomize my results. I don't see how this will increase accuracy -- especially if I keep telling everyone I'm a 108 year old female in Uganda making $100,000+ per year year who works in the sales department of an Educational field and plans to make purchases of an suv, a house, a console gaming system, a optical mouse in the next six months and rates thier internet experience as very low. My e-mail address is sjobs@mac.com and I would like to apply for your quarterly, monthly, weekly, daily, and hourly newsletters and I do give permission to pass this information to your affiliates.

--
Karma: SELECT `karma` FROM `users` WHERE `userid`=138474;

LOL oh this is rich by Archfeld · 2002-07-21 05:35 · Score: 1

Not only does this not make any sense, the article was really poorly written. There is no way this system will be any more truthful than the lame one we use now. What this does prove it the morons ay college actually believe the crap they write...what a sad concept. It is like actually believing the commercials we see on TV...

--
errr....umm...*whooosh* *whoosh* Is this thing on ?

That's just stupid by photon317 · 2002-07-21 05:36 · Score: 4, Insightful

Let me summarize:

1) People lie on surveys, most likely because they don't trust the taker - but probably also just because they like putting in other answers (yeah, I'm a millionaire, woohoo!, etc). This only addresses the trust issue, ignoring other ptential sources of lying.

2) In order to work around the trust issue, they've developed a method of injecting random noise into the original answers as they are recorded and then extracting useful data in the end.

Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.

Without bothering with all this mumbo-jumbo, I can build a trustworthy system. I simply record survey statistics, and I promise not to use the individuals' personal data invidually.

They can either trust me that I'm telling the truth about this, or they can lie. In the IBM researchers' scenario, the users are again asked to trust that the backend system doesn't compromise them, and again they can choose to trust it or choose to lie.

Given the above, why on earth would you bother with this research and uneccesary complexity. It's not going to make any difference over just promising your users that you don't invade their privacy. You could replace their research results with a banner on top of the survey that says "After you sumbit your data to us, we use Magical HibiJibi technology to prevent ourselves from invading your privacy, so please trust us and answer truthfully"

What a waste of research.

--
11*43+456^2

Re:That's just stupid by optikron · 2002-07-21 05:40 · Score: 1

Who said the randomization was made at the other side ? That would be pointless indeed.... With a simple java randomizer CLIENT SIDE, with a little button, you could have total clarity to what you send!( you could still check the java code if you are doubtful ). Don't take IBM scientits for more dumb than they are!
Re:That's just stupid by IO+ERROR · 2002-07-21 05:41 · Score: 2

You seem to have missed the point. This technology assumes that users are going to lie, and mitigates the effects of those lies on the final results of the survey with a minimal loss in "accuracy."

--
How am I supposed to fit a pithy, relevant quote into 120 characters?
Re:That's just stupid by optikron · 2002-07-21 05:44 · Score: 1

No, YOU are missing the point :) It's made so the user actually say the TRUTH but then apply a homogeneous noise over these truth, thus protecting the privacy without destroying the "statistical distribution"( if you ever did some statistic, you'll know what i mean ).
So if ppl will still lie, the accuracy won't change.... you can't have good accuracy with "wrong" data....
Re:That's just stupid by honkycat · 2002-07-21 05:47 · Score: 1

no!

the article intro kind of implies that, but that's not what the system does. I first read it thinking this is what they'd claimed to have found, but if that had been the case there'd have to be assumptions about *how* people lie.

as it is, this is just pointing out that if you add noise with a known distribution to data with an unknown distribution, you can remove the known distribution from an aggregation of data points but not from any single data point.

this doesn't fix the garbage-in garbage-out problem.
Re:That's just stupid by photon317 · 2002-07-21 06:42 · Score: 2

Yes, but surveys are tagetted at and only work with masses of people, the common plural man. From this person's perspective, it doesn't matter than the randomization technically happened in their PC in a java applet or javascript code. In either case they're entering personal data and trusting the company to not abuse them.

--
11*43+456^2
Re:That's just stupid by ndecker · 2002-07-21 08:37 · Score: 1

Notice their technology doesn't do anything to fix the underlying problem. The hope is that users will understand and trust the backend randomizer system, and that based on this trust they will answer more truthfully.
Well, they should leave the randomisation to the user:
Please enter your age and add a random number using a normal distribution with u=30 o=10 : ...
Re:That's just stupid by Bert690 · 2002-07-21 09:36 · Score: 1

No, you're just stupid. :-)
The randomization can be (and should be) performed on the user's OWN MACHINE by appropriate software, so that the server is NEVER obtaining real data. All you have to trust is that the client-side implementation of the randomization scheme is doing the appropriate thing. Open source versions could be provided that would allow anyone to verify this easily.
Re:That's just stupid by photon317 · 2002-07-21 09:46 · Score: 2

No, you're still stupid. :-)

Surveys survey the masses, which means computer illiterate people. It doesn't matter to them the ins and outs of where the randomization happens. If they have to put their real data into a form, they will assume the company asking the survey can get it if they want it. Telling them "don't worry, a client-side javascript function is randomizing this before it gets sent in" doesn't reassure them any more than the HibiJibi statement.

--
11*43+456^2
Re:That's just stupid by Bert690 · 2002-07-21 09:58 · Score: 1

Well, our mutual stupidity aside, I think I agree with you about not trusting a javascript thingy... realistically for this thing to be deployed in a trustable manner I believe it would require some browser / protocol extension (along the lines of HTTPS). Thus just like people trust the little "lock" icon in their browsers (whether they should or not :) there could be another indicator that only privacy-preserved information would be provided to the site.
In more limited application domains this wouldn't be as big a problem.
Maybe it's a pipe dream, but it's still an interesting idea.

Well , good read. by photon_chac · 2002-07-21 05:36 · Score: 1

Yet I don't quite understand it , but I like it cause it somehow arises my interest in data-mining.

--
KOS-MOS

Re:Well , good read. by Moofie · 2002-07-21 06:02 · Score: 1

This isn't data mining this is data-making-up. : )

--
Why yes, I AM a rocket scientist!

Like Benjamin Disraeli once said ... by Spacelord · 2002-07-21 05:37 · Score: 1

There are three kinds of lies: Lies, Damn Lies, and Statistics.

Re:Like Benjamin Disraeli once said ... by Seehund · 2002-07-21 06:10 · Score: 1

There are three kinds of survey answers: Lies, Damn Lies, and CowboyNeal.

--
Help savingAmigaOS and a free PowerPC market

Comared with Slashdot polls by Rural · 2002-07-21 05:42 · Score: 2, Interesting

"Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said. Conservative estimates are 42 percent, but anecdotally the rates are far higher

Gee, considering that, the /. polls (even with their prominent disclaimers) seem to have more meaningful results than polls you see on websites, and probably even more than some "scientific" web polls. At least the results usually look right.

Re:Comared with Slashdot polls by ngtni · 2002-07-21 05:49 · Score: 1

Slashdot polls are only accurate because the polls are never too personal - "how many WPM can you type?" is a lot different from "how much do you earn each year?". People don't like answering deeply personal questions.
Re:Comared with Slashdot polls by Dthoma · 2002-07-21 06:28 · Score: 1

In addition to what ngtni said, there's also the fact that Slashdot polls are optional, and that they are a joke, not to be taken seriously. The more important a survey is, the more likely people will come and try to screw it up for a laugh. And if you're *forced* to fill out a survey, then you're more likely to deliberately falsify your results out of spite.

--
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".
Re:Comared with Slashdot polls by Anonymous Coward · 2002-07-21 07:04 · Score: 0

Which raises an obvious topic for a Slashdot poll...how big is your dick?
/me says "bigger than Cowboy Neal's"

Re:In reply to "you still have to trust the compan by ngtni · 2002-07-21 05:43 · Score: 1

This is probably the most sensible way of doing this, using, for example, Java or even ActiveX.

However, it still doesn't fix the problem that people lie. Even if they know that their privacy is guaranteed, they'll still lie, simply because it's fun--after all, rules are made to be broken.

Re:In reply to "you still have to trust the compan by Sheetrock · 2002-07-21 05:46 · Score: 2

Trust is a necessity any way you slice it. If the randomization takes place completely on the client, the numbers probably won't be random enough. If the randomization takes place on their end, and you hit a button client-side to 'roll the dice' until you get a set of numbers you are comfortable with, the combined human psychology of those surveyed can mung the randomness (numbers ending in 7, for example, might be favored over numbers ending in 0 because of our subconscious understanding that numbers ending in 0 are easier to work with mathematically and are therefore 'less safe'). If you only get one set of numbers from the remote randomizer, you don't know that they aren't using an intentionally weak pseudorandom generator that they'll be able to reverse and get all the original results from (or simply giving the same set of numbers to everybody).

I'm always a bit skeptical when I'm told I'm about to be surveyed anonymously, and I can't think of a way that this can be implemented (or at least is likely to be implemented) that would reassure me. The non-skeptics are filling in their information already. Perhaps businesses could pick one in five to survey and offer the people who don't want to take it the ability to just skip it; I'll bet a good amount of crap in the databases is coming from people who have to fill in eighty mandatory fields for free e-mail or music or whatever.

--

Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.

kind of like... by I+Like+Purple · 2002-07-21 05:46 · Score: 1

I heard something similar to this a while back where they were surveying college students asking personal questions like "have you had sex?" and to make students more inclined to answer truthfully the students would go into a room by themselves with a survey and a coin and for every question they we asked to flip a coin. If the coin came up heads they would answer truthfully and if it came up tails they would flip the coin again and heads would mean true and tails would mean false.

I can't remember where I read this. If someone has a like could you please post it?

Re:kind of like... by optikron · 2002-07-21 05:52 · Score: 1

There are a lot of different technics to extract data from ppl, we are only big mice you know :D I remember the technic used for VERY problematic questions asked to young women( like about rape ). They would ask her two different questions, both true/false, the first was the very problematic, the other was a very common question ( did you like what you eat at lunch ? ). And she said yes/no or true/false, without saying at what question she was answering, thus protecting her privacy but giving enough data for the statistical distribution.
Re:kind of like... by Moofie · 2002-07-21 06:10 · Score: 1

Huh? OK, this sounds really interesting, but I haven't the faintest clue how the data would be meaningful. Would you mind walking me through it?

--
Why yes, I AM a rocket scientist!

How about... by jbuhler · 2002-07-21 05:47 · Score: 2

I agree with lots of folks here that this system works only if you don't have to trust the remote site to apply the obfuscating transformation. Here's a suggestion to make things somewhat more transparent.

Create a form with attached Javascript. You enter the real data and hit the "obfuscate" button. The script then locally adds noise to your answers. At this point, the "obfuscate" button turns into "submit", allowing you to send the visibly obfuscated responses to the remote site.

Of course, you'll probably want to read the source to make sure the real answers are not sent along with the obfuscated ones. Still, this scheme would go a ways toward creating the perception of honesty.

Re:How about... by bandicut · 2002-07-21 06:23 · Score: 1

Then you still need to trust them, *or* you need the ability to distinguish a formula which adds noise from one that encrypts it. (With the key in the hands of the people at the remote site.)
Re:How about... by flonker · 2002-07-21 06:49 · Score: 1

My "randomizer"

foreach field
value{field} = value{field} xor 3

Apply again at the server side, and you get the user's actual input data back.

For example by bgreska · 2002-07-21 05:47 · Score: 1

Flip a coin twice. If the result is two tails, answer "yes" to this question. If the result is two heads, answer "no". If the result is one head and one tail, answer truthfully:

Are you a homosexual?

Nothing known about that one person, but integrate the results over a large enough sample... The catch: the person taking the survey must trust the random number generator, so low-tech things (like coin tosses) would work best.

User Interface and Implementation by WEFUNK · 2002-07-21 05:48 · Score: 4, Insightful

Interesting approach, but useless unless people actually understand and trust the system. For this to happen will probably require widespread adoption, an easy to understand explanation of the process, and assurances that answers really are randomized. These requirements obviously force a bit of a chicken and the egg scenario.

Explaining the whole randomization process (how it protects privacy, how it provides useful info) will be a little much for most people I think, but a good user interface might alleviate this, perhaps with a 'randomize' button that is used before hitting the 'submit' button. This would take the user input and change it right in front of their eyes. Of course many would be rightfully concerned that the randomize button is just for show (or simply encodes but doesn't anonymize), but I think that enough people might buy into the false sense of security that demonstrated 'randomization' provides to at least partly improve the % of bonafide results. Also, the system could be set up so users who don't mind submitting traceable information could be encouraged ("extra 10% off") to submit without randomization, with a simple flag sorting data into randomized/anonymous and non-randomized/non-anonymous data).

This approach would be even better if the randomization approach becomes a ubiquitous standard backed by a consistent and legally accountable and well-known entity/brand (IBM for instance). I'm not sure how well an open solution would work unless there was a central group assuming responsibility and accountability for the system, enforcing trademarks, and suing spoofers. Also, people feel safer when they feel there's someone to blame for any abuse/mistakes (hence, giving their credit card freely to a waiter but not to a website).

--
My next sig will be ready soon, but friends can beat the rush!

Re:User Interface and Implementation by Bert690 · 2002-07-21 09:51 · Score: 1

>Interesting approach, but useless unless people
>actually understand and trust the system.
I don't know... people use HTTPS/SSL all the time. You really think more than .01% of the population understand the subtleties of certificate authorities, public / private key encryption, etc.? Yet still it's far from useless.
As long as people trust that it works, there's no need to actually understand it. I think this trust would be best provided by open source implementations of schemes such as this.
Re:User Interface and Implementation by phanki · 2002-07-21 15:46 · Score: 1

What you suggest is interesting, but there is one major thing i think people are missing out here. The idea behind the research was to protect people's privacy. Would one trust it or not is one's own prerogative but then if there is someone working to achieve the same, i guess the effort should be appreciated and if there are any suggestions do let them know. I think your suggestion of using a randomize button does allay people's scares. I think this idea should be backed up by IBM and I am sure people would respond. A final small observation. Spammers generally are a subject of disregard. When we use the NYTimes random generator are we not doing something similar to that? Are we not generating scores of fake e-mail ids. It is TRUE that NYTimes should not ask information for reading an article, but can someone reflect on the amount of random emails that are getting stored there in NYTimes' database

They just came up with this? by danny256 · 2002-07-21 05:48 · Score: 1

"The New York Times reports that two researchers at IBM have come up with a way to persuade people to give correct answers to survey questions: randomize the results."

Maybe I'm crazy, but isn't this exactly the same as the random response stuff that I (and presumably everyone else) learned in high school finite? The way it worked in finite is you ask someone 2 questions, like "have you ever killed?" and "have you ever been on a plane?", then you tell them to answer one of those based on some random event, like "answer the first question only if its Monday, Tuesday or Wednesday, otherwise the other question", then you calculate the probability that they answered each question using finite. This protects the privacy since there is no way of knowing whether they said "true" to killing of being on a plane, and people know this so they will be more likly to be honest. I don't see how this "new" system that IBM has created is much different than a slightly modified version of what I have just described.

Re:They just came up with this? by letxa2000 · 2002-07-21 07:53 · Score: 1

I don't see how this "new" system that IBM has created is much different than a slightly modified version of what I have just described.
Because it assumes that you answer ALL the questions accurately ALL the time, but they will add "noise" so supposedly your exact answer cannot be known. They assume that since they add noise that people will answer all the questions accurately all the time. That, unfortunately, is not the case.
All they'll end up with is a dataset where the few accurate answers are now distorted by noise, and the lies are also distorted making it harder to differentiate the two.
It's a non-starter and useless...

This wouldn't help with my answers by Helen+O'Boyle · 2002-07-21 05:51 · Score: 1

Why do I sometimes put bogus information in surveys? Because the survey stands between me and something I'm after, and I don't want to waste any more time than necessary, getting whatever it is. For example, reading an online article, downloading eval software, entering an etailer's contest, getting on a geek mailing list, etc. are all things for which I've been asked to fill out surveys.

Privacy concerns don't necessarily enter into the equation in those situations. I may or may not want to take the time/brain power to even attempt to answer accurately.

Re:This wouldn't help with my answers by Anonymous Coward · 2002-07-21 06:46 · Score: 1, Insightful

Yes, this is something that seems to have been overlooked by the other posters. They seem to look at the surveys as a way to steal their privacy, where as I simply see them as a waste of time. I would say this is true of a vast majority of the people I know, many of whom don't even understand that people go out of their way to track them.

If people didn't try to make their surveys mandatory, or obnoxious they might receive more truthful answers from that subset of the population that gives a crap enough to tell them they earn $30,000/y, work as a waitress, and has 2.5 children, and heard about this site/contest/whatever from spam e-mail sent by the publishers of the survey.

Until then, beware my "Bob Dole" fake survey answers.

This is cool. by Sheetrock · 2002-07-21 05:53 · Score: 1

I hadn't heard about this before. Randomized Response

--

Try not. Do or do not, there is no try.
-- Dr. Spock, stardate 2822-3.

It's can be done, and has been done before by iltzu · 2002-07-21 06:00 · Score: 1

I'm always a bit skeptical when I'm told I'm about to be surveyed anonymously, and I can't think of a way that this can be implemented (or at least is likely to be implemented) that would reassure me.

There is a fairly well-known survey technique called randomized response that implements just this. The way to ensure trust is simply to let the user generate the random noise, for example by flipping a coin.

For example, to answer a sensitive yes/no question, you could be instructed to flip a coin. If you get heads, you answer the question truthfully. If you get tails, you flip the coin again, and answer "yes" for heads / "no" for tails. Thus, there is a 50% probability that any given answer is completely random, but the noise can easily be removed from the aggregate statistics.

Mind you, I don't think that's what the NYT article is describing. The text is too vague to to be sure, but it does seem as if they're describing a server-side randomization system. If so, I wouldn't trust that either.

Its time wasting by t_allardyce · 2002-07-21 06:08 · Score: 1

I dont think trust has anything to do with it. I already filled in forms on the internet truthfully years ago when i was young and stupid, lots of sites have my details - they're gonna get them one way or another - maybe another company will sell them my information illigally, or maybe i'll fill in a form, it makes no difference.

The reason i dont fill in forms truthfully is because it takes too long. I cant be bothered to read questions, i see a drop-down box of countries i quickly click afganistan, then i click on the post/zip code box and hit my keyboard to fill it with crap, moving on, click, tick, select whatever gets me through as quick as possable. I _really_ have no time to keep filling in these forms i just want to get on the site so i can post a question, read an article or whatever (btw the random ny times generater was excellent :). Then there are other sites that i just dont feel deserve my details. Im fed up of junk mail so my email address usually ends up being webmaster@[the-web-site].com etc.. (give them a taste of their own medicene). I think there are very few people who actually go around filling in false data just to sabotage the statistics. All people want is a _By-Pass_ button that will just get them in - it really is for the web-sites' own good if they want to maintain a reasonably accurate database, because otherwise they will just get my useless random fist-hitting-keyboard data.

--
This comment does not represent the views or opinions of the user.

Did these reporters ever take a stats course???? by Anonymous Coward · 2002-07-21 06:09 · Score: 0

Dear God, I learned about this technique in an undergrad intro to stats stats course back in 1985. What's the NY Times going to discover next, "Harvard Researchers Prove Accuracy in Counting to 10 Improved by Averaging Finger Counts"?

Can be solved by Anonymous Coward · 2002-07-21 06:13 · Score: 0

The technique described by the researchers is designed to reduce false answers users give to protect their privacy. Users don't trust the pollsters with their private info, so they lie to protect it. The way I read the article, and I did read the article, they are suggesting the answers get randomized before they get to the pollster. Like this:

"Sire take this clipboard and go to that table way over there where we can't watch you too closely. For each question you wish to answers, spin the Wheel-Of-Fortune-type wheel. Add the value of the wheel to the value of your answer and mark the sum down on the clipboard. In this way we cannot discern your actual information."

If privacy concerns are the sole reason the user would normally lie to a pollster, this technique may convince the user that their actual data cannot be linked back to them, thus preserving their privacy.

On the other hand, if the user just wants to screw with the pollster, as many slashdotters have confessed, this technique is of no help.

Maybe not so pointless after all by Knacklappen · 2002-07-21 06:13 · Score: 1

What a pointless "technology".
"Pointless" would be to argue with you about the meaning of higher education.

Let's think what the article actually says: IBM has employed a technique which lets them estimate the original distribution of data by adding a certain amount of random data with know distribution. That surely should be useful in other areas as well?!
A Google seachr on Random Perturbation gives quite a long list with applications in wheather simulation, computer graphics, chaotic dynamical systems, etc.
Still pointless? What about a search in the then NEC Research Index? Wowwww... Pointless, eh...?

--

Excellence: Moderate (mostly affected by comments on your karma)

Re:Maybe not so pointless after all by cduffy · 2002-07-21 06:31 · Score: 1

Yes, pointless -- this "research" is so well-known that the results are immediately obvious to anyone with any knowledge of statistics. The funding, media attention and whatever else that has been spent on studying and documenting this particular use is thus utterly without value.
Re:Maybe not so pointless after all by Knacklappen · 2002-07-21 21:23 · Score: 1

Well, than it's not research but simply engineering (aka applied research). Still, the technique itself is not pointless at all and the application may be obvious but imho not without any value.
Example: Just because you spent millions of dollars to produce something that is worth 1 buck, it doesn't mean the thing is worthless. It's worth 1 buck! But one self is totally worthless as an entrepreneur... :-)

--

Excellence: Moderate (mostly affected by comments on your karma)

Old trick by guanxi · 2002-07-21 06:17 · Score: 4, Informative

As another poster observes, if you don't trust them with the data, why trust them to randomize it?

My college stats professor 10 years ago explained a simpler trick that puts control in the respondant's hands. It went something like this:

With each question, the respondant flips a coin and looks at the second hand of a clock. Only the respondant can see the coin or the clock.

If the second hand is between 1-30 seconds, they answer per the coin (e.g. heads=yes). If it's between 31-60, they tell the truth.

The surveyor, knows very precisely the number of 'lies', can extract accurate data, and the respondant has confidence and control over their privacy. All without a transistor.

Re:Old trick by cduffy · 2002-07-21 06:34 · Score: 3, Insightful

The problem with these techniques is that you can't force the user to do it manually (as they won't), and the user can't trust their own computer (running someone else's software) to do it for themselves. That latter objection is the one that has botched any number of theoretically sound online voting systems.

Useful in theory? Very. Useful in practice? Not so much.
Re:Old trick by Dthoma · 2002-07-21 06:37 · Score: 1

Good idea, but the respondant would have to make sure that their clock wasn't set to the correct time. Otherwise, since the surveyor will have their clock set correctly, the second hands on the clocks will be approximately synchronised and the surveyor will know whether or not the respondant was actually answering the question:
"Hey! It's 48 seconds past, so I'm going to answer 'yes'!"
"Hey! We got a response at 48 seconds past! Assuming that the respondant's clock is correctly set (which it probably is) then 'yes' is their proper answer!"
This could even be used across different time zones, since these only affect the hour but not the second.

--
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".
Re:Old trick by AJWM · 2002-07-21 07:00 · Score: 3, Interesting

Indeed, very old trick. (For my sins, in my earlier days I used to help PhD psych students run statistical analyses on their survey data.)

A variation on this is to give the respondant a die (ie, half a pair of dice), tell them to pick a number between one and six, and every time they roll that number, intentionally give a false answer on the survey. Thus, looking at any individual survey response, you don't know whether it's true or false, but you can factor in the 16.7% false responses into the statistical analysis.

Sure, that can be computerized, but as someone above pointed out, how does the respondant know he can trust it? The above old technique is entirely under the respondant's control.

--
-- Alastair
Re:Old trick by Anonymous Coward · 2002-07-21 07:39 · Score: 0

Yea, but you know what? I'm too lazy to fill out a survey truthfully. I usually just end up picking whatever my mouse comes to first and moveing on. This also means that I'm WAY too lazy to roll a flip a coin, then look at a clock and determine what I mean. I'm also too lazy to do anything else outside of what the survey requires (because it's sitting between me, what I want and my free time). So, I'm not going to randomize anything unless it's a button I click (and have to click) or it's totally transparent.

Now, just imagine a person slightly less lazy than me. Who'll give the correct answers -- if their privacy is guaranteed but won't go so far as to flip a coin, roll a die, etc. That's where a computerized system is going to be helpful.

Afterall, aren't computers supposed to be doing the work for me?!? Why should I have to flip a coin or roll a die when I can use a random number generator on my computer to do it for me and make it totally transparent?
Re:Old trick by Bert690 · 2002-07-21 09:47 · Score: 1

> As another poster observes, if you don't trust
> them with the data, why trust them to randomize it?
Duh, because you don't have to -- you have software (which can be open source provided and verified) randomize the data according to the scheme ON YOUR OWN MACHINE.
While randomizing data isn't entirely new trick for ensuring privacy, IF ANYONE HERE would have bothered reading their paper they would realize that this work takes it quite a bit further than before in allowing reconstruction of complex classification models rather than simple aggregate fucntion computations.
Re:Old trick by sunhou · 2002-07-21 10:21 · Score: 2

Another version of this is to have two questions. E.g. if you want to know how many people have shoplifted, you actually give them two questions:

1) Have you ever shoplifted?
2) Do you have any siblings? (Or some other innocuous question.)

You tell the person "Roll a die (or just mentally choose a random number between 1 and 6). If you get a 5 or 6, answer question 1. Otherwise, answer question 2." You can use the fact that there is a 1/3 chance of answering question 1, together with Bayes' Theorem, to figure out the percentage of people who said yes to question 1. People feel more confident about answering honestly, because the experiment is simple enough that most people believe the researcher doesn't know which question they answered (although some people will still be suspicious, of course).

Note: if you have them mentally choose a number between 1 and 6, you first need to do another experiment to find the percentage of people who choose 5 or 6, since it probably is not 1/3.

I read a nice little article on this subject a while back called "How to ask sensitive questions without getting punched in the nose", I believe it was in volume 3 of a series called Modules in Applied Mathematics, but I don't have it handy on my shelf. But it's a very well-known example in statistics, I believe it's called a randomized response design.
Re:Old trick by dr_canak · 2002-07-21 13:36 · Score: 1

I'm not sure how old "Randomized Response" surveys are, but I think it goes back to at least 1965 [Warner, 1965, Journal of the American Statistical Association, 60, 63-69].

There is a nice summary of several randomized response methods in:

Fox, J., & Tracy, P. (1986). Randomized Response; A Method for Sensitive Surveys. Sage University Papers

While I'm not sure these particular techniques are applicable to the types of things IBM is doing, they can be very helpful in research when you are trying to establish a baseline/prevalence of behavior that has some stigma attached to it (i.e. drug use, child sexual assault, rape, etc...)
Re:Old trick by mbertini · 2002-07-22 00:30 · Score: 1

Yep, some years ago I was interviewed by a student working on his Master thesis in Statistics (Univ. of Florence) do to something similar.

I had to perform some simple calculations using the serial number of a banknote taken from my wallet, then answer according to the number obtained, or tell the truth.

Bullshit by ROBOKATZ · 2002-07-21 06:22 · Score: 1

If the users always pick the first choice, regardless of the 'true distribution', you're not going to get any information.

Re:Bullshit by Dthoma · 2002-07-21 06:40 · Score: 1

Internet forms are not necessarily all multiple choice. Some ask you stuff like your address which drop-down forms can't take into account.

--
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".
Re:Bullshit by 80N · 2002-07-21 08:15 · Score: 1

Some on-line survey software ramdomizes the order in which the choices are listed to avoid this bias.
80N

Reminds me of... by jburroug · 2002-07-21 06:22 · Score: 2

this song by Three Dead Trolls in a Baggie

--
"Listen: We are here on Earth to fart around. Don't let anybody tell you any different!" - Kurt Vonnegut

I'm innocent! by HD+Webdev · 2002-07-21 06:29 · Score: 3, Funny

"Judge, I did not know she was 14 years old. I'm pleading innocent by reason of randomized, aggregate data!"

--
This is not a dream, not a dream...we are transmitting from the year 1-9-9-9.

NYT Random Login Generator by majcher · 2002-07-21 06:33 · Score: 5, Informative

Hey, it's me. The guy who put together and hosts the New York Times random login generator. First off, thanks for all your cards and letters - I originally just created that page to save myself some trouble, but I'm glad to see that everyone likes it so much.

I'd also like to remind anyone who wants to download, copy, and mirror the source of that page on their own servers, or even as an HTML page on your desktop or whatever. It's just javascript, so it's portable, and that way you'll still be able to use it when the NYT lawyers finally get around to noticing it or they start blocking requests from my page or something. (It will also help distribute my load, though I haven't had any real trouble yet...)

Re:NYT Random Login Generator by Kelerain · 2002-07-21 10:03 · Score: 1

Heres a thought.. Could this be implemented in a bookmarklet? http://www.bookmarklets. com/
Is it just the formating of a random URL? I don't know much about the process or the coding, but if you can do it in HTML. It would be a pretty extreeme codeing challange of course as you only have 256 bytes to work with. Maybe a few 'throw away' logins you could use for a week or two, then replace them randomly from a page at the end of the week?
Re:NYT Random Login Generator by Frank+of+Earth · 2002-07-21 12:18 · Score: 1

I don't know why slashdot promotes this link. The NY Times is constantly linked from /. and is a great source of information; from technology to World news.

All they ask is that you register on their site so they can learn from their user base and they provide you with access to everything for free. If you're one of those paranoid psychos, then don't give them your life story.

By using this random generator, you're really fucking up the NYtimes db. In working with people who analyze site traffic [not nytimes.com], this just skews all the numbers, especially all the hits from /. which really makes it a pain in the ass to figure out who the user base is when you got all these people registering from "Grenada".

When they change their registration process and perhaps charging for their online content, don't start bitching.

[There goes my good karma rating ;-]

--
Live web cams
Re:NYT Random Login Generator by evilviper · 2002-07-21 12:48 · Score: 2

It's a good idea... but I don't suppose there's a javascript-free version for those of us that know better than to enable javascript?

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Re:NYT Random Login Generator by shepd · 2002-07-21 12:55 · Score: 4, Insightful

>If you're one of those paranoid psychos, then don't give them your life story.

Too bad there's no "Skip this crap" option in their registration screen, huh?

So, the only way to not give them your life story is to lie. I know! Let's make it easy and create a random login generator so I don't have to type more random crap on every computer I use!

And, BTW, if you think I'm paranoid, I'll let you know that I was able to make any changes I wanted [but only did what I asked, of course] to my grandmother's phone line by simply asking her age and full name -- ALL of which are sent to NYT on that page. They only asked to hear a lady's voice, which my mother happily provided. Armed with just a birthdate and name I can make all sorts of changes to your services -- anonymously.

Knowing that, do you want to give me your name and address? If you don't, you should know there's no reason why I'm not working at the NYT right now... I will tell you that were I do work I have access to many, many, many records including Full Names and Birthdates. Feeling uneasy yet? Well, if you trust me, I've never abused those privleges.

>When they change their registration process and perhaps charging for their online content, don't start bitching.

My only bitching will be the fact their site goes offline for everyone. You can't compete in a (literally) Free market by charging infinitely more than your competitors. With the amount of newspapers online right now, and the amount of good content that doesn't come from the NYT, I think they'll end up another salon.

--
If you could be told what you can see or read, then it follows that you could be told what to say or think - BoC
Re:NYT Random Login Generator by Sarin · 2002-07-21 13:28 · Score: 2

I know why slashdot is promoting it. I stumbled upon the generator a few months ago, while searching for something else with google.

Then I decided to make it part of my evil plot to take over the world by using it as my signature on slashdot (you have to start somewhere, you know.
It was funny to see that other people noticed it and started to use it as well.

Now that the first step of my mission is completed, I will have to think of another signature.
Re:NYT Random Login Generator by orn · 2002-07-22 06:54 · Score: 1

Sorry for the (very soon to be moderated down) simple and annoying reply, but...

Kick ass! :-)

--
1. 2.

None of your business...... by stuartkahler · 2002-07-21 06:42 · Score: 2, Insightful

The kind of questions that most of these sites ask include stuff that is impolite for friends to ask each other sometimes, never mind some random business. If they want accurate results, they should include the option for people to answer with a "MYOB" option. People are rather unlikely to keep tossing in crap data when they have the "MYOB" option, at least not in the 40% range. There is no way in hell that anyone making 100k+/year would actually admit it and give a business their real e-mail address. They would be begging for a flood of advertisements.
Why is it that online business feel they have the right to try and force so much personal information out of us? In brick 'n mortar stores, the worst info anyone asks me for is my zip code (or age to purchase alcohol). They can get my name if I use my credit card, but I can easily pay cash to avoid that.
It's very ironic that NYTimes would run this story.... Why do they expect me to tell them where I live, work, and what I make, just to read their articles? The paper version is nowhere near this invasive.

Re:None of your business...... by astroboscope · 2002-07-21 08:09 · Score: 0

But it won't fly.
1. If they were willing to accept "MYOB", they'd also let you leave the field blank.
2. If someone answered MYOB to income, they'd just assume it was more than $100 000 or whatever the trigger level is for massive junk mailing.

--
If we were ants living on a Rubik's cube, differential geometry would be a little more confusing.
Re:None of your business...... by stuartkahler · 2002-07-21 16:33 · Score: 1

But then you could be comfortable saying 'myob' to the 'e-mail' field, and then fill in the rest truthfully. They just have to make it so that there is at least as much effort involved in checking off the 'myob' answer as putting in a real one.
I'm beginning to think that the internet is being taken over by moron execs who never actually USE the internet themselves. I see stupid things all the time that make me leave a website before I've even gotten past the second page.
Re:None of your business...... by unclelib · 2002-07-21 20:13 · Score: 0

Why is it that online business feel they have the right to try and force so much personal information out of us?
Because they need to make money too! In a real store or business you are paying $$ for a service, so they dont ask as many questions. When you go to the NY Times website, you don't pay anything. But the website still has bills it must pay. This is how they are able to pay their bills, by collecting data and using that information for marketing purposes.

Not an entirely new idea by James+Ezick · 2002-07-21 06:47 · Score: 2, Informative

The idea of using randomness to get better survey results is not a new one. In his 1990 book "Innumeracy", John Allen Paulos posits a system for asking a potentially embarrasing yes or no question whereby the examiner asks the subject to flip a fair coin before responding. If the subject gets heads he should give the embarrasing answer, tails he should tell the truth. The idea is that the subject is then spared the trauma of giving the embarassing answer since the examiner is not told the result of the coin flip and it is possible the subject just flipped heads. Knowing the "probability distribution" of a fair coin it can then be assumed that half the respondants gave the embarrasing answer as a result of their coin flip. These can then be removed from the data leaving a staticically accurate result.

It seems that what the IBM folks are doing is a staightforward extension of this idea to a larger response domain (numerical ages as opposed to boolean questions) and to a more automated system in which the website flips the coin for the subject and amends his answer accordingly.

Re:Not an entirely new idea by Dthoma · 2002-07-21 06:58 · Score: 1

"It seems that what the IBM folks are doing is a staightforward extension of this idea to a larger response domain (numerical ages as opposed to boolean questions) and to a more automated system in which the website flips the coin for the subject and amends his answer accordingly."
Since most people probably won't trust the automaton to randomize their data, then you should convince the user to do it at their end. The problem with this approach is that there is not a great deal of person-independent random data within a large enough and set range which is readily accessible to the user. Sure, you could ask them to flip a coin 20 times and add on the number of times they get heads, but what are the odds of that?

--
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".

Data is already "randomized". by blair1q · 2002-07-21 06:59 · Score: 2

If the respondents are already randomizing the data, the statistical analysis should be able to produce the same result.

Or hadn't they thought of that?

Re:Data is already "randomized". by Joheines · 2002-07-21 07:12 · Score: 1

But people randomize data using an unknown kind of distribution (some may always check the first best thing, some always answer yes, etc.). When IBM does the randomizing for you, it can design the process so as to randomize it accordingly to a kind of statistical distribution they choose. When they know, which statistical distribution was used to randomize the data, the can remove its influence from the aggregated data.
Re:Data is already "randomized". by Anonymous Coward · 2002-07-21 13:51 · Score: 0

You're stupid.
Re:Data is already "randomized". by blair1q · 2002-07-22 09:09 · Score: 2

It's not unknown. It's random. And the space of available errors is finite. So there's a good chance it will either be a uniform distribution or a gaussian one. And there's a cute statistical trick where if you sum a random selection of random distributions, the sum's PDF averages to the normal distribution anyway. Which is all the proposal proposes to add themselves. So they're really not changing the statistics at all.

I bet if they applied their decommutation algorithm to the existing data, they'd wet their pants.

--Blair
Re:Data is already "randomized". by blair1q · 2002-07-22 09:11 · Score: 2

You wet your pants.

Re:Mirror by Registered+Coward+v2 · 2002-07-21 07:08 · Score: 2

A couple of thoughts.

First, I found this funny:

Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.

Yes, they *need* to make even more money off your data.

Second, anyone find it interetsing that they assume a distribution and then work towards it:

"When people lie randomly -- and that is what they do now when they answer questions -- we get very poor results," he said. But by "adding random values to true values," he said, "we can reconstruct a distribution that is very close to the actual one."

Using this information, Dr. Srikant said, the researchers make a first guess at what the true distribution should be. Then the program crunches through the analysis and produces a slightly better guess. This guess is crunched again, and the process is repeated over and over again, getting closer and closer to the actual distribution.

My guess i sthat they hope people don't truely lie randomly, and then yuse their random additions or subtractions to bring people closer to the actual distribution - i.e. I may say I make $0 or %$50,000 (or what ever the low/high end is, but not pick one one away from my real income.) They are hoping that people, as a group, behave predictably even when any one individual doesn't. Which, if my org behavuior prof is to be beleived, is generally the case and the way people can shape other's responses and behaviors.

Interestingly enough, randomization is a useful tool in surveys. If you area sking about very private infromation that people may lie about if they fear the answer will be leaked, you can tell them to flip a coin - heads ask them to answer truthfully, tails put down no (for a yes no survey). With a large enough sample, you can back out the real results based on the 50/50 results of the coin toss, without knowing how anyone actually answered.

Of course, companies should probably ask themselves how many Josef Stalins live in Moscow Idaho and were born on Oct 24, 1917?

--
I'm a consultant - I convert gibberish into cash-flow.

Amazingly ironic and hypocritical of the NYT by Anonymous Coward · 2002-07-21 07:08 · Score: 0

I am truely amazed that the NYT would report this as news since they always have one or more front page stories which are the results of a poll.

What's so amazing is that the polls they report as news are usually constructed as to come up with results that the NYT editorial board wants to have.

How about being respectable journalists by:
1. reporting facts to not try to influence public opinion
2. not reporting manufactured news stories such as poll results
3. putting editorials on the editorial page and not the front page since editorials are designed to change public opinion
4. reporting both good and bad news about a president/congressman whether or not it helps/hurts the political outlook of the NYT editorial board
5. reporting news stories which don't center around NYC, DC, and LA.

Re:Amazingly ironic and hypocritical of the NYT by Anonymous Coward · 2002-07-22 06:06 · Score: 0

It's about time newspapers actually have some credibality in their reporting.

I am sick of the practice of including one or two articles in each paper to appeal to each demographci group.

Any statisticians in da 'ouse? by Anonymous Coward · 2002-07-21 07:09 · Score: 0

doesnt anyone else see how this such an incredibly stupid waste of time and CPU cycles?... paraphrased: "yea... we could add or subract a random quality from the results to either conform to a normal distribution or just linearly.." is this the whole point? chances are theyll have pretty much the same data at the end... or is it some psychological device?

Re:Any statisticians in da 'ouse? by Daetrin · 2002-07-21 09:12 · Score: 2

The whole idea of the thing is that they'll have pretty much the same data at the end, at least an agregate form. What they won't have is the exact data on any individual person. If their randomizer adds a value form -15 to 15 to the age, and my result comes up as 37, then i could actually be anywhere from 22 to 53 years old.

--
This Space Intentionally Left Blank

Type of questions, too by Bastian · 2002-07-21 07:12 · Score: 2

When a company tells me they aren't going to use the information I give them for anything but demographics research, then asks me for my phone number and address and makes both fields required, I consider it safe to assume that company is lying, and don't feel think it's at all naughty to fib.

On the other hand, if the company really only requires me to answer questions of demographic importance, such as what country and state/province I am from and my age, I am likely to respond truthfully.

Re:Type of questions, too by verbatim · 2002-07-21 08:50 · Score: 2

You're right - the distance between the information and the user affects the result. That is, I'll tell you my age group, but not my birthdate. I'll tell you my region, but not my street-address.

Kind of like what dboyles said: by allowing the user to skip questions they don't want to answer, the questions they do answer are far more likely to be honest.

--
Price, Quality, Time. Pick none. What, you thought you had a choice?

It's when they can face the truth... by Ho-Lee-Cow! · 2002-07-21 07:15 · Score: 1

...that their surveys are bogus, that we see them assume that the crappy answers they get are the fault of the respondents.

Back when McCain's campaign was maligning Bush in the 2000 primaries, I got a call from a pollster, obvious from the McCain camp, who offered the most leading questions and answer choices I've ever seen in such a poll. Of course, when I said that I thought McCain was a commie, and that I didn't agree with even a tenth of his allegedly 'conservative' positions, the guy started shuffling to end that interview. I am certain my survey responses got flushed.

--
In space, no one can hear you moo.

Slashdot, fast with the news by ZaneMcAuley · 2002-07-21 07:18 · Score: 1

I submitted this a while ago before it appeared here. Same for other stories.

2002-07-19 22:51:55 Web form information, True or False? (articles,news) (rejected)

I wont submit another story again as it is pointless.

--
----- Whats wrong with this picture? http://www.revoh.org:1234/whatswrong

The type of questions make a difference. by implex · 2002-07-21 07:19 · Score: 1

I am more prone to lie about facts like birth date, income, education status.

In the end it all comes down to the questions: If they are along the lines of my interests, my age range then I feel more comfortable saying I am in my late 20's and how I like to travel. Than give them my birth date and which hotel chains I stay at.

the only time i answered truthfully... by YrWrstNtmr · 2002-07-21 07:30 · Score: 1

...was when I could actually check the button marked:
Household income? >$100,000

and then, only once. Of course, the rest of the block were filled in with random junk.

If they take actual inputs, and randomize the results, why bother with the expensive surveyy taking in the first place? Just generate a set of random responses.

"Well, we had 1,500 responses, and 23% were female, potential snow tire buyers, age 18-25" (of course, we did not actually take a survey, and just had the computer over in the corner generate a bunch of random responses).

This is how it would work: by 80N · 2002-07-21 07:36 · Score: 2, Interesting

This is how it would work: You have a web page that asks you for your age (see 1 below). On the web-page is a JavaScript function that adds a random modifier. The value you entered is displayed as a non-input field to the right and the value you entered in the input field is replaced by the randomized version (see 2 below).

1 Age [28] *Will be randomized*
2 Age [56 (Randomized)] *28*

The value 56 gets submitted to the server, not the value 28 - which is my real age ;).

This is auditable because I can inspect the source code which is part of the web-page, and I can even monitor the network packets if I'm really paranoid.

Now I could still lie, or mess with the algorithms in the Javascript, but what would be the point?

80N

Re:This is how it would work: by GigsVT · 2002-07-21 07:44 · Score: 1

Now I could still lie, or mess with the algorithms in the Javascript, but what would be the point?

95% or more of web users won't know or understand a client side function vs. a server side one. This might work with some site that marketed only to geeks that did know the difference.

--
I've had enough abrasive sigs. Kittens are cute and fuzzy.
Re:This is how it would work: by 80N · 2002-07-21 08:05 · Score: 3, Interesting

95% of web users don't understand a lot of things, but if someone they trust tells them its OK then they will be happy.
I don't really understand how SSL works, but I trust my browser (a bit) and when I see https in the URL then I'm comfortable with that. Not because I fully understand SSL, but because I listen to the opinions of people who do.
So if it became accepted practice that pressing the Randomize button on your browser (why not build it into the browser) made your response anonymous then nobody needs to understand it any more than they do SSL.
Actually, why not have a new http method: POST-RANDOM instead of POST so the server knows that the data has been randomized.
80N
Re:This is how it would work: by petecarlson · 2002-07-21 08:15 · Score: 1

95% or more of web users won't know or understand a client side function vs. a server side one. This might work with some site that marketed only to geeks that did know the difference.

Any geek that took the time to look at the Javascript would delibratly screw with their answers because they can.

I question a basic assumption by mysticgoat · 2002-07-21 07:45 · Score: 2, Interesting

It was an interesting article, and I can see how this technique will work when the surveyors have the goodwill of the respondents, so that any respondent's primary concern is only that of keeping his individual privacy.

But is privacy the core issue in market research, or is it simply a label of convenience that a lot people use for something else that we don't have easy words for? I will lie on many surveys even when I am fully confident of my personal anonymity-- though I prefer to avoid those surveys entirely when I can. OTOH, when a survey is done by a group that I have aligned myself with, I might well enthusiastically bare my soul without any regard to the privacy issue. And I know that I am not at all uncommon in these respects.

I suspect that my reactions stem from the same source as nationalism, patriotism, ethnic pride, and that whole mess of things where I'm not behaving as an individual protecting my privacy, but as a member of a group who feels called upon to defend my group.

Mostly I see marketing as an attempt by outsiders to mess with my group, to get us to buy stuff through conning us rather than letting us apply our own standards of value to the goods offered. I think I lie on surveys to protect my group from these subtle attacks; to misdirect and confound my group's enemy.

So I really don't think privacy has much to do with it. I think all this lying is a natural group reaction to consumerism, and its belief that it is perfectly okay to sell product by conning your customers into thinking that what you are pushing today is something they want.

Not in my group, buster. We don't need no steenkeeng pushers in our neighborhood.

a better way by sbwoodside · 2002-07-21 07:52 · Score: 1

There's a better way. Run the demo-collecting software on the client. The user enters their info, the client randomizes it and sends it on.

Similarly for customized ads. Your client (open source of course) knows your demographics. But it also has 5 other (fake) profiles. It sends them all to the server, the server sends back 5 customized ads, one for each profile. The client picks the right one and shows you.

Everyone wins!

Ciao for now,
Simon Woodside
http://www.simonwoodside.com/

PS. Please, check out http://www.semacode.org/ and give me some feedback !

--
home page

Voting Machine by 80N · 2002-07-21 07:53 · Score: 1

Umm...wouldn't this technique be useful for a Voting Machine? See Unauditable Voting Machines. If every vote is randomized then anonymity can be guaranteed while at the same time maintaining a complete audit of the poll.

You vote for candidate B, this is randomized to be candidate E. The voting machine has a record that shows that you voted for E. This can be inspected by you to determine that your (ramdomized) vote was not tampered with.

The outcome of the election is determined simply by removing the randomizing bias...

80N

They need to let the randomization be client-side by iabervon · 2002-07-21 07:57 · Score: 2

People won't trust sites to actually randomize the data. Actually, people probably won't notice that the site is promising to, or take this as a reason to give good results. What they should do is set up a system where the randomization is done by the browser (which people trust), in accordance with a distribution specified by the site and provided to the user.

That way, the browser tells you that your entry will be randomized to tell the site your age +-30 years, or give your actual gender 20% more frequently. Based on the numbers the site is using, you can decide whether to answer accurately, knowing just how hard it would be to track you based on this information. The web site would then be able to remove the noise from the aggregate data, and have a confidence based on the distribution they ask for (aside from people who think the margin is too small and lie).

Why I don't trust survey takers by dpbsmith · 2002-07-21 08:00 · Score: 3, Informative

As many others have noted, the technique is silly because if you don't trust survey takers in the first place, why would you trust them when they say they are following the IBM randomization technique?

A couple of years ago, I received a survey in the mail that said the results would be kept completely confidential and anonymous. I thought it was odd that there was a mysterious seven-digit number in one corner, but anyway, I said to heck with it and pitched it. A week later I got a follow-up letter noting that I hadn't sent in my survey yet! Some anonymity!

Incidentally, this is not the only time I've gotten "anonymous, confidential" surveys with mysterious multi-digit numbers. In at least one case, it was at a big company and the survey involved things that nobody in their right mind would want their bosses to know about... and there were mysterious multi-digit numbers on the forms and, indeed, checking with colleagues confirmed that the numbers were different on each of our forms. Naturally, we all put down safe, inaccurate answers.

--

"How to Do Nothing," kids activities, back in print!

Re:Why I don't trust survey takers by Dthoma · 2002-07-21 08:31 · Score: 1

These surveys have curious seven-digit numbers in the corner. Do WWW surveys have curious seven-digit numbers in the corner that they can identify you with? No, they don't. They can't slip an all-knowing seven-digit number into the page without your knowing. All they can do is get your IP address, but with that all they can do is try to send you email (or DoS you, but what are the odds of that?). This isn't problem if you tick the little "No mail, please" box at the bottom.

--
Note to M1-ers: a curt but otherwise insightful message is not "Flamebait" or "Troll".
Re:Why I don't trust survey takers by blisspix · 2002-07-21 15:51 · Score: 1

anonymity/confidentiality in questionnaires does NOT mean that the person conducting the survey doesn't know who you are.

It means that you will not be identifiable in the results, and any identifying information about you gathered by the questionnaire will be kept separate from the coded results.

I am conducting a questionnaire at the moment for my thesis. I have conducted specific people to return the questionnaire, and I only know who they are if they choose to tell me what their email address is (if they want to know the results). Otherwise, I have no idea who they are, although I am certain that they come from my original email sample.

If you are ever concerned about your confidentiality, ask to see their ethics approval (if it is an academic study) or feel free to give feedback to the person conducting the survey and let them know where you are concerned.

What about the users of the data by panurge · 2002-07-21 08:09 · Score: 1

The people who use the data - the corporations - are using it based on the belief that they are paying for some acceptable level of accuracy. They have to believe or be made to believe the market research companies are somehow extracting good information from that band of congenital liars, we the people.

So perhaps the real objective of this "research" is actually to persuade the guys who pay for the surveys that IBM consulting has better ways of doing them - trust us, we

know

, and that's an extra 20% on the bill for all that research we did into how to do market research.

--
Panurge has posted for the last time. Thanks for the positive moderations.

neat technique first published by Robyn Dawes by King+Babar · 2002-07-21 08:11 · Score: 2

I think it was first by Robyn Dawes...anyway, a very similar technique was used in what was a brilliant design for a study on sexual behavior during the perceived height of the AIDS epidemic in this country. In a nutshell, we were faced with an epidemic spread by sexual contact, but did not really know what the base rates for any of the more (or less) dangerous activities were, or if they had ever been tried.

Asking people right out "Hey, did you have unprotected anal sex on your casual encounter?" was found to be not a particularly good way to elicit truthful answers. So what you do is give people a fair coin (or the equivalent) and have them flip the coin for each question. If the coin lands heads, they answer "yes". If the coin lands tails, they answer *truthfully*. Looking over an answer sheet, you have no idea which "yes" answers are real and which are not, and subject did feel like nobody really could "get" any personal information off their answer sheet. In the statistical aggregate, however, you could get perfectly useful average rates for a given population. (Basically, you just adjust for the "yes answer background".)

A great idea, but its use in a wide-range study of this type was axed, I believe, when the study itself was blasted by certain members of congress...but that's another story.

--

Babar

Huh? by Galahad2 · 2002-07-21 08:14 · Score: 1

I fail to see how this is a real technology. They have untrustworthy results, so they want to get results that make sense, so they mess with them. So they already know what they want to get, and by implementing this system, they manipulate the "data" so that they do. What's the worth in asking actual people, anyway? If you assume that they lie and then change their data entirely to fit into your bell, why don't you just make up the data entirely? I bet you could make it fit much nicer on the bell if you did that.

Completely missing the point by Perianwyr+Stormcrow · 2002-07-21 08:16 · Score: 2

They've missed the point about why their forms are full of bullshit.

The forms are giant time-wasters.

If the folks giving these surveys would stick to EXACTLY what they NEED to know, we wouldn't balk at filling them out properly- especially since personal data is one thing they generally do NOT need to know for marketing!

Forget the name, address, interests (the BIGGEST time waster of all.) Generally, the most important information that you can get from site visitors is:

1) Zip code. This tells you the geographic area that your visitors are coming from. Useful for location-relevant information, but completely impersonal.

2) Age range. This is really the prime info that marketers want, as so much of their "science" is based on generational observation. Again, totally impersonal.

3) How you heard about the site. This is the most important thing you can learn from your visitors, as it gives you some information on which advertisements are performing!

If every site I signed up to asked me these three questions and these three questions ONLY, I'd answer them all truthfully. As it stands, I have to dig through a mountain of shit, and these days I generally just throw the shovel at the pile and move on.

--

What we call folk wisdom is often no more than a kind of expedient stupidity.-Edward Abbey

Re:scrolling errors by Brian+Goldman · 2002-07-21 08:25 · Score: 0

Client-side obfuscation is a great idea, but trusting some website's Javascript isn't good enough. Not everyone knows how to inspect source code for hidden fields. This would be best implemented as browser feature or plug-in, so you use the same code to obfuscate forms every time, and it's made by people independant of the surveying website.

the data is already random by Anonymous Coward · 2002-07-21 08:27 · Score: 0

isn't the fact that people are providing false data enough randomization? Why doesn't IBM just process the data that is already provided, rather than trying to collect it "their" way.

All they need is an idea of what the distribution of the random numbers generated by the average person.

So far, people seem to have missed the point... by pgrb · 2002-07-21 08:30 · Score: 2, Insightful

...which is not unusual on Slashdot - I do it all the time as well.

The idea of randomising answers it not new. It has been used in 'socially sensitive' surveys for years, if not decades.

Simple explanation:
Have a survey of 10 questions people don't like to reveal the truth of, ech with a yes/no answer.

For each question, either
a) reply truthfully
b) flip a coin and record whatever the coin gives.

If challenged about your answer, you can always say that's the answer the coin required.

Analyse the results for a large population of completed survays. Any significant deviation from 50% yes and 50% no answers tells you which way the population answered, without revealing who actually holds those views.

All you need is a coin to randomise your answers. This is independent of any web form, doctored answer sheet etc etc - so particular answers cannot be pinned on you.

It's fun administering the same survey to people with and without the randomisation - you get to see what people in general lie about!

Hope this gives a usefule summary of the method.

Regards,

pgrb

--
This line intentionally left..uh..blank?

How ironic... by Anonymous Coward · 2002-07-21 08:31 · Score: 0

I enter a new sub everytime I read NYTimes...

This time my email was theevildoers@whitehouse.gov, I was 60 yrs old, lived in Afghanistan, was CEO in the Energy business (that sounds about right for the email addy), and subscribed to the Times.

If you can't get me to input truthful info in the first place, which I won't, no matter how "anonymously" the data will be kept, how is randomization going to help?

The Irony... by rimsky · 2002-07-21 08:41 · Score: 1

Isn't it ironic that we are using the random NY Times registration generator to read an article about random registration data? Sort of proves the point, doesn't it?

Re:They need to let the randomization be client-si by ndecker · 2002-07-21 08:43 · Score: 1

That way, the browser tells you that your entry will be randomized...

Most people will not understand, what the browser does. ( probably with javascript ) Even those who could, would not bother reading the page source to make sure, the data isn't transmittet in clear.
It will not make any difference whether the data is encrypted on the client or the server.

Drug abuse surveys etc by JPMH · 2002-07-21 08:51 · Score: 2

I believe this technique (or the variant above with a single d6) has been used as a standard textbook example in the literature on Bayesian methods in biomedical statistics.

ISTR it's in Tanner's book on Gibbs sampling, as a method used to extract accurate population estimates about embarassing, personal or even incriminating subjects, such as past exposure to STDs, sexual orientation, or the use of particular controlled drugs.

Of course, your survey has to be big enough so that the expected number of true positives (N.p) stands out above the expected uncertainty in the number of false positives, approx sqrt(N.p'.(1-p'). If p is small, N may have to be really quite big.

not a new idea by Mr.+Slippery · 2002-07-21 08:53 · Score: 2

I remember reading about something similar to this in a psychologly class in 1988 or so. The idea was for people doing a door-to-door survey asking things like sexual behavior. There's important public health reasons to have the data, but also strong reluctance to give honest answers.

What they did was give the person being polled a spinner, like from a board game. (Remember those, oh you young /.ers? Maybe not...) It was divided in two parts, 2/3 would say "yes" and 1/3 would be "no". The questioner would ask if the person's answer to some yes/no question matched that shown on the spinner (which the questioner couldn't see). You couldn't know what any single person's answer was, but you could do the math and get how many had done what.

--
Tom Swiss | the infamous tms | my blog
You cannot wash away blood with blood

People lie because corporations lie by guttentag · 2002-07-21 09:09 · Score: 2

No company is really going to use this, but a company will claim it does to gain your "trust." Have you ever heard a hardcore marketing goon talk about trust? It's really chilling.

I used to work for a company whose customers had to provide accurate information in order to sign up -- the service wouldn't work with false info -- but the problem was getting people to sign up.

One of the main selling points was that customer data was completely secure: no one will ever be able to read your data, only an aggregate report of all our users. The company went to a lot of trouble to make this point convincing, going so far as to suggest that users had legal protections against abuse. There were people in the building who spent all their time trying to think of ways to convince more people to drop their defenses so we could exploit their information -- cold, calculating, 24-7, like WOPR spends all its time playing World War Three.

I believed their claims until the day I saw a user's sensitive data on an engineer's screen. And then that engineer showed me another user's data, and another. "We've always had the ability to do this," he said, "for, ahem, quality control purposes."

If a company tells you it isn't collecting the valuable data you provide, you need to assume it is lying (unless you can personally verify the claim or you are positive that the law protects you against abuse).

Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.
"Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said. Conservative estimates are 42 percent, but anecdotally the rates are far higher, she added. "People are lying," she said, "and vendors don't know what is false and accurate, so the information is useless."

People are "lying" because corporations lie, as a matter of policy. This will never change because lies are more profitable than truth. Only corporations don't call their behavior "lying," they call it "marketing." So when I fill out an intrusive form with false information, I don't consider it lying either. I call it "standing up for my right to privacy." This system of "marketing" versus "standing up for my rights" is well-balanced, but this new masking technology is simply a marketing attempt to tip the scales in the corporations' favor by tricking consumers into volunteering information on false assumptions.

really?? by JebusIsLord · 2002-07-21 09:53 · Score: 1

studies have shown that over 67% of statistics are made up on the spot :)

--
Jeremy

Re:POLLS SUCK by Anonymous Coward · 2002-07-21 09:59 · Score: 0

I'll help

CmdrTaco, root@cmdrtaco.net

Randomizing for Accuracy by hysterion · 2002-07-21 10:00 · Score: 4, Funny

Rakesh Agrawal and Ramakrishnan Srikant have devised a data-mining program that would cloak individual truthful answers

Don't trust these guys. They are (obviously) piping their names through some obfuscation algorithm.

--
Timeo idiotikOS et dona ferentes

Real easy way to get honest answers by Guppy06 · 2002-07-21 10:28 · Score: 2

Why do companies take these polls to begin with? To make money. Either there is money to be made in interpreting the results or even in providing the results in the first place (see election exit polls). If the pollsters are looking to make a profit off of the information, why not share that profit with the people that gave you the information to begin with?

Going back to the example of the exit poll, if all you're going to do is try to make money by predicting who will win an election, its much more satisfying for the voter to lie and watch them squirm when they get it wrong. Why should we tell the truth?

Mod Parent UP by Anonymous Coward · 2002-07-21 10:56 · Score: 0

ROFLMAO

The computer can do what now?? by IncohereD · 2002-07-21 10:57 · Score: 1

Or the computer could pick random numbers out of a hat, with the chances of picking any one number the same as for any other.

Now, all other problems with the article aside, I think this little sentence about generating random numbers is the most problematic. As most people are aware, generating random numbers on a computer is no trivial thing. If they have since grown the ability to draw numbers out of a hat...I'd really like to see it.

correct answer is user choice by joe094287523459087 · 2002-07-21 10:58 · Score: 1

i always thought that the way to get accurate data is to make fields *non required* so people don't have to game the form if they don't want to enter the data. this seems pretty obvious to me, but i guess marketing people aren't known for having a firm grasp of the obvious.

Hm... Reminds me of John Brunner's Oracle by Anonymous Coward · 2002-07-21 11:22 · Score: 0

In his book The Shockwave Rider, Brunner posits an online service called "Oracle" which operates by aggregating responses from a large enough cross section of the general population to gain answers to question. Brunner suggested that even though the members of the cross-section didn't actually know the answers, in the aggregate the oracle was more often right than wrong.

Looks like the NYTimes reads old science fiction novels.

There is already a better way by oopy_-_ · 2002-07-21 11:33 · Score: 1

There are already methods of randomizing responses for sensitive topics that work much better than this by instilling real trust. The way it works is quite simple. You ask one thousand people whether they have used drugs in the last 30 days. For each person, during the interview you ask them to flip a coin, and if it is heads, answer truthfully, if it is tails say "yes". Through some very simple formulas, you can translate this to the true result.

All surveys have a confidence interval, and 1000 flips of a coin has a very high confidence of being very close to 50-50 heads/tails.

There are of course more sophisticated versions of this with dice or a more complex random number generator, but with the interviewer not knowing if a person is telling the truth, it usually makes people much more confident in being honest as nothing can ever be proven about what they say.

Waste Of Time by Anonymous Coward · 2002-07-21 11:46 · Score: 0

This is an attempt to find a technical solution to a problem of human motivation.

Nobody gives a crap about providing truthful survey answers to marketing weasles. Obviously, technology can never provide a solution for this problem.

This is exactly parallel to the music industry's problem. Nobody gives a crap about their copyrights, and so their attempt to find a technical solution has been a failure.

To fully understand technology, you need to understand what it CAN'T do.

Doesn't really protect privacy cross-site by cgdemarc · 2002-07-21 11:51 · Score: 1

Adding noise does an OK job of protecting an individual response, but after years of submitting survey responses to many web surveys, there'd be plenty of data to make excellent estimates of your personal attributes.

How many users are aware that many of the sites they visit pool data?

There's an enormous body of research on how to hide individual records in databases, ranging from adding noise to preventing queries that access fewer than a set number of records. In the end, none of the methods work well - all have simple or clever workarounds. Even individual records aren't very well protected by adding noise if the record size is large enough and fields are dependent.

Just use multiple choice with a range by shird · 2002-07-21 12:42 · Score: 1

If randomizing simply introduces an element of inaccuracy, why not just have the options include a range?

i.e Age?: (0-10), (10-15), (15-25) etc

To me, this is all randomizing seems to achieve in regards to privacy, and overall accuracy of the results. At least then you don't have to rely on the host collecting the results randomizing them, because you would already know the answers are fairly 'vague'.

--
I.O.U One Sig.

Free Reg My Ass by Anonymous Coward · 2002-07-21 12:51 · Score: 0

WHAT'S your age? Your salary?

Online merchants who ask nosy questions like that on surveys at their Web sites have learned what usually honest visitors will do.

Fib, most likely.

People give false answers to protect their privacy. Then, because the data is so unreliable, companies can't use it to help them run their businesses.

Two I.B.M. researchers have devised software that seeks to get around this information age impasse. Rakesh Agrawal and Ramakrishnan Srikant, computer scientists at the I.B.M. Almaden Research Center in San Jose, Calif., have devised a data-mining program that would cloak individual truthful answers that people might enter once their trust was won but still recover important characteristics of the overall group.

For instance, instead of recording the answer "41" to a nosy question like "How old are you?" the software automatically adds a random number of years within a specified range, say minus 30 to plus 30, to the answer. No record of initial answers is kept.

Then, using a series of mathematical guesses based partly on how the initial data was randomized, the program gradually reconstructs a realistic distribution of the age groups that responded -- how many people were 20 to 25, say, or 40 to 45. Demographic information like this might be of great interest to a company in quest of 25-year-olds to buy its sports cars or computer games.

Some inaccuracy results when the I.B.M. program approximates the actual distribution of age, salary or other characteristics in such large data sets, said Ann Cavoukian, the commissioner of information and privacy in Ontario. "But in return for about 5 percent inaccuracy, you have a privacy model in which individual answers are not used," she said.

Programs like this one could lead to greater truthfulness in the answers people volunteer on the Web, she said, provided that they were willing to replace some of their native caution with a bit of good will toward a company and its need for data-mining.

"Right now, the rate of falsification on Web surveys is extremely high," Dr. Cavoukian said. Conservative estimates are 42 percent, but anecdotally the rates are far higher, she added. "People are lying," she said, "and vendors don't know what is false and accurate, so the information is useless."

Dr. Agrawal said that his way of reconstructing data was based on hiding the true numbers, although not through the sort of lying practiced by ordinary people confronting a questionnaire.

"When people lie randomly -- and that is what they do now when they answer questions -- we get very poor results," he said. But by "adding random values to true values," he said, "we can reconstruct a distribution that is very close to the actual one."

Dr. Srikant said, "We know a lot about the distribution of these random values."

The random numbers generated by the computer could be distributed in a bell curve, for instance, with most values clustered near zero and fewer at either end. Or the computer could pick random numbers out of a hat, with the chances of picking any one number the same as for any other.

Using this information, Dr. Srikant said, the researchers make a first guess at what the true distribution should be. Then the program crunches through the analysis and produces a slightly better guess. This guess is crunched again, and the process is repeated over and over again, getting closer and closer to the actual distribution.

"When you do this for 10,000 answers, the overall distribution is likely to be accurate," Dr. Srikant said.

Johannes Gehrke, an assistant professor of computer science at Cornell University who specializes in data mining, said the program was the first effort to address in depth the challenge of reconstructing a distribution of large data sets in the context of data mining.

"You know the record after randomization and you also know how you randomized the record," he said. Those two pieces of information, along with a standard statistical theorem called Bayes' rule, allow the program to estimate the prior distribution.

Random perturbation, the formal name of the technique used by the I.B.M. researchers to mask the original answers, satisfies the demand for privacy to a greater degree than many other procedures available to organizations, said David F. Andrews, who recently retired as a professor in the department of statistics at the University of Toronto.

"The idea that you can take data from a population, add random noise to it and then recover important characteristics from this perturbed data has a long history," he said.

Techniques that reconstruct distributions without revealing individual information may be welcome not only to people filling out forms but also to companies that ask touchy questions. "If companies have data and it escapes, they could be liable for data breaches of security," Dr. Andrews said. "This way, you can't be sued."

The program and related ones by other researchers may help companies explore raw data presently closed to them, said Christopher W. Clifton, an associate professor of computer science at Purdue University and author of a chapter on security and privacy in the forthcoming LEA Handbook of Data Mining (Lawrence Erlbaum Associates). "These programs ensure that the original data values can't be reconstructed, but are still close enough to the real results to be meaningful."

The I.B.M. program has been tested in the lab and a prototype is available. Dr. Cavoukian said she hoped that businesses would soon come forward to do beta tests of the software.

"Usually technology is used to invade privacy," she said. "I like this program because here we are using technology to protect privacy."

Privacy trap by Erpo · 2002-07-21 13:33 · Score: 1

One solution might be to perform the randomization on the client side and display the result. That way the user can see that the answers have been munged before they are sent.

<sarcasm>
That would certainly be pyschologically assuring. It would show users that the data they entered could not be used to target them specifically but could still provide useful group demographics to the company. As a web retailer, I can't wait for this to become widespread. I'll put up a registration form on my web site with a "randomize" button next to the submit button. When a user clicks on the randomize button, the javascript on the page will back up the real information that the customer entered into another set of variables, then randomize what the display boxes show to make it look like the data has been randomized. When that user then clicks the submit button, which belongs to the form containing all the backed up true values in "hidden" type inputs, the customer's data will be sent through a POST so the user can't see that his or her data has not been protected.

It's brilliant. It gets me the data I want without regard for the user's privacy and it solves the problem of the user not trusting my web-based business.

<sarcasm> 
But what if some random user clicks "view source" and finds out what I'm doing? Well, of course they'll report it to the eff which will muster it's army of thousands of high paid lawyers, public relations masters, and black belt ninjas to sue me out of business, run television ad campaigns to keep the public informed, and quickly and (get this) anonymously take out all my top execs, just like they've always done when combatting such problems as Passport(R)'s invasions of privacy and the DMCA.
</sarcasm>
</sarcasm>

I'd still lie by ggruschow · 2002-07-21 13:41 · Score: 1

Nobody'll read the instructions that tell them there are privacy enhancements in place. The few who do read the instructions still won't understand how the enhancements work anyway. Even if they do kind of grasp it, they still won't be convinced their privacy is safe.

There's no incentive to answer correctly. Good old-fashioned generosity and truthfulness are more than cancelled out by spam.

I don't lie to deceive. I just answer as quickly as possible. I don't care if people know who I am. It's just easier to enter a@a.com and pick the first choice in each multiple choice group.

conservative estimates say nearly half ... by HaggiZ · 2002-07-21 13:49 · Score: 1

... of all survey answers are bogus

And how exactly do they measure that? Are you lying:
Yes
No
Random

Obviously you then discount everyone who said no, because only the ones that answered yes can be certain to have told the truth, but then they... oh dear

And in related news, 74% of all statistics are made up.

--
Glenn
The Smrt way to trade CFDs on the ASX

Nothing New by Roger+W+Moore · 2002-07-21 14:43 · Score: 1

The techniques they describe in this article are absoultely nothing new: physicists (and probably many other scientists) have been using them for years. In my particular field, particle physics, we use large detectors to look at high energy particle interactions. Unfortunately these detectors never give an exact picture of what happens, instead their response varies randomly. However by knowing how the detector randomizes things we can, on average, extract the original picture and find out what happened.

So it seems to me that IBM has just reinvented the wheel. What would have made it more interesting is if they had some way to randomize the answers automatically without having to trust the company since without this the whole process is useless and asking visitors to randomize the answers themselves has two problems:

How many people would actually do it properly rather than just get through the stupid forms as fast as possible? If you completely lie then there is no way to extract the truth (unless you can show that everyone's lies form some statistical pattern)

Is human randomness truly random? If not (and it's almost certainly not!) then a lot of work will be needed to extract the correct distributions, especially since there may be correlations. For example when asked to randomize their age are women more likely to subtract 10 years than add 10 years? Are teenagers more likely to add 10 years than subtract it etc.
What might solve the second problem is a browser plugin that would display forms and add randomization to them when submitted. However I don't see any way to solve the first problem so I think this is really a none starter.

Zip Code Checkers by DustMagnet · 2002-07-21 15:48 · Score: 1

I always use 10101, since it's binary and easy to type. It's for New York, NY, so that's easy to remmeber too.

What I hate is the e-mail checkers. To get past them, I have to risk using someone else real e-mail address.

Back to the topic, even if they promised to fake 100% of the answers, I still wouldn't fill in the truth. I've been spammed by almost every company I've given my e-mail to, even when they promised not to.

--
'SBEMAIL!' is better than a goat!!

Random function by jsse · 2002-07-21 16:51 · Score: 2

(from mathematic perspective)but isn't random function reversible? Though the chance of reconstruct the scrumble is slim but we shouldn't rule out the risk. Why don't they use some irreversible functions like MD5?

It is amazing how people justify lying. by Jayson · 2002-07-21 17:41 · Score: 2

I find it quite amazing how people will justify their behavior. This is a good example of the selfishness of people: I want everything my way and if it conflicts with my belief then I should have the right to discard it.

The company is providing you with a product, often for free, and they request that for you to use their product you give them a little personal information. It is their product, so they get to make the rules. Your choices are to give what they want and take what you want, or you could just live without it. I don't understand the position of taking what you want and not leaving what they want.

Or you consider this tiny piece of personal information part of the price. Instead of giving them $5, you give them your age, salary, and email address. You don't try to trick the grocery store clerk when you think the bill was too high for what you bought, do you? Why would this be any different? If you don't like the invasion of privacy, then the cost is too high for you and you don't take the product.

I can see where people may say that capital is a required part of making the product and personal information isn't. Since they don't need my email address then I should feel free to not give it to them. However, this personal information often does translate into capital for them (the goal of business is to make money most of the time). Besides, that isn't your decision to make. The company wants your privates and they are giving you the product, so their desire carries more weight. If you were not receiving something back in return, then their desires would not override yours.

It is only a "little" lie doesn't change the fundamental aspect that lying is a priori wrong.

Re:It is amazing how people justify lying. by Anonymous Coward · 2002-07-22 12:12 · Score: 0

Yeah, but seeing that all of slashdot is about lying, fanaticism, FUD and selfishness without bounds... it fits right in.

Am I the only one that finds it ironic... by Kidbro · 2002-07-21 19:25 · Score: 2

...that we use a Random NY Times Registration Generator to falsify [our] registration details to access an article about ways to persuade people to give correct answers to survey questions?

Helluva page btw, majcher. Thanks :)

--
May we live long and die out

80% of all statistics are made up on the spot... by davejenkins · 2002-07-21 21:00 · Score: 1

Since conservative estimates say nearly half of all survey answers are bogus...

How did they come up with that estimate? A survey?

--

davejenkins.com |

Grandma? by iiii · 2002-07-22 03:09 · Score: 2

Grandma, is that you? How all of the family has been searching to find you. What a joyful day this is! No one believed you were serious when you, Sjembo Obsowetu, vowed to put an Apple computer in every classroom in our home country, but look at you now.

Beef jerky?

--
Light cup, beer drink, thin so chain, neck turtle fat, man I won't say it again

Online voting systems by guanxi · 2002-07-22 03:17 · Score: 1

That latter objection is the one that has botched any number of theoretically sound online voting systems.

Voting systems were designed to randomize some votes to protect anonymity? Very interesting. Even if people trust it, though, I wonder about the constitutionality and politics of it. Could you imagine Florida 2000 with some votes intentionally randomized?

My objection to online voting was not so much anonymity as security. It would be worth huge amounts of money to some people (from foreign gov'ts, to politicians, to large companies) to rig an election. How do you secure every variation of computer that exists in America against that kind of attacker; think of the NSA (i.e. their foreign equivalents). But that's OT.

Re:Online voting systems by cduffy · 2002-07-22 10:57 · Score: 1

No, not systems designed to randomize votes -- systems designed to use some protocol which, given a set of steps performed by the user, assures that user's security, anonymity, &c.

There are several of these on the market or in academic use; see GNU.FREE, or Sensus, or the commercial system used by VoteHere (they have white papers on their site), or the algorithm described by David Chaum in Eurocrypt '89. These algorithms are all theoretically secure, until one realizes that the steps which "the voter" is supposed to perform are all performed not by the voter but by their computer -- and that this computer can be subverted by any number of means (including attack on the distribution mechanism of the voting software, which is more often than not implemented as a Java application).

There are a great many other issues with online voting, of course; while a polling booth can be monitored to prevent vote-buying, for instance, how can one prevent this when the polling booth can be located anywhere a computer with a net connection exists? But as you say, this goes beyond the scope of this discussion.

three sigma responses to political surveys by Anonymous Coward · 2002-07-22 06:26 · Score: 0

I don't see how this would much affect folks like me. When a phone surveyor calls, if I talk to them at all, I tell them exactly what I want, not a response from their carefully prepared list. This applies especially to political surveys, where they manipulate the multiple-choice responses to elicit the results they were paid to produce.

This confuses them, and my responses certainly will be dropped as "statisical outliers". But it sure is fun to get these tele-surveyors working outside of the comfort zone!

Now, if the politicians who commission these "unbiased" surveys hadn't excluded themselves from the state "no-call" law, all of this wouldn't be necessary. Until they do that, I'll just enjoy myself at their expense.

Re:They need to let the randomization be client-si by iabervon · 2002-07-22 07:26 · Score: 2

You'd need it to be a browser feature, rather than a feature implemented by the browser while running javascript in the page. Obviously, this involves a lot of changes, but this is theoretical research anyway.

Slashdot Mirror

Randomizing Survey Answers For Accuracy

224 comments