Big Talk About Small Samples
The smallest sample I've ever used to make an argument was when I submitted some legal briefs, each no longer than five pages, in the anti-spam cases that I'd been filing in Washington State small claims court. Since I suspected the judges were not taking the cases seriously, I filed the briefs with the third and fourth pages stuck together in the center, by a tiny thread of paper joining the back of the third page to the front of the fourth page. (If someone were to turn the pages and actually readthe brief, the thread would break.) I did something similar in six different cases, and when the motions were all rejected, I went to the courthouse to look at the paper motions still in the file. In three out of six cases, the judge had rejected the motion without reading it first.
Now, the point was not to make any accurate estimation of the actual proportion, in the total population of small claims court judges, who would reject a brief in an anti-spam case without reading it. There's no basis for saying that the proportion of such judges is close to 50%. But we can still probably reject any contention that the proportion of such judges is very low. If only 10% of judges were rejecting motions without reading them, then there is only about a 1.4% chance of taking a random sample of six rejected motions and finding that in three or more cases, the judge did not read the motion. Even if 20% of judges were doing so, for an event with a probability of p=0.20 you would still only see it occur in three out of six cases, about 8.2% of the time. (If an event has probability p, the exact probability of that event occuring three or more times in six trials is given by 20*(p^3)*((1-p)^3) + 15*(p^4)*((1-p)^2) + 6*(p^5)*((1-p)^1) + 1*(p^6)*((1-p)^0).) So we can say that the proportion of such judges is quite probably more than 20%. I did this repeatedly because even after I had "caught" the first judge, I wanted to head off any objection that this was just an isolated case of rare behavior.
And, as always, it's important not to generalize too much about the behavior whose probability we're estimating. I don't think that 20% or more of judges, even in small claims court, are throwing most types of cases without reading or listening to the arguments. My impression was that most judges see view small claims court as a place to redress injustices, and that they see anti-spam and anti-telemarketer plaintiffs as just trying to "make money" at it, so they take those suits less seriously. I disagreed with this stance because (1) anti-spam plaintiffs usually really have been harmed and are not just "whining about one email" which they are trying to "cash in" (I still get so much spam that it interferes somewhat with the operation of my server and with my ability to get through my daily email); and (2) the law is intended after all as a deterrent, with disproportionate damages in order to discourage spammers from spamming in the first place. However, the charitable reading of the results is to assume that judges are merely biased against anti-spam plaintiffs -- but at least they probably don't treat all cases as casually as they treat anti-spam suits!
Back to the issue of small samples. My previous article was prompted by an editorial about the online response that had been elicited by two different photos -- one showing a black woman breastfeeding, and a nearly identical photo showing a white woman breastfeeding. The author asserted that the photos had received vastly different responses, which she attributed to racism. I presented a survey to a sample if users recruited from Amazon's Mechanical Turk, randomly showed each survey-taker one of the two photos, and asked:
Our academic department has asked everyone to submit a "fun" photo of themselves, so that our photos can be displayed together on the department home page. One of our employees submitted a photo that has caused some internal debate about whether the photo is inappropriate. I wanted to do a poll to get the opinion of a random sample of Internet users of different backgrounds.
Do you think this is an appropriate picture to be used in a photo collection on our academic department home page?
Out of 47 respondents who saw the black woman's photo, 36 of them (77%) said it was inappropriate. Out of 54 respondents who saw the white woman's photo, 38 of them (70%) said it was inappropriate.
As before, these samples are to small to say precisely what the relevant proportions in the background populations are, but we can probably reject certain statements about the populations -- for example, that the percentage of users offended by the black woman's photo is 20 percentage points higher than the percentage of users offended by the white woman's photo. This is where the counterintuitive part comes in. Suppose that in the background population, 81% of respondents would find the black woman's photo offensive, but only 61% would be offended by the white woman's photo. What are the odds of getting 77% or less "yes that's offensive" responses from a sample of 47 users shown the black woman's photo, and getting 70% or more "yes that's offensive" responses from a sample of 54 users shown the white woman's photo? It doesn't sound unlikely at all, because the percentages are quite close to the originals -- but you can verify, either with statistical calculations or with a quickly written computer program, that the odds are only about 2.5%.
Two main factors contribute to this counterintuitive result. First, even with a sample size of a few dozen, the frequency of an event starts to tend very closely to the frequency in the background population (if 80% of your population has some trait, and you take a sample of size 50, there's about a 95% chance that the number with that trait in your population will be between 34 and 46). Second, to find the odds of seeing both of these deviations at the same time (deviating from an assumed 81% in the background population down to 77% in the first sample, and deviating from an assumed 61% in the background population up to 70% in the second sample), you have to mutiply the probabilities of these two unlikely events. The probability of the first deviation is about 19%, the probability of the second is about 13%, and so the probability of them both occurring is about 2.5%.
The reason I calculated the odds of getting 77% or less "offended" responses for the black woman's photo while also getting 70% or more "offended" responses for the white woman's photo, is that in calculating the "unlikeliness" of a statistical result, it's customary to calculate the odds of getting "this result or a more extreme one". For example, suppose you want to know if a company's hiring process is gender-balanced (assuming a 50/50 gender split in the population), and you notice that in a random sample of 100 recent hires, 61 were men. You wouldn't ask "What are the odds of there being exactly 61 men in this sample?", because the odds of getting any particular number, are small. You'd ask, "What are the odds of getting this result or a more extreme one -- i.e. the odds of getting 61 or more men out of a random sample of 100, if the population were truly gender-balanced? As this calculation tool shows, the odds are only about 1.7%.
Similarly, in the case of the two populations being measured, the author of the original editorial hypothesized that there was some significant gap between the percentages of the population that were offended by the two photos, which I arbitrarily assumed to be 20 percentage points. Under that assumption, showing the two pictures to two different groups and having them be offended at similar rates, is the unexpected, "extreme" result, and the closer the rates are to each other, the more extreme the result is. That's why I calculated "77% of less" for the first group vs. "70% or more" for the second group.
And out of the pairs of numbers that I tested which were separated by 20 percentage points, 81% and 61% were the numbers which made the given result the least unlikely. 80/60 and 79/59 give odds of about 2.5% and 2.4%; 82/62 and 83/63 give odds of 2.4% and 2.2%.
You can do the statistical calculations directly, but in case you won't believe it unless you see the results unfold with your own eyes, you can run this perl script, which iterates through a million trials of the experiment, counting the number of times that the unexpected result occurs.
Why did I assume a 20-point gap? That was the most subjective leap that I made. Looking through the original editorial, I figured that on the basis of inflammatory statements like
"Only one woman was called 'adorable' by the media and portrayed with girlish innocence, and it wasn't the black one. It never is."
and
"The contrast in headlines is so stark, it deserves to be examined" [I assume here she meant the contrast in responses]
the author meant to imply a difference in people's attitudes that was at least that large. But the results suggest that it isn't.
For all of this effort, of course, I could have just expanded the original experiment to a sample of several hundred and mollified some people's concerns. But I wanted to argue for what you can show, even with small samples, because I would like to try (and would like others to try) similar experiments in the future, and do not think people should be discouraged if they can't afford to pay a thousand Amazon Mechanical Turk workers to take their survey. I paid my 100 respondents $0.25 each; naturally, one experiment I'd like to do soon is to figure out what's the lowest I can get away with paying them.
Slashdot is trying to move their user base from news for nerds and geeks to news for normals.
Seriously, I've noticed the Register getting more active as people move over there.
We, geeks, view this entire article as a bunch of shenanigans that waste our time. Please stop spitting in my face.
Give me an article about Intel latest and greatest chipset plans or how AMD screwed the poorch or about how one can modify a blackberry to run android applications. Those things are Useful.
\
Infotainment designed to incite does not nor should enter my world, it makes my world more stressful and wastes my time.
.
He already wasted ten minutes of my life with his last episode of keyboard effluent, why should I waste my time with him anymore?
I guess it's kinda cool that you took over what use to be a major tech-news website and turned it into your personal blog.
Fried post.
No kidding. Who is Bennett, and why does he get to use /. as his blog? What happened to WP:NOTBLOG?
That way the rest of us don't have to hear about his bullshit.
General Relativity: Space-time tells matter where to go; Matter tells space-time what shape to be.
There's a really good book that talks about brevity and how to communicate your ideas more concisely with fewer words. I suggest Bennett read it.
"First they came for the slanderers and i said nothing."
I'm sorry, but this is getting absurd.
If Slashdot is going to be Bennett "aint I smug and pointless" Haselton's personal blog ....
Give us a STORY EXCLSUION for this clown.
I do not see value in Bennett and hit shit, and I don't care.
But apparently at least samzenpus and timothy with post any of the shit this idiot writes.
Seriously, just fucking make it stop. Nobody here gives a shit about Bennett Haselton. So give us a fucking way to stop reading his crap.
Lost at C:>. Found at C.
It might be different if Bennet were a frequent poster and would be actively engaged in discussions, but he's not. He's just some guy who once heard that brevity is the soul of wit and went off to write ten thousand words explaining what it meant.
I am TheRaven on Soylent News
I don't read his long articles, generally speaking, but he has been an advocate against censorship and I respect that much.
No one makes anyone read the articles, and without even checking, I'd guess you can configure /. not to even show them.
The Haselton hate reminds me of the Jon Katz days, which is kind of amusing ;)
Slashdot by now has OBVIOUSLY seen how much we don't like this guy. The fact that they keep posting him means they're just trolling us, or going for pageviews, or both. Or maybe Bennett has some kind of deal with the site, or has something on one of the editors. Whatever. I don't care. From now on, NO ONE post any comments on one of his stories. Not even to say how much you hate his stories. This will be my last comment on one of his stories. Hope this takes!
Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
But we can still probably reject any contention that the proportion of such judges is very low. If only 10% of judges were rejecting motions without reading them, then there is only about a 1.4% chance of taking a random sample of six rejected motions and finding that in three or more cases, the judge did not read the motion.
But you DIDN'T HAVE A RANDOM SAMPLE. In particular, you had a sample from Washington State small claims court. So you can ONLY draw conclusions about Washington State small claims court. You have no idea what happens in New York, or in England. But that's only one example of how non-random your sample was. The problem is, ANY small sample is going to have non-random attributes, because it's a small sample. You can roll a dice three times and the results will appear highly non-random - no instances at all of some values - you have to roll it a hundred times to get a good distribution and the dice is random. If you start with a non-random dice - like your "sampling only from one court" or your "using Mechanical Turk" - your small sample size gives you results that are simply MEANINGLESS.
Go and study stats and stop posting this drivel on Slashdot where people might believe it.
Bennett, try this experiment.
Make a program that flips 54 coins and notes the number of heads and the number of tails at each round. Then run this program for one million rounds.
When you're done, note the number of rounds the random generator saw 38 or more heads and frame this as a proportion; ie - "the random generator reached this level X% of the time".
Then compare your results with the random generator. If your results are unlikely to come from the random generator, then perhaps you have something.
Now, " unlikely" is an arbitrary measure with no compelling foundation (it's the wrong measure to determine the significance of a result(*)), but in scientific circles we use a "rule of thumb": results are considered significant when they are less likely than 95% of the random results.
Even at this level, we expect 1-in-20 studies to be due to random chance, but then follow-on studies should confirm or deny the findings (and 1-in-20x20 of *those* will be due to random chance as well).
If the results might lead to potentially catastrophic decisions we might use a higher level of significance; for example, 99% confidence when deciding whether a drug is safe. Physics uses an insanely high level of confidence.
Try that and get back to us - we await your next post with baited breath.
(*) The correct measure is the number of bits saved by compressing the original data by factoring out the result (glossing over some details).
But why is Bennett's garbage being approved? I understand slashvertisements, because there is at least a monetary benefit to posting them. I also understand some pseudoscience occasionally slipping by, because the editor didn't read it carefully. But this crap? It is obvious shit from beginning to end. He has nothing to say. It is just completely pointless.
Bennett Haselton (born November 20, 1978) is a frequent commenter on the website Slashdot.org, where he is widely disliked by readers.
Slashdot is not a your blog. Go away.
You want to see a more meaningful sample size? Look at the number of comments in Bennett's "submissions" that are complaining about this waste of time. Compare that to the number that actually gives a shit. /. now needs to post a whiny follow-up piece???
It was bad enough that the first sorry the other day had NOTHING whatsoever to do with news for nerds, nor was it well written, nor was it well conducted.
But
Few people care about this Miley Cyrus' opinion on things that do matter, and fewer still care about his opinion on all the crap that doesn't matter.
Breastfeeding pictures? Burning Man parking? Burning Man Ice distribution? How come 5th Ammendment?
Fuck this clown.
Wikipedia is not a blog, but Slashdot is not Wikipedia. Plenty of newspapers and the like have in-house opinion columnists and other writers producing exclusive original content that distinguishes each publication from other AP/Reuters aggregators.
You can say that but you are wrong.
With a small, non-random sample you cannot say ANYTHING about anything.
Random is not the same as non-random.
A small sample size that is random is NOT THE SAME as a small sample size that is non-random.
Again, your sample was not random.
No matter how many times you try to imply/claim that it was random, it was not random.
As I said, I included the link to the perl script in the article, so that you don't have to take my word for it about the statistical calculations -- you can run one million trials of the experiment and verify that, under the posited hypothesis, a result similar to the one that occurred will only occur about 2.5% of the time. So the posited hypothesis is probably wrong.
Three minutes before posting this you were smacked down by a statistics prof posting as AC. I recommend you just apologize for having defended your small sample size with bad statistics, and hope people forget in a few years.