Slashdot Mirror


Big Talk About Small Samples

Bennett Haselton writes: My last article garnered some objections from readers saying that the sample sizes were too small to draw meaningful conclusions. (36 out of 47 survey-takers, or 77%, said that a picture of a black woman breast-feeding was inappropriate; while in a different group, 38 out of 54 survey-takers, or 70%, said that a picture of a white woman breast-feeding was inappropriate in the same context.) My conclusion was that, even on the basis of a relatively small sample, the evidence was strongly against a "huge" gap in the rates at which the surveyed population would consider the two pictures to be inappropriate. I stand by that, but it's worth presenting the math to support that conclusion, because I think the surveys are valuable tools when you understand what you can and cannot demonstrate with a small sample. (Basically, a small sample can present only weak evidence as to what the population average is, but you can confidently demonstrate what it is not.) Keep reading to see what Bennett has to say.

The smallest sample I've ever used to make an argument was when I submitted some legal briefs, each no longer than five pages, in the anti-spam cases that I'd been filing in Washington State small claims court. Since I suspected the judges were not taking the cases seriously, I filed the briefs with the third and fourth pages stuck together in the center, by a tiny thread of paper joining the back of the third page to the front of the fourth page. (If someone were to turn the pages and actually readthe brief, the thread would break.) I did something similar in six different cases, and when the motions were all rejected, I went to the courthouse to look at the paper motions still in the file. In three out of six cases, the judge had rejected the motion without reading it first.

Now, the point was not to make any accurate estimation of the actual proportion, in the total population of small claims court judges, who would reject a brief in an anti-spam case without reading it. There's no basis for saying that the proportion of such judges is close to 50%. But we can still probably reject any contention that the proportion of such judges is very low. If only 10% of judges were rejecting motions without reading them, then there is only about a 1.4% chance of taking a random sample of six rejected motions and finding that in three or more cases, the judge did not read the motion. Even if 20% of judges were doing so, for an event with a probability of p=0.20 you would still only see it occur in three out of six cases, about 8.2% of the time. (If an event has probability p, the exact probability of that event occuring three or more times in six trials is given by 20*(p^3)*((1-p)^3) + 15*(p^4)*((1-p)^2) + 6*(p^5)*((1-p)^1) + 1*(p^6)*((1-p)^0).) So we can say that the proportion of such judges is quite probably more than 20%. I did this repeatedly because even after I had "caught" the first judge, I wanted to head off any objection that this was just an isolated case of rare behavior.

And, as always, it's important not to generalize too much about the behavior whose probability we're estimating. I don't think that 20% or more of judges, even in small claims court, are throwing most types of cases without reading or listening to the arguments. My impression was that most judges see view small claims court as a place to redress injustices, and that they see anti-spam and anti-telemarketer plaintiffs as just trying to "make money" at it, so they take those suits less seriously. I disagreed with this stance because (1) anti-spam plaintiffs usually really have been harmed and are not just "whining about one email" which they are trying to "cash in" (I still get so much spam that it interferes somewhat with the operation of my server and with my ability to get through my daily email); and (2) the law is intended after all as a deterrent, with disproportionate damages in order to discourage spammers from spamming in the first place. However, the charitable reading of the results is to assume that judges are merely biased against anti-spam plaintiffs -- but at least they probably don't treat all cases as casually as they treat anti-spam suits!

Back to the issue of small samples. My previous article was prompted by an editorial about the online response that had been elicited by two different photos -- one showing a black woman breastfeeding, and a nearly identical photo showing a white woman breastfeeding. The author asserted that the photos had received vastly different responses, which she attributed to racism. I presented a survey to a sample if users recruited from Amazon's Mechanical Turk, randomly showed each survey-taker one of the two photos, and asked:

Our academic department has asked everyone to submit a "fun" photo of themselves, so that our photos can be displayed together on the department home page. One of our employees submitted a photo that has caused some internal debate about whether the photo is inappropriate. I wanted to do a poll to get the opinion of a random sample of Internet users of different backgrounds.

Do you think this is an appropriate picture to be used in a photo collection on our academic department home page?

Out of 47 respondents who saw the black woman's photo, 36 of them (77%) said it was inappropriate. Out of 54 respondents who saw the white woman's photo, 38 of them (70%) said it was inappropriate.

As before, these samples are to small to say precisely what the relevant proportions in the background populations are, but we can probably reject certain statements about the populations -- for example, that the percentage of users offended by the black woman's photo is 20 percentage points higher than the percentage of users offended by the white woman's photo. This is where the counterintuitive part comes in. Suppose that in the background population, 81% of respondents would find the black woman's photo offensive, but only 61% would be offended by the white woman's photo. What are the odds of getting 77% or less "yes that's offensive" responses from a sample of 47 users shown the black woman's photo, and getting 70% or more "yes that's offensive" responses from a sample of 54 users shown the white woman's photo? It doesn't sound unlikely at all, because the percentages are quite close to the originals -- but you can verify, either with statistical calculations or with a quickly written computer program, that the odds are only about 2.5%.

Two main factors contribute to this counterintuitive result. First, even with a sample size of a few dozen, the frequency of an event starts to tend very closely to the frequency in the background population (if 80% of your population has some trait, and you take a sample of size 50, there's about a 95% chance that the number with that trait in your population will be between 34 and 46). Second, to find the odds of seeing both of these deviations at the same time (deviating from an assumed 81% in the background population down to 77% in the first sample, and deviating from an assumed 61% in the background population up to 70% in the second sample), you have to mutiply the probabilities of these two unlikely events. The probability of the first deviation is about 19%, the probability of the second is about 13%, and so the probability of them both occurring is about 2.5%.

The reason I calculated the odds of getting 77% or less "offended" responses for the black woman's photo while also getting 70% or more "offended" responses for the white woman's photo, is that in calculating the "unlikeliness" of a statistical result, it's customary to calculate the odds of getting "this result or a more extreme one". For example, suppose you want to know if a company's hiring process is gender-balanced (assuming a 50/50 gender split in the population), and you notice that in a random sample of 100 recent hires, 61 were men. You wouldn't ask "What are the odds of there being exactly 61 men in this sample?", because the odds of getting any particular number, are small. You'd ask, "What are the odds of getting this result or a more extreme one -- i.e. the odds of getting 61 or more men out of a random sample of 100, if the population were truly gender-balanced? As this calculation tool shows, the odds are only about 1.7%.

Similarly, in the case of the two populations being measured, the author of the original editorial hypothesized that there was some significant gap between the percentages of the population that were offended by the two photos, which I arbitrarily assumed to be 20 percentage points. Under that assumption, showing the two pictures to two different groups and having them be offended at similar rates, is the unexpected, "extreme" result, and the closer the rates are to each other, the more extreme the result is. That's why I calculated "77% of less" for the first group vs. "70% or more" for the second group.

And out of the pairs of numbers that I tested which were separated by 20 percentage points, 81% and 61% were the numbers which made the given result the least unlikely. 80/60 and 79/59 give odds of about 2.5% and 2.4%; 82/62 and 83/63 give odds of 2.4% and 2.2%.

You can do the statistical calculations directly, but in case you won't believe it unless you see the results unfold with your own eyes, you can run this perl script, which iterates through a million trials of the experiment, counting the number of times that the unexpected result occurs.

Why did I assume a 20-point gap? That was the most subjective leap that I made. Looking through the original editorial, I figured that on the basis of inflammatory statements like

"Only one woman was called 'adorable' by the media and portrayed with girlish innocence, and it wasn't the black one. It never is."

and

"The contrast in headlines is so stark, it deserves to be examined" [I assume here she meant the contrast in responses]

the author meant to imply a difference in people's attitudes that was at least that large. But the results suggest that it isn't.

For all of this effort, of course, I could have just expanded the original experiment to a sample of several hundred and mollified some people's concerns. But I wanted to argue for what you can show, even with small samples, because I would like to try (and would like others to try) similar experiments in the future, and do not think people should be discouraged if they can't afford to pay a thousand Amazon Mechanical Turk workers to take their survey. I paid my 100 respondents $0.25 each; naturally, one experiment I'd like to do soon is to figure out what's the lowest I can get away with paying them.

5 of 246 comments (clear)

  1. It's a straightforward proportions test! by Anonymous Coward · · Score: 2, Informative

    Bennett, what the hell are you doing? This is a straightforward difference between proportions test. In R it's simply
    > prop.test(c(36,38),c(47,54))
    which gives a 95% confidence interval for the difference as (-0.13 to 0.25), meaning loosely that we wouldn't be terribly surprised if the white women were thought to be inappropriate at 13 percentage points higher or if the black women were thought to be inappropriate at 25 percentage points higher.

  2. Re:I am not reading that. by daw · · Score: 4, Informative

    Haselton needs an article about basic statistics. The 95% confidence interval on the difference between the two proportions is 6% +/- 17%, i.e. the range from about -11% to +23%. This (a) demonstrates that the sample is indeed underpowered to distinguish the sort of effect sizes that Haselton appears to be interested in, and (b) demonstrates that a +20% difference in proportion, contrary to Haselton's assertion, absolutely falls within the range of true values that can't be ruled out at a standard level of statistical confidence given the outcome of this experiment.

    see http://www.kean.edu/~fosborne/bstat/06d2pop.html for the basic statistics.

  3. Re:tldr by Anonymous Coward · · Score: 2, Informative

    What's an example of one?

    Your second sentence is grammatically incorrect -- "I could spot" instead of "I spotted". ("I could spot" makes no sense, because it sounds as if you were able to spot a certain number of mistakes, but of course you could only know that if you actually did spot them.) It's an understandable error for a non-native-speaker, but it's the kind of thing you should watch out for when you are complaining about other people's grammar...

    WTF Bennet??

    I can't see any claim by the parent of being grammatically perfect.

    The parent doesn't have to be grammatically perfect in order to highlight the many, many mistakes that you have made.

    "I could spot" is perfectly valid. If you can't understand this formation, and how it is different from "I spotted", then there really is something wrong with you.

  4. Re:Look, give us an exclusion. by jones_supa · · Score: 3, Informative

    Bennett is just the latest incarnation of Katz and that other guy before him who I've thankfully forgotten the name of.

    Roland Piquepaille.

  5. Re:Still no, sorry by Aighearach · · Score: 3, Informative

    You're not helping yourself here, or even defending yourself. You're tripping over your own feet, and making a total clown of yourself.

    Small samples have no statistical value for real reasons. Get over yourself and stop thinking you invented a new math that allows small samples to have extra meaning via some clever asshat trick. They don't have value. They don't tell you anything at all about the background population. Small samples are not instructive. Look up the AC statistics professor above who tries to teach you the math that you totally mangled.