The rightly esteemed statistician R.A. Fisher would agree with you, I think. He was pretty firmly against any statistical analysis which was not done using controlled design of experiments design.
However, using this argument, it is impossible to prove most real-world effects using statistics. The best example of this is the link between smoking and lung cancer. Even after a preponderance of evidence from multiple studies accumulated, Fisher refused to acknowledge a link, because each study had some methodological flaw which allowed some alternative possible explanation for the correlation. There's a description of the debate in the excellent (and very readable) book "The Lady Tasting Tea" by David Salsburg, in Chapter 18. Essentially, the problem here is that there/could/ be bias due to Mechanical Turk participants being biased in some way. On the other hand, there's no particular reason to think that Mechanical Turk participants' bias will have a particular effect. If the effect here is significant and strong, we would want to repeat this with other populations to confirm the lack of a bias...but the possibility of a bias will pretty much always exist, no matter how many groups of people we study. As we study more, the possibility of bias in each population decreases.
Now, I don't mean to say that the author's statistical argument is correct. In particular, I think that he has the tendency (which we all have and fall prey to from time to time) of confirmation bias, of focusing on the evidence which supports the hypothesis we already believe. Here's my analysis, from the numbers provided by the author:
I've posted a quick barplot to illustrate. You can see that other than the "Excellent" category, the tendency is for higher self-reported math ability to actually correspond to a higher tendency to say that yes, the law was violated.
A quick and dirty statistical analysis assumes as its null hypothesis that all four categories (no respondents self-reported as "Poor") are the same: they each have the same probability of voting "Yes" for whether the law was violated. A standard maximum-likelihood estimate for this percentage sums over all the responses, arriving at about a 67% chance of saying "Yes". This estimate does have some variance (it's an estimate, not the truth), but we'll ignore that for the moment and assume it's the correct percentage. Now the question is, if this is the case, what is the probability of getting the 44% (12/27) in the "Excellent" category? If we use a binomial distribution, the chances of getting 12 or fewer "Yes" responses from 27 is about 0.013, which seems to have been the source of the author's conclusion that this result is about 1% likely to happen by chance. However, we should also consider that we have four categories; whenever running multiple tests, you have a better chance of getting a "significant" result just by chance. A standard significance cutoff level is 0.05, reflecting that if something has less than a 5% probability of happening by chance, this is evidence that something other than chance is at work. Since we are running 4 tests, we use a Bonferroni correction, dividing by 4, to get 0.0125 as our actual cutoff. You can see, however, that 0.013 is higher than 0.0125, so it actually does not even (quite) meet the 5% significance level.
As a final note for all the real statisticians out there, it's clear there are still a number of issues with the above analysis (such as the fact that we ignore the dependence between the categories, for example). A chi-squared test provides a value of 8.06, with a p-value around 0.045; however, this is caused about equally by the low "yes" result for "Excellent" and the high "yes" result for "Very good"; that difference, unfortunately, allows for a variety of expl
The rightly esteemed statistician R.A. Fisher would agree with you, I think. He was pretty firmly against any statistical analysis which was not done using controlled design of experiments design.
However, using this argument, it is impossible to prove most real-world effects using statistics. The best example of this is the link between smoking and lung cancer. Even after a preponderance of evidence from multiple studies accumulated, Fisher refused to acknowledge a link, because each study had some methodological flaw which allowed some alternative possible explanation for the correlation. There's a description of the debate in the excellent (and very readable) book "The Lady Tasting Tea" by David Salsburg, in Chapter 18. Essentially, the problem here is that there /could/ be bias due to Mechanical Turk participants being biased in some way. On the other hand, there's no particular reason to think that Mechanical Turk participants' bias will have a particular effect. If the effect here is significant and strong, we would want to repeat this with other populations to confirm the lack of a bias...but the possibility of a bias will pretty much always exist, no matter how many groups of people we study. As we study more, the possibility of bias in each population decreases.
Now, I don't mean to say that the author's statistical argument is correct. In particular, I think that he has the tendency (which we all have and fall prey to from time to time) of confirmation bias, of focusing on the evidence which supports the hypothesis we already believe. Here's my analysis, from the numbers provided by the author:
I've posted a quick barplot to illustrate. You can see that other than the "Excellent" category, the tendency is for higher self-reported math ability to actually correspond to a higher tendency to say that yes, the law was violated.
A quick and dirty statistical analysis assumes as its null hypothesis that all four categories (no respondents self-reported as "Poor") are the same: they each have the same probability of voting "Yes" for whether the law was violated. A standard maximum-likelihood estimate for this percentage sums over all the responses, arriving at about a 67% chance of saying "Yes". This estimate does have some variance (it's an estimate, not the truth), but we'll ignore that for the moment and assume it's the correct percentage. Now the question is, if this is the case, what is the probability of getting the 44% (12/27) in the "Excellent" category? If we use a binomial distribution, the chances of getting 12 or fewer "Yes" responses from 27 is about 0.013, which seems to have been the source of the author's conclusion that this result is about 1% likely to happen by chance. However, we should also consider that we have four categories; whenever running multiple tests, you have a better chance of getting a "significant" result just by chance. A standard significance cutoff level is 0.05, reflecting that if something has less than a 5% probability of happening by chance, this is evidence that something other than chance is at work. Since we are running 4 tests, we use a Bonferroni correction, dividing by 4, to get 0.0125 as our actual cutoff. You can see, however, that 0.013 is higher than 0.0125, so it actually does not even (quite) meet the 5% significance level.
As a final note for all the real statisticians out there, it's clear there are still a number of issues with the above analysis (such as the fact that we ignore the dependence between the categories, for example). A chi-squared test provides a value of 8.06, with a p-value around 0.045; however, this is caused about equally by the low "yes" result for "Excellent" and the high "yes" result for "Very good"; that difference, unfortunately, allows for a variety of expl