The Fallacy of Hard Tests

← Back to Stories (view on slashdot.org)

Posted by kdawson on Saturday June 16, 2007 @06:47PM from the do-the-math dept.

Al Feldzamen writes in with a blog post on the fallacious math behind many specialist examinations. "'The test was very hard,' the medical specialist said. 'Only 35 percent passed.' 'How did they grade it?' I asked. 'Multiple choice,' he said. 'They count the number right.' As a former mathematician, I immediately knew the test results were meaningless. It was typical of the very hard test, like bar exams or medical license exams, where very often the well-qualified and knowledgeable fail the exam. But that's because the exam itself is a fraud."

12 of 404 comments (clear)

Min score:

Reason:

Sort:

Re:Worthless by WFFS · 2007-06-16 19:01 · Score: 5, Funny

Ok... the test will be on... girls. Huh? What do you mean that isn't fair?
Re:Worthless by Derekloffin · 2007-06-16 19:09 · Score: 5, Insightful

Yeah, this is a pretty bloody poor analysis. If I know 2X as much (even assuming we could quantify it that easily), that doesn't automatically mean I get 2X the score on a test, and it certainly doesn't mean my guesses are equally as bad as the guy with 1/2 my knowledge. It depends heavily on what my knowledge is and what is covered by the test. The potential is even there for the guy with 1/2 my knowledge to beat me just simply by getting lucky on what the test covers.
Just for an example, say we were doing a geography test on the states of the united states and their associated capitals. I know 1/2 of them, and another guy knows 1/4 of them. Now, each question is a 4 part multi-choice simple question: State X, which is it's capital? A, B, C, or D. The thing is, even for those I don't know, 1/2 the potential answers (on average) I can eliminate as I know them, while the other guy, on average, can only eliminate 1/4 of them. So, I would get 50% on knowing the answers, and about 1/2 of the remaining on guesses. The other guy would get 1/4 on knowing them, and only 1/3 of the rest on guesses. And that's just the basic mathematic flaw in his reasoning.
My experience by Tim_UWA · 2007-06-16 19:26 · Score: 5, Interesting

I once had a test that had a check box for how confident you were your answer was correct, that affected your score the following way:

If you ticked "confident" and you were wrong, -2
If you ticked "confident and you were right, +2
If you ticked "unsure" and you were wrong, -0
If you ticked "unsure" and you were right, +1

I guess the point is that it's advantageous to guess, but only if you choose the lesser-scoring option.
Re:Worthless by nephyo · 2007-06-16 19:52 · Score: 5, Informative

His argument is that the harder the test the less relevant knowledge of the actual answers to the questions posed on the test are to determining your relative score. As a result, on a very hard test, two test takers with vastly different levels of knowledge of the correct answers to the test questions do not on average end up with scores that reflect that difference.
The "educated guess" does not contradict that argument. Again, the harder the test then the smaller the difference between the number of potentially correct answers you can eliminate versus the number that he can eliminate will be. With a sufficiently hard test, "educated guessing" makes no difference whatsoever.
So basically with a multiple choice, count only the correct answers test, increasing the difficulty is not an effective means of increasing the likelihood of the test to accurately filter out candidates with lesser knowledge of the subject matter covered by the test. Increasing the difficulty only increases the degree to which randomness has an impact on the results.
This is true, well known, and not very controversial. However, you would of course need to examine the specific tests in question to determine whether they are effective. They may have other features to help mitigate this effect. Also, his analysis is purely mathematical. It doesn't take into account the likelihood of a challenging test to create social pressure that influences people to self-filter. It could be argued that most of these tests are not testing the takers knowledge of the material so much as they are testing the takers ability to study and react to the pressure that the tests provide.

--
I grant all that I write to the public domain.
I find Mr. Feldzamen's post hard to believe. by mbstone · 2007-06-16 20:57 · Score: 5, Interesting

Mr. Feldzamen claims to have passed the Virginia bar exam, but I can't find any evidence he was ever admitted to the Virginia bar, or to any state bar (he's not in Martindale-Hubbell). He cites the Virginia bar exam -- which I also passed (IAAL, licensed to practice in CA and VA) -- as one of his examples of a "complete fraud." In fact, when I took the Virginia bar exam it had over a dozen one-hour essay components, testing each and every possible subject. By contrast, the California bar exam, had essay tests covering six randomly chosen subjects out of a possible 15 or so, and it had other non-multiple-choice components. The multiple-choice section of every state's bar exam, the Multistate Bar Exam, is no walk in the park. So I don't understand how he includes bar exams in his claim that the tests are invalid. If anything, the low pass rate of bar exams, typically 50% or less among a candidate pool of mostly recent law school grads, suggests that they are very hard indeeed.
Re:Worthless by loganrapp · 2007-06-16 21:14 · Score: 5, Funny

A test on girls isn't fair because no matter what answer you give, it'll be wrong.
Re:Worthless by Derekloffin · 2007-06-16 21:27 · Score: 5, Interesting

The "educated guess" does not contradict that argument. Again, the harder the test then the smaller the difference between the number of potentially correct answers you can eliminate versus the number that he can eliminate will be. With a sufficiently hard test, "educated guessing" makes no difference whatsoever.
Actually, the problem here is his example is a total worst case scenario and doesn't tell us what the 'Pass' level is. The tests mentioned are not relative knowledge tests, they are pass/fail tests, in other words, I don't care how much Joe knows compared to Bill, all I care about is does Joe demonstrate the necessary level of knowledge to pass. In that case, assuming the test maker has the slightest clue, in the example the pass mark would likely be at 75%+ (you need about 1/2 right legit, and 1/2 of the remaining right on guesses or better) meaning that it's difficulty is fine as it has correctly blocked both people as they didn't show the necessary level of knowledge.
He might have a point IF he qualified this to scaled result tests (ie the top X people will pass regardless of their scores, only relative position counts), but he didn't. But, even in that case he'd have to analyze the distribution of all testees, not just 2. Once again, his math does work and doesn't support the argument.
Re:Worthless by pionzypher · 2007-06-16 22:25 · Score: 5, Funny

No. I think what he was trying to say was that no matter what, you'd never score with a girl. ;)

--
I'll believe in corporations having personhood when Texas executes one... - advocate_one
Re:Worthless by clifyt · 2007-06-16 23:26 · Score: 5, Interesting

In a well designed exam, the 'educated guess' is just as much a part of the design as anything else. You *HAVE* to have questions that have answers somewhat similar, or you make it way too easy to guess the answer by way of elimination. At the same time, we want enough questions that one can eliminate one or more questions immediately.

For instance, I'm a testing person, but not a content person (i.e., I design towards what the stats tell me, as well as the actual wording and structure of the exam...I always work with someone who understands the content areas from a very advanced level and can deal with that end). One of the last MC exams I was helping validate, I knew NOTHING about the content -- it was a medical exam. First thing I did was go through the entire exam, read all the questions quickly, and see if logic could remove any of the answers. Statistically, I would have gotten a 20% by random means, but in this case, I received somewhere around 43% (if I remember correctly). The educated guess is a BIG part of these things...you aren't just measuring content knowledge, but application and that means if someone can raise the bar, they might actually do well in the real world. If I had a doctor who had never seen a case like mine, and it defied traditional practice, I think I'd be more impressed with the man that got 40% on purely logic, than the guy that got the 40% based around actually knowing something about the problem (and actually, I had a team of doctors several years ago like this...I sat around trying to figure out how I was going to die for a couple of months while one doctor who had seen problems like mine couldn't figure out what the cause was, while the one that wasn't an expert in the field methodologically ruled out what wasn't the cause, and ended up finding me a specialist that the first doctor SHOULD have been able to do because his field encompassed a hell of a lot more of the specialty than my general physician's 'specialty').

And it kinda depends on the type of test and what you are measuring. When designing these things, you ask a lot of questions based around the type of assessment one is looking for. And you design accordingly. By correlating my exams with others that have some sense of validity, I can see the levels of the testees before they take the new one. This in itself will show you quite a few things about the design of the new exam. For instance, we can tell certain questions might have 50% of the folks answering correctly, but which 50%? On the original test, you have two groups take the exam, novices and experts (and heavily simplifying this for /.). If the experts get the question wrong, while the novices get it right -- the question is struck. Someone with little experience in test design may look at the question and wonder whats wrong -- the answer is correct and all of your colleagues agree -- but in some way it is wrongly worded. So again, it is either struck, or restructured to be inserted for calibration and validation at a later point (on a large exam like the Bar the author had derided, a good chunk of the questions are probably not scored and are only there to see how well they work and if they can be put into the next exam).

Beyond that, you have panels of experts who go over questions. Have them all vote on things like the difficulty of the item, the appropriateness for the exam. Things like that. Folks like me will take these and sort the items into usable or unusable stacks, rewrite them (again with experts), and then sort X amount of the lower difficulty, Y of the medium, Z of the hard (the easy questions are there to give motivation...its amazing how much better someone will do if they get positive reinforcement in that they KNOW this questions...it will prime the neural pathways to hopefully give more routes to specific knowledge in order to get the reward...I can feel the endorphin rush when I'm doing poorly but then get a win every now and then and it helps). And finally, one analyzes everything to see h
Re:Worthless by Jaidan · 2007-06-16 23:32 · Score: 5, Interesting
No, just no. The point of the tests are to determine who is over a particular threshold of knowledge and who isn't. The method being called a fraud fails to accurately do that. Since randomness has a proven substantial impact on those tests that threshold becomes blurred. To make matters worse, the harder the test the MORE randomness affects score. As a result the test results are meaningless at any scale. His examples where simplified to illustrate the essential math behind them, he does not need more than 2 people to compare since the math is equally applicable no matter how many are tested. He also does not need to set a scale because the math is equally applicable to any bar you might set.

The point of the article was to illustrate that these hard tests are meant to establish a minimum required level of knowledge, however due to the nature of counting only correct answers, randomness incurs a great penalty to the accuracy of the attempted measurement of knowledgege. He is suggesting, and rightly so, that a test that instead occurs an effective 0 net effect of guessing would much more accurately measure the knowledge of the participants by reducing the effects of guessing to nearly 0
.
What this really comes down to is accuracy and precision. We assume that a test score can be equated to a measurement of knowledge, and for your benefit (it's completely irrelevant) we'll assume that a passing test is 60%.
- We give 1 person 5 different tests. We allow for random guessing with no penalty, and the test is very hard. He takes them all and scores wildly different, but averages 65% across all of the tests. If I was to know for a fact that the person in question does indeed deserve to score a 65% then we can say the test was very accurate, but low in precision. On any given test the subject may have passed or failed depending on his luck with guessing.
- We now give the same person 5 new tests. We this time remove randomness for the most part by penalizing wrong answers by an amount that results in an effective gain of 0 for random guessing. This time he takes the tests all his scores are within a few points of each other and infact he averages 65% again. In this case the test is highly accurate and is also high precision. On any given test the subject most likely would pass
The article's math indeed illustrates this point very clearly. The unspoken point is that in tests such as these, designed to set standards to be met, it is a fraud to use a test with low accuracy at measuring actual knowledge. The precision gained by penalizing guessing allows the test to be much more fair in it's administration.
http://en.wikipedia.org/wiki/Accuracy
--
Mobius Custom Computers
He is totally and completely wrong. by kklein · 2007-06-17 00:12 · Score: 5, Interesting

Ugh. I just wrote a pretty polite reply at his page after skimming his idiotic article. Now that I've read it, I'm actually angry.
This guy knows NOTHING about testing. Nothing. He isn't even to the level of Classical Testing Theory (CTT), which is really not much more than means and Pearson correlations, and is nowhere near how high-stakes (and even medium- and low-stakes, increasingly) multiple choice (MC) tests work now, and how they have worked for many many years.
IAAP (I am a psychometrician). A big part of what I do for a living is design a particular MC test, pilot the items, and interpret the results. But I don't just count up the correct items and give you the percentage. Why? Because that would be insane. You can guess on those.
Oh, but he says this:

But suppose the grading attempts to adjust for guessing. There is no way of knowing what is in the mind of the test-taker, so the customary is to subtract, from the number correct, some fraction of the number wrong.
--Which is just fine until I tell you I have NEVER heard of dealing with guessing that way on a professional-level test.
As a general rule, we don't do any easy mathematics. At all.
Here is part of the output for a test I'm working on right now:

Seq Item Type Location SE FitResid DF ChiSq DF Prob 35 I0035 Poly 0.685 0.089 2.239 525.69 15.636 8 0.05 36 I0036 Poly -1.946 0.165 -0.587 525.69 6.754 8 0.56 37 I0037 Poly 0.02 0.093 2.603 525.69 12.704 8 0.12

This is generated by RUMM2020, a tool for Rasch analysis. The Rasch model was developed in the 60s as an ideal model of item response. These are the stats on 3 items of this test. The two most important columns are Location and Probability.
The location is the item difficulty. Given the sample's performance on this item, and given their ability, how hard is this item? Item 35 is quite difficult; item 36, quite easy.
The probability is the p value for the chi square. Basically, if it's 0.05 or below, that item is operating significantly (statistically significantly, that is) outside of the model. It displays poor "fit." we generally toss these items before going on to the next step (ideally, these are weeded out during pilot testing, before the test goes live--in this case, it is an experimental test of a construct I'm not even sure exists anymore, but I digress). If an item has poor fit with the model, it is too much of a loose cannon, and its results cannot be trusted. This is what the benighted blogger (is there any other kind?) was whining about. That item is hard not because it is good, but because it is evidently stupid. The responses are all over the place, which means people were probably just guessing. Out it goes before it ruins any examinees' lives.
The next step is to get person locations. In the case of people, these numbers indicate the person's ability. This is calculated by looking at their performance on the items, given their difficulty (Which is calculated based on people's performance on them! Incestuous! But given a large enough sample, it all works out to a fine enough grain to be useful). Here is the output for the people:

ID Total Max Miss Locn SE Residual DegFree DataPts 1 67 125 125 0.254 0.21 -0.272 123.60 125 2 77 125 125 0.700 0.21 -0.178 123.60 125 3 86 125 125 1.120 0.22 -1.030 123.60 125

So, the first person didn't do so hot; the last did pretty well (these usually top out at 3ish). As you can see in "DataPts," there were 125 items on this test. I started with 160. Do you hear that, Mr. Unexpected "Truths?" We have your back! We're not just handing you a naked score based on our crap items. WE PULL THE CRAP ITEMS.
That location score will usually be rescaled to something prettier, since no one would really like to see something like
Re:Worthless by ryanov · 2007-06-17 01:47 · Score: 5, Interesting

I have a lot of experience with this lately, having come down with an odd virus that had no treatment but was/is excruciatingly painful. There may be no treatment available, but I wager the vast majority of these folks who go to a doctor but have nothing wrong with them DO have some symptom or another... for me, getting the symptom treated is almost equally as important as having the cause treated, as I probably wouldn't have gotten out of my chair without it. One doctor recently seemed much more concerned with the cause and the symptom was nearly an afterthought -- as a result, I was in a lot of pain for 24 hours with no way to fix it. He saw the antibiotic as more important (though it ultimately turned out not to be bacterial), but I saw something for pain to be something that should have happened immediately.

Another thing -- most people want to feel like the doctor at least LOOKED for something. One doctor I went to recently made me wait 40 mins to see him and then looked at me for like 30 seconds and prescribed something. Yes, that makes sense if you know what it is straight off and know what to do about it, but you might just wanna look for other things that I /didn't/ mention, in case I have more than one thing or in case there are different diagnoses that have similar symptoms except for a couple.