The Fallacy of Hard Tests
Al Feldzamen writes in with a blog post on the fallacious math behind many specialist examinations. "'The test was very hard,' the medical specialist said. 'Only 35 percent passed.' 'How did they grade it?' I asked. 'Multiple choice,' he said. 'They count the number right.' As a former mathematician, I immediately knew the test results were meaningless. It was typical of the very hard test, like bar exams or medical license exams, where very often the well-qualified and knowledgeable fail the exam. But that's because the exam itself is a fraud."
What a worthless post. He gave one situation where guessing is more important than knowledge, but didn't at all address the specifics of the tests he was talking about. A typical vapid blog that for some reason gets posted to /.
Stories like this could never get on Slashdot. Seriously, this is like a maths problem I'd give to my Year 9 kids. This is definitely not news, and certainly doesn't matter.
If you have 100 questions, and 20 right ones and 20 wrong ones, it leaves 60 unanswered questions.
That's why the articles talks about only counting right ones. In order to avoid guessing, there should be a difference between picking a wrong answer and not picking an answer at all.
As a medical student, I know how much our education is divided into what we do in real life, and what is the proper answer for exams. Quite often, during our education exercises, we're given senarios like "A patient presents with symptoms X, Y and Z. What do you do next?". At that point, that's when the resident says "You would diagnose condition A from those symptoms, but for the exam, you'd say you'd get an MRI to rule out B". So many questions are basically having intuition for where the question is guiding you too, rather than practical medicine. Often, it's extremely difficult to discern what the question wants. There will be some question along the lines of "A patient presents with general fatigue over the past 3 months, which one blood test do you want to order?" and you'll narrow down the answer choices to either thyroid stimulating hormone, or a complete blood count, both studies are equally important in the evaluation of fatigue, but the question wants you to know which one is more important. In real life, you would always get both because both conditions fairly common, and you want to evaluate both at once to save the patient time and effort. However, the question will nail you if you don't know some obscure study which states that there like is a 1% difference in the incidence of hypothyroidism vs anemia in fatigue. Moreso, if you were on the hospital floor and you were to say "I'm getting only a CBC, because it's more likely," the resident will chide you for not considering hypothyroidism as well and getting the Thyroid stimulating hormone as well, making you look bad. So yeah, learning for the test doesn't really ever end.
if anything testing has become FAR FAR too easy, people pass CS courses and come out the otherside only to have a vague notion of how a computer works.
I won't claim his post is correct or not, but he claims the technology behind such tests is wrong and lets less educated people pass through with guessing, whle more educated people try to pass without guessing and fail.
People see the tests produce poor selection, and make the tests harder and harder in attempt to remedy this (but they won't since it's the technology of a test that's wrong).
Then you come here and support his opinion 1:1 by claiming tests are too easy (i.e. should be harder) and idiots pass through.
Ironic, isn't it.
You're missing the point. Counting only correct answers on a multi-choice test doesn't measure what you know, or whether you have the necessary minimum knowledge.
With 4 choices for each question on a 100 question test, the average student (student A) who knows 50% of the answers will get at least 62 correct if they guess entirely at random when they don't know the answer (50 plus 50/4 correct guesses). The average student who knows only 25% of the material (student B) will get at least 44 correct using the same approach (25 plus 75/4). Although A knows twice as much as B, A's score is only 40% better (not 100%).
Of course, it's even worse than this. First, because there is a large degree of scatter: a student choosing at random might do much better or much worse than this. Second, because multi-choice questions are often structured so that half of the possible answers are obviously incorrect, which changes the odds.
With only two plausible answers to choose between, A might get 75 correct and B might get 63: in this case A, who knows twice as much as B, gets a score only 19% better than B.
If points are subtracted for incorrect answers (say -1/4 pt to -1/2 for each one wrong), the effect of guesses can be taken out of the equation so that differences in scores actually reflect differences in knowledge. Or if the questions are easier, a smaller proportion of both students' answers will be guesses, so the effect should be smaller.
I think what he should have said is that multiple choice tests are a stupid idea (it's okay if one or two questions are a block of multiple choice lines but not the whole test). Let the student explain things with his own words.
Justice is the sheep getting arrested while an impartial judge declares the vote void.
Subtracting points for wrong answers is supposed to encourage students to skip a question if they don't know what to say rather than give a wrong answer. If someone gets 48% right from his knowledge he can't spray and pray for the remaining 2%.
Justice is the sheep getting arrested while an impartial judge declares the vote void.
Though some of his logic was overblown (see the comments made directly on his blog), I think his larger point has some merit. In fields which require lots of studying before beginning as a professional, such as medicine and law, you always hear that you have to be absolutely brilliant to 'get in'. The fact of the matter is that this is not the case: you should be darn smart, but you needn't be the best student in the world to be successful as a doctor. Many of the students who go to law or medical school (I'd guess most) are completely qualified for positions in their respective fields, but by the same token, are not necessarily any more qualified than their peers: they've all studied the same material, had the same experience in the lab, and know the whole picture within a reasonable approximation of each other.
Yet to maintain the level of exclusivity that these careers have, there must be some way to select a subset of the candidates to proceed, and at this point, there are few distinguishing features among them. Some will be far and away brilliant, and will easily get a career regardless; but the majority can't be differentiated from one another. So, how should it be decided who is a doctor and who isn't? By making a test that's so hard it amounts to a randomising function, and then selecting a subset of top scorers to pass. Passing doesn't mean one is inherently more qualified; it just means one guessed better on that day. This also explains why people can pass on their second or third try: they are no better than their competitors the next time around, but eventually one will guess luckily, and get in. It'd be interesting to do some statistical analysis on how many tries it takes people to 'pass' a particular exam, and see if the results fit probabilistic models: If the results of such analysis fit too well, the test is too hard, whereas if they deviate greatly from probabilistic expectations, then the test is more likely to be an actual test of one's knowledge.
To be sure, there will be some individuals who can pass based entirely on their knowledge, just as there will be some individuals who simply aren't cut out for life as a lawyer that will fail the exam. But ultimately, it allows the higher-ups to select candidates for job positions based on the single indisputable criterion of the candidate having passed an exam, thus avoiding any messy issues when someone complains about them choosing a particular candidate in lieu of one better qualified.
Time for a terrible analogy, since it's 0300 here: Really hard exams are the bouncers at the door to the club of medical careers.
In many professional specialties, including law and medicine, there are times when a quick, decisive educated guess may produce better results than an exhaustively researched, definitively confirmed answer.
So tests that force students to do a lot of guessing may still be good tools for evaluating their professional qualifications.
A doctor or lawyer who can guess right may be superior to one who plods to the right answer only after many expensive lab tests or hours of legal research. That's not to say that doctors and lawyers shouldn't do lab tests and research -- of course they should. But there are many situations, especially time-sensitive ones, where quick judgment is more important than absolute knowledge: during surgery or a health crisis, during a trial or deposition, etc.
in college that gave very hard tests. Intel Assembly class. For a midterm, we had to decipher Object-Oriented Assembly, and decipher self-modifying code. After 3 weeks of introduction to Assembly.
I got an A, with an average of 58% in the class.
For the 2-hour final, he got up at the 1-hour point, and yelled: "The test is over. All pencils down." We just sat there dumbfounded for about 10 seconds, and then he said, "Just kidding. I always wanted to do that."
Ya, a real great pal there!
Worst teacher I had in college. He didn't last long
Don't steal. The government hates competition.
I once had a test that had a check box for how confident you were your answer was correct, that affected your score the following way:
If you ticked "confident" and you were wrong, -2
If you ticked "confident and you were right, +2
If you ticked "unsure" and you were wrong, -0
If you ticked "unsure" and you were right, +1
I guess the point is that it's advantageous to guess, but only if you choose the lesser-scoring option.
TFA makes sense. Observe:
News for nerds?: yes[ ] no[x]
Stuff that matters?: yes[ ] no[x]
Clearly the editorial process is fraudulent - as this is a multiple choice, it is obvious that guessing tends to count much more than knowledge.
From this we can conclude one of two things:
1) Zonk is bad at guessing
2) The author is speaking out of his ass
Tempting as it is, I am going to stick with 2... But I could, of course, be guessing.
I love the exams we had : a question was posed or a problem stated which required the knowledge we had learnt to solve it. Eventually there is more than one question asked to offer a lead. But no answer given. Those are real test. Applied Knowledge. Usually for multi choice with a very basic knowledge of the subject you can sort out formany response the one being the most probable. This is how I breathed through my english Multiple-Choice at the university, and hell, look at how bad (or how good ;)) my english is. Face it multiple choice might be an easy way out for professor to correct exams, but they are the poorest choice to test the knowledge and habilitiy to reason of the student.
C. Sagan : A demon haunted world:
http://www.amazon.com/gp/product/0345409469/
visit randi.org
Having done experiments on MBA students
See, I KNEW they were good for something. Let me guess, the reason you opted for MBAs over mice is that there is far less protests when you do cruel medical experiments on the MBA students than with mice, correct?
Monstar L
Mr. Feldzamen claims to have passed the Virginia bar exam, but I can't find any evidence he was ever admitted to the Virginia bar, or to any state bar (he's not in Martindale-Hubbell). He cites the Virginia bar exam -- which I also passed (IAAL, licensed to practice in CA and VA) -- as one of his examples of a "complete fraud." In fact, when I took the Virginia bar exam it had over a dozen one-hour essay components, testing each and every possible subject. By contrast, the California bar exam, had essay tests covering six randomly chosen subjects out of a possible 15 or so, and it had other non-multiple-choice components. The multiple-choice section of every state's bar exam, the Multistate Bar Exam, is no walk in the park. So I don't understand how he includes bar exams in his claim that the tests are invalid. If anything, the low pass rate of bar exams, typically 50% or less among a candidate pool of mostly recent law school grads, suggests that they are very hard indeeed.
I find the fact that medical and lawyer exams are based on multiple choice rather disturbing. As an engineer almost all of my test were long answer. Sure, some multi questions, but mostly show all your work or explain the whole process. And I just design systems and networks! Now someone can just luckily guess enough multiple choice questions and start slicing me up?
Like I said, disturbing.
Vote monkeys into Congress. They are cheaper and more trustworthy.
A person has heartburn, do you:
A) Perform a colonoscopy
B) Perform open heart surgery
C) Tickle him
D) Fart
E) Refer him to Cowboy Neil
I'm going to Mexico for my next check up. At least you'll get tequila first....
Vote monkeys into Congress. They are cheaper and more trustworthy.
I just skimmed TFA, but it seemed to me like he was advocating a guessing penalty.
Funnily enough, one of the most hardassed profs I ever had also taught the introductory assembler class. (except for us it was PDP-11 and 68K) His tests were legendary for their difficulty, and the average was somewhere in the 20-30% range. However, it was curved after the fact and was a perfectly valid exam since there was absolutely no opportunity to guess. He gave us self-modifying assembler code too, without telling us such a thing was possible in advance! He also had a unique way of assigning readings. He would say, "Have you read chapter X yet? If you haven't, you're screwed!" Still, despite his apparent sadism we did learn a lot in his class.
In a later course I had a prof who would run our class through proofs that would span 3 or 4 lectures. If you fell asleep once in that period of time you'd be utterly lost. At the end of his proofs he would often say, "Does this make sense? Does everybody get this? If not, you had better think about dropping the course!" (Somehow it was hilarious in his thick indian accent. He really rolled the 'r' in dropping too.)
-1 for an incorrect answer? That's pretty weak. As an air traffic control student, our wrong answers get punished with planes full of dead people.
The problem there is that averages are one thing, but in practice there still is a non-zero chance that he'll actually score higher than you do.
Let's say it's 20 questions, 4 possible answers each. He'll know 5 of those, has to guess 15. There's even a 1 in billion chance that he'll get all 20 right. (4^15 = 2^30 = approx 1 billion.) If you gave that test in China, by now you'd have at least one guy who pulled exactly that stunt.
There's also the issue of how well those questions fit your and his domain of knowledge. Let's say you can't possibly test _all_ the questions, because that's usually the case. You can do it for state capitals, but you can't possibly cover a whole domain like medicine or law.
There are 50 states, you know 25, the other guy knows, say 12 (rounded down), so it's not impossible that the 20 questions are all from the 25 you don't know, but include all 12 that guy knows. In fact, assuming a very very very large domain (much larger than 50, anyway), there's about 1 in a million chance that all 20 questions will be from the 50% you don't know.
Now when testing states that doesn't have a higher moral, because (at least theoretically) all states are equally important. In other domains, like medicine, law, even CS, that's not the case: stuff ranges from vital basics to pure trivia that noone gives a damn about. (Or not for the scope of the problem at hand: e.g., if I'm hiring a Java programmer, asking questions about COBOL would be just trivia.)
And a lot of "hard tests" are "hard" just by including inordinate amounts of stuff that's unimportant trivia. E.g., if I'm giving a test for a unix admin job, I can make it arbitrarily "hard" by including such trivia as "in which directory is Mozilla installed under SuSE Linux?" It's stuff that won't actually affect your ability to admin a unix box in any form or shape. The fact that SuSE does install some programs in different directories is just trivia.
(And if that sounds like an convoluted imaginary example, let's say that some "hard" certification exams ask just that: where is program X installed in distribution Y? And at least one version of Sun's Java certification asked such idiotically stupid trivia as in which package is class X, or whether class Y is final. Who cares about that trivia? It's less than half a second to get any IDE to fill in the package for you. E.g., in Eclipse it just takes a CTRL+SPACE.)
And in view of that previous point, including trivia in an exam just to make it "hard" is outright counter-productive. There is a non-null chance that you'll pass someone who memorized all the trivia, but doesn't know the basics.
Not all knowledge is created equal, and that's one point that many "hard" exams and certifications miss. If a lawyer doesn't know the intricacies of Melchett vs The Vatican, who cares? In the unlikely situation that they need it, they can google it. If they don't understand Habeas Corpus, on the other hand, they're just unfit to be a lawyer at all. Cramming trivia into an exam can get you just that kind of screwed up situation: you passed someone who happened to know that Melchett vs The Vatican is actually a gag question, and that case name appears in Stephen Fry's "The Letter", yet flunked someone with a solid grasp of the the basics and who knows how to extrapolate from there and where to get more information when he needs it.
Rewarding random guesswork is worse. Probably the most important thing one should know is what he _doesn't_ know, so he can research it instead of taking a dumb uninformed guess. Most RL problems aren't neatly organized into 4 possible answers, so it can be a monumental waste of time to just take wild guesses and see if it works. I've seen entirely too many people wasting time trying wrong guess after wrong guess, instead of just doing some research. E.g., I've actually witnessed a guy trying every single bloody combination between *, & and nothing in front of every single variable in a C function, because he never understood how poin
A polar bear is a cartesian bear after a coordinate transform.
> This is really a question of statistics not of mathematics. Having done experiments on MBA
> students, we found that a well written multiple choice question is more accurate than 4 well
> written essays. The fact that we can easilly have 50 multiple choice questions and a maximum
> of 8 essays makes it a no brainer that multiple choice is much more accurate.
I don't know how you judge whether a question is well written or not. In my experience, multiple choice questions are very easy to write wrongly. A wrongly worded essay question easily have exactly the opposite effect as you want: you reward the ones who know the subject less (it seldom just give you random noise). Worse, you won't know it happened before you're told. I've read many exam MC questions during exam paper review meetings, my feeling from reading such questions for 4 years is that one in 4-6 MC questions are poor enough this way. In contrast, a wrongly worded essay question will present students some real-life trouble (the questions that they will face will be full of inaccuracies!), and when marking them you know the question is written wrongly, but at the same time you know whether the students are good anyway.
But the real problem of multiple choice questions is that it doesn't present the student any real world test. In the real world, nobody would tell you that "You are in situation, you can do A, B, C or D. Please choose one". Instead, what they see is "Somebody is in this situation. Please advice." Being good in multiple choice question usually has doubtful utility in the real world. And education systems will have to align with the judgement system, so at the end the teachers train their students the wrong technique as well.
Of course, there are benefits of MC questions: they can be marked mechanically, which means that (1) they lessen the workload of markers, (2) they are marked with perfect consistency, and (3) their markings are free from language or hand-writing proficiency. I don't think "accuracy" is one of those, though, since MC questions are just testing the wrong ability.
Actually it has to be a % passing. If the supply of licensed doctors and attorneys were not limited, the costs for their services would reduce, so these exams have to be a part of the the system to control the supply. A test may be written to ensure a spread (so it tests knowledge) and also to ensure that the passing score is largely unattainable. So, I think the analysis is incorrect. The tests are not too hard to be useful as tests, it is just that their is a conflict of interest as regards their use. As medical care begins to take on the characteristics of a human right as representation in court is a political right, perhaps we'll begin to see a breaking down of the cartel system so that medical and law educations are not restricted and final competency tests can be tests of competency rather than also being a link in a chain of controlling supply to increase price.s -selling-solar.html
--
Electricity without fuel costs: http://mdsolar.blogspot.com/2007/01/slashdot-user
Ugh. I just wrote a pretty polite reply at his page after skimming his idiotic article. Now that I've read it, I'm actually angry.
This guy knows NOTHING about testing. Nothing. He isn't even to the level of Classical Testing Theory (CTT), which is really not much more than means and Pearson correlations, and is nowhere near how high-stakes (and even medium- and low-stakes, increasingly) multiple choice (MC) tests work now, and how they have worked for many many years.
IAAP (I am a psychometrician). A big part of what I do for a living is design a particular MC test, pilot the items, and interpret the results. But I don't just count up the correct items and give you the percentage. Why? Because that would be insane. You can guess on those.
Oh, but he says this:
But suppose the grading attempts to adjust for guessing. There is no way of knowing what is in the mind of the test-taker, so the customary is to subtract, from the number correct, some fraction of the number wrong.
--Which is just fine until I tell you I have NEVER heard of dealing with guessing that way on a professional-level test.
As a general rule, we don't do any easy mathematics. At all.
Here is part of the output for a test I'm working on right now:
This is generated by RUMM2020, a tool for Rasch analysis. The Rasch model was developed in the 60s as an ideal model of item response. These are the stats on 3 items of this test. The two most important columns are Location and Probability.
The location is the item difficulty. Given the sample's performance on this item, and given their ability, how hard is this item? Item 35 is quite difficult; item 36, quite easy.
The probability is the p value for the chi square. Basically, if it's 0.05 or below, that item is operating significantly (statistically significantly, that is) outside of the model. It displays poor "fit." we generally toss these items before going on to the next step (ideally, these are weeded out during pilot testing, before the test goes live--in this case, it is an experimental test of a construct I'm not even sure exists anymore, but I digress). If an item has poor fit with the model, it is too much of a loose cannon, and its results cannot be trusted. This is what the benighted blogger (is there any other kind?) was whining about. That item is hard not because it is good, but because it is evidently stupid. The responses are all over the place, which means people were probably just guessing. Out it goes before it ruins any examinees' lives.
The next step is to get person locations. In the case of people, these numbers indicate the person's ability. This is calculated by looking at their performance on the items, given their difficulty (Which is calculated based on people's performance on them! Incestuous! But given a large enough sample, it all works out to a fine enough grain to be useful). Here is the output for the people:
So, the first person didn't do so hot; the last did pretty well (these usually top out at 3ish). As you can see in "DataPts," there were 125 items on this test. I started with 160. Do you hear that, Mr. Unexpected "Truths?" We have your back! We're not just handing you a naked score based on our crap items. WE PULL THE CRAP ITEMS.
That location score will usually be rescaled to something prettier, since no one would really like to see something like
Had an algorithms prof (of all things) give us a test where every question had the following possible answers:
..
Yes, No, Sometimes, Maybe, Unknown
Then, he had questions like 1. Some scientists believe than P=NP?
To which, of course, you could argue ANY answer is correct.
That being said, this blog post comes across as the usual whining we've all done or had to put up with through the years. No testing methodology is perfect, and everyone tests different on different kinds of tests. Fact is, though, they're pretty damn good. It's a common belief that millions of people who are otherwise idiots are graduating with great grades, while millions of geniuses can't test well - but that's horseshit. The majority of people manage to test at their level of understanding. The fact that people actually notice the odd idiot who guesses well is the exception that proves the rule.
Endless arguments over trivial contradictions in books written by ignorant savages to explain thunder in the dark.
The hardest part of most medical specialist exams are the orals. Nobody ever complains about the written component. You get a to sit in a room with one or more examiners for a few hours of intense grilling. There is no way to hide any lack of knowledge and your deficiencies are exposed for all to see.
Also the US has a strange system of certifying specialists. After completing residency (usually based on putting in your hours) you can practice medicine under the application 'board-eligible.' Once you've passed your exams, then you can be called 'board certified.'
In Canada, you can't practice at all unless you pass your board (Royal College) exams. The exams are reputedly harder in Canada as well (from those I know who have written both).
I don't care what they don't know.
I give multiple choice exams with between 100 and 200 questions, and 4 possible answers.
Wach correct answer is worth 2 points; they need to answer 50 correctly to get 100.
They don't HAVE to answer any question, or any number of questions. If they can answer 30 questions, they can get a D. Any question answered incorrectly is -1 point. This serves two purposes.
It prevents guessing, and it forces the student to consider whether they actually know the answer, or just think they do.
I typically give 4 of these per semester. After the first one I usually get several complaints because they're not used to testing in this way. After the second I usually get one or two stating they can't break the habit of answering every question. After the final, I get many compliments and high marks on my evaluations, and the students tell me they are much more confident in what they've learned than from any other class. I've had occasion to run across previous students from years past, and they claim they still remember more from my class than from others.
I've had administrators forbid me to do it this way. I did it anyway. When they saw the results, they relented, and many suggested the process to others.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
[i]For True-False exams for example, the number subtracted would most likely be (Number Wrong ÷ 2). Let's see how that would work out, for the sample case above. You, answering two questions correctly and guessing at 98 would be likely, on the average, to get 49 wrong, and so have a final score of 2 + 49 - (49 ÷ 2), or 75.5, while I, again on the average. answering only 1 correctly and guessing at 97, would get a final score of 1 + (97 ÷ 2) - ((97 ÷ 2) ÷ 2)), which comes out to be 25.25. Here there is a substantial difference between our scores, closer to the two-fold difference in our actual knowledge.[/i] Lets think about this, 51-24.5=26.5 not 75.5, further, knowing one would mean guessing at 99, not 97. 1+(99/2)-(97/4)=25.75 This means the avg. difference if adjusting for guessing moves from .5 (average score of 50.5 vs 51) to .75, hardly a substantial difference. Of course the numbers will separate out at greater levels of knowledge as he showed earlier, if one person can answer 50 and the other 25, the average scoes will be 62.5 and 43.75
Now he probably simply didn't check his math, but twice in the same paragraph?
Well, since you are much more of an authority in this subject area than most of us on Slashdot, perhaps you could give me some insight on this little conundrum?
If a man is walking in a forest, and he's talking to himself, and there are no women around, is he still wrong?
I agree with everything you said except this part:
"A multiple choice question might only have one right answer and its point value is the exact same as that of something much easier (especially, when on the harder on, the wrong choice might even be 'righter' than the correct choice on the easy question) -- but thats why there is an entire field of psychometrics out there to ensure that these sorts of exams are doing what they say they are."
Seems to me like that is more an example of psychometricians being forced to accept a less than valid form of test scoring. The proper way to do things has to incorporate Rasch's principle that the likelihood that a given test-taker will give the correct answer (on a question that is valid for the quantity it being used to measure) depends on the product of the easiness of the question and the ability of the test-taker. For that matter, lumped scores (pass-fail, ranking, or absolute) on professional proficiency exams - which by their nature must test disparate quantities with various non-linear contributions to professional qualification - cannot properly be interpreted as measurements of anything without a well-thought out unified criterion that describes the contributions and dependencies of the various quantities measured by the questions to the overall measurement of professional competence.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry