The Fallacy of Hard Tests
Al Feldzamen writes in with a blog post on the fallacious math behind many specialist examinations. "'The test was very hard,' the medical specialist said. 'Only 35 percent passed.' 'How did they grade it?' I asked. 'Multiple choice,' he said. 'They count the number right.' As a former mathematician, I immediately knew the test results were meaningless. It was typical of the very hard test, like bar exams or medical license exams, where very often the well-qualified and knowledgeable fail the exam. But that's because the exam itself is a fraud."
2 + 49 - (49 ÷ 2) = 75.5?
Seems like he added rather than subtracted the (49/2). Pretty much ruins the whole argument.
If you have 100 questions, and 20 right ones and 20 wrong ones, it leaves 60 unanswered questions.
That's why the articles talks about only counting right ones. In order to avoid guessing, there should be a difference between picking a wrong answer and not picking an answer at all.
In many professional specialties, including law and medicine, there are times when a quick, decisive educated guess may produce better results than an exhaustively researched, definitively confirmed answer.
So tests that force students to do a lot of guessing may still be good tools for evaluating their professional qualifications.
A doctor or lawyer who can guess right may be superior to one who plods to the right answer only after many expensive lab tests or hours of legal research. That's not to say that doctors and lawyers shouldn't do lab tests and research -- of course they should. But there are many situations, especially time-sensitive ones, where quick judgment is more important than absolute knowledge: during surgery or a health crisis, during a trial or deposition, etc.
I once had a test that had a check box for how confident you were your answer was correct, that affected your score the following way:
If you ticked "confident" and you were wrong, -2
If you ticked "confident and you were right, +2
If you ticked "unsure" and you were wrong, -0
If you ticked "unsure" and you were right, +1
I guess the point is that it's advantageous to guess, but only if you choose the lesser-scoring option.
Ah, so the process of eliminating the dangerous in favour of the obvious is actually built into the tutoring system, while the exams appear to teach the Right Thing to make the problem even harder to fix.
Got a friend who used to work in healthcare and this goes some way to explaining a few incidents where the Unlikely Explanation was instantly ruled out and the patient died (or, by over-concerned parents, was immediately taken elsewhere for tests and a life-threatening condition found).
This isn't like sloppy programming where the greatest danger of a buffer overflow is a pwned machine. If there is a discrepancy like that between "what you will actually do" and "what you claim you would do", and someone's life might depend on it, don't play along - rock the boat. If it'd be unhelpful to do so now, please make an issue of it once you've graduated.
Mr. Feldzamen claims to have passed the Virginia bar exam, but I can't find any evidence he was ever admitted to the Virginia bar, or to any state bar (he's not in Martindale-Hubbell). He cites the Virginia bar exam -- which I also passed (IAAL, licensed to practice in CA and VA) -- as one of his examples of a "complete fraud." In fact, when I took the Virginia bar exam it had over a dozen one-hour essay components, testing each and every possible subject. By contrast, the California bar exam, had essay tests covering six randomly chosen subjects out of a possible 15 or so, and it had other non-multiple-choice components. The multiple-choice section of every state's bar exam, the Multistate Bar Exam, is no walk in the park. So I don't understand how he includes bar exams in his claim that the tests are invalid. If anything, the low pass rate of bar exams, typically 50% or less among a candidate pool of mostly recent law school grads, suggests that they are very hard indeeed.
Actually, the problem here is his example is a total worst case scenario and doesn't tell us what the 'Pass' level is. The tests mentioned are not relative knowledge tests, they are pass/fail tests, in other words, I don't care how much Joe knows compared to Bill, all I care about is does Joe demonstrate the necessary level of knowledge to pass. In that case, assuming the test maker has the slightest clue, in the example the pass mark would likely be at 75%+ (you need about 1/2 right legit, and 1/2 of the remaining right on guesses or better) meaning that it's difficulty is fine as it has correctly blocked both people as they didn't show the necessary level of knowledge.
He might have a point IF he qualified this to scaled result tests (ie the top X people will pass regardless of their scores, only relative position counts), but he didn't. But, even in that case he'd have to analyze the distribution of all testees, not just 2. Once again, his math does work and doesn't support the argument.
Funnily enough, one of the most hardassed profs I ever had also taught the introductory assembler class. (except for us it was PDP-11 and 68K) His tests were legendary for their difficulty, and the average was somewhere in the 20-30% range. However, it was curved after the fact and was a perfectly valid exam since there was absolutely no opportunity to guess. He gave us self-modifying assembler code too, without telling us such a thing was possible in advance! He also had a unique way of assigning readings. He would say, "Have you read chapter X yet? If you haven't, you're screwed!" Still, despite his apparent sadism we did learn a lot in his class.
In a later course I had a prof who would run our class through proofs that would span 3 or 4 lectures. If you fell asleep once in that period of time you'd be utterly lost. At the end of his proofs he would often say, "Does this make sense? Does everybody get this? If not, you had better think about dropping the course!" (Somehow it was hilarious in his thick indian accent. He really rolled the 'r' in dropping too.)
The problem there is that averages are one thing, but in practice there still is a non-zero chance that he'll actually score higher than you do.
Let's say it's 20 questions, 4 possible answers each. He'll know 5 of those, has to guess 15. There's even a 1 in billion chance that he'll get all 20 right. (4^15 = 2^30 = approx 1 billion.) If you gave that test in China, by now you'd have at least one guy who pulled exactly that stunt.
There's also the issue of how well those questions fit your and his domain of knowledge. Let's say you can't possibly test _all_ the questions, because that's usually the case. You can do it for state capitals, but you can't possibly cover a whole domain like medicine or law.
There are 50 states, you know 25, the other guy knows, say 12 (rounded down), so it's not impossible that the 20 questions are all from the 25 you don't know, but include all 12 that guy knows. In fact, assuming a very very very large domain (much larger than 50, anyway), there's about 1 in a million chance that all 20 questions will be from the 50% you don't know.
Now when testing states that doesn't have a higher moral, because (at least theoretically) all states are equally important. In other domains, like medicine, law, even CS, that's not the case: stuff ranges from vital basics to pure trivia that noone gives a damn about. (Or not for the scope of the problem at hand: e.g., if I'm hiring a Java programmer, asking questions about COBOL would be just trivia.)
And a lot of "hard tests" are "hard" just by including inordinate amounts of stuff that's unimportant trivia. E.g., if I'm giving a test for a unix admin job, I can make it arbitrarily "hard" by including such trivia as "in which directory is Mozilla installed under SuSE Linux?" It's stuff that won't actually affect your ability to admin a unix box in any form or shape. The fact that SuSE does install some programs in different directories is just trivia.
(And if that sounds like an convoluted imaginary example, let's say that some "hard" certification exams ask just that: where is program X installed in distribution Y? And at least one version of Sun's Java certification asked such idiotically stupid trivia as in which package is class X, or whether class Y is final. Who cares about that trivia? It's less than half a second to get any IDE to fill in the package for you. E.g., in Eclipse it just takes a CTRL+SPACE.)
And in view of that previous point, including trivia in an exam just to make it "hard" is outright counter-productive. There is a non-null chance that you'll pass someone who memorized all the trivia, but doesn't know the basics.
Not all knowledge is created equal, and that's one point that many "hard" exams and certifications miss. If a lawyer doesn't know the intricacies of Melchett vs The Vatican, who cares? In the unlikely situation that they need it, they can google it. If they don't understand Habeas Corpus, on the other hand, they're just unfit to be a lawyer at all. Cramming trivia into an exam can get you just that kind of screwed up situation: you passed someone who happened to know that Melchett vs The Vatican is actually a gag question, and that case name appears in Stephen Fry's "The Letter", yet flunked someone with a solid grasp of the the basics and who knows how to extrapolate from there and where to get more information when he needs it.
Rewarding random guesswork is worse. Probably the most important thing one should know is what he _doesn't_ know, so he can research it instead of taking a dumb uninformed guess. Most RL problems aren't neatly organized into 4 possible answers, so it can be a monumental waste of time to just take wild guesses and see if it works. I've seen entirely too many people wasting time trying wrong guess after wrong guess, instead of just doing some research. E.g., I've actually witnessed a guy trying every single bloody combination between *, & and nothing in front of every single variable in a C function, because he never understood how poin
A polar bear is a cartesian bear after a coordinate transform.
> This is really a question of statistics not of mathematics. Having done experiments on MBA
> students, we found that a well written multiple choice question is more accurate than 4 well
> written essays. The fact that we can easilly have 50 multiple choice questions and a maximum
> of 8 essays makes it a no brainer that multiple choice is much more accurate.
I don't know how you judge whether a question is well written or not. In my experience, multiple choice questions are very easy to write wrongly. A wrongly worded essay question easily have exactly the opposite effect as you want: you reward the ones who know the subject less (it seldom just give you random noise). Worse, you won't know it happened before you're told. I've read many exam MC questions during exam paper review meetings, my feeling from reading such questions for 4 years is that one in 4-6 MC questions are poor enough this way. In contrast, a wrongly worded essay question will present students some real-life trouble (the questions that they will face will be full of inaccuracies!), and when marking them you know the question is written wrongly, but at the same time you know whether the students are good anyway.
But the real problem of multiple choice questions is that it doesn't present the student any real world test. In the real world, nobody would tell you that "You are in situation, you can do A, B, C or D. Please choose one". Instead, what they see is "Somebody is in this situation. Please advice." Being good in multiple choice question usually has doubtful utility in the real world. And education systems will have to align with the judgement system, so at the end the teachers train their students the wrong technique as well.
Of course, there are benefits of MC questions: they can be marked mechanically, which means that (1) they lessen the workload of markers, (2) they are marked with perfect consistency, and (3) their markings are free from language or hand-writing proficiency. I don't think "accuracy" is one of those, though, since MC questions are just testing the wrong ability.
In a well designed exam, the 'educated guess' is just as much a part of the design as anything else. You *HAVE* to have questions that have answers somewhat similar, or you make it way too easy to guess the answer by way of elimination. At the same time, we want enough questions that one can eliminate one or more questions immediately.
/.). If the experts get the question wrong, while the novices get it right -- the question is struck. Someone with little experience in test design may look at the question and wonder whats wrong -- the answer is correct and all of your colleagues agree -- but in some way it is wrongly worded. So again, it is either struck, or restructured to be inserted for calibration and validation at a later point (on a large exam like the Bar the author had derided, a good chunk of the questions are probably not scored and are only there to see how well they work and if they can be put into the next exam).
For instance, I'm a testing person, but not a content person (i.e., I design towards what the stats tell me, as well as the actual wording and structure of the exam...I always work with someone who understands the content areas from a very advanced level and can deal with that end). One of the last MC exams I was helping validate, I knew NOTHING about the content -- it was a medical exam. First thing I did was go through the entire exam, read all the questions quickly, and see if logic could remove any of the answers. Statistically, I would have gotten a 20% by random means, but in this case, I received somewhere around 43% (if I remember correctly). The educated guess is a BIG part of these things...you aren't just measuring content knowledge, but application and that means if someone can raise the bar, they might actually do well in the real world. If I had a doctor who had never seen a case like mine, and it defied traditional practice, I think I'd be more impressed with the man that got 40% on purely logic, than the guy that got the 40% based around actually knowing something about the problem (and actually, I had a team of doctors several years ago like this...I sat around trying to figure out how I was going to die for a couple of months while one doctor who had seen problems like mine couldn't figure out what the cause was, while the one that wasn't an expert in the field methodologically ruled out what wasn't the cause, and ended up finding me a specialist that the first doctor SHOULD have been able to do because his field encompassed a hell of a lot more of the specialty than my general physician's 'specialty').
And it kinda depends on the type of test and what you are measuring. When designing these things, you ask a lot of questions based around the type of assessment one is looking for. And you design accordingly. By correlating my exams with others that have some sense of validity, I can see the levels of the testees before they take the new one. This in itself will show you quite a few things about the design of the new exam. For instance, we can tell certain questions might have 50% of the folks answering correctly, but which 50%? On the original test, you have two groups take the exam, novices and experts (and heavily simplifying this for
Beyond that, you have panels of experts who go over questions. Have them all vote on things like the difficulty of the item, the appropriateness for the exam. Things like that. Folks like me will take these and sort the items into usable or unusable stacks, rewrite them (again with experts), and then sort X amount of the lower difficulty, Y of the medium, Z of the hard (the easy questions are there to give motivation...its amazing how much better someone will do if they get positive reinforcement in that they KNOW this questions...it will prime the neural pathways to hopefully give more routes to specific knowledge in order to get the reward...I can feel the endorphin rush when I'm doing poorly but then get a win every now and then and it helps). And finally, one analyzes everything to see h
No, just no. The point of the tests are to determine who is over a particular threshold of knowledge and who isn't. The method being called a fraud fails to accurately do that. Since randomness has a proven substantial impact on those tests that threshold becomes blurred. To make matters worse, the harder the test the MORE randomness affects score. As a result the test results are meaningless at any scale. His examples where simplified to illustrate the essential math behind them, he does not need more than 2 people to compare since the math is equally applicable no matter how many are tested. He also does not need to set a scale because the math is equally applicable to any bar you might set.
The point of the article was to illustrate that these hard tests are meant to establish a minimum required level of knowledge, however due to the nature of counting only correct answers, randomness incurs a great penalty to the accuracy of the attempted measurement of knowledgege. He is suggesting, and rightly so, that a test that instead occurs an effective 0 net effect of guessing would much more accurately measure the knowledge of the participants by reducing the effects of guessing to nearly 0
.What this really comes down to is accuracy and precision. We assume that a test score can be equated to a measurement of knowledge, and for your benefit (it's completely irrelevant) we'll assume that a passing test is 60%.
The article's math indeed illustrates this point very clearly. The unspoken point is that in tests such as these, designed to set standards to be met, it is a fraud to use a test with low accuracy at measuring actual knowledge. The precision gained by penalizing guessing allows the test to be much more fair in it's administration.
http://en.wikipedia.org/wiki/AccuracyMobius Custom Computers
Actually it has to be a % passing. If the supply of licensed doctors and attorneys were not limited, the costs for their services would reduce, so these exams have to be a part of the the system to control the supply. A test may be written to ensure a spread (so it tests knowledge) and also to ensure that the passing score is largely unattainable. So, I think the analysis is incorrect. The tests are not too hard to be useful as tests, it is just that their is a conflict of interest as regards their use. As medical care begins to take on the characteristics of a human right as representation in court is a political right, perhaps we'll begin to see a breaking down of the cartel system so that medical and law educations are not restricted and final competency tests can be tests of competency rather than also being a link in a chain of controlling supply to increase price.s -selling-solar.html
--
Electricity without fuel costs: http://mdsolar.blogspot.com/2007/01/slashdot-user
Ugh. I just wrote a pretty polite reply at his page after skimming his idiotic article. Now that I've read it, I'm actually angry.
This guy knows NOTHING about testing. Nothing. He isn't even to the level of Classical Testing Theory (CTT), which is really not much more than means and Pearson correlations, and is nowhere near how high-stakes (and even medium- and low-stakes, increasingly) multiple choice (MC) tests work now, and how they have worked for many many years.
IAAP (I am a psychometrician). A big part of what I do for a living is design a particular MC test, pilot the items, and interpret the results. But I don't just count up the correct items and give you the percentage. Why? Because that would be insane. You can guess on those.
Oh, but he says this:
But suppose the grading attempts to adjust for guessing. There is no way of knowing what is in the mind of the test-taker, so the customary is to subtract, from the number correct, some fraction of the number wrong.
--Which is just fine until I tell you I have NEVER heard of dealing with guessing that way on a professional-level test.
As a general rule, we don't do any easy mathematics. At all.
Here is part of the output for a test I'm working on right now:
This is generated by RUMM2020, a tool for Rasch analysis. The Rasch model was developed in the 60s as an ideal model of item response. These are the stats on 3 items of this test. The two most important columns are Location and Probability.
The location is the item difficulty. Given the sample's performance on this item, and given their ability, how hard is this item? Item 35 is quite difficult; item 36, quite easy.
The probability is the p value for the chi square. Basically, if it's 0.05 or below, that item is operating significantly (statistically significantly, that is) outside of the model. It displays poor "fit." we generally toss these items before going on to the next step (ideally, these are weeded out during pilot testing, before the test goes live--in this case, it is an experimental test of a construct I'm not even sure exists anymore, but I digress). If an item has poor fit with the model, it is too much of a loose cannon, and its results cannot be trusted. This is what the benighted blogger (is there any other kind?) was whining about. That item is hard not because it is good, but because it is evidently stupid. The responses are all over the place, which means people were probably just guessing. Out it goes before it ruins any examinees' lives.
The next step is to get person locations. In the case of people, these numbers indicate the person's ability. This is calculated by looking at their performance on the items, given their difficulty (Which is calculated based on people's performance on them! Incestuous! But given a large enough sample, it all works out to a fine enough grain to be useful). Here is the output for the people:
So, the first person didn't do so hot; the last did pretty well (these usually top out at 3ish). As you can see in "DataPts," there were 125 items on this test. I started with 160. Do you hear that, Mr. Unexpected "Truths?" We have your back! We're not just handing you a naked score based on our crap items. WE PULL THE CRAP ITEMS.
That location score will usually be rescaled to something prettier, since no one would really like to see something like
I go to UC Berkeley, but the MO of universities is to assign a professor to a class that is either within his field of research or is a fundamental part of what he/she does at the school (i.e. physics profs. in math). Professors are there because they are intellectuals and researchers, usually not because they love to teach. Because they have such vast knowledge and probably aren't very good at discerning how much their students are absorbing, they write impossibly hard exams, simply because they can't understand anything more basic. As a result, the top grade might be 58%. But knowing 58% on a difficult test might require the same level of knowledge as it takes to get a 90% on a easy exam.
In everything up till high school it was pretty easy to go back to that 90-80-70 standard because all the material was so elementary teachers could easily make tests to place students in those categories. At universities, the material is so much more complex that it's pointless/impossible to write exams to those arbitrarily defined 90-80-70 categories. Think of an exam in which half of the points are based on knowledge of the fundamentals, and the rest is on complicated, hazy, trivial bits of knoweldge. In that case it's totally fine for a 58% to be an A since he understood (probably) most of the fundamentals and a few of the smaller facts.
BTW: Even the you are a reasonably talented individual, I doubt you'll get grades in the 70%-100% range your whole life....
Had an algorithms prof (of all things) give us a test where every question had the following possible answers:
..
Yes, No, Sometimes, Maybe, Unknown
Then, he had questions like 1. Some scientists believe than P=NP?
To which, of course, you could argue ANY answer is correct.
That being said, this blog post comes across as the usual whining we've all done or had to put up with through the years. No testing methodology is perfect, and everyone tests different on different kinds of tests. Fact is, though, they're pretty damn good. It's a common belief that millions of people who are otherwise idiots are graduating with great grades, while millions of geniuses can't test well - but that's horseshit. The majority of people manage to test at their level of understanding. The fact that people actually notice the odd idiot who guesses well is the exception that proves the rule.
Endless arguments over trivial contradictions in books written by ignorant savages to explain thunder in the dark.
The worst case is if the experts will also start doing this: trying to offload the patient - and therefore the risk - to someone else as soon as possible. That will lead to the people with actual serious illnesses dying, since no one will actually diagnose them in their hurry to send them to someone else before they have a chance to die on them.
Have you ever seen that episode of Scrubs where they take that wealthy hospital donor to every department to try to figure out what is wrong with them, but no one knows.
It turned out the best solution was to do nothing at all (which it turned out the protagonist did because he simply did not know the what was wrong) and the problem went away.
Had they actually did something it might have caused more a problem that the correct answer of "do nothing".
"I am the king of the Romans, and am superior to rules of grammar!"
-Sigismund, Holy Roman Emperor (1368-1437)
I don't care what they don't know.
I give multiple choice exams with between 100 and 200 questions, and 4 possible answers.
Wach correct answer is worth 2 points; they need to answer 50 correctly to get 100.
They don't HAVE to answer any question, or any number of questions. If they can answer 30 questions, they can get a D. Any question answered incorrectly is -1 point. This serves two purposes.
It prevents guessing, and it forces the student to consider whether they actually know the answer, or just think they do.
I typically give 4 of these per semester. After the first one I usually get several complaints because they're not used to testing in this way. After the second I usually get one or two stating they can't break the habit of answering every question. After the final, I get many compliments and high marks on my evaluations, and the students tell me they are much more confident in what they've learned than from any other class. I've had occasion to run across previous students from years past, and they claim they still remember more from my class than from others.
I've had administrators forbid me to do it this way. I did it anyway. When they saw the results, they relented, and many suggested the process to others.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
I have a lot of experience with this lately, having come down with an odd virus that had no treatment but was/is excruciatingly painful. There may be no treatment available, but I wager the vast majority of these folks who go to a doctor but have nothing wrong with them DO have some symptom or another... for me, getting the symptom treated is almost equally as important as having the cause treated, as I probably wouldn't have gotten out of my chair without it. One doctor recently seemed much more concerned with the cause and the symptom was nearly an afterthought -- as a result, I was in a lot of pain for 24 hours with no way to fix it. He saw the antibiotic as more important (though it ultimately turned out not to be bacterial), but I saw something for pain to be something that should have happened immediately.
/didn't/ mention, in case I have more than one thing or in case there are different diagnoses that have similar symptoms except for a couple.
Another thing -- most people want to feel like the doctor at least LOOKED for something. One doctor I went to recently made me wait 40 mins to see him and then looked at me for like 30 seconds and prescribed something. Yes, that makes sense if you know what it is straight off and know what to do about it, but you might just wanna look for other things that I
I agree with everything you said except this part:
"A multiple choice question might only have one right answer and its point value is the exact same as that of something much easier (especially, when on the harder on, the wrong choice might even be 'righter' than the correct choice on the easy question) -- but thats why there is an entire field of psychometrics out there to ensure that these sorts of exams are doing what they say they are."
Seems to me like that is more an example of psychometricians being forced to accept a less than valid form of test scoring. The proper way to do things has to incorporate Rasch's principle that the likelihood that a given test-taker will give the correct answer (on a question that is valid for the quantity it being used to measure) depends on the product of the easiness of the question and the ability of the test-taker. For that matter, lumped scores (pass-fail, ranking, or absolute) on professional proficiency exams - which by their nature must test disparate quantities with various non-linear contributions to professional qualification - cannot properly be interpreted as measurements of anything without a well-thought out unified criterion that describes the contributions and dependencies of the various quantities measured by the questions to the overall measurement of professional competence.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
A normal person may score "wildly differently" on a 300-question exam from one attempt to the next, but the variance will be based more on differences in preparation, physical and mental comfort, stress, and how much sleep he got the night before.
The article's math is actually pretty pathetic. For one, he assumes that a person who knows half as much will guess just as accurately. For another, his entire point seems to be based on implying that the less knowledgeable person has a good chance of scoring as well as the more knowledgeable one, but he only calculates that probability for a trivial, extreme case. Why doesn't he tell us the probability for one of the more reasonable cases he describes? Either he never bothered calculating those probabilities, or he decided they weakened his case. I don't particularly care which one; his credibility is approximately zero either way.
There's a very easy solution. Require essay tests. Make sure at least 10% of your class doesn't complete the test (but A students get done 20-30m early on a 2 hour test).
It works like a charm, and weeds out exactly who needs weeded out.
He're a hint. After the first exam of a semester in a weed out class, the intstructors can predict with a 98% accuracy your final grade. Your test taking skills mean shit. If you're not bright when we talk to you, if you're slow on the uptake, you won't do well. The tests are designed to avoid giving you passing grades. Of course some people bleak the mold, but these people are always super-intelligent and have to put an absurd amount of effort in.
Then, the rest of the semester is really about teaching material, the scores don't really matter that much.