The Fallacy of Hard Tests
Al Feldzamen writes in with a blog post on the fallacious math behind many specialist examinations. "'The test was very hard,' the medical specialist said. 'Only 35 percent passed.' 'How did they grade it?' I asked. 'Multiple choice,' he said. 'They count the number right.' As a former mathematician, I immediately knew the test results were meaningless. It was typical of the very hard test, like bar exams or medical license exams, where very often the well-qualified and knowledgeable fail the exam. But that's because the exam itself is a fraud."
What a worthless post. He gave one situation where guessing is more important than knowledge, but didn't at all address the specifics of the tests he was talking about. A typical vapid blog that for some reason gets posted to /.
Stories like this could never get on Slashdot. Seriously, this is like a maths problem I'd give to my Year 9 kids. This is definitely not news, and certainly doesn't matter.
As a medical student, I know how much our education is divided into what we do in real life, and what is the proper answer for exams. Quite often, during our education exercises, we're given senarios like "A patient presents with symptoms X, Y and Z. What do you do next?". At that point, that's when the resident says "You would diagnose condition A from those symptoms, but for the exam, you'd say you'd get an MRI to rule out B". So many questions are basically having intuition for where the question is guiding you too, rather than practical medicine. Often, it's extremely difficult to discern what the question wants. There will be some question along the lines of "A patient presents with general fatigue over the past 3 months, which one blood test do you want to order?" and you'll narrow down the answer choices to either thyroid stimulating hormone, or a complete blood count, both studies are equally important in the evaluation of fatigue, but the question wants you to know which one is more important. In real life, you would always get both because both conditions fairly common, and you want to evaluate both at once to save the patient time and effort. However, the question will nail you if you don't know some obscure study which states that there like is a 1% difference in the incidence of hypothyroidism vs anemia in fatigue. Moreso, if you were on the hospital floor and you were to say "I'm getting only a CBC, because it's more likely," the resident will chide you for not considering hypothyroidism as well and getting the Thyroid stimulating hormone as well, making you look bad. So yeah, learning for the test doesn't really ever end.
if anything testing has become FAR FAR too easy, people pass CS courses and come out the otherside only to have a vague notion of how a computer works.
I won't claim his post is correct or not, but he claims the technology behind such tests is wrong and lets less educated people pass through with guessing, whle more educated people try to pass without guessing and fail.
People see the tests produce poor selection, and make the tests harder and harder in attempt to remedy this (but they won't since it's the technology of a test that's wrong).
Then you come here and support his opinion 1:1 by claiming tests are too easy (i.e. should be harder) and idiots pass through.
Ironic, isn't it.
in college that gave very hard tests. Intel Assembly class. For a midterm, we had to decipher Object-Oriented Assembly, and decipher self-modifying code. After 3 weeks of introduction to Assembly.
I got an A, with an average of 58% in the class.
For the 2-hour final, he got up at the 1-hour point, and yelled: "The test is over. All pencils down." We just sat there dumbfounded for about 10 seconds, and then he said, "Just kidding. I always wanted to do that."
Ya, a real great pal there!
Worst teacher I had in college. He didn't last long
Don't steal. The government hates competition.
I once had a test that had a check box for how confident you were your answer was correct, that affected your score the following way:
If you ticked "confident" and you were wrong, -2
If you ticked "confident and you were right, +2
If you ticked "unsure" and you were wrong, -0
If you ticked "unsure" and you were right, +1
I guess the point is that it's advantageous to guess, but only if you choose the lesser-scoring option.
TFA makes sense. Observe:
News for nerds?: yes[ ] no[x]
Stuff that matters?: yes[ ] no[x]
Clearly the editorial process is fraudulent - as this is a multiple choice, it is obvious that guessing tends to count much more than knowledge.
From this we can conclude one of two things:
1) Zonk is bad at guessing
2) The author is speaking out of his ass
Tempting as it is, I am going to stick with 2... But I could, of course, be guessing.
Having done experiments on MBA students
See, I KNEW they were good for something. Let me guess, the reason you opted for MBAs over mice is that there is far less protests when you do cruel medical experiments on the MBA students than with mice, correct?
Monstar L
Mr. Feldzamen claims to have passed the Virginia bar exam, but I can't find any evidence he was ever admitted to the Virginia bar, or to any state bar (he's not in Martindale-Hubbell). He cites the Virginia bar exam -- which I also passed (IAAL, licensed to practice in CA and VA) -- as one of his examples of a "complete fraud." In fact, when I took the Virginia bar exam it had over a dozen one-hour essay components, testing each and every possible subject. By contrast, the California bar exam, had essay tests covering six randomly chosen subjects out of a possible 15 or so, and it had other non-multiple-choice components. The multiple-choice section of every state's bar exam, the Multistate Bar Exam, is no walk in the park. So I don't understand how he includes bar exams in his claim that the tests are invalid. If anything, the low pass rate of bar exams, typically 50% or less among a candidate pool of mostly recent law school grads, suggests that they are very hard indeeed.
I find the fact that medical and lawyer exams are based on multiple choice rather disturbing. As an engineer almost all of my test were long answer. Sure, some multi questions, but mostly show all your work or explain the whole process. And I just design systems and networks! Now someone can just luckily guess enough multiple choice questions and start slicing me up?
Like I said, disturbing.
Vote monkeys into Congress. They are cheaper and more trustworthy.
Jesus christ, hopefully you didn't get the job, it was harder then fuck to understand what the hell you just said.
Fate, it seems, is not without a sense of irony.
The problem there is that averages are one thing, but in practice there still is a non-zero chance that he'll actually score higher than you do.
Let's say it's 20 questions, 4 possible answers each. He'll know 5 of those, has to guess 15. There's even a 1 in billion chance that he'll get all 20 right. (4^15 = 2^30 = approx 1 billion.) If you gave that test in China, by now you'd have at least one guy who pulled exactly that stunt.
There's also the issue of how well those questions fit your and his domain of knowledge. Let's say you can't possibly test _all_ the questions, because that's usually the case. You can do it for state capitals, but you can't possibly cover a whole domain like medicine or law.
There are 50 states, you know 25, the other guy knows, say 12 (rounded down), so it's not impossible that the 20 questions are all from the 25 you don't know, but include all 12 that guy knows. In fact, assuming a very very very large domain (much larger than 50, anyway), there's about 1 in a million chance that all 20 questions will be from the 50% you don't know.
Now when testing states that doesn't have a higher moral, because (at least theoretically) all states are equally important. In other domains, like medicine, law, even CS, that's not the case: stuff ranges from vital basics to pure trivia that noone gives a damn about. (Or not for the scope of the problem at hand: e.g., if I'm hiring a Java programmer, asking questions about COBOL would be just trivia.)
And a lot of "hard tests" are "hard" just by including inordinate amounts of stuff that's unimportant trivia. E.g., if I'm giving a test for a unix admin job, I can make it arbitrarily "hard" by including such trivia as "in which directory is Mozilla installed under SuSE Linux?" It's stuff that won't actually affect your ability to admin a unix box in any form or shape. The fact that SuSE does install some programs in different directories is just trivia.
(And if that sounds like an convoluted imaginary example, let's say that some "hard" certification exams ask just that: where is program X installed in distribution Y? And at least one version of Sun's Java certification asked such idiotically stupid trivia as in which package is class X, or whether class Y is final. Who cares about that trivia? It's less than half a second to get any IDE to fill in the package for you. E.g., in Eclipse it just takes a CTRL+SPACE.)
And in view of that previous point, including trivia in an exam just to make it "hard" is outright counter-productive. There is a non-null chance that you'll pass someone who memorized all the trivia, but doesn't know the basics.
Not all knowledge is created equal, and that's one point that many "hard" exams and certifications miss. If a lawyer doesn't know the intricacies of Melchett vs The Vatican, who cares? In the unlikely situation that they need it, they can google it. If they don't understand Habeas Corpus, on the other hand, they're just unfit to be a lawyer at all. Cramming trivia into an exam can get you just that kind of screwed up situation: you passed someone who happened to know that Melchett vs The Vatican is actually a gag question, and that case name appears in Stephen Fry's "The Letter", yet flunked someone with a solid grasp of the the basics and who knows how to extrapolate from there and where to get more information when he needs it.
Rewarding random guesswork is worse. Probably the most important thing one should know is what he _doesn't_ know, so he can research it instead of taking a dumb uninformed guess. Most RL problems aren't neatly organized into 4 possible answers, so it can be a monumental waste of time to just take wild guesses and see if it works. I've seen entirely too many people wasting time trying wrong guess after wrong guess, instead of just doing some research. E.g., I've actually witnessed a guy trying every single bloody combination between *, & and nothing in front of every single variable in a C function, because he never understood how poin
A polar bear is a cartesian bear after a coordinate transform.
Ugh. I just wrote a pretty polite reply at his page after skimming his idiotic article. Now that I've read it, I'm actually angry.
This guy knows NOTHING about testing. Nothing. He isn't even to the level of Classical Testing Theory (CTT), which is really not much more than means and Pearson correlations, and is nowhere near how high-stakes (and even medium- and low-stakes, increasingly) multiple choice (MC) tests work now, and how they have worked for many many years.
IAAP (I am a psychometrician). A big part of what I do for a living is design a particular MC test, pilot the items, and interpret the results. But I don't just count up the correct items and give you the percentage. Why? Because that would be insane. You can guess on those.
Oh, but he says this:
But suppose the grading attempts to adjust for guessing. There is no way of knowing what is in the mind of the test-taker, so the customary is to subtract, from the number correct, some fraction of the number wrong.
--Which is just fine until I tell you I have NEVER heard of dealing with guessing that way on a professional-level test.
As a general rule, we don't do any easy mathematics. At all.
Here is part of the output for a test I'm working on right now:
This is generated by RUMM2020, a tool for Rasch analysis. The Rasch model was developed in the 60s as an ideal model of item response. These are the stats on 3 items of this test. The two most important columns are Location and Probability.
The location is the item difficulty. Given the sample's performance on this item, and given their ability, how hard is this item? Item 35 is quite difficult; item 36, quite easy.
The probability is the p value for the chi square. Basically, if it's 0.05 or below, that item is operating significantly (statistically significantly, that is) outside of the model. It displays poor "fit." we generally toss these items before going on to the next step (ideally, these are weeded out during pilot testing, before the test goes live--in this case, it is an experimental test of a construct I'm not even sure exists anymore, but I digress). If an item has poor fit with the model, it is too much of a loose cannon, and its results cannot be trusted. This is what the benighted blogger (is there any other kind?) was whining about. That item is hard not because it is good, but because it is evidently stupid. The responses are all over the place, which means people were probably just guessing. Out it goes before it ruins any examinees' lives.
The next step is to get person locations. In the case of people, these numbers indicate the person's ability. This is calculated by looking at their performance on the items, given their difficulty (Which is calculated based on people's performance on them! Incestuous! But given a large enough sample, it all works out to a fine enough grain to be useful). Here is the output for the people:
So, the first person didn't do so hot; the last did pretty well (these usually top out at 3ish). As you can see in "DataPts," there were 125 items on this test. I started with 160. Do you hear that, Mr. Unexpected "Truths?" We have your back! We're not just handing you a naked score based on our crap items. WE PULL THE CRAP ITEMS.
That location score will usually be rescaled to something prettier, since no one would really like to see something like