The Fallacy of Hard Tests
Al Feldzamen writes in with a blog post on the fallacious math behind many specialist examinations. "'The test was very hard,' the medical specialist said. 'Only 35 percent passed.' 'How did they grade it?' I asked. 'Multiple choice,' he said. 'They count the number right.' As a former mathematician, I immediately knew the test results were meaningless. It was typical of the very hard test, like bar exams or medical license exams, where very often the well-qualified and knowledgeable fail the exam. But that's because the exam itself is a fraud."
What a worthless post. He gave one situation where guessing is more important than knowledge, but didn't at all address the specifics of the tests he was talking about. A typical vapid blog that for some reason gets posted to /.
hard tests are meaningless? what's his solution, easy tests where even an idiot can score 100%?
if anything testing has become FAR FAR too easy, people pass CS courses and come out the otherside only to have a vague notion of how a computer works.
If you mod me down, I will become more powerful than you can imagine....
Stories like this could never get on Slashdot. Seriously, this is like a maths problem I'd give to my Year 9 kids. This is definitely not news, and certainly doesn't matter.
It's hard to believe this guy is really a mathematician. I read this with interest as I teach college classes and have to give tests. However, there's not much content in the article.
His point about only counting the correct answers is rather silly. In a test where each question is either right or wrong, counting the wrong answers into the score does not add any information (you can tell how many are wrong if you know how many are right). The only thing it does is change the scaling of the resulting scores. This only makes a difference if you have an issue interpreting the scores. He seems to want the scores to proportional to the amount of knowledge someone has, so that if I have twice as much knowledge as you my score is twice as high. But in the example case of a professional qualifying examination, all that matters is whether or not you achieve some minimum. Whether that is represented as % correct or % correct - %incorrect/2 really makes no difference.
Designing better tests generally involves moving beyond multiple choice, not manipulating the scoring process.
2 + 49 - (49 ÷ 2) = 75.5?
Seems like he added rather than subtracted the (49/2). Pretty much ruins the whole argument.
And now we know why this man is a former mathematician. This is just bad math.
Suppose the test is really hard and contains many answers which are wrong, but can be thought as correct by a person who is moderately knowledgeable about the question. Now if you penalize guessing, I may answer 20 questions correctly and 80 with "reasonable" answer which are not correct, my score is 0 assuming 4 questions per choice. On the other hand, someone who answers 10 questions correctly and puts random guesses for the other 90, will likely get a score close to 10.
Basically, multiple choice tests which are so hard that even successful candidates will get most questions wrong are worthless. Consider also the potential of undetectable fraud if, say the janitor cleaning instructors room leaks questions in advance.
I haven't had many exams with multiple choice, but my university statistics course was one of them.
:-)
Each question had 5 options, and only one was correct. A correct answer gave 5 points, an incorrect answer gave -1 point.
Now, as the smart reader can guess, 4 x -1 + 5 = 1, so guessing still pays off... especially if one or more of the questions are very unlikely to be correct.
Did the teacher design this test incorrectly, since guessing was rewarded? Well, actually, the only test of real-life application of statistical knowledge was to understand this, so those who started to guess, basically demonstrated their statistical knowledge, and I guess that should be rewarded.
One of the questions was about the outcome of a distribution, where the value should be looked up in a distribution table that was used by the course. Only one of the 5 options was in the table as a result value. That made this one easy
I am a french student and we have very rarely, if any, multiple choices questions (QCM in french) in our exams. When there are some QCM, like in the maths test of the baccalauréat, it counts only as a small part of the final grade and it is very recent. The only QCM-only test I passed was the TOEFL.
Is it that common in the US ? Is it common even outside scientific studies ?
If you have 100 questions, and 20 right ones and 20 wrong ones, it leaves 60 unanswered questions.
That's why the articles talks about only counting right ones. In order to avoid guessing, there should be a difference between picking a wrong answer and not picking an answer at all.
As a medical student, I know how much our education is divided into what we do in real life, and what is the proper answer for exams. Quite often, during our education exercises, we're given senarios like "A patient presents with symptoms X, Y and Z. What do you do next?". At that point, that's when the resident says "You would diagnose condition A from those symptoms, but for the exam, you'd say you'd get an MRI to rule out B". So many questions are basically having intuition for where the question is guiding you too, rather than practical medicine. Often, it's extremely difficult to discern what the question wants. There will be some question along the lines of "A patient presents with general fatigue over the past 3 months, which one blood test do you want to order?" and you'll narrow down the answer choices to either thyroid stimulating hormone, or a complete blood count, both studies are equally important in the evaluation of fatigue, but the question wants you to know which one is more important. In real life, you would always get both because both conditions fairly common, and you want to evaluate both at once to save the patient time and effort. However, the question will nail you if you don't know some obscure study which states that there like is a 1% difference in the incidence of hypothyroidism vs anemia in fatigue. Moreso, if you were on the hospital floor and you were to say "I'm getting only a CBC, because it's more likely," the resident will chide you for not considering hypothyroidism as well and getting the Thyroid stimulating hormone as well, making you look bad. So yeah, learning for the test doesn't really ever end.
This is really a question of statistics not of mathematics. Having done experiments on MBA students, we found that a well written multiple choice question is more accurate than 4 well written essays. The fact that we can easilly have 50 multiple choice questions and a maximum of 8 essays makes it a no brainer that multiple choice is much more accurate.
So it isn't a matter of how you reward guessing (which psychologist will say that rewarding guessing actually gets better accuracy). It is a question of how well written the questions are. Further the pass rate has absolutely nothing to do with the fraction needed to pass. Even high school students understand this one. So he seems totally confused.
He really flubbed his math in the second last paragraph.. Sort-of eats your credibility.
If you have the true/false test, penalizing for the full amount of wrong answers would statistically eliminate the advantage of guessing. If you add "true/false/dontknow" and don't penalize for dontknow you could penalize more those you either guess or have wrong understanding of the subject. I'd like my doctor rather to not know and having to check up, than just passing me wrong treatment..
Besides the teachers can still have tight score limits on passing. As long as it's far removed from the 50% guess-rate, there is no huge bias towards guessing, as the author seems to suggest.
IMHO, the real beef of true/false tests is that most subjects are not well suited to be answered just by true/false.
His basic assumptions are so retarded as to invalidate his own thesis. Yes, depending on the difficulty of the exam, the range between the best and worst candidates will narrow. But the effect of guessing only becomes important in the extreme cases he looks at (impossibly hard test vs. impossibly easy).
And who the hell sets multiple choice questions with only two options? Rerun the numbers with five options and report back. You'll find the guesser is far more severely punished.
Besides pass rates as indication of the difficulty of exams is a myth. Set any exam with even the slightest differentiation, and you can have whatever pass rate you like. You just pick your passing grade appropriately.
Athletic Scholarships to universities make as much sense as academic scholarships to sports teams.
...the reasoning is... incomplete because it is based on an undefined variable ("knows twice as much as" (that itself is no easy task to measure)), and excludes the reasoning that, if a test is on the single subject the testee's 'level of knowlege' is "calculated" on, he with more knowledge/experience in that subject and its workings as a whole would have a greater chance of "guesstimating" correctly on the questions he was unable to answer with 100% certainty. Even more so if the test isn't a fixed set of true/false questions.
I'm sure it is possible to reduce such questions to mathamatical formulae, but the algorithm would be m~u~c~h more complicated, and even then I think we could only be hitting at "closest averages".
No, no sig. Really.
ThePromenader
Though some of his logic was overblown (see the comments made directly on his blog), I think his larger point has some merit. In fields which require lots of studying before beginning as a professional, such as medicine and law, you always hear that you have to be absolutely brilliant to 'get in'. The fact of the matter is that this is not the case: you should be darn smart, but you needn't be the best student in the world to be successful as a doctor. Many of the students who go to law or medical school (I'd guess most) are completely qualified for positions in their respective fields, but by the same token, are not necessarily any more qualified than their peers: they've all studied the same material, had the same experience in the lab, and know the whole picture within a reasonable approximation of each other.
Yet to maintain the level of exclusivity that these careers have, there must be some way to select a subset of the candidates to proceed, and at this point, there are few distinguishing features among them. Some will be far and away brilliant, and will easily get a career regardless; but the majority can't be differentiated from one another. So, how should it be decided who is a doctor and who isn't? By making a test that's so hard it amounts to a randomising function, and then selecting a subset of top scorers to pass. Passing doesn't mean one is inherently more qualified; it just means one guessed better on that day. This also explains why people can pass on their second or third try: they are no better than their competitors the next time around, but eventually one will guess luckily, and get in. It'd be interesting to do some statistical analysis on how many tries it takes people to 'pass' a particular exam, and see if the results fit probabilistic models: If the results of such analysis fit too well, the test is too hard, whereas if they deviate greatly from probabilistic expectations, then the test is more likely to be an actual test of one's knowledge.
To be sure, there will be some individuals who can pass based entirely on their knowledge, just as there will be some individuals who simply aren't cut out for life as a lawyer that will fail the exam. But ultimately, it allows the higher-ups to select candidates for job positions based on the single indisputable criterion of the candidate having passed an exam, thus avoiding any messy issues when someone complains about them choosing a particular candidate in lieu of one better qualified.
Time for a terrible analogy, since it's 0300 here: Really hard exams are the bouncers at the door to the club of medical careers.
much like: Which is most right? 2+2= a) 3 b) 3.4 c) 9 d) 4.1
no answer is correct, but one has to choose the closest thing... but perhaps it's the closest without going over? etc... so many ways to interpret some of the questions.
In many professional specialties, including law and medicine, there are times when a quick, decisive educated guess may produce better results than an exhaustively researched, definitively confirmed answer.
So tests that force students to do a lot of guessing may still be good tools for evaluating their professional qualifications.
A doctor or lawyer who can guess right may be superior to one who plods to the right answer only after many expensive lab tests or hours of legal research. That's not to say that doctors and lawyers shouldn't do lab tests and research -- of course they should. But there are many situations, especially time-sensitive ones, where quick judgment is more important than absolute knowledge: during surgery or a health crisis, during a trial or deposition, etc.
in college that gave very hard tests. Intel Assembly class. For a midterm, we had to decipher Object-Oriented Assembly, and decipher self-modifying code. After 3 weeks of introduction to Assembly.
I got an A, with an average of 58% in the class.
For the 2-hour final, he got up at the 1-hour point, and yelled: "The test is over. All pencils down." We just sat there dumbfounded for about 10 seconds, and then he said, "Just kidding. I always wanted to do that."
Ya, a real great pal there!
Worst teacher I had in college. He didn't last long
Don't steal. The government hates competition.
I once had a test that had a check box for how confident you were your answer was correct, that affected your score the following way:
If you ticked "confident" and you were wrong, -2
If you ticked "confident and you were right, +2
If you ticked "unsure" and you were wrong, -0
If you ticked "unsure" and you were right, +1
I guess the point is that it's advantageous to guess, but only if you choose the lesser-scoring option.
Cue the underground brilliance of every slashdot troll claiming that he is no less than a genius and nothing truly mental stimulating can be classified as difficult.
-tyfighter
TFA makes sense. Observe:
News for nerds?: yes[ ] no[x]
Stuff that matters?: yes[ ] no[x]
Clearly the editorial process is fraudulent - as this is a multiple choice, it is obvious that guessing tends to count much more than knowledge.
From this we can conclude one of two things:
1) Zonk is bad at guessing
2) The author is speaking out of his ass
Tempting as it is, I am going to stick with 2... But I could, of course, be guessing.
You mean it ain't me noggin, it's me teachers?
As many have posted, this blogpost is mostly pretentious at best. However, in the post he states:
Now suppose the test is very hard. As hard as it could be actually. Suppose the test is so hard that I, with lesser knowledge, can only answer one question based on actual knowledge. I answer that question, and guess at the other 99. You, who know twice as much as I, can answer two questions based on knowledge. So you guess at 98 answers. As you can readily imagine, the odds of you getting a higher grade than I are very slight. In fact, over 45 percent of the time, in repeated trials, I would outscore you, even though my knowledge is half that of yours.
I'd like to point out the simple fact: in reality we don't worry about those who are two or three times as smart as the rest, their knowledges are mostly indistinguishable (as pointed out by the blogpost, albeit shakily), but we are looking for the many magnitudes of times smarter than the rest (so smart in fact that they surpass the flaws that he has pointed out). And that's where those taking 'hard tests' succeed and others do not. All these flaws are arguably non-existent but even in supposing their existence, it would do nothing to correct these flaws in our ability to be able to separate those capable of Med School or Law school and those who are not. Those capable, those in the very upper-ring, are just so capable that they surpass the very flaws of the test itself.
My page.
So it's not just a "Typo" that distracts, it supports a completely faulty conclusion.
One is left wondering what kind of mathematics background the author had. Also, noting the dittography earlier ("question question"), whether proofreading or "checking your work" formed part of the author's training.
In any case, the post also assumes that test-makers don't spend an awful lot of time validating their tests; so instead of taking the rules from any given test, a couple of straw-men examinations are supplied.
Consider, for example, the case where a multiple-choice test featuring 4 possible answers penalizes wrong answers by one third of a point. In that case, guessing is not advantageous unless the examinee can eliminate two answers: hence "partial knowledge" can count for something.
Oh yeah, and who the heck said that the test was "hard" because most of the answers were unknown? Heck, if you look at your big standardized tests (such as the SAT), and just the multiple-choice parts, you'll find that, for those who take the test more than once, there's not much noise in their scores. So why should a medical exam be different?
I love the exams we had : a question was posed or a problem stated which required the knowledge we had learnt to solve it. Eventually there is more than one question asked to offer a lead. But no answer given. Those are real test. Applied Knowledge. Usually for multi choice with a very basic knowledge of the subject you can sort out formany response the one being the most probable. This is how I breathed through my english Multiple-Choice at the university, and hell, look at how bad (or how good ;)) my english is. Face it multiple choice might be an easy way out for professor to correct exams, but they are the poorest choice to test the knowledge and habilitiy to reason of the student.
C. Sagan : A demon haunted world:
http://www.amazon.com/gp/product/0345409469/
visit randi.org
In the long run he will score 1 + ( 99 * 0.5 ) = 50.5.
My expectancy is 2 + ( 98 * 0.5 ) = 51.
Seems I score more.
Only three things are certain; death, taxes, and apocryphal quotations - Ben Franklin.
A close second was counting negatives. To make easy concepts more difficult on tests, professors would often throw in layers of negative concepts (which of these isn't...). As I took the test, I'd count on my fingers while saying negative, positive, negative, positive ad nauseum. Once, I counted five negatives in one question and correct answer.
This wasn't testing my knowledge of the subject being taught. It was just seeing how well I could parse.
Of course essay tests are much more difficult to administer, though they are better indicators of your grasp of the subject.
I had a physics professor for two entire physics series. This man was... a machine. He was VERY intelligent, and was a VERY good teacher. He was, however, quite anal. He would not expect you to know things he hadn't taught, but he expected you to know what he had taught with *perfect* mastery.
He provided copies of all former tests, along with answers and how to solve them, to the local copy store for students to buy (amazingly, this prof DIDN'T try to take you on them, the only cost was of the copies). The tests changed VERY little over the years. Two nights before the exam, you were welcome to go to a study session, where he would take problems VERY similar to what would be on the test, and walk you through solving them. And he would let you take a 4x5 card into the test with you, with anything you wanted.
His tests consisted of three questions. Just three. At the end of the alotted hour for the exams, the majority of the class would NOT have finished. Those tests were *tough*. I also had a calculus professor who would give exams that consisted of just two problems, and few people finished them completely. That wasn't so much that he gave hard problems, just problems that took a lot of work to solve. That almost seems backwards, since the point of calculus is to make difficult problems easier... or at least possible, anyway. But he was VERY generous with partial credit.
Oh, you're not stuck, you're just unable to let go of the onion rings.
What do you call a person who graduated at the bottom of their class in medical school?
A doctor...
Its a joke. Before you say anything, a doctor isn't a doctor until they pass the medical board tests.
Don't allow yourself to dream away time. Be productive. -- Some fortune cookie
Look, its all bad 19th century design.
If the question said, pick the 'odd ones out' each worth n% its better.
There is no wrong or right unless the answer says so. But did the person designing the questions have a degree in writing/psychology/reading aswell?
Its easy to know who is a rope learner, vs a true genius, even Hawking flunked a lot at school.
Liberty freedom are no1, not dicks in suits.
He's right as far as it goes that a multiple choice test where the recipients know almost none of the answers is not very accurate at measuring their marginal knowledge.
However in my experience, hard multiple choice tests have a different problem..
"hard" can mean that you compare against a curve that's known for that particular test and that the curve has a long enough upper tail to seem to measure something at the upper end. Ie, the last couple of questions as you approach 100% are worth more than the questions before them.
The problem with that is that it seems to me that a common way to make a test have that longer upper tail is to make some of the questions ambiguous bad questions. If there are 10 questions on a test that are poorly designed where a knowledgeable person is likely to pick a "wrong" answer, then you can count on it that VERY few people will get all of the "right" answers. Instant "hard" test!
...there are so many who rabidly defend the 'Blogosphere'. I'm guessing they all have blogs.
Mr. Feldzamen claims to have passed the Virginia bar exam, but I can't find any evidence he was ever admitted to the Virginia bar, or to any state bar (he's not in Martindale-Hubbell). He cites the Virginia bar exam -- which I also passed (IAAL, licensed to practice in CA and VA) -- as one of his examples of a "complete fraud." In fact, when I took the Virginia bar exam it had over a dozen one-hour essay components, testing each and every possible subject. By contrast, the California bar exam, had essay tests covering six randomly chosen subjects out of a possible 15 or so, and it had other non-multiple-choice components. The multiple-choice section of every state's bar exam, the Multistate Bar Exam, is no walk in the park. So I don't understand how he includes bar exams in his claim that the tests are invalid. If anything, the low pass rate of bar exams, typically 50% or less among a candidate pool of mostly recent law school grads, suggests that they are very hard indeeed.
I find the fact that medical and lawyer exams are based on multiple choice rather disturbing. As an engineer almost all of my test were long answer. Sure, some multi questions, but mostly show all your work or explain the whole process. And I just design systems and networks! Now someone can just luckily guess enough multiple choice questions and start slicing me up?
Like I said, disturbing.
Vote monkeys into Congress. They are cheaper and more trustworthy.
A person has heartburn, do you:
A) Perform a colonoscopy
B) Perform open heart surgery
C) Tickle him
D) Fart
E) Refer him to Cowboy Neil
I'm going to Mexico for my next check up. At least you'll get tequila first....
Vote monkeys into Congress. They are cheaper and more trustworthy.
Hah! Ok, how about a test on soap operas or celebrity trivia. Or sports?
They whose government reduces their essential liberties for temporary security, receive neither liberty nor security.
In our first year of engineering school (in France. Call it college, for the USA), our math teacher only did multiple choice exams. I was always floored by how accurate the results of those exams were. Of course, all answers counted, and guesses were punished.
The rumor was that he had done his thesis on the subject of multiple choice exams. Sadly, he is retired now, and newer students no longer benefit from his type of quick and accurate exams.
Misleading titles? Inflammatory blurbs? Keep in mind that Slashdot is a tabloid.
The chess ranking is typically a 3-digit figure. Given two chess players, you can work out approximate odds of one winning from the difference in these figures. The figures are compiled from the games people have won, and the ranks of people they have played against. As in multi-choice tests, each individual question or game has a wright (win) and a wrong (lose) solution, and a stalemate (not filling in anything) option. From this we can estimate ranks of people we have not met; we can estimate ranks of people in history; we can even estimate corrections for ranks between cultures. For instance in the 19th century, how might a woman chess player in London (where the culture did not encourage chess) rank against a man from Prague (where cafes typically had chess boards in the tabletop, and most people played with friends and strangers in their lunch hours) had their backgrounds been equal, and assuming a native talent for chess is spread equally? This last point is not obvious - the differences between London and Prague and between women and men may not be wholly cultural, but the others can discuss that ad et ultra nauseam.
No-one designed chess with a perfect solution, and yet we can rank people. The IQ tests started from a similar point. People did not understand what intelligence was, exactly, but if they made tests that seemed to be testing the right sort of thing, and got the best people to design the next round of tests, then it was hoped that an incisive test for intelligence would evolve, even though no-one had defined what intelligence was. Unfortunately, in the early days, what was being tested as 'intelligence' was probably better named "how like minded are you to the white male that designed the test". The test can be as exacting as a chess ranking if you do enough tests, but the figure is less useful because it is not a measurement of something abstract and useful (unless you were IBM in the sixties, looking for white males with short hair that would sing the company song, in which case it was perfect).
There is a further downside to IQ tests. If you sit and stare at them, you can often reason a second or third possible answer using different readoning. I also have a problem with forms that means I think long and hard about the answer, and then tick the wrong box. The trick seems to be to work really quickly, and let your instincts drive your answer. I have only ever done about three IQ tests, and all of these were done ages ago for job applications to computer companies. The last one must have been twenty five years ago. We had two hours and 300 questions. I deliberately hammered through the questions, and handed the paper in after 40 minutes to avoid the temptation of fiddling with the answers. Incredibly, this seemed to cure my usual error rate with forms, and I got a perfect score. I wasn't any smarter that day - I just happened to be in the zone, I guess.
Didn't get the job, though. They thought I was too scary.
Thinking about the original post, though. The guy claimed that a multi-choice was 'fraudulent'. Isn't 'fraud' where someone is trying to deceive someone else? Multi-choice questions are an attempt to separate the test for the presence or absence of knowledge from the talents of presentation (good handwriting, confident presentation style, etc), but often flawed by laziness in trying to pass off examination skills to a computer. A good multi-choice questionnaire would have to be much longer than such tests usually are to reliably separate the thing you are trying to measure from the noise (think how much mesurement goes into a chess ranking, for example). But 'fraudulent'? And this was supposed to be a legal exam? I have my doubts about the original posting. Interesting subject, though.
Just make it 10 possible answers... with 5 or 6 quite obviously wrong answers... for someone who studied...
Therefore the more "educated" the guess, the higher the probability of a reward...
The scenarios sketched in the post are not exactly close to the real world. Multiple choice tests no matter how hard can easily be constructed so there probability of passing the exam simply by guessing is insignificant. For example:
n
There are 20 questions and you have to be correct on at least 16 of them. If there are just two options on each question your chances of passing by guessing is 1:170 if there are four options your chances are 1:2600000.
If you are interested see http://en.wikipedia.org/wiki/Binomial_distributio
Sadly it was kdawson who posted this turd. (Or did I miss the memo about trolling slashdot with misinformation that seems to be circulating.)
Summary: 2 + 49 - (49 / 2) = 26.5, not 75.5 as per the article.
I want the minutes of my life spent reading this back. The author rattles off a bunch of crap about his credentials, including math credentials, and then rolls out some bs which pretty much amounts to admitting he is trolling the slashdot submission queue.
Anyone want to refer me to less dumb versions of this site? At this point I'm just waiting for Jon Katz to start posting again.
-- http://thegirlorthecar.com funny dating game for guys
Funnily enough, one of the most hardassed profs I ever had also taught the introductory assembler class. (except for us it was PDP-11 and 68K) His tests were legendary for their difficulty, and the average was somewhere in the 20-30% range. However, it was curved after the fact and was a perfectly valid exam since there was absolutely no opportunity to guess. He gave us self-modifying assembler code too, without telling us such a thing was possible in advance! He also had a unique way of assigning readings. He would say, "Have you read chapter X yet? If you haven't, you're screwed!" Still, despite his apparent sadism we did learn a lot in his class.
In a later course I had a prof who would run our class through proofs that would span 3 or 4 lectures. If you fell asleep once in that period of time you'd be utterly lost. At the end of his proofs he would often say, "Does this make sense? Does everybody get this? If not, you had better think about dropping the course!" (Somehow it was hilarious in his thick indian accent. He really rolled the 'r' in dropping too.)
This article is nonsense from beginning to end. First, as others have pointed out, the arithmetic is wrong. Second, the point of penalising wrong answers is misrepresented: it's nothing to do with improving the accuracy of scoring for different abilities, but is to minimise the difference between those with equal abilities who choose to guess answers they don't know and those who don't. Finally, the model of abilities is completely wrong. A better student will not only know the answer to more questions, but will be more likely to be right on questions he guesses. So far from swamping the true difference, guessed answers add to the accuracy of the test.
Maybe I didn't read it carefully, but I think his math in that last example is all wrong. It seemed wrong to me. I didn't see how that kind of adjustment could amplify the difference the way it did.
I get the following:
The guy who answers 1 correctly and guesses at 99 ends up with an expected score of 1 + 49.5 -24.75 = 25.75
The guy who answers 2 correctly and guesses at 98 ends up with an expected score of 1 + 49.0 -24.5 = 26.5
This is obvious, right?
I wonder where he got his math education. It is fairly simple to show that there exists a mapping between the results on a multiple choice test and "actual knowledge" K=T+|e|, where |e| is the statistical error, accounting for guessing statistically. Subtracting for wrong answers etc. is just "psychology". The statistical uncertainty "e" can easily be reduced below any significant value with more choices and more questions.
The example the author shows maximized the statistical uncertainty of guessing, and is not relevant. To illustrate the point: take the 100 question true/false test.
A) If you give 1 point for correct and no point for wrong, the student will score from S0=50 (randomness) to S0=100 (perfect). Now calculate a new score
S= 2*S0-100, and you have results from 0 to 100 (round anything less than 0 to 0.
B) Announce you will subtract 1 point for each wrong. Now you will get scores from T0=0 to T0=100, and your map is just T=T0.
don't cut it off www.mgmbill.org
The problem there is that averages are one thing, but in practice there still is a non-zero chance that he'll actually score higher than you do.
Let's say it's 20 questions, 4 possible answers each. He'll know 5 of those, has to guess 15. There's even a 1 in billion chance that he'll get all 20 right. (4^15 = 2^30 = approx 1 billion.) If you gave that test in China, by now you'd have at least one guy who pulled exactly that stunt.
There's also the issue of how well those questions fit your and his domain of knowledge. Let's say you can't possibly test _all_ the questions, because that's usually the case. You can do it for state capitals, but you can't possibly cover a whole domain like medicine or law.
There are 50 states, you know 25, the other guy knows, say 12 (rounded down), so it's not impossible that the 20 questions are all from the 25 you don't know, but include all 12 that guy knows. In fact, assuming a very very very large domain (much larger than 50, anyway), there's about 1 in a million chance that all 20 questions will be from the 50% you don't know.
Now when testing states that doesn't have a higher moral, because (at least theoretically) all states are equally important. In other domains, like medicine, law, even CS, that's not the case: stuff ranges from vital basics to pure trivia that noone gives a damn about. (Or not for the scope of the problem at hand: e.g., if I'm hiring a Java programmer, asking questions about COBOL would be just trivia.)
And a lot of "hard tests" are "hard" just by including inordinate amounts of stuff that's unimportant trivia. E.g., if I'm giving a test for a unix admin job, I can make it arbitrarily "hard" by including such trivia as "in which directory is Mozilla installed under SuSE Linux?" It's stuff that won't actually affect your ability to admin a unix box in any form or shape. The fact that SuSE does install some programs in different directories is just trivia.
(And if that sounds like an convoluted imaginary example, let's say that some "hard" certification exams ask just that: where is program X installed in distribution Y? And at least one version of Sun's Java certification asked such idiotically stupid trivia as in which package is class X, or whether class Y is final. Who cares about that trivia? It's less than half a second to get any IDE to fill in the package for you. E.g., in Eclipse it just takes a CTRL+SPACE.)
And in view of that previous point, including trivia in an exam just to make it "hard" is outright counter-productive. There is a non-null chance that you'll pass someone who memorized all the trivia, but doesn't know the basics.
Not all knowledge is created equal, and that's one point that many "hard" exams and certifications miss. If a lawyer doesn't know the intricacies of Melchett vs The Vatican, who cares? In the unlikely situation that they need it, they can google it. If they don't understand Habeas Corpus, on the other hand, they're just unfit to be a lawyer at all. Cramming trivia into an exam can get you just that kind of screwed up situation: you passed someone who happened to know that Melchett vs The Vatican is actually a gag question, and that case name appears in Stephen Fry's "The Letter", yet flunked someone with a solid grasp of the the basics and who knows how to extrapolate from there and where to get more information when he needs it.
Rewarding random guesswork is worse. Probably the most important thing one should know is what he _doesn't_ know, so he can research it instead of taking a dumb uninformed guess. Most RL problems aren't neatly organized into 4 possible answers, so it can be a monumental waste of time to just take wild guesses and see if it works. I've seen entirely too many people wasting time trying wrong guess after wrong guess, instead of just doing some research. E.g., I've actually witnessed a guy trying every single bloody combination between *, & and nothing in front of every single variable in a C function, because he never understood how poin
A polar bear is a cartesian bear after a coordinate transform.
Actually it has to be a % passing. If the supply of licensed doctors and attorneys were not limited, the costs for their services would reduce, so these exams have to be a part of the the system to control the supply. A test may be written to ensure a spread (so it tests knowledge) and also to ensure that the passing score is largely unattainable. So, I think the analysis is incorrect. The tests are not too hard to be useful as tests, it is just that their is a conflict of interest as regards their use. As medical care begins to take on the characteristics of a human right as representation in court is a political right, perhaps we'll begin to see a breaking down of the cartel system so that medical and law educations are not restricted and final competency tests can be tests of competency rather than also being a link in a chain of controlling supply to increase price.s -selling-solar.html
--
Electricity without fuel costs: http://mdsolar.blogspot.com/2007/01/slashdot-user
What I don't like about his reasoning is his assumption that "hard" tests will test substantial knowledge that even the most educated test-taker will not get correct. I would submit that such tests are poorly designed, at least for a final/qualification test.
If your goal is to teach X amount of material and you want to give tests to see how far along a student is at learning X, then such a test is okay, as there will be naturally parts of X that you haven't even taught yet that the student is likely to get wrong. However, if you're now giving a final/qualification test where a good student is expected to know all of X, then the test should test for all of X, and no more. Many of the students should be scoring very close to 100%. In this way, guessing doesn't become a large statistical factor in overall score.
If you administer such a test and even the best students are missing half the questions, either you're testing for more than X, or X was not taught very well. Now, in college classes, we want X from different classes in different years, at least recent ones, to be equivalent; that is, we want the kid who got 100% on the test this year to know as much as the one who got 100% last year. So one needs to be careful about lowering the standards for what qualifies as X knowledge. However, statistically speaking, it's very unlikely for a college or professional class to have a "poor" year where everyone in the class is a poor learner so they can't even reach 100% of X even if taught properly. So I don't think it's a bad idea if test high score is only 50% on a final to either make the final easier or change one's teaching methods. The chances of it actually "dumbing down" the qualifications relative to previous years is small.
It's a shame this guy went to the effort of creating this blog post without making any effort to involve useful metrics of how informative a test really is. The words sensitivity, specificity, and variance don't come up at all. There is a kernel of truth here, which is that you can have both noisy items (not informative about the taker's knowledge) and informative items on tests, and hard tests tend to have more noisy items. The author seems to miss the point that two-choice items at which students guess maximize the error variance. In other words, he chooses the best possible case to support his argument, even thought it's unrealistic. Five-choice guessing items contribute less, although it depends how the items are structured (if three of the choices are easily eliminated by even the worst students, then it's much closer to a two-choice item). As a thought experiment, if there were a million choices per "hard" item, they would contribute almost no variance to test scores. The article seems to make no reference to the true score variance among the test takers, which is obviously critical.
I would have liked to see an analysis of the relationship between the number of plausible choices per question and the probability of mis-ordering two test-takers (giving the less knowledgeable a higher score). That would have been a lot more informative than simply saying, essentially, "two is bad, you do the math for more -- but trust me, it's a mathematical certainty."
There is a kernel of truth here, that multiple-choice tests are often not that sensitive, and that when everyone is guessing on an item, it contributes only noise to the measure. At issue really is how much variance in the test score is explainable by knowledge. In other words, how much information is contained in the test score. An article that uses phrases like "mathematical certainty" and "complete fraud" is obligated to provide some legitimate analysis, or at least references to the literature, not just anecdotes.
Sure, maybe the spread between the "actually knowledgeable" and "lucky guessers" might be small in a test that is consistently "hard" or consistently "easy", but when I compose a multiple-choice test, I intentionally generate questions that vary individually in their difficulty. I've got some easy questions, medium questions, and hard questions. That pulls apart the differences very well -- so well, in fact, that is common for the test results to have a bimodal distribution. It's very obvious who had no clue and who did.
Ugh. I just wrote a pretty polite reply at his page after skimming his idiotic article. Now that I've read it, I'm actually angry.
This guy knows NOTHING about testing. Nothing. He isn't even to the level of Classical Testing Theory (CTT), which is really not much more than means and Pearson correlations, and is nowhere near how high-stakes (and even medium- and low-stakes, increasingly) multiple choice (MC) tests work now, and how they have worked for many many years.
IAAP (I am a psychometrician). A big part of what I do for a living is design a particular MC test, pilot the items, and interpret the results. But I don't just count up the correct items and give you the percentage. Why? Because that would be insane. You can guess on those.
Oh, but he says this:
But suppose the grading attempts to adjust for guessing. There is no way of knowing what is in the mind of the test-taker, so the customary is to subtract, from the number correct, some fraction of the number wrong.
--Which is just fine until I tell you I have NEVER heard of dealing with guessing that way on a professional-level test.
As a general rule, we don't do any easy mathematics. At all.
Here is part of the output for a test I'm working on right now:
This is generated by RUMM2020, a tool for Rasch analysis. The Rasch model was developed in the 60s as an ideal model of item response. These are the stats on 3 items of this test. The two most important columns are Location and Probability.
The location is the item difficulty. Given the sample's performance on this item, and given their ability, how hard is this item? Item 35 is quite difficult; item 36, quite easy.
The probability is the p value for the chi square. Basically, if it's 0.05 or below, that item is operating significantly (statistically significantly, that is) outside of the model. It displays poor "fit." we generally toss these items before going on to the next step (ideally, these are weeded out during pilot testing, before the test goes live--in this case, it is an experimental test of a construct I'm not even sure exists anymore, but I digress). If an item has poor fit with the model, it is too much of a loose cannon, and its results cannot be trusted. This is what the benighted blogger (is there any other kind?) was whining about. That item is hard not because it is good, but because it is evidently stupid. The responses are all over the place, which means people were probably just guessing. Out it goes before it ruins any examinees' lives.
The next step is to get person locations. In the case of people, these numbers indicate the person's ability. This is calculated by looking at their performance on the items, given their difficulty (Which is calculated based on people's performance on them! Incestuous! But given a large enough sample, it all works out to a fine enough grain to be useful). Here is the output for the people:
So, the first person didn't do so hot; the last did pretty well (these usually top out at 3ish). As you can see in "DataPts," there were 125 items on this test. I started with 160. Do you hear that, Mr. Unexpected "Truths?" We have your back! We're not just handing you a naked score based on our crap items. WE PULL THE CRAP ITEMS.
That location score will usually be rescaled to something prettier, since no one would really like to see something like
"(And if that sounds like an convoluted imaginary example, let's say that some "hard" certification exams ask just that: where is program X installed in distribution Y? And at least one version of Sun's Java certification asked such idiotically stupid trivia as in which package is class X, or whether class Y is final. Who cares about that trivia? It's less than half a second to get any IDE to fill in the package for you. E.g., in Eclipse it just takes a CTRL+SPACE.)"
And what do you do when you don't have such a crutch to fall back upon? That's what "hard" tests address.
Since when was 2 + 49 - (49 ÷ 2) equal to 75.5?
Had an algorithms prof (of all things) give us a test where every question had the following possible answers:
..
Yes, No, Sometimes, Maybe, Unknown
Then, he had questions like 1. Some scientists believe than P=NP?
To which, of course, you could argue ANY answer is correct.
That being said, this blog post comes across as the usual whining we've all done or had to put up with through the years. No testing methodology is perfect, and everyone tests different on different kinds of tests. Fact is, though, they're pretty damn good. It's a common belief that millions of people who are otherwise idiots are graduating with great grades, while millions of geniuses can't test well - but that's horseshit. The majority of people manage to test at their level of understanding. The fact that people actually notice the odd idiot who guesses well is the exception that proves the rule.
Endless arguments over trivial contradictions in books written by ignorant savages to explain thunder in the dark.
He should have stuck with the more useful observation that almost* any test with a very low pass rate will be unreliable.
All tests have a margin of error, although its a rather taboo subject - when did you ever get a test result that stated the 95% confidence interval? If only a small proportion pass, there is a danger that these errors will dominate.
There are:
Now, an issue with multi-choice is the "guessing" problem, but there are (as TFA points out) work-arounds. TFA misses out the most important way of reducing guessing - which is designing the questions carefully so that each alternative is seductive and/or represents a common error. The real problem with multi-choice is the last two bullets above - it really is the most artificial and superficial form of test possible. Done well, its a good way of quickly romping through a large domain to offset the "sampling" problem, but it should never be the totality of a test. The depressing problem is that its so easy to mark and administer - and is cheap to deliver on computer (c.f. more ambitious computer-based testing, which is expensive to develop).
*I'm sure its possible to contrive a counter-example.
In a survey of 100 programmers, 111111 thought that duck-typing was a good idea.
I read tfa, and aside from the bad math in the last paragraph I also have a problem with the logic of his example. He assumes that since one person who knows twice as much as me should get twice as high as me. But it isn't at all taking into account that twice of VERY LITTLE knowledge is still not a lot. I can understand that multiple choice will not be a perfect way of judging someone's knowledge, but thinking about how many people need to take the exam, and how few people there are to grade them. Not to mention it is impossible to ask long answer questions on every aspect of a course. Multiple choice questions are still not too bad a way of quizzing a large group of people on subject matter that has a large range of subtopics. Assuming of course the questions aren't uber-hard. Damn this wasn't exactly something I wanted to read 2 days before my exams (which happen to be largely mc-questions)
assumptions. The premise is "IF we had person A that knows 2 times as much as person B, a well devised test ought to score person A twice as high as person B". No one is saying that they found person
A and person B, i.e. two people where they can show beyond any doubt that A knows twice as much as B, that is exactly what ideal test would do. The problem here is construction of such ideal test.
And surely it is possible to construct a test that approaches the ideal without having people A and B. All one has to know about A and B is that A will answer two times as many questions correctly, if they don't answer questions they don't know answers to. So, one can then see how to score tests to reflect that. And the answer is to penalize guessing to some degree, which will depend on the structure of the test.
As the island of our knowledge grows, so does the shore of our ignorance.
but these tests do not test knowledge (all the people attempting BAR or medical license exam already have degrees). They are devised to cull and decide who "gets in", rather than test knowledge.
:). World doesn't work that way.
It is a naive assumption to think that more knowledgeable should get in it seems
As the island of our knowledge grows, so does the shore of our ignorance.
I know this isn't a comforting thought, but isn't some of the domain of doctors and lawyers in effect specialized, logical guesswork? For example, many diseases could share a common set of symptoms. Certainly, it takes knowledge, but it also takes a wee bit of luck.
...they are designed to touch on subjects which are likely to get them onto news sites like Slashdot...
So, is that the new sport? First there was First Post, then (before they switched) getting 50 Karma Points. Now we have to get Slashdot to Feature our Third Party Blog?
When our name is on the back of your car, we're behind you all the way!
The hardest part of most medical specialist exams are the orals. Nobody ever complains about the written component. You get a to sit in a room with one or more examiners for a few hours of intense grilling. There is no way to hide any lack of knowledge and your deficiencies are exposed for all to see.
Also the US has a strange system of certifying specialists. After completing residency (usually based on putting in your hours) you can practice medicine under the application 'board-eligible.' Once you've passed your exams, then you can be called 'board certified.'
In Canada, you can't practice at all unless you pass your board (Royal College) exams. The exams are reputedly harder in Canada as well (from those I know who have written both).
I don't care what they don't know.
I give multiple choice exams with between 100 and 200 questions, and 4 possible answers.
Wach correct answer is worth 2 points; they need to answer 50 correctly to get 100.
They don't HAVE to answer any question, or any number of questions. If they can answer 30 questions, they can get a D. Any question answered incorrectly is -1 point. This serves two purposes.
It prevents guessing, and it forces the student to consider whether they actually know the answer, or just think they do.
I typically give 4 of these per semester. After the first one I usually get several complaints because they're not used to testing in this way. After the second I usually get one or two stating they can't break the habit of answering every question. After the final, I get many compliments and high marks on my evaluations, and the students tell me they are much more confident in what they've learned than from any other class. I've had occasion to run across previous students from years past, and they claim they still remember more from my class than from others.
I've had administrators forbid me to do it this way. I did it anyway. When they saw the results, they relented, and many suggested the process to others.
"I may be synthetic, but I'm not stupid." -- Bishop 341-B
[i]For True-False exams for example, the number subtracted would most likely be (Number Wrong ÷ 2). Let's see how that would work out, for the sample case above. You, answering two questions correctly and guessing at 98 would be likely, on the average, to get 49 wrong, and so have a final score of 2 + 49 - (49 ÷ 2), or 75.5, while I, again on the average. answering only 1 correctly and guessing at 97, would get a final score of 1 + (97 ÷ 2) - ((97 ÷ 2) ÷ 2)), which comes out to be 25.25. Here there is a substantial difference between our scores, closer to the two-fold difference in our actual knowledge.[/i] Lets think about this, 51-24.5=26.5 not 75.5, further, knowing one would mean guessing at 99, not 97. 1+(99/2)-(97/4)=25.75 This means the avg. difference if adjusting for guessing moves from .5 (average score of 50.5 vs 51) to .75, hardly a substantial difference. Of course the numbers will separate out at greater levels of knowledge as he showed earlier, if one person can answer 50 and the other 25, the average scoes will be 62.5 and 43.75
Now he probably simply didn't check his math, but twice in the same paragraph?
Parent post gets the point, and states it better than TFA.
Study after study has shown that performance on standardized tests is highly correlated with career success. While they may not be perfect, the tests are the best way to identify the people who are truly competent and weed out the unqualified. The fact that a particular subset of the population may achieve lower than average results does not invalidate the efficacy of the tests, and is not an excuse for lowering standards or abandoning testing.
A well designed test wouldn't be all multiple choice!
Ask anyone with a teaching license and they will tell you that there is a lot of debate over what multiple choice tests actual test, knowledge of the material or ability to take tests. There are a lot of educator who argue that essays or portfolios are a more accurate measure of how much someone knows than multiple choice or true or false tests.
http://www.popularculturegaming.com -- my blog about the culture of videogame players
So thats why he became a doctor, yikes please don't diagnose me. It's simple if the # right for a pass is the same as randomly guessing of course the test is fallacious, however if it is much higher (say 70) then the test is probably fine. Its easy to tell if the test is valid by looking at your curve, mean and median if its significantly different than random there you go.
each correct answer worth c
and each incorrect answer worth -i:
If you have no idea which answer is correct, and (n-1)*i < c, then guess.
Likewise, if you can eliminate some of the answers, so you are only choosing from m possible correct answers, and m < n, then guess if (m-1)*i < c
It got me a full-tuition scholarship to an ivy-league school (where I learned to hyphenate adjectival phrases). Your results may vary.
In any case, please be sure to join the slashdotters posting 'you are an idiot' message on the article's comments. It's important to keep the slashdot's reputation as the premiere internet home of arrogant assholes.
Not all knowledge is created equal, and that's one point that many "hard" exams and certifications miss.
That might be less of a problem than you think. See these comments.
If a lawyer doesn't know the intricacies of Melchett vs The Vatican, who cares? In the unlikely situation that they need it, they can google it. If they don't understand Habeas Corpus, on the other hand, they're just unfit to be a lawyer at all.
This is a common misperception about law. It's actually more important to know the laws and cases than abstract concepts, because the concepts are defined solely by specific laws and cases. In applying a concept you must always provide a citation. The best lawyers are those with a giant capacity for remembering specific laws and cases, and applying them to current situations. A general grasp of concepts is useful in writing about law for the general public, but actually not that useful for practicing law.
Build a man a fire, he's warm for one night. Set him on fire, and he's warm for the rest of his life.
can be hard and they don't mean that much as they are easy for people to cram and pass them with out have a clue about how to do the work and they cover things that you do not see / use in the real world or they do things in a way that is not the best way to do it.
"'The test was very hard,' the medical specialist said. 'Only 35 percent passed.' 'How did they grade it?' I asked. 'Multiple choice,' he said. 'They count the number right.' As a former mathematician, I immediately knew the test results were meaningless.
He/she must not have been a very good mathematician. They're assuming that the reason that only 35% passed is because each individual question was very hard, in which case their argument is correct. However, it's more likely that each question is relatively easy, but to pass you have to get almost all of them right. Since it's impossible to know which of these situations was the case based on what the doctor said, the former mathematician couldn't have known the test results were meaningless. In my experience taking the MCAT, med school tests, and USMLE licensing exams, they're usually composed of many easy questions, and you have to get almost all of them right. That sort of mimics what being a doctor is like, which is that most of the time the diagnosis and treatment is fairly straightforward, but the tolerance for mistakes is very, very low.
"But he was VERY generous with partial credit."
So far you're the only poster who has pointed out the big problem with multiple choice questions - NO PART MARKS! This all-or-nothing approach to each question skews the whole result enough to make it meaningless. Taking physics as an example, a student could work through a very complicated physics problem, but make a trivial arithmetic subtraction error at the very end and so choose "E. None of the above". Result for that question: zero, despite actually understanding the physics!
I demonstrated to the very young "professor" that by changing placement alone of various factoids my grade could have been anywhere from a high B to an F. What made matters worse, he declared that the bell shape curve of the outcomes validated his scheme. I couldn't quite get across the bell shape curve that, for instance, loaded dice create.
My adviser agreed with me but stayed out of the dispute. I am still angry!
Mastery of a subject or body of knowledge means only one thing and
that is to know *everything*.
Therefore, any test score that is less than 100%, or perfect, should
constitute a complete failure.
Throw away all of these theoretical quibbles and calculations. The only
eminently sensible solution is to accept nothing less than absolute
perfection.
I, for one, would not want to undergo brain surgery by a person that
passed his board certifying exams with a multitude of "acceptable" errors.
The same obtains for all the other professions that are deemed crucial
by our society.
If one cannot know it all, then one should not bother to know anything.
If nobody has suggested this until here:
What about a plurality of answers being potentially correct ?
Let's say 4 alternatives; and 0,1,2,3,4 may be correct.
Now we could consider
- the answer correct in case of all ticks being correct (resp. correctly unticked)
- to allocate partial marks: '+' for correct ticks and '-' for incorrect ones
At least, in both cases guessing will deliver close to nothing.
The blogger touts his math/stat skills and then argues that multi choice scores are a fraud. Like many self proclaimed experts, this one falls short. He posts a formula without variables and show the wrong answer.
Worse as a supposed stats expert, he also quotes the formula for guessing incorrectly.
He doesn't mention that standardized tests that use the "guessing formula" do not require one to guess. If you know only two answers and answer ONLY those questions, there is no penalty for unanswered questions.
Also his extreme examples aren't the ones to support his hypothesis. His primary two examples were both 100 True/False questions. In on "extreme" example one person knowing one answer and the other two. That case, regardless of the math we know on average the more knowledgeable person. His second example on this test was comparing two people. One knowing all and one knowing half. Aganig applying the guessing formula widens the delta but we still know who's more knowledgeable.
The example that exposes the fraud is a 100 question T/F test where one person knows 50 and marks guesses for the other 50, while the second person knows 64 answers and doesn't guess leaving 35 questions unmarked. Person 1 is going to average a score of 75, frequently a passing grade, while the more knowledgeable person scores a 64, often a failing grade.
However the blame here lies with the test preparation. If there is no "guessing" penalty for wrong answers, then all test takers should guess on all unknown questions. If all do then that person that knew 64 answer will on average score an 82 beating the person who knew only 50. If there is a "guessing" penalty for wrong answers, then whether or not the test takers makes blind guesses is irrelevant. AS another reply to the blogger points out, knowledgeable people rarely are blind guessers and thus should guess as they are likely to beat the odds of the guessing penalty.
If there is a fraud, it is if the standard for passing is so low that a person making random guesses can pass the exam one out of three or four times.
Well, since you are much more of an authority in this subject area than most of us on Slashdot, perhaps you could give me some insight on this little conundrum?
If a man is walking in a forest, and he's talking to himself, and there are no women around, is he still wrong?
I just wanted to say, I think that 95 percent of all exams are cop-outs, whether issued so deliberately or just because they were lulled-up that way. This is not including 'take-home' exams. In a perfect world, rather than spend all the resources we have on lawyers, advertising, physical distribution of virtual goods, cash registers their operators, and who knows what else, we could have more people compensated to learn how to teach and have them teach and spend time assessing students individually or in smaller groups over longer periods of examining, paying attention to who they actually are and what they have to say. And even in this perfect world, there would still be more room for people to become teachers through the returns of what that education gives back. To those who say that machines, computers, paperless offices and trust-based systems take jobs away from people, who might also say that exams are the natural result of logistics, I say - please consider the nature of what education provides a society and how far a human mind can actually go. And please do not give up.
The typical college entrance exams I took had 4 choices each and 4 points for a correct answer and -1 for a wrong answer. The thing was in a lot of the cases you may not know the correct answer but if you know something about the subject 1 or 2 of the choices are obviously wrong . Now if you eliminate those and guess amongst the rest your chances are much higher than simply not answering a question you don't know the answer to. Mathematically with 2 choices eliminated you have a .5 chance of guessing right so an expected value of 4*.5=2 -1*.5=.5 = 1.5 as opposed to 0 for not attempting. I think this kind of guessing is fair as you are getting rewarded for your partial knowledge which lets you eliminate at least the nonsense solution.
**Life is too short to be serious**
I once took a test that I knew absolutely nothing about, and got the best score of anyone taking the test!
...
...
:( But I got the best score out of the 100 or so people taking that test at that site.
I was in high school, so this is almost 30 years ago. Also, I was the brightest kid my school had ever seen, literally. Okay, not all that hard when a typical graduating class only has 50 people in it (this was a rural school system), but still
In my junior year (that is, grade 11), they decided to have me represent the school in all sorts of competitions. Math and science - math was really my specialty, I had taken every math class they had to offer by this time, but I also did well in science. So anyway, this contest comes up in which you're allowed to take 2 subjects. Of course I took math and did okay. Not spectacular, my school didn't offer Calculus or other advanced math classes, but respectable. And since no one else in the school was willing to take the test in Physics, I took that as well.
You have to understand, my school didn't even offer Physics until grade 12. I had taken general science 2 years earlier (and got such a good score that it made everyone else look silly), then Biology and was at the time taking Chemistry, but Physics wasn't until the next year. I figured "How hard could it be?" Well, I got the test, and maybe had an idea how to work out one question.
I wasn't going to leave the test blank. Like the college entrance tests, they actually assign a negative score to wrong answers so that there's not supposed to be any advantage to guessing, but I wasn't going to sit there for an hour and do nothing. So I started looking through the answers
Now that I've been a teacher also, I know about (theoretical) test design. You're supposed to include a couple of reasonable-sounding but wrong answers (referred to as "distractors") to catch the people who have some idea what they're doing but are trying to be lazy, and a couple of completely wrong answers - and of course the correct answer. I was able to eliminate the completely wrong answers, then look just at the others and determine which one had to be the correct answer on over 90% of the questions.
In one sense it didn't work. I still got done with 20 minutes left and had to sit there with nothing to do for the rest of the time.
This was actually a national contest, but I won't give the name. Fortunately, there were people who beat me at other locations, it would have been really embarrassing if I'd one the national competition just by guessing.
And needless to say, I never gave my students multiple choice tests (in Math, that was my subject after all). I know from experience that multiple choice tests are worthless.
Can't speak for law school but virtually every exam my wife took in med school as well as for her licensing exams was multiple choice. She informs me that most if not all med schools in the United States give the vast majority of tests in multiple choice format. The main exceptions are practical examinations where multiple choice is not an option.
The number of questions somebody gets right can be thought of as a Binomial distribution with two parameters: p, the fraction of questions I expect to get right, and n, the total number of questions.
As such, the mean of a Binomial variable is n*p, and the standard deviation is the square root of n*p*(1-p). People who have the same "knowledge", i.e., the same probability p of getting a question right, will score between the mean minus two standard deviations and the mean plus two standard deviations just over 95 percent of the time (assuming n is large enough).
This means, for example, on a test with 100 questions (n=100), where it's multiple choice with four possibilities, then if I know next to nothing (p=0.25), my score will range (95% of the time) somewhere between n*p-2*sqrt(n*p*(1-p)) and n*p+2*sqrt(n*p*(1-p)), or between 16.33 and 33.66.
So, I can very easily set a threshold score above which people with little knowledge will not pass; on the other hand, there is random variation in the grade that is not due to a difference in knowledge, so it seems unfair to give a B+ to someone who scored 5 points higher than the person who got a B, becuase that difference could have been due to luck.
Obviously, as n gets larger, the problem diminishes (since the mean grows with n and the standard deviation only with the square root of n, so the standard deviation becomes smaller relative to the mean score); it's worst when p=0.5 and less of a problem for p close to 0 or 1.
One thing that is sometimes done is to introduce a small percentage of questions where the best answer is to leave it blank: all choices are wrong. These might not scored directly, but used to adjust the weighting factor for wrong answers.
In a 100 item exam with 4 choices per item, there might be 5 weight-factor items. A perfect score would be 95 correct answers out of the 95 test items, with all the 5 weight-factor items left blank. The adjusted score would be 100.
A raw score of 79 correct, 16 wrong, and all weight-factor items answered would indicate this subject had been guessing randomly when he did not know the answer: his adjusted score would be decreased by a relatively large amount (25% of wrong subtracted from correct, expressed as percent of perfect: adjusted score = 78.9).
Another raw score of 79 correct, 16 wrong, but all weight-factor items left blank would indicate that subject had been using somewhat educated guesses when he did not know the answer: his adjusted score would be decreased by a lesser amount (20% of wrong subtracted from correct, blah blah, adjusted score = 79.8). Even though his raw scores are the same as in the first case, he has the smarts to make better guesses, and he is rewarded with a slightly higher adjusted score.
Yet another raw score of 79 correct, 8 wrong, 8 left blank, and all weight-factors left blank would show the use of more sophisticated educated guessing (20% of wrong subtracted, blah blah, adjusted score = 81.5).
If the passing level for the test was an adjusted score of 80, then this approach provides good discrimination between subjects with scores right around that mark.
Their real purpose is to hinder competitors from entering the market.
If you are worried about quality of care, than transparency is they key. Transparency allows the customer to weed out incompetent or inexperienced practitioners.
I agree with everything you said except this part:
"A multiple choice question might only have one right answer and its point value is the exact same as that of something much easier (especially, when on the harder on, the wrong choice might even be 'righter' than the correct choice on the easy question) -- but thats why there is an entire field of psychometrics out there to ensure that these sorts of exams are doing what they say they are."
Seems to me like that is more an example of psychometricians being forced to accept a less than valid form of test scoring. The proper way to do things has to incorporate Rasch's principle that the likelihood that a given test-taker will give the correct answer (on a question that is valid for the quantity it being used to measure) depends on the product of the easiness of the question and the ability of the test-taker. For that matter, lumped scores (pass-fail, ranking, or absolute) on professional proficiency exams - which by their nature must test disparate quantities with various non-linear contributions to professional qualification - cannot properly be interpreted as measurements of anything without a well-thought out unified criterion that describes the contributions and dependencies of the various quantities measured by the questions to the overall measurement of professional competence.
"Is life so dear, or peace so sweet, as to be purchased at the price of chains and slavery?" - Patrick Henry
"Tests don't prove you know anything; they only prove you know how to take tests!"
If you disagree with me on social issues, then it's pretty clear that you are a narrow-minded bigot.
AAAPIT (I am a psychometrician in training). He clearly knows nothing about psychometrics, and is pretty much a fool for assuming that the people who put together the tests have never bothered to think about such elementary problems. There is well-developed statistical methodology behind the scoring of standardized tests. Most licensing tests these days are put together with Item Response Theory, which gives the test developer a very precise idea of how much of a role guessing plays in each question. (You might be surprised to find that the floor guessing parameter is not just based on the number of choices; it varies depending on the details of each question). IRT also yields a test information function that lets you see how much information the test is giving you along the range of ability levels. The argument he makes about deducting fractions for incorrect answers (known as "formula scoring") is BS, because no standardized test ever reports just the raw score. Different forms of the test differ in difficulty, and so must be equated to one another. In the process, raw scores are converted to scaled scores, and the conversion is typically not a linear one. Formula scoring results in lower raw scores than if you don't apply the penalty (dichotomously scored), but all that means is that the range between the lowest and the highest raw score is a less with the dichotomously scored test. If that range is too small, you can always add more questions. Suppose you took two versions of the same test, one dichotomously scored and one with formula scoring. (Assume for the purposes of simplicity that there's no measurement error.) Yes, you would get a higher raw score on the dichotomously scored test, but so would the whole test-taking population. Your percentile rank would not change, and the scaled score would work out still be the same.
Let me tell you a story. When my parents bought a ZX-81 with 1K RAM back in the day, that thing didn't even have enough memory for an assembler. I learned assembly by translating it all in hex by hand. I had a big notebook with all combinations of opcodes and registers, and their hex codes. Forget writing "for" loops or even "goto", you had to actually count bytes by hand to do a jump.
Or did I tell you about the time when a PHB gave me a computer with a compiler, but literally no editor? (Not even EDLIN.) Yeah, we had to do with a disk editor until that was sorted out, because the alternative was to sit and twiddle thumbs. Even if with a damn good excuse.
So I _can_ do, and did do, without even the "crutch" of a compiler or assembler or even a text editor. Can _you_?
That said, I genuinely don't miss those days. They're not some "good old days", they're days when I wasted time on stuff that a tool would have done better. That was wasted time. There's a reason there are better tools nowadays, and that is that they genuinely make you more productive. They let you focus on the things that actually _matter_, like algorithm and design, not on the mechanical bullshit that a compiler or assembler does better or faster anyway.
_That_ is what makes a good compiler: algorithms, data structures, patterns, and knowing how to use a tool or library for the rest. Doing stuff by hand that the IDE or compiler does better, that's not a reason for pride, it's a waste of time and (employer's) money.
It's like hiring, say, a gardener and discovering that his grand reason for professional pride is that he can mow the lawn with some small scissors, instead of relying on the "crutch" of a lawn mower. Well, who cares? He's still doing a crap job and wasting more time than someone else. If the tools do that faster, freakin' use them. In fact, if a gardener actually did that, you might even suspect him of fraud: that he's deliberately wasting time so he gets paid for more hours.
A polar bear is a cartesian bear after a coordinate transform.
First off, this is cheating. I'm responding to a comment made at Blogspot here on Slashdot, but there's no way I'm creating an account over there just for this and besides they don't even have threaded conversations. Instead I'll quote the guy from Blogspot's comment here in full and then basically say he's full of shit.
Aaron said...
I am sorry, but as a psychometrician (i.e. someone who writes multiple choice tests and interprets the results), I have to simply chime in with this:
We know. That's why we don't just count correct answers.
Any major test (GRE, LSAT, TOEFL, TOEIC, etc.) uses some kind of item response theory (IRT) to determine the score. This means that the final score is actually the person's ability, given their performance on the items, which are weighted differently (to put it VERY simply) according to people's performance on them. It doesn't matter what easy-to-read numbers the test gives you as your score; your REAL score is a number between 0 and 1. Sometimes that number is rescaled to the actual number of items that were on the instrument to give people the illusion of a classical MC test.
Another point is this: Remember when you took your SAT (I think it was)? They told you not to guess if you weren't sure about that answer, right? The reason for that is that with a really well-worn and robust test, the developers have been able to figure out who picks which distractors, and can therefore derive further meaning from whatever option you choose. So instead of a simple binary item (right or wrong), they can create a partial-credit item. Say "A" is the right answer, but people who are pretty smart seem to pick "B" a lot. So maybe the stats will assign a value of 0.5 for that one. Maybe "C" is just a throwaway distractor and doesn't mean anything other than you missed the question. But what if "D" turns out to really distract total morons? The stats might end up assigning a NEGATIVE value if you pick that. So read the test specifications before you take a big test. If they say not to guess, that's why. What you don't know can actually hurt your score more than just skipping it.
Look into the Rasch model and multi-parameter IRT. It's late and I actually need to develop some questions tonight (no kidding!), so I leave it to you and Wikipedia.
So to sum up: Basically, you are right about the problems with MC tests, but wrong about how much this affects people's lives.
June 17, 2007 4:06 AM
So, as I was saying --bullshit.
I'm also a writer of GRE/TOEFL practice tests and I am quite sure this is not true. This was true for the TOEFL, but only for a few years. With the advent of the computer based TOEFL in 2000 there were weighted responses and the successful implementation of this feature was one of the primary differentiations between software practice test products that were published at that time such as my own which you are welcome to buy on Amazon but I'm sure you won't if you're already reading this in English.
However, that computer test was dropped in favor of a radically redesigned test in 2005 --another reason you probably won't buy it at Amazon-- in which ETS specifically documented that they were dropping weighted scoring entirely. This was specifically stated in documentation from ETS and it was distresing to me because I was offering one of the few projects that had a reasonably accurate weighted scoring system so I am absolutely sure of this. It cost me money big time.
As for GRE, well this is location depdendent. In some locations the GRE computer based test still uses weighted scoring, but in most of Asia that test is no longer offered and a non-weighted test is currently the only choice. The reason the
- VA is the only state in which you can still read for bar...
- no need to go to law school!
...but my limited math skills are all going red-flag on me at the moment:
For True-False exams for example, the number subtracted would most likely be (Number Wrong ÷ 2). Let's see how that would work out, for the sample case above. You, answering two questions correctly and guessing at 98 would be likely, on the average, to get 49 wrong, and so have a final score of 2 + 49 - (49 ÷ 2), or 75.5, while I, again on the average. answering only 1 correctly and guessing at 97, would get a final score of 1 + (97 ÷ 2) - ((97 ÷ 2) ÷ 2)), which comes out to be 25.25. Here there is a substantial difference between our scores, closer to the two-fold difference in our actual knowledge.
OK, forgive me for RTFA, but how is 2 + 49 - (49/2) equal to 75.5? My trusty calculator tells me this is 26.5, exactly one point higher than the second example -- as I would expect.
The entire argument is fallacious...I know twice as much as you, so much that I get 100 questions right, you get 50 right and guess at the other 50...50 + 25 - (25/2) = 62.5. Not quite a 2:1 ratio there.
While I agree with the author's premise that guessing should be penalized, he does a terrible job proving his point.
The real fraud of that sort of test is that the number of passing grades is set first, then the pass/fail cutoff is moved to meet that figure. If few take the bar exam, a drooling moron may pass. If many take it, being well qualified isn't good enough.
It has to be long and hard so moving the cutoff can provide fine-grained control on the number who are admitted into the profession.
The way we assess future professionals may be wrong: We give them a piece of paper or sit them in front of a computer screen full of questions, and ask them to either choose from multiple answers or write down their own answer. However, few of these professionals will ever need to do exactly that in their actual jobs. In essence, we benchmark candidates by asking them to do something they will rarely do in real life. The results are easy to predict: Some will learn how to pass tests without exhibiting real-life performance, while others will be able to do the job but fail on the test. In the end, tests seem to mainly assess the candidates's patience and conformity to social hierarchies.
The answer is F) all the above.
You are reading a sig. Cancel or allow?
One gets a feeling one's in the wrong crowd after seeing what happened to his comment thread after Slashdot reported this: a genteel and thoughtful chat becomes filled with increasingly crude, uninformed and insulting remarks.
Maybe I don't want to be here....
Doesn't it strike anyone as odd as to how inefficient the education system must be to produce such a high failure rate? The screening process that admits candidates to these elite professional programs must be broken too, as it obviously allows too many candidates that just can't cut it into the programs. On the other hand, maybe it is just the testing process is broken...
For instance, I'm a testing person, but not a content person (i.e., I design towards what the stats tell me, as well as the actual wording and structure of the exam...I always work with someone who understands the content areas from a very advanced level and can deal with that end). One of the last MC exams I was helping validate, I knew NOTHING about the content -- it was a medical exam. First thing I did was go through the entire exam, read all the questions quickly, and see if logic could remove any of the answers. Statistically, I would have gotten a 20% by random means, but in this case, I received somewhere around 43% (if I remember correctly). The educated guess is a BIG part of these things...you aren't just measuring content knowledge, but application and that means if someone can raise the bar, they might actually do well in the real world.
If you knew NOTHING (your words) and you could get 43% through logic, in what SHOULD have been 20%, then I think you prove the author's point even more. How good is a 5-choice multiple choice test if someone with ZERO knowledge can score 43% by applying logic/common sense ? It sounds like what you are describing is the exact opposite of an educated guess
JWall: GUI client for IPTables
I have had all kinds of experience. Some of it a little strange.
A couple of things I did around the end of law school bear mention here.
I probably should not discuss it, but I helped calibrate the Multi-State Bar exam, during my third year of law school. Most lawyers will scream bloody murder, that I should have been allowed anywhere near the data.
It is not like it sounds. I was working with a real psychometrician. He knew the statistics and methodology, and I knew the practical parts of computer systems. We both knew SAS, very well.
(Statistical Analysis System - it is its' own little language. In many ways, the language is an improvement over languages like Fortran, and I *like* Fortran)
The data was double-blinded. Neither my friend nor I saw the questions or any of the answers. Someone else handled that part. All we knew was that for each one of the thousands of examinees, for each question, whether or not the examinee got the correct answer or not. The order of the questions was also scrambled, so we did not even know the order of the questions as they were taken by the examinees.
FWIW, for the Multi-State Bar in my State, and many others, only one thing counts: the total correct. Nothing is taken off for wrong answers. A passing score is much higher than 25% of the total. I do not recall now, but 60-80% is the neighborhood of correct answers to pass (actually it was a combined score from the written and Multi-State, but if you got only 25% on the Multi-State, you failed, period.)
Bad statisticians get crappy results because they make wrong assumptions. Whoever the guy is that wrote the article, never let him do your statistics. He makes assumptions that competent psychometricians know are false.
I know the article's assumptions. I made them myself until I worked with my friend.
Strange things happen with really good test questions. This is not all, but most.
First, some guy randomly guessing, say, by going down the questions and always taking the first answer, will fail. Even if he is incredibly lucky and is nearly three standard deviations out (one of the very unlikely possibilities from a uniform distribution), he will still fail the exam.
Second, if his answers were educated guesses instead of blindly picking from a, b, c, d, or e, then his chances of getting the correct answer went down.
You cannot even take the exam until you graduate from an accredited law school, or you practice in California. *Every* single person that took the exam made at least educated guesses on most of the answers.
(One of the top guys in my law school class decided he was going to give the psychometricians a heart attack, by answering every single question correctly. He bragged about it before the exam. The smartest guy I ever met, bar none. Later he said that at the end of the day, he looked up, realized that he had twenty-five questions to go, and there were five minutes left. He *blindly* answered the last twenty-five questions (all with (b), IIRC) and turned in his exam. He passed, of course.)
My friend and I could sort of tell the order of some of the questions, in spite of the double-blind. The more difficult questions, in the last fifty or sixty, clearly had a higher random component of correct answers. Taking into account the difficulty of the question and the size of the random component, we felt confident that we could identify and order three-fourths of the final fourty questions.
Difficulty == number of examinees getting a correct answer, adjusted for their relative ability to correctly answer all the other questions. For small numbers of examinees, this is perilous. Our sample set was more than ten thousand, and verified against results from previous years going back. Those answers, in turn, were sampled, then diligently validated against LSAT scores (a law school entrance IQ test), law school grades, relative difficulty of law school, undergraduate grades, and personal inquiries to indiv
All is paradox. Retired lawyer, so this is just one more layman's opinion.
other countries use written answers instead of true/false.
But they must be to difficult to correct, no correction sheet to those ones.
To paraphrase, he said that decent tests count the number of right answers you get, but really good tests also count how many times your answer is, "It depends."
The man who does not read good books has no advantage over the man who cannot read them. - Mark Twain
There is no way to hide any lack of knowledge
Right, because this is an iterative, custom-fit exam. They assume you generally know your stuff, since you passed the writtens; they don't care about where you're strong, they want to know where you're weak, and how weak you are. As soon as the examiners start to smell the whiff of ignorance in an oral exam, they pursue it mercilessly, and work together to explore the depths of your particular areas of ignorance the same way a tag-team of sadistic dentists will use an array of very small and very sharp bits of steel to dig in and thoroughly explore a bad spot on a tooth. God help you if you give them an especially juicy target to work on, or if you give them more than one.
(Shudders when thinking back on doctoral oral exams.)
The man who does not read good books has no advantage over the man who cannot read them. - Mark Twain
My classmates proved this in High School, inadvertently. Our Chemistry class was notoriously hard, and graded on a curve. For the practice final one guy answered 'C' for everything. Another answered randomly. They both finished fairly early, to the chagrin of the teacher, but did fairly well on the final curve. If I recall, "C" beat random. This test did not include the wrong-answer penalty like the SAT.
Why, oh why, didn't I take the Blue Pill?
As a psychometrician, I must disagree with his post and examples. When latent trait (in this case, knowledge of the subject, or ability) is estimated, difficulty of each question and probability of guessing are taken in consideration. As a mathematician, you must be familiar with Item Response Theory (IRT) and Rasch mode, and its modifications. Even if IRT is not used, extremely difficult (like those in your example) or easy items are usually not included into tests, since they do not have any informational value, and guessing parameter is considered when scoring the responses. Konstantin Augemberg (konstantin at augemberg.com)
Under your system, if I am over 33% confident in my answer, it is still to my advantage to make a guess. Maybe that's the effect that you're going for, but being 34% confident in myself is not enough for me to claim to "know" something.
Imagine the consequences. If I were taking your test, any time I can eliminate just one choice, my expected value (or penalty) for guessing is 0, assuming I don't have clue #1 about the other choices. But if I actually took your class, I would hope that I would at least have clue #1, so any time I could eliminate one choice with confidence, I would take a stab at answering the question.
Looking at it a different way, let's say I'm a slacker and I only know 50% of the material on your test. What score would I get if I took your test? You are hoping that I'll get 50%, typically a failing grade (if I only learned half of what I was supposed to, I'd say failing is appropriate). But in reality, I would expect to pass your test.
Why? Well, I know 50% of the material, so I'm going to get 50% on your test based on that alone. But the story doesn't end there. If I know 50% of the material, I should be able to eliminate two of the four choices on the questions for which I do not know the answer. That means that I will expect to get half of the remaining questions correct (and half incorrect, of course).
On a 100 question test, I will get:
50*2=100 points for my 50% mastery of the material
25*2=50 points for my "good" guesses
25*(-1)=-25 points for my "bad" guesses
That gives me 125 out of 200 points, or 63%. Nothing to post on the refrigerator, for sure, but I passed, eh?
They don't grade fathers, but if your daughter's a stripper, you fucked up. --Chris Rock
I have a friend who has - in my presence - suffered from varying forms of seizures and episodes. A few times she almost fainted in my arms, and once her eyes glazed over and she was making weird noises and slumping over until I carried her over to a chair (she came to after awhile, but never remember in the time she was "out"). According to her doctor, it was simply because she was too tall and sometimes not getting enough blood circulation to her head (despite no BP issues), no further tests, no prescriptions.
I'm dreading one day where I'll hear she has had a serious accident due to a seizure. I've had little luck helping her find another doctor either as *none* want to contradict a fellow doctor's diagnosis...
I don't think it's #1, since we never named her doctor to others, and #2 doesn't apply since it's Canada and we have public healthcare. I'm not sure about #3, but I would think that another doctor might be willing to *see* the patient before making that assumption.
These were also different doctors in different areas of town, but the impression I got from mine is that the various medical associations frown upon one doctor overruling another's judgement, even if the first was wrong.