Harvard Macy Institute - Blog - The Multiple Choice Conundrum

For anyone who has ever written multiple choice questions, or taken a multiple choice exam (which includes basically everyone on the planet who can read), the limitations of this type of test are pretty obvious. In fact, if you look into the history of the multiple choice question (MCQ) format, it is interesting to learn that it was never meant to become our only standardized test method. It was invented to test potential military recruits for intelligence, but was later discarded when it was found not to be reliable. The uses we now put MCQ tests to are staggering.

One thing MCQ tests are definitely good for is to detect serious problems with taking MCQ tests. We have all known students who were everything we value from the point of view of a teacher: thoughtful, empathic learners who are able to apply their knowledge in the real world, but who have great difficulty with MCQ tests. In fact, it is not that uncommon to detect serious learning disabilities in adulthood through problems with testing. But is this really what we want to use to evaluate the rest of our students?

Most people think that writing MCQ items comes naturally, and many teachers write them in an off-hand way. If all you want to test is trivia, MCQ items are indeed easy to write. But as educators, we should be striving to evaluate our students for higher levels of thinking rather than memorization of discrete facts, which in the medical field are often changing. If you have ever tried to write a MCQ item that tests higher level reasoning, you know how difficult (if not impossible) that is.

As an example, when the National Board of Medical Examiners writes MCQ’s for their certifying exams, after an exhaustive and detailed editing process designed to weed out any possible structural clues to the correct answer, an entire expert panel debates each question for considerable time, throws many of them out as unsalvageable, and then for the few that are accepted, tries every new one out on a large number of students, running statistics to determine the quality of the question before it is ever used in a ‘real’ test to score performance. So if you think you can sit down and write a lot of MCQ items in a short time and test at the higher levels of Bloom’s pyramid, you are probably fooling yourself. There is an extensive ‘Item Writing Manual’ available online that gives the details of all of the rules and regulations for writing ‘NBME-style’ test items that is used as a model for many high-stakes MCQ exams.

It is easy to write difficult MCQ’s. Trickiness is not the point. Great MCQ test-takers have learned all the tricks in the book, and the exam score becomes like a game of poker. If you can eliminate one choice (too long and detailed), another choice (different in nature from the other options), you can quickly change your chances from 25% to 50% for randomly selecting an answer to a question on information you don’t know. Do we really want to waste our student’s brain cells with this kind of gamesmanship? They need every neuron for the real tasks of learning to take care of patients.

If you do a lot of case-based interactive teaching, you already know how to ask probing questions that force your learners to think aloud and reveal the quality of their rational processes. How could we use this question format to actually test knowledge? Before the complete takeover of exams by MCQ’s, other question types were widely used: essay and short answer. In an interactive teaching session, that is the type of questions that are usually being asked. What is the problem with these questions for other exam settings? In the past, there was no way to grade them on computer, making them prohibitively difficult to deal with for large classes. And for the essay question, items are often graded in a subjective and inconsistent way by different graders.

Since we are now living in the 21^st century, it is time we re-examined the limitations and virtues of the types of questions we use on exams, as we now have the computer power to be able to grade written text in a consistent manner using sophisticated rubrics. Some standardized exams have continued to use the essay format with variable results from computer grading. Some studies have suggested that this old-style computer grading of essay questions gave scores that mostly correlated with the total number of words written by the student, which is obviously not our goal.

Beginning with short-answer, even without sophisticated programming, it is slightly tedious but possible to provide a fairly comprehensive list of correct options, being careful to include all alternative word orders, synonyms, etc. I have done this for years using a website that was previously known as Spaced Education, now renamed as QStream. This site also incorporates an important educational concept of repetition, which has been shown in sophisticated multi-center educational trials to improve retention of knowledge. This site has always offered question authors the option to use short answer instead of MCQ, which tests learners in a more realistic way, forcing them to come up with answers on their own just like in the real clinical setting. No fancy programming is needed for this, just a bit of hard work trying to input all possible correct options. Crowd-sourcing can definitely help with this low-tech method of generating computer-graded fill-in-the-blank questions, as readers were more than willing (in fact, sometimes rather vehement) to send in their thoughts on other correct answers, helping to make the list of options more complete.

A more sophisticated method involves natural language processing, or NLP. In NLP, more computer fire-power is needed, but such systems can make use of Boolean approaches to right and wrong answers (THIS is correct, but NOT this), and can partly automate the somewhat tedious task of making a list of all correct answers. At the American College of Radiology, a programming team has been working on this approach for several years, and is nearly ready to roll it out in their flagship website, Case-in-Point. As editor-in-chief of this website, I am eager to move these interactive online cases to a more interesting and provocative question format that would more closely simulate actual practice (where patients never come in with four choices of diagnosis written out for us to choose from). The ultimate goal of this project would be to generate computer rubrics for entire radiology reports, which could give sophisticated feedback on which items were missed or incorrect, which items were most important, and how various findings were described. Similar approaches could revolutionize testing in many areas, not just Radiology. With NLP widely used by sites like Google (to fill in your search request based on just a few typed letters) and the existence of lexicons of medical terminology, it is to be hoped that we will soon see the end of the MCQ forever. In my opinion it is not a moment too soon.

What do you think? Is it time to retire MCQs?

Kitt Shaffer

Dr. Kitt Shaffer earned her MD from Tufts University, her Anatomy PhD from Kansas University, and is currently Vice-Chair for Radiology Education at Boston Medical Center. She is the recipient of numerous educational awards including the Stauffer and Whitley Awards of the Association of University Radiologists (AUR), the Faculty Prize for Teaching at Harvard Medical School, Outstanding Educator of the Year for the Radiologic Society of North America, and most recently in 2016 she was named Educator of the Year by the AUR. She collaborated with Dr. Petra Lewis in development of national curricular guidelines for medical students in Radiology and is a co-founder with Dr. Angelisa Paladin of the Clinician Educator Development Program of the American Roentgen Ray Society, a national program to train radiologists as educators.