EXAM DEVELOPMENT

How Should Item Writers Be Trained?

The science is surprising in its silence, but real-life experiences and new innovations are worth a closer look.

All professional certification and licensure exams are written by people who are subject matter experts (SMEs) in the role being tested. Ironically, however, they are selected because they are experts in the exam topic, yet asked to perform a task at which they are not experts: writing questions for exams. Writing effective items is surprisingly tricky, and there is a learning curve to become proficient. Items need to be concise yet clear; they need to effectively address a specific piece of knowledge or a specific skill; and they need to be of an appropriate level of cognitive complexity and difficulty. Therefore, before any exam content can be created, item writers need to be trained to write items.

The Traditional Model

The typical training starts with a set of item writing rules—almost always some variation of the 43 rules for multiple-choice items published by Haladyna and Downing in 1989. (If you are unfamiliar with these rules, just do a web search for: rules for writing multiple-choice items; the first several pages of hits are all different variations of those original 43 rules.)

The next step is to devise a training curriculum around these 43 rules as well as any “local,” in-house rules that ensure the item pool has a consistent editorial style and voice. An example of a local rule is: “When writing questions, people are always referred by last name and gender pronouns are avoided. Correct: Mead has two apples. Incorrect: Alan/He has two apples.”

The bulk of the traditional training is therefore a long list of rules with correct and incorrect examples. Commonly, item writer training also includes some practice in writing items and perhaps critiquing the items as a group. With 43 rules, plus local rules, plus practice, this training is typically very long (even by focusing on the most important rules) – multiple hours or even days.

Analyzing the Results

Anecdotal evidence suggested that this long, tedious training was not particularly effective. For example, most programs need to have an editor review and revise items because item writers violate these rules. One study (Case, Holtzman, & Ripkey, 2001) found that the main cost of item writing was editorial and staff time. Also, most exam development staff and item writers privately dread these training sessions. This came to a head when the COVID19 pandemic hit; moving these lengthy training sessions online causes significant problems. For example, it was hard for programs to keep SMEs engaged for hours of online training. Many programs found that they had no choice but to reduce training time.

Because training SME item writers is such an important step, and particularly in the face of anecdotal evidence that the training was burdensome, you would assume that there is a large amount of literature that has empirically examined the question of how best to prepare SMEs to write items, right? For example, researchers would have validated that these rules improve item writing, right? Or someone would have established that the rules and practice were necessary, right? Or empirical evidence would be used to establish which of the rules was most important, right? Well, no—it turns out that there is essentially no research comparing different training methods!

How Can We Study the Experts?

The ideal research design to identify the best training method would (a) assemble a group of SMEs, (b) randomly assign them to two different training conditions, and (c) evaluate how well each group does. It may seem like an academic technicality, but random assignment to groups is necessary to clarify the results of such an experiment. If the groups are preexisting, then we cannot rule out that their performance is due to preexisting differences. (Remember when you memorized “Correlation does not imply causation?” Well, random assignment is the “secret ingredient” that would allow us to imply causation if we observed a correlation in this study.) Shockingly, we cannot find a single instance of such a study ever having been done!

One study (Case et al, 2001) compared items from traditional test committees (TC), ad-hoc task force committees (AH), and “item harvesting” (IH; e.g., individual SMEs submitting items). The traditional test committee members received two days of traditional, in-person, item-writer training and then wrote items remotely during a three-year tenure on the committee. The ad-hoc task force committees met in person for 1.5 days and received 30 minutes of training. The item harvesting participants were sent instructional materials and an item-writing manual and submitted items. In all cases, some degree of editing and review was applied. The results were mixed; some metrics strongly favored the traditional test committees (they wrote more items and rated their items 3.4/4.0) while other metrics favored item harvesting (lowest cost and lowest editor time per item; adequate average rated quality, 3.0/4.0).

What Are We Able to Know?

We should not draw too many conclusions from one study of one exam with one type of development. Also, participants were not randomly assigned to conditions, so the degree to which traditional test committee or ad-hoc task force members were better (or worse) item writers is an uncontrolled variable that prevents us from making strong causal statements. Also, best practices should be shaped by meta-analyses of many studies conducted by many researchers across a wide set of circumstances (types of exams and items, domains of practice, types of professionals, etc.) to have a good perspective on what type of training is most effective.

However, two tentative conclusions seem warranted: (1) Item writers who learned to write items and then wrote many items produced better results. This makes perfect sense—item writers need to learn to write effective items and proficient SMEs produce better results than new item writers. But also (2) the average quality of items submitted by individuals in the “item harvesting” condition was adequate (3.0/4.0) and not much worse than those of the best condition (the traditional test committees; 3.4/4.0).

Because research on training item writers is lacking, we can examine an allied research domain where SME training has been extensively studied: organizational psychologists have long been interested in improving managers’ performance appraisals, and the most common method of performance appraisal is managers' ratings of performance, but there are common errors that all raters tend towards (similarity, severity, leniency, halo, etc.). Thus, performance appraisal is a highly analogous domain where SMEs are recruited for their domain expertise (managers know the performance of their employees) but asked to perform a task in which they are not experts (rating performance without common rating errors). And the performance appraisal literature has also (analogous with item writing) looked to training to solve this problem.

And the analogy continues: the traditional training for performance appraisal was focused on rater errors and how to avoid them. Psychologists saw that this training was not very effective, and they devised new training methods and conducted empirical research comparing the effectiveness of different rater training methods. So far, the best approach seems to be “frame of reference training” in which raters are shown examples (e.g., videos) of various levels of performance (e.g., highly effective, adequate, and inadequate performance).

How Should Item-Writing Training Evolve?

Applying the lessons learned from the performance appraisal literature to training for item writers suggests that lengthy training on lists of rules may not be the most effective training. Teaching raters about common errors and training raters to avoid them does not make much difference in the ratings managers made during performance appraisal. Rather, the best training teaches raters what constitutes better and worse performance and how to calibrate their expectations to a common metric.

The Certiverse platform is flexible, and any training or organizational method can be used with the item-writing module, but we find reliable results with minimal training because our platform supports item writers. For example, we do not need to train item writers on the parts of an item (exhibit, stem, key, distractors, etc.) because we provide a wizard that leads SMEs, step-by-step, through the steps of writing an effective item. Built into this wizard are natural language processing (NLP) algorithms that encourage best practices (avoid negated items, end the stem with a question, ensure that the key and distractors are similar in length).

Similar to how modern word possessors show you spelling errors, item writers using Certiverse can see when they make errors that violate Haladyna and Downings’ rules and make corrections immediately, with two significant benefits: First, the item writer learns faster how to follow the rules without a lengthy training. Also, rule violations can be fixed by the SME immediately, rather than sending the item to an editor who will need to send it back to the SME, making the whole item-writing process more efficient.

Certiverse NLP can only detect a portion of potential rule violations. To efficiently enforce the remaining rules, Certiverse uses peer review. After submission, items automatically go into a review queue where an independent SME evaluates the other rules in simple language (e.g., Is the item on topic? Is it clear?). The SME chooses Yes/No to indicate whether the item meets the rule, and where there are violations, the SME types a written comment. The common types of feedback are that the item is unclear, disputes about the key correctness under all circumstances, spelling/grammar issues, or problems in the supporting rationales/reference. This feedback is returned to the author immediately on completion, who corrects the item and resubmits it into the review queue. By default, items must be passed by two peers to be accepted.

After pretesting, we grade items using the psychometric results into “red,” “yellow,” and “green” categories, as indicators of an item’s readiness to be used in a live exam. The most typical pool-level outcome for Certiverse is 3-8% of the items being graded as “red,” or unacceptable. In those instances when there were more, it was because the overall exam was too hard. We feel that these are superior results because it is common with other methods to plan for 20-30% of items to be discarded after pretesting.

As a field, we do not have empirical research indicating what the best form of item-writer training is, but we have anecdotal evidence that lengthy training does not have good ROI. The Certiverse platform emphasizes having SMEs dive into item-writing, supporting them with a good interface and supportive AI, and getting them the feedback that they need to avoid common pitfalls and write items that demonstrate both topic and exam development expertise.

References

Case, S. M., Holtzman, K., & Ripkey, D. R. (2001). Developing an item pool for CBT: A practical comparison of three models of item writing. Academic Medicine: Journal of the Association of American Medical Colleges, 76(10 Suppl), S111– S113. https://doi.org/10.1097/00001888-200110001-00037

Haladyna, T.M., & Downing, S.M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 37-50. https://doi.org/10.1207/s15324818ame0201_3

Similar posts