Item review is a critical step in ensuring high-quality items. No matter how good the SME item author (or AI) is, exam items often have issues like an unclear stem, a correct answer that sticks out, or implausible distractors. Such quality issues can undermine the usefulness of the exam. The Certiverse platform already uses AI and machine learning to guide item writers through in the development process, providing guidance when necessary to promote high-quality items. However, quality issues are not always immediately obvious, even to the trained eye. This is why a thorough (and typically labor intensive) review process is necessary.
Wouldn’t it be great if AI could automatically review newly written exam items and provide fast, actionable feedback? Could AI find the same issues that human reviewers find, and guide item writers to make targeted revisions? The first step to using AI for item review and revision is to see if large language models (LLMs) are capable of picking up on the same item-quality issues as human review. Therefore, we tested how accurately two LLMs identified the same key issues as a human review within an exam containing 43 four-option multiple-choice items.
We found promising initial results for leveraging AI to review items. As shown in the table below, human review determined that the most common issues in our exam items were (1) a correct answer that stood out (i.e., conspicuous key) and (2) unrealistic or obvious distractors (i.e., distractor plausibility). Both LLMs identified nearly all these issues, but one of the LLMs identified all but one case. Specifically, GPT 4.1 identified 100% of the conspicuous keys and 96% of the implausible distractors, outperforming GPT 4o. The LLMs also identified nearly all the less common issues (e.g., stem clarity issues).
Quality Issue
|
Frequency of Issue from Human Review
|
GPT-4.1 Accuracy
|
GPT-4o Accuracy
|
Conspicuous Key
|
20
|
100.00%
|
80.00%
|
Distractor Clarity
|
2
|
100.00%
|
100.00%
|
Distractor Plausibility
|
27
|
96.00%
|
96.00%
|
Distractor Redundancy
|
2
|
100.00%
|
100.00%
|
Double Key
|
2
|
50.00%
|
0.00%
|
key Clarity
|
1
|
100.00%
|
100.00%
|
key Plausibility
|
2
|
100.00%
|
100.00%
|
Stem Clarity
|
10
|
100.00%
|
100.00%
|
Superficial Question
|
3
|
100.00%
|
100.00%
|
In short, GPT-4.1 outperformed GPT-4o in identifying the most common item-quality issues, which included implausible distractors, conspicuous keys, and unclear stems. Specifically, GPT-4.1 found at least 96% of these common issues.
Because these initial results are a promising first step, the next steps will be to evaluate the usefulness of the AI-generated feedback and suggested revisions. Additionally, because the current results reflect only a single exam, we need to ensure that the AI item review accurately identifies and addresses item-quality issues across various item types and content domains. Overall, these findings suggest that we can leverage state-of-the-art LLMs to reduce the amount of time and effort needed to review exam items, allowing faster development of high-quality exams.